Pages

Monday 20 June 2016

Shutting Down the HDFS

You can shut down the HDFS by using the following command.

$ stop-dfs.sh 

Retrieving Data from HDFS

Assume we have a file in HDFS called outfile. Given below is a simple demonstration for retrieving the required file from the Hadoop file system.

Step 1

Initially, view the data from HDFS using cat command.
$ $HADOOP_HOME/bin/hadoop fs -cat /user/output/outfile 

Step 2

Get the file from HDFS to the local file system using get command.
$ $HADOOP_HOME/bin/hadoop fs -get /user/output/ /home/hadoop_tp/ 

Inserting Data into HDFS

Assume we have data in the file called file.txt in the local system which is ought to be saved in the hdfs file system. Follow the steps given below to insert the required file in the Hadoop file system.

Step 1

You have to create an input directory.
$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/input 

Step 2

Transfer and store a data file from local systems to the Hadoop file system using the put command.
$ $HADOOP_HOME/bin/hadoop fs -put /home/file.txt /user/input 

Step 3

You can verify the file using ls command.
$ $HADOOP_HOME/bin/hadoop fs -ls /user/input 

Listing Files in HDFS

After loading the information in the server, we can find the list of files in a directory, status of a file, using ‘ls’. Given below is the syntax of ls that you can pass to a directory or a filename as an argument.

$ $HADOOP_HOME/bin/hadoop fs -ls <args>

Starting HDFS

Initially you have to format the configured HDFS file system, open namenode (HDFS server), and execute the following command.
$ hadoop namenode -format 
After formatting the HDFS, start the distributed file system. The following command will start the namenode as well as the data nodes as cluster.
$ start-dfs.sh 

Hadoop - HDFS Operations

Goals of HDFS

  • Fault detection and recovery : Since HDFS includes a large number of commodity hardware, failure of components is frequent. Therefore HDFS should have mechanisms for quick and automatic fault detection and recovery.
  • Huge datasets : HDFS should have hundreds of nodes per cluster to manage the applications having huge datasets.
  • Hardware at data : A requested task can be done efficiently, when the computation takes place near the data. Especially where huge datasets are involved, it reduces the network traffic and increases the throughput.

Block

Generally the user data is stored in the files of HDFS. The file in a file system will be divided into one or more segments and/or stored in individual data nodes. These file segments are called as blocks. In other words, the minimum amount of data that HDFS can read or write is called a Block. The default block size is 64MB, but it can be increased as per the need to change in HDFS configuration.