BIG-DATA – TecAdmin

Creating Directory In HDFS And Copy Files (Hadoop)

Rahul — Mon, 31 Oct 2022 10:25:34 +0000

HDFS is the Hadoop Distributed File System. It’s a distributed storage system for large data sets which supports fault tolerance, high throughput, and scalability. It works by dividing data into blocks that are replicated across multiple machines in a cluster. The blocks can be written to or read from in parallel, facilitating high throughput and fault tolerance. HDFS provides RAID-like redundancy with automatic failover. HDFS also supports compression, replication, and encryption.

The most common use case for HDFS is storing large collections of data such as image and video files, logs, sensor data, and so on.

Creating Directory Structure with HDFS

The “hdfs” command line utility is available under ${HADOOP_HOME}/bin directory. Assuming that the Hadoop bin directory is already included in PATH environment variable. Now log in as a HADOOP user and follow the instructions.

Create a /data directory in HDFS file system. I am willing to use this directory to contain all the data of the applications.
```
hdfs dfs -mkdir /data 
```
Creating another directory /var/log, that will contains all the log files. As the /var directory also not exists, use -p to create a parent directory as well.
```
hdfs dfs -mkdir -p /var/log 
```
You can also use variables during directory creation. For example, creating a directory with the same name as the currently logged user. This directory can be used to contain the user’s data.
```
hdfs dfs -mkdir -p /Users/$USER 
```

Changing File Permissions with HDFS

You can also change the files ownerships as well as permission in the HDFS file system.

To change the file owner and group owner use the -chown command line option:
```
hdfs dfs -chown -R $HADOOP_USER:$HADOOP_USER  /Users/hadoop 
```
To change the file permission use the -chmod command line options.
```
hdfs dfs -chmod -R 775 /Users/hadoop
```

Copying Files to HDFS

The hdfs command provides -get and -put parameters to copy files to/from the HDFS file system.

For example, to copy a single file from local to HDFS file system:
```
hdfs dfs -put ~/testfile.txt /var/log/  
```
Copy multiple files as are directory tree using the wildcard characters.
```
hdfs dfs -put ~/log/* /var/log/  
```

Listing Files in HDFS

While working with the Hadoop cluster, you can view files under the HDFS file system via the command line as well as GUI.

Use the -ls option with hdfs to list files in the HDFS file system. For example to list all files on the root directory use:
```
hdfs dfs -ls / 
```
The same command can be used to list files from subdirectories as well.
```
hdfs dfs -ls /Users/hadoop 
```
You should get the following output:

List files in HDFS
Rather than the command line, Hadoop also provides a graphical explorer to view, download and upload files easily. Browse the HDFS file system on the NameNode port at the following URL:
http://localhost:9870/explorer.html

Browse files in HDFS

Conclusion

HDFS also supports a range of other applications such as MapReduce jobs processing large volumes of data as well as user authentication and access control mechanisms. HDFS can also be combined with other distributed file systems like S3 and Swift to create hybrid cloud solutions that combine high availability and low latency with low-cost storage.

In this article, you have learned about creating a directory structure in the HDFS file system, changing permissions, and copying and listing files with HDFS.

The post Creating Directory In HDFS And Copy Files (Hadoop) appeared first on TecAdmin.

How To Install and Configure Hadoop on CentOS/RHEL 8

Hitesh Jethva — Tue, 11 Feb 2020 18:12:31 +0000

Hadoop is a free, open-source and Java-based software framework used for storage and processing of large datasets on clusters of machines. It uses HDFS to store its data and process these data using MapReduce. It is an ecosystem of Big Data tools that are primarily used for data mining and machine learning. It has four major components such as Hadoop Common, HDFS, YARN, and MapReduce.

In this guide, we will explain how to install Apache Hadoop on RHEL/CentOS 8.

Step 1 – Disable SELinux

Before starting, it is a good idea to disable the SELinux in your system.

To disable SELinux, open the /etc/selinux/config file:

nano /etc/selinux/config

Change the following line:

SELINUX=disabled

Save the file when you are finished. Next, restart your system to apply the SELinux changes.

Step 2 – Install Java

Hadoop is written in Java and supports only Java version 8. You can install OpenJDK 8 and ant using DNF command as shown below:

dnf install java-1.8.0-openjdk ant -y

Once installed, verify the installed version of Java with the following command:

java -version

You should get the following output:

openjdk version "1.8.0_232"
OpenJDK Runtime Environment (build 1.8.0_232-b09)
OpenJDK 64-Bit Server VM (build 25.232-b09, mixed mode)

Step 3 – Create a Hadoop User

It is a good idea to create a separate user to run Hadoop for security reasons.

Run the following command to create a new user with name hadoop:

useradd hadoop

Next, set the password for this user with the following command:

passwd hadoop

Provide and confirm the new password as shown below:

Changing password for user hadoop.
New password: 
Retype new password: 
passwd: all authentication tokens updated successfully.

Step 4 – Configure SSH Key-based Authentication

Next, you will need to configure passwordless SSH authentication for the local system.

First, change the user to hadoop with the following command:

su - hadoop

Next, run the following command to generate Public and Private Key Pairs:

ssh-keygen -t rsa

You will be asked to enter the filename. Just press Enter to complete the process:

Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa): 
Created directory '/home/hadoop/.ssh'.
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:a/og+N3cNBssyE1ulKK95gys0POOC0dvj+Yh1dfZpf8 hadoop@centos8
The key's randomart image is:
+---[RSA 2048]----+
|                 |
|                 |
|              .  |
|     .   o o o   |
|  . . o S o o    |
| o = + O o   .   |
|o * O = B =   .  |
| + O.O.O + +   . |
|  +=*oB.+ o     E|
+----[SHA256]-----+

Next, append the generated public keys from id_rsa.pub to authorized_keys and set proper permission:

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 640 ~/.ssh/authorized_keys

Next, verify the passwordless SSH authentication with the following command:

ssh localhost

You will be asked to authenticate hosts by adding RSA keys to known hosts. Type yes and hit Enter to authenticate the localhost:

The authenticity of host 'localhost (::1)' can't be established.
ECDSA key fingerprint is SHA256:0YR1kDGu44AKg43PHn2gEnUzSvRjBBPjAT3Bwrdr3mw.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
Activate the web console with: systemctl enable --now cockpit.socket

Last login: Sat Feb  1 02:48:55 2020
[hadoop@centos8 ~]$

Step 5 – Install Hadoop

First, change the user to hadoop with the following command:

su - hadoop

Next, download the latest version of Hadoop using the wget command:

wget http://apachemirror.wuchna.com/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz

Once downloaded, extract the downloaded file:

tar -xvzf hadoop-3.2.1.tar.gz

Next, rename the extracted directory to hadoop:

mv hadoop-3.2.1 hadoop

Next, you will need to configure Hadoop and Java Environment Variables on your system.

Open the ~/.bashrc file in your favorite text editor:

nano ~/.bashrc

Append the following lines:

export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk-1.8.0.232.b09-2.el8_1.x86_64/
export HADOOP_HOME=/home/hadoop/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

Save and close the file. Then, activate the environment variables with the following command:

source ~/.bashrc

Next, open the Hadoop environment variable file:

nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Update the JAVA_HOME variable as per your Java installation path:

export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk-1.8.0.232.b09-2.el8_1.x86_64/

Save and close the file when you are finished.

Step 6 – Configure Hadoop

First, you will need to create the namenode and datanode directories inside Hadoop home directory:

Run the following command to create both directories:

mkdir -p ~/hadoopdata/hdfs/namenode
mkdir -p ~/hadoopdata/hdfs/datanode

Next, edit the core-site.xml file and update with your system hostname:

nano $HADOOP_HOME/etc/hadoop/core-site.xml

Change the following name as per your system hostname:


        
                fs.defaultFS
                hdfs://hadoop.tecadmin.com:9000

Save and close the file. Then, edit the hdfs-site.xml file:

nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

Change the NameNode and DataNode directory path as shown below:



        
                dfs.replication
                1
        

        
                dfs.name.dir
                file:///home/hadoop/hadoopdata/hdfs/namenode
        

        
                dfs.data.dir
                file:///home/hadoop/hadoopdata/hdfs/datanode

Save and close the file. Then, edit the mapred-site.xml file:

nano $HADOOP_HOME/etc/hadoop/mapred-site.xml

Make the following changes:


        
                mapreduce.framework.name
                yarn

Save and close the file. Then, edit the yarn-site.xml file:

nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

Make the following changes:


        
                yarn.nodemanager.aux-services
                mapreduce_shuffle

Save and close the file when you are finished.

Step 7 – Start Hadoop Cluster

Before starting the Hadoop cluster. You will need to format the Namenode as a hadoop user.

Run the following command to format the hadoop Namenode:

hdfs namenode -format

You should get the following output:

2020-02-05 03:10:40,380 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
2020-02-05 03:10:40,389 INFO namenode.FSImage: FSImageSaver clean checkpoint: txid=0 when meet shutdown.
2020-02-05 03:10:40,389 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hadoop.tecadmin.com/45.58.38.202
************************************************************/

After formating the Namenode, run the following command to start the hadoop cluster:

start-dfs.sh

Once the HDFS started successfully, you should get the following output:

Starting namenodes on [hadoop.tecadmin.com]
hadoop.tecadmin.com: Warning: Permanently added 'hadoop.tecadmin.com,fe80::200:2dff:fe3a:26ca%eth0' (ECDSA) to the list of known hosts.
Starting datanodes
Starting secondary namenodes [hadoop.tecadmin.com]

Next, start the YARN service as shown below:

start-yarn.sh

You should get the following output:

Starting resourcemanager
Starting nodemanagers

You can now check the status of all Hadoop services using the jps command:

jps

You should see all the running services in the following output:

7987 DataNode
9606 Jps
8183 SecondaryNameNode
8570 NodeManager
8445 ResourceManager
7870 NameNode

Step 8 – Configure Firewall

Hadoop is now started and listening on port 9870 and 8088. Next, you will need to allow these ports through the firewall.

Run the following command to allow Hadoop connections through the firewall:

firewall-cmd --permanent --add-port=9870/tcp
firewall-cmd --permanent --add-port=8088/tcp

Next, reload the firewalld service to apply the changes:

firewall-cmd --reload

Step 9 – Access Hadoop Namenode and Resource Manager

To access the Namenode, open your web browser and visit the URL http://your-server-ip:9870. You should see the following screen:

To access the Resource Manage, open your web browser and visit the URL http://your-server-ip:8088. You should see the following screen:

Step 10 – Verify the Hadoop Cluster

At this point, the Hadoop cluster is installed and configured. Next, we will create some directories in HDFS filesystem to test the Hadoop.

Let’s create some directory in the HDFS filesystem using the following command:

hdfs dfs -mkdir /test1
hdfs dfs -mkdir /test2

Next, run the following command to list the above directory:

hdfs dfs -ls /

You should get the following output:

Found 2 items
drwxr-xr-x   - hadoop supergroup          0 2020-02-05 03:25 /test1
drwxr-xr-x   - hadoop supergroup          0 2020-02-05 03:35 /test2

You can also verify the above directory in the Hadoop Namenode web interface.

Go to the Namenode web interface, click on the Utilities => Browse the file system. You should see your directories which you have created earlier in the following screen:

Step 11 – Stop Hadoop Cluster

You can also stop the Hadoop Namenode and Yarn service any time by running the stop-dfs.sh and stop-yarn.sh script as a Hadoop user.

To stop the Hadoop Namenode service, run the following command as a hadoop user:

stop-dfs.sh

To stop the Hadoop Resource Manager service, run the following command:

stop-yarn.sh

Conclusion

In the above tutorial, you learned how to set up the Hadoop single node cluster on CentOS 8. I hope you have now enough knowledge to install the Hadoop in the production environment.

The post How To Install and Configure Hadoop on CentOS/RHEL 8 appeared first on TecAdmin.

How to Setup Hadoop on Ubuntu 18.04 & 16.04 LTS

Rahul — Thu, 14 Mar 2019 04:33:54 +0000

Apache Hadoop 3.1 have noticeable improvements any many bug fixes over the previous stable 3.0 releases. This version has many improvements in HDFS and MapReduce. This tutorial will help you to install and configure Hadoop 3.1.2 Single-Node Cluster on Ubuntu 18.04, 16.04 LTS and LinuxMint Systems. This article has been tested with Ubuntu 18.04 LTS.

Step 1 – Prerequsities

Java is the primary requirement for running Hadoop on any system, So make sure you have Java installed on your system using the following command. If you don’t have Java installed on your system, use one of the following links to install it first.

Step 2 – Create User for Haddop

We recommend creating a normal (nor root) account for Hadoop working. To create an account using the following command.

adduser hadoop

After creating the account, it also required to set up key-based ssh to its own account. To do this use execute following commands.

su - hadoop
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys

Now, SSH to localhost with Hadoop user. This should not ask for the password but the first time it will prompt for adding RSA to the list of known hosts.

ssh localhost
exit

Step 3 – Download Hadoop Source Archive

In this step, download hadoop 3.1 source archive file using below command. You can also select alternate download mirror for increasing download speed.

cd ~
wget http://www-eu.apache.org/dist/hadoop/common/hadoop-3.1.2/hadoop-3.1.2.tar.gz
tar xzf hadoop-3.1.2.tar.gz
mv hadoop-3.1.2 hadoop

Step 4 – Setup Hadoop Pseudo-Distributed Mode

4.1. Setup Hadoop Environment Variables

Setup the environment variables used by the Hadoop. Edit ~/.bashrc file and append following values at end of file.

export HADOOP_HOME=/home/hadoop/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

Then, apply the changes in the current running environment

source ~/.bashrc

Now edit $HADOOP_HOME/etc/hadoop/hadoop-env.sh file and set JAVA_HOME environment variable. Change the JAVA path as per install on your system. This path may vary as per your operating system version and installation source. So make sure you are using the correct path.

vim $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Update below entry:

export JAVA_HOME=/usr/lib/jvm/java-11-oracle

4.2. Setup Hadoop Configuration Files

Hadoop has many configuration files, which need to configure as per requirements of your Hadoop infrastructure. Let’s start with the configuration with basic Hadoop single node cluster setup. first, navigate to below location

cd $HADOOP_HOME/etc/hadoop

Edit core-site.xml



  fs.default.name
    hdfs://localhost:9000

Edit hdfs-site.xml



 dfs.replication
 1



  dfs.name.dir
    file:///home/hadoop/hadoopdata/hdfs/namenode



  dfs.data.dir
    file:///home/hadoop/hadoopdata/hdfs/datanode

Edit mapred-site.xml


 
  mapreduce.framework.name
   yarn

Edit yarn-site.xml


 
  yarn.nodemanager.aux-services
    mapreduce_shuffle

4.3. Format Namenode

Now format the namenode using the following command, make sure that Storage directory is

hdfs namenode -format

Sample output:

WARNING: /home/hadoop/hadoop/logs does not exist. Creating.
2018-05-02 17:52:09,678 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = tecadmin/127.0.1.1
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 3.1.2
...
...
...
2018-05-02 17:52:13,717 INFO common.Storage: Storage directory /home/hadoop/hadoopdata/hdfs/namenode has been successfully formatted.
2018-05-02 17:52:13,806 INFO namenode.FSImageFormatProtobuf: Saving image file /home/hadoop/hadoopdata/hdfs/namenode/current/fsimage.ckpt_0000000000000000000 using no compression
2018-05-02 17:52:14,161 INFO namenode.FSImageFormatProtobuf: Image file /home/hadoop/hadoopdata/hdfs/namenode/current/fsimage.ckpt_0000000000000000000 of size 391 bytes saved in 0 seconds .
2018-05-02 17:52:14,224 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
2018-05-02 17:52:14,282 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at tecadmin/127.0.1.1
************************************************************/

Step 5 – Start Hadoop Cluster

Let’s start your Hadoop cluster using the scripts provides by Hadoop. Just navigate to your $HADOOP_HOME/sbin directory and execute scripts one by one.

cd $HADOOP_HOME/sbin/

Now execute start-dfs.sh script.

./start-dfs.sh

Then execute start-yarn.sh script.

./start-yarn.sh

Step 6 – Access Hadoop Services in Browser

Hadoop NameNode started on default port 9870. Access your server on port 9870 in your favorite web browser.

http://svr1.tecadmin.net:9870/

Now access port 8042 for getting the information about the cluster and all applications

http://svr1.tecadmin.net:8042/

Access port 9864 to get details about your Hadoop node.

http://svr1.tecadmin.net:9864/

Step 7 – Test Hadoop Single Node Setup

7.1. Make the HDFS directories required using following commands.

bin/hdfs dfs -mkdir /user
bin/hdfs dfs -mkdir /user/hadoop

7.2. Copy all files from local file system /var/log/httpd to hadoop distributed file system using below command

bin/hdfs dfs -put /var/log/apache2 logs

7.3. Browse Hadoop distributed file system by opening below URL in the browser. You will see an apache2 folder in the list. Click on the folder name to open and you will find all log files there.

 http://svr1.tecadmin.net:9870/explorer.html#/user/hadoop/logs/

7.4 – Now copy logs directory for hadoop distributed file system to local file system.

bin/hdfs dfs -get logs /tmp/logs
ls -l /tmp/logs/

You can also check this tutorial to run wordcount mapreduce job example using command line.

The post How to Setup Hadoop on Ubuntu 18.04 & 16.04 LTS appeared first on TecAdmin.

Hadoop Commands to Manage Files on HDFS

Rahul — Fri, 26 Aug 2016 03:11:08 +0000

This tutorial helps you to learn to manage our files on HDFS in Hadoop. You will learn how to create, upload, download and list contents in HDFS. Below commands will help you to how to create a directory structure in HDFS, Copy files from local file system to HDFS and download files from HDFS to local files. Also how to do manage files in HDFS.

Create Directory in HDFS

Takes the path URI’s like an argument and creates a directory or multiple directories.

hdfs dfs -mkdir

Remember that you must create a home directory in HDFS with your system’s username. For example, you are logged in as hduser on your system, So first create /user/hduser else you will get this error, Now create directory structure inside it

hdfs dfs -mkdir /user/hduser
hdfs dfs -mkdir /user/hduser/input
hdfs dfs -mkdir /user/hduser/output 
hdfs dfs -mkdir /user/hduser/input/text /user/hadoop/input/xml

Copy Files to HDFS

After creating directory structure, Now put some files to HDFS from your local file system.

hdfs dfs -put LOCAL_FILE HDFS_PATH

For example you have test1.txt in current directory and /tmp/test2.xml on your local file system.

hdfs dfs -put text1.txt /user/hduser/input/text/
hdfs dfs -put /tmp/text2.xml /user/hduser/input/xml/

List Files from HDFS

Use the following example commands to list the content of the directory in HDFS.

hdfs dfs -ls /user/hduser
hdfs dfs -ls /user/hduser/input/
hdfs dfs -ls /user/hduser/input/text/

Use -R to list files recursively inside directories. For example:

hdfs dfs -ls -R /user/hadoop/input/

Download Files from HDFS

At this point, you have learned how to copy and list files to HDFS. Now use following example commands to how to download/Copy files from HDFS to the local file system.

hdfs dfs -get /user/hduser/input/text/test1.txt /tmp/
hdfs dfs -get /user/hadoop/dir1/xml/test2.xml /tmp/

here /tmp is on system’s local file system.

Copy Files between HDFS Directories

You can easily copy files between HDFS file system using distcp option.

hdfs distcp /user/hduser/input/xml/text2.xml /user/hduser/output
hdfs distcp /user/hduser/input/text/text1.xml /user/hduser/output

The post Hadoop Commands to Manage Files on HDFS appeared first on TecAdmin.

HADOOP/HDFS ls: ‘.’: No such file or directory

Rahul — Fri, 26 Aug 2016 01:01:33 +0000

Sometimes you faced issue with Hadoop cluster setup on system with listing filesystem like ls: ‘.’: No such file or directory’. This issue occurs because of there is no home directory created on HDFS for your current user.

In order to resolve this issue create the home directory on HDFS. For example, you are logged with with user hduser on your system.

$ hdfs fs -mkdir -p /user/hduser

All set. Now you can list files and directories under Hadoop Distributed file system and do other operations normally.

$ hdfs fs -ls

The post HADOOP/HDFS ls: ‘.’: No such file or directory appeared first on TecAdmin.

Hadoop – Namenode is in safe mode

Rahul — Thu, 18 Aug 2016 10:20:26 +0000

Namenode loads the filesystem state from fsimage and stays in safe mode and wait for data nodes to report their blocks. Safemode is a read-only mode for HDFS cluster, so that it does not prematurely start replicating the blocks. Use following command to let the namenode leave safemode forcefully.

$ hadoop dfsadmin -safemode leave

In newer versions of hadoop command is deprecated. You can use hdfs command instead of hadoop. For example:

$ hdfs dfsadmin -safemode leave

You must run hadoop fsck so sort out any inconsistencies created in the hdfs due to above command.

The post Hadoop – Namenode is in safe mode appeared first on TecAdmin.

Hadoop – Running a Wordcount Mapreduce Example

Rahul — Wed, 10 Aug 2016 11:11:41 +0000

This tutorial will help you to run a wordcount mapreduce example in hadoop using command line. This can be also an initial test for your Hadoop setup testing.

1. Prerequisites

You must have running hadoop setup on your system. If you don’t have hadoop installed visit Hadoop installation on Linux tutorial.

2. Copy Files to Namenode Filesystem

After successfully formatting namenode, You must have start all Hadoop services properly. Now create a directory in hadoop filesystem.

$ hdfs dfs -mkdir -p /user/hadoop/input

Copy copy some text file to hadoop filesystem inside input directory. Here I am copying LICENSE.txt to it. You can copy more that one files.

$ hdfs dfs -put LICENSE.txt /user/hadoop/input/

3. Running Wordcount Command

Now run the wordcount mapreduce example using following command. Below command will read all files from input folder and process with mapreduce jar file. After successful completion of task results will be placed on output directory.

$ cd $HADOOP_HOME
$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount input output

4. Show Results

First check the names of result file created under dfs@/user/hadoop/output filesystem using following command.

$ hdfs dfs -ls /user/hadoop/output

Now show the content of result file where you will see the result of wordcount. You will see the count of each word.

$ hdfs dfs -cat /user/hadoop/output/part-r-00000

The post Hadoop – Running a Wordcount Mapreduce Example appeared first on TecAdmin.

How to Install and Configure Apache Hadoop on CentOS & Fedora

Rahul — Thu, 12 Nov 2015 09:51:29 +0000

Having been around for some time now, Hadoop has become one of the most popular open-source big data solutions. It processes data in batches and is famous for its scalable, cost-effective, and distributed computing capabilities. It’s one of the most popular open-source frameworks in the data analysis and storage space. As a user, you can use it to manage your data, analyze that data, and store it again – all in an automated way. With Hadoop installed on your Fedora system, you can access important analytic services with ease.

This article covers how to install Apache Hadoop on CentOS and Fedora systems. In this article, we’ll show you how to install Apache Hadoop on Fedora for local usage as well as a production server.

1. Prerequsities

How to Install JAVA 8 on CentOS/RHEL 7/6/5

2. Create Hadoop User

We recommend creating a normal (nor root) account for Hadoop working. To create an account using the following command.

adduser hadoop
passwd hadoop

After creating the account, it also required to set up key-based ssh to its own account. To do this use execute following commands.

su - hadoop
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys

Let’s verify key based login. Below command should not ask for the password but the first time it will prompt for adding RSA to the list of known hosts.

ssh localhost
exit

3. Download Hadoop 3.1 Archive

In this step, download hadoop 3.1 source archive file using below command. You can also select alternate download mirror for increasing download speed.

cd ~
wget http://www-eu.apache.org/dist/hadoop/common/hadoop-3.1.0/hadoop-3.1.0.tar.gz
tar xzf hadoop-3.1.0.tar.gz
mv hadoop-3.1.0 hadoop

4. Setup Hadoop Pseudo-Distributed Mode

4.1. Setup Hadoop Environment Variables

First, we need to set environment variable uses by Hadoop. Edit ~/.bashrc file and append following values at end of file.

export HADOOP_HOME=/home/hadoop/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

Now apply the changes in the current running environment

source ~/.bashrc

export JAVA_HOME=/usr/lib/jvm/java-8-oracle

4.2. Setup Hadoop Configuration Files

Hadoop has many of configuration files, which need to configure as per requirements of your Hadoop infrastructure. Let’s start with the configuration with basic Hadoop single node cluster setup. first, navigate to below location

cd $HADOOP_HOME/etc/hadoop

Edit core-site.xml



  fs.default.name
    hdfs://localhost:9000

Edit hdfs-site.xml



 dfs.replication
 1



  dfs.name.dir
    file:///home/hadoop/hadoopdata/hdfs/namenode



  dfs.data.dir
    file:///home/hadoop/hadoopdata/hdfs/datanode

Edit mapred-site.xml


 
  mapreduce.framework.name
   yarn

Edit yarn-site.xml


 
  yarn.nodemanager.aux-services
    mapreduce_shuffle

4.3. Format Namenode

Now format the namenode using the following command, make sure that Storage directory is

hdfs namenode -format

Sample output:

WARNING: /home/hadoop/hadoop/logs does not exist. Creating.
2018-05-02 17:52:09,678 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = tecadmin/127.0.1.1
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 3.1.0
...
...
...
2018-05-02 17:52:13,717 INFO common.Storage: Storage directory /home/hadoop/hadoopdata/hdfs/namenode has been successfully formatted.
2018-05-02 17:52:13,806 INFO namenode.FSImageFormatProtobuf: Saving image file /home/hadoop/hadoopdata/hdfs/namenode/current/fsimage.ckpt_0000000000000000000 using no compression
2018-05-02 17:52:14,161 INFO namenode.FSImageFormatProtobuf: Image file /home/hadoop/hadoopdata/hdfs/namenode/current/fsimage.ckpt_0000000000000000000 of size 391 bytes saved in 0 seconds .
2018-05-02 17:52:14,224 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
2018-05-02 17:52:14,282 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at tecadmin/127.0.1.1
************************************************************/

5. Start Hadoop Cluster

Let’s start your Hadoop cluster using the scripts provides by Hadoop. Just navigate to your $HADOOP_HOME/sbin directory and execute scripts one by one.

cd $HADOOP_HOME/sbin/

Now run start-dfs.sh script.

./start-dfs.sh

Sample output:

Starting namenodes on [localhost]
Starting datanodes
Starting secondary namenodes [tecadmin]
2018-05-02 18:00:32,565 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Now run start-yarn.sh script.

./start-yarn.sh

Sample output:

Starting resourcemanager
Starting nodemanagers

6. Access Hadoop Services in Browser

Hadoop NameNode started on port 9870 default. Access your server on port 9870 in your favorite web browser.

http://svr1.tecadmin.net:9870/

Now access port 8042 for getting the information about the cluster and all applications

http://svr1.tecadmin.net:8042/

Access port 9864 to get details about your Hadoop node.

http://svr1.tecadmin.net:9864/

7. Test Hadoop Single Node Setup

7.1. Make the HDFS directories required using following commands.

bin/hdfs dfs -mkdir /user
bin/hdfs dfs -mkdir /user/hadoop

7.2. Copy all files from local file system /var/log/httpd to hadoop distributed file system using below command

bin/hdfs dfs -put /var/log/apache2 logs

7.3. Browse Hadoop distributed file system by opening below URL in the browser. You will see an apache2 folder in the list. Click on the folder name to open and you will find all log files there.

 http://svr1.tecadmin.net:9870/explorer.html#/user/hadoop/logs/

7.4 – Now copy logs directory for hadoop distributed file system to local file system.

bin/hdfs dfs -get logs /tmp/logs
ls -l /tmp/logs/

You can also check this tutorial to run wordcount mapreduce job example using command line.

The post How to Install and Configure Apache Hadoop on CentOS & Fedora appeared first on TecAdmin.

How to Setup Hadoop 2.6.5 (Single Node Cluster) on Ubuntu, CentOS And Fedora

Rahul — Sat, 10 Jan 2015 15:48:18 +0000

Apache Hadoop 2.6.5 noticeable improvements over the previous stable 2.X.Y releases. This version has many improvements in HDFS and MapReduce. This how-to guide will help you to install Hadoop 2.6 on CentOS/RHEL 7/6/5, Ubuntu and other Debian-based operating system. This article doesn’t include the overall configuration to setup Hadoop, we have only basic configuration required to start working with Hadoop.

Step 1: Installing Java

Java is the primary requirement to setup Hadoop on any system, So make sure you have Java installed on your system using the following command.

# java -version 

java version "1.8.0_101"
Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)

If you don’t have Java installed on your system, use one of the following links to install it first.

Install Java 8 on CentOS/RHEL 7/6/5
Install Java 8 on Ubuntu

Step 2: Creating Hadoop User

We recommend creating a normal (nor root) account for Hadoop working. So create a system account using the following command.

# adduser hadoop
# passwd hadoop

After creating an account, it also required to set up key-based ssh to its own account. To do this use execute following commands.

# su - hadoop
$ ssh-keygen -t rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys

Let’s verify key based login. Below command should not ask for the password but the first time it will prompt for adding RSA to the list of known hosts.

$ ssh localhost
$ exit

Step 3. Downloading Hadoop 2.6.5

Now download hadoop 2.6.0 source archive file using below command. You can also select alternate download mirror for increasing download speed.

$ cd ~
$ wget http://www-eu.apache.org/dist/hadoop/common/hadoop-2.6.5/hadoop-2.6.5.tar.gz 
$ tar xzf hadoop-2.6.5.tar.gz 
$ mv hadoop-2.6.5 hadoop

Step 4. Configure Hadoop Pseudo-Distributed Mode

4.1. Setup Hadoop Environment Variables

First, we need to set environment variable uses by Hadoop. Edit ~/.bashrc file and append following values at end of file.

export HADOOP_HOME=/home/hadoop/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

Now apply the changes in current running environment

$ source ~/.bashrc

Now edit $HADOOP_HOME/etc/hadoop/hadoop-env.sh file and set JAVA_HOME environment variable. Change the JAVA path as per install on your system.

export JAVA_HOME=/opt/jdk1.8.0_131/

4.2. Edit Configuration Files

Hadoop has many of configuration files, which need to configure as per requirements to setup Hadoop infrastructure. Let’s start with the configuration with basic Hadoop single node cluster setup. first, navigate to below location

$ cd $HADOOP_HOME/etc/hadoop

Edit core-site.xml



  fs.default.name
    hdfs://localhost:9000

Edit hdfs-site.xml



 dfs.replication
 1



  dfs.name.dir
    file:///home/hadoop/hadoopdata/hdfs/namenode



  dfs.data.dir
    file:///home/hadoop/hadoopdata/hdfs/datanode

Edit mapred-site.xml


 
  mapreduce.framework.name
   yarn

Edit yarn-site.xml


 
  yarn.nodemanager.aux-services
    mapreduce_shuffle

4.3. Format Namenode

Now format the namenode using the following command, make sure that Storage directory is

$ hdfs namenode -format

Sample output:

15/02/04 09:58:43 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = svr1.tecadmin.net/192.168.1.133
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 2.6.5
...
...
15/02/04 09:58:57 INFO common.Storage: Storage directory /home/hadoop/hadoopdata/hdfs/namenode has been successfully formatted.
15/02/04 09:58:57 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
15/02/04 09:58:57 INFO util.ExitUtil: Exiting with status 0
15/02/04 09:58:57 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at svr1.tecadmin.net/192.168.1.133
************************************************************/

Step 5. Start Hadoop Cluster

Now start your Hadoop cluster using the scripts provides by Hadoop. Just navigate to your Hadoop sbin directory and execute scripts one by one.

$ cd $HADOOP_HOME/sbin/

Now run start-dfs.sh script.

$ start-dfs.sh

Sample output:

15/02/04 10:00:34 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/hadoop/hadoop/logs/hadoop-hadoop-namenode-svr1.tecadmin.net.out
localhost: starting datanode, logging to /home/hadoop/hadoop/logs/hadoop-hadoop-datanode-svr1.tecadmin.net.out
Starting secondary namenodes [0.0.0.0]
The authenticity of host '0.0.0.0 (0.0.0.0)' can't be established.
RSA key fingerprint is 3c:c4:f6:f1:72:d9:84:f9:71:73:4a:0d:55:2c:f9:43.
Are you sure you want to continue connecting (yes/no)? yes
0.0.0.0: Warning: Permanently added '0.0.0.0' (RSA) to the list of known hosts.
0.0.0.0: starting secondarynamenode, logging to /home/hadoop/hadoop/logs/hadoop-hadoop-secondarynamenode-svr1.tecadmin.net.out
15/02/04 10:01:15 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Now run start-yarn.sh script.

$ start-yarn.sh

Sample output:

starting yarn daemons
starting resourcemanager, logging to /home/hadoop/hadoop/logs/yarn-hadoop-resourcemanager-svr1.tecadmin.net.out
localhost: starting nodemanager, logging to /home/hadoop/hadoop/logs/yarn-hadoop-nodemanager-svr1.tecadmin.net.out

Step 6. Access Hadoop Services in Browser

Hadoop NameNode started on port 50070 default. Access your server on port 50070 in your favorite web browser.

http://svr1.tecadmin.net:50070/

Now access port 8088 for getting the information about cluster and all applications

http://svr1.tecadmin.net:8088/

Access port 50090 for getting details about secondary namenode.

http://svr1.tecadmin.net:50090/

Access port 50075 to get details about DataNode

http://svr1.tecadmin.net:50075/

Step 7. Test Hadoop Single Node Setup

7.1 – Make the HDFS directories required using following commands.

$ bin/hdfs dfs -mkdir /user
$ bin/hdfs dfs -mkdir /user/hadoop

7.2 – Now copy all files from local file system /var/log/httpd to hadoop distributed file system using below command

$ bin/hdfs dfs -put /var/log/httpd logs

7.3 – Now browse hadoop distributed file system by opening below url in browser.

 http://svr1.tecadmin.net:50070/explorer.html#/user/hadoop/logs

7.4 – Now copy logs directory for hadoop distributed file system to local file system.

$ bin/hdfs dfs -get logs /tmp/logs
$ ls -l /tmp/logs/

You can also check this tutorial to run wordcount mapreduce job example using command line.

The post How to Setup Hadoop 2.6.5 (Single Node Cluster) on Ubuntu, CentOS And Fedora appeared first on TecAdmin.

How to Install Elasticsearch on CentOS 7/6

Rahul — Wed, 07 Jan 2015 01:00:18 +0000

Elasticsearch is flexible and powerful open-source, distributed real-time search and analytics engine. Using a simple set of APIs provides the ability for full-text search. Elastic search is freely available under the Apache 2 license, which provides the most flexibility.

This tutorial will help you to setup Elasticsearch single node cluster on CentOS, Red Hat, and Fedora systems.

Step 1 – Prerequsities

Java is the primary requirement for installing Elasticsearch on any system. You can check the installed version of Java by executing the following command. If it returns an error, install Java on your system using this tutorial.

java -version

Step 2 – Setup Yum Repository

First of all, install GPG key for the elasticsearch rpm packages.

sudo rpm --import https://artifacts.elastic.co/GPG-KEY-elasticsearch

Then create yum repository file for the elasticsearch. Edit /etc/yum.repos.d/elasticsearch.repo file:

sudo vi /etc/yum.repos.d/elasticsearch.repo

Add below content:

[Elasticsearch-7]
name=Elasticsearch repository for 7.x packages
baseurl=https://artifacts.elastic.co/packages/7.x/yum
gpgcheck=1
gpgkey=https://artifacts.elastic.co/GPG-KEY-elasticsearch
enabled=1
autorefresh=1
type=rpm-md

Step 3 – Install Elasticsearch

After adding yum repository, just install Elasticsearch on CentOS and RHEL system using the following command:

sudo yum install elasticsearch

After successful installation edit Elasticsearch configuration file “/etc/elasticsearch/elasticsearch.yml” and set the network.host to localhost. You can also change it to the system LAP IP address to make it accessible over the network.

vim /etc/elasticsearch/elasticsearch.yml

  network.host: localhost

Then enable the elasticsearch service and start it.

sudo systemctl enable elasticsearch
sudo systemctl start elasticsearch

The ElasticSearch has been successfully installed and running on your CentOS or RHEL system.

Run the following command to verify service:

curl -X GET "localhost:9200/?pretty"

You will see the results like below:

{
  "name" : "tecadmin",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "HY8HoLHnRCeb3QzXnTcmrQ",
  "version" : {
    "number" : "7.4.0",
    "build_flavor" : "default",
    "build_type" : "rpm",
    "build_hash" : "22e1767283e61a198cb4db791ea66e3f11ab9910",
    "build_date" : "2019-09-27T08:36:48.569419Z",
    "build_snapshot" : false,
    "lucene_version" : "8.2.0",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

Step 4 – Elasticsearch Examples (Optional)

The following examples will help you to add, fetch and search data in the Elasticsearch cluster.

Create New Bucket

curl -XPUT http://localhost:9200/mybucket

Output:

{"acknowledged":true}

Adding Data to Elasticsearch

Use following commands to add some data in Elasticsearch.
Command 1:

curl -XPUT 'http://localhost:9200/mybucket/user/johny' -d '{ "name" : "Rahul Kumar" }'

Output:

{"_index":"mybucket","_type":"user","_id":"johny","_version":1,"created":true}

Command 2:

curl -XPUT 'http://localhost:9200/mybucket/post/1' -d '
{
    "user": "Rahul",
    "postDate": "01-15-2015",
    "body": "This is Demo Post 1 in Elasticsearch" ,
    "title": "Demo Post 1"
}'

Output:

{"_index":"mybucket","_type":"post","_id":"1","_version":1,"created":true}

Command 3:

curl -XPUT 'http://localhost:9200/mybucket/post/2' -d '
{
    "user": "TecAdmin",
    "postDate": "01-15-2015",
    "body": "This is Demo Post 2 in Elasticsearch" ,
    "title": "Demo Post 2"
}'

Output:

{"_index":"mybucket","_type":"post","_id":"2","_version":1,"created":true}

Fetching Data from Elasticsearch

Use the following command to GET data from ElasticSearch and read the output.

curl -XGET 'http://localhost:9200/mybucket/user/johny?pretty=true'
curl -XGET 'http://localhost:9200/mybucket/post/1?pretty=true'
curl -XGET 'http://localhost:9200/mybucket/post/2?pretty=true'

Searching in Elasticsearch

Use the following command to search data from elastic search. Below command will search all data associated with user johny.

curl 'http://localhost:9200/mybucket/post/_search?q=user:TecAdmin&pretty=true'

Output:

{
  "took" : 145,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.30685282,
    "hits" : [ {
      "_index" : "mybucket",
      "_type" : "post",
      "_id" : "2",
      "_score" : 0.30685282,
      "_source":
{
    "user": "TecAdmin",
    "postDate": "01-15-2015",
    "body": "This is Demo Post 2 in Elasticsearch" ,
    "title": "Demo Post 2"
}
    } ]
  }
}

Congratulations! You have successfully configured elasticsearch single node cluster on your Linux system.

The post How to Install Elasticsearch on CentOS 7/6 appeared first on TecAdmin.