hadoop

https://hadoop.apache.org/

Prerequisites

Install JDK on each node, see “Install Java Development Environment”.

Get shell script install_java_bin.

Download the java binary packages:

$ curl -LO https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/hadoop-3.4.1/hadoop-3.4.1.tar.gz

Check sum:

$ sha512sum hadoop-3.4.1.tar.gz
09cda6943625bc8e4307deca7a4df76d676a51aca1b9a0171938b793521dfe1ab5970fdb9a490bab34c12a2230ffdaed2992bad16458169ac51b281be1ab6741  hadoop-3.4.1.tar.gz

Set password-free login to all workers (even it is the same node where the commands are emitted) for the user used (in this case, it is ubuntu).

Deploy

Install the java packages on each node:

$ install_java_bin hadoop hadoop-3.4.1.tar.gz /opt
$ sudo chown ubuntu:ubuntu /opt/hadoop

Set environment variables on each node:

$ echo "export HADOOP_HOME=\"/opt/hadoop\"" | sudo tee -a /etc/profile.d/hadoop.sh
$ echo "export HADOOP_CLASSPATH=\"\$(\${HADOOP_HOME}/bin/hadoop classpath)\"" | sudo tee -a /etc/profile.d/hadoop.sh

Configure

Edit file /opt/hadoop/etc/hadoop/hadoop-env.sh:

 # The java implementation to use. By default, this environment
 # variable is REQUIRED on ALL platforms except OS X!
 # export JAVA_HOME=
+export $(/usr/bin/env java -XshowSettings:properties -version 2>&1 | grep 'java.home' | sed -e 's/java.home/JAVA_HOME/;s/ //g;')
 
 # The language environment in which Hadoop runs. Use the English
 # environment to ensure that logs are printed as expected.
 #
 # For example, to limit who can execute the namenode command,
 # export HDFS_NAMENODE_USER=hdfs
+export HDFS_NAMENODE_USER=ubuntu
+export HDFS_DATANODE_USER=ubuntu
+export HDFS_SECONDARYNAMENODE_USER=ubuntu
+export YARN_RESOURCEMANAGER_USER=ubuntu
+export YARN_NODEMANAGER_USER=ubuntu
 
 
 ###

Edit file /opt/hadoop/etc/hadoop/core-site.xml:

 <!-- Put site-specific property overrides in this file. -->
 
 <configuration>
+    <property>
+        <name>fs.defaultFS</name>
+        <value>hdfs://las0:9000</value>
+    </property>
+    <property>
+        <name>hadoop.tmp.dir</name>
+        <value>/opt/tmp/hadoop</value>
+    </property>
+    <property>
+        <name>hadoop.proxyuser.root.hosts</name>
+        <value>*</value>
+    </property>
+    <property>
+        <name>hadoop.proxyuser.root.groups</name>
+        <value>*</value>
+    </property>
 </configuration>

These files need to be copied to all nodes to the same path.

Create the directory for ${hadoop.tmp.dir} on each node:

$ sudo mkdir -p /opt/tmp/hadoop
$ sudo chown ubuntu:ubuntu /opt/tmp/hadoop

Edit file /opt/hadoop/etc/hadoop/workers:

-localhost
+las0
+las1
+las2

Configure hdfs

Edit file /opt/hadoop/etc/hadoop/hdfs-site.xml:

 <!-- Put site-specific property overrides in this file. -->
 
 <configuration>
-
+    <property>
+        <name>dfs.replication</name>
+        <value>3</value>
+    </property>
+    <property>
+        <name>dfs.blocksize</name>
+        <value>262144</value>
+    </property>
+    <property>
+        <name>dfs.namenode.fs-limits.min-block-size</name>
+        <value>262144</value>
+    </property>
+    <property>
+        <name>dfs.namenode.handler.count</name>
+        <value>10</value>
+    </property>
 </configuration>

This file need to be copied to all nodes to the same path.

Configure yarn

Edit file /opt/hadoop/etc/hadoop/yarn-site.xml:

 <configuration>
 
 <!-- Site specific YARN configuration properties -->
-
+    <property>
+        <name>yarn.resourcemanager.hostname</name>
+        <value>las0</value>
+    </property>
+    <property>
+        <name>yarn.nodemanager.aux-services</name>
+        <value>mapreduce_shuffle</value>
+    </property>
+    <property>
+        <name>yarn.nodemanager.resource.detect-hardware-capabilities</name>
+        <value>true</value>
+    </property>
+    <property>
+        <name>yarn.scheduler.maximum-allocation-vcores</name>
+        <value>8</value>
+    </property>
+    <property>
+        <name>yarn.resourcemanager.ha.enabled</name>
+        <value>false</value>
+    </property>
+    <property>
+        <name>yarn.webapp.ui2.enable</name>
+        <value>true</value>
+    </property>
 </configuration>

Edit file /opt/hadoop/etc/hadoop/mapred-site.xml:

 <!-- Put site-specific property overrides in this file. -->
 
 <configuration>
-
+    <property>
+        <name>mapreduce.framework.name</name>
+        <value>yarn</value>
+    </property>
+    <property>
+        <name>mapreduce.application.classpath</name>
+        <value>/opt/hadoop/share/hadoop/mapreduce/*:/opt/hadoop/share/hadoop/mapreduce/lib/*</value>
+    </property>
 </configuration>

These files need to be copied to all nodes to the same path.

Run

Check the version:

$ hadoop version
Hadoop 3.4.1
Source code repository https://github.com/apache/hadoop.git -r 4d7825309348956336b8f06a08322b78422849b1
Compiled by mthakur on 2024-10-09T14:57Z
Compiled on platform linux-x86_64
Compiled with protoc 3.23.4
From source with checksum 7292fe9dba5e2e44e3a9f763fce3e680
This command was run using /opt/hadoop-3.4.1/share/hadoop/common/hadoop-common-3.4.1.jar

Init hdfs:

$ hdfs namenode -format

/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at las0/10.225.4.51
************************************************************/

Start hdfs:

$ start-dfs.sh
Starting namenodes on [las0]
Starting datanodes
Starting secondary namenodes [las0]

Start yarn:

$ start-yarn.sh
Starting resourcemanager
Starting nodemanagers

Show java processes:

$ jps -lm
2509842 org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode
2510720 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager
2509278 org.apache.hadoop.hdfs.server.namenode.NameNode
2511597 sun.tools.jps.Jps -lm
2509499 org.apache.hadoop.hdfs.server.datanode.DataNode
2510937 org.apache.hadoop.yarn.server.nodemanager.NodeManager

Stop them:

$ stop-yarn.sh
Stopping nodemanagers
Stopping resourcemanager
$ stop-dfs.sh
Stopping namenodes on [las0]
Stopping datanodes
Stopping secondary namenodes [las0]

Usage

hdfs

$ hdfs dfs -ls /
$ hdfs dfs -mkdir -p /user/ubuntu
$ echo 'Hello world!' > file.dat
$ hdfs dfs -put file.dat
$ hdfs dfs -cat file.dat
Hello world!
$ hdfs dfs -rm file.dat
Deleted file.dat

The hadoop web UI is available at http://las0:9870.

Safe mode

Show current safe mode status:

$ hdfs dfsadmin -safemode get
Safe mode is OFF

Enter safe mode:

$ hdfs dfsadmin -safemode enter
Safe mode is ON

Leave safe mode:

$ hdfs dfsadmin -safemode leave
Safe mode is OFF

Note

A freshly started/restarted NameNode is in safe mode temporarily. It will leave safe mode automatically.

Clear all data

If you want to clear the hdfs data, stop hdfs and run the following commands on each node:

$ rm -rf /opt/tmp/hadoop/dfs/*

yarn

The yarn web UI is available at http://las0:8088 or http://las0:8088/ui2/.