hadoop

https://hadoop.apache.org/

Prerequisites

Install JDK on each node, see “Install Java Development Environment”.

Get shell script install_java_bin.

Download the java binary packages:

$ curl -LO https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/hadoop-3.4.1/hadoop-3.4.1.tar.gz

Check sum:

$ sha512sum hadoop-3.4.1.tar.gz
09cda6943625bc8e4307deca7a4df76d676a51aca1b9a0171938b793521dfe1ab5970fdb9a490bab34c12a2230ffdaed2992bad16458169ac51b281be1ab6741  hadoop-3.4.1.tar.gz

Set password-free login to all workers (even it is the same node where the commands are emitted) for the user used (in this case, it is ubuntu).

Deploy

Install the java packages on each node:

$ install_java_bin hadoop hadoop-3.4.1.tar.gz /opt
$ sudo chown ubuntu:ubuntu /opt/hadoop

Set environment variables on each node:

$ echo "export HADOOP_HOME=\"/opt/hadoop\"" | sudo tee -a /etc/profile.d/hadoop.sh
$ echo "export HADOOP_CLASSPATH=\"\$(\${HADOOP_HOME}/bin/hadoop classpath)\"" | sudo tee -a /etc/profile.d/hadoop.sh

Configure

Edit file /opt/hadoop/etc/hadoop/hadoop-env.sh:

 # The java implementation to use. By default, this environment
 # variable is REQUIRED on ALL platforms except OS X!
 # export JAVA_HOME=
+export $(/usr/bin/env java -XshowSettings:properties -version 2>&1 | grep 'java.home' | sed -e 's/java.home/JAVA_HOME/;s/ //g;')
 
 # The language environment in which Hadoop runs. Use the English
 # environment to ensure that logs are printed as expected.
 #
 # For example, to limit who can execute the namenode command,
 # export HDFS_NAMENODE_USER=hdfs
+export HDFS_NAMENODE_USER=ubuntu
+export HDFS_DATANODE_USER=ubuntu
+export HDFS_SECONDARYNAMENODE_USER=ubuntu
+export YARN_RESOURCEMANAGER_USER=ubuntu
+export YARN_NODEMANAGER_USER=ubuntu
 
 
 ###

Edit file /opt/hadoop/etc/hadoop/core-site.xml:

 <!-- Put site-specific property overrides in this file. -->
 
 <configuration>
+    <property>
+        <name>fs.defaultFS</name>
+        <value>hdfs://las0:9000</value>
+    </property>
+    <property>
+        <name>hadoop.tmp.dir</name>
+        <value>/opt/tmp/hadoop</value>
+    </property>
+    <property>
+        <name>hadoop.proxyuser.root.hosts</name>
+        <value>*</value>
+    </property>
+    <property>
+        <name>hadoop.proxyuser.root.groups</name>
+        <value>*</value>
+    </property>
 </configuration>

These files need to be copied to all nodes to the same path.

Create the directory for ${hadoop.tmp.dir} on each node:

$ sudo mkdir -p /opt/tmp/hadoop
$ sudo chown ubuntu:ubuntu /opt/tmp/hadoop

Edit file /opt/hadoop/etc/hadoop/workers:

-localhost
+las0
+las1
+las2

Configure hdfs

Edit file /opt/hadoop/etc/hadoop/hdfs-site.xml:

 <!-- Put site-specific property overrides in this file. -->
 
 <configuration>
-
+    <property>
+        <name>dfs.replication</name>
+        <value>3</value>
+    </property>
+    <property>
+        <name>dfs.blocksize</name>
+        <value>262144</value>
+    </property>
+    <property>
+        <name>dfs.namenode.fs-limits.min-block-size</name>
+        <value>262144</value>
+    </property>
+    <property>
+        <name>dfs.namenode.handler.count</name>
+        <value>10</value>
+    </property>
 </configuration>

This file need to be copied to all nodes to the same path.

Configure yarn

Edit file /opt/hadoop/etc/hadoop/yarn-site.xml:

 <configuration>
 
 <!-- Site specific YARN configuration properties -->
-
+    <property>
+        <name>yarn.resourcemanager.hostname</name>
+        <value>las0</value>
+    </property>
+    <property>
+        <name>yarn.nodemanager.aux-services</name>
+        <value>mapreduce_shuffle</value>
+    </property>
+    <property>
+        <name>yarn.nodemanager.resource.detect-hardware-capabilities</name>
+        <value>true</value>
+    </property>
+    <property>
+        <name>yarn.scheduler.maximum-allocation-vcores</name>
+        <value>8</value>
+    </property>
+    <property>
+        <name>yarn.resourcemanager.ha.enabled</name>
+        <value>false</value>
+    </property>
+    <property>
+        <name>yarn.webapp.ui2.enable</name>
+        <value>true</value>
+    </property>
 </configuration>

Edit file /opt/hadoop/etc/hadoop/mapred-site.xml:

 <!-- Put site-specific property overrides in this file. -->
 
 <configuration>
-
+    <property>
+        <name>mapreduce.framework.name</name>
+        <value>yarn</value>
+    </property>
+    <property>
+        <name>mapreduce.application.classpath</name>
+        <value>/opt/hadoop/share/hadoop/mapreduce/*:/opt/hadoop/share/hadoop/mapreduce/lib/*</value>
+    </property>
 </configuration>

These files need to be copied to all nodes to the same path.

Run

Check the version:

$ hadoop version
Hadoop 3.4.1
Source code repository https://github.com/apache/hadoop.git -r 4d7825309348956336b8f06a08322b78422849b1
Compiled by mthakur on 2024-10-09T14:57Z
Compiled on platform linux-x86_64
Compiled with protoc 3.23.4
From source with checksum 7292fe9dba5e2e44e3a9f763fce3e680
This command was run using /opt/hadoop-3.4.1/share/hadoop/common/hadoop-common-3.4.1.jar

Init hdfs:

$ hdfs namenode -format
⋮
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at las0/10.225.4.51
************************************************************/

Start hdfs:

$ start-dfs.sh
Starting namenodes on [las0]
Starting datanodes
Starting secondary namenodes [las0]

Start yarn:

$ start-yarn.sh
Starting resourcemanager
Starting nodemanagers

Show java processes:

$ jps -lm
2509842 org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode
2510720 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager
2509278 org.apache.hadoop.hdfs.server.namenode.NameNode
2511597 sun.tools.jps.Jps -lm
2509499 org.apache.hadoop.hdfs.server.datanode.DataNode
2510937 org.apache.hadoop.yarn.server.nodemanager.NodeManager

Stop them:

$ stop-yarn.sh
Stopping nodemanagers
Stopping resourcemanager
$ stop-dfs.sh
Stopping namenodes on [las0]
Stopping datanodes
Stopping secondary namenodes [las0]

Usage

hdfs

$ hdfs dfs -ls /
$ hdfs dfs -mkdir -p /user/ubuntu
$ echo 'Hello world!' > file.dat
$ hdfs dfs -put file.dat
$ hdfs dfs -cat file.dat
Hello world!
$ hdfs dfs -rm file.dat
Deleted file.dat

The hadoop web UI is available at http://las0:9870.

Safe mode

Show current safe mode status:

$ hdfs dfsadmin -safemode get
Safe mode is OFF

Enter safe mode:

$ hdfs dfsadmin -safemode enter
Safe mode is ON

Leave safe mode:

$ hdfs dfsadmin -safemode leave
Safe mode is OFF

Note

A freshly started/restarted NameNode is in safe mode temporarily. It will leave safe mode automatically.

Clear all data

If you want to clear the hdfs data, stop hdfs and run the following commands on each node:

$ rm -rf /opt/tmp/hadoop/dfs/*

yarn

The yarn web UI is available at http://las0:8088 or http://las0:8088/ui2/.