实验2 | Spark和Hadoop的安装

yiyun 09-03 164

前言：

而今同学们对“ubuntu中hadoop安装”都比较重视，小伙伴们都需要分析一些“ubuntu中hadoop安装”的相关文章。那么小编同时在网摘上网罗了一些对于“ubuntu中hadoop安装””的相关知识，希望小伙伴们能喜欢，朋友们快快来了解一下吧！

一、实验目的

（1）掌握在Linux虚拟机中安装Hadoop和Spark的方法；

（2）熟悉HDFS的基本使用方法；

（3）掌握使用Spark访问本地文件和HDFS文件的方法。

二、实验平台

操作系统：Ubuntu 20.04；

Spark版本：3.0.2；

Hadoop版本：2.7.3。

Python版本：3.4.3。

三、实验内容和要求1．安装Hadoop和Spark

进入Linux系统，参照本教程官网“实验指南”栏目的“Hadoop的安装和使用”，完成Hadoop伪分布式模式的安装。完成Hadoop的安装以后，再安装Spark（Local模式）。

参考：

Hadoop安装教程_单机/伪分布式配置_Hadoop2.6.0(2.7.1)/Ubuntu14.04(16.04)_厦大数据库实验室博客 Spark2.1.0入门：Spark的安装和使用_厦大数据库实验室博客

1.1 准备

Dockerfile

12345678910111213141516

FROM ubuntu:20.04 AS baseLABEL maintainer="yiyun <yiyungent@gmail.com>"# 设置国内阿里云镜像源COPY etc/apt/aliyun-ubuntu-20.04-focal-sources.list   /etc/apt/sources.list# 时区设置ENV TZ=Asia/ShanghaiRUN apt-get update# 1. 安装常用软件RUN apt-get install -y wgetRUN apt-get install -y sshRUN apt-get install -y vim

1.2 安装 Java

Dockerfile

# 2. 安装 JavaADD jdk-8u131-linux-x64.tar.gz /opt/RUN mv /opt/jdk1.8.0_131 /opt/jdk1.8ENV JAVA_HOME=/opt/jdk1.8ENV JRE_HOME=${JAVA_HOME}/jreENV CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/libENV PATH=${JAVA_HOME}/bin:$PATH

1.3 安装 Hadoop

Dockerfile

# 3. 安装 HadoopADD hadoop-2.7.3.tar.gz /opt/RUN mv /opt/hadoop-2.7.3 /opt/hadoopENV HADOOP_HOME=/opt/hadoopENV HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/nativeENV PATH=$PATH:${HADOOP_HOME}/sbin:${HADOOP_HOME}/bin

1.3 Hadoop伪分布式配置

Hadoop 可以在单节点上以伪分布式的方式运行，Hadoop 进程以分离的 Java 进程来运行，节点既作为 NameNode 也作为 DataNode，同时，读取的是 HDFS 中的文件。

Hadoop 的配置文件位于 /usr/local/hadoop/etc/hadoop/ 中，

伪分布式需要修改2个配置文件 core-site.xml 和 hdfs-site.xml 。

Hadoop的配置文件是 xml 格式，每个配置以声明 property 的 name 和 value 的方式来实现。

1.3.1 配置文件

接下来，准备 core-site.xml 文件，内容如下：

core-site.xml

1234567891011

<configuration>    <property>        <name>hadoop.tmp.dir</name>        <value>file:/opt/hadoop/tmp</value>        <description>Abase for other temporary directories.</description>    </property>    <property>        <name>fs.defaultFS</name>        <value>hdfs://localhost:9000</value>    </property></configuration>

hdfs-site.xml 文件，内容如下：

hdfs-site.xml

1234567891011121314

<configuration>    <property>        <name>dfs.replication</name>        <value>1</value>    </property>    <property>        <name>dfs.namenode.name.dir</name>        <value>file:/opt/hadoop/tmp/dfs/name</value>    </property>    <property>        <name>dfs.datanode.data.dir</name>        <value>file:/opt/hadoop/tmp/dfs/data</value>    </property></configuration>

Hadoop配置文件说明

Hadoop 的运行方式是由配置文件决定的（运行 Hadoop 时会读取配置文件），

因此如果需要从伪分布式模式切换回非分布式模式，需要删除 core-site.xml 中的配置项。

此外，伪分布式虽然只需要配置 fs.defaultFS 和 dfs.replication 就可以运行（官方教程如此），

不过若没有配置 hadoop.tmp.dir 参数，则默认使用的临时目录为 /tmp/hadoo-hadoop，

而这个目录在重启时有可能被系统清理掉，导致必须重新执行 format 才行。

所以我们进行了设置，同时也指定 dfs.namenode.name.dir 和 dfs.datanode.data.dir，否则在接下来的步骤中可能会出错。

注意：

core-site.xml, hdfs-site.xml 均放于与 Dockerfile 同级目录

配置文件：

Dockerfile

# 3.1 Hadoop伪分布式配置COPY core-site.xml /opt/hadoop/etc/hadoop/core-site.xmlCOPY hdfs-site.xml /opt/hadoop/etc/hadoop/hdfs-site.xml

1.3.2 执行 NameNode 的格式化

配置完成后，执行 NameNode 的格式化:

Dockerfile

# 3.2 配置完成后，执行 NameNode 的格式化RUN /opt/hadoop/bin/hdfs namenode -format

1.3.3 配置 ssh 免密登录

Dockerfile

# 3.3 配置 ssh 免密登录RUN /etc/init.d/ssh startRUN ssh-keygen -f $HOME/.ssh/id_rsa -t rsa -N ''RUN cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

1.3.4 Java环境变量

hadoop-env.sh

/opt/hadoop/etc/hadoop/hadoop-env.sh

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100

# Licensed to the Apache Software Foundation (ASF) under one# or more contributor license agreements.  See the NOTICE file# distributed with this work for additional information# regarding copyright ownership.  The ASF licenses this file# to you under the Apache License, Version 2.0 (the# "License"); you may not use this file except in compliance# with the License.  You may obtain a copy of the License at##      Unless required by applicable law or agreed to in writing, software# distributed under the License is distributed on an "AS IS" BASIS,# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.# See the License for the specific language governing permissions and# limitations under the License.# Set Hadoop-specific environment variables here.# The only required environment variable is JAVA_HOME.  All others are# optional.  When running a distributed configuration it is best to# set JAVA_HOME in this file, so that it is correctly defined on# remote nodes.# The java implementation to use.# export JAVA_HOME=${JAVA_HOME}export JAVA_HOME=/opt/jdk1.8# The jsvc implementation to use. Jsvc is required to run secure datanodes# that bind to privileged ports to provide authentication of data transfer# protocol.  Jsvc is not required if SASL is configured for authentication of# data transfer protocol using non-privileged ports.#export JSVC_HOME=${JSVC_HOME}export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"}# Extra Java CLASSPATH elements.  Automatically insert capacity-scheduler.for f in $HADOOP_HOME/contrib/capacity-scheduler/*.jar; do  if [ "$HADOOP_CLASSPATH" ]; then    export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$f  else    export HADOOP_CLASSPATH=$f  fidone# The maximum amount of heap to use, in MB. Default is 1000.#export HADOOP_HEAPSIZE=#export HADOOP_NAMENODE_INIT_HEAPSIZE=""# Extra Java runtime options.  Empty by default.export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true"# Command specific options appended to HADOOP_OPTS when specifiedexport HADOOP_NAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_NAMENODE_OPTS"export HADOOP_DATANODE_OPTS="-Dhadoop.security.logger=ERROR,RFAS $HADOOP_DATANODE_OPTS"export HADOOP_SECONDARYNAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_SECONDARYNAMENODE_OPTS"export HADOOP_NFS3_OPTS="$HADOOP_NFS3_OPTS"export HADOOP_PORTMAP_OPTS="-Xmx512m $HADOOP_PORTMAP_OPTS"# The following applies to multiple commands (fs, dfs, fsck, distcp etc)export HADOOP_CLIENT_OPTS="-Xmx512m $HADOOP_CLIENT_OPTS"#HADOOP_JAVA_PLATFORM_OPTS="-XX:-UsePerfData $HADOOP_JAVA_PLATFORM_OPTS"# On secure datanodes, user to run the datanode as after dropping privileges.# This **MUST** be uncommented to enable secure HDFS if using privileged ports# to provide authentication of data transfer protocol.  This **MUST NOT** be# defined if SASL is configured for authentication of data transfer protocol# using non-privileged ports.export HADOOP_SECURE_DN_USER=${HADOOP_SECURE_DN_USER}# Where log files are stored.  $HADOOP_HOME/logs by default.#export HADOOP_LOG_DIR=${HADOOP_LOG_DIR}/$USER# Where log files are stored in the secure data environment.export HADOOP_SECURE_DN_LOG_DIR=${HADOOP_LOG_DIR}/${HADOOP_HDFS_USER}#### HDFS Mover specific parameters#### Specify the JVM options to be used when starting the HDFS Mover.# These options will be appended to the options specified as HADOOP_OPTS# and therefore may override any similar flags set in HADOOP_OPTS## export HADOOP_MOVER_OPTS=""#### Advanced Users Only!#### The directory where pid files are stored. /tmp by default.# NOTE: this should be set to a directory that can only be written to by #       the user that will run the hadoop daemons.  Otherwise there is the#       potential for a symlink attack.export HADOOP_PID_DIR=${HADOOP_PID_DIR}export HADOOP_SECURE_DN_PID_DIR=${HADOOP_PID_DIR}# A string representing this instance of hadoop. $USER by default.export HADOOP_IDENT_STRING=$USER

Dockerfile

# 3.4 Java环境变量COPY hadoop-env.sh /opt/hadoop/etc/hadoop/hadoop-env.sh

1.3.5 启动 Hadoop

构建镜像

docker build -t spark-python-2 .

image-20210227145155969

构建过程出现如上图，说明成功格式化。

从镜像中启动，并将宿主机50070端口映射到容器内50070端口，最后进入容器内bash终端

docker run -it --name spark-python-2-container -p 50070:50070 -p 4040:4040 spark-python-2 bash

注意：

50070 为 Hadoop WebUI 端口，

4040 为 Spark WebUI 端口

启动ssh

接着开启 NameNode 和 DataNode 守护进程

image-20210227154210777

image-20210227154225496

补充：

停止 hadoop

访问

image-20210227155445594

1.4 安装 Spark

由于我们已经自己安装了Hadoop，所以这里选择 spark-3.0.2-bin-without-hadoop.tgz

Spark部署模式主要有四种：

Local模式（单机模式）

Standalone模式（使用Spark自带的简单集群管理器）

YARN模式（使用YARN作为集群管理器）

Mesos模式（使用Mesos作为集群管理器）

这里介绍Local模式（单机模式）的 Spark安装。

Dockerfile

# 4. 安装 SparkRUN mkdir /opt/spark RUN wget  tar -zxvf spark-3.0.2-bin-without-hadoop.tgz -C /opt/spark/ENV SPARK_HOME=/opt/spark/spark-3.0.2-bin-without-hadoopENV PATH=${SPARK_HOME}/bin:$PATHCOPY spark-env.sh /opt/spark/conf/spark-env.sh

spark-env.sh

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475

#!/usr/bin/env bash## Licensed to the Apache Software Foundation (ASF) under one or more# contributor license agreements.  See the NOTICE file distributed with# this work for additional information regarding copyright ownership.# The ASF licenses this file to You under the Apache License, Version 2.0# (the "License"); you may not use this file except in compliance with# the License.  You may obtain a copy of the License at##     Unless required by applicable law or agreed to in writing, software# distributed under the License is distributed on an "AS IS" BASIS,# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.# See the License for the specific language governing permissions and# limitations under the License.## This file is sourced when running various Spark programs.# Copy it as spark-env.sh and edit that to configure Spark for your site.export SPARK_DIST_CLASSPATH=$(/opt/hadoop/bin/hadoop classpath)# Options read when launching programs locally with# ./bin/run-example or ./bin/spark-submit# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node# - SPARK_PUBLIC_DNS, to set the public dns name of the driver program# Options read by executors and drivers running inside the cluster# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node# - SPARK_PUBLIC_DNS, to set the public DNS name of the driver program# - SPARK_LOCAL_DIRS, storage directories to use on this node for shuffle and RDD data# - MESOS_NATIVE_JAVA_LIBRARY, to point to your libmesos.so if you use Mesos# Options read in YARN client/cluster mode# - SPARK_CONF_DIR, Alternate conf dir. (Default: ${SPARK_HOME}/conf)# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files# - YARN_CONF_DIR, to point Spark towards YARN configuration files when you use YARN# - SPARK_EXECUTOR_CORES, Number of cores for the executors (Default: 1).# - SPARK_EXECUTOR_MEMORY, Memory per Executor (e.g. 1000M, 2G) (Default: 1G)# - SPARK_DRIVER_MEMORY, Memory for Driver (e.g. 1000M, 2G) (Default: 1G)# Options for the daemons used in the standalone deploy mode# - SPARK_MASTER_HOST, to bind the master to a different IP address or hostname# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master# - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")# - SPARK_WORKER_CORES, to set the number of cores to use on this machine# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker# - SPARK_WORKER_DIR, to set the working directory of worker processes# - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y")# - SPARK_DAEMON_MEMORY, to allocate to the master, worker and history server themselves (default: 1g).# - SPARK_HISTORY_OPTS, to set config properties only for the history server (e.g. "-Dx=y")# - SPARK_SHUFFLE_OPTS, to set config properties only for the external shuffle service (e.g. "-Dx=y")# - SPARK_DAEMON_JAVA_OPTS, to set config properties for all daemons (e.g. "-Dx=y")# - SPARK_DAEMON_CLASSPATH, to set the classpath for all daemons# - SPARK_PUBLIC_DNS, to set the public dns name of the master or workers# Options for launcher# - SPARK_LAUNCHER_OPTS, to set config properties and Java options for the launcher (e.g. "-Dx=y")# Generic options for the daemons used in the standalone deploy mode# - SPARK_CONF_DIR      Alternate conf dir. (Default: ${SPARK_HOME}/conf)# - SPARK_LOG_DIR       Where log files are stored.  (Default: ${SPARK_HOME}/logs)# - SPARK_PID_DIR       Where the pid file is stored. (Default: /tmp)# - SPARK_IDENT_STRING  A string representing this instance of spark. (Default: $USER)# - SPARK_NICENESS      The scheduling priority for daemons. (Default: 0)# - SPARK_NO_DAEMONIZE  Run the proposed command in the foreground. It will not output a PID file.# Options for native BLAS, like Intel MKL, OpenBLAS, and so on.# You might get better performance to enable these options if using native BLAS (see SPARK-21305).# - MKL_NUM_THREADS=1        Disable multi-threading of Intel MKL# - OPENBLAS_NUM_THREADS=1   Disable multi-threading of OpenBLAS

spark-env.sh :

有了上面的配置信息以后，Spark就可以把数据存储到Hadoop分布式文件系统HDFS中，也可以从HDFS中读取数据。

如果没有配置上面信息，Spark就只能读写本地数据，无法读写HDFS数据。配置完成后就可以直接使用，不需要像Hadoop运行启动命令。通过运行Spark自带的示例，验证Spark是否安装成功。

1.4.1 运行示例

$SPARK_HOME/bin/run-example SparkPi 2>&1 | grep "Pi is"

image-20210227163547529

1.5 安装 Python

安装 Python 是为了运行 pyspark

Dockerfile

RUN apt-get install -y python

2．HDFS常用操作

使用hadoop用户名登录进入Linux系统，启动Hadoop，参照相关Hadoop书籍或网络资料，或者也可以参考本教程官网的“实验指南”栏目的“HDFS操作常用Shell命令”，使用Hadoop提供的Shell命令完成如下操作：

（1）启动Hadoop，在HDFS中创建用户目录“/user/hadoop”；

（2）在Linux系统的本地文件系统的“/home/hadoop”目录下新建一个文本文件test.txt，并在该文件中随便输入一些内容，然后上传到HDFS的“/user/hadoop”目录下；

（3）把HDFS中“/user/hadoop”目录下的test.txt文件，下载到Linux系统的本地文件系统中的“/home/hadoop/下载”目录下；

（4）将HDFS中“/user/hadoop”目录下的test.txt文件的内容输出到终端中进行显示；

（5）在HDFS中的“/user/hadoop”目录下，创建子目录input，把HDFS中“/user/hadoop”目录下的test.txt文件，复制到“/user/hadoop/input”目录下；

（6）删除HDFS中“/user/hadoop”目录下的test.txt文件，删除HDFS中“/user/hadoop”目录下的input子目录及其子目录下的所有内容。

3. Spark读取文件系统的数据

（1）在pyspark中读取Linux系统本地文件“/home/hadoop/test.txt”，然后统计出文件的行数；

（2）在pyspark中读取HDFS系统文件“/user/hadoop/test.txt”（如果该文件不存在，请先创建），然后，统计出文件的行数；

（3）编写独立应用程序，读取HDFS系统文件“/user/hadoop/test.txt”（如果该文件不存在，请先创建），然后，统计出文件的行数；通过 spark-submit 提交到 Spark 中运行程序。

四、实验报告

《Spark编程基础》实验报告

题目：

姓名：

日期：

实验环境：

实验内容与完成情况：

出现的问题：

解决方案（列出遇到的问题和解决办法，列出没有解决的问题）：

# 补充

Python 版 Spark Shell

进入 Python Spark Shell

${SPARK_HOME}/bin/pyspark

image-20210316201936826

补充：

PATH 配置好，也可以直接 pyspark

image-20210316202052091

sc 即默认在 PySpark Shell 中创建的 Sparkcontext

Ctrl+D 退出pyspark shell

使用以下命令运行 Python Spark Shell脚本 (script.py)

${SPARK_HOME}/bin/pyspark script.py

在 Python Spark Shell 中运行 script.py

注意：

当在 pyspark 中运行 py 时，不要再次创建 SparkContext，应当直接使用 sc ，它会在 pyspark 中默认被创建存在，否则会报错：

ValueError: Cannot run multiple SparkContexts at once

运行 pyspark 例子

TODO: 运行 pyspark 例子

pyspark 报错

image-20210316201147248

root@a8878d819a6a:/# ${SPARK_HOME}/bin/pysparkenv: 'python': No such file or directory

解决：

方法一：

apt-get install -y python

方法二:

参考:

运行spark时提示 env: ‘python’: No such file or directory - trp - 博客园

注意：

此方法未测试，

Dockerfile 中使用 apt-get install -y python 解决此问题

本文地址：http://www.longkongtuishu.com/ca80eBABsBFUFAFRX.html

标签： #ubuntu中hadoop安装 #spark环境搭建ubuntu #apachehadoop安装 #如何安装hadooplinux #apache license 2