龙空技术网

实验2 | Spark和Hadoop的安装

yiyun 164

前言:

而今同学们对“ubuntu中hadoop安装”都比较重视,小伙伴们都需要分析一些“ubuntu中hadoop安装”的相关文章。那么小编同时在网摘上网罗了一些对于“ubuntu中hadoop安装””的相关知识,希望小伙伴们能喜欢,朋友们快快来了解一下吧!

一、实验目的

(1)掌握在Linux虚拟机中安装Hadoop和Spark的方法;

(2)熟悉HDFS的基本使用方法;

(3)掌握使用Spark访问本地文件和HDFS文件的方法。

二、实验平台

操作系统:Ubuntu 20.04;

Spark版本:3.0.2;

Hadoop版本:2.7.3。

Python版本:3.4.3。

三、实验内容和要求1.安装Hadoop和Spark

进入Linux系统,参照本教程官网“实验指南”栏目的“Hadoop的安装和使用”,完成Hadoop伪分布式模式的安装。完成Hadoop的安装以后,再安装Spark(Local模式)。

参考:

Hadoop安装教程_单机/伪分布式配置_Hadoop2.6.0(2.7.1)/Ubuntu14.04(16.04)_厦大数据库实验室博客 Spark2.1.0入门:Spark的安装和使用_厦大数据库实验室博客

1.1 准备

Dockerfile

12345678910111213141516
FROM ubuntu:20.04 AS baseLABEL maintainer="yiyun <yiyungent@gmail.com>"# 设置国内阿里云镜像源COPY etc/apt/aliyun-ubuntu-20.04-focal-sources.list   /etc/apt/sources.list# 时区设置ENV TZ=Asia/ShanghaiRUN apt-get update# 1. 安装常用软件RUN apt-get install -y wgetRUN apt-get install -y sshRUN apt-get install -y vim

1.2 安装 Java

Dockerfile

1234567
# 2. 安装 JavaADD jdk-8u131-linux-x64.tar.gz /opt/RUN mv /opt/jdk1.8.0_131 /opt/jdk1.8ENV JAVA_HOME=/opt/jdk1.8ENV JRE_HOME=${JAVA_HOME}/jreENV CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/libENV PATH=${JAVA_HOME}/bin:$PATH

1.3 安装 Hadoop

Dockerfile

123456
# 3. 安装 HadoopADD hadoop-2.7.3.tar.gz /opt/RUN mv /opt/hadoop-2.7.3 /opt/hadoopENV HADOOP_HOME=/opt/hadoopENV HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/nativeENV PATH=$PATH:${HADOOP_HOME}/sbin:${HADOOP_HOME}/bin

1.3 Hadoop伪分布式配置

Hadoop 可以在单节点上以伪分布式的方式运行,Hadoop 进程以分离的 Java 进程来运行,节点既作为 NameNode 也作为 DataNode,同时,读取的是 HDFS 中的文件。

Hadoop 的配置文件位于 /usr/local/hadoop/etc/hadoop/ 中,

伪分布式需要修改2个配置文件 core-site.xml 和 hdfs-site.xml 。

Hadoop的配置文件是 xml 格式,每个配置以声明 property 的 name 和 value 的方式来实现。

1.3.1 配置文件

接下来,准备 core-site.xml 文件,内容如下:

core-site.xml

1234567891011
<configuration>    <property>        <name>hadoop.tmp.dir</name>        <value>file:/opt/hadoop/tmp</value>        <description>Abase for other temporary directories.</description>    </property>    <property>        <name>fs.defaultFS</name>        <value>hdfs://localhost:9000</value>    </property></configuration>

hdfs-site.xml 文件,内容如下:

hdfs-site.xml

1234567891011121314
<configuration>    <property>        <name>dfs.replication</name>        <value>1</value>    </property>    <property>        <name>dfs.namenode.name.dir</name>        <value>file:/opt/hadoop/tmp/dfs/name</value>    </property>    <property>        <name>dfs.datanode.data.dir</name>        <value>file:/opt/hadoop/tmp/dfs/data</value>    </property></configuration>

Hadoop配置文件说明

Hadoop 的运行方式是由配置文件决定的(运行 Hadoop 时会读取配置文件),

因此如果需要从伪分布式模式切换回非分布式模式,需要删除 core-site.xml 中的配置项。

此外,伪分布式虽然只需要配置 fs.defaultFS 和 dfs.replication 就可以运行(官方教程如此),

不过若没有配置 hadoop.tmp.dir 参数,则默认使用的临时目录为 /tmp/hadoo-hadoop,

而这个目录在重启时有可能被系统清理掉,导致必须重新执行 format 才行。

所以我们进行了设置,同时也指定 dfs.namenode.name.dir 和 dfs.datanode.data.dir,否则在接下来的步骤中可能会出错。

注意:

core-site.xml, hdfs-site.xml 均放于与 Dockerfile 同级目录

配置文件:

Dockerfile

123
# 3.1 Hadoop伪分布式配置COPY core-site.xml /opt/hadoop/etc/hadoop/core-site.xmlCOPY hdfs-site.xml /opt/hadoop/etc/hadoop/hdfs-site.xml

1.3.2 执行 NameNode 的格式化

配置完成后,执行 NameNode 的格式化:

Dockerfile

12
# 3.2 配置完成后,执行 NameNode 的格式化RUN /opt/hadoop/bin/hdfs namenode -format

1.3.3 配置 ssh 免密登录

Dockerfile

1234
# 3.3 配置 ssh 免密登录RUN /etc/init.d/ssh startRUN ssh-keygen -f $HOME/.ssh/id_rsa -t rsa -N ''RUN cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

1.3.4 Java环境变量

hadoop-env.sh

/opt/hadoop/etc/hadoop/hadoop-env.sh

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
# Licensed to the Apache Software Foundation (ASF) under one# or more contributor license agreements.  See the NOTICE file# distributed with this work for additional information# regarding copyright ownership.  The ASF licenses this file# to you under the Apache License, Version 2.0 (the# "License"); you may not use this file except in compliance# with the License.  You may obtain a copy of the License at##      Unless required by applicable law or agreed to in writing, software# distributed under the License is distributed on an "AS IS" BASIS,# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.# See the License for the specific language governing permissions and# limitations under the License.# Set Hadoop-specific environment variables here.# The only required environment variable is JAVA_HOME.  All others are# optional.  When running a distributed configuration it is best to# set JAVA_HOME in this file, so that it is correctly defined on# remote nodes.# The java implementation to use.# export JAVA_HOME=${JAVA_HOME}export JAVA_HOME=/opt/jdk1.8# The jsvc implementation to use. Jsvc is required to run secure datanodes# that bind to privileged ports to provide authentication of data transfer# protocol.  Jsvc is not required if SASL is configured for authentication of# data transfer protocol using non-privileged ports.#export JSVC_HOME=${JSVC_HOME}export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"}# Extra Java CLASSPATH elements.  Automatically insert capacity-scheduler.for f in $HADOOP_HOME/contrib/capacity-scheduler/*.jar; do  if [ "$HADOOP_CLASSPATH" ]; then    export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$f  else    export HADOOP_CLASSPATH=$f  fidone# The maximum amount of heap to use, in MB. Default is 1000.#export HADOOP_HEAPSIZE=#export HADOOP_NAMENODE_INIT_HEAPSIZE=""# Extra Java runtime options.  Empty by default.export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true"# Command specific options appended to HADOOP_OPTS when specifiedexport HADOOP_NAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_NAMENODE_OPTS"export HADOOP_DATANODE_OPTS="-Dhadoop.security.logger=ERROR,RFAS $HADOOP_DATANODE_OPTS"export HADOOP_SECONDARYNAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_SECONDARYNAMENODE_OPTS"export HADOOP_NFS3_OPTS="$HADOOP_NFS3_OPTS"export HADOOP_PORTMAP_OPTS="-Xmx512m $HADOOP_PORTMAP_OPTS"# The following applies to multiple commands (fs, dfs, fsck, distcp etc)export HADOOP_CLIENT_OPTS="-Xmx512m $HADOOP_CLIENT_OPTS"#HADOOP_JAVA_PLATFORM_OPTS="-XX:-UsePerfData $HADOOP_JAVA_PLATFORM_OPTS"# On secure datanodes, user to run the datanode as after dropping privileges.# This **MUST** be uncommented to enable secure HDFS if using privileged ports# to provide authentication of data transfer protocol.  This **MUST NOT** be# defined if SASL is configured for authentication of data transfer protocol# using non-privileged ports.export HADOOP_SECURE_DN_USER=${HADOOP_SECURE_DN_USER}# Where log files are stored.  $HADOOP_HOME/logs by default.#export HADOOP_LOG_DIR=${HADOOP_LOG_DIR}/$USER# Where log files are stored in the secure data environment.export HADOOP_SECURE_DN_LOG_DIR=${HADOOP_LOG_DIR}/${HADOOP_HDFS_USER}#### HDFS Mover specific parameters#### Specify the JVM options to be used when starting the HDFS Mover.# These options will be appended to the options specified as HADOOP_OPTS# and therefore may override any similar flags set in HADOOP_OPTS## export HADOOP_MOVER_OPTS=""#### Advanced Users Only!#### The directory where pid files are stored. /tmp by default.# NOTE: this should be set to a directory that can only be written to by #       the user that will run the hadoop daemons.  Otherwise there is the#       potential for a symlink attack.export HADOOP_PID_DIR=${HADOOP_PID_DIR}export HADOOP_SECURE_DN_PID_DIR=${HADOOP_PID_DIR}# A string representing this instance of hadoop. $USER by default.export HADOOP_IDENT_STRING=$USER

Dockerfile

12
# 3.4 Java环境变量COPY hadoop-env.sh /opt/hadoop/etc/hadoop/hadoop-env.sh

1.3.5 启动 Hadoop

构建镜像

1
docker build -t spark-python-2 .

image-20210227145155969

构建过程出现如上图,说明成功格式化。

从镜像中启动,并将宿主机50070端口 映射到 容器内50070端口,最后进入容器内bash终端

1
docker run -it --name spark-python-2-container -p 50070:50070 -p 4040:4040 spark-python-2 bash

注意:

50070 为 Hadoop WebUI 端口,

4040 为 Spark WebUI 端口

启动ssh

接着开启 NameNode 和 DataNode 守护进程

image-20210227154210777

image-20210227154225496

补充:

停止 hadoop

访问

image-20210227155445594

1.4 安装 Spark

由于我们已经自己安装了Hadoop,所以这里选择 spark-3.0.2-bin-without-hadoop.tgz

Spark部署模式主要有四种:

Local模式(单机模式)

Standalone模式(使用Spark自带的简单集群管理器)

YARN模式(使用YARN作为集群管理器)

Mesos模式(使用Mesos作为集群管理器)

这里介绍Local模式(单机模式)的 Spark安装。

Dockerfile

1234567
# 4. 安装 SparkRUN mkdir /opt/spark RUN wget  tar -zxvf spark-3.0.2-bin-without-hadoop.tgz -C /opt/spark/ENV SPARK_HOME=/opt/spark/spark-3.0.2-bin-without-hadoopENV PATH=${SPARK_HOME}/bin:$PATHCOPY spark-env.sh /opt/spark/conf/spark-env.sh

spark-env.sh

spark-env.sh

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
#!/usr/bin/env bash## Licensed to the Apache Software Foundation (ASF) under one or more# contributor license agreements.  See the NOTICE file distributed with# this work for additional information regarding copyright ownership.# The ASF licenses this file to You under the Apache License, Version 2.0# (the "License"); you may not use this file except in compliance with# the License.  You may obtain a copy of the License at##     Unless required by applicable law or agreed to in writing, software# distributed under the License is distributed on an "AS IS" BASIS,# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.# See the License for the specific language governing permissions and# limitations under the License.## This file is sourced when running various Spark programs.# Copy it as spark-env.sh and edit that to configure Spark for your site.export SPARK_DIST_CLASSPATH=$(/opt/hadoop/bin/hadoop classpath)# Options read when launching programs locally with# ./bin/run-example or ./bin/spark-submit# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node# - SPARK_PUBLIC_DNS, to set the public dns name of the driver program# Options read by executors and drivers running inside the cluster# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node# - SPARK_PUBLIC_DNS, to set the public DNS name of the driver program# - SPARK_LOCAL_DIRS, storage directories to use on this node for shuffle and RDD data# - MESOS_NATIVE_JAVA_LIBRARY, to point to your libmesos.so if you use Mesos# Options read in YARN client/cluster mode# - SPARK_CONF_DIR, Alternate conf dir. (Default: ${SPARK_HOME}/conf)# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files# - YARN_CONF_DIR, to point Spark towards YARN configuration files when you use YARN# - SPARK_EXECUTOR_CORES, Number of cores for the executors (Default: 1).# - SPARK_EXECUTOR_MEMORY, Memory per Executor (e.g. 1000M, 2G) (Default: 1G)# - SPARK_DRIVER_MEMORY, Memory for Driver (e.g. 1000M, 2G) (Default: 1G)# Options for the daemons used in the standalone deploy mode# - SPARK_MASTER_HOST, to bind the master to a different IP address or hostname# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master# - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")# - SPARK_WORKER_CORES, to set the number of cores to use on this machine# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker# - SPARK_WORKER_DIR, to set the working directory of worker processes# - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y")# - SPARK_DAEMON_MEMORY, to allocate to the master, worker and history server themselves (default: 1g).# - SPARK_HISTORY_OPTS, to set config properties only for the history server (e.g. "-Dx=y")# - SPARK_SHUFFLE_OPTS, to set config properties only for the external shuffle service (e.g. "-Dx=y")# - SPARK_DAEMON_JAVA_OPTS, to set config properties for all daemons (e.g. "-Dx=y")# - SPARK_DAEMON_CLASSPATH, to set the classpath for all daemons# - SPARK_PUBLIC_DNS, to set the public dns name of the master or workers# Options for launcher# - SPARK_LAUNCHER_OPTS, to set config properties and Java options for the launcher (e.g. "-Dx=y")# Generic options for the daemons used in the standalone deploy mode# - SPARK_CONF_DIR      Alternate conf dir. (Default: ${SPARK_HOME}/conf)# - SPARK_LOG_DIR       Where log files are stored.  (Default: ${SPARK_HOME}/logs)# - SPARK_PID_DIR       Where the pid file is stored. (Default: /tmp)# - SPARK_IDENT_STRING  A string representing this instance of spark. (Default: $USER)# - SPARK_NICENESS      The scheduling priority for daemons. (Default: 0)# - SPARK_NO_DAEMONIZE  Run the proposed command in the foreground. It will not output a PID file.# Options for native BLAS, like Intel MKL, OpenBLAS, and so on.# You might get better performance to enable these options if using native BLAS (see SPARK-21305).# - MKL_NUM_THREADS=1        Disable multi-threading of Intel MKL# - OPENBLAS_NUM_THREADS=1   Disable multi-threading of OpenBLAS

spark-env.sh :

有了上面的配置信息以后,Spark就可以把数据存储到Hadoop分布式文件系统HDFS中,也可以从HDFS中读取数据。

如果没有配置上面信息,Spark就只能读写本地数据,无法读写HDFS数据。 配置完成后就可以直接使用,不需要像Hadoop运行启动命令。 通过运行Spark自带的示例,验证Spark是否安装成功。

1.4.1 运行示例

1
$SPARK_HOME/bin/run-example SparkPi 2>&1 | grep "Pi is"

image-20210227163547529

1.5 安装 Python

安装 Python 是为了 运行 pyspark

Dockerfile

1
RUN apt-get install -y python

2.HDFS常用操作

使用hadoop用户名登录进入Linux系统,启动Hadoop,参照相关Hadoop书籍或网络资料,或者也可以参考本教程官网的“实验指南”栏目的“HDFS操作常用Shell命令”,使用Hadoop提供的Shell命令完成如下操作:

(1) 启动Hadoop,在HDFS中创建用户目录“/user/hadoop”;

(2) 在Linux系统的本地文件系统的“/home/hadoop”目录下新建一个文本文件test.txt,并在该文件中随便输入一些内容,然后上传到HDFS的“/user/hadoop”目录下;

(3) 把HDFS中“/user/hadoop”目录下的test.txt文件,下载到Linux系统的本地文件系统中的“/home/hadoop/下载”目录下;

(4) 将HDFS中“/user/hadoop”目录下的test.txt文件的内容输出到终端中进行显示;

(5) 在HDFS中的“/user/hadoop”目录下,创建子目录input,把HDFS中“/user/hadoop”目录下的test.txt文件,复制到“/user/hadoop/input”目录下;

(6) 删除HDFS中“/user/hadoop”目录下的test.txt文件,删除HDFS中“/user/hadoop”目录下的input子目录及其子目录下的所有内容。

3. Spark读取文件系统的数据

(1)在pyspark中读取Linux系统本地文件“/home/hadoop/test.txt”,然后统计出文件的行数;

(2)在pyspark中读取HDFS系统文件“/user/hadoop/test.txt”(如果该文件不存在,请先创建),然后,统计出文件的行数;

(3)编写独立应用程序,读取HDFS系统文件“/user/hadoop/test.txt”(如果该文件不存在,请先创建),然后,统计出文件的行数;通过 spark-submit 提交到 Spark 中运行程序。

四、实验报告

《Spark编程基础》实验报告

题目:

姓名:

日期:

实验环境:

实验内容与完成情况:

出现的问题:

解决方案(列出遇到的问题和解决办法,列出没有解决的问题):

# 补充

Python 版 Spark Shell

进入 Python Spark Shell

1
${SPARK_HOME}/bin/pyspark

image-20210316201936826

补充:

PATH 配置好,也可以直接 pyspark

image-20210316202052091

sc 即默认在 PySpark Shell 中创建的 Sparkcontext

Ctrl+D 退出pyspark shell

使用以下命令运行 Python Spark Shell脚本 (script.py)

1
${SPARK_HOME}/bin/pyspark script.py

在 Python Spark Shell 中 运行 script.py

注意:

当在 pyspark 中运行 py 时,不要再次创建 SparkContext,应当直接 使用 sc ,它会在 pyspark 中默认被创建存在,否则会报错:

1

ValueError: Cannot run multiple SparkContexts at once

运行 pyspark 例子

TODO: 运行 pyspark 例子

pyspark 报错

image-20210316201147248

12
root@a8878d819a6a:/# ${SPARK_HOME}/bin/pysparkenv: 'python': No such file or directory

解决:

方法一:

1
apt-get install -y python

方法二:

参考:

运行spark时提示 env: ‘python’: No such file or directory - trp - 博客园

注意:

此方法未测试,

Dockerfile 中使用 apt-get install -y python 解决此问题

标签: #ubuntu中hadoop安装 #spark环境搭建ubuntu #apachehadoop安装 #如何安装hadooplinux #apache license 2