前言:
而今同学们对“ubuntu中hadoop安装”都比较重视,小伙伴们都需要分析一些“ubuntu中hadoop安装”的相关文章。那么小编同时在网摘上网罗了一些对于“ubuntu中hadoop安装””的相关知识,希望小伙伴们能喜欢,朋友们快快来了解一下吧!一、实验目的
(1)掌握在Linux虚拟机中安装Hadoop和Spark的方法;
(2)熟悉HDFS的基本使用方法;
(3)掌握使用Spark访问本地文件和HDFS文件的方法。
二、实验平台
操作系统:Ubuntu 20.04;
Spark版本:3.0.2;
Hadoop版本:2.7.3。
Python版本:3.4.3。
三、实验内容和要求1.安装Hadoop和Spark
进入Linux系统,参照本教程官网“实验指南”栏目的“Hadoop的安装和使用”,完成Hadoop伪分布式模式的安装。完成Hadoop的安装以后,再安装Spark(Local模式)。
参考:
Hadoop安装教程_单机/伪分布式配置_Hadoop2.6.0(2.7.1)/Ubuntu14.04(16.04)_厦大数据库实验室博客 Spark2.1.0入门:Spark的安装和使用_厦大数据库实验室博客
1.1 准备
Dockerfile
12345678910111213141516
FROM ubuntu:20.04 AS baseLABEL maintainer="yiyun <yiyungent@gmail.com>"# 设置国内阿里云镜像源COPY etc/apt/aliyun-ubuntu-20.04-focal-sources.list /etc/apt/sources.list# 时区设置ENV TZ=Asia/ShanghaiRUN apt-get update# 1. 安装常用软件RUN apt-get install -y wgetRUN apt-get install -y sshRUN apt-get install -y vim
1.2 安装 Java
Dockerfile
1234567
# 2. 安装 JavaADD jdk-8u131-linux-x64.tar.gz /opt/RUN mv /opt/jdk1.8.0_131 /opt/jdk1.8ENV JAVA_HOME=/opt/jdk1.8ENV JRE_HOME=${JAVA_HOME}/jreENV CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/libENV PATH=${JAVA_HOME}/bin:$PATH
1.3 安装 Hadoop
Dockerfile
123456
# 3. 安装 HadoopADD hadoop-2.7.3.tar.gz /opt/RUN mv /opt/hadoop-2.7.3 /opt/hadoopENV HADOOP_HOME=/opt/hadoopENV HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/nativeENV PATH=$PATH:${HADOOP_HOME}/sbin:${HADOOP_HOME}/bin
1.3 Hadoop伪分布式配置
Hadoop 可以在单节点上以伪分布式的方式运行,Hadoop 进程以分离的 Java 进程来运行,节点既作为 NameNode 也作为 DataNode,同时,读取的是 HDFS 中的文件。
Hadoop 的配置文件位于 /usr/local/hadoop/etc/hadoop/ 中,
伪分布式需要修改2个配置文件 core-site.xml 和 hdfs-site.xml 。
Hadoop的配置文件是 xml 格式,每个配置以声明 property 的 name 和 value 的方式来实现。
1.3.1 配置文件
接下来,准备 core-site.xml 文件,内容如下:
core-site.xml
1234567891011
<configuration> <property> <name>hadoop.tmp.dir</name> <value>file:/opt/hadoop/tmp</value> <description>Abase for other temporary directories.</description> </property> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property></configuration>
hdfs-site.xml 文件,内容如下:
hdfs-site.xml
1234567891011121314
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/opt/hadoop/tmp/dfs/name</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/opt/hadoop/tmp/dfs/data</value> </property></configuration>
Hadoop配置文件说明
Hadoop 的运行方式是由配置文件决定的(运行 Hadoop 时会读取配置文件),
因此如果需要从伪分布式模式切换回非分布式模式,需要删除 core-site.xml 中的配置项。
此外,伪分布式虽然只需要配置 fs.defaultFS 和 dfs.replication 就可以运行(官方教程如此),
不过若没有配置 hadoop.tmp.dir 参数,则默认使用的临时目录为 /tmp/hadoo-hadoop,
而这个目录在重启时有可能被系统清理掉,导致必须重新执行 format 才行。
所以我们进行了设置,同时也指定 dfs.namenode.name.dir 和 dfs.datanode.data.dir,否则在接下来的步骤中可能会出错。
注意:
core-site.xml, hdfs-site.xml 均放于与 Dockerfile 同级目录
配置文件:
Dockerfile
123
# 3.1 Hadoop伪分布式配置COPY core-site.xml /opt/hadoop/etc/hadoop/core-site.xmlCOPY hdfs-site.xml /opt/hadoop/etc/hadoop/hdfs-site.xml
1.3.2 执行 NameNode 的格式化
配置完成后,执行 NameNode 的格式化:
Dockerfile
12
# 3.2 配置完成后,执行 NameNode 的格式化RUN /opt/hadoop/bin/hdfs namenode -format
1.3.3 配置 ssh 免密登录
Dockerfile
1234
# 3.3 配置 ssh 免密登录RUN /etc/init.d/ssh startRUN ssh-keygen -f $HOME/.ssh/id_rsa -t rsa -N ''RUN cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
1.3.4 Java环境变量
hadoop-env.sh
/opt/hadoop/etc/hadoop/hadoop-env.sh
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
# Licensed to the Apache Software Foundation (ASF) under one# or more contributor license agreements. See the NOTICE file# distributed with this work for additional information# regarding copyright ownership. The ASF licenses this file# to you under the Apache License, Version 2.0 (the# "License"); you may not use this file except in compliance# with the License. You may obtain a copy of the License at## Unless required by applicable law or agreed to in writing, software# distributed under the License is distributed on an "AS IS" BASIS,# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.# See the License for the specific language governing permissions and# limitations under the License.# Set Hadoop-specific environment variables here.# The only required environment variable is JAVA_HOME. All others are# optional. When running a distributed configuration it is best to# set JAVA_HOME in this file, so that it is correctly defined on# remote nodes.# The java implementation to use.# export JAVA_HOME=${JAVA_HOME}export JAVA_HOME=/opt/jdk1.8# The jsvc implementation to use. Jsvc is required to run secure datanodes# that bind to privileged ports to provide authentication of data transfer# protocol. Jsvc is not required if SASL is configured for authentication of# data transfer protocol using non-privileged ports.#export JSVC_HOME=${JSVC_HOME}export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"}# Extra Java CLASSPATH elements. Automatically insert capacity-scheduler.for f in $HADOOP_HOME/contrib/capacity-scheduler/*.jar; do if [ "$HADOOP_CLASSPATH" ]; then export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$f else export HADOOP_CLASSPATH=$f fidone# The maximum amount of heap to use, in MB. Default is 1000.#export HADOOP_HEAPSIZE=#export HADOOP_NAMENODE_INIT_HEAPSIZE=""# Extra Java runtime options. Empty by default.export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true"# Command specific options appended to HADOOP_OPTS when specifiedexport HADOOP_NAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_NAMENODE_OPTS"export HADOOP_DATANODE_OPTS="-Dhadoop.security.logger=ERROR,RFAS $HADOOP_DATANODE_OPTS"export HADOOP_SECONDARYNAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_SECONDARYNAMENODE_OPTS"export HADOOP_NFS3_OPTS="$HADOOP_NFS3_OPTS"export HADOOP_PORTMAP_OPTS="-Xmx512m $HADOOP_PORTMAP_OPTS"# The following applies to multiple commands (fs, dfs, fsck, distcp etc)export HADOOP_CLIENT_OPTS="-Xmx512m $HADOOP_CLIENT_OPTS"#HADOOP_JAVA_PLATFORM_OPTS="-XX:-UsePerfData $HADOOP_JAVA_PLATFORM_OPTS"# On secure datanodes, user to run the datanode as after dropping privileges.# This **MUST** be uncommented to enable secure HDFS if using privileged ports# to provide authentication of data transfer protocol. This **MUST NOT** be# defined if SASL is configured for authentication of data transfer protocol# using non-privileged ports.export HADOOP_SECURE_DN_USER=${HADOOP_SECURE_DN_USER}# Where log files are stored. $HADOOP_HOME/logs by default.#export HADOOP_LOG_DIR=${HADOOP_LOG_DIR}/$USER# Where log files are stored in the secure data environment.export HADOOP_SECURE_DN_LOG_DIR=${HADOOP_LOG_DIR}/${HADOOP_HDFS_USER}#### HDFS Mover specific parameters#### Specify the JVM options to be used when starting the HDFS Mover.# These options will be appended to the options specified as HADOOP_OPTS# and therefore may override any similar flags set in HADOOP_OPTS## export HADOOP_MOVER_OPTS=""#### Advanced Users Only!#### The directory where pid files are stored. /tmp by default.# NOTE: this should be set to a directory that can only be written to by # the user that will run the hadoop daemons. Otherwise there is the# potential for a symlink attack.export HADOOP_PID_DIR=${HADOOP_PID_DIR}export HADOOP_SECURE_DN_PID_DIR=${HADOOP_PID_DIR}# A string representing this instance of hadoop. $USER by default.export HADOOP_IDENT_STRING=$USER
Dockerfile
12
# 3.4 Java环境变量COPY hadoop-env.sh /opt/hadoop/etc/hadoop/hadoop-env.sh
1.3.5 启动 Hadoop
构建镜像
1
docker build -t spark-python-2 .
image-20210227145155969
构建过程出现如上图,说明成功格式化。
从镜像中启动,并将宿主机50070端口 映射到 容器内50070端口,最后进入容器内bash终端
1
docker run -it --name spark-python-2-container -p 50070:50070 -p 4040:4040 spark-python-2 bash
注意:
50070 为 Hadoop WebUI 端口,
4040 为 Spark WebUI 端口
启动ssh
接着开启 NameNode 和 DataNode 守护进程
image-20210227154210777
image-20210227154225496
补充:
停止 hadoop
访问
image-20210227155445594
1.4 安装 Spark
由于我们已经自己安装了Hadoop,所以这里选择 spark-3.0.2-bin-without-hadoop.tgz
Spark部署模式主要有四种:
Local模式(单机模式)
Standalone模式(使用Spark自带的简单集群管理器)
YARN模式(使用YARN作为集群管理器)
Mesos模式(使用Mesos作为集群管理器)
这里介绍Local模式(单机模式)的 Spark安装。
Dockerfile
1234567
# 4. 安装 SparkRUN mkdir /opt/spark RUN wget tar -zxvf spark-3.0.2-bin-without-hadoop.tgz -C /opt/spark/ENV SPARK_HOME=/opt/spark/spark-3.0.2-bin-without-hadoopENV PATH=${SPARK_HOME}/bin:$PATHCOPY spark-env.sh /opt/spark/conf/spark-env.sh
spark-env.sh
spark-env.sh
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
#!/usr/bin/env bash## Licensed to the Apache Software Foundation (ASF) under one or more# contributor license agreements. See the NOTICE file distributed with# this work for additional information regarding copyright ownership.# The ASF licenses this file to You under the Apache License, Version 2.0# (the "License"); you may not use this file except in compliance with# the License. You may obtain a copy of the License at## Unless required by applicable law or agreed to in writing, software# distributed under the License is distributed on an "AS IS" BASIS,# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.# See the License for the specific language governing permissions and# limitations under the License.## This file is sourced when running various Spark programs.# Copy it as spark-env.sh and edit that to configure Spark for your site.export SPARK_DIST_CLASSPATH=$(/opt/hadoop/bin/hadoop classpath)# Options read when launching programs locally with# ./bin/run-example or ./bin/spark-submit# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node# - SPARK_PUBLIC_DNS, to set the public dns name of the driver program# Options read by executors and drivers running inside the cluster# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node# - SPARK_PUBLIC_DNS, to set the public DNS name of the driver program# - SPARK_LOCAL_DIRS, storage directories to use on this node for shuffle and RDD data# - MESOS_NATIVE_JAVA_LIBRARY, to point to your libmesos.so if you use Mesos# Options read in YARN client/cluster mode# - SPARK_CONF_DIR, Alternate conf dir. (Default: ${SPARK_HOME}/conf)# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files# - YARN_CONF_DIR, to point Spark towards YARN configuration files when you use YARN# - SPARK_EXECUTOR_CORES, Number of cores for the executors (Default: 1).# - SPARK_EXECUTOR_MEMORY, Memory per Executor (e.g. 1000M, 2G) (Default: 1G)# - SPARK_DRIVER_MEMORY, Memory for Driver (e.g. 1000M, 2G) (Default: 1G)# Options for the daemons used in the standalone deploy mode# - SPARK_MASTER_HOST, to bind the master to a different IP address or hostname# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master# - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")# - SPARK_WORKER_CORES, to set the number of cores to use on this machine# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker# - SPARK_WORKER_DIR, to set the working directory of worker processes# - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y")# - SPARK_DAEMON_MEMORY, to allocate to the master, worker and history server themselves (default: 1g).# - SPARK_HISTORY_OPTS, to set config properties only for the history server (e.g. "-Dx=y")# - SPARK_SHUFFLE_OPTS, to set config properties only for the external shuffle service (e.g. "-Dx=y")# - SPARK_DAEMON_JAVA_OPTS, to set config properties for all daemons (e.g. "-Dx=y")# - SPARK_DAEMON_CLASSPATH, to set the classpath for all daemons# - SPARK_PUBLIC_DNS, to set the public dns name of the master or workers# Options for launcher# - SPARK_LAUNCHER_OPTS, to set config properties and Java options for the launcher (e.g. "-Dx=y")# Generic options for the daemons used in the standalone deploy mode# - SPARK_CONF_DIR Alternate conf dir. (Default: ${SPARK_HOME}/conf)# - SPARK_LOG_DIR Where log files are stored. (Default: ${SPARK_HOME}/logs)# - SPARK_PID_DIR Where the pid file is stored. (Default: /tmp)# - SPARK_IDENT_STRING A string representing this instance of spark. (Default: $USER)# - SPARK_NICENESS The scheduling priority for daemons. (Default: 0)# - SPARK_NO_DAEMONIZE Run the proposed command in the foreground. It will not output a PID file.# Options for native BLAS, like Intel MKL, OpenBLAS, and so on.# You might get better performance to enable these options if using native BLAS (see SPARK-21305).# - MKL_NUM_THREADS=1 Disable multi-threading of Intel MKL# - OPENBLAS_NUM_THREADS=1 Disable multi-threading of OpenBLAS
spark-env.sh :
有了上面的配置信息以后,Spark就可以把数据存储到Hadoop分布式文件系统HDFS中,也可以从HDFS中读取数据。
如果没有配置上面信息,Spark就只能读写本地数据,无法读写HDFS数据。 配置完成后就可以直接使用,不需要像Hadoop运行启动命令。 通过运行Spark自带的示例,验证Spark是否安装成功。
1.4.1 运行示例
1
$SPARK_HOME/bin/run-example SparkPi 2>&1 | grep "Pi is"
image-20210227163547529
1.5 安装 Python
安装 Python 是为了 运行 pyspark
Dockerfile
1
RUN apt-get install -y python
2.HDFS常用操作
使用hadoop用户名登录进入Linux系统,启动Hadoop,参照相关Hadoop书籍或网络资料,或者也可以参考本教程官网的“实验指南”栏目的“HDFS操作常用Shell命令”,使用Hadoop提供的Shell命令完成如下操作:
(1) 启动Hadoop,在HDFS中创建用户目录“/user/hadoop”;
(2) 在Linux系统的本地文件系统的“/home/hadoop”目录下新建一个文本文件test.txt,并在该文件中随便输入一些内容,然后上传到HDFS的“/user/hadoop”目录下;
(3) 把HDFS中“/user/hadoop”目录下的test.txt文件,下载到Linux系统的本地文件系统中的“/home/hadoop/下载”目录下;
(4) 将HDFS中“/user/hadoop”目录下的test.txt文件的内容输出到终端中进行显示;
(5) 在HDFS中的“/user/hadoop”目录下,创建子目录input,把HDFS中“/user/hadoop”目录下的test.txt文件,复制到“/user/hadoop/input”目录下;
(6) 删除HDFS中“/user/hadoop”目录下的test.txt文件,删除HDFS中“/user/hadoop”目录下的input子目录及其子目录下的所有内容。
3. Spark读取文件系统的数据
(1)在pyspark中读取Linux系统本地文件“/home/hadoop/test.txt”,然后统计出文件的行数;
(2)在pyspark中读取HDFS系统文件“/user/hadoop/test.txt”(如果该文件不存在,请先创建),然后,统计出文件的行数;
(3)编写独立应用程序,读取HDFS系统文件“/user/hadoop/test.txt”(如果该文件不存在,请先创建),然后,统计出文件的行数;通过 spark-submit 提交到 Spark 中运行程序。
四、实验报告
《Spark编程基础》实验报告
题目:
姓名:
日期:
实验环境:
实验内容与完成情况:
出现的问题:
解决方案(列出遇到的问题和解决办法,列出没有解决的问题):
# 补充
Python 版 Spark Shell
进入 Python Spark Shell
1
${SPARK_HOME}/bin/pyspark
image-20210316201936826
补充:
PATH 配置好,也可以直接 pyspark
image-20210316202052091
sc 即默认在 PySpark Shell 中创建的 Sparkcontext
Ctrl+D 退出pyspark shell
使用以下命令运行 Python Spark Shell脚本 (script.py)
1
${SPARK_HOME}/bin/pyspark script.py
在 Python Spark Shell 中 运行 script.py
注意:
当在 pyspark 中运行 py 时,不要再次创建 SparkContext,应当直接 使用 sc ,它会在 pyspark 中默认被创建存在,否则会报错:
1
ValueError: Cannot run multiple SparkContexts at once
运行 pyspark 例子
TODO: 运行 pyspark 例子
pyspark 报错
image-20210316201147248
12
root@a8878d819a6a:/# ${SPARK_HOME}/bin/pysparkenv: 'python': No such file or directory
解决:
方法一:
1
apt-get install -y python
方法二:
参考:
运行spark时提示 env: ‘python’: No such file or directory - trp - 博客园
注意:
此方法未测试,
Dockerfile 中使用 apt-get install -y python 解决此问题
标签: #ubuntu中hadoop安装 #spark环境搭建ubuntu #apachehadoop安装 #如何安装hadooplinux #apache license 2