龙空技术网

知识集|spark编程基础期末总结

LearningYard学苑 240

前言:

目前我们对“spark编程基础期末考试”可能比较关切,看官们都想要学习一些“spark编程基础期末考试”的相关内容。那么小编也在网络上汇集了一些关于“spark编程基础期末考试””的相关文章,希望咱们能喜欢,同学们一起来了解一下吧!

分享兴趣,传播快乐,增长见闻,留下美好!

亲爱的您,这里是LearningYard新学苑。

今天小编为大家带来知识集|spark编程基础期末总结,欢迎您的访问。

Share interests, spread happiness, increase knowledge, and leave a good legacy!

Dear you, this is The LearningYard Academy.

Today Xiaobian brings you Knowledge Set| Final Summary of Spark Programming Fundamentals, welcome to your visit.

spark编程主要讲诉RDD编程基础、sparkSQL、sparkstreaming、sparkmlib四个部分。

spark programming mainly talks about the four parts of RDD programming basics, sparkSQL, sparkstreaming, and sparkmlib.

RDD编程包括rdd的创建、保存、持久化、分区。创建rdd有从文件系统中加载数据和通过并行集合创建。文件加载数据,使用spark提供的sparkcontext对象,在交互式环境中直接使用,编写独立程序则需手动生成,将数据以行为单位生成rdd。通过并行集合创建,从一个已经存在的集合的基础上创建rdd。

RDD programming includes RDD creation, saving, persistence, and partitioning. Creating RDDs has to load data from the file system and create through parallel collections. Files load data, use the SparkContext object provided by Spark, use directly in an interactive environment, write a stand-alone program that needs to be generated manually, and generate RDDs in behavioral units. With parallel collection creation, RDDs are created from a collection that already exists.

rdd有转换和行动两种操作,转换操作不真正计算,只记录轨迹。包括filter()、map()、flatmap()、groupByKey()、reduceByKey()。行动操作包括count()、collect()、first()、take()、reduce()、foreach()。

RDD has two operations: conversion and action, and the conversion operation does not really calculate and only records the trajectory. Includes filter(), map(), flatmap(), groupByKey(), reduceByKey(). Action actions include count(), collect(), first(), take(), reduce(), foreach().

持久化是使用persist()语句,当遇到行动操作时将计算结果持久化,在此后遇到行动操作便无需从头计算。分区的目的是增加并行度、减少通信开销,分区的数量可通过repartition重新设置数量。

Persistence is the use of persist() statements to persist the result of the computation when an action action is encountered, and there is no need to compute from scratch after the action operation is encountered. The purpose of partitioning is to increase parallelism and reduce communication overhead, and the number of partitions can be reset by repartitioning.

SparkSQL是适用于spark的数据查询,使用到有模式的rdd,即dataframe查询。使用spark.read()读取不同格式的数据并创建dataframe,使用df.write()保存数据。dataframe的常用操作有printSchema()、select()、filter()、groupBy()、sort()。利用反射机制或编程定义的方式将rdd转化为dataframe。可以使用sparkSQL读取数据库。

SparkSQL is a data query for Spark, using a schematic RDD, i.e. dataframe query. Use spark.read() to read data in different formats and create a dataframe, and df.write() to save data. Common operations for dataframes are printSchema(), select(), filter(), groupBy(), sort(). Use reflection mechanisms or programmatically defined ways to convert RDDs into dataframes. The database can be read using sparkSQL.

今天的分享就到这里了。如果您对今天的文章有什么独特的想法,欢迎评论留言,让我们相约明天,祝您今天过得开心快乐!

That's it for today's sharing. If you have any unique ideas for today's article, please leave a comment, let us meet tomorrow, I wish you a happy day!

参考教材:《Spark编程基础(python版)》,林子雨主编。

翻译:Google翻译

本文由LearningYard新学苑原创,部分图片文字来自网络,如有侵权请联系。

标签: #spark编程基础期末考试