龙空技术网

大数据专家,详解HadoopMapReduce处理海量小文件:压缩文件

大数据架构师 331

前言:

此刻咱们对“apache无法启动错误1053”大体比较关注,咱们都想要知道一些“apache无法启动错误1053”的相关知识。那么小编在网摘上收集了一些对于“apache无法启动错误1053””的相关资讯,希望姐妹们能喜欢,你们一起来学习一下吧!

前言

在HDFS上存储文件,大量的小文件是非常消耗NameNode内存的,因为每个文件都会分配一个文件描述符,NameNode需要在启动的时候加载全部文件的描述信息,所以文件越多,对NameNode来说开销越大。我们可以考虑,将小文件压缩以后,再上传到HDFS中,这时只需要一个文件描述符信息,自然大大减轻了NameNode对内存使用的开销。MapReduce计算中,Hadoop内置提供了如下几种压缩格式:

DEFLATEgzipbzip2LZO

使用压缩文件进行MapReduce计算,它的开销在于解压缩所消耗的时间,在特定的应用场景中这个也是应该考虑的问题。不过对于海量小文件的应用场景,我们压缩了小文件,却换来的Locality特性。假如成百上千的小文件压缩后只有一个Block,那么这个Block必然存在一个DataNode节点上,在计算的时候输入一个InputSplit,没有网络间传输数据的开销,而且是在本地进行运算。

倘若直接将小文件上传到HDFS上,成百上千的小Block分布在不同DataNode节点上,为了计算可能需要“移动数据”之后才能进行计算。文件很少的情况下,除了NameNode内存使用开销以外,可能感觉不到网络传输开销,但是如果小文件达到一定规模就非常明显了。下面,我们使用gzip格式压缩小文件,然后上传到HDFS中,实现MapReduce程序进行任务处理。使用一个类实现了基本的Map任务和Reduce任务,代码如下所示:(原创:时延军(包含链接:))

package org.shirdrn.kodz.inaction.hadoop.smallfiles.compression; import java.io.IOException;import java.util.Iterator; import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.io.compress.CompressionCodec;import org.apache.hadoop.io.compress.GzipCodec;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.util.GenericOptionsParser; public class GzipFilesMaxCostComputation {     public static class GzipFilesMapper extends Mapper<LongWritable, Text, Text, LongWritable> {         private final static LongWritable costValue = new LongWritable(0);        private Text code = new Text();         @Override        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {            // a line, such as 'SG 253654006139495 253654006164392 619850464'            String line = value.toString();            String[] array = line.split("\\s");            if (array.length == 4) {                String countryCode = array[0];                String strCost = array[3];                long cost = 0L;                try {                    cost = Long.parseLong(strCost);                } catch (NumberFormatException e) {                    cost = 0L;                }                if (cost != 0) {                    code.set(countryCode);                    costValue.set(cost);                    context.write(code, costValue);                }            }        }    }     public static class GzipFilesReducer extends Reducer<Text, LongWritable, Text, LongWritable> {         @Override        protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {            long max = 0L;            Iterator<LongWritable> iter = values.iterator();            while (iter.hasNext()) {                LongWritable current = iter.next();                if (current.get() > max) {                    max = current.get();                }            }            context.write(key, new LongWritable(max));        }     }     public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {         Configuration conf = new Configuration();        String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();        if (otherArgs.length != 2) {            System.err.println("Usage: gzipmaxcost <in> <out>");            System.exit(2);        }         Job job = new Job(conf, "gzip maxcost");         job.getConfiguration().setBoolean("mapred.output.compress", true);        job.getConfiguration().setClass("mapred.output.compression.codec", GzipCodec.class, CompressionCodec.class);         job.setJarByClass(GzipFilesMaxCostComputation.class);        job.setMapperClass(GzipFilesMapper.class);        job.setCombinerClass(GzipFilesReducer.class);        job.setReducerClass(GzipFilesReducer.class);         job.setMapOutputKeyClass(Text.class);        job.setMapOutputValueClass(LongWritable.class);        job.setOutputKeyClass(Text.class);        job.setOutputValueClass(LongWritable.class);         job.setNumReduceTasks(1);         FileInputFormat.addInputPath(job, new Path(otherArgs[0]));        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));         int exitFlag = job.waitForCompletion(true) ? 0 : 1;        System.exit(exitFlag);     }}

上面程序就是计算最大值的问题,实现比较简单,而且使用gzip压缩文件。另外,如果考虑Mapper输出后,需要向Reducer拷贝的数据量比较大,可以考虑在配置Job的时候,指定

压缩选项,详见上面代码中的配置。

下面看运行上面程序的过程:

准备数据

xiaoxiang@ubuntu3:/opt/stone/cloud/hadoop-1.0.3$ du -sh ../dataset/gzipfiles/*147M     ../dataset/gzipfiles/data_10m.gz43M     ../dataset/gzipfiles/data_50000_1.gz16M     ../dataset/gzipfiles/data_50000_2.gzxiaoxiang@ubuntu3:/opt/stone/cloud/hadoop-1.0.3$ bin/hadoop fs -mkdir /user/xiaoxiang/datasets/gzipfilesxiaoxiang@ubuntu3:/opt/stone/cloud/hadoop-1.0.3$ bin/hadoop fs -copyFromLocal ../dataset/gzipfiles/* /user/xiaoxiang/datasets/gzipfilesxiaoxiang@ubuntu3:/opt/stone/cloud/hadoop-1.0.3$ bin/hadoop fs -ls /user/xiaoxiang/datasets/gzipfilesFound 3 items-rw-r--r--   3 xiaoxiang supergroup  153719349 2013-03-24 12:56 /user/xiaoxiang/datasets/gzipfiles/data_10m.gz-rw-r--r--   3 xiaoxiang supergroup   44476101 2013-03-24 12:56 /user/xiaoxiang/datasets/gzipfiles/data_50000_1.gz-rw-r--r--   3 xiaoxiang supergroup   15935178 2013-03-24 12:56 /user/xiaoxiang/datasets/gzipfiles/data_50000_2.gz
运行程序
xiaoxiang@ubuntu3:/opt/stone/cloud/hadoop-1.0.3$ bin/hadoop jar gzip-compression.jar org.shirdrn.kodz.inaction.hadoop.smallfiles.compression.GzipFilesMaxCostComputation /user/xiaoxiang/datasets/gzipfiles /user/xiaoxiang/output/smallfiles/gzip13/03/24 13:06:28 INFO input.FileInputFormat: Total input paths to process : 313/03/24 13:06:28 INFO util.NativeCodeLoader: Loaded the native-hadoop library13/03/24 13:06:28 WARN snappy.LoadSnappy: Snappy native library not loaded13/03/24 13:06:28 INFO mapred.JobClient: Running job: job_201303111631_003913/03/24 13:06:29 INFO mapred.JobClient:  map 0% reduce 0%13/03/24 13:06:55 INFO mapred.JobClient:  map 33% reduce 0%13/03/24 13:07:04 INFO mapred.JobClient:  map 66% reduce 11%13/03/24 13:07:13 INFO mapred.JobClient:  map 66% reduce 22%13/03/24 13:07:25 INFO mapred.JobClient:  map 100% reduce 22%13/03/24 13:07:31 INFO mapred.JobClient:  map 100% reduce 100%13/03/24 13:07:36 INFO mapred.JobClient: Job complete: job_201303111631_003913/03/24 13:07:36 INFO mapred.JobClient: Counters: 2913/03/24 13:07:36 INFO mapred.JobClient:   Job Counters13/03/24 13:07:36 INFO mapred.JobClient:     Launched reduce tasks=113/03/24 13:07:36 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=7823113/03/24 13:07:36 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=013/03/24 13:07:36 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=013/03/24 13:07:36 INFO mapred.JobClient:     Launched map tasks=313/03/24 13:07:36 INFO mapred.JobClient:     Data-local map tasks=313/03/24 13:07:36 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=3441313/03/24 13:07:36 INFO mapred.JobClient:   File Output Format Counters13/03/24 13:07:36 INFO mapred.JobClient:     Bytes Written=133713/03/24 13:07:36 INFO mapred.JobClient:   FileSystemCounters13/03/24 13:07:36 INFO mapred.JobClient:     FILE_BYTES_READ=28812713/03/24 13:07:36 INFO mapred.JobClient:     HDFS_BYTES_READ=21413102613/03/24 13:07:36 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=38572113/03/24 13:07:36 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=133713/03/24 13:07:36 INFO mapred.JobClient:   File Input Format Counters13/03/24 13:07:36 INFO mapred.JobClient:     Bytes Read=21413062813/03/24 13:07:36 INFO mapred.JobClient:   Map-Reduce Framework13/03/24 13:07:36 INFO mapred.JobClient:     Map output materialized bytes=910513/03/24 13:07:36 INFO mapred.JobClient:     Map input records=1408000313/03/24 13:07:36 INFO mapred.JobClient:     Reduce shuffle bytes=607013/03/24 13:07:36 INFO mapred.JobClient:     Spilled Records=2283413/03/24 13:07:36 INFO mapred.JobClient:     Map output bytes=15487849313/03/24 13:07:36 INFO mapred.JobClient:     CPU time spent (ms)=9020013/03/24 13:07:36 INFO mapred.JobClient:     Total committed heap usage (bytes)=68819353613/03/24 13:07:36 INFO mapred.JobClient:     Combine input records=1409291113/03/24 13:07:36 INFO mapred.JobClient:     SPLIT_RAW_BYTES=39813/03/24 13:07:36 INFO mapred.JobClient:     Reduce input records=69913/03/24 13:07:36 INFO mapred.JobClient:     Reduce input groups=23313/03/24 13:07:36 INFO mapred.JobClient:     Combine output records=1374713/03/24 13:07:36 INFO mapred.JobClient:     Physical memory (bytes) snapshot=76544819213/03/24 13:07:36 INFO mapred.JobClient:     Reduce output records=23313/03/24 13:07:36 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=221123788813/03/24 13:07:36 INFO mapred.JobClient:     Map output records=14079863
运行结果
xiaoxiang@ubuntu3:/opt/stone/cloud/hadoop-1.0.3$ bin/hadoop fs -ls /user/xiaoxiang/output/smallfiles/gzipFound 3 items-rw-r--r--   3 xiaoxiang supergroup          0 2013-03-24 13:07 /user/xiaoxiang/output/smallfiles/gzip/_SUCCESSdrwxr-xr-x   - xiaoxiang supergroup          0 2013-03-24 13:06 /user/xiaoxiang/output/smallfiles/gzip/_logs-rw-r--r--   3 xiaoxiang supergroup       1337 2013-03-24 13:07 /user/xiaoxiang/output/smallfiles/gzip/part-r-00000.gzxiaoxiang@ubuntu3:/opt/stone/cloud/hadoop-1.0.3$ bin/hadoop fs -copyToLocal /user/xiaoxiang/output/smallfiles/gzip/part-r-00000.gz ./xiaoxiang@ubuntu3:/opt/stone/cloud/hadoop-1.0.3$ gunzip -c ./part-r-00000.gzAD     999974516AE     999938630AF     999996180AG     999991085AI     999989595AL     999998489AM     999978568AO     999989628AQ     999995031AR     999999563AS     999935982AT     999999909AU     999937089AW     999965784AZ     999996557BA     999994828BB     999992177BD     999992272BE     999925057BF     999999220BG     999971528BH     999994900BI     999982573BJ     999977886BM     999991925BN     999986630BO     999995482BR     999989947BS     999983475BT     999992685BW     999984222BY     999998496BZ     999997173CA     999991096CC     999969761CD     999978139CF     999995342CG     999957938CH     999997524CI     999998864CK     999968719CL     999967083CM     999998369CN     999975367CO     999999167CR     999980097CU     999976352CV     999990543CW     999996327CX     999987579CY     999982925CZ     999993908DE     999985416DJ     999997438DK     999963312DM     999941706DO     999992176DZ     999973610EC     999971018EE     999960984EG     999980522ER     999980425ES     999949155ET     999987033FI     999989788FJ     999990686FK     999977799FM     999994183FO     999988472FR     999988342GA     999982099GB     999970658GD     999996318GE     999991970GF     999982024GH     999941039GI     999995295GL     999948726GM     999984872GN     999992209GP     999996090GQ     999988635GR     999999672GT     999981025GU     999975956GW     999962551GY     999999881HK     999970084HN     999972628HR     999986688HT     999970913HU     999997568ID     999994762IE     999996686IL     999982184IM     999987831IN     999973935IO     999984611IQ     999990126IR     999986780IS     999973585IT     999997239JM     999986629JO     999982595JP     999985598KE     999996012KG     999991556KH     999975644KI     999994328KM     999989895KN     999991068KP     999967939KR     999992162KW     999924295KY     999985907KZ     999992835LA     999989151LB     999989233LC     999994793LI     999986863LK     999989876LR     999984906LS     999957706LT     999999688LU     999999823LV     999981633LY     999992365MA     999993880MC     999978886MD     999997483MG     999996602MH     999989668MK     999983468ML     999990079MM     999989010MN     999969051MO     999978283MP     999995848MQ     999913110MR     999982303MS     999997548MT     999982604MU     999988632MV     999975914MW     999991903MX     999978066MY     999995010MZ     999981189NA     999976735NC     999961053NE     999990091NF     999989399NG     999985037NI     999965733NL     999988890NO     999993122NP     999972410NR     999956464NU     999987046NZ     999998214OM     999967428PA     999944775PE     999998598PF     999959978PG     999987347PH     999981534PK     999954268PL     999996619PM     999998975PR     999978127PT     999993404PW     999991278PY     999993590QA     999995061RE     999998518RO     999994148RS     999999923RU     999995809RW     999980184SA     999973822SB     999972832SC     999991021SD     999963744SE     999972256SG     999977637SH     999999068SI     999980580SK     999998152SL     999999269SM     999941188SN     999990278SO     999978960SR     999997483ST     999980447SV     999999945SX     999938671SY     999990666SZ     999992537TC     999969904TD     999999303TG     999977640TH     999979255TJ     999983666TK     999971131TM     999958998TN     999979170TO     999959971TP     999986796TR     999996679TT     999984435TV     999974536TW     999975092TZ     999992734UA     999972948UG     999980070UM     999998377US     999918442UY     999989662UZ     999982762VA     999987372VC     999991495VE     999997971VG     999954576VI     999990063VN     999974393VU     999976113WF     999961299WS     999970242YE     999984650YT     999994707ZA     999998692ZM     999993331ZW     999943540
觉得文章还不错的话,可以转发此文关注小编,每天更新技术好文

标签: #apache无法启动错误1053