wget https://download.github.com/kevinweil-hadoop-lzo-2ad6654.tar.gz
tar -zxvf kevinweil-hadoop-lzo-2ad6654.tar.gz
cd kevinweil-hadoop-lzo-2ad6654
export CFLAGS=-m64
export CXXFLAGS=-m64
ant compile-native tar
hadoop jar /path/to/hadoop-lzo.jarcom.hadoop.compression.lzo.LzoIndexerhdfs://namenode:9000/lzo_logs
如果要写一个job来使用lzo,可以找一个job,例如wordcount,将当中到TextInputFormat修改为LzoTextInputForma,其他都不用修改,job就能从hdfs上读入lzo压缩文件,进行分布式都分块并行处理。
>Using Hadoop and LZO
Reading and Writing LZO Data
The project provides LzoInputStream and LzoOutputStream wrapping regular streams, to allow you to easily read and write compressed LZO data.
Indexing LZO Files
At this point, you should also be able to use the indexer to index lzo files in Hadoop (recall: this makes them splittable, so that they can be analyzed in parallel in a mapreduce job). Imagine that big_file.lzo is a 1 GB LZO file. You have two options:
• index it in-process via:
• hadoop jar /path/to/your/hadoop-lzo.jar com.hadoop.compression.lzo.LzoIndexer big_file.lzo
• index it in a map-reduce job via:
• hadoop jar /path/to/your/hadoop-lzo.jar com.hadoop.compression.lzo.DistributedLzoIndexer big_file.lzo
Either way, after 10-20 seconds there will be a file named big_file.lzo.index. The newly-created index file tells the LzoTextInputFormat's getSplits function how to break the LZO file into splits that can be decompressed and processed in parallel. Alternatively, if you specify a directory instead of a filename, both indexers will recursively walk the directory structure looking for .lzo files, indexing any that do not already have corresponding .lzo.index files.
Running MR Jobs over Indexed Files
Now run any job, say wordcount, over the new file. In Java-based M/R jobs, just replace any uses of TextInputFormat by LzoTextInputFormat. In streaming jobs, add "-inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat" (streaming still uses the old APIs, and needs a class that inherits from org.apache.hadoop.mapred.InputFormat). Note that to use the DeprecatedLzoTextInputFormat properly with hadoop-streaming, you should also set the jobconf propertystream.map.input.ignoreKey=true. That will replicate the behavior of the default TextInputFormat by stripping off the byte offset keys from the input lines that get piped to the mapper process. For Pig jobs, email me or check the pig list -- I have custom LZO loader classes that work but are not (yet) contributed back.
Note that if you forget to index an .lzo file, the job will work but will process the entire file in a single split, which will be less efficient.
参考资料
[lzo本地压缩与解压缩实例](http://blog.csdn.net/scorpiohjx2/article/details/18423529)
[hadoop集群内lzo的安装与配置](http://share.blog.51cto.com/278008/549393/)
[安装 Hadoop 2.0.0-cdh4.3.0 LZO 成功](http://www.tuicool.com/articles/VVj6rm)
[hadoop-lzo源代码](https://code.google.com/a/apache-extras.org/p/hadoop-gpl-compression/wiki/FAQ?redir=1)
[Hadoop Could not load native gpl library异常解决](http://guoyunsky.iteye.com/blog/1237327)