Some weeks ago I read the book “Hadoop The Definitive Guide“. In the second chapter, the book explains MapReduce followed by an example using information from the National Climatic Data Center (www.ncdc.noaa.gov).
To get the NCDC Weather data, the book refers to the Appendix C. There you can find the link to the book:
If you go to the link, and navigate to the “Code and Data” menu, you will find the link to the GitHub repository:
http://github.com/tomwhite/hadoop-book/
Here is the interesting part:
“A sample of the NCDC weather dataset that is used throughout the book can be found at https://github.com/tomwhite/hadoop-book/tree/master/input/ncdc/all.
“The full dataset is stored on Amazon S3 in the
hadoopbook
bucket, and if you have an AWS account you can copy it to a EC2-based Hadoop cluster using Hadoop’sdistcp
command.”
So that’s it. If you are interested, you can try with AWS. In my case I prefer to get the data directly from the NCDC Web, so I found this link:
https://www.ncdc.noaa.gov/cdo-web/datasets
Here is the step by step procedure:
- Click on Legacy Applications -> Global Hourly Data
- Click on FTP button
Then you’ll arrive to the ftp with all the data from 1901 to 2015:
ftp://ftp.ncdc.noaa.gov/pub/data/noaa/
I hope this is useful for someone!
PD. Remember, to follow the book exercise, it’s necessary to preprocess the files generating only one by year. The reason to do so, according to the book, is that process of smaller number of relatively large files is easier and more efficient. You can use this shell script to do this:
#!/usr/bin/env bash
# NCDC Weather file to load into hadoop
target=”/home/user/Projects/Hadoop/data/noaa”;# Un-gzip each station file and concat into one file
echo “reporter:status:Un-gzipping $target” >&2
for file in $target/2005/*
do
gunzip -c $file >> $target.all
echo “reporter:status:Processed $file” >&2
done
Pingback: Benchmarking ‘Hadoop The Definitive Guide Chapter 2’ approaches | tesnick.com