Data for ‘Hadoop The Definitive Guide’ book, chapter 2

Some weeks ago I read the book “Hadoop The Definitive Guide“. In the second chapter, the book explains MapReduce followed by an example using information from the National Climatic Data Center (www.ncdc.noaa.gov).

To get the NCDC Weather data, the book refers to the Appendix C. There you can find the link to the book:

http://www.hadoopbook.com

If you go to the link, and navigate to the “Code and Data” menu, you will find the link to the GitHub repository:

http://github.com/tomwhite/hadoop-book/

Here is the interesting part:

“A sample of the NCDC weather dataset that is used throughout the book can be found at https://github.com/tomwhite/hadoop-book/tree/master/input/ncdc/all.

“The full dataset is stored on Amazon S3 in the hadoopbook bucket, and if you have an AWS account you can copy it to a EC2-based Hadoop cluster using Hadoop’s distcp command.”

So that’s it. If you are interested, you can try with AWS. In my case I prefer to get the data directly from the NCDC Web, so I found this link:

https://www.ncdc.noaa.gov/cdo-web/datasets

Here is the step by step procedure:

  1. Click on Legacy Applications -> Global Hourly Data
  2. Click on FTP button

Then you’ll arrive to the ftp with all the data from 1901 to 2015:

ftp://ftp.ncdc.noaa.gov/pub/data/noaa/

I hope this is useful for someone!

PD. Remember, to follow the book exercise, it’s necessary to preprocess the files generating only one by year. The reason to do so, according to the book, is that process of smaller number of relatively large files is easier and more efficient. You can use this shell script to do this:

#!/usr/bin/env bash

# NCDC Weather file to load into hadoop
target=”/home/user/Projects/Hadoop/data/noaa”;

# Un-gzip each station file and concat into one file
echo “reporter:status:Un-gzipping $target” >&2
for file in $target/2005/*
do
gunzip -c $file >> $target.all
echo “reporter:status:Processed $file” >&2
done

You can download it from here.

Anuncios
Esta entrada fue publicada en Big Data, Hadoop. Guarda el enlace permanente.

Una respuesta a Data for ‘Hadoop The Definitive Guide’ book, chapter 2

  1. Pingback: Benchmarking ‘Hadoop The Definitive Guide Chapter 2’ approaches | tesnick.com

Responder

Introduce tus datos o haz clic en un icono para iniciar sesión:

Logo de WordPress.com

Estás comentando usando tu cuenta de WordPress.com. Cerrar sesión / Cambiar )

Imagen de Twitter

Estás comentando usando tu cuenta de Twitter. Cerrar sesión / Cambiar )

Foto de Facebook

Estás comentando usando tu cuenta de Facebook. Cerrar sesión / Cambiar )

Google+ photo

Estás comentando usando tu cuenta de Google+. Cerrar sesión / Cambiar )

Conectando a %s