Benchmarking ‘Hadoop The Definitive Guide Chapter 2’ approaches

Following the previous post, I thought it could be interesting to execute some benchmarks to check the execution time of each implementation approach on the Hadoop The Definitive Guide book Chapter 2.

To focus on the problem, the book exercise consists of processing a big file performing an aggregation operation with its data. More specifically:

  • For the benchmark, I’ve created a relatively big file (named noa.all) following the instructions defined on the previous post. It’s a txt file with 16.873.949 lines of Weather Data (4.1 Gigabytes). Each line represents the measurement of weather data (we’re only interested in air temperature and year of measurement) from a weather sensor across the globe.
  • The aggregation operation consists of getting the max value of temperature by year.

The implementation approaches proposed by the book are:

  1. Using Unix Commands (cat and awk)
  2. Using Hadoop (with and without a combiner function)
  3. Using Hadoop Streaming with Ruby (with and without a combiner function)
  4. Using Hadoop Streaming with Python (with and without a combiner function)

And these are the execution time results:

Label Approach Time in seconds
1 cat + awk 23.482
2 cat + python script 82.675
3 Hadoop Streaming with combiner + python 84.359
4 cat + ruby script 86.169
5 Hadoop Streaming with combiner + ruby 89.356
6 Hadoop Streaming + python 90.558
7 Hadoop Streaming + ruby 93.952
8 Hadoop with combiner function 114.527
9 Hadoop 123.788


It’s easier to appreciate on a graphic:



It seems clear that the most optimized approach using only one machine is using Unix commands. Python and Ruby are the second and third best options. The overhead using Hadoop with Java code is not justified in this case. In case we need to reduce more the execution times and if we had several machines, we could consider the Hadoop option. As it ‘s explained in the book, in this case, using a 10-node EC2 cluster running High-CPU Extra Large instances, the program takes seven times less than the naive AWK version does.

How to execute the benchmark

I used the Ruby and Python scripts provided by the git repo from the book. They are on the ch02-mr-intro/src/main/python and ch02-mr-intro/src/main/ruby folders.

The script to execute the cat + awk commands, is a modified version of the file (ch02-mr-intro/src/main/awk). This is the code:

#!/usr/bin/env bash


cat $target |
awk ‘{ temp = substr($0, 88, 5) + 0;
q = substr($0, 93, 1);
if (temp !=9999 && q ~ /[01459]/ && temp > max) max = temp }
END { print max }’

You can download it from here.

Commands executed:

1 time
2 time cat noaa.all | | sort |
3 time hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar -input noaa.all -output output -mapper -combiner -reducer
4 time cat noaa.all | max_temperature_map.rb | sort | max_temperature_reduce.rb
5 time hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar -input noaa.all -output output -mapper max_temperature_map.rb -combiner max_temperature_reduce.rb -reducer max_temperature_reduce.rb
6 time hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar -input noaa.all -output output -mapper -reducer
7 time hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar -input noaa.all -output output -mapper max_temperature_map.rb -reducer max_temperature_reduce.rb
8 time hadoop MaxTemperatureWithCombiner noaa.all output
9 time hadoop MaxTemperature noaa.all output

You can download a shell script with all the commands from here.

Environment used

Ubuntu 14.04.3

Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz

8 GB Ram


Esta entrada fue publicada en Big Data, Hadoop. Guarda el enlace permanente.


Introduce tus datos o haz clic en un icono para iniciar sesión:

Logo de

Estás comentando usando tu cuenta de Cerrar sesión / Cambiar )

Imagen de Twitter

Estás comentando usando tu cuenta de Twitter. Cerrar sesión / Cambiar )

Foto de Facebook

Estás comentando usando tu cuenta de Facebook. Cerrar sesión / Cambiar )

Google+ photo

Estás comentando usando tu cuenta de Google+. Cerrar sesión / Cambiar )

Conectando a %s