Flexible, affordable statistics education.

Designed to help you master the software you need to enhance your skills and the practical experience you need to get ahead.

facebook LinkedIn twitter Google+ Email

Introduction to Analytics using Hadoop



October 31, 2014 to November 28, 2014

Thank you for your submission.

Introduction to Analytics using Hadoop

taught by Benjamin Bengfort

Aim of Course:

This class will introduce statisticians to Hadoop, and provide an exemplar workflow for using Hadoop, writing MapReduce jobs, and finally leveraging Hadoop Streaming to conclude work in an analytics programming language such as Python.  In this course you will learn

  1. What Hadoop is hand how to leverage it to perform analytics
  2. The software components of the Hadoop Ecosystem
  3. How to manage data on a distributed file system
  4. How to write MapReduce jobs to perform computations with Hadoop
  5. How to utilize Hadoop Streaming to output jobs

Background - "Big Data"

The term “Big Data” has come into vogue vogue to refer not just to data volume, but also tofor an exciting new set of applications and techniques that are powering modern applications and whose novelty seems to be changing the way the world is computing. In most cases, the "end game" is the application of well-known statistical and machine-learning techniques. However, modern distributed computation techniques are allowing the analysis of data sets far larger then those that could be typically analyzed in the past.

The need for distributed computing arises from a combination of rapidly increasing data sets flows generated by organizations and from the Internet, and the fact that the huge size of these data sets greatly widens the scope for prediction and analysis. A key milestone was the release of Apache Hadoop in 2005. Hadoop is an open source project based on two seminal papers produced by Google: The Google File System (2003) and MapReduce: Simplified Data Processing on Large Clusters (2004). These two papers discuss the two key components that make up Hadoop: a distributed File System and MapReduce functional computations. Now it seems that whenever someone says “Big Data” they probably are referring to computation using Hadoop.

This course may be taken individually (one-off) or as part of a certificate program.

Course Program:

WEEK 1: A Distributed Computing Environment

The first week is all about getting to know Hadoop and getting set up to develop MapReduce jobs on a development environment.  This task by itself is not particularly easy, but is crucial to getting started with Hadoop.

  • Introduce Hadoop, its motivations and core concepts
  • Discover HDFS and MapReduce and their roles
  • NameNodes, JobTrackers, and DataNodes (The Hadoop Anatomy)
  • Learn about the other applications in the Hadoop Ecosystem
  • Get a development environment set up

WEEK 2: Working with Hadoop

In week 2, we’ll explore how to use the Hadoop Filesystem to load and manage data. We’ll also learn the data flow of Hadoop jobs and execute some simple, pre-built jobs. 

  • Introduce the Hadoop Filesystem
  • Learn how to read and write data to HDFS
  • Learn data flow in Hadoop Jobs

WEEK 3: Computing with MapReduce

We’ll kick week three off with a discussion of MapReduce programming, and write our first MapReduce jobs to execute on our Hadoop cluster. This is where the rubber meets the road, and we’ll use Hadoop Streaming and the language of your choice to develop simple analytics. 

  • Functional programming with Mappers and Reducers
  • A sample MapReduce Algorithm
  • Mappers and Reducers in Detail
  • Running MapReduce jobs
  • Hadoop Streaming

WEEK 4: Towards Last Mile Computation

In the last section we’ll discuss how to use Hadoop to transform large data sets into a more manageable computational size. We’ll talk about workflows towards last mile computation, filtering, searching, and aggregating, as well as writing some more MapReduce jobs.

  • Hadoop workflows with Python
  • Utilizing your programming environment with Hadoop Streaming
  • Filtering, Aggregating and Searching
  • Intro to advanced topics in Hadoop



In addition to assigned readings, this course also has example software codes, and supplemental readings available online.

Introduction to Analytics using Hadoop

Be sure you meet all of the minimum requirements before you register, click here to learn more.


October 31, 2014 to November 28, 2014

Course Fee: $549

Tuition Savings:  When you register online for 3 or more courses, $200 is automatically deducted from the total tuition. (This offer cannot be combined and is only applicable to courses of 3 weeks or longer.)


Have you reviewed the REQUIREMENTS for this course?

Add $50 service fee if you require a prior invoice, or if you need to submit a purchase order or voucher, pay by wire transfer or EFT, or refund and reprocess a prior payment. Please use this printed registration form, for these and other special orders.

Courses may fill up at any time and registrations are processed in the order in which they are received. Your registration will be confirmed for the first available course date, unless you specify otherwise.

Introduction to Analytics using Hadoop

taught by Benjamin Bengfort

Who Should Take This Course:

Data scientists and statisticians with programming experience who want to need to deal with large data sets and want to learn about Hadoop's distributing computing capability.



These are listed for your benefit so you can determine for yourself, whether you have the needed background, whether from taking the listed courses, or by other experience.
1. Command line experience on Linux, to manage system processes, find appropriate files and set permissions.
2. For the last week, familiarity with R or another programming language when we illustrate Hadoop streaming to do computations.
3. See the "Software" section below.
Organization of the Course:

This course takes place online at the Institute for 4 weeks. During each course week, you participate at times of your own choosing - there are no set times when you must be online. Course participants will be given access to a private discussion board. In class discussions led by the instructor, you can post questions, seek clarification, and interact with your fellow students and the instructor.

The course typically requires 15 hours per week. At the beginning of each week, you receive the relevant material, in addition to answers to exercises from the previous session. During the week, you are expected to go over the course materials, work through exercises, and submit answers. Discussion among participants is encouraged. The instructor will provide answers and comments, and at the end of the week, you will receive individual feedback on your homework answers.

Students come to the Institute for a variety of reasons. As you begin the course, you will be asked to specify your category:
  1. You may be interested only in learning the material presented, and not be concerned with grades or a record of completion.
  2. You may be enrolled in PASS (Programs in Analytics and Statistical Studies) that requires demonstration of proficiency in the subject, in which case your work will be assessed for a grade.
  3. You may require a "Record of Course Completion," along with professional development credit in the form of Continuing Education Units (CEU's).  For those successfully completing the course, 5.0 CEU's and a record of course completion will be issued by The Institute, upon request.

Course Text:

Please read the mentioned papers produced by Google: The Google File System (2003) and MapReduce: Simplified Data Processing on Large Clusters (2004)

Recommended text:  Hadoop: The Definitive Guide, 3rd ed., by Tom White (O'Reilly Media).  Optional readings will be assigned from this reference.

Required readings will be provided as PDF documents in the course.


The required software is Apache Hadoop and R.  Familiarity with Linux is required.  IMPORTANT:  Please continue reading below for configuration information.

Hadoop and Virtual Machines

Hadoop developers often use a “Single Node Cluster” to perform development tasks on. This is often a virtual machine running a virtual server environment, which runs the various Hadoop daemons. Access to this VM can be accomplished with SSH from your main development box, just like you’d access a Hadoop cluster. In order to create a virtual environment, you need some sort of virtualization software like VirtualBox, VMWare, or Parallels.

The installation instructions discuss how to setup an Ubuntu x64 virtual machine, and the course provides a preconfigured one for use with VMWare or VirtualBox. If you’d like to use the preconfigured virtual machine instead of setting up your own, you will be able to download it from the Resources section in the course.  Note that you will need a 64-bit machine in any case.

Software Development

The native API for Hadoop is written in Java, and in order to best serve the class I will discuss code in the second and third weeks in Java. To that end you’ll need some tool to develop and compile Java. The most well known are Eclipse and NetBeans, as well as a popular, professional IDE- IntelliJIDEA.

In the last week of the course we’ll use R to perform MapReduce jobs using the Hadoop Streaming interface. You will also need the RHadoop packages.

SSH on Windows

If you’re on Windows, to SSH into your VM you’ll need a client called PuTTY, on Mac or Linux you’ll be fine using SSH from the terminal. Note that this class does not cover command line usage, ssh, or virtual machine setup. The best place to ask for help on these topics will be in the forums, and if you’re an expert on these topics, please help your fellow classmates as well!

Want to be
notified of future
course offerings?
Please enter first name.
Please enter last name.
Please enter valid E-mail.

Students comment on our courses:

© statistics.com 2004-2014