Flexible, affordable statistics education.

Designed to help you master the software you need to enhance your skills and the practical experience you need to get ahead.

facebook LinkedIn twitter Google+ Email

Introduction to Analytics using Hadoop

Instructor(s):

Dates:

March 27, 2015 to April 24, 2015 October 30, 2015 to November 27, 2015 March 25, 2016 to April 22, 2016 October 28, 2016 to November 25, 2016 March 24, 2017 to April 21, 2017 October 27, 2017 to November 24, 2017 March 23, 2018 to April 20, 2018 October 26, 2018 to November 23, 2018

Thank you for your submission.

Introduction to Analytics using Hadoop

taught by Benjamin Bengfort

This course has been evaluated and is recAmerican Council on Education College Recommendationommended for the upper-division baccalaureate degree category, 3 semester hours in computer science or programming. ACE CREDIT helps adults gain access to academic credit at colleges and universities for formal courses taken outside traditional higher education. ACE CREDIT's College and University Network is a group of more than 2,200 higher education institutions that consider ACE credit recommendations for transfer to degree programs. Note: The decision to accept specific credit recommendations is up to each institution.
 
This course can help candidates prepare for the Institute for Operations Research and the Management Sciences (INFORMS) Certified Analytics Professional (CAP®)exam or help CAP® analysts accrue acceptable Professional Development Units to maintain their certification.

Aim of Course:

In this online course, “Introduction to Analytics using Hadoop,” analytics professionals will be introduced to Hadoop and Spark, and provided with an exemplar workflow for using Hadoop. They also will be introduced to writing Spark and MapReduce jobs, and leveraging Hadoop Streaming to conclude work in an analytics programming language such as Python.  In this course you will learn

  1. What Hadoop is hand how to leverage it to perform analytics
  2. The software components of the Hadoop Ecosystem
  3. How to manage data on a distributed file system
  4. How to write MapReduce jobs to perform computations with Hadoop
  5. How to utilize Hadoop Streaming to output jobs

Background - "Big Data"

The term “Big Data” has come into vogue vogue to refer not just to data volume, but also to an exciting new set of applications and techniques that are powering modern applications and whose novelty seems to be changing the way the world is computing. In most cases, the "end game" is the application of well-known statistical and machine-learning techniques. However, modern distributed computation techniques are allowing the analysis of data sets far larger then those that could be typically analyzed in the past.

The need for distributed computing arises from a combination of rapidly increasing data sets flows generated by organizations and from the Internet, and the fact that the huge size of these data sets greatly widens the scope for prediction and analysis. A key milestone was the release of Apache Hadoop in 2005. Hadoop is an open source project based on two seminal papers produced by Google: The Google File System (2003) and MapReduce: Simplified Data Processing on Large Clusters (2004). These two papers discuss the two key components that make up Hadoop: a distributed File System and MapReduce functional computations. Now it seems that whenever someone says “Big Data” they probably are referring to computation using Hadoop.

Here's an excellent introduction to Spark, the newest component of the Hadooop ecosystem.

This course may be taken individually (one-off) or as part of a certificate program.

Course Program:

WEEK 1: A Distributed Computing Environment

The first week is all about getting to know Hadoop and getting set up to develop MapReduce jobs on a development environment.  This task by itself is not particularly easy, but is crucial to getting started with Hadoop.

  • Introduce Hadoop, its motivations and core concepts
  • Discover HDFS and MapReduce and their roles
  • NameNodes, JobTrackers, and DataNodes (The Hadoop Anatomy)
  • Learn about the other applications in the Hadoop Ecosystem
  • Get a development environment set up


WEEK 2: Working with Hadoop

In week 2, we’ll explore how to use the Hadoop Filesystem to load and manage data. We’ll also learn the data flow of Hadoop jobs and execute some simple, pre-built jobs. 

  • Introduce the Hadoop Filesystem
  • Learn how to read and write data to HDFS
  • Learn data flow in Hadoop Jobs


WEEK 3: Computing with MapReduce

We’ll kick week three off with a discussion of MapReduce programming, and write our first MapReduce jobs to execute on our Hadoop cluster. This is where the rubber meets the road, and we’ll use Hadoop Streaming and the language of your choice to develop simple analytics. 

  • Functional programming with Mappers and Reducers
  • A sample MapReduce Algorithm
  • Mappers and Reducers in Detail
  • Running MapReduce jobs
  • Hadoop Streaming


WEEK 4: Towards Last Mile Computation

In the last section we’ll discuss how to use Hadoop to transform large data sets into a more manageable computational size. We’ll talk about workflows towards last mile computation, filtering, searching, and aggregating, as well as writing some more MapReduce jobs.

  • Hadoop workflows with Python
  • Utilizing your programming environment with Hadoop Streaming
  • Filtering, Aggregating and Searching
  • Intro to advanced topics in Hadoop


HOMEWORK

In addition to assigned readings, this course also has example software codes, and supplemental readings available online.

Introduction to Analytics using Hadoop

Instructor(s):

Dates:
March 27, 2015 to April 24, 2015 October 30, 2015 to November 27, 2015 March 25, 2016 to April 22, 2016 October 28, 2016 to November 25, 2016 March 24, 2017 to April 21, 2017 October 27, 2017 to November 24, 2017 March 23, 2018 to April 20, 2018 October 26, 2018 to November 23, 2018

Course Fee: $549

Do you meet course prerequisites? What about book & software? (Click here to learn more)

Tuition Savings:  When you register online for 3 or more courses, $200 is automatically deducted from the total tuition. (This offer cannot be combined and is only applicable to courses of 3 weeks or longer.)

Register Now


Add $50 service fee if you require a prior invoice, or if you need to submit a purchase order or voucher, pay by wire transfer or EFT, or refund and reprocess a prior payment. Please use this printed registration form, for these and other special orders.

Courses may fill up at any time and registrations are processed in the order in which they are received. Your registration will be confirmed for the first available course date, unless you specify otherwise.

Introduction to Analytics using Hadoop

taught by Benjamin Bengfort

This course has been evaluated and is recAmerican Council on Education College Recommendationommended for the upper-division baccalaureate degree category, 3 semester hours in computer science or programming. ACE CREDIT helps adults gain access to academic credit at colleges and universities for formal courses taken outside traditional higher education. ACE CREDIT's College and University Network is a group of more than 2,200 higher education institutions that consider ACE credit recommendations for transfer to degree programs. Note: The decision to accept specific credit recommendations is up to each institution.
 
This course can help candidates prepare for the Institute for Operations Research and the Management Sciences (INFORMS) Certified Analytics Professional (CAP®)exam or help CAP® analysts accrue acceptable Professional Development Units to maintain their certification.

Who Should Take This Course:

Data scientists and statisticians with programming experience who need to deal with large data sets and want to learn about Hadoop's distributing computing capability should take Introduction to Analytics using Hadoop. This course is particularly suited to data scientists that need to access and analyze large amounts of unstructured or semi-structured data that do not fit well into traditional relational databases.

Level:

Intermediate

Prerequisite:
These are listed for your benefit so you can determine for yourself, whether you have the needed background, whether from taking the listed courses, or by other experience.
  1. Command line experience on Linux, to manage system processes, find appropriate files and set permissions.

  2. Familiarity with Python or another programming language to leverage Hadoop streaming to perform computations.

  3. See the "Software" section below.

Organization of the Course:

This course takes place online at the Institute for 4 weeks. During each course week, you participate at times of your own choosing - there are no set times when you must be online. Course participants will be given access to a private discussion board. In class discussions led by the instructor, you can post questions, seek clarification, and interact with your fellow students and the instructor.

At the beginning of each week, you receive the relevant material, in addition to answers to exercises from the previous session. During the week, you are expected to go over the course materials, work through exercises, and submit answers. Discussion among participants is encouraged. The instructor will provide answers and comments, and at the end of the week, you will receive individual feedback on your homework answers.

Time Requirement: about 15 hours per week, at times of  your choosing.


Credit:
Students come to the Institute for a variety of reasons. As you begin the course, you will be asked to specify your category:
  1. You may be interested only in learning the material presented, and not be concerned with grades or a record of completion.
  2. You may be enrolled in PASS (Programs in Analytics and Statistical Studies) that requires demonstration of proficiency in the subject, in which case your work will be assessed for a grade.
  3. You may require a "Record of Course Completion," along with professional development credit in the form of Continuing Education Units (CEU's).  For those successfully completing the course, 5.0 CEU's and a record of course completion will be issued by The Institute, upon request.

Course Text:

Please read the mentioned papers produced by Google: The Google File System (2003) and MapReduce: Simplified Data Processing on Large Clusters (2004)

Recommended text:  Hadoop: The Definitive Guide, 3rd ed., by Tom White (O'Reilly Media).  Optional readings will be assigned from this reference.

Required readings will be provided as PDF documents in the course.

Software:

The required software is Apache Hadoop and Python.  Familiarity with Linux is required, and we will be using a virtual machine (VM) to make things easier.  You will need a 64-bit computer. 

Before the course starts we recommend that you:

1.  Install virtualization software so you can run the VM in the course.  We recommend VirtualBox which is free; VMWare, or Parallels are also possible.  Our technology supervisor, Dr. Stan Blank, will monitor a discussion board in our Learning Management System 4 days prior to the course start to provide assistance.

2.  Download the pre-configured Virtual Machine (VM) that will be used in the course.  Note that you will receive a preview error message.  This is OK.  Click on the download button below the message. 

3.  Using the downloaded VM, which includes Linux, brush up on your command-line Linux.  If you really need to re-learn Linux, or learn it in the first place, you should allow several weeks to do this on your own before the course starts.  For help, see The Command Line Crash Course

4.  Make sure you have Python available and a text editor such as Sublime Text to write your code. Once the course opens, we will be working with Python to execute jobs via Hadoop Streaming.  There are several frameworks available to assist writing Hadoop jobs in Python, which will be discussed during the course.

If you know Java or other languages...
  • Those with experience in Java may use the Native API to implement MapReduce jobs, but the class will focus on Hadoop Streaming.  To access more advanced functionality you’ll need some tool to develop and compile Java. The most well known are Eclipse and NetBeans, as well as a popular, professional IDE- IntelliJIDEA. For this course, however, this is completely optional.
  • For the programming work, any programming language that accepts data from stdin and writes to stdout (R, Ruby, Perl, etc.) can be used, but all examples and pseudo-code will be in Python. 

If you prefer to set up your own VM ...

  • If you prefer not to download the pre-configured VM, instructions in the course will describe how to setup an Ubuntu x64 virtual machine

 

 


Want to be
notified of future
course offerings?
Please enter first name.
Please enter last name.
Please enter valid E-mail.
© statistics.com 2004-2014