Hadoop: Hive, Sqoop and Spark
Taught by Ms. Jenny Kim

Hadoop: Hive, Sqoop and Spark

taught by Jenny Kim

Aim of Course:

In this online course, "Hadoop: Hive, Sqoop and Spark" you will expand on the topics from the topics from the "Introduction to Analytics using Hadoop" course and learn how to use higher-order tools in the Hadoop Ecosystem and Spark computing platform to perform data analysis and implement machine learning patterns on data at scale.  In this course, you will learn about:

  • The software components of the Hadoop Ecosystem
  • Data loading, warehousing and manipulation with HBase, Hive, and Sqoop
  • Data aggregation and designing data workflows with Pig and Spark
  • Machine learning and data mining with Spark's MLlib library

This course may be taken individually (one-off) or as part of a certificate program.

Here's an excellent introduction to Spark, the newest component in the Hadoop ecosystem.

This course may be taken individually (one-off) or as part of a certificate program.
Course Program:

WEEK 1: The Hadoop Ecosystem and Data Warehousing and Manipulation pt. 1

  • Review basic installation and configuration of Hadoop in a single-node, pseudodistributed mode
  • Structured data querying and warehousing with Hive

 

WEEK 2: Data Warehousing and Manipulation pt. 2

  • Working with Hadoop’s NoSQL database HBase
  • Accessing Relational Data with Sqoop

 

WEEK 3:  Higher Order Hadoop Programming

  • Data processing flows with Pig
  • Fast, in-memory big-data processing with Spark's Python API

 

WEEK 4: Machine Learning and Data Mining

  • Introduction to Data Mining and Machine Learning
  • Building a Machine Learning system with Spark's MLlib


Homework:

The homework in this course consists of short answer questions to test concepts, guided exercises in writing code and guided data analysis problems using software.

This course also has example software codes, supplemental readings available online and video lectures in each week.

Hadoop: Hive, Sqoop and Spark

Who Should Take This Course:
Data scientists and statisticians who are familiar with Hadoop fundamentals, have programming experience, and who want to learn how to process and analyze large data sets with Hadoop's distributing computing capability and ecosystem components.
Level:
Intermediate/Advanced
Prerequisite:
  1. Big Data Computing with Hadoop or equivalent familiarity with Hadoop and its core components

  2. Strong understanding of MapReduce and MapReduce API

  3. Intermediate familiarity with Python preferred

  4. “SQL and R: Introduction to Database Queries” or the equivalent familiarity with SQL and query languages

  5. Basic knowledge of operating systems (UNIX/Linux)

Organization of the Course:

This course takes place online at the Institute for 4 weeks. During each course week, you participate at times of your own choosing - there are no set times when you must be online. Course participants will be given access to a private discussion board. In class discussions led by the instructor, you can post questions, seek clarification, and interact with your fellow students and the instructor.

At the beginning of each week, you receive the relevant material, in addition to answers to exercises from the previous session. During the week, you are expected to go over the course materials, work through exercises, and submit answers. Discussion among participants is encouraged. The instructor will provide answers and comments, and at the end of the week, you will receive individual feedback on your homework answers.

Time Requirement:
About 15 hours per week, at times of  your choosing.

Options for Credit and Recognition:
Students come to the Institute for a variety of reasons. As you begin the course, you will be asked to specify your category:
  1. No credit - You may be interested only in learning the material presented, and not be concerned with grades or a record of completion.
  2. Certificate - You may be enrolled in PASS (Programs in Analytics and Statistical Studies) that requires demonstration of proficiency in the subject, in which case your work will be assessed for a grade.
  3. CEUs and/or proof of completion - You may require a "Record of Course Completion," along with professional development credit in the form of Continuing Education Units (CEU's).  For those successfully completing the course,  CEU's and a record of course completion will be issued by The Institute, upon request.
  4. Other options - Statistics.com Specializations, INFORMS CAP recognition, and academic (college) credit are available for some Statistics.com courses
Course Text:

Required readings will be provided as PDF documents in the course.

Recommended texts:

Hadoop: The Definitive Guide, 3rd ed., by Tom White (O'Reilly Media).  Optional readings will be assigned from this reference.

Java Resources:

Head First Java, 2nd ed., by Kathy Sierra and Bert Bates (O’Reilly Media).  Good introductory book on Java.

Effective Java, 2nd ed., by Joshua Block (Addison-Wesley).  Excellent book on those familiar with Java but looking for insights into best practices and effective Java patterns.

Software:

The required software is Apache Hadoop and Java JDK 7.  Familiarity with Linux is required.  IMPORTANT:  Please continue reading below for configuration information.

Hadoop and Virtual Machines

Hadoop developers often use a “Single Node Cluster” to perform development tasks on. This is often a virtual machine running a virtual server environment, which runs the various Hadoop daemons. Access to this VM can be accomplished with SSH from your main development box, just like you’d access a Hadoop cluster. In order to create a virtual environment, you need some sort of virtualization software like VirtualBox, VMWare, or Parallels.  

VirtualBox with an Ubuntu VM is used in the examples within the course material.

The installation instructions discuss how to setup an Ubuntu x64 virtual machine, and the course provides a preconfigured one for use with VMWare or VirtualBox. If you’d like to use the preconfigured virtual machine instead of setting up your own, you will be able to download it from the Resources section in the course.  Note that you will need a 64-bit machine in any case.

Software Development

The native API for Hadoop is written in Java, thus you will need some tool to develop and compile Java. The most well known are Eclipse and NetBeans, as well as a popular, professional IDE- IntelliJIDEA.

SSH on Windows

If you’re on Windows, to SSH into your VM you’ll need a client called PuTTY, on Mac or Linux you’ll be fine using SSH from the terminal. Note that this class does not cover command line usage, ssh, or virtual machine setup. The best place to ask for help on these topics will be in the forums, and if you’re an expert on these topics, please help your fellow classmates as well!

Instructor(s):

Dates:

To be scheduled.

Hadoop: Hive, Sqoop and Spark

Instructor(s):

Dates:
To be scheduled.

Course Fee: $549

Do you meet course prerequisites? What about book & software? (Click here to learn more)

Group rates: Click here to get information on group rates. 

First time student or academic? Click here for an introductory offer on select courses. Academic affiliation?  You may be eligible for a discount at checkout.

This course may be scheduled on a contract basis. Please contact ourcourses@statistics.com to arrange.

Register Now

Add $50 service fee if you require a prior invoice, or if you need to submit a purchase order or voucher, pay by wire transfer or EFT, or refund and reprocess a prior payment. Please use this printed registration form, for these and other special orders.

Courses may fill up at any time and registrations are processed in the order in which they are received. Your registration will be confirmed for the first available course date, unless you specify otherwise.

The Institute for Statistics Education is certified to operate by the State Council of Higher Education in Virginia (SCHEV).

Want to be notified of future courses?

Yes
Student comments