Natural Language Processing Using NLTK

Natural Language Processing Using NLTK
taught by Nitin Indurkhya


Aim of Course:

After taking "Natural Language Processing using NLTK", you will be equipped to introduce natural language processing (NLP) processes into your projects and software applications.

Computational linguistics and the related field of natural language processing (NLP) are widely used in software applications, analytics, and other contexts where humans communicate via machines. Modern NLP in a software context requires the combination of machine learning with linguistic knowledge. In order for a piece of software to take advantage of NLP, a framework and a level of computational sturdiness is required.

In this course you will be using Python and a module called NLTK - the Natural Language Tool Kit to perform natural language processing on medium size text corpora. NLTK provides analysts, software developers, researchers, and students cutting edge linguistic and machine learning tools that are on par with traditional NLP frameworks. Because it is Python, it comes with "batteries included" and is available on most systems, providing a low barrier to entry for language processing in general, and in particular allows you to quickly and easily analyze text data in larger applications.

In the first section of this course, we will review linguistic analyses that you will be familiar with after learning language models in the Introduction to Natural Language Processing course (or the equivalent):  tokenization, part of speech tagging, stemming and lemmatizing, and creating NGram language models. This sets the stage for a more application oriented look at NLP in the real world, and we will also cover information extraction, text classification, and introduce other natural language analytics.

This course may be taken individually (one-off) or as part of a certificate program.
Course Program:

WEEK 1: Natural Language Processing with Python

  • Tokenization and segmentation
  • Part of speech tagging
  • Stemming and lemmatizing
  • N-Gram chunking

After this lesson, the student should have a basic familiarity with the language tools that are available in NLTK and should be able to implement those tools on their own corpora as well as be able to practice with the included corpora that come with NLTK.

WEEK 2: Information Extraction

  • Managing Linguistic Data
  • Information Extraction and Chunking
  • Named Entity Recognition and Classification
  • Entity Resolution and Disambiguation

After this lesson, the student should be able to create language models from a corpus that you've constructed, and be able to compute significant collocations from the text. You will also be able to use NLTK's built in NERC classifier to extract entities from documents. 

WEEK 3: Classification of Text

  • Decision Tree Classifiers
  • Naive Bayes Classifiers
  • Maximum Entropy Classifiers
  • Document Classification

After this week you will be able to train a classifier using both Naive Bayes and Maximum entropy - to determine the accuracy of the classifier using cross-validation and employ these classifiers effectively in a variety of settings.

WEEK 4: Topics in Natural Language Processing

  • Syntactic Parsing
  • Topic Modeling
  • NLTK and Hadoop

At the end of this week you should have a better understanding of how to develop natural language applications and how to perform meaningful analyses with text. After the course you should have a good understanding of the many tools used in NLTK to perform language analyses.


The homework in this course consists of short answer questions to test concepts, guided exercises in writing code and guided data analysis problems using software.

This course also has example software codes, supplemental readings available online.


Natural Language Processing Using NLTK

Who Should Take This Course:

Python developers who have a need for language aware data products in their applications. Computational linguists who want to use a fast and easy tool for doing language analyses. Anyone interested in the machine learning aspects of text and software development. This class is especially good for students who many have language requirements in their software development.

Intermediate/ Advanced
  • Python (you should be proficient at Python programming, and comfortable at creating and executing Python programs that use third party libraries)
  • Natural Lang. Processing

Organization of the Course:

This course takes place online at the Institute for 4 weeks. During each course week, you participate at times of your own choosing - there are no set times when you must be online. Course participants will be given access to a private discussion board. In class discussions led by the instructor, you can post questions, seek clarification, and interact with your fellow students and the instructor.

At the beginning of each week, you receive the relevant material, in addition to answers to exercises from the previous session. During the week, you are expected to go over the course materials, work through exercises, and submit answers. Discussion among participants is encouraged. The instructor will provide answers and comments, and at the end of the week, you will receive individual feedback on your homework answers.

Time Requirement:
About 15 hours per week, at times of  your choosing.

Options for Credit and Recognition:
Students come to the Institute for a variety of reasons. As you begin the course, you will be asked to specify your category:
  1. No credit - You may be interested only in learning the material presented, and not be concerned with grades or a record of completion.
  2. Certificate - You may be enrolled in PASS (Programs in Analytics and Statistical Studies) that requires demonstration of proficiency in the subject, in which case your work will be assessed for a grade.
  3. CEUs and/or proof of completion - You may require a "Record of Course Completion," along with professional development credit in the form of Continuing Education Units (CEU's).  For those successfully completing the course,  CEU's and a record of course completion will be issued by The Institute, upon request.
  4. Other options - Specializations, INFORMS CAP recognition, and academic (college) credit are available for some courses
Specializations are an easy way for you to demonstrate mastery of a specific skill in statistics and analytics. This course is part of the Text Mining and Analytics Specialization which gives a deep dive into text mining, natural language processing and sentiment analysis. Requires Python and some familiarity with Bayesian statistics.  Take any three of the four courses on this topic (this course, plus the courses listed to the right under "related courses," not including conferences).  For savings, use the promo code "text-specialization" and register for all three courses at once for  $1197 ($399 per course, not combinable with other tuition savings).  If you register for all four, you'll still receive the discounted rate.

This course is also recognized by the Institute for Operations Research and the Management Sciences (INFORMS) as helpful preparation for the Certified Analytics Professional (CAP®) exam, and can help CAP® analysts accrue Professional Development Units to maintain their certification .
Course Text:
The required text is Natural Language Processing with Python Analyzing Text with the Natural Language Toolkit by The course text is available as a free book online or for purchase as a print or eBook from O'Reilly.  This book is considered the definitive guide to NLP with Python because of its comprehensive coverage of NLTK and language processing in general; and because the authors are also the primary contributors and creators of NLTK.

NLTK is a leading platform for building Python programs to work with human language data.

In order to successfully follow along with the reading and complete the assignments you'll need to have Python installed on your system as well as NLTK. Please also make sure that you download the entire NLTK data set to a directory that is accessible by NLTK. For instructions in installing the software, please read the following links:

  1. Installing NLTK:
  2. Installing Data:

Other third party packages may be required to be installed with NLTK, for example to draw graphs you will need matplotlib and for network analysis, you'll need NetworkX. These libraries can be easily installed if you have pip on your system. We recommend How to Develop Quality Python Code for a discussion on how to be effective with Python and particularly the integration of third party tools into your Python projects




September 15, 2017 to October 13, 2017 September 14, 2018 to October 12, 2018

Natural Language Processing Using NLTK


September 15, 2017 to October 13, 2017 September 14, 2018 to October 12, 2018

Course Fee: $549

Do you meet course prerequisites? What about book & software? (Click here to learn more)

Group rates: Click here to get information on group rates. 

First time student or academic? Click here for an introductory offer on select courses. Academic affiliation?  You may be eligible for a discount at checkout.

Take 3 or 4 text analytics courses, save $100 per course (code text2016).

Register Now

Add $50 service fee if you require a prior invoice, or if you need to submit a purchase order or voucher, pay by wire transfer or EFT, or refund and reprocess a prior payment. Please use this printed registration form, for these and other special orders.

Courses may fill up at any time and registrations are processed in the order in which they are received. Your registration will be confirmed for the first available course date, unless you specify otherwise.

The Institute for Statistics Education is certified to operate by the State Council of Higher Education in Virginia (SCHEV).

Want to be notified of future courses?

Student comments