Skip to content

Word of the Week – Entity Extraction

In Natural Language Processing (our course on the subject starts Jan 15), entity extraction is the process of labeling chunks of text as entities (e.g. people or organizations).  Consider this phrase from the blog on close elections linked above:  

“the tie was not between Jefferson and Adams, but between Jefferson and Aaron Burr, who was also a Democratic Republican…” 

Suppose our interest is in identifying and extracting text that represents people and political parties. An entity extraction algorithm operates as a machine learning classifier, classifying the words either as a person, a political party, or other. For each classification of a person or political party, the word can be classified as the beginning word of an entity (e.g. Aaron), or as a subsequent word in an entity that already has a beginning word (Burr). The features used as predictors are typically the other words in the text, paying particular attention to sequence and proximity.