Glossary

Tokenization

Tokenization:

In processing unstructured text, tokenization is the step by which the character string in a text segment is turned into units – tokens – for further analysis. Ideally, those tokens would be words, but numbers and other characters can also count as tokens. A big challenge in tokenization is determining delimiters that separate tokens. Delimiters could be white space, commas, periods, html tags, etc., and they might not always be delimiters. After the text is broken into tokens, a list of “types,” or unique tokens, is created. In the previous sentence, the token “is” appears twice, but there is just a single “is” type.

Browse Other Glossary Entries

Test Yourself

Planning on taking an introductory statistics course, but not sure if you need to start at the beginning? Review the course description for each of our introductory statistics courses and estimate which best matches your level, then take the self test for that course. If you get all or almost all the questions correct, move on and take the next test.