Skip to content



In processing unstructured text, tokenization is the step by which the character string in a text segment is turned into units – tokens – for further analysis. Ideally, those tokens would be words, but numbers and other characters can also count as tokens. A big challenge in tokenization is determining delimiters that separate tokens. Delimiters could be white space, commas, periods, html tags, etc., and they might not always be delimiters. After the text is broken into tokens, a list of “types,” or unique tokens, is created. In the previous sentence, the token “is” appears twice, but there is just a single “is” type.

Browse Other Glossary Entries

Test Yourself

Planning on taking an introductory statistics course, but not sure if you need to start at the beginning? Review the course description for each of our introductory statistics courses and estimate which best matches your level, then take the self test for that course. If you get all or almost all the questions correct, move on and take the next test.

Data Analytics

Considering becoming adata scientist, customer analyst or our data science certificate program?

Analytics Quiz

Advanced Statistics Quiz

Statistics Quiz


Looking at statistics for graduate programs or to enhance your foundational knowledge?

Statistics 1 Quiz

Regression Quiz

Regression Quiz


Entering the biostatistics field? Test your skill here.

Biostatistics Quiz

Advanced Statistics Quiz

Statistics 2 Quiz

Stay Informed

Our Blog

Read up on our latest blogs


Learn about our certificate programs


Find the right course for you

Contact Us

We'd love to answer your questions

Our mentors and academic advisors are standing by to help guide you towards the courses or program that makes the most sense for you and your goals.

300 W Main St STE 301, Charlottesville, VA 22903

(434) 973-7673

By submitting your information, you agree to receive email communications from All information submitted is subject to our privacy policy. You may opt out of receiving communications at any time.