In processing unstructured text, tokenization is the step by which the character string in a text segment is turned into units – tokens – for further analysis. Ideally, those tokens would be words, but numbers and other characters can also count as tokens. A big challenge in tokenization is determining delimiters that separate tokens. Delimiters could be white space, commas, periods, html tags, etc., and they might not always be delimiters. After the text is broken into tokens, a list of “types,” or unique tokens, is created. In the previous sentence, the token “is” appears twice, but there is just a single “is” type.
Browse Other Glossary Entries
Planning on taking an introductory statistics course, but not sure if you need to start at the beginning? Review the course description for each of our introductory statistics courses and estimate which best matches your level, then take the self test for that course. If you get all or almost all the questions correct, move on and take the next test.
Find the right course for you
We'd love to answer your questions
Our mentors and academic advisors are standing by to help guide you towards the courses or program that makes the most sense for you and your goals.
300 W Main St STE 301, Charlottesville, VA 22903