In processing unstructured text, tokenization is the step by which the character string in a text segment is turned into units - tokens - for further analysis. Ideally, those tokens would be words, but numbers and other characters can also count as tokens. A big challenge in tokenization is determining delimiters that separate tokens. Delimiters could be white space, commas, periods, html tags, etc., and they might not always be delimiters. After the text is broken into tokens, a list of "types," or unique tokens, is created. In the previous sentence, the token "is" appears twice, but there is just a single "is" type.