Microsoft Research Word Breaker Demo

WordBreaker4Web - new and improved !

Enter some text without spaces:
Select a language model:

Word breaker is a very established natural language processing (NLP) task. It is important for processing many European languages that use compound words (e.g. German, Dutch, Greek, etc.), and is critical for eastern Asian languages (e.g. Chinese, Japanese, Korean, etc.) where the writing systems do not use white spaces to mark word boundaries.

The Web has brought the need of word breaking to a new height. For example, because the URL format does not allow white spaces, we are all forced to concatenate words together in specifying, say, file paths or domain names. The convention of the hash tags in tweets is another example. One the web, we regularly need to parse domain names like "247moms" as "twenty-four seven moms" (salute to all the moms in the world!), and be careful to understand that the web site called "penisland" is about Pen Island, an island located in Hudson Bay.

As a byproduct of developing statistical NLP technologies to understand search queries, we find the same techniques can be also applied for word breaking "web languages". Below is a demo we showed at WWW-2010 and NAACL/HLT-2010, and a detail description can be found in the paper in WWW-2011. Just type in your concatenated strings (i.e., no spaces and obvious word boundary markers) and click on the "Break" button. The app will show the 5 most plausible results with their probabilities in log base 10.

The demo is powered by Microsoft Web N-gram Services. The N-gram is trained with web documents indexed by Bing in the EN-US market. But, as pointed out in the NAACL/HLT paper, it seems to understand languages other than English to some extent (e.g. even Chinese pinyin using roman alphabets). Also, the N-gram follows Bing's tokenization in which punctuation marks, dollar signs and even apostrophe are all treated as spaces. Please don't be surprised if the word breaking results show you something like "don t" (don't) or "he ll" (he'll).

Feedback and Questions?