WordBreaker4Web - new and improved !
Word breaker is a very established natural language processing (NLP) task.
It is important for processing many European languages that use compound words
(e.g. German, Dutch, Greek, etc.), and is critical for eastern Asian languages
(e.g. Chinese, Japanese, Korean, etc.) where the writing systems do not use
white spaces to mark word boundaries.
The Web has brought the need of word breaking to a new height. For example, because
the URL format does not allow white spaces, we are all forced to concatenate words
together in specifying, say, file paths or domain names. The convention of the hash
tags in tweets is another example. One the web, we regularly need to parse domain
names like "247moms" as "twenty-four seven moms" (salute to all the moms in the world!),
and be careful to understand that the web site called "penisland" is about
Pen Island, an island located in Hudson Bay.
As a byproduct of developing statistical NLP technologies to understand search queries,
we find the same techniques can be also applied for word breaking "web languages".
Below is a demo we showed at WWW-2010 and NAACL/HLT-2010, and a detail description
can be found in the paper in WWW-2011. Just type in your concatenated strings
(i.e., no spaces and obvious word boundary markers) and click on the "Break" button.
The app will show the 5 most plausible results with their probabilities in log base 10.
The demo is powered by Microsoft Web N-gram Services. The N-gram is trained with web documents indexed
by Bing in the EN-US market. But, as pointed out in the NAACL/HLT paper, it seems to understand
languages other than English to some extent (e.g. even Chinese pinyin using roman alphabets).
Also, the N-gram follows Bing's tokenization in which punctuation marks, dollar signs and
even apostrophe are all treated as spaces. Please don't be surprised if the word breaking results
show you something like "don t" (don't) or "he ll" (he'll).
Feedback and Questions?