From text to token: How tokenization pipelines work

(paradedb.com)

71 points | by philippemnoel 1 day ago

5 comments

heikkilevanto 1 hour ago
Good explanation on tokenizing English text for regular search. But it is far from universal, and will not work well in Finnish, for example.
Folding diacritics makes "vähä" (little) into "vaha" (wax).
Dropping stop words like "The" misses the word for "tea" (in rather old-fashioned finnish, but also in current Danish).
Stemming Finnish words is also much more complex, as we tend to append suffixes to the words instead of small words in front to the word. "talo" is "house", "talosta" is "from the house", "talostani" is "from my house", and "talostaniko" makes it a question "from my house?"
If that sounds too easy, consider Japanese. From what little I know they don't use whitespace to separate words, mix two phonetic alphabets with Chinese ideograms, etc.
[-]
- philippemnoel 3 minutes ago
  That's true. For this reason, most modern search engines support language-aware stemming and tokenization. Popular tokenizers for CJK languages include Lindera and Jieba.
  We (ParadeDB) use a search library called Tantivy under the hood, which supports stemming in Finnish, Danish and many other languages: https://docs.paradedb.com/documentation/token-filters/stemmi...
wongarsu 4 hours ago
Notably tokenization for traditional search. LLMs use very different tokenization with very different goals
gortok 1 hour ago
My biggest complaints about search come from day-to-day uses:
I use search in my email pretty heavily, and I’m most interested in specific words in the email; and when those emails are from specific folks or a specific domain. But, the mobile version of Gmail produces different results than the mobile Outlook app than the desktop version of Gmail, and all of them are pretty terrible at search as it pertains to email.
I have a hard to getting them to pull up emails in search that I know exist, that I know have certain words, and I know have certain email addresses in the body.
I recognize a generalized searching mechanisms is going to get domain specific nuances wrong, but is it really so hard to make a search engine that works on email and email based attachments that no one cares enough to try?
the_arun 1 hour ago
Just curious - if we remove stop words from prompts before going to LLM, wouldn't it reduce token size? Will it keep the response from LLM same (original vs without stop tokens)?
[-]
- kylecazar 57 minutes ago
  Search engines are often keyword based and can afford to throw out stopwords. Modern (frontier) LLM's need the nuance and semantics they signal though -- so they don't automatically strip them. There are probably special purpose smaller models that do this, but that's the exception.
  Yeah, it'll be less input tokens if you omitted them yourself. It's not guaranteed to keep the response the same, though. You're asking the model to work with less context and more ambiguity at that point. So stripping your prompt of stopwords is going to save you negligible $ and potentially cost a lot in model performance.
semicognitive 1 hour ago
ParadeDB is a great team, highly recommend using