Member-only story
Elasticsearch — Search in your local language
How to implement efficient keywords search in Python for languages like Russian, Polish, French, Hungarian and more.

A while ago, we were developing an application for searching keywords in documents. One of the problems we faced was an efficient search in the Slovene language.
Search in Slovene documents is more demanding than in English, which is known as a morphologically relatively simple language. The Slovene language has some features uncommon to most other languages. These are cases and the grammatical number dual. For example, in English, the word “dogs” denotes a plural word for a dog, wherein Slovenian we have pes (a dog), psa (two dogs), psi (dogs).
Here are a few links that might interest you:
- Labeling and Data Engineering for Conversational AI and Analytics- Data Science for Business Leaders [Course]- Intro to Machine Learning with PyTorch [Course]- Become a Growth Product Manager [Course]- Deep Learning (Adaptive Computation and ML series) [Ebook]- Free skill tests for Data Scientists & Machine Learning Engineers
Some of the links above are affiliate links and if you go through them to make a purchase I’ll earn a commission. Keep in mind that I link courses because of their quality and not because of the commission I receive from your purchases.
Solution
To make search efficient, we chose ElasticSearch as we had positive experiences with it in the past. But ElasticSearch doesn’t offer Slovene lemmatizer out of the box. Luckily for us, there is a great plugin LemmaGen that solves this shortcoming. The same procedure described bellow also works for:
- Bulgarian,
- Czech,
- Estonian,
- French,
- Hungarian,
- Macedonian,
- Persian,
- Polish,
- Romanian,
- Russian,
- Slovak,
- Slovene
- Serbian,
- Ukrainian.