Full-text search is a way to quickly find text within a large set of data. Think of it like using Google to search the web, but instead, you're searching rows in a database. It's designed to locate all entries that contain specific words or phrases, even if they're buried in longer text. In SQLite, this is handled by a module called FTS5, which makes searching fast and flexible.
I want to walk you through how full-text search works in SQLite, focusing on its FTS5 module and diving into tokenizers - specifically the trigram tokenizer. We'll use this approach to find matching names in a database with thousands of entries, where some names might be pretty similar. Let's start with the basics and build up from there.
FTS5 is a tool in SQLite that sets up virtual tables for full-text search. It's available in SQLite version 3.48 or later. If you're on an older version - like 3.26, which some Linux systems use - you'll need to upgrade.
Here's how:
mkdir /usr/local/src/sqlite cd /usr/local/src/sqlite wget https://www.sqlite.org/2025/sqlite-autoconf-3480000.tar.gz
tar xvfz sqlite-autoconf-3480000.tar.gz.
cd sqlite-autoconf-3480000
, ./configure
, make
, make install
.export PATH=/usr/local/bin:$PATH
to /etc/profile
and run source /etc/profile
.sqlite3 --version
.If you use Python, you must recompile it to link with the new SQLite. For Python 3.11:
wget https://www.python.org/ftp/python/3.11.9/Python-3.11.9.tgz
.tar xvf Python-3.11.9.tgz
.cd Python-3.11.9
, sudo LD_RUN_PATH=/usr/local/lib ./configure --enable-optimizations
, sudo LD_RUN_PATH=/usr/local/lib
make altinstall
.When you search with FTS5, your query gets split into smaller pieces called tokens. These are the bits the system looks for in your data. Normally, FTS5 splits text into words - so alpha beta gamma
becomes three tokens: alpha
, beta
, and gamma
. The tool that does this splitting is called a tokenizer.
FTS5 features four built-in tokenizer modules. In this article we will discuss the trigram tokenizer. A trigram is just three characters in a row. So, with the trigram tokenizer, text gets broken into overlapping sets of three letters. For example, "hello" becomes:
This is great for finding similar text, because even if two words aren't identical, they might share a lot of trigrams. When using the trigram tokenizer, a query or phrase token may match any sequence of characters within a row, not just a complete token. That's exactly what we want when searching for names that are close but not exact matches e.g. "John" and "Johnny".
Let's assume you have a table called entity with id and name columns. We'll create a virtual table for searching.
Here's the steps:
Clean Up the Names:
Before anything else, normalize the names. This means removing junk like titles, punctuation, and extra spaces, and making everything uppercase. Clean names make searches work better.
Here's a Python function to do it:
import re async def normalize_name(name): if name is None: return "" # Remove titles (assuming a list_of_titles exists) name = await remove_titles(name, list_of_titles) name = name.replace("`", "").replace("'", "").replace(".", "").replace("-", " ") name = re.sub(r"\s+", " ", name).strip().upper() return name
Create the Virtual Table
Now, set up the virtual table with FTS5 and the trigram tokenizer:
CREATE VIRTUAL TABLE IF NOT EXISTS v_normalized_entity USING FTS5 ( id UNINDEXED, name, tokenize = 'trigram' );
The UNINDEXED
bit means we don't index the id column - it speeds things up since we're only searching name.
Add the Names
Insert the cleaned-up names into the virtual table. Use cursor.executemany because it's faster for lots of records than doing one at a time:
cursor.executemany( "INSERT OR REPLACE INTO v_normalized_entity (id, name) VALUES (?, ?)", [(entity.id, entity.name) for entity in list_of_entities] )
Why executemany
? It batches everything into one operation, cutting down on overhead. If you're adding thousands of names, this saves time.
Build the Index After inserting, rebuild the index:
INSERT INTO v_normalized_entity(v_normalized_entity) VALUES('rebuild');
This makes sure searches run fast.
Note: If your SQLite is older than 3.48, you'll get an error or exception like sqlite3.OperationalError
: No such tokenizer: trigram
.
Upgrade first.
To find a name, use the MATCH
operator with the OR
operator. Split the search name into words and join them with "OR".
Here's how:
import asyncio import sqlite3 from typing import List import pandas as pd from pydantic import BaseModel # Models: class EntityRead(BaseModel): id: int name: str async def load_entities_from_db(name: str) -> List[EntityRead]: db_path = "app/data/entity.db" names = name.split() match_query = " OR ".join(names) conn = sqlite3.connect(db_path) sql_query = f"SELECT id, name FROM v_normalized_entity WHERE name MATCH '{match_query}'" results = pd.read_sql_query(sql_query, conn) conn.close() entities = await preprocess_entities(results) return entities # function to preprocess the entity names: async def preprocess_entities(results) -> List[EntityRead]: async def row_to_entity(row): normalized_name = await normalize_name(row["name"]) return EntityRead(id=row["id"], name=normalized_name) # Process rows asynchronously entities = await asyncio.gather(*[row_to_entity(row) for _, row in results.iterrows()]) return entities
For example, searching "John Doe" runs:
SELECT id, name FROM v_normalized_entity WHERE name MATCH 'John OR Doe';
This finds names with "John" or "Doe" or both. Want to sort by relevance? Add order by rank
:
SELECT id, name FROM v_normalized_entity WHERE name MATCH 'John OR Doe' ORDER BY rank;
In my tests, it takes around 0.028 seconds to search for "John Doe" in a database with 200,000 names.
Full-text search in SQLite with FTS5 is a solid way to handle text searches, and the trigram tokenizer takes it up a notch for finding similar names. By breaking text into three-character chunks, it catches close matches - like "John" and "Johnny" - which is perfect for a database with thousands of names.
With this setup, you can search quickly and get useful results in milliseconds, even when names aren't exactly the same or mispelled.
Enjoyed this blog post? Check out these related posts!
Simple CI/CD for Your FastAPI App with Google Cloud Build and Cloud Run
Push code, deploy automatically: A simple CI/CD guide for your web app.
Read More..
How to Get a Structured JSON Response from a Web Page Using AI
Using AI to Extract Structured Data from Web Pages: A Kenyan Invoice Verification Case Study
Read More..
Reflex Makes SEO Easier: Automatic robots.txt and sitemap.xml Generation
Discover how adding your deploy URL in Reflex automatically generates robots.txt and sitemap.xml for easier SEO.
Read More..
Adding middleware to FastAPI Applications: Process Time Headers, Security, and Compression
A practical guide to implementing middleware in FastAPI for better performance, security, and efficiency.
Read More..
Have a project in mind? Send me an email at hello@davidmuraya.com and let's bring your ideas to life. I am always available for exciting discussions.