Full-Text Search in SQLite: Using the Trigram Tokenizer to Find Similar Names

4 min read

David Muraya Tokenizer Blog Header Image

What Is Full-Text Search?

Full-text search is a way to quickly find text within a large set of data. Think of it like using Google to search the web, but instead, you're searching rows in a database. It's designed to locate all entries that contain specific words or phrases, even if they're buried in longer text. In SQLite, this is handled by a module called FTS5, which makes searching fast and flexible.

I want to walk you through how full-text search works in SQLite, focusing on its FTS5 module and diving into tokenizers - specifically the trigram tokenizer. We'll use this approach to find matching names in a database with thousands of entries, where some names might be pretty similar. Let's start with the basics and build up from there.

Overview of FTS5

FTS5 is a tool in SQLite that sets up virtual tables for full-text search. It's available in SQLite version 3.48 or later. If you're on an older version - like 3.26, which some Linux systems use - you'll need to upgrade.

Here's how:

Download the latest SQLite from https://www.sqlite.org.

mkdir /usr/local/src/sqlite
cd /usr/local/src/sqlite
wget https://www.sqlite.org/2025/sqlite-autoconf-3480000.tar.gz

Unpack it: tar xvfz sqlite-autoconf-3480000.tar.gz.
Build it: cd sqlite-autoconf-3480000, ./configure, make, make install.
Update your PATH: Add export PATH=/usr/local/bin:$PATH to /etc/profile and run source /etc/profile.
Check the version: sqlite3 --version.

If you use Python, you must recompile it to link with the new SQLite. For Python 3.11:

Download: wget https://www.python.org/ftp/python/3.11.9/Python-3.11.9.tgz.
Unpack: tar xvf Python-3.11.9.tgz.
Build: cd Python-3.11.9, sudo LD_RUN_PATH=/usr/local/lib ./configure --enable-optimizations, sudo LD_RUN_PATH=/usr/local/lib make altinstall.

Tokenizers

When you search with FTS5, your query gets split into smaller pieces called tokens. These are the bits the system looks for in your data. Normally, FTS5 splits text into words - so alpha beta gamma becomes three tokens: alpha, beta, and gamma. The tool that does this splitting is called a tokenizer.

FTS5 features four built-in tokenizer modules. In this article we will discuss the trigram tokenizer. A trigram is just three characters in a row. So, with the trigram tokenizer, text gets broken into overlapping sets of three letters. For example, "hello" becomes:

This is great for finding similar text, because even if two words aren't identical, they might share a lot of trigrams. When using the trigram tokenizer, a query or phrase token may match any sequence of characters within a row, not just a complete token. That's exactly what we want when searching for names that are close but not exact matches e.g. "John" and "Johnny".

Setting Up the Virtual Table with the Trigram Tokenizer

Let's assume you have a table called entity with id and name columns. We'll create a virtual table for searching.

Here's the steps:

Clean Up the Names:

Before anything else, normalize the names. This means removing junk like titles, punctuation, and extra spaces, and making everything uppercase. Clean names make searches work better.

Here's a Python function to do it:

import re

async def normalize_name(name):
    if name is None:
        return ""
    # Remove titles (assuming a list_of_titles exists)
    name = await remove_titles(name, list_of_titles)
    name = name.replace("`", "").replace("'", "").replace(".", "").replace("-", " ")
    name = re.sub(r"\s+", " ", name).strip().upper()
    return name

Create the Virtual Table

Now, set up the virtual table with FTS5 and the trigram tokenizer:

CREATE VIRTUAL TABLE IF NOT EXISTS v_normalized_entity USING FTS5 (
    id UNINDEXED,
    name,
    tokenize = 'trigram'
);

The UNINDEXED bit means we don't index the id column - it speeds things up since we're only searching name.

Add the Names

Insert the cleaned-up names into the virtual table. Use cursor.executemany because it's faster for lots of records than doing one at a time:

cursor.executemany(
    "INSERT OR REPLACE INTO v_normalized_entity (id, name) VALUES (?, ?)",
    [(entity.id, entity.name) for entity in list_of_entities]
)

Why executemany? It batches everything into one operation, cutting down on overhead. If you're adding thousands of names, this saves time.

Build the Index After inserting, rebuild the index:
```
INSERT INTO v_normalized_entity(v_normalized_entity) VALUES('rebuild');
```
This makes sure searches run fast.

Note: If your SQLite is older than 3.48, you'll get an error or exception like sqlite3.OperationalError: No such tokenizer: trigram. Upgrade first.

Searching for Names

To find a name, use the MATCH operator with the OR operator. Split the search name into words and join them with "OR".

Here's how:

import asyncio
import sqlite3
from typing import List

import pandas as pd
from pydantic import BaseModel

# Models:
class EntityRead(BaseModel):
    id: int
    name: str

async def load_entities_from_db(name: str) -> List[EntityRead]:
    db_path = "app/data/entity.db"
    names = name.split()
    match_query = " OR ".join(names)

    conn = sqlite3.connect(db_path)
    sql_query = f"SELECT id, name FROM v_normalized_entity WHERE name MATCH '{match_query}'"
    results = pd.read_sql_query(sql_query, conn)
    conn.close()

    entities = await preprocess_entities(results)
    return entities

# function to preprocess the entity names:
async def preprocess_entities(results) -> List[EntityRead]:
    async def row_to_entity(row):
        normalized_name = await normalize_name(row["name"])
        return EntityRead(id=row["id"], name=normalized_name)

    # Process rows asynchronously
    entities = await asyncio.gather(*[row_to_entity(row) for _, row in results.iterrows()])
    return entities

For example, searching "John Doe" runs:

SELECT id, name FROM v_normalized_entity WHERE name MATCH 'John OR Doe';

This finds names with "John" or "Doe" or both. Want to sort by relevance? Add order by rank:

SELECT id, name FROM v_normalized_entity WHERE name MATCH 'John OR Doe' ORDER BY rank;

In my tests, it takes around 0.028 seconds to search for "John Doe" in a database with 200,000 names.

Conclusion

Full-text search in SQLite with FTS5 is a solid way to handle text searches, and the trigram tokenizer takes it up a notch for finding similar names. By breaking text into three-character chunks, it catches close matches - like "John" and "Johnny" - which is perfect for a database with thousands of names.

With this setup, you can search quickly and get useful results in milliseconds, even when names aren't exactly the same or mispelled.