HomeBlog

Full-Text Search in SQLite: Using the Trigram Tokenizer to Find Similar Names

4 min read

David Muraya Tokenizer Blog Header Image

What Is Full-Text Search?

Full-text search is a way to quickly find text within a large set of data. Think of it like using Google to search the web, but instead, you're searching rows in a database. It's designed to locate all entries that contain specific words or phrases, even if they're buried in longer text. In SQLite, this is handled by a module called FTS5, which makes searching fast and flexible.

I want to walk you through how full-text search works in SQLite, focusing on its FTS5 module and diving into tokenizers - specifically the trigram tokenizer. We'll use this approach to find matching names in a database with thousands of entries, where some names might be pretty similar. Let's start with the basics and build up from there.

Overview of FTS5

FTS5 is a tool in SQLite that sets up virtual tables for full-text search. It's available in SQLite version 3.48 or later. If you're on an older version - like 3.26, which some Linux systems use - you'll need to upgrade.

Here's how:

  • Download the latest SQLite from https://www.sqlite.org.
    mkdir /usr/local/src/sqlite
    cd /usr/local/src/sqlite
    wget https://www.sqlite.org/2025/sqlite-autoconf-3480000.tar.gz
    
  • Unpack it: tar xvfz sqlite-autoconf-3480000.tar.gz.
  • Build it: cd sqlite-autoconf-3480000, ./configure, make, make install.
  • Update your PATH: Add export PATH=/usr/local/bin:$PATH to /etc/profile and run source /etc/profile.
  • Check the version: sqlite3 --version.

If you use Python, you must recompile it to link with the new SQLite. For Python 3.11:

  • Download: wget https://www.python.org/ftp/python/3.11.9/Python-3.11.9.tgz.
  • Unpack: tar xvf Python-3.11.9.tgz.
  • Build: cd Python-3.11.9, sudo LD_RUN_PATH=/usr/local/lib ./configure --enable-optimizations, sudo LD_RUN_PATH=/usr/local/lib make altinstall.

Tokenizers

When you search with FTS5, your query gets split into smaller pieces called tokens. These are the bits the system looks for in your data. Normally, FTS5 splits text into words - so alpha beta gamma becomes three tokens: alpha, beta, and gamma. The tool that does this splitting is called a tokenizer.

FTS5 features four built-in tokenizer modules. In this article we will discuss the trigram tokenizer. A trigram is just three characters in a row. So, with the trigram tokenizer, text gets broken into overlapping sets of three letters. For example, "hello" becomes:

  • hel
  • ell
  • llo

This is great for finding similar text, because even if two words aren't identical, they might share a lot of trigrams. When using the trigram tokenizer, a query or phrase token may match any sequence of characters within a row, not just a complete token. That's exactly what we want when searching for names that are close but not exact matches e.g. "John" and "Johnny".

Setting Up the Virtual Table with the Trigram Tokenizer

Let's assume you have a table called entity with id and name columns. We'll create a virtual table for searching.

Here's the steps:

  1. Clean Up the Names:

    Before anything else, normalize the names. This means removing junk like titles, punctuation, and extra spaces, and making everything uppercase. Clean names make searches work better.

    Here's a Python function to do it:

    import re
    
    async def normalize_name(name):
        if name is None:
            return ""
        # Remove titles (assuming a list_of_titles exists)
        name = await remove_titles(name, list_of_titles)
        name = name.replace("`", "").replace("'", "").replace(".", "").replace("-", " ")
        name = re.sub(r"\s+", " ", name).strip().upper()
        return name
    
  2. Create the Virtual Table

    Now, set up the virtual table with FTS5 and the trigram tokenizer:

    CREATE VIRTUAL TABLE IF NOT EXISTS v_normalized_entity USING FTS5 (
        id UNINDEXED,
        name,
        tokenize = 'trigram'
    );
    

    The UNINDEXED bit means we don't index the id column - it speeds things up since we're only searching name.

  3. Add the Names

    Insert the cleaned-up names into the virtual table. Use cursor.executemany because it's faster for lots of records than doing one at a time:

    cursor.executemany(
        "INSERT OR REPLACE INTO v_normalized_entity (id, name) VALUES (?, ?)",
        [(entity.id, entity.name) for entity in list_of_entities]
    )
    

    Why executemany? It batches everything into one operation, cutting down on overhead. If you're adding thousands of names, this saves time.

  4. Build the Index After inserting, rebuild the index:

    INSERT INTO v_normalized_entity(v_normalized_entity) VALUES('rebuild');
    

    This makes sure searches run fast.

    Note: If your SQLite is older than 3.48, you'll get an error or exception like sqlite3.OperationalError: No such tokenizer: trigram. Upgrade first.

Searching for Names

To find a name, use the MATCH operator with the OR operator. Split the search name into words and join them with "OR".

Here's how:

import asyncio
import sqlite3
from typing import List

import pandas as pd
from pydantic import BaseModel

# Models:
class EntityRead(BaseModel):
    id: int
    name: str

async def load_entities_from_db(name: str) -> List[EntityRead]:
    db_path = "app/data/entity.db"
    names = name.split()
    match_query = " OR ".join(names)

    conn = sqlite3.connect(db_path)
    sql_query = f"SELECT id, name FROM v_normalized_entity WHERE name MATCH '{match_query}'"
    results = pd.read_sql_query(sql_query, conn)
    conn.close()

    entities = await preprocess_entities(results)
    return entities

# function to preprocess the entity names:
async def preprocess_entities(results) -> List[EntityRead]:
    async def row_to_entity(row):
        normalized_name = await normalize_name(row["name"])
        return EntityRead(id=row["id"], name=normalized_name)

    # Process rows asynchronously
    entities = await asyncio.gather(*[row_to_entity(row) for _, row in results.iterrows()])
    return entities

For example, searching "John Doe" runs:

SELECT id, name FROM v_normalized_entity WHERE name MATCH 'John OR Doe';

This finds names with "John" or "Doe" or both. Want to sort by relevance? Add order by rank:

SELECT id, name FROM v_normalized_entity WHERE name MATCH 'John OR Doe' ORDER BY rank;

In my tests, it takes around 0.028 seconds to search for "John Doe" in a database with 200,000 names.

Conclusion

Full-text search in SQLite with FTS5 is a solid way to handle text searches, and the trigram tokenizer takes it up a notch for finding similar names. By breaking text into three-character chunks, it catches close matches - like "John" and "Johnny" - which is perfect for a database with thousands of names.

With this setup, you can search quickly and get useful results in milliseconds, even when names aren't exactly the same or mispelled.

Related Blog Posts

Enjoyed this blog post? Check out these related posts!

Simple CI/CD for Your FastAPI App with Google Cloud Build and Cloud Run

Simple CI/CD for Your FastAPI App with Google Cloud Build and Cloud Run

Push code, deploy automatically: A simple CI/CD guide for your web app.

Read More..

How to Get a Structured JSON Response from a Web Page Using AI

How to Get a Structured JSON Response from a Web Page Using AI

Using AI to Extract Structured Data from Web Pages: A Kenyan Invoice Verification Case Study

Read More..

Reflex Makes SEO Easier: Automatic robots.txt and sitemap.xml Generation

Reflex Makes SEO Easier: Automatic robots.txt and sitemap.xml Generation

Discover how adding your deploy URL in Reflex automatically generates robots.txt and sitemap.xml for easier SEO.

Read More..

Adding middleware to FastAPI Applications: Process Time Headers, Security, and Compression

Adding middleware to FastAPI Applications: Process Time Headers, Security, and Compression

A practical guide to implementing middleware in FastAPI for better performance, security, and efficiency.

Read More..

Contact Me

Have a project in mind? Send me an email at hello@davidmuraya.com and let's bring your ideas to life. I am always available for exciting discussions.

© 2025 David Muraya. All rights reserved.