Indexes and Query Performance #
The Problem: Searching Without an Index #
When you run a query like:
SELECT * FROM orders WHERE customer_id = 42;
the database needs to find all rows where customer_id equals 42. Without an index, it has only one option: read every single row in the table and check each one. This is called a full table scan (or sequential scan).
For a table with 100 rows, this is instant. For a table with 10 million rows, it can take seconds — or longer if the table is too large to fit in memory.
An index is a separate data structure that lets the database find rows matching a condition without scanning the entire table. It’s the difference between reading a book cover-to-cover to find a topic, and looking it up in the book’s index.
How Indexes Work #
An index is an auxiliary data structure, maintained by the database, that maps column values to the physical locations of rows containing those values. When you create an index on a column, the database builds this mapping and keeps it up to date as data changes.
B-Tree Indexes #
The most common index type is the B-tree (balanced tree). B-trees are the default index type in PostgreSQL, MySQL, SQLite, and most other relational databases.
A B-tree organizes values in a sorted, hierarchical structure:
- The root node contains a few values that divide the entire range.
- Internal nodes further subdivide the range.
- Leaf nodes contain the actual values (or pointers to rows).
To find a specific value, the database starts at the root and follows the appropriate branch at each level — much like a binary search, but with a higher branching factor (each node has many children, not just two). This gives B-trees O(log n) lookup time.
For a table with 10 million rows, a B-tree lookup typically requires 3-4 disk reads, compared to potentially millions of reads for a full scan.
B-tree indexes support:
- Exact lookups:
WHERE customer_id = 42 - Range queries:
WHERE price BETWEEN 10 AND 50 - Sorting:
ORDER BY order_date - Prefix matching:
WHERE name LIKE 'Ali%'(but notLIKE '%ice')
Hash Indexes #
A hash index uses a hash function to map values directly to row locations. Hash indexes provide O(1) lookup time for exact matches — faster than B-trees for simple equality checks.
However, hash indexes cannot support range queries, sorting, or prefix matching. They’re useful only when you exclusively query with =. In practice, B-trees are almost always preferred for their versatility.
Creating and Managing Indexes #
Creating an Index #
CREATE INDEX idx_orders_customer_id ON orders(customer_id);
This creates a B-tree index on the customer_id column of the orders table. After this, queries filtering on customer_id can use the index instead of scanning the full table.
Unique Indexes #
CREATE UNIQUE INDEX idx_customers_email ON customers(email);
A unique index serves double duty: it speeds up lookups and enforces that no two rows can have the same email. When you declare a column as UNIQUE in your table definition, the database creates a unique index automatically.
Composite Indexes #
An index can cover multiple columns:
CREATE INDEX idx_orders_customer_date ON orders(customer_id, order_date);
This is a composite index (also called a multi-column index). It’s useful when queries frequently filter or sort by both columns together.
The column order matters. A composite index on (customer_id, order_date) can efficiently answer:
WHERE customer_id = 42(uses the first column)WHERE customer_id = 42 AND order_date > '2025-01-01'(uses both columns)WHERE customer_id = 42 ORDER BY order_date(uses both columns)
But it cannot efficiently answer:
WHERE order_date > '2025-01-01'alone (the first column isn’t constrained)
Think of it like a phone book sorted by last name, then first name. You can look up all people named “Smith,” or “Smith, Alice.” But you can’t efficiently find all people named “Alice” — the book isn’t sorted that way.
Dropping an Index #
DROP INDEX idx_orders_customer_id;
The Cost of Indexes #
Indexes are not free. Every index introduces trade-offs:
Storage. An index is a separate data structure stored on disk. A table with five indexes uses significantly more disk space than the same table without indexes.
Write overhead. Every INSERT, UPDATE, or DELETE must also update all relevant indexes. The more indexes a table has, the slower writes become.
Maintenance. Indexes can become fragmented over time as data is inserted and deleted, potentially degrading performance. Some databases periodically rebuild indexes to address this.
The rule of thumb: index columns that you query by, not columns that you just store. If a column appears frequently in WHERE, JOIN, or ORDER BY clauses, it’s a candidate for an index. If it’s only ever read as part of a row that was already found by other means, it doesn’t need one.
Understanding Query Execution with EXPLAIN #
The most important tool for understanding query performance is EXPLAIN. It shows you the database’s execution plan — how it intends to find and return your data.
EXPLAIN SELECT * FROM orders WHERE customer_id = 42;
The output varies by database, but typically shows:
- Whether the query uses an index or a full table scan.
- Which index it uses.
- The estimated number of rows it will examine.
- The estimated cost (in arbitrary units used by the optimizer).
In PostgreSQL, EXPLAIN ANALYZE actually runs the query and shows real execution times:
EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 42;
Reading EXPLAIN Output #
Key things to look for:
- Seq Scan (sequential scan): The database is reading every row. This is fine for small tables but concerning for large ones.
- Index Scan: The database is using an index to find rows. This is usually what you want.
- Index Only Scan: The database can answer the query entirely from the index, without reading the actual table rows. This is the fastest option.
- Sort: The database is sorting results. If this is on a large dataset without an index, it can be expensive.
- Nested Loop / Hash Join / Merge Join: Different strategies for joining tables, each with different performance characteristics.
EXPLAIN should be your first tool when a query is slow. Before adding indexes or rewriting queries, understand what the database is actually doing.
Common Indexing Patterns #
Primary keys are automatically indexed. #
When you declare a primary key, the database creates a unique index on that column. You don’t need to create one manually.
Foreign keys should usually be indexed. #
When you join tables on a foreign key, an index on that column dramatically speeds up the join. Some databases (like MySQL with InnoDB) create these automatically; others (like PostgreSQL) do not.
-- If not automatically indexed, create this explicitly
CREATE INDEX idx_orders_customer_id ON orders(customer_id);
Columns in WHERE clauses are candidates. #
If a query is used frequently and filters on a specific column, that column likely benefits from an index.
Don’t index everything. #
A table with an index on every column will have very slow writes and use substantial disk space. Be selective. Measure before and after adding an index to verify it actually helps.
Summary #
Indexes are the primary mechanism for making queries fast. But they’re a trade-off, not a silver bullet:
- They speed up reads at the cost of slower writes and more storage.
- B-tree indexes are the default and handle the widest range of queries.
- Composite indexes are powerful but column order matters.
EXPLAINis how you verify that your indexes are actually being used.
The best approach: start without indexes (beyond the primary key), measure your query performance, and add indexes where they provide clear, measurable improvement.