Schema Design and Normalization #
Why Schema Design Matters #
Knowing SQL lets you read and write data. Schema design determines whether that data stays consistent, performs well, and remains maintainable as your application evolves.
A poorly designed schema doesn’t just make queries harder to write — it causes real bugs. Data gets duplicated, updates become partial, and invariants that your application depends on get silently violated.
Normalization is the systematic process of organizing a database schema to reduce redundancy and prevent anomalies. It’s not about following rules for their own sake — it’s about understanding what goes wrong when data is structured carelessly, and knowing how to fix it.
The Anomalies (Again, But Formally) #
We saw these informally in the first module. Now let’s be precise about what they are, using a concrete example.
Imagine a university database that stores course enrollments in a single flat table:
| student_id | student_name | department | course_id | course_name | instructor |
|---|---|---|---|---|---|
| 1 | Alice | Computer Sci | CS101 | Intro to CS | Prof. Smith |
| 1 | Alice | Computer Sci | MATH201 | Linear Algebra | Prof. Johnson |
| 2 | Bob | Physics | PHYS101 | Mechanics | Prof. Lee |
| 2 | Bob | Physics | CS101 | Intro to CS | Prof. Smith |
| 3 | Charlie | Computer Sci | CS101 | Intro to CS | Prof. Smith |
Three types of anomalies are present:
Update anomaly: Prof. Smith’s name is stored in three rows. If her name changes (marriage, correction, etc.), you must update all three. Miss one, and CS101 now has two different instructors in the same dataset.
Deletion anomaly: If Bob drops both of his courses, deleting those rows also erases the fact that Bob is a Physics student.
Insertion anomaly: You can’t record a new course until at least one student enrolls in it, because every row requires a student_id.
Normal Forms #
Normalization proceeds through a series of normal forms, each eliminating a specific type of redundancy. In practice, getting to Third Normal Form (3NF) resolves the vast majority of real-world problems.
First Normal Form (1NF) #
A table is in 1NF if:
- Every column holds atomic (indivisible) values — no lists, no sets, no nested tables.
- Every row is unique (the table has a primary key).
Violation example:
| student_id | student_name | courses |
|---|---|---|
| 1 | Alice | CS101, MATH201 |
| 2 | Bob | PHYS101, CS101 |
The courses column holds multiple values. You can’t easily query “which students are in CS101?” without parsing the comma-separated string.
Fix: Put each enrollment on its own row. Each column holds one value.
| student_id | student_name | course_id |
|---|---|---|
| 1 | Alice | CS101 |
| 1 | Alice | MATH201 |
| 2 | Bob | PHYS101 |
| 2 | Bob | CS101 |
Now it’s in 1NF. But notice that student_name is duplicated — Alice appears twice. This hints that further normalization is needed.
Second Normal Form (2NF) #
A table is in 2NF if:
- It’s in 1NF.
- Every non-key column depends on the entire primary key, not just part of it.
This only applies to tables with composite primary keys (primary keys made up of multiple columns).
In the enrollment table above, the composite primary key is (student_id, course_id). The column student_name depends only on student_id — it has nothing to do with course_id. This is a partial dependency, and it violates 2NF.
Fix: Decompose into two tables:
students:
| student_id | student_name |
|---|---|
| 1 | Alice |
| 2 | Bob |
enrollments:
| student_id | course_id |
|---|---|
| 1 | CS101 |
| 1 | MATH201 |
| 2 | PHYS101 |
| 2 | CS101 |
Now student_name lives in one place, and the enrollments table only contains facts about enrollments.
Third Normal Form (3NF) #
A table is in 3NF if:
- It’s in 2NF.
- Every non-key column depends on the primary key directly, not through another non-key column.
Consider the original flat table. Even after pulling students into their own table, suppose we have:
| student_id | student_name | department | dept_building |
|---|---|---|---|
| 1 | Alice | Computer Sci | Engineering Hall |
| 2 | Bob | Physics | Science Center |
| 3 | Charlie | Computer Sci | Engineering Hall |
The primary key is student_id. The column dept_building depends on department, which in turn depends on student_id. This chain — student_id -> department -> dept_building — is a transitive dependency.
If the Computer Science department moves to a new building, you have to update every student row in that department. That’s an update anomaly caused by the transitive dependency.
Fix: Decompose again:
students:
| student_id | student_name | department |
|---|---|---|
| 1 | Alice | Computer Sci |
| 2 | Bob | Physics |
| 3 | Charlie | Computer Sci |
departments:
| department | dept_building |
|---|---|
| Computer Sci | Engineering Hall |
| Physics | Science Center |
Now dept_building is stored once per department, and updating it means changing one row.
The Fully Normalized Design #
Starting from the original flat table, full normalization through 3NF produces four tables:
- students (student_id, student_name, department)
- departments (department, dept_building)
- courses (course_id, course_name, instructor)
- enrollments (student_id, course_id)
Each fact is stored in exactly one place. No anomalies are possible through normal operations.
When Not to Normalize #
Normalization is not always the right answer. There are legitimate reasons to keep some redundancy:
Read-heavy workloads. If your application reads data far more often than it writes, and a common query requires joining five tables, it may be worth storing some derived data directly. This is called denormalization — intentionally adding redundancy to improve read performance.
Reporting and analytics. Analytical queries often aggregate data across many tables. Data warehouses frequently use denormalized schemas (star schemas, snowflake schemas) because read performance matters more than write anomalies in that context.
Calculated fields. Storing a total_price column (price * quantity) alongside price and quantity is technically redundant. But if every query needs it, computing it once at write time and storing it can be a reasonable trade-off.
The key distinction: normalize by default, denormalize with intention. If you denormalize, you should be able to articulate exactly which read pattern you’re optimizing for and what consistency trade-off you’re accepting.
Practical Guidelines #
-
Start in 3NF. Unless you have a specific, measured performance reason to denormalize, start with a fully normalized schema.
-
Each table should represent one concept. If a table name feels like two nouns connected by “and,” it probably needs to be split.
-
Every non-key column should describe the entity identified by the primary key. If
dept_buildingdescribes a department and not a student, it doesn’t belong in thestudentstable. -
If you find yourself updating the same value in multiple rows, that’s a signal. The data is likely in the wrong table.
-
Naming matters. Tables get plural nouns (
customers,orders). Columns get singular, descriptive names (customer_id,order_date). Foreign keys should match the primary key they reference.