Schema Design and Normalization #

Why Schema Design Matters #

Knowing SQL lets you read and write data. Schema design determines whether that data stays consistent, performs well, and remains maintainable as your application evolves.

A poorly designed schema doesn’t just make queries harder to write — it causes real bugs. Data gets duplicated, updates become partial, and invariants that your application depends on get silently violated.

Normalization is the systematic process of organizing a database schema to reduce redundancy and prevent anomalies. It’s not about following rules for their own sake — it’s about understanding what goes wrong when data is structured carelessly, and knowing how to fix it.

The Anomalies (Again, But Formally) #

We saw these informally in the first module. Now let’s be precise about what they are, using a concrete example.

Imagine a university database that stores course enrollments in a single flat table:

student_id	student_name	department	course_id	course_name	instructor
1	Alice	Computer Sci	CS101	Intro to CS	Prof. Smith
1	Alice	Computer Sci	MATH201	Linear Algebra	Prof. Johnson
2	Bob	Physics	PHYS101	Mechanics	Prof. Lee
2	Bob	Physics	CS101	Intro to CS	Prof. Smith
3	Charlie	Computer Sci	CS101	Intro to CS	Prof. Smith

Three types of anomalies are present:

Update anomaly: Prof. Smith’s name is stored in three rows. If her name changes (marriage, correction, etc.), you must update all three. Miss one, and CS101 now has two different instructors in the same dataset.

Deletion anomaly: If Bob drops both of his courses, deleting those rows also erases the fact that Bob is a Physics student.

Insertion anomaly: You can’t record a new course until at least one student enrolls in it, because every row requires a student_id.

Normal Forms #

Normalization proceeds through a series of normal forms, each eliminating a specific type of redundancy. In practice, getting to Third Normal Form (3NF) resolves the vast majority of real-world problems.

First Normal Form (1NF) #

A table is in 1NF if:

Every column holds atomic (indivisible) values — no lists, no sets, no nested tables.
Every row is unique (the table has a primary key).

Violation example:

student_id	student_name	courses
1	Alice	CS101, MATH201
2	Bob	PHYS101, CS101

The courses column holds multiple values. You can’t easily query “which students are in CS101?” without parsing the comma-separated string.

Fix: Put each enrollment on its own row. Each column holds one value.

student_id	student_name	course_id
1	Alice	CS101
1	Alice	MATH201
2	Bob	PHYS101
2	Bob	CS101

Now it’s in 1NF. But notice that student_name is duplicated — Alice appears twice. This hints that further normalization is needed.

Second Normal Form (2NF) #

A table is in 2NF if:

It’s in 1NF.
Every non-key column depends on the entire primary key, not just part of it.

This only applies to tables with composite primary keys (primary keys made up of multiple columns).

In the enrollment table above, the composite primary key is (student_id, course_id). The column student_name depends only on student_id — it has nothing to do with course_id. This is a partial dependency, and it violates 2NF.

Fix: Decompose into two tables:

students:

student_id	student_name
1	Alice
2	Bob

enrollments:

student_id	course_id
1	CS101
1	MATH201
2	PHYS101
2	CS101

Now student_name lives in one place, and the enrollments table only contains facts about enrollments.

Third Normal Form (3NF) #

A table is in 3NF if:

It’s in 2NF.
Every non-key column depends on the primary key directly, not through another non-key column.

Consider the original flat table. Even after pulling students into their own table, suppose we have:

student_id	student_name	department	dept_building
1	Alice	Computer Sci	Engineering Hall
2	Bob	Physics	Science Center
3	Charlie	Computer Sci	Engineering Hall

The primary key is student_id. The column dept_building depends on department, which in turn depends on student_id. This chain — student_id -> department -> dept_building — is a transitive dependency.

If the Computer Science department moves to a new building, you have to update every student row in that department. That’s an update anomaly caused by the transitive dependency.

Fix: Decompose again:

students:

student_id	student_name	department
1	Alice	Computer Sci
2	Bob	Physics
3	Charlie	Computer Sci

departments:

department	dept_building
Computer Sci	Engineering Hall
Physics	Science Center

Now dept_building is stored once per department, and updating it means changing one row.

The Fully Normalized Design #

Starting from the original flat table, full normalization through 3NF produces four tables:

students (student_id, student_name, department)
departments (department, dept_building)
courses (course_id, course_name, instructor)
enrollments (student_id, course_id)

Each fact is stored in exactly one place. No anomalies are possible through normal operations.

When Not to Normalize #

Normalization is not always the right answer. There are legitimate reasons to keep some redundancy:

Read-heavy workloads. If your application reads data far more often than it writes, and a common query requires joining five tables, it may be worth storing some derived data directly. This is called denormalization — intentionally adding redundancy to improve read performance.

Reporting and analytics. Analytical queries often aggregate data across many tables. Data warehouses frequently use denormalized schemas (star schemas, snowflake schemas) because read performance matters more than write anomalies in that context.

Calculated fields. Storing a total_price column (price * quantity) alongside price and quantity is technically redundant. But if every query needs it, computing it once at write time and storing it can be a reasonable trade-off.

The key distinction: normalize by default, denormalize with intention. If you denormalize, you should be able to articulate exactly which read pattern you’re optimizing for and what consistency trade-off you’re accepting.

Practical Guidelines #

Start in 3NF. Unless you have a specific, measured performance reason to denormalize, start with a fully normalized schema.
Each table should represent one concept. If a table name feels like two nouns connected by “and,” it probably needs to be split.
Every non-key column should describe the entity identified by the primary key. If dept_building describes a department and not a student, it doesn’t belong in the students table.
If you find yourself updating the same value in multiple rows, that’s a signal. The data is likely in the wrong table.
Naming matters. Tables get plural nouns (customers, orders). Columns get singular, descriptive names (customer_id, order_date). Foreign keys should match the primary key they reference.