Why the CS50 Degrees Project Uses set() for Movies Instead of a List

This topic is empty.

Viewing 1 post (of 1 total)

Author

Posts
May 23, 2026 at 12:11 pm #6633
Rajeev Bagra
Keymaster
A beginner studying the CS50 AI Degrees project may notice this line inside load_data():
```
people[row["id"]] = {
    "name": row["name"],
    "birth": row["birth"],
    "movies": set()
}
```
and naturally wonder:

Why is "movies" stored as a set() instead of a list?

After all, lists are commonly used for storing multiple values in Python.

This is actually an important design decision related to:
- data uniqueness
- graph traversal efficiency
- fast lookups
- relationship modeling
Let us understand it carefully.

What Does "movies": set() Mean?

This creates an empty Python set for storing movie IDs connected to a person.

Example:
```
people = {
    "102": {
        "name": "Tom Hanks",
        "birth": "1956",
        "movies": set()
    }
}
```
Initially, the actor has no movies stored.

Later, movie IDs are added using:
```
people[row["person_id"]]["movies"].add(row["movie_id"])
```
Suppose Tom Hanks acted in:
- Toy Story → movie ID 500
- Forrest Gump → movie ID 700
Then:
```
people["102"]["movies"]
```
becomes:
```
{"500", "700"}
```
Why Not Use a List?

A list could also store movie IDs:
```
["500", "700"]
```
So why did the project choose sets?

Because sets provide several important advantages.

1. Sets Automatically Prevent Duplicate Entries

Suppose the CSV data accidentally contains duplicate rows:
```
102,500
102,500
```
If movies were stored in a list:
```
["500", "500"]
```
the duplicate remains.

But with a set:
```
{"500"}
```
duplicates are automatically removed.

This is important because an actor should not be connected to the same movie multiple times.

2. Sets Provide Faster Membership Checking

Python sets are optimized for very fast lookups.

Example:
```
"500" in people["102"]["movies"]
```
This operation is extremely efficient with sets.

The CS50 Degrees project performs many relationship checks while running Breadth-First Search (BFS), so fast lookups matter.

3. Order Does Not Matter

The Degrees project only cares about:
- whether an actor appeared in a movie
- which actors are connected
It does NOT care about:
- the order of movies
- which movie came first
- sorting
Sets are ideal when:
- uniqueness matters
- order does not matter
How Movies Are Added to the Set

Sets use .add(), not .append().

Example:
```
movies_set = set()

movies_set.add("500")
movies_set.add("700")
movies_set.add("500")
```
Result:
```
{"500", "700"}
```
The duplicate "500" is ignored automatically.

List vs Set Comparison

Using a List
```
movies = ["500", "700", "500"]
```
Problems:
- duplicates allowed
- slower membership checks
Using a Set
```
movies = {"500", "700"}
```
Advantages:
- duplicates removed automatically
- faster lookups
- excellent for graph relationships
Why This Matters for BFS

The CS50 Degrees project models a graph like this:
```
Actor → Movie → Actor → Movie → Actor
```
The BFS algorithm repeatedly explores:
- which movies an actor appeared in
- which actors appeared in those movies
This requires:
- fast relationship traversal
- efficient duplicate prevention
- clean graph connections
Sets are perfect for this type of graph structure.

Real-World Uses of Sets

Sets are commonly used for:
- unique website visitors
- friend relationships in social networks
- unique tags or hashtags
- graph connections
- database relationship modeling
The Degrees project is essentially building a movie-actor social graph.

Final Insight

The line:
```
"movies": set()
```
is not accidental.

It is a deliberate design choice because:
- duplicate movie IDs should not exist
- fast membership checking is important
- ordering is irrelevant
- graph traversal benefits from set operations
This is one of the first examples where a beginner can see how data structure choice directly affects algorithm efficiency and program design.
Author

Posts

Viewing 1 post (of 1 total)

You must be logged in to reply to this topic.

Additional menu

What Does "movies": set() Mean?

Why Not Use a List?

1. Sets Automatically Prevent Duplicate Entries

2. Sets Provide Faster Membership Checking

3. Order Does Not Matter

How Movies Are Added to the Set

List vs Set Comparison

Using a List

Using a Set

Why This Matters for BFS

Real-World Uses of Sets

Final Insight

What Does `"movies": set()` Mean?