› Forums › AI & Machine Learning › Why the CS50 Degrees Project Uses set() for Movies Instead of a List
- This topic is empty.
-
AuthorPosts
-
May 23, 2026 at 12:11 pm #6633
A beginner studying the CS50 AI Degrees project may notice this line inside
load_data():people[row["id"]] = { "name": row["name"], "birth": row["birth"], "movies": set() }and naturally wonder:
Why is
"movies"stored as aset()instead of a list?After all, lists are commonly used for storing multiple values in Python.
This is actually an important design decision related to:
- data uniqueness
- graph traversal efficiency
- fast lookups
- relationship modeling
Let us understand it carefully.
What Does
"movies": set()Mean?This creates an empty Python set for storing movie IDs connected to a person.
Example:
people = { "102": { "name": "Tom Hanks", "birth": "1956", "movies": set() } }Initially, the actor has no movies stored.
Later, movie IDs are added using:
people[row["person_id"]]["movies"].add(row["movie_id"])Suppose Tom Hanks acted in:
- Toy Story → movie ID 500
- Forrest Gump → movie ID 700
Then:
people["102"]["movies"]becomes:
{"500", "700"}
Why Not Use a List?
A list could also store movie IDs:
["500", "700"]So why did the project choose sets?
Because sets provide several important advantages.
1. Sets Automatically Prevent Duplicate Entries
Suppose the CSV data accidentally contains duplicate rows:
102,500 102,500If movies were stored in a list:
["500", "500"]the duplicate remains.
But with a set:
{"500"}duplicates are automatically removed.
This is important because an actor should not be connected to the same movie multiple times.
2. Sets Provide Faster Membership Checking
Python sets are optimized for very fast lookups.
Example:
"500" in people["102"]["movies"]This operation is extremely efficient with sets.
The CS50 Degrees project performs many relationship checks while running Breadth-First Search (BFS), so fast lookups matter.
3. Order Does Not Matter
The Degrees project only cares about:
- whether an actor appeared in a movie
- which actors are connected
It does NOT care about:
- the order of movies
- which movie came first
- sorting
Sets are ideal when:
- uniqueness matters
- order does not matter
How Movies Are Added to the Set
Sets use
.add(), not.append().Example:
movies_set = set() movies_set.add("500") movies_set.add("700") movies_set.add("500")Result:
{"500", "700"}The duplicate
"500"is ignored automatically.
List vs Set Comparison
Using a List
movies = ["500", "700", "500"]Problems:
- duplicates allowed
- slower membership checks
Using a Set
movies = {"500", "700"}Advantages:
- duplicates removed automatically
- faster lookups
- excellent for graph relationships
Why This Matters for BFS
The CS50 Degrees project models a graph like this:
Actor → Movie → Actor → Movie → ActorThe BFS algorithm repeatedly explores:
- which movies an actor appeared in
- which actors appeared in those movies
This requires:
- fast relationship traversal
- efficient duplicate prevention
- clean graph connections
Sets are perfect for this type of graph structure.
Real-World Uses of Sets
Sets are commonly used for:
- unique website visitors
- friend relationships in social networks
- unique tags or hashtags
- graph connections
- database relationship modeling
The Degrees project is essentially building a movie-actor social graph.
Final Insight
The line:
"movies": set()is not accidental.
It is a deliberate design choice because:
- duplicate movie IDs should not exist
- fast membership checking is important
- ordering is irrelevant
- graph traversal benefits from set operations
This is one of the first examples where a beginner can see how data structure choice directly affects algorithm efficiency and program design.
-
AuthorPosts
- You must be logged in to reply to this topic.
