› Forums › AI & Machine Learning › Understanding the Difference Between val_predictions and val_y in Kaggle Model Validation
- This topic is empty.
-
AuthorPosts
-
May 23, 2026 at 11:06 am #6631
A learner working through the Kaggle tutorial on model validation became confused after seeing two different variables:
val_predictionsval_y
At first glance, both seem to represent the same thing — the dependent variable (
y), which in the Iowa housing dataset is the house price.The confusion usually comes from code like this:
[code language="python"]
get predicted prices on validation data
val_predictions = iowa_model.predict(val_X)
print the top few validation predictions
print(val_predictions[:5])
print the top few actual prices from validation data
print(val_y.head())
[/code]The learner wondered:
If both are house prices, then why do we need both? Aren’t they the same dependent variable?
Short Answer
They are related to the same dependent variable, but they are not the same data.
val_y= the real correct answers from the datasetval_predictions= the model’s guessed answers
What Happens During Validation?
In machine learning, the model is trained using training data:
[code language="python"]
train_X
train_y
[/code]After learning patterns, we test the model using validation data:
[code language="python"]
val_X
val_y
[/code]The important detail:
val_Xcontains only input featuresval_ycontains the real answers
Now we ask the model:
“Can you predict the house prices for these validation houses?”
That is done using:
[code language="python"]
val_predictions = iowa_model.predict(val_X)
[/code]So:
val_predictionsare generated by the modelval_yalready existed in the dataset
Real-World Analogy
Imagine a teacher giving students a math test.
- The answer sheet =
val_y - The student’s answers =
val_predictions
Validation is simply checking:
“How close are the predictions to the real answers?”
Example
Suppose the actual prices are:
[code language="python"]
val_y
[/code]House Actual Price 1 200000 2 150000 3 300000 Now suppose the model predicts:
[code language="python"]
val_predictions
[/code]House Predicted Price 1 210000 2 140000 3 310000 Notice:
- The predictions are close
- But they are not exactly equal
That difference is called the prediction error.
Why Validation Exists
Without validation, we would never know whether the model is good or bad.
We compare:
[code language="python"]
val_predictions
[/code]against:
[code language="python"]
val_y
[/code]to measure accuracy.
One common metric is MAE (Mean Absolute Error):
MAE = \frac{1}{n}\sum_{i=1}^{n}|y_i-\hat{y}_i|
where:
y_i= actual values\hat{y}_i= predicted values
Smaller MAE means better predictions.
Important Concept
A beginner often thinks:
“If predictions are prices and actual values are also prices, then they must be identical.”
But the entire purpose of machine learning is:
to generate estimates that try to match the real values.
If predictions and actual values were always automatically identical, there would be no need for machine learning at all.
Another Important Observation
Notice these lines:
[code language="python"]
print(val_predictions[:5])
print(val_y.head())
[/code]Why different syntax?
Because:
val_predictionsis usually a NumPy arrayval_yis usually a pandas Series
So:
-
- NumPy arrays use slicing:
[code language="python"]
val_predictions[:5]
[/code]
- NumPy arrays use slicing:
[code lang=text]
<ul>
<li>Pandas Series use:
[/code][code language="python"]
val_y.head()
[/code]Source
This discussion is based on Kaggle’s tutorial:
-
AuthorPosts
- You must be logged in to reply to this topic.
