Understanding the Difference Between val_predictions and val_y in Kaggle Model Validation

This topic is empty.

Viewing 1 post (of 1 total)

Author

Posts
May 23, 2026 at 11:06 am #6631
Rajeev Bagra
Keymaster
A learner working through the Kaggle tutorial on model validation became confused after seeing two different variables:
- val_predictions
- val_y
At first glance, both seem to represent the same thing — the dependent variable (y), which in the Iowa housing dataset is the house price.

The confusion usually comes from code like this:

[code language="python"]

get predicted prices on validation data

val_predictions = iowa_model.predict(val_X)

print the top few validation predictions

print(val_predictions[:5])

print the top few actual prices from validation data

print(val_y.head())
[/code]

The learner wondered:

If both are house prices, then why do we need both? Aren’t they the same dependent variable?

Short Answer

They are related to the same dependent variable, but they are not the same data.
- val_y = the real correct answers from the dataset
- val_predictions = the model’s guessed answers
What Happens During Validation?

In machine learning, the model is trained using training data:

[code language="python"]
train_X
train_y
[/code]

After learning patterns, we test the model using validation data:

[code language="python"]
val_X
val_y
[/code]

The important detail:
- val_X contains only input features
- val_y contains the real answers
Now we ask the model:

“Can you predict the house prices for these validation houses?”

That is done using:

[code language="python"]
val_predictions = iowa_model.predict(val_X)
[/code]

So:
- val_predictions are generated by the model
- val_y already existed in the dataset
Real-World Analogy

Imagine a teacher giving students a math test.
- The answer sheet = val_y
- The student’s answers = val_predictions
Validation is simply checking:

“How close are the predictions to the real answers?”

Example

Suppose the actual prices are:

[code language="python"]
val_y
[/code]

House Actual Price

1 200000

2 150000

3 300000

Now suppose the model predicts:

[code language="python"]
val_predictions
[/code]

House Predicted Price

1 210000

2 140000

3 310000

Notice:
- The predictions are close
- But they are not exactly equal
That difference is called the prediction error.

Why Validation Exists

Without validation, we would never know whether the model is good or bad.

We compare:

[code language="python"]
val_predictions
[/code]

against:

[code language="python"]
val_y
[/code]

to measure accuracy.

One common metric is MAE (Mean Absolute Error):

MAE = \frac{1}{n}\sum_{i=1}^{n}|y_i-\hat{y}_i|

where:
- y_i = actual values
- \hat{y}_i = predicted values
Smaller MAE means better predictions.

Important Concept

A beginner often thinks:

“If predictions are prices and actual values are also prices, then they must be identical.”

But the entire purpose of machine learning is:

to generate estimates that try to match the real values.

If predictions and actual values were always automatically identical, there would be no need for machine learning at all.

Another Important Observation

Notice these lines:

[code language="python"]
print(val_predictions[:5])
print(val_y.head())
[/code]

Why different syntax?

Because:
- val_predictions is usually a NumPy array
- val_y is usually a pandas Series
So:
- - NumPy arrays use slicing:
    [code language="python"]
    val_predictions[:5]
    [/code]
[code lang=text]
<ul>
<li>Pandas Series use:
[/code]

[code language="python"]
val_y.head()
[/code]

Source

This discussion is based on Kaggle’s tutorial:

Kaggle Model Validation Tutorial by Dan Becker

(kaggle.com)
Author

Posts

House	Actual Price
1	200000
2	150000
3	300000

House	Predicted Price
1	210000
2	140000
3	310000

Viewing 1 post (of 1 total)

You must be logged in to reply to this topic.

Additional menu

get predicted prices on validation data

print the top few validation predictions

print the top few actual prices from validation data

Short Answer

What Happens During Validation?

Real-World Analogy

Example

Why Validation Exists

Important Concept

Another Important Observation

Source