Chapter 1 · Simple Linear Regression

Worked Example — MLB Home Runs

Predicting runs scored from home runs hit across six teams — every number computed by hand. Use the steps below to move through it.

Step 1 of 8

The raw data

Do teams that hit more home runs score more runs? x = home runs, y = runs scored — both per season, six teams.

A data table on a clipboard
Six rows, two columns

6 MLB teams — one season

TeamHome runs (x)Runs scored (y)
Yankees245807
Dodgers221758
Red Sox198726
Astros214749
Cubs177691
Padres162668
Moneyball connection Billy Beane's A's found that some cheap counting stats had a big effect on winning. Spotting which x-variables move y the most is exactly what regression does.
Step 2 of 8

Find the means x̄ and ȳ

The means are the data's centre of gravity. A key OLS fact: the fitted line always passes through the point (x̄, ȳ).

Crosshair marking the mean point of a scatter
The line must pass through (x̄, ȳ)

x̄ — mean home runs

x̄ = (245 + 221 + 198 + 214 + 177 + 162) / 6 = 1217 / 6 = 202.83

ȳ — mean runs scored

ȳ = (807 + 758 + 726 + 749 + 691 + 668) / 6 = 4399 / 6 = 733.17
(x̄, ȳ) = (202.83, 733.17) is the "average team" in this sample. No matter what the slope and intercept turn out to be, the regression line is guaranteed to run through this point.
Step 3 of 8

Compute Sxx and Sxy

Sxx measures how spread out the home-run counts are. Sxy measures how x and y move together. Both feed straight into the slope.

Arrows showing deviations from the centre line
Deviations from the mean

Deviation table

Teamxᵢ−x̄yᵢ−ȳ(xᵢ−x̄)²(xᵢ−x̄)(yᵢ−ȳ)
Yankees42.1773.831778.33113.7
Dodgers18.1724.83330.1451.2
Red Sox−4.83−7.1723.334.6
Astros11.1715.83124.8176.8
Cubs−25.83−42.17667.41089.3
Padres−40.83−65.171667.12661.6
Σ≈ 0≈ 04591.07527.2
Sxx = Σ(xᵢ−x̄)²4591.0
Sxy = Σ(xᵢ−x̄)(yᵢ−ȳ)7527.2
The deviation columns sum to ≈ 0 — always true, and a handy arithmetic check. Sxy is large and positive: more home runs clearly travels with more runs.
Step 4 of 8

The slope β̂₁ and intercept β̂₀

With Sxx and Sxy in hand, the line is one division and one subtraction away.

A line with a slope triangle showing rise over run
Slope = rise ÷ run

Slope

β̂₁ = Sxy / Sxx = 7527.2 / 4591.0 = 1.639

Each extra home run is worth about 1.639 more runs.

Intercept

β̂₀ = ȳ − β̂₁ × x̄ = 733.17 − 1.639 × 202.83 = 400.73
ŷ = 400.73 + 1.639 × x
Matrix connection β̂₁ = Sxy / Sxx is exactly what β̂ = (XᵀX)⁻¹XᵀY gives for one predictor. Add more predictors and you need the full matrix; with a single x it collapses to this ratio.
Step 5 of 8

Fitted values ŷᵢ and residuals eᵢ

Plug each xᵢ into the line to get ŷᵢ. The residual eᵢ = yᵢ − ŷᵢ is how far off the model is for that team.

Scatter points with dashed lines to a fitted line
Each dashed gap is a residual

ŷᵢ = 400.73 + 1.639 × xᵢ

Teamyᵢŷᵢeᵢeᵢ²
Yankees807802.3+4.722.1
Dodgers758763.0−5.025.0
Red Sox726725.2+0.80.64
Astros749751.5−2.56.25
Cubs691690.8+0.20.04
Padres668666.2+1.83.24
Σ≈ 057.3
Residuals always sum to zero in OLS. The Astros' −2.5 means they scored 2.5 fewer runs than predicted — home runs alone can't capture everything (a few stranded runners, say).
Step 6 of 8

SSE and the variance estimate s²

SSE is the total squared error after fitting. s² estimates σ² — the natural scatter around the true line.

A residual gap drawn as a literal square
Square each residual, then add

Sum of squared errors

SSE = Σ eᵢ² = 22.1 + 25.0 + 0.64 + 6.25 + 0.04 + 3.24 = 57.3

Variance estimate

s² = SSE / (n − 2) = 57.3 / 4 = 14.33 s = √14.33 = 3.79 runs
Why n − 2? We used up 2 degrees of freedom estimating β̂₀ and β̂₁, so we divide by n − 2, not n. Writing n here is the most common slip on this topic.
Step 7 of 8

R² — how much we explained

R² answers: what proportion of the total variation in runs does the model account for?

A donut chart showing explained versus unexplained variation
Explained vs unexplained

Total sum of squares

TSS = Σ(yᵢ − ȳ)² = 5452.8 + 615.7 + 51.4 + 250.7 + 1778.3 + 4247.1 = 12396.0

R² = 1 − SSE / TSS = 1 − 57.3 / 12396.0 = 0.9954
TSS (total)12396
SSE (unexplained)57.3
0.995
Home runs explain 99.5% of the variation in runs scored across these six teams. The other 0.5% is everything else — walks, singles, stolen bases. (Real research rarely gets this high; here it's genuine, since HR teams simply score more.)
Step 8 of 8

Putting it into words

Every quantity, its value, and what it actually means — plus how to phrase it in an exam.

A lightbulb representing the interpretation
From numbers to meaning

Summary

QuantityValueMeaning
β̂₁1.639+1 HR → +1.639 runs
β̂₀400.73Baseline runs at x = 0
SSE57.3Unexplained squared error
s3.79Typical miss (± runs)
0.995499.5% of variation explained

How to write it in an exam

Slope: For each additional home run in a season, a team is estimated to score 1.639 more runs on average.

Intercept: A team hitting zero home runs is predicted to score 400.73 runs (non-HR scoring) — but x = 0 is outside the data, so it isn't practically meaningful.

R²: Home runs explain 99.5% of the variation in runs scored — an excellent fit.