Skip to main content

One post tagged with "Unit 3"

View All Tags

Data Analysis of Correlation and Regression Notebooks

· 12 min read
Ross Bulat
Full Stack Engineer

Notebook 1: Covariance, Pearson Correlation and Regression

Google Colab notebook: Open in Google Colab

Purpose of the notebook

I expanded a basic correlation and regression example into controlled experiments to explore how dataset properties affect statistical results. I varied sample size, noise, outliers, non-linearity, and data range, calculating and visualizing correlations, regressions, and R² for each.

Changes and experiments

The first change was to make the notebook reusable by adding helper functions instead of writing all calculations in one code block.

The first experiment varied sample size: 10, 30, 100 and 1,000 points. Small datasets produced unstable estimates because individual points heavily influenced the slope and correlation. Larger samples stabilized the relationship.

The second experiment changed noise levels. As noise increased, the scatter plot spread out, correlation decreased, and R² weakened, showing how noisy data obscures real relationships.

The third experiment added outliers. Extreme values shifted the regression line noticeably, especially when distant from the main cluster.

The fourth experiment used a non-linear relationship. The plot showed a clear pattern, but Pearson correlation and linear regression didn't describe it well, demonstrating that low correlation may reflect an unsuitable method, not the absence of a relationship.

The fifth experiment restricted the input range, making the relationship appear weaker. This mirrors a practical problem: models trained on narrow data don't generalize well to real-world variation.

What the results showed

Correlation and regression results depend heavily on dataset conditions. More data points made estimates more reliable; more noise obscured relationships; outliers distorted the fitted line; non-linearity limited linear methods; restricted ranges masked real patterns.

ExperimentnPearson rNotes
Baseline (linear, σ=10)1,0000.890.79Clear positive trend
Small sample100.950.91Looks strong but unstable across reseeds
Large sample1,0000.890.79Stable estimate
Low noise (σ=2)1,0000.990.99Almost deterministic
High noise (σ=50)1,0000.360.13Relationship largely obscured
Clean data1000.890.79Reference for outlier test
With 3 high-leverage outliers1030.210.04Outliers collapse the fit
Curved (y ≈ x² + noise)300-0.050.00Linear methods miss the pattern
Full x range5000.780.61Reference for range test
Restricted x range5000.290.09Narrow range hides the trend

This showed that statistics shouldn't be interpreted in isolation. Scatter plots mattered as much as numbers because they revealed data structure. Effective practice requires combining statistical measures with visual inspection and careful documentation.

Analysis and conclusions

The results show that statistical outputs cannot be interpreted in isolation from the data that produced them. Sample size, noise, outliers, non-linearity, and the range of observed values each changed correlation strength and regression fit, even though the underlying generative relationship was held constant.

There are also practical implications beyond the numbers. Models used in sensitive areas (finance, healthcare, recruitment) need reliable data to produce defensible predictions. Restricted datasets can create systems that work for some groups but not others. Outlier handling is a judgement call: unusual points may be errors, but they may also represent important rare cases. And in any setting, assumptions and limitations should be documented rather than hidden behind a single headline statistic.

Critical reflection

The most important takeaway from the notebook is that data analysis is not purely a technical task. A high correlation value or a neat regression line can be misleading if the dataset is too small, too narrow or affected by unexamined bias. Likewise, a low correlation value can be misleading if the relationship is non-linear. This means that machine learning professionals need to understand the context of the dataset before choosing or evaluating an algorithm.

This notebook serves as an introduction to how basic statistical methods raise important professional questions. It demonstrates why dataset quality, representativeness and transparency are central to responsible machine learning.


Notebook 2: Linear Regression

Google Colab notebook: Open in Google Colab

Purpose of the notebook

I explored how dataset changes affect linear regression by starting with a 13-point baseline model and then systematically altering the data: varying sample size, adding consistent points, introducing noise, including outliers, and restricting range. For each variation, I recalculated correlation, refitted the regression line, and observed how predictions changed.

Changes and experiments

I tested five variations of the original 13-point dataset:

Sample size: Repeated random sampling from a simulated population showed the slope's standard deviation falling from ~1.13 (n=5) to ~0.16 (n=100). With only 5 points the fitted line looks plausible but varies wildly across resamples; with 100 it stabilizes.

Trend-consistent points: Adding 10 points that followed the existing negative pattern strengthened the correlation (-0.76 → -0.88) and raised R² (0.58 → 0.78). Prediction at x=10 stayed essentially unchanged (~85.5).

Noisy data: Adding inconsistent points effectively destroyed the relationship: Pearson r flipped to +0.04 and R² collapsed to 0.001. The algorithm still produced a line and prediction, but both were meaningless.

Outlier: A single contradictory point (x=18, y=120) almost completely wiped out the negative trend (r: -0.76 → -0.10, R²: 0.58 → 0.009). Outliers deserve investigation; they may be errors or represent important rare cases.

Restricted range: Using only x ≤ 8 produced stronger local statistics (r=-0.90, R²=0.82) but the predicted value at x=10 drifted to 76.4 (from 85.6 baseline). High correlation did not guarantee reliable extrapolation.

What the results showed

The same algorithm produced drastically different results depending on data quality. Consistent data strengthened relationships; noise destroyed them; outliers distorted them; narrow ranges masked extrapolation problems. Most critically, the algorithm couldn't detect when data was unsuitable; it generated predictions regardless.

ExperimentnPearson rPredict at x=10
Original13-0.760.5885.6
+ Trend-consistent points23-0.880.7885.5
+ Noisy points23+0.040.00189.6
+ 1 Outlier (18, 120)14-0.100.00991.5
Restricted range (x ≤ 8)8-0.900.8276.4

Analysis and conclusions

The headline finding is that data quality matters as much as algorithm choice. The same LinearRegression call produced a defensible model on the original data, a stronger model when trend-consistent points were added, and an essentially meaningless model once noise or a single contradictory outlier was introduced. Restricting the range produced the most subtle failure mode: the local fit improved (r=−0.90) while the prediction at x=10 drifted by nearly 10 units, illustrating that a high in-sample R² is not a guarantee of reliable extrapolation.

In practical settings such as loan scoring, pricing and risk assessment, regression models trained on biased, incomplete or unrepresentative data will still produce confident predictions, and those predictions will still be acted on. The algorithm cannot tell the operator when its inputs are unsuitable.

Critical reflection

Regression is fundamentally fitting a line to a specific dataset. A high R² might seem to validate the model, but statistics can't guarantee practical utility. The professional responsibility is understanding data limitations, validating assumptions, investigating anomalies, and always acknowledging the conditions predictions depend on.


Notebook 3: Multiple Linear Regression

Google Colab notebook: Open in Google Colab

Purpose of the notebook

I explored how dataset changes affect multiple linear regression using a small car dataset. I predicted CO2 emissions from Weight and Volume, then systematically tested variations: different sample sizes, trend-consistent points, noisy data, outliers, and restricted ranges to see how each change affected correlations, coefficients, R², and predictions.

Changes and experiments

I added helper functions and expanded the original single-prediction example into a series of controlled experiments:

Baseline: Fitted a model with Weight and Volume predicting CO2, establishing baseline correlations (Weight–CO2: 0.55, Volume–CO2: 0.59, R²: 0.38). Predicted CO2 at Weight=2300, Volume=1300 ≈ 107.2 g.

Fewer data points: Reduced to 12 samples; estimates became unstable, correlations weakened toward zero (Weight–CO2 even flipped sign), and R² dropped to ~0.18.

Trend-consistent points: Added points following the expected pattern, strengthening correlations (~0.62 / 0.64) and lifting R² to ~0.44.

Noisy data: Added conflicting points that reversed relationships, collapsing R² to ~0.02, showing how poor data quality can mask real patterns.

Outlier: One unusual point pulled coefficients sharply, dropping R² to ~0.04 and weakening both correlations.

Restricted range: Using only lower weight values produced moderate local statistics (R² ~0.22) but unreliable extrapolation to heavier cars.

What the results showed

The same algorithm produced drastically different results across experiments. Data quality, sample size, and range coverage all shaped the model's apparent reliability. Notably, R² ranged from ~0.02 to ~0.44 depending on data alterations, yet the algorithm produced confident predictions in every case, unable to signal when data was unsuitable.

ExperimentnCorr W–CO2Corr V–CO2Pred CO2 (2300/1300)
Baseline360.550.590.38107.2
Fewer points12-0.170.080.1868.6
+ Trend-consistent410.620.640.44109.3
+ Noisy points410.060.130.0295.6
+ Outlier370.170.200.04105.5
Restricted (low weight)180.250.430.2285.1

Analysis and conclusions

Multiple linear regression appears mathematically objective: two coefficients, an intercept, an R². Yet every one of those numbers shifted substantially across the experiments. Even the baseline R² of 0.38 already signals that Weight and Volume alone are incomplete predictors of CO2; fuel type, engine efficiency, vehicle age and driving conditions are all absent from the dataset. That gap, rather than the algorithm itself, is what limits the model's usefulness.

This matters because models of this shape are routinely used in emissions reporting, insurance pricing and regulatory compliance. A practitioner deploying such a model would need to justify the choice of predictors, document how the dataset was assembled, and be transparent about the proportion of variation the model leaves unexplained.

Critical reflection

This notebook reinforced a key lesson: regression is not proof; it is curve-fitting adapted to a specific dataset. The models were too small and simplified for real decision-making. Their value lies in showing why professional judgment matters as much as technical execution: understanding data limitations, investigating anomalies, documenting assumptions, and refusing to overstate what statistics can prove.


Notebook 4: Polynomial Regression

Google Colab notebook: Open in Google Colab

Purpose of the notebook

I explored how dataset changes affect polynomial regression using tollbooth data showing a curved relationship between time of day and vehicle speed. I fitted polynomial models to the data, then systematically tested variations: sample size, trend-consistent points, noisy data, outliers, restricted ranges, and different polynomial degrees to see how each affected the model's reliability.

Changes and experiments

I added helper functions and expanded the original third-degree polynomial example into controlled experiments:

Baseline: Fitted a degree-3 polynomial to predict speed at 17:00 (≈88.9). R² was 0.94 but Pearson r was only 0.43, because the relationship is curved, not linear.

Fewer data points: Used every other observation. R² stayed high (~0.96) and the prediction barely moved (~88.5), but the model was based on much less evidence and would be more vulnerable to sampling bias.

Trend-consistent points: Added points following the curved pattern; R² stayed at ~0.95 and predictions remained stable.

Noisy data: Added conflicting points that distorted the curve, dropping R² to ~0.65 and pulling the 17:00 prediction down to ~81.

Outlier: One extreme value (x=11, y=130) pulled the curve sharply, collapsing R² to ~0.39 and shifting predictions to ~92.

Restricted range: Using only 05:00–16:00 data produced strong local fit (R² ~0.94) but the 17:00 prediction dropped to ~77, showing how extrapolation outside the observed range becomes unreliable.

Polynomial degrees: Compared degrees 1 through 6. Degree 1 underfitted (R²=0.18); degree 3 captured the pattern well (R²=0.94); degree 6 fitted training data even more closely (R²=0.98) but risked overfitting noise rather than learning generalizable structure.

What the results showed

Polynomial regression can model curved patterns effectively, but results depend critically on data quality and model complexity. The same algorithm produced different predictions across experiments. Pearson correlation alone was insufficient for evaluating curved relationships. Outliers and noise distorted curves dramatically. Restricted ranges masked extrapolation problems. Higher-degree polynomials improved training fit but risked overfitting.

ExperimentnPearson rRMSEPredict at x=17
Baseline (degree 3)180.430.943.588.9
Fewer points (every other)90.260.962.988.5
+ Trend-consistent220.490.953.288.9
+ Noisy points220.220.659.281.1
+ 1 Outlier (11, 130)190.330.3914.592.3
Restricted (05:00–16:00)110.950.942.077.1

Polynomial degree comparison (original data):

DegreeRMSEPredict at x=17
10.1813.483.9
20.767.381.2
30.943.588.9
40.953.287.2
60.982.184.3

Analysis and conclusions

The degree comparison is the clearest illustration of the bias and variance trade-off in this set of notebooks. A degree-1 model underfits the curved tollbooth pattern (R²=0.18) and a degree-6 model achieves the best training R² (0.98), but the degree-6 fit is shaped as much by noise as by signal. This is visible in how its 17:00 prediction drifts to 84.3 while the more conservative degree-3 model predicts 88.9. The right choice is the simplest model that captures the structure, not the one with the highest R².

The other experiments reinforce points seen in the linear cases (outliers and noisy points distort the curve, restricted ranges break extrapolation), but they land harder here because polynomial models are more flexible and therefore more sensitive to each of these problems. In a real traffic-prediction setting, an over-fitted model deployed against limited or biased data could produce confidently wrong outputs that feed into planning or safety decisions.

Critical reflection

The most important lesson was that more flexible models are not automatically better. Polynomial regression handles curves linear regression cannot, but it is also more sensitive to outliers and prone to overfitting. Model selection requires balancing expressiveness with reliability, always considering data limitations. High R² does not guarantee practical utility, especially when data is limited, noisy, or used beyond its observed range.