Seminar 2 - Exploratory Data Analysis with Auto MPG Dataset

May 9, 2025 · 7 min read

Full Stack Engineer

Google Colab notebook: Open in Google Colab

Introduction

In this activity, I carried out exploratory data analysis (EDA) on the Auto MPG dataset to judge whether it was suitable for machine learning and to identify key uncertainties before modelling.

The process followed a standard-practice sequence: importing standard analysis libraries, loading the data, inspecting structure and summary statistics, separating numeric and categorical variables, checking missing values, and using visualisations (including heat maps and scatter plots) to examine relationships.

The dataset contains 398 rows and 9 original columns. The main variables are fuel economy (mpg), engine size (displacement), engine power (horsepower), vehicle weight, acceleration, model year, origin, and car name.

1. Loading and inspecting the data

I began by loading auto-mpg.csv in pandas and reviewing the first and last rows, dataset shape, and inferred data types. This quickly exposed an important quality issue: horsepower had been read as an object column instead of a numeric variable.

The reason was that missing values were represented by a question mark (?) rather than standard NaN. This is a useful practical reminder that missing data is not always immediately visible through isna() on raw input, especially when pandas has inferred a text type.

2. Identifying missing values

After converting horsepower using pd.to_numeric(..., errors='coerce'), the question marks were correctly converted to missing values. The missing-value summary was:

	missing_count	missing_percent
horsepower	6	1.51
mpg	0	0
cylinders	0	0
displacement	0	0
weight	0	0
acceleration	0	0
model year	0	0
origin	0	0
car name	0	0
horsepower_imputed	0	0
origin_label	0	0
origin_encoded	0	0

This indicates a small but non-trivial missing-data issue: six missing horsepower values (approximately 1.51% of records). To keep the workflow transparent, I retained the original horsepower column and created a separate horsepower_imputed feature using median imputation for analysis that required complete numeric inputs.

I selected median imputation because it is straightforward, reproducible, and generally more robust to skew than mean imputation. That said, it remains an assumption; in a full modelling pipeline, I would compare multiple imputation strategies rather than treat one approach as definitive.

3. Encoding categorical values

The task required categorical values to be represented numerically (for example, America = 1, Europe = 2). In this dataset, origin was already stored as numeric codes. To make the transformation explicit and interpretable, I first mapped values to labels and then encoded them:

Origin	Encoded value
America	1
Europe	2
Japan	3

This satisfies the module requirement, but it also highlights a methodological issue. Integer encoding (1, 2, 3) can imply artificial order or distance between nominal categories. Depending on algorithm choice, this may introduce bias in representation. For later modelling, one-hot encoding may be more appropriate where the goal is to avoid false ordinal assumptions.

The origin counts were:

origin_label	count
America	249
Japan	79
Europe	70

4. Skewness and kurtosis

I calculated skewness and kurtosis for numeric variables. Skewness indicates asymmetry in a distribution, while kurtosis indicates tail behaviour relative to a normal distribution.

	skewness	kurtosis
mpg	0.457	-0.511
cylinders	0.527	-1.377
displacement	0.72	-0.747
horsepower	1.087	0.697
horsepower_imputed	1.106	0.764
weight	0.531	-0.786
acceleration	0.279	0.419
model year	0.012	-1.181
origin_encoded	0.924	-0.818

The clearest result was the strong positive skew in horsepower, suggesting many vehicles in low-to-mid power ranges and fewer high-powered outliers creating a right tail. displacement, weight, and cylinders also showed positive skew, which aligns with a dataset dominated by moderate vehicles and fewer extreme large-engine observations.

Most kurtosis values were near-normal or negative, indicating that distributions were not heavily tail-dominated overall. This matters because skew and outliers can affect algorithm performance and interpretation; depending on model choice, later stages may require transformation, scaling, or more robust estimators.

5. Correlation heat map

I used a correlation heat map to examine linear relationships between numeric features. Correlation coefficients with mpg were:

	mpg
mpg	1
model year	0.579
origin_encoded	0.563
acceleration	0.42
horsepower_imputed	-0.773
cylinders	-0.775
displacement	-0.804
weight	-0.832

The strongest negative relationship was between mpg and weight, which is intuitive: heavier vehicles tend to consume more fuel. displacement, horsepower, and cylinders were also strongly negatively associated with mpg.

model year showed a positive relationship with mpg, suggesting that newer vehicles in this historical sample were generally more fuel efficient. I treated this carefully in interpretation, since correlation alone does not establish causation and may reflect broader historical factors such as regulation, design shifts, or market priorities.

6. Scatter plots

I then used scatter plots to inspect pairwise relationships involving mpg, particularly:

mpg vs weight
mpg vs horsepower_imputed
mpg vs displacement
mpg vs acceleration
mpg vs model year

The visual trend was clear: mpg decreases as weight, horsepower_imputed, and displacement increase. I also used origin as a colour grouping, which showed clusters occupying different parts of feature space. In this dataset, American-origin vehicles appeared more frequently in heavier, larger-displacement regions, while Japanese and European vehicles were more common in lighter, higher-mpg regions.

This is informative for modelling, but I interpreted it with caution. Origin may be acting as a proxy for other factors (vehicle size, production era, market segment, or design choices), so it should not be treated as a standalone causal explanation.

7. Grouped summaries

To complement the plots, I produced grouped summaries by origin:

origin_label	count	avg_mpg	avg_weight	avg_horsepower
Japan	79	30.45	2221.23	79.84
Europe	70	27.89	2423.3	80.93
America	249	20.08	3361.93	118.64

These grouped results supported the visual findings. In this sample, vehicles from Japan and Europe had higher average MPG than those from America, and they also tended to have lower average weight and horsepower. This was a useful example of how EDA methods work together: correlation gives a quantitative overview, while grouped summaries improve interpretability.

8. Reflection against the learning outcomes

At face value, Auto MPG appears straightforward, but the horsepower issue demonstrated why EDA is essential before any model development. Hidden missingness in a nominally numeric feature can directly undermine model reliability.

Before selecting any model, the EDA surfaced technical concerns including missing values, skewed feature distributions, multicollinearity among mechanical variables, and encoding decisions for categorical data. Each of these has implications for model validity, interpretability, and defensibility.

Although this dataset is less sensitive than personal or clinical data, there are still professional and ethical considerations. It is historical data, so findings should not be overgeneralised to contemporary vehicle populations. In the same way, origin-based comparisons should be framed carefully to avoid simplistic conclusions, since origin may co-vary with broader engineering and market factors.

From a collaboration perspective, I treated reproducibility as part of professional practice: documenting cleaning decisions, preserving raw versus transformed columns, and structuring analysis so another team member could review assumptions and continue the workflow into modelling.

Conclusion

Overall, this EDA suggests that the Auto MPG dataset is suitable for introductory machine learning tasks, particularly MPG prediction, but only after careful preparation. The hidden missing values in horsepower, skew in engine-related variables, strong inter-feature correlations, and categorical encoding choices all require explicit handling.

My main takeaway from this activity is that EDA is critical as a structured risk-reduction stage. It helps identify data quality problems early, supports transparent methodological decisions, and provides a stronger foundation for responsible model development.

Introduction​

1. Loading and inspecting the data​

2. Identifying missing values​

3. Encoding categorical values​

4. Skewness and kurtosis​

5. Correlation heat map​

6. Scatter plots​

7. Grouped summaries​

8. Reflection against the learning outcomes​

Conclusion​