Skip to main content

Exploratory Data Analysis with Auto MPG Dataset

· 8 min read
Ross Bulat
Full Stack Engineer

Google Colab notebook: Open in Google Colab

Introduction

In this activity, I carried out exploratory data analysis (EDA) on the Auto MPG dataset to judge whether it was suitable for machine learning and to identify key uncertainties before modelling.

The process followed a standard-practice sequence: importing standard analysis libraries, loading the data, inspecting structure and summary statistics, separating numeric and categorical variables, checking missing values, and using visualisations (including heat maps and scatter plots) to examine relationships.

The dataset contains 398 rows and 9 original columns. The main variables are fuel economy (mpg), engine size (displacement), engine power (horsepower), vehicle weight, acceleration, model year, origin, and car name.

1. Loading and inspecting the data

I began by loading auto-mpg.csv in pandas and reviewing the first and last rows, dataset shape, and inferred data types. This quickly exposed an important quality issue: horsepower had been read as an object column instead of a numeric variable.

The reason was that missing values were represented by a question mark (?) rather than standard NaN. This is a useful practical reminder that missing data is not always immediately visible through isna() on raw input, especially when pandas has inferred a text type.

2. Identifying missing values

After converting horsepower using pd.to_numeric(..., errors='coerce'), the question marks were correctly converted to missing values. The missing-value summary was:

missing_countmissing_percent
horsepower61.51
mpg00
cylinders00
displacement00
weight00
acceleration00
model year00
origin00
car name00
horsepower_imputed00
origin_label00
origin_America00
origin_Europe00
origin_Japan00

This indicates a small but non-trivial missing-data issue: six missing horsepower values (approximately 1.51% of records). To keep the workflow transparent, I retained the original horsepower column and created a separate horsepower_imputed feature using median imputation for analysis that required complete numeric inputs.

I selected median imputation because it is straightforward, reproducible, and generally more robust to skew than mean imputation. That said, it remains an assumption; in a full modelling pipeline, I would compare multiple imputation strategies rather than treat one approach as definitive.

3. Encoding categorical values

The task required categorical values to be represented numerically. In this dataset, origin was already stored as integer codes (America = 1, Europe = 2, Japan = 3). However, integer encoding (1, 2, 3) implies an artificial order and distance between what are actually nominal categories—there is no meaningful sense in which Japan is "three times" America, or in which Europe sits "between" the other two. Depending on the algorithm, this can introduce spurious structure into the model.

For this reason, I first mapped the integer codes to readable labels and then applied one-hot encoding using pd.get_dummies(...), which produced three binary indicator columns:

Original originorigin_Americaorigin_Europeorigin_Japan
America100
Europe010
Japan001

One-hot encoding avoids the false ordinal assumption that integer encoding can introduce, and it is generally the more appropriate choice for nominal categorical features in machine learning pipelines. Each indicator column is itself a valid numeric feature (a Bernoulli 0/1 variable), so unlike a single arbitrary integer code, the one-hot columns can legitimately appear in a Pearson correlation matrix against mpg and other continuous variables.

The origin counts were:

origin_labelcount
America249
Japan79
Europe70

4. Skewness and kurtosis

I calculated skewness and kurtosis for the continuous numeric variables only. Skewness indicates asymmetry in a distribution, while kurtosis indicates tail behaviour relative to a normal distribution. The one-hot origin_* columns are excluded from this analysis because they are binary indicator variables—distribution shape statistics like skewness and kurtosis are not informative for 0/1 features (their shape is fully described by their proportion).

skewnesskurtosis
mpg0.457-0.511
cylinders0.527-1.377
displacement0.72-0.747
horsepower1.0870.697
horsepower_imputed1.1060.764
weight0.531-0.786
acceleration0.2790.419
model year0.012-1.181

The clearest result was the strong positive skew in horsepower, suggesting many vehicles in low-to-mid power ranges and fewer high-powered outliers creating a right tail. displacement, weight, and cylinders also showed positive skew, which aligns with a dataset dominated by moderate vehicles and fewer extreme large-engine observations.

Most kurtosis values were near-normal or negative, indicating that distributions were not heavily tail-dominated overall. This matters because skew and outliers can affect algorithm performance and interpretation; depending on model choice, later stages may require transformation, scaling, or more robust estimators.

5. Correlation heat map

I used a correlation heat map to examine linear relationships between numeric features. Because origin is represented as three binary one-hot columns, each of those columns can be included in the correlation matrix as a point-biserial-style relationship with mpg—this is a valid use of Pearson correlation between a continuous variable and a 0/1 indicator.

The correlations with mpg for the continuous variables were:

mpg
mpg1
model year0.579
acceleration0.42
horsepower_imputed-0.773
cylinders-0.775
displacement-0.804
weight-0.832

The strongest negative relationship was between mpg and weight, which is intuitive: heavier vehicles tend to consume more fuel. displacement, horsepower, and cylinders were also strongly negatively associated with mpg.

model year showed a positive relationship with mpg, suggesting that newer vehicles in this historical sample were generally more fuel efficient. I treated this carefully in interpretation, since correlation alone does not establish causation and may reflect broader historical factors such as regulation, design shifts, or market priorities.

The one-hot origin_* columns showed the direction expected from the grouped summaries in Section 7: origin_America was negatively correlated with mpg, while origin_Japan and origin_Europe were positively correlated with mpg. Because the three indicators are mutually exclusive and sum to 1 for every row, they are not independent of each other—so they should be interpreted together rather than as fully separate features. This is best understood as a categorical effect surfaced through a continuous-style metric, and the grouped summaries below remain the more interpretable view.

6. Scatter plots

I then used scatter plots to inspect pairwise relationships involving mpg, particularly:

  • mpg vs weight
  • mpg vs horsepower_imputed
  • mpg vs displacement
  • mpg vs acceleration
  • mpg vs model year

The visual trend was clear: mpg decreases as weight, horsepower_imputed, and displacement increase. I also used origin as a colour grouping, which showed clusters occupying different parts of feature space. In this dataset, American-origin vehicles appeared more frequently in heavier, larger-displacement regions, while Japanese and European vehicles were more common in lighter, higher-mpg regions.

This is informative for modelling, but I interpreted it with caution. Origin may be acting as a proxy for other factors (vehicle size, production era, market segment, or design choices), so it should not be treated as a standalone causal explanation.

7. Grouped summaries

To complement the plots, I produced grouped summaries by origin:

origin_labelcountavg_mpgavg_weightavg_horsepower
Japan7930.452221.2379.84
Europe7027.892423.380.93
America24920.083361.93118.64

These grouped results supported the visual findings. In this sample, vehicles from Japan and Europe had higher average MPG than those from America, and they also tended to have lower average weight and horsepower. This was a useful example of how EDA methods work together: correlation gives a quantitative overview, while grouped summaries improve interpretability.

8. Reflection against the learning outcomes

At face value, Auto MPG appears straightforward, but the horsepower issue demonstrated why EDA is essential before any model development. Hidden missingness in a nominally numeric feature can directly undermine model reliability.

Before selecting any model, the EDA surfaced technical concerns including missing values, skewed feature distributions, multicollinearity among mechanical variables, and encoding decisions for categorical data. Each of these has implications for model validity, interpretability, and defensibility.

Although this dataset is less sensitive than personal or clinical data, there are still professional and ethical considerations. It is historical data, so findings should not be overgeneralised to contemporary vehicle populations. In the same way, origin-based comparisons should be framed carefully to avoid simplistic conclusions, since origin may co-vary with broader engineering and market factors.

From a collaboration perspective, I treated reproducibility as part of professional practice: documenting cleaning decisions, preserving raw versus transformed columns, and structuring analysis so another team member could review assumptions and continue the workflow into modelling.

Conclusion

Overall, this EDA suggests that the Auto MPG dataset is suitable for introductory machine learning tasks, particularly MPG prediction, but only after careful preparation. The hidden missing values in horsepower, skew in engine-related variables, strong inter-feature correlations, and categorical encoding choices all require explicit handling.

My main takeaway from this activity is that EDA is critical as a structured risk-reduction stage. It helps identify data quality problems early, supports transparent methodological decisions, and provides a stronger foundation for responsible model development.