Global Population and GDP per Capita - Correlation and Regression
Google Colab notebook: Open in Google Colab
This portfolio activity investigates whether there is a linear relationship between a country's mean population and its mean GDP per capita across the years 2001–2020. The analysis uses two World Bank datasets, global_gdp.csv and global_population.csv, and applies both a Pearson correlation test and a simple linear regression model.
Data and preprocessing
The two CSV files are loaded directly from URLs, so the notebook runs end-to-end in Google Colab without manual uploads. A single helper function, prepare_country_year_values, handles preprocessing for both datasets. It:
- Selects the country name, country code, and year columns.
- Drops World Bank regional and income aggregates (e.g.
EUU,HIC,WLD) by country code, so that only individual countries remain in the analysis. - Converts year columns to numeric values.
- Interpolates across years to fill isolated missing values.
- Drops any rows that are still entirely missing after interpolation.
GDP per capita is then calculated year-by-year as total GDP divided by population for each country, and the mean across 2001–2020 is taken for each country.
Outlier removal
Two countries - China and India - have mean populations roughly 4 to 10 times larger than the next largest country (the USA, at around 300 million). They act as high-leverage points that distort both the scatter plot and the regression line. They were removed so that the remaining 211 countries are more visually readable, and so the regression is not dominated by two extreme values.
Task A: Correlation
The Pearson correlation coefficient was computed between mean population (in millions) and mean GDP per capita:
- Pearson r = −0.0596
- p-value = 0.3888
The relationship is therefore a very weak negative linear relationship, and the p-value sits well above the conventional 0.05 significance threshold. In other words, the correlation is not statistically significant.
The scatter plot reinforces this: points are widely scattered, with a dense cluster of low-population countries spanning the full range of GDP per capita values, and larger countries concentrated near the bottom of the GDP per capita scale. A log-scaled x-axis plot was added to make medium and small countries easier to inspect, but the underlying pattern does not change.
The key takeaway from Task A is that a country's population size alone does not explain its wealth level.
Task B: Linear regression
A simple linear regression model was fitted with mean population (in millions) as the independent variable and mean GDP per capita as the dependent variable:
Mean GDP per capita = β₀ + β₁ × Mean population (millions)
Results
| Metric | Value |
|---|---|
| Slope per 1 million people | −33.77 |
| R-squared | ~0.004 |
| Number of countries | 211 |
Interpretation
- The fitted slope of −33.77 means that each additional 1 million people in mean population is associated with an estimated decrease of about $33.77 in mean GDP per capita.
- The R² of ~0.004 means the model explains only about 0.4% of the variation in mean GDP per capita - a negligible share.
- Mean population is therefore a very weak predictor of a country's wealth.
- This is not a causal finding. GDP per capita is shaped by many factors: productivity, institutions, natural resources, trade, education, infrastructure, and historical development.
Reflections
Legal, social, ethical and professional issues
Working with country-level economic data raises issues that go beyond model accuracy:
- Risk of misinterpretation. Even a weak negative slope can be picked up and presented as "more people means less wealth", which is misleading. A responsible practitioner should report effect sizes alongside p-values and R², and frame the relationship in plain language.
- Data provenance and consent. World Bank data is aggregated from national statistical offices whose data collection capacity varies widely. Results for smaller or lower-income countries may rest on weaker measurements than those for high-income countries.
- Equity of representation. Removing China and India as outliers is defensible for visualisation and regression stability, but it also removes roughly a third of the world's population from the headline result. Any communication of findings should disclose this.
Applicability and challenges of the dataset
- Skewed distributions. Population is highly right-skewed, which is why a log-scale plot was useful and why the two largest countries acted as high-leverage points.
- Missing values. Year-by-year interpolation handled isolated gaps, but rows that were entirely empty had to be dropped. This is acceptable for an introductory linear analysis, but more sophisticated imputation would be needed for serious modelling.
- Aggregate codes. World Bank files include regional and income-group aggregates that look like countries but are not. Filtering these out by country code is essential to avoid double-counting.
- Limits of a single predictor. The very low R² confirms that population alone is not a useful predictor. Any meaningful model of GDP per capita would need multiple features and likely a non-linear approach.
Summary
Across 211 countries between 2001 and 2020, mean population and mean GDP per capita show a very weak, non-significant negative correlation (Pearson r = −0.0596, p = 0.39), and a linear regression of GDP per capita on population explains essentially none of the variation (R² ≈ 0.004). The headline finding is clear: population size alone is not a meaningful predictor of national wealth, and any further analysis would need a richer feature set and careful treatment of skew, missing data, and outliers.
