Global Population and GDP - Correlation and Regression
Google Colab notebook: Open in Google Colab
This portfolio activity investigates the relationship between a country or entity's mean population and its mean total GDP across the years 2001-2020. The analysis uses two World Bank datasets, global_gdp.csv and global_population.csv, then applies descriptive analysis, Pearson correlation, log-scale visualisation, and simple linear regression.
Data and preprocessing
The two CSV files are loaded directly from URLs, so the notebook runs end-to-end in Google Colab without manual uploads:
global_gdp.csvglobal_population.csv
The raw files contain 266 GDP rows and 272 population rows. A shared helper function, prepare_country_year_values, handles preprocessing for both datasets. It:
- Selects the country/entity name, country code, and year columns for 2001-2020.
- Converts yearly values to numeric values.
- Interpolates across year columns to fill isolated missing values.
- Drops rows where all selected year values are still missing.
- Renames columns into a consistent format for merging.
The cleaned GDP and population tables are merged by country code, producing 257 rows and 42 columns. The analysis then calculates:
- Mean population, 2001-2020.
- Mean GDP, 2001-2020.
- Mean population in millions.
- Mean GDP in billions and trillions of US dollars.
Rows are kept only where both mean population and mean GDP are positive, which is also required for the later log-scale plot. The final analysis contains 257 countries/entities.
Scope of the rows
The notebook analyses countries/entities, not only sovereign countries. World Bank aggregate rows such as World, High income, OECD members, and regional or income-group totals remain in the dataset. This is important for interpretation because those aggregate entities are much larger than individual countries and are not statistically independent of the countries they contain.
Descriptive analysis
Both variables are highly skewed. The median entity has a mean population of about 9.43 million and mean GDP of about US$39.46 billion, while the largest entity, World, has a mean population of about 6.97 billion and mean GDP of about US$66.70 trillion.
The largest mean-population observations are aggregate entities:
| Entity | Mean population | Mean GDP |
|---|---|---|
| World | 6,971.25 million | US$66.70 trillion |
| IDA & IBRD total | 5,842.84 million | US$22.86 trillion |
| Low & middle income | 5,790.14 million | US$21.93 trillion |
| Middle income | 5,266.19 million | US$21.55 trillion |
| Early-demographic dividend | 2,927.29 million | US$9.85 trillion |
The highest mean-GDP observations are also mostly aggregate entities, including World, High income, OECD members, and Post-demographic dividend.
Task A: Correlation and visual pattern
The Pearson correlation coefficient was computed between mean population, measured in millions, and mean total GDP, measured in trillions of US dollars:
- Pearson r = 0.7213
- p-value = 1.459e-42
This indicates a strong positive linear relationship in the analysed dataset. Entities with larger populations generally have larger total GDP.
The result is economically intuitive because total GDP measures aggregate output. A larger population can mean a larger workforce, larger domestic market, and greater total production. However, the relationship is not perfect: population size alone does not determine how productive an economy is or how much output is produced per person.
Linear-scale plot
The linear-scale scatter plot shows a visible upward pattern between mean population and mean total GDP. The largest aggregate entities sit toward the upper-right of the plot, while many smaller countries/entities cluster near the lower-left.
The plot also shows substantial spread around the trend. Some smaller-population economies have relatively high GDP, and some larger-population entities have lower GDP than the trend would predict. This is the first sign that population is useful but incomplete as a single explanatory variable.
Log-scale visual check
Because both population and GDP are highly skewed, the notebook also plots both axes on a log scale. The log transformation makes smaller and medium-sized entries easier to compare.
After applying the log transform:
- Pearson r = 0.8818
- p-value = 3.15e-85
The positive relationship remains and becomes visually clearer on the log-log plot. The notebook also identifies a small top-left cluster: countries/entities with smaller populations but comparatively high GDP. In the displayed examples, Switzerland appears in this group, with a mean population of about 7.90 million and mean GDP of about US$0.582 trillion.
The log-scale plot reinforces the main point: population is related to total GDP, but GDP also depends on productivity, industry structure, trade, institutions, education, infrastructure, technology, natural resources, and historical development.
Task B: Linear regression
A simple linear regression model was fitted with mean population as the independent variable and mean total GDP as the dependent variable:
Mean GDP = beta_0 + beta_1 x Mean population
For readability:
- Mean population is measured in millions of people.
- Mean GDP is measured in trillions of US dollars.
This means the slope represents the expected change in mean GDP, in trillions of US dollars, for each additional one million people.
Results
| Metric | Value |
|---|---|
| Intercept | US$0.4497 trillion |
| Slope, trillions per 1 million people | US$0.006036 trillion |
| Slope, billions per 1 million people | US$6.036 billion |
| R-squared | 0.5203 |
| RMSE | US$4.958 trillion |
| Number of observations | 257 |
The fitted equation is:
Mean GDP in US$ trillions = 0.4497 + 0.006036 x mean population in millions
Interpretation
- The fitted slope is positive.
- Each additional 1 million people is associated with an estimated US$6.036 billion increase in mean total GDP.
- The R-squared value of 0.5203 means the model explains about 52.03% of the variation in mean total GDP across the analysed countries/entities.
- The RMSE of US$4.958 trillion is large, which shows that the model's predictions can still be far from observed GDP values.
- The intercept is a mathematical anchor for the regression line. A country or entity with zero population is not a meaningful real-world case.
The model is therefore informative but limited. Population explains a substantial share of the variation in total GDP in this dataset, but it is still a one-variable model and should not be treated as a complete explanation of economic output.
Example predictions
The notebook demonstrates predictions for several population sizes:
| Mean population | Predicted mean GDP |
|---|---|
| 10 million | US$0.510 trillion |
| 50 million | US$0.751 trillion |
| 100 million | US$1.053 trillion |
| 500 million | US$3.468 trillion |
These are demonstrations of the fitted regression line, not forecasts. They use population as the only input, so they ignore productivity and other economic differences between countries/entities.
Reflections
Legal, social, ethical and professional issues
Working with country-level and aggregate economic data raises issues that go beyond model accuracy:
- Risk of misinterpretation. A positive relationship between population and total GDP should not be presented as "larger populations make countries richer". Total GDP is not the same as GDP per capita or living standards.
- Aggregate rows. The dataset includes World Bank aggregate entities such as
World, income groups, and regions. These rows are useful for a broad descriptive exercise, but they overlap with country rows and are not independent observations. - Data provenance and quality. World Bank data is compiled from national and institutional sources. Measurement quality can vary by country, year, and economic context.
- Causal limits. The regression is associational. It does not prove that population growth causes GDP growth, nor does it account for productivity, institutions, trade, education, infrastructure, conflict, or resource endowments.
Applicability and challenges of the dataset
- Skewed distributions. Population and total GDP are both heavily right-skewed, so log-scale visualisation is useful for revealing patterns among smaller and medium-sized entries.
- Single predictor. Population alone explains about half of the observed variation in total GDP, but the remaining variation is still large.
- Unit interpretation. The model uses population in millions and GDP in trillions for readability. The unit conversion does not change the relationship, but it matters for interpreting the slope correctly.
- Entity definition. Because aggregate entities are included, the results should be described as applying to countries/entities rather than strictly to countries.
- GDP versus GDP per capita. Total GDP captures aggregate economic output. It should not be used as a direct measure of individual prosperity.
Summary
Across 257 countries/entities between 2001 and 2020, mean population and mean total GDP show a strong positive correlation (Pearson r = 0.7213, p = 1.459e-42). After log transformation, the relationship is even stronger (Pearson r = 0.8818, p = 3.15e-85).
The simple linear regression model estimates that each additional 1 million people is associated with about US$6.036 billion more mean total GDP, and population alone explains about 52.03% of the variation in mean GDP. The headline finding is therefore clear: population is meaningfully related to total GDP, but it is not a complete explanation of economic output or prosperity.
