Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .markdownlint.yml
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
ignore:
- venv
- .github


MD013: false
129 changes: 128 additions & 1 deletion 1_datasets/README.md
Original file line number Diff line number Diff line change
@@ -1 +1,128 @@
# Datasets
# 📂 Datasets Documentation

This folder contains all the datasets used to model the impact of
the COVID-19 pandemic on math learning outcomes in primary education.
Most of our data was collected from **UNESCO's UIS (Institute for Statistics)**
open data platform. To add important contextual variables, we included supplementary
datasets from **UNICEF** and the **World Bank**.

---

## 🎯 Research Question

**How did the COVID-19 pandemic affect math learning outcomes in primary
education systems across countries of different income levels,
and what policy responses can support recovery?**

---

## 🔄 Research Question Adjustment

Our original focus was on **low- and middle-income countries (LMICs)** only:

> *“How did students in low- and middle-income countries experience changes
in their learning proficiency during the COVID-19 pandemic, and what kind of
support systems can address those gaps effectively?”*

We adjusted the scope and phrasing of the question to:

> **“How did the COVID-19 pandemic affect math learning outcomes in primary
education systems across countries of different income levels, and what policy
responses can support recovery?”**

### Why the adjustment?

1. **Data availability constraints**: We found limited assessment
data for LMICs, especially during 2020–2021. Expanding to include all income
groups ensured more consistent and complete data coverage.

2. **Comparative value**: By classifying countries by income level, we could
still highlight LMIC challenges while enabling global comparisons.

---

## 🧠 Modeling Approach

To model the impact of COVID-19 on primary-level math learning outcomes,
we relied on **country-level education system indicators** instead of
individual student-level data.
This was due to data privacy concerns and the scarcity of standardized
testing during the pandemic.

Our dependent variable is:

- **Proportion of students achieving minimum math proficiency
at the end of primary school**

Our independent variables include:

- Government expenditure on education (% of GDP)
- Completion rate of primary education
- Out-of-school rate (primary level)
- Pupil-to-trained-teacher ratio
- Digital connectivity for school-age children
- Duration of school closures

Countries are grouped by World Bank income levels for comparative analysis.

![Modeling Diagram](../notes/visuals/model.png)

*Figure: Data flow and feature grouping used to model COVID-related learning loss.*

---

## ⚠️ Constraints and Limitations

- **Lack of standardized assessment data for 2020–2021**
- **Missing data** for some countries and years
- **Some indicators are modeled**, not directly observed (e.g., completion rate)
- **No individual-level data**
- **Definitions of “school closures” vary**
- **Digital connectivity estimates** are based on the most recent
household survey available (often pre-2020)

---

## 📊 Dataset Summary

| # | Variable | Description | Source | Time Range |
|---|---------------------------|----------------------------------------------------------|------------|--------------|
| 1 | `math_proficiency` | % of students achieving minimum level | UNESCO UIS | 2019–2023 |
| 2 | `gov_edu_spending` | Government spending on education (% of GDP) | UNESCO | 2019–2023 |
| 3 | `completion_rate` | % completing primary education | UNESCO | 2019–2023 |
| 4 | `out_of_school_rate` | % of primary-aged children not enrolled | UNESCO | 2019–2023 |
| 5 | `trained_teacher_ratio` | Pupils per trained teacher | UNESCO | 2019–2023 |
| 6 | `digital_connectivity` | % of school-age children with digital access (latest) | UNICEF | 2010–2020 |
| 7 | `school_closure` | Number of days schools were closed (full/partial) | UNESCO | 2020–2022 |

---

## 🔁 Reproducibility

- 📁 Raw data in: `datasets/raw_data/`
- 📁 Cleaned individual files in: `datasets/cleaned_data/`
- 📜 Cleaning/merging script: `scripts/data_cleaning_script.ipynb`
- 📄 Final dataset: `datasets/final_dataset.csv`
- 🌍 All countries identified by ISO 3-letter codes
- ⛔ Missing values preserved as `NaN` to ensure transparency

---

## 📍 Supplementary Case Study – Bangladesh School

To complement the global data, we analyzed math scores from
450 seventh-grade students
in a school in Bangladesh, covering 2019, 2020, and 2021.

The average scores show a clear learning loss:

- **2019**: 77.3
- **2020**: 51.7
- **2021**: 61.2

While not nationally representative, this micro-level example
offers additional insight into
learning disruption in a lower-middle-income country during the pandemic.

*Note: This case study was not included in the main model,
but supports our global findings.*
42 changes: 42 additions & 0 deletions 1_datasets/cleaned_data/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# 🧹 Cleaned Data

This folder contains cleaned versions of the raw datasets used in our project.

Each file here has undergone basic preprocessing steps such as:

- Removing regional aggregates and keeping only countries
- Standardizing column names
- Converting country names to ISO 3-letter codes
- Dropping irrelevant columns
- Filtering out years we don't use in our analysis

---

## 📁 Files

| File Name | Based on Raw File | Description |
|------------------------------------|--------------------------------------|--------------------------------------------------|
| `math_proficiency.cleaned.csv` | `minimum_proficiency.raw.csv` | Cleaned math proficiency data (2019, 2023 only) |
| `gov_spending.cleaned.csv` | `gov_expending.raw.csv` | Cleaned gov education spending data |
| `completion_rate.cleaned.csv` | `completion_rate.raw.csv` | Cleaned primary completion rates |
| `out_of_school.cleaned.csv` | `out_of_school.raw.csv` | Cleaned out-of-school rates |
| `trained_teacher_ratio.cleaned.csv`| `pupil_teacher_ratio.raw.csv` | Cleaned teacher quality data |
| `digital_connectivity.cleaned.csv` | `School-Age-Digital-Connectivity...` | Filtered for primary level + standardized ISO |
| `school_closure.cleaned.csv` | `duration-of-school-closures.raw...` | Filtered, summarized school closure durations |

---

## ✅ Purpose

The cleaned files are used as input to the final merging process, where we combine all variables into a single dataset:
📄 `datasets/final_dataset.csv`

These files ensure consistent structure and comparability across countries and years before modeling.

---

## 📌 Notes

- Cleaning was performed using the script: `scripts/data_cleaning_script.ipynb`
- No imputation or interpolation was applied — missing values are retained
- All files use ISO 3-letter country codes (`ISO`) and standard column names
Loading
Loading