Migration Flows to EuropeΒΆ
Baris Alan
Final Data Science Tutorial
CMPS 6790 / Data Science - Prof. Nicholas Mattei
Project Topic and GoalsΒΆ
The advanced liberal democracies in the Western world stand as an attractive destination for populations in the developing world, driven by a myriad of pull and push factors. Pull factors encompass liberal developed democracies offering employment opportunities, and providing rule of law and equal treatment before the law. Conversely, push factors comprise issues such as armed and social conflict, unemployment, poverty, corruption, poor governance, and the climate related risks. European countries, in particular, emerge as a desirable destination for numerous nations in Africa, the Middle East, and West Asia.
The primary objective of this project is to conduct a comprehensive analysis and visualization of migration flows to the European Union (EU) countries from regions outside Europe. The key areas of focus include examining the demographic structure and educational background of migrants, identifying their countries of origin and the EU countries they choose for settlement, and mapping out the migration routes and transit countries they navigate.
By addressing these aspects, the project aims to provide valuable insights into the dynamics of migration to the EU, shedding light on the factors influencing migration patterns and contributing to a nuanced understanding of the complex interplay between push and pull factors in the context of global migration.
Later on, this project builds a regression model to measure the impact and the significance of various factors on migration flows to Europe: economic (GPD per capita and multidimensional poverty), political (political stability and effectiveness of government), gender (gender inequality), conflict (armed conflict and social unrest), and climate (climate related risks).
Project DatasetsΒΆ
This project will utilize various datasets from different institutions.
The First DatasetΒΆ
The primary dataset is sourced from Eurostat, the official statistical administration of the EU. Specifically, I obtained the dataset from the Migration and Population Statistics section, focusing on immigration by age group, sex, and citizenship. This dataset provides the total number of migrants based on specified filters, allowing researchers to analyze immigration by receiving country, immigrant citizenship, year, age group, and gender.
Due to Eurostat's data download limitations, careful selection of attributes was necessary. Specifically, (1) I narrowed down the country of citizenship options to 218, excluding EU countries to focus on immigration from other regions to Europe. Additionally, regional groupings such as Africa and South Asia were included for future analysis. (2) Receiving countries were limited to 27 EU nations. (3) Gender analysis was conducted for all available options (Male-Female-Total). (4) Age-based analysis was performed by selecting total and specific age brackets. (5) The dataset was filtered for the year 2021, with plans to include data from previous years for a comprehensive analysis of changing migration flows.
Key questions addressed with the first dataset include: "What is the total number of arrivals in EU countries in 2022?", "What are the demographic characteristics of immigrants based on gender and age?", and "Which EU countries received the highest number of immigrants?", and lastly "Which sent the highest number of immigrants?"
The Second Set of Datasets: ISO CodesΒΆ
For the second dataset, Datahub.io, to incorporate 2-digit country codes, was utilized. This dataset serves the sole purpose of associating country names witb country codes in the immigration dataset.
Some of the independent variable datasets use 3-digit country codes. Therefore, for better merging the main migration dataset with these independent variable datasets, the World Bank ISO3 Dataset was also utilized to bring iso3 codes.
The following datasets will be used to operationalize the explanatory variables which would help to answer the main question: "What is the most important predictor of migration flows to Europe in 2022?". Therefore this project utilizes various datasets to bring independent variables into the analysis.
The Third-set of Datasets: Economic IndicatorsΒΆ
To measure the effect of economic indicators, this project will utilize the GDP per capita (Purchasing Power Parity in 2017 Constant USD) dataset by the World Bank. GPD/PC is the most common and one of the best indicator of overall economic wellbeing of a country.
Yet, because the average income might not reflect the well-being of the whole population, this research will include World Bank Multidimensional Poverty Measurement as well.
The Fourth Dataset: ACLED Conflict DatasetsΒΆ
Armed Conflict Location & Event Data Project provides various datasets, ranging from mob violence to military conflict to have a sense of conflict in a specific country or region. This project will utilize three datasets which are noted as factors creating migration outflows: the Battles, Riots, and Violence against Civilians.
The Fifth Dataset: Governance IndicatorsΒΆ
Sometimes goverments failure to provide services, such as health, education, or social welfare, might create dissatisfaction in public, and they might want to migrate to developed nations for these reasons. Furthermore, poor governance is also linked with worse economic outcomes and conflicts as well. World Bank World Governance Indicator will be used to asses the political stability and government effectiveness.
The Fifth Dataset: German Watch Climate Risk IndexΒΆ
German Watch Climate Risk Index will be used to asses climate-related factors.
The Seventh Dataset: PopulationΒΆ
Population of a country is also a crucial factor explaining the amount of migration flows. This project uses the CIA's population dataset.
ETL (Extract, Transform, Load)ΒΆ
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Clone the repository, change the wd
!git clone https://github.com/barisalan00/barisalan00.github.io
%cd /home/jovyan/barisalan00.github.io
!pwd
Cloning into 'barisalan00.github.io'... remote: Enumerating objects: 141, done. remote: Counting objects: 100% (89/89), done. remote: Compressing objects: 100% (86/86), done. remote: Total 141 (delta 43), reused 2 (delta 2), pack-reused 52 (from 1) Receiving objects: 100% (141/141), 19.18 MiB | 105.00 KiB/s, done. Resolving deltas: 100% (57/57), done. /home/jovyan/barisalan00.github.io /home/jovyan/barisalan00.github.io
Import DatasetsΒΆ
# Main Immigration Dataset: Import Eurostat Immigration/2022 Dataset
euim22 = pd.read_csv('Eurostat-2022Migration-migr_imm1ctz__custom_10841676_linear.csv')
display(euim22.head(3))
# Total number of observations: 119479
display(len(euim22))
# Import Datahub.io 2-digit Country Codes dataset
country_codes2 = pd.read_csv('Datahub-CountryCodes-data_csv.csv')
display(country_codes2.head(3))
# Import WB 3-digit Country Codes dataset
country_codes3 = pd.read_csv('UN-iso3.csv')
display(country_codes3.head(3))
country_codes3_ = pd.read_excel('WB-CountryCodes.xlsx')
# Economic Indicator1: WB 2022 GDP/PC
gdppc = pd.read_csv('WB-2022GDPPC-Const.csv')
display(gdppc.head(3))
# Economic Indicator2: WB 2023 Multidimensional Poverty Measure
mpm = pd.read_excel('WB-2022MPM-Data-AM2022.xlsx')
display(mpm.head(3))
# Conflict Indicator1: ACLED 2022 Battles Dataset
battle = pd.read_csv('ACLED-2022Battles.csv')
display(battle.head(3))
# Conflict Indicator2: ACLED 2022 Riots Dataset
riot = pd.read_csv('ACLED-2022Riots.csv')
display(riot.head(3))
# Conflict Indicator3: ACLED 2022 Violence Dataset
violence = pd.read_csv('ACLED-2022ViolencesCivilians.csv')
display(violence.head(3))
# Political Indicators: WB Governance Indicators
govern = pd.read_csv('WB-2022GovIndic.csv')
display(govern.head(3))
# Climate Indicator: German Watch Climate Risk Index
climate = pd.read_csv('GermanWatch-2018CRI.csv')
display(climate.head(3))
# Population Indicator: CIA World Factbook - Population
population = pd.read_csv('CIA-Population.csv', encoding='latin1')
display(population.head(3))
DATAFLOW | LAST UPDATE | freq | citizen | agedef | age | unit | sex | geo | TIME_PERIOD | OBS_VALUE | OBS_FLAG | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ESTAT:MIGR_IMM1CTZ(1.0) | 27/03/24 11:00:00 | A | AD | REACH | TOTAL | NR | F | AT | 2022 | 0 | NaN |
1 | ESTAT:MIGR_IMM1CTZ(1.0) | 27/03/24 11:00:00 | A | AD | REACH | TOTAL | NR | F | BG | 2022 | 0 | NaN |
2 | ESTAT:MIGR_IMM1CTZ(1.0) | 27/03/24 11:00:00 | A | AD | REACH | TOTAL | NR | F | CZ | 2022 | 0 | NaN |
119479
Name | Code | |
---|---|---|
0 | Afghanistan | AF |
1 | Γ land Islands | AX |
2 | Albania | AL |
iso3 | name | |
---|---|---|
0 | BEL | Belgium |
1 | CH_ | China, mainland |
2 | GGY | Guernsey |
Country Name | Country Code | Series Name | Series Code | 2022 [YR2022] | |
---|---|---|---|---|---|
0 | Afghanistan | AFG | GDP per capita, PPP (constant 2017 internation... | NY.GDP.PCAP.PP.KD | .. |
1 | Africa Eastern and Southern | AFE | GDP per capita, PPP (constant 2017 internation... | NY.GDP.PCAP.PP.KD | 3566.269439 |
2 | Africa Western and Central | AFW | GDP per capita, PPP (constant 2017 internation... | NY.GDP.PCAP.PP.KD | 4066.48323 |
Region | Country code | Economy | Reporting year | Survey name | Survey year | Survey coverage | Welfare type | Survey comparability | Monetary (%) | Educational attainment (%) | Educational enrollment (%) | Electricity (%) | Sanitation (%) | Drinking water (%) | Multidimensional poverty headcount ratio (%) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ECA | ALB | Albania | 2018 | HBS | 2018 | N | c | 3.0 | 0.048107 | 0.192380 | - | 0.06025 | 6.579772 | 9.594966 | 0.293161 |
1 | SSA | AGO | Angola | 2018 | IDREA | 2018 | N | c | 2.0 | 31.122005 | 29.753423 | 27.44306 | 52.639532 | 53.637516 | 32.106507 | 47.203606 |
2 | LAC | ARG | Argentina | 2021 | EPHC-S2 | 2021 | U | i | 2.0 | 0.958847 | 1.085320 | 0.731351 | 0 | 0.193965 | 0.364048 | 0.971202 |
event_id_cnty | event_date | year | time_precision | disorder_type | event_type | sub_event_type | actor1 | assoc_actor_1 | inter1 | ... | location | latitude | longitude | geo_precision | source | source_scale | notes | fatalities | tags | timestamp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | DRC27768 | 31-Dec-22 | 2022 | 1 | Political violence | Battles | Armed clash | M23: March 23 Movement | NaN | 2 | ... | Karenga | -1.4724 | 29.0655 | 2 | Mediacongo.net; Radio Okapi | National | On 31 December 2022, during a two-day battle, ... | 0 | NaN | 1673291085 |
1 | MZM3154 | 31-Dec-22 | 2022 | 1 | Political violence | Battles | Armed clash | Islamist Militia (Mozambique) | NaN | 3 | ... | Namacule | -11.8567 | 39.8000 | 1 | AIM; Pinnacle News; Twitter; Zitamar | New media-National | On 31 December 2022, Islamist militia clashed ... | 2 | NaN | 1673291088 |
2 | MZM3155 | 31-Dec-22 | 2022 | 1 | Political violence | Battles | Armed clash | Islamist Militia (Mozambique) | NaN | 3 | ... | Namande | -11.8278 | 39.7416 | 1 | AIM; Pinnacle News; Twitter; VOA; Zitamar | New media-National | On 31 December 2022, Islamist militia clashed ... | 2 | NaN | 1673291088 |
3 rows Γ 31 columns
event_id_cnty | event_date | year | time_precision | disorder_type | event_type | sub_event_type | actor1 | assoc_actor_1 | inter1 | ... | location | latitude | longitude | geo_precision | source | source_scale | notes | fatalities | tags | timestamp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | KEN9717 | 31 December 2022 | 2022 | 1 | Political violence | Riots | Mob violence | Rioters (Kenya) | Vigilante Group (Kenya) | 5 | ... | Kutus | -0.5753 | 37.3269 | 2 | Kenya Standard; NTV (Kenya) | New media-National | On 31 December 2022, a mob lynched a man, part... | 1 | crowd size=no report | 1673291087 |
1 | BRA62473 | 31 December 2022 | 2022 | 1 | Political violence | Riots | Mob violence | Rioters (Brazil) | Vigilante Group (Brazil) | 5 | ... | Maues | -3.3795 | -57.7196 | 1 | Portal do Holanda | Subnational | On 31 December 2022, in Maues (Amazonas), a su... | 0 | crowd size=no report | 1673295343 |
2 | BRA62488 | 31 December 2022 | 2022 | 1 | Political violence | Riots | Mob violence | Rioters (Brazil) | PL: Liberal Party | 5 | ... | Catalao | -18.1670 | -47.9448 | 1 | Estado de Minas | National | Property destruction: On 31 December 2022, in ... | 0 | crowd size=no report | 1673295343 |
3 rows Γ 31 columns
event_id_cnty | event_date | year | time_precision | disorder_type | event_type | sub_event_type | actor1 | assoc_actor_1 | inter1 | ... | location | latitude | longitude | geo_precision | source | source_scale | notes | fatalities | tags | timestamp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | DRC27766 | 31 December 2022 | 2022 | 1 | Political violence | Violence against civilians | Abduction/forced disappearance | Twirwaneho Ethnic Militia (Democratic Republic... | Banyamulenge Ethnic Militia (Democratic Republ... | 4 | ... | Mikenge | -3.4497 | 28.4476 | 1 | Kivu Times | Subnational | On 31 December 2022, Twirwaneho abducted a wom... | 0 | NaN | 1673291085 |
1 | SAF18067 | 31 December 2022 | 2022 | 1 | Political violence | Violence against civilians | Attack | Unidentified Armed Group (South Africa) | NaN | 3 | ... | Johannesburg | -26.2023 | 28.0436 | 1 | Zambia Reports | International | On 31 December 2022, unknown suspects shot and... | 1 | NaN | 1673291088 |
2 | SOM38915 | 31 December 2022 | 2022 | 1 | Political violence | Violence against civilians | Abduction/forced disappearance | Al Shabaab | NaN | 2 | ... | Ted | 4.4000 | 43.9167 | 2 | Undisclosed Source | Local partner-Other | On 31 December 2022, Al Shabaab abducted three... | 0 | NaN | 1673291088 |
3 rows Γ 31 columns
Country Name | Country Code | Series Name | Series Code | 2022 [YR2022] | |
---|---|---|---|---|---|
0 | Afghanistan | AFG | Political Stability and Absence of Violence/Te... | PV.EST | -2.550801754 |
1 | Afghanistan | AFG | Voice and Accountability: Estimate | VA.EST | -1.751587272 |
2 | Korea, Dem. People's Rep. | PRK | Voice and Accountability: Percentile Rank | VA.PER.RNK | 0 |
CRI\rRank | Country | CRI\rscore | Fatalities\rin 2018\r(Rank) | Fatalities per\r100 000 inhab-\ritants (Rank) | Losses in mil-\rlion US$ (PPP)\r(Rank) | Losses per\runit GDP in\r% (Rank) | |
---|---|---|---|---|---|---|---|
0 | 1 | Japan | 5.50 | 2 | 2 | 3 | 12 |
1 | 2 | Philippines | 11.17 | 4 | 14 | 7 | 14 |
2 | 3 | Germany | 13.83 | 3 | 1 | 6 | 36 |
name | slug | value | date_of_information | ranking | region | |
---|---|---|---|---|---|---|
0 | Afghanistan | afghanistan | 38,346,720 | 2022 est. | 37.0 | South Asia |
1 | Albania | albania | 3,095,344 | 2022 est. | 136.0 | Europe |
2 | Algeria | algeria | 44,178,884 | 2022 est. | 34.0 | Africa |
Transform and Tidy DataΒΆ
# Check the dtypes for euim22
# The year (TIME_PERIOD) and flow (OBS_VALUE) columns are integer, and the rest is object as expected.
euim22.dtypes
DATAFLOW object LAST UPDATE object freq object citizen object agedef object age object unit object sex object geo object TIME_PERIOD int64 OBS_VALUE int64 OBS_FLAG object dtype: object
# Keep only necessary columns and drop redundant ones
euim22 = euim22[['citizen', 'age', 'sex', 'geo', 'TIME_PERIOD', 'OBS_VALUE']]
euim22.head()
citizen | age | sex | geo | TIME_PERIOD | OBS_VALUE | |
---|---|---|---|---|---|---|
0 | AD | TOTAL | F | AT | 2022 | 0 |
1 | AD | TOTAL | F | BG | 2022 | 0 |
2 | AD | TOTAL | F | CZ | 2022 | 0 |
3 | AD | TOTAL | F | EE | 2022 | 0 |
4 | AD | TOTAL | F | FI | 2022 | 0 |
# Rename columns for readability
euim22.rename(columns={'citizen':'Migrant_Citizenship',
'age': 'Age',
'sex': 'Gender',
'geo':'Receiving_CCode',
'TIME_PERIOD':'Year',
'OBS_VALUE':'Flow'},inplace=True)
euim22.head()
Migrant_Citizenship | Age | Gender | Receiving_CCode | Year | Flow | |
---|---|---|---|---|---|---|
0 | AD | TOTAL | F | AT | 2022 | 0 |
1 | AD | TOTAL | F | BG | 2022 | 0 |
2 | AD | TOTAL | F | CZ | 2022 | 0 |
3 | AD | TOTAL | F | EE | 2022 | 0 |
4 | AD | TOTAL | F | FI | 2022 | 0 |
Despite the Eurostat data dashboard displaying country names for the country codes, the downloaded dataset does not include country names. This is why I will leverage the "country_codes2" dataset from datahub.io to retrieve country names for both Citizenship and Receiving Country Code codes. This step ensures a comprehensive and accurate representation of country names in the analysis.
# Bring country name information for Migrant Citizenship column (left join to keep all observations at euim_21)
euim22 = pd.merge(euim22, country_codes2, how='left', left_on='Migrant_Citizenship', right_on='Code')
# Drop the redundant "Code" column
euim22.drop('Code', axis=1, inplace=True)
# Rename the 'Name' column to 'Sending_Country'
euim22.rename(columns={'Name':'Sending_Country'}, inplace=True)
# Move Migrant_Country after Migrant_Citizenship
col = euim22.pop('Sending_Country')
euim22.insert(1, col.name, col)
euim22.head(3)
Migrant_Citizenship | Sending_Country | Age | Gender | Receiving_CCode | Year | Flow | |
---|---|---|---|---|---|---|---|
0 | AD | Andorra | TOTAL | F | AT | 2022 | 0 |
1 | AD | Andorra | TOTAL | F | BG | 2022 | 0 |
2 | AD | Andorra | TOTAL | F | CZ | 2022 | 0 |
# Bring country name information for Receiving Country Column (left join to keep all observations at euim_22)
euim22 = pd.merge(euim22, country_codes2, how='left',left_on='Receiving_CCode', right_on='Code')
# Drop the redundant "Code" column
euim22.drop('Code', axis=1, inplace=True)
# Rename the 'Name' column to 'Receiving_Country'
euim22.rename(columns={'Name':'Receiving_Country'}, inplace=True)
# Move Receiving_Country after Receiving_CCode
col = euim22.pop('Receiving_Country')
euim22.insert(5, col.name, col)
euim22.head(3)
Migrant_Citizenship | Sending_Country | Age | Gender | Receiving_CCode | Receiving_Country | Year | Flow | |
---|---|---|---|---|---|---|---|---|
0 | AD | Andorra | TOTAL | F | AT | Austria | 2022 | 0 |
1 | AD | Andorra | TOTAL | F | BG | Bulgaria | 2022 | 0 |
2 | AD | Andorra | TOTAL | F | CZ | Czech Republic | 2022 | 0 |
# Check if we lost any cells at the merge operations.
# We had 119479 observations at the beginning, and and it is still there, we are not missing anything.
len(euim22)
119479
# Is there any missing values under Receiving_Country?
# 'EL' is country code for Greece. Greece is using both 'GR' (in international systems) and 'EL' (in European systems) as its country code.
# 'EU27_2020' is the code for 27 EU countries.
euim22[euim22['Receiving_Country'].isna()]['Receiving_CCode'].unique()
array(['EL', 'EU27_2020'], dtype=object)
# Fill these NaN values for 'EL' with Greece
euim22.loc[euim22['Receiving_CCode'] == 'EL', 'Receiving_Country'] = 'Greece'
euim22.loc[euim22['Receiving_CCode'] == 'EU27_2020', 'Receiving_Country'] = 'EU27'
# Are all receiving countries EU27? Iceland, Liechstein, Norway, Switzerland are not EU27.
display(euim22['Receiving_Country'].unique())
array(['Austria', 'Bulgaria', 'Czech Republic', 'Estonia', 'Finland', 'Croatia', 'Hungary', 'Iceland', 'Italy', 'Lithuania', 'Luxembourg', 'Latvia', 'Netherlands', 'Norway', 'Romania', 'Sweden', 'Slovenia', 'Slovakia', 'Spain', 'France', 'Belgium', 'Switzerland', 'Cyprus', 'Germany', 'Denmark', 'Greece', 'EU27', 'Ireland', 'Liechtenstein', 'Malta', 'Poland', 'Portugal'], dtype=object)
# Drop these 4 countries: Now we have 27 EU countries + 1 EU27 Aggregated observation
countries_to_drop = ['Iceland', 'Liechtenstein', 'Norway', 'Switzerland']
euim22 = euim22[~euim22['Receiving_Country'].isin(countries_to_drop)]
eu27 = (euim22['Receiving_Country'].unique())
print(eu27)
['Austria' 'Bulgaria' 'Czech Republic' 'Estonia' 'Finland' 'Croatia' 'Hungary' 'Italy' 'Lithuania' 'Luxembourg' 'Latvia' 'Netherlands' 'Romania' 'Sweden' 'Slovenia' 'Slovakia' 'Spain' 'France' 'Belgium' 'Cyprus' 'Germany' 'Denmark' 'Greece' 'EU27' 'Ireland' 'Malta' 'Poland' 'Portugal']
# Is there any NaN cells under Sending_Country column? --> 20017 observations are missing.
euim22['Sending_Country'].isna().sum()
20017
# Let's check the unique values for these 20017 NaN observations.
euim22[euim22['Sending_Country'].isna()]['Migrant_Citizenship'].unique()
array(['AFR', 'AFR_C', 'AFR_E', 'AFR_N', 'AFR_S', 'AFR_W', 'AME', 'AME_C', 'AME_N', 'AME_S', 'ASI', 'ASI_C', 'ASI_E', 'ASI_S', 'ASI_S_E', 'ASI_W', 'AU_NZ', 'CC8_22_FOR', 'CRB', 'CZ_SK', 'EFTA_FOR', 'EL', 'EU27_2020_FOR', 'EUR', 'EX_SU', 'EX_YU', 'FOR_STLS', 'MEL', 'MIC', 'NAT', 'NEU27_2020_FOR', 'OCE', 'POL', 'RNC', 'RS_ME', 'STLS', 'TOTAL', 'UK', 'UNK', 'XK'], dtype=object)
The NaN values under the "Sending_Country" column correspond to the codes displayed in the array above. Notably, these codes are not 2-digit but rather 3-digit.
As per the Eurostat system, most of these codes represent continents such as 'AFR'=Africa, 'ASI_W'=West Asia, which aggregate the sum of countries within these continents. While the immigrant numbers for continents may introduce duplicates, they remain crucial for continental flow analysis.
Additionally, specific codes represent regions such as 'AU_NZ': Australia-New Zealand, 'CC8_22_FOR':8 Candidate Countries, 'CZ_SK': Czechoslovakia, 'EFTA_FOR':European Free Trade Association Countries, 'EL':Greece, 'EU27_2020_FOR':EU27 Countries except reporting country,'EUR':Europe, 'EX_SU':Soviet Union, 'EX_YU':Yugoslavia, 'FOR_STLS':Foreign country and stateless, 'NAT': Reporting Country, 'NEU27_2020_FOR':Non-EU27 countries nor reporting country, Oceania, 'RNC': Recognized Non-Citizens, 'RS_ME':Serbia and Montenegro, 'STLS': Stateless, 'TOTAL': Total, 'UNK': Unknown, 'XK':Kosovo.
For analytical purposes, all continents and regional observations will be excluded from the primary analysis. A secondary continental dataset will be created, and these observations will be removed from the original "euim22" dataset to prevent duplication. However, 'STLS': Stateless, 'RNC': Recognized Non-Citizens, 'UNK': Unknown observations will be retained in the original dataset, as these observations are not represented under any country-observations and can be treated as distinct entities. 'EU27_2020_FOR', 'NEU27_2020_FOR', and 'TOTAL' will also kept in the dataset for calculations.
# Replace the NaN Values under Migrant_Country for these ('STLS':Stateless, 'RNC':Recognized Non-Citizens and 'UNK':Unknown) under Migrant_Ciizenship
# Therefore all non-NaN observatoins under Migrant_Country column are part of our analysis.
euim22.loc[euim22[euim22['Migrant_Citizenship'] == 'STLS'].index, 'Sending_Country'] = 'Stateless'
euim22.loc[euim22[euim22['Migrant_Citizenship'] == 'RNC'].index, 'Sending_Country'] = 'Non-Citizens'
euim22.loc[euim22[euim22['Migrant_Citizenship'] == 'UNK'].index, 'Sending_Country'] = 'Unkown'
euim22.loc[euim22[euim22['Migrant_Citizenship'] == 'EU27_2020_FOR'].index, 'Sending_Country'] = 'EU27'
euim22.loc[euim22[euim22['Migrant_Citizenship'] == 'NEU27_2020_FOR'].index, 'Sending_Country'] = 'Non-EU27'
euim22.loc[euim22[euim22['Migrant_Citizenship'] == 'TOTAL'].index, 'Sending_Country'] = 'Total'
euim22.loc[euim22[euim22['Migrant_Citizenship'] == 'EL'].index, 'Sending_Country'] = 'Greece'
# How many missing values now: 15129
display(euim22['Sending_Country'].isna().sum())
15129
# Create a 2nd dataset to keep continental observations.
euim22_continents = euim22
# Now we can delete the continent/region observations from the euim22 (which are NA observations under Sending_Country column)
euim22 = euim22.dropna(subset=['Sending_Country'])
# NEw dataframe is 93370 length.
len(euim22)
93370
#Some extra re-naming for easier coding
euim22.loc[euim22[euim22['Age'] == 'TOTAL'].index, 'Age'] = 'Total'
euim22.loc[euim22[euim22['Gender'] == 'T'].index, 'Gender'] = 'Total'
euim22.loc[euim22[euim22['Migrant_Citizenship'] == 'TOTAL'].index, 'Migrant_Citizenship'] = 'Total'
Basic Summary StatisticsΒΆ
How many immigrants did arrive in the EU countries in 2022 from non-European countries?
According to Frontex (EU Border and Coast Guard AgencySecurity) and Eurostat a total of 5.1 million immigrants entered to EU countries from non-EU countries, which is a 117% compared to 2021 (2.7 million).
Our dataset (code below) reveals that the total number of arrivals to EU27 amounts to almost 7 million individuals. Among these, 4.8 million immigrants originated from non-EU27 countries, while 1.1 million arrived from other EU27 countries. Considering the challenges of managing and compiling data between 27 countries, the 300K difference between Eurostat and Frontex data is ignorable.
Germany emerges as the top destination, with a total of 2.1 million immigrants arriving, followed by Spain (1.2 million), France (430K), and Italy (410K). The table is similar if we look at the arrivals from Non-EU countries.
# Total number of immigration to EU27 from and total number of immigration from non-EU27
euim22[(euim22['Sending_Country'].apply(lambda x: x in ['Total', 'EU27', 'Non-EU27'])) & (euim22['Receiving_Country']=='EU27') & (euim22['Age']=='Total') & (euim22['Gender']=='Total')].sort_values(by='Flow', ascending=False)
Migrant_Citizenship | Sending_Country | Age | Gender | Receiving_CCode | Receiving_Country | Year | Flow | |
---|---|---|---|---|---|---|---|---|
106898 | Total | Total | Total | Total | EU27_2020 | EU27 | 2022 | 6977742 |
78980 | NEU27_2020_FOR | Non-EU27 | Total | Total | EU27_2020 | EU27 | 2022 | 4777475 |
38564 | EU27_2020_FOR | EU27 | Total | Total | EU27_2020 | EU27 | 2022 | 1098032 |
# Total number of arrivals by reciving EU27 countries
euim22[(euim22['Sending_Country']=='Total') & (euim22['Age']=='Total') & (euim22['Gender']=='Total')].sort_values(by='Flow', ascending=False).head(5)
Migrant_Citizenship | Sending_Country | Age | Gender | Receiving_CCode | Receiving_Country | Year | Flow | |
---|---|---|---|---|---|---|---|---|
106898 | Total | Total | Total | Total | EU27_2020 | EU27 | 2022 | 6977742 |
106893 | Total | Total | Total | Total | DE | Germany | 2022 | 2071690 |
106897 | Total | Total | Total | Total | ES | Spain | 2022 | 1258894 |
106900 | Total | Total | Total | Total | FR | France | 2022 | 431017 |
106905 | Total | Total | Total | Total | IT | Italy | 2022 | 410985 |
# Total number of arrivals from non-EU27 countries
euim22[(euim22['Sending_Country']=='Non-EU27') & (euim22['Age']=='Total') & (euim22['Gender']=='Total')].sort_values(by='Flow', ascending=False).head(5)
Migrant_Citizenship | Sending_Country | Age | Gender | Receiving_CCode | Receiving_Country | Year | Flow | |
---|---|---|---|---|---|---|---|---|
78980 | NEU27_2020_FOR | Non-EU27 | Total | Total | EU27_2020 | EU27 | 2022 | 4777475 |
78975 | NEU27_2020_FOR | Non-EU27 | Total | Total | DE | Germany | 2022 | 1630619 |
78979 | NEU27_2020_FOR | Non-EU27 | Total | Total | ES | Spain | 2022 | 925587 |
78974 | NEU27_2020_FOR | Non-EU27 | Total | Total | CZ | Czech Republic | 2022 | 330997 |
78987 | NEU27_2020_FOR | Non-EU27 | Total | Total | IT | Italy | 2022 | 287010 |
# Create a new dataset by dropping the 'Total', 'EU27', and 'Non-EU27' observations under 'Sending_Country', and 'EU27' under 'Receiving'
# Total number of observations decreased to 90958.
immig = euim22[~euim22['Sending_Country'].isin(['Total', 'EU27', 'Non-EU27'])]
len(immig)
90958
# Drop 'EU27' under 'Receiving_Country'
# Total number of observations decreased to 90952.
immig = immig[~immig['Receiving_Country'].isin(['EU27'])]
len(immig)
90952
Crucial Note on Migration Dataset
The Eurostat immigration dataset offers observations that provide the total number of arrivals (including 'Total' for total arrivals, 'EU27' for total arrivals within the EU27, and 'Non-EU27' for total arrivals from non-EU27 countries) enabling a comprehensive view of aggregate numbers. According to these observations the total arrivals to EU27 countries amount to 7 million, with 4.8 million originating from non-EU27 countries.
However, upon removing these 'Total' observations to eliminate duplicates and examining the total number of arrivals by filtering the receiving country, we find a significantly lower figure of 3.4 million. Of this immigration flow, 1.1 million arrivals are from other EU27 countries, while 2.2 million are from non-EU27 countries.
These numbers starkly contrast with aggregate observations from Frontex and Eurostat, primarily due to the exclusion of some Sending_Country observations in the dataset. For instance, while aggregate data suggests Germany received 2.1 million immigrants in 2022, of which 1.6 million were from non-EU countries, a closer examination of arrivals to Germany by filtering the Sending_Country reveals only 6226 migrants, recorded as Stateless or Unknown, arrived in Germany.
This means that Germany didnot released/shared the arrivals by sending countries. Same situation exists for some other EU member states too. Therefore, it is evident that our dataset contains missing values under the Sending_Country column, which makes it difficult to have a country-level analysis for EU members.
The datasets by various organizations and projects focusing on international migration flows, including the International Migration Organization, Global Migration Data Portal, UN Global Migration Database, OECD International Migration Database, and World Bank Global Bilateral Migration and some other independent projects and academic research, were checked. However, none of these sources provide a complete dataset of bilateral migration flows. At present, the Eurostat dataset stands as the most comprehensive option. The Eurostat officials stated that the current version of the dataset is the most comprehensive one, and the member countries have discreation not to release the full details. Therefore this project will utilize the Total arrivals to the EU27 by sending cobservations, instead of having a country level analysis of the EU countries.
# What is total number of arrivals? 3.4 million.
immig[(immig['Age']=='Total') & (immig['Gender']=='Total')]['Flow'].sum()
3406513
# Total number of arrivals from EU27 countries: 1.1 million
immig[(immig['Sending_Country'].isin(eu27)) & (immig['Age']=='Total') & (immig['Gender']=='Total')]['Flow'].sum()
1171307
# Total number of arrivals from non-EU27 countries: 2.2 million.
immig_noneu = immig[~immig['Sending_Country'].isin(eu27)]
immig_noneu[(immig_noneu['Age']=='Total') & (immig['Gender']=='Total')]['Flow'].sum()
/tmp/ipykernel_145/547588598.py:3: UserWarning: Boolean Series key will be reindexed to match DataFrame index. immig_noneu[(immig_noneu['Age']=='Total') & (immig['Gender']=='Total')]['Flow'].sum()
2235206
# Total number of arrivals in Germany from non-EU27 countries: 6226
display(immig_noneu[(immig_noneu['Receiving_Country']=='Germany') & (immig_noneu['Age']=='Total') & (immig_noneu['Gender']=='Total')]['Flow'].sum())
display(immig_noneu[immig_noneu['Receiving_Country']=='Germany']['Sending_Country'].unique())
6226
array(['Stateless', 'Unkown'], dtype=object)
How does the gender distribution among immigrants break down?
It appears that there were more men than women arriving in the EU from non-EU27 countries in 2022. A significant contributing factor to this trend is the presence of irregular migrants, who enter the EU illegally by crossing the Mediterranean and Aegean seas with the assistance of smugglers. Due to the perilous nature of these routes and the life-threatening aspects of the journey, men often aim to arrive first to secure asylum before bringing their families. Additionally, in regions such as the Middle East, Africa, and South Asia, young unmarried men are more likely to immigrate to Europe compared to young unmarried women.
The gender breakdown holds importance for various reasons. Some groups advocate for the inclusion of women, children, and the elderly while excluding men, while others argue that there is a labor shortage in the European labor market, making adult men crucial in filling this gap. More conservative groups express concerns about the potential impact of adult male immigrants on distorting European society. Therefore, understanding the gender and age demographics is crucial to assessing the validity of such perceived threats.
# Group by 'Gender' column and sum 'Flow' column
immig_gender = immig_noneu[(immig_noneu['Age'] == 'Total')].groupby('Gender')['Flow'].sum()
print(immig_gender)
plt.figure(figsize=(8, 6))
plot_gender = immig_gender.plot(kind='bar', color=['pink','blue', 'green'])
plt.xlabel('Gender')
plt.ylabel('Total Arrivals')
plt.title('Total Migration Flow from Non-EU27 Countries Based on Gender')
plt.xticks(rotation=0)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
Gender F 1142144 M 1092917 Total 2235206 Name: Flow, dtype: int64
What about age demographics?
Approximately 52% (around 1.1 million) of arriving immigrants are 34 years old or younger, with 323K falling within the age range of 15 years or younger. From the perspective of certain groups within the EU, these numbers may be perceived as a potential threat to European society. An additional observation is that as age increases, the number of immigrants arriving decreases, as depicted in the accompanying plot.
However, from a humanitarian standpoint, these figures underscore the desperation of immigrants who, in the face of civil war, economic hardships, or climate-related challenges, flee their home countries with their children, aspiring to reach Europe. Those in the age group of 20-29 are often individuals who have completed their education or recently started a family but struggle to make a living in their home countries. Frustration with poverty, corruption, and economic challenges compels them to seek better living conditions in Europe.
As future milestones incorporate additional datasets into the analysis, a clearer picture will emerge regarding the underlying reasons behind these migration patterns.
# Take the total ('T') from Gender column, and groupby 'Age'.
# Y_LT15: those below 15, and Y_GE65: those above 65. Other age breaks already make sense.
# reindex the age brackets in order
immig_age = immig_noneu[(immig_noneu['Gender'] == 'Total')].groupby('Age')['Flow'].sum()
# Reindex the age breaks from smallest to biggest
display(immig_age.reindex(['TOTAL','Y_LT15', 'Y15-19', 'Y20-24', 'Y25-29', 'Y30-34', 'Y35-39', 'Y40-44', 'Y45-49', 'Y50-54', 'Y55-59', 'Y60-64', 'Y_GE65']))
#Plot
plot_age = immig_age.reindex(['Y_LT15', 'Y15-19', 'Y20-24', 'Y25-29', 'Y30-34', 'Y35-39', 'Y40-44', 'Y45-49', 'Y50-54', 'Y55-59', 'Y60-64', 'Y_GE65']).plot(kind='bar', color='skyblue', edgecolor='black')
plt.xlabel('Age')
plt.ylabel('Total Flow')
plt.title('Total Flow Based on Age Breaks')
Age TOTAL NaN Y_LT15 323229.0 Y15-19 145583.0 Y20-24 199836.0 Y25-29 258376.0 Y30-34 247847.0 Y35-39 212671.0 Y40-44 162976.0 Y45-49 117268.0 Y50-54 83435.0 Y55-59 56993.0 Y60-64 45407.0 Y_GE65 67215.0 Name: Flow, dtype: float64
Text(0.5, 1.0, 'Total Flow Based on Age Breaks')
Which EU countries receive the highest number of immigrants?
Before going into the details it is worth reminding the limited nature of this dataset. As clarified above, Germany received a total number of 1.6 immigrants from non-EU27 countries, yet this dataset only represent 6226 immigrants.
Initially, both the plot and the list below indicate that Spain and Italy have received more than half of the total immigrants. This observation underscores that Africa and the Middle East remain the primary regions of origin for migrants.
Additionally, the substantial influx of immigrants into Central and Eastern EU countries is noteworthy, signifying the impact of the Invasion of Ukraine. This surge in migration patterns in these regions is a notable consequence of the geopolitical events in Ukraine.
# Total immigrants by receiving country
total_by_receiving = immig_noneu[(immig_noneu['Age'] == 'Total') & (immig_noneu['Gender'] == 'Total')].groupby('Receiving_Country')['Flow'].sum()
# Sort them
total_by_receiving = total_by_receiving.sort_values(ascending=False)
print(total_by_receiving)
#Plot
total_by_receiving.plot(kind='bar', color='skyblue', edgecolor='black')
plt.xlabel('Receiving Country')
plt.ylabel('Total Flow')
plt.title('Total Flow Based on Receiving Country')
Receiving_Country Spain 857915 Czech Republic 330362 Italy 283740 Netherlands 189198 Austria 115919 Romania 91019 Lithuania 66139 Sweden 54356 Hungary 43601 Croatia 40073 Estonia 38898 Finland 33014 Latvia 29800 Slovenia 24269 Luxembourg 14555 Bulgaria 13885 Germany 6226 Belgium 922 Slovakia 561 Denmark 380 Ireland 297 Poland 77 France 0 Malta 0 Cyprus 0 Portugal 0 Greece 0 Name: Flow, dtype: int64
Text(0.5, 1.0, 'Total Flow Based on Receiving Country')
Which country has sent the highest number of immigrants to EU countries?
Concurrently with the aforementioned analysis, the table and plot below reveal that Ukraine, Latin American Countries, and North African Countries are the primary sources of immigration. It is not unexpected to find China and India on these lists, given that they are the two most populous countries globally.
# Total number of immigrants by sending country
total_by_sending = immig_noneu[(immig_noneu['Age'] == 'Total') & (immig_noneu['Gender'] == 'Total')].groupby('Sending_Country')['Flow'].sum()
# Rename countries with long names
total_by_sending.index = total_by_sending.index.str.replace('Venezuela, Bolivarian Republic of', 'Venezuela')
# Sort and display
total_by_sending = total_by_sending.sort_values(ascending=False)
display(total_by_sending.head(10))
# Plot the immigrants by top- 10 sending country
total_by_sending.head(10).plot(kind='bar', color='skyblue', edgecolor='black')
plt.xlabel('Sending Country')
plt.ylabel('Total Flow')
plt.title('Total Flow Based on Sending Country')
Sending_Country Ukraine 764356 Colombia 176955 Morocco 138886 Venezuela 85238 Peru 74722 India 59999 Russian Federation 52933 Argentina 49531 Pakistan 42479 Syrian Arab Republic 42466 Name: Flow, dtype: int64
Text(0.5, 1.0, 'Total Flow Based on Sending Country')
Further ETLΒΆ
As clarified above, a second goal of this project is to statistically analyze the effects of various push factors, such as economic, political, climate, and conflict, and gender-related indicators.
From this point and on, we will be using the total flows, and we will not need the age and gender breaks. Therefore, will drop the Male-Female, and Age breaks and will only get the total observations for each sending country. Additionally we will have 1 observation per sending country, because from now on we will be working on the total flow to the EU. Therefore, dataset has 174 total observations.
# Keep one aggregate sums for sending countries
immigration = immig_noneu.groupby('Sending_Country').agg({'Flow': 'sum','Migrant_Citizenship': 'first'}).reset_index()
immigration
Sending_Country | Flow | Migrant_Citizenship | |
---|---|---|---|
0 | Afghanistan | 53037 | AF |
1 | Albania | 131294 | AL |
2 | Algeria | 58852 | DZ |
3 | Andorra | 30 | AD |
4 | Angola | 828 | AO |
... | ... | ... | ... |
169 | Viet Nam | 22106 | VN |
170 | Western Sahara | 0 | EH |
171 | Yemen | 8918 | YE |
172 | Zambia | 698 | ZM |
173 | Zimbabwe | 1610 | ZW |
174 rows Γ 3 columns
#Rename countries with long names
immigration['Sending_Country'] = immigration['Sending_Country'].replace({
'Venezuela, Bolivarian Republic of': 'Venezuela',
'Syrian Arab Republic': 'Syria',
"Korea, Democratic People's Republic of": 'North Korea',
'Taiwan, Province of China': 'Taiwan',
'Holy See (Vatican City State)': 'Vatican City',
'Tanzania, United Republic of': 'Tanzania',
'Macedonia, the Former Yugoslav Republic of': 'Macedonia',
'Iran, Islamic Republic of': 'Iran',
'Bolivia, Plurinational State of': 'Bolivia'
})
Merge with Economic Indicator1: GDP/PCΒΆ
GDP/PC (GDP Per Capita) Dataset has country names and 3 digit country codes. In order to merge this dataset with immigration dataset, we need 3 digit countries. Therefore, initialy the UN dataset will be used to bring the 3digit country codes for Sending_Countries.
# Merge immigration dataframe with UN country_codes3
immigration = pd.merge(immigration, country_codes3, left_on='Sending_Country', right_on='name', how='left')
immigration.drop(columns=['name'], inplace=True)
immigration.rename(columns={'iso3': 'Sending_iso3'}, inplace=True)
immigration.rename(columns={'Migrant_Citizenship': 'Sending_iso2'}, inplace=True)
display(len(immigration))
immigration.head(5)
174
Sending_Country | Flow | Sending_iso2 | Sending_iso3 | |
---|---|---|---|---|
0 | Afghanistan | 53037 | AF | AFG |
1 | Albania | 131294 | AL | ALB |
2 | Algeria | 58852 | DZ | DZA |
3 | Andorra | 30 | AD | AND |
4 | Angola | 828 | AO | AGO |
# Check for missing ISO3 codes: 21 missing values
display(immigration['Sending_iso3'].isna().sum())
# These countries have longer and shorter version of their names.
immigration[immigration['Sending_iso3'].isna()]['Sending_Country'].unique()
21
array(['Bolivia', 'Cape Verde', 'Congo, the Democratic Republic of the', 'Vatican City', 'Iran', 'North Korea', 'Korea, Republic of', 'Macedonia', 'Micronesia, Federated States of', 'Moldova, Republic of', 'Non-Citizens', 'Palestine, State of', 'Stateless', 'Swaziland', 'Syria', 'Taiwan', 'Tanzania', 'Turkey', 'United States', 'Unkown', 'Venezuela'], dtype=object)
# Manually bring the 3 digit country codes for the missing countries
missing_iso3 = {
'Bolivia': 'BOL',
'Congo, the Democratic Republic of the': 'COD',
'Cape Verde': 'CPV',
'Micronesia, Federated States of': 'FSM',
'Syria': 'SYR',
'Iran': 'IRN',
'North Korea': 'PRK',
'Korea, Republic of': 'KOR',
'Moldova, Republic of': 'MDA',
'Macedonia': 'MKD',
'Palestine, State of': 'PSE',
'Non-Citizens': 'XXX',
'Stateless': 'XXX',
'Swaziland': 'SWZ',
'Turkey': 'TUR',
'Taiwan': 'TWN',
'Tanzania': 'TZA',
'Unkown': 'XXX',
'United States': 'USA',
'Vatican City': 'VAT',
'Venezuela': 'VEN'
}
# Update ISO3 column in immig_noneu27 dataset using the dictionary
immigration['Sending_iso3'].fillna(immigration['Sending_Country'].map(missing_iso3), inplace=True)
display(immigration['Sending_iso3'].isna().sum())
0
# Merge WB GDP/PC with Immigration Dataset and bring GDP/PC information
immigration = pd.merge(immigration, gdppc[['Country Code', '2022 [YR2022]']], left_on='Sending_iso3', right_on='Country Code', how='left')
immigration.drop(columns=['Country Code'], inplace=True)
# Rename Sending_gdppc
immigration.rename(columns={'2022 [YR2022]': 'Sending_gdppc'}, inplace=True)
immigration
Sending_Country | Flow | Sending_iso2 | Sending_iso3 | Sending_gdppc | |
---|---|---|---|---|---|
0 | Afghanistan | 53037 | AF | AFG | .. |
1 | Albania | 131294 | AL | ALB | 15491.961 |
2 | Algeria | 58852 | DZ | DZA | 11198.23348 |
3 | Andorra | 30 | AD | AND | .. |
4 | Angola | 828 | AO | AGO | 5906.115677 |
... | ... | ... | ... | ... | ... |
169 | Viet Nam | 22106 | VN | VNM | 11396.5313 |
170 | Western Sahara | 0 | EH | ESH | NaN |
171 | Yemen | 8918 | YE | YEM | .. |
172 | Zambia | 698 | ZM | ZMB | 3365.87378 |
173 | Zimbabwe | 1610 | ZW | ZWE | 2207.957033 |
174 rows Γ 5 columns
Check the data type and unique values for Sending_gdppc
column.
In the dataframe above, Andorra has a value of "..", yet it is not included in missing values. It is better to check the data type of this GDP/PC column, and the unique values.
# Data type of Sending_gdppc = object
print(immigration['Sending_gdppc'].dtype)
# Turn datatype of GDP/PC column into numeric
immigration['Sending_gdppc'] = pd.to_numeric(immigration['Sending_gdppc'], errors='coerce')
print(immigration.dtypes)
object Sending_Country object Flow int64 Sending_iso2 object Sending_iso3 object Sending_gdppc float64 dtype: object
# How many NaN values under GDP/PC column
missing_gdppc = immigration[immigration['Sending_gdppc'].isna()]['Sending_Country'].unique()
display(len(missing_gdppc))
display("Countries with NaN GDP/PC: ", missing_gdppc)
24
'Countries with NaN GDP/PC: '
array(['Afghanistan', 'Andorra', 'Bhutan', 'Cuba', 'Eritrea', 'Vatican City', 'Isle of Man', 'Jersey', 'North Korea', 'Lebanon', 'Liechtenstein', 'Monaco', 'Non-Citizens', 'Palau', 'San Marino', 'South Sudan', 'Stateless', 'Syria', 'Taiwan', 'Tonga', 'Unkown', 'Venezuela', 'Western Sahara', 'Yemen'], dtype=object)
# How many migrants arrived from these countries with missing GDP/PC column?
immigration[(immigration['Sending_Country'].isin(missing_gdppc))].groupby('Sending_Country')['Flow'].sum().sort_values(ascending=False)
Sending_Country Venezuela 340468 Syria 138876 Cuba 82456 Afghanistan 53037 Unkown 35760 Eritrea 14295 Stateless 11142 Lebanon 9025 Yemen 8918 Non-Citizens 4090 Taiwan 3608 South Sudan 761 San Marino 118 Bhutan 98 Liechtenstein 88 North Korea 78 Andorra 30 Monaco 6 Palau 4 Vatican City 2 Jersey 0 Tonga 0 Isle of Man 0 Western Sahara 0 Name: Flow, dtype: int64
Important Note on Imputation: As it can be seen above, some of these countries with missing GDP/PC information didnot send any migrants to the EU, or only send a handful. Some of these countries geographically located in Europe and they send only a handful of migrants(Vatican City, Monaco, Andorra etc). For the purpose of this project, the GDP/PC information of the sending countries outside Europe will be brought from other reliable resources (such as IMF, and World Bank) and be kept in the dataset and whereas those in Europe will be dropped.
Besides these countries, the migrants recorded as Unknown, Stateless, and Non-Citizens also have missing GDP/PC information, because they do not have state information. For these observations we will simply replace the GDP/PC with the average GDP/PC.
# Define GDP per capita values for missing countries
missing_gdp = {
'Venezuela': 3420,
'Syria': 752,
'Cuba': 7449,
'Afghanistan': 372,
'Eritrea': 1921,
'Lebanon': 4467,
'Yemen': 1017,
'Taiwan': 32716,
'South Sudan': 340,
'Western Sahara': 2500,
'Tonga': 4681,
'Palau': 14565,
'North Korea': 1217,
'Bhutan': 3704
}
# Fill missing GDP per capita values based on Sending_Country
immigration['Sending_gdppc'] = immigration.apply(
lambda row: missing_gdp[row['Sending_Country']] if pd.isna(row['Sending_gdppc']) and row['Sending_Country'] in missing_gdp else row['Sending_gdppc'],
axis=1
)
# Check the remaining missing countries
display(immigration[immigration['Sending_gdppc'].isna()]['Sending_Country'].unique())
array(['Andorra', 'Vatican City', 'Isle of Man', 'Jersey', 'Liechtenstein', 'Monaco', 'Non-Citizens', 'San Marino', 'Stateless', 'Unkown'], dtype=object)
# For the Unkown, Stateless, Recognized Non-Citizens observations, fill the GDP/PC with the average of dataset
average_gdppc = immigration['Sending_gdppc'].mean()
# 3 observations
countries_to_fill = ['Unkown', 'Stateless', 'Non-Citizens']
# Fill missing GDP per capita for specified countries with the dataset's average GDP per capita
immigration['Sending_gdppc'] = immigration.apply(
lambda row: average_gdppc if pd.isna(row['Sending_gdppc']) and row['Sending_Country'] in countries_to_fill else row['Sending_gdppc'],
axis=1
)
# Check the remaining missing countries
display(immigration[immigration['Sending_gdppc'].isna()]['Sending_Country'].unique())
array(['Andorra', 'Vatican City', 'Isle of Man', 'Jersey', 'Liechtenstein', 'Monaco', 'San Marino'], dtype=object)
# Drop NA values
immigration = immigration.dropna(subset=['Sending_gdppc'])
# Verify that no missing GDP per capita values remain
immigration['Sending_gdppc'].isna().sum()
0
Scatter Plot of Migration Flows and GDP/PC: My project hypothesizes the economic conditions, GDP/PC, are the consistent and stable long-term cause of migration outflows to Europe.
An early scatterplot with regression line for the whole dataset below shows a small negative slope below, which indicates that the GDP/PC and the migration flows are negatively correlated. However, I argue that the effect would be much higher for African countries. Hence, if a dummy variable for the continent is included, or if seperate models were created for different country groups (such as African countries, Gulf countries, developed countries) then the economic conditions would have a higher negative coefficient and therefore a higher negative impact on the migration flows especially from underdeveloped and developing countries which located at the periphery of the Europe.
Individual scatter plots for each explanatory variable will be provided once all of the y-variables are merged into the main dataset.
import numpy as np
import matplotlib.pyplot as plt
# Scatter plot with size proportional to Flow
plt.figure(figsize=(10, 6))
plt.scatter(
immigration['Sending_gdppc'],
immigration['Flow'],
s=immigration['Sending_gdppc'] / 100, # Size of the dot
alpha=0.1,
color='blue'
)
# Calculate the regression line
x = immigration['Sending_gdppc']
y = immigration['Flow']
coefficients = np.polyfit(x, y, 1)
regression_line = np.poly1d(coefficients)
# Plot the regression line
plt.plot(
x,
regression_line(x),
color='red',
linewidth=2,
label=f'Regression Line: y = {coefficients[0]:.2f}x + {coefficients[1]:.2f}'
)
# Add labels and title
plt.xlabel('Sending GDP per Capita')
plt.ylabel('Migration Flow')
plt.title('Scatter Plot of Sending GDP/PC vs Migration Flow (Dot Size = GDP/PC)')
plt.grid(True)
plt.legend()
# Show the plot
plt.show()
Merge with Economic Indicator2: Multi-dimensional Poverty MetricΒΆ
The second economic indicator which will be used is World Bank Multidimensional Poverty Measure, which shows percentage of individuals who suffer from multdimensional poverty (a measure includes monetary, educational attainment, electricty, sanitation, and water-related poverty measures).
There are a total of 46 missing observations which is approximately 30% of the dataset. The total number of migrants coming from these countries is 1.2 million which is a big proportion of total migration flows.
Among these countries without MPM information, Venezuela comes at the top of the list with 340K migrants, followed by India with 224K migrants, Syria with 138K.
Removing these observations with missing MPM (and missing other indicator variables) would cause our dataset to shrink significantly, which would harm the generealization power of the dataset. Yet, simply filling the average Multidimensional Poverty Measure and replacing would not be appropriate. Because, as it can be seen highly developed countries like Canada or Monaco have missing Sending_mpm, and simply assigning the average Sending_mpm value would be against the reality. Therefore, this project will utilize a K-NN-like measure to fill these missing values. Becuase GDP/PC is a good measure of poverty in a country, we will use World Bank country classification by income information to fill the missing poverty measures. World Bank groups countries into 4 groups based on GDP/PC: Low Income (less than $1,085), Lower-middle Income (1,086 - 4,255), Upper-middle Income (4,256 - 13,205) and High income (higher than 13,206). While filling out the missing values, we will determine the income group of the country, and then fill the missing Sending_mpm with the average of the countries at the same income group.
# Merge WB GDP/MPM with Immigration Dataset and bring Poverty information
immigration = pd.merge(immigration, mpm[['Country code', 'Multidimensional poverty headcount ratio (%)']], left_on='Sending_iso3', right_on='Country code', how='left')
immigration.drop(columns=['Country code'], inplace=True)
# Rename Sending_mpm
immigration.rename(columns={'Multidimensional poverty headcount ratio (%)': 'Sending_mpm'}, inplace=True)
immigration.head()
Sending_Country | Flow | Sending_iso2 | Sending_iso3 | Sending_gdppc | Sending_mpm | |
---|---|---|---|---|---|---|
0 | Afghanistan | 53037 | AF | AFG | 372.000000 | NaN |
1 | Albania | 131294 | AL | ALB | 15491.961000 | 0.293161 |
2 | Algeria | 58852 | DZ | DZA | 11198.233480 | NaN |
3 | Angola | 828 | AO | AGO | 5906.115677 | 47.203606 |
4 | Antigua and Barbuda | 42 | AG | ATG | 22321.870020 | NaN |
# How many missing: 46 observations are missing.
missing_mpm = immigration[immigration['Sending_mpm'].isna()]
display(len(missing_mpm['Sending_Country']))
# Which countries have missing Multidimensional poverty information?
display(missing_mpm['Sending_Country'].unique())
# How many migrants came from these countries with missing MPM: 335,542.
display(immigration[immigration['Sending_Country'].isin(missing_mpm['Sending_Country'])]['Flow'].sum())
46
array(['Afghanistan', 'Algeria', 'Antigua and Barbuda', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Barbados', 'Belize', 'Bosnia and Herzegovina', 'Brunei Darussalam', 'Cambodia', 'Canada', 'Central African Republic', 'China', 'Cuba', 'Dominica', 'Equatorial Guinea', 'Eritrea', 'Grenada', 'Guyana', 'India', 'Jamaica', 'North Korea', 'Kuwait', 'Libya', 'New Zealand', 'Non-Citizens', 'Oman', 'Palau', 'Qatar', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Vincent and the Grenadines', 'Saudi Arabia', 'Singapore', 'Somalia', 'Stateless', 'Suriname', 'Syria', 'Trinidad and Tobago', 'Turkmenistan', 'United Arab Emirates', 'Unkown', 'Uzbekistan', 'Venezuela', 'Western Sahara'], dtype=object)
1239427
# How many migrants arrived from the countries with missing MPM:
immigration[immigration['Sending_mpm'].isna()].groupby('Sending_Country')['Flow'].sum().sort_values(ascending=False)[:20]
Sending_Country Venezuela 340468 India 224303 Syria 138876 China 135900 Cuba 82456 Bosnia and Herzegovina 63130 Algeria 58852 Afghanistan 53037 Unkown 35760 Uzbekistan 17739 Somalia 14831 Eritrea 14295 Stateless 11142 Equatorial Guinea 10398 Canada 9145 Azerbaijan 7853 Suriname 5812 Non-Citizens 4090 Libya 2338 Turkmenistan 1849 Name: Flow, dtype: int64
# Create Income_group variable
bins = [0, 1085, 4255, 13205, float('inf')]
labels = ['Low Income', 'Lower-middle Income', 'Upper-middle Income', 'High Income']
immigration['Sending_incomegroup'] = pd.cut(immigration['Sending_gdppc'], bins=bins, labels=labels, right=False)
immigration.head(3)
Sending_Country | Flow | Sending_iso2 | Sending_iso3 | Sending_gdppc | Sending_mpm | Sending_incomegroup | |
---|---|---|---|---|---|---|---|
0 | Afghanistan | 53037 | AF | AFG | 372.00000 | NaN | Low Income |
1 | Albania | 131294 | AL | ALB | 15491.96100 | 0.293161 | High Income |
2 | Algeria | 58852 | DZ | DZA | 11198.23348 | NaN | Upper-middle Income |
# Fill missing Sending_mpm values
#Calculate average mpm for each income group
mpm_avg = immigration.groupby('Sending_incomegroup')['Sending_mpm'].transform('mean')
# Fill missing Sending_mpm values with the average values based on Income_group
immigration['Sending_mpm'] = immigration['Sending_mpm'].fillna(mpm_avg)
immigration.head(3)
/tmp/ipykernel_145/1772377715.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. mpm_avg = immigration.groupby('Sending_incomegroup')['Sending_mpm'].transform('mean')
Sending_Country | Flow | Sending_iso2 | Sending_iso3 | Sending_gdppc | Sending_mpm | Sending_incomegroup | |
---|---|---|---|---|---|---|---|
0 | Afghanistan | 53037 | AF | AFG | 372.00000 | 68.499527 | Low Income |
1 | Albania | 131294 | AL | ALB | 15491.96100 | 0.293161 | High Income |
2 | Algeria | 58852 | DZ | DZA | 11198.23348 | 12.653047 | Upper-middle Income |
Merge with Conflict Variables: Fatalities in Battles, Riots, and Violence against CiviliansΒΆ
The ACLED Project (Armed Conflict Location and Event Data) provides various measures of armed conflict, and the datasets on battles, riots and conflict against civilians will be imported to our analysis to measure the effect of conflict on the migration flows. The datasets are based on incidents or events. As a result there are multiple observations for countries, and we will calculate the total number of casualties in each country.
Since ACLED datasets include numeric iso3 codes, this project will use UN country codes datasets which include numeric iso3 codes.
As shown below with a total casualty of 13,414 Ukraine tops the list of fatalities in battles, becuase of Russian invasion. Myanmar comes the second, becuase of recent coup d'etat. And the countries with civil war follows these two countries. Similar tables are presented below for casualties in riots and violence against civilians.
If a country has a NA value for any of these conflict variables, this means that that country didnot have any casualties because of battles, riots, and violence. Hence we can simply fill the NA values with 0.
The literature shows that the battles are much more important determinant of migration outflows than riots and violence towards civilians. To take this into consideration a new casualties variable will be created by weighting the existing three variables (battle fatalities 70%, and riot and vilence fatalities 15% each).
# Numeric iso3 codes
immigration = immigration.merge(country_codes3_[['Iso3', 'Code']], how='left', left_on='Sending_iso3', right_on='Iso3')
immigration = immigration.drop(columns=['Iso3'])
immigration.rename(columns={'Code': 'Sending_iso3_num'}, inplace=True)
immigration
Sending_Country | Flow | Sending_iso2 | Sending_iso3 | Sending_gdppc | Sending_mpm | Sending_incomegroup | Sending_iso3_num | |
---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 53037 | AF | AFG | 372.000000 | 68.499527 | Low Income | 4.0 |
1 | Albania | 131294 | AL | ALB | 15491.961000 | 0.293161 | High Income | 8.0 |
2 | Algeria | 58852 | DZ | DZA | 11198.233480 | 12.653047 | Upper-middle Income | 12.0 |
3 | Angola | 828 | AO | AGO | 5906.115677 | 47.203606 | Upper-middle Income | 24.0 |
4 | Antigua and Barbuda | 42 | AG | ATG | 22321.870020 | 2.721764 | High Income | 28.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
162 | Viet Nam | 22106 | VN | VNM | 11396.531300 | 1.166175 | Upper-middle Income | 704.0 |
163 | Western Sahara | 0 | EH | ESH | 2500.000000 | 46.103936 | Lower-middle Income | 732.0 |
164 | Yemen | 8918 | YE | YEM | 1017.000000 | 35.411829 | Low Income | 887.0 |
165 | Zambia | 698 | ZM | ZMB | 3365.873780 | 66.403395 | Lower-middle Income | 894.0 |
166 | Zimbabwe | 1610 | ZW | ZWE | 2207.957033 | 42.397930 | Lower-middle Income | 716.0 |
167 rows Γ 8 columns
# Countries with the most fatalities in battles.
battle1 = battle.groupby('country').agg({
'fatalities': 'sum',
'iso': 'first'
}).reset_index()
battle1.sort_values(by='fatalities', ascending=False)[:10]
country | fatalities | iso | |
---|---|---|---|
95 | Ukraine | 13414 | 804 |
65 | Myanmar | 12901 | 104 |
69 | Nigeria | 5274 | 566 |
81 | Somalia | 4498 | 706 |
33 | Ethiopia | 3618 | 231 |
98 | Yemen | 3576 | 887 |
13 | Brazil | 3008 | 76 |
26 | Democratic Republic of Congo | 2919 | 180 |
86 | Syria | 2599 | 760 |
0 | Afghanistan | 2424 | 4 |
# Merge with battle1 df
immigration = immigration.merge(battle1, how='left', left_on='Sending_iso3_num', right_on='iso')
immigration = immigration.drop(columns=['country', 'iso'])
immigration = immigration.rename(columns={'fatalities': 'battle_fatalities'})
immigration
Sending_Country | Flow | Sending_iso2 | Sending_iso3 | Sending_gdppc | Sending_mpm | Sending_incomegroup | Sending_iso3_num | battle_fatalities | |
---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 53037 | AF | AFG | 372.000000 | 68.499527 | Low Income | 4.0 | 2424.0 |
1 | Albania | 131294 | AL | ALB | 15491.961000 | 0.293161 | High Income | 8.0 | NaN |
2 | Algeria | 58852 | DZ | DZA | 11198.233480 | 12.653047 | Upper-middle Income | 12.0 | 19.0 |
3 | Angola | 828 | AO | AGO | 5906.115677 | 47.203606 | Upper-middle Income | 24.0 | 48.0 |
4 | Antigua and Barbuda | 42 | AG | ATG | 22321.870020 | 2.721764 | High Income | 28.0 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
162 | Viet Nam | 22106 | VN | VNM | 11396.531300 | 1.166175 | Upper-middle Income | 704.0 | NaN |
163 | Western Sahara | 0 | EH | ESH | 2500.000000 | 46.103936 | Lower-middle Income | 732.0 | NaN |
164 | Yemen | 8918 | YE | YEM | 1017.000000 | 35.411829 | Low Income | 887.0 | 3576.0 |
165 | Zambia | 698 | ZM | ZMB | 3365.873780 | 66.403395 | Lower-middle Income | 894.0 | NaN |
166 | Zimbabwe | 1610 | ZW | ZWE | 2207.957033 | 42.397930 | Lower-middle Income | 716.0 | 0.0 |
167 rows Γ 9 columns
# Merge riot casualties
riot1 = riot.groupby('country').agg({'fatalities': 'sum','iso': 'first'}).reset_index()
print(riot1.sort_values(by='fatalities', ascending=False)[:10])
immigration = immigration.merge(riot1, how='left', left_on='Sending_iso3_num', right_on='iso')
immigration = immigration.drop(columns=['country', 'iso'])
immigration = immigration.rename(columns={'fatalities': 'riot_fatalities'})
immigration
country fatalities iso 65 Iran 430 364 36 Democratic Republic of Congo 244 180 63 India 214 356 75 Kenya 211 404 106 Nigeria 208 566 74 Kazakhstan 185 398 136 South Africa 169 710 10 Bangladesh 151 50 64 Indonesia 147 360 110 Pakistan 142 586
Sending_Country | Flow | Sending_iso2 | Sending_iso3 | Sending_gdppc | Sending_mpm | Sending_incomegroup | Sending_iso3_num | battle_fatalities | riot_fatalities | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 53037 | AF | AFG | 372.000000 | 68.499527 | Low Income | 4.0 | 2424.0 | 24.0 |
1 | Albania | 131294 | AL | ALB | 15491.961000 | 0.293161 | High Income | 8.0 | NaN | 0.0 |
2 | Algeria | 58852 | DZ | DZA | 11198.233480 | 12.653047 | Upper-middle Income | 12.0 | 19.0 | 0.0 |
3 | Angola | 828 | AO | AGO | 5906.115677 | 47.203606 | Upper-middle Income | 24.0 | 48.0 | 34.0 |
4 | Antigua and Barbuda | 42 | AG | ATG | 22321.870020 | 2.721764 | High Income | 28.0 | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
162 | Viet Nam | 22106 | VN | VNM | 11396.531300 | 1.166175 | Upper-middle Income | 704.0 | NaN | NaN |
163 | Western Sahara | 0 | EH | ESH | 2500.000000 | 46.103936 | Lower-middle Income | 732.0 | NaN | NaN |
164 | Yemen | 8918 | YE | YEM | 1017.000000 | 35.411829 | Low Income | 887.0 | 3576.0 | 0.0 |
165 | Zambia | 698 | ZM | ZMB | 3365.873780 | 66.403395 | Lower-middle Income | 894.0 | NaN | 8.0 |
166 | Zimbabwe | 1610 | ZW | ZWE | 2207.957033 | 42.397930 | Lower-middle Income | 716.0 | 0.0 | 27.0 |
167 rows Γ 10 columns
# Merge Violence casualties
violence1 = violence.groupby('country').agg({'fatalities': 'sum','iso': 'first'}).reset_index()
print(violence1.sort_values(by='fatalities', ascending=False)[:10])
immigration = immigration.merge(violence1, how='left', left_on='Sending_iso3_num', right_on='iso')
immigration = immigration.drop(columns=['country', 'iso'])
immigration = immigration.rename(columns={'fatalities': 'violence_fatalities'})
immigration
country fatalities iso 80 Mexico 6561 484 16 Brazil 4034 76 90 Nigeria 3701 566 31 Democratic Republic of Congo 3046 180 39 Ethiopia 2614 231 84 Myanmar 2188 104 76 Mali 2151 466 26 Colombia 1680 170 134 Ukraine 1348 804 17 Burkina Faso 1177 854
Sending_Country | Flow | Sending_iso2 | Sending_iso3 | Sending_gdppc | Sending_mpm | Sending_incomegroup | Sending_iso3_num | battle_fatalities | riot_fatalities | violence_fatalities | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 53037 | AF | AFG | 372.000000 | 68.499527 | Low Income | 4.0 | 2424.0 | 24.0 | 741.0 |
1 | Albania | 131294 | AL | ALB | 15491.961000 | 0.293161 | High Income | 8.0 | NaN | 0.0 | 0.0 |
2 | Algeria | 58852 | DZ | DZA | 11198.233480 | 12.653047 | Upper-middle Income | 12.0 | 19.0 | 0.0 | 8.0 |
3 | Angola | 828 | AO | AGO | 5906.115677 | 47.203606 | Upper-middle Income | 24.0 | 48.0 | 34.0 | 22.0 |
4 | Antigua and Barbuda | 42 | AG | ATG | 22321.870020 | 2.721764 | High Income | 28.0 | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
162 | Viet Nam | 22106 | VN | VNM | 11396.531300 | 1.166175 | Upper-middle Income | 704.0 | NaN | NaN | NaN |
163 | Western Sahara | 0 | EH | ESH | 2500.000000 | 46.103936 | Lower-middle Income | 732.0 | NaN | NaN | NaN |
164 | Yemen | 8918 | YE | YEM | 1017.000000 | 35.411829 | Low Income | 887.0 | 3576.0 | 0.0 | 294.0 |
165 | Zambia | 698 | ZM | ZMB | 3365.873780 | 66.403395 | Lower-middle Income | 894.0 | NaN | 8.0 | 3.0 |
166 | Zimbabwe | 1610 | ZW | ZWE | 2207.957033 | 42.397930 | Lower-middle Income | 716.0 | 0.0 | 27.0 | 6.0 |
167 rows Γ 11 columns
# Fill all NAs with 0
immigration[['battle_fatalities', 'riot_fatalities', 'violence_fatalities']] = immigration[['battle_fatalities', 'riot_fatalities', 'violence_fatalities']].fillna(0)
immigration
Sending_Country | Flow | Sending_iso2 | Sending_iso3 | Sending_gdppc | Sending_mpm | Sending_incomegroup | Sending_iso3_num | battle_fatalities | riot_fatalities | violence_fatalities | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 53037 | AF | AFG | 372.000000 | 68.499527 | Low Income | 4.0 | 2424.0 | 24.0 | 741.0 |
1 | Albania | 131294 | AL | ALB | 15491.961000 | 0.293161 | High Income | 8.0 | 0.0 | 0.0 | 0.0 |
2 | Algeria | 58852 | DZ | DZA | 11198.233480 | 12.653047 | Upper-middle Income | 12.0 | 19.0 | 0.0 | 8.0 |
3 | Angola | 828 | AO | AGO | 5906.115677 | 47.203606 | Upper-middle Income | 24.0 | 48.0 | 34.0 | 22.0 |
4 | Antigua and Barbuda | 42 | AG | ATG | 22321.870020 | 2.721764 | High Income | 28.0 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
162 | Viet Nam | 22106 | VN | VNM | 11396.531300 | 1.166175 | Upper-middle Income | 704.0 | 0.0 | 0.0 | 0.0 |
163 | Western Sahara | 0 | EH | ESH | 2500.000000 | 46.103936 | Lower-middle Income | 732.0 | 0.0 | 0.0 | 0.0 |
164 | Yemen | 8918 | YE | YEM | 1017.000000 | 35.411829 | Low Income | 887.0 | 3576.0 | 0.0 | 294.0 |
165 | Zambia | 698 | ZM | ZMB | 3365.873780 | 66.403395 | Lower-middle Income | 894.0 | 0.0 | 8.0 | 3.0 |
166 | Zimbabwe | 1610 | ZW | ZWE | 2207.957033 | 42.397930 | Lower-middle Income | 716.0 | 0.0 | 27.0 | 6.0 |
167 rows Γ 11 columns
# Create conflict_casualties
immigration['Sending_conflict_casualties'] = (immigration['battle_fatalities'] * 0.70) + (immigration['riot_fatalities']*0.15) + (immigration['violence_fatalities']*0.15)
immigration = immigration.drop(columns=['battle_fatalities', 'riot_fatalities', 'violence_fatalities'])
immigration
Sending_Country | Flow | Sending_iso2 | Sending_iso3 | Sending_gdppc | Sending_mpm | Sending_incomegroup | Sending_iso3_num | Sending_conflict_casualties | |
---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 53037 | AF | AFG | 372.000000 | 68.499527 | Low Income | 4.0 | 1811.55 |
1 | Albania | 131294 | AL | ALB | 15491.961000 | 0.293161 | High Income | 8.0 | 0.00 |
2 | Algeria | 58852 | DZ | DZA | 11198.233480 | 12.653047 | Upper-middle Income | 12.0 | 14.50 |
3 | Angola | 828 | AO | AGO | 5906.115677 | 47.203606 | Upper-middle Income | 24.0 | 42.00 |
4 | Antigua and Barbuda | 42 | AG | ATG | 22321.870020 | 2.721764 | High Income | 28.0 | 0.00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
162 | Viet Nam | 22106 | VN | VNM | 11396.531300 | 1.166175 | Upper-middle Income | 704.0 | 0.00 |
163 | Western Sahara | 0 | EH | ESH | 2500.000000 | 46.103936 | Lower-middle Income | 732.0 | 0.00 |
164 | Yemen | 8918 | YE | YEM | 1017.000000 | 35.411829 | Low Income | 887.0 | 2547.30 |
165 | Zambia | 698 | ZM | ZMB | 3365.873780 | 66.403395 | Lower-middle Income | 894.0 | 1.65 |
166 | Zimbabwe | 1610 | ZW | ZWE | 2207.957033 | 42.397930 | Lower-middle Income | 716.0 | 4.95 |
167 rows Γ 9 columns
Merge with Political Indicators: WB Governance IndicatorsΒΆ
WB Governance Indicators dataset simply has measures of "Political Stability and Absence of Violence", "Voice and Accountability", "Control of Corruption", "Regulatory Quality", "Government Effectiveness", and "Rule of Law".
This dataset mainly has one column for country, and "Series Name" column for (1) estimates and (2) percentiles of above listed measures. This means that all 12 variables (6 estimates and 6 percentiles) are under "Series Name", and there are 12 total observation for each country. To tidy data, pivot_table function is used to spread these 12 variables to columns.
This dataset includes both estimates and percentile rankings. For the sake of simplicity, one single governance quality variable will be calculated only using the estimates variables.
The estimates range between -3.5 and +3.5. So as to make all of the estimates higher than 0, each estimate will be added 3.5 points. After this, all of the 6 estimates will be summed into a new Sending_govern variable as 1 single governance indicator of sending countries.
Again, the missing governance score variable will be filled with the average of the income group
# What kind of measures does WB Governance Indicators dataset have?
display(govern.head())
govern["Series Name"].unique()
Country Name | Country Code | Series Name | Series Code | 2022 [YR2022] | |
---|---|---|---|---|---|
0 | Afghanistan | AFG | Political Stability and Absence of Violence/Te... | PV.EST | -2.550801754 |
1 | Afghanistan | AFG | Voice and Accountability: Estimate | VA.EST | -1.751587272 |
2 | Korea, Dem. People's Rep. | PRK | Voice and Accountability: Percentile Rank | VA.PER.RNK | 0 |
3 | Afghanistan | AFG | Control of Corruption: Estimate | CC.EST | -1.183776498 |
4 | Korea, Dem. People's Rep. | PRK | Regulatory Quality: Percentile Rank | RQ.PER.RNK | 0 |
array(['Political Stability and Absence of Violence/Terrorism: Estimate', 'Voice and Accountability: Estimate', 'Voice and Accountability: Percentile Rank', 'Control of Corruption: Estimate', 'Regulatory Quality: Percentile Rank', 'Government Effectiveness: Estimate', 'Rule of Law: Percentile Rank', 'Control of Corruption: Percentile Rank', 'Regulatory Quality: Estimate', 'Government Effectiveness: Percentile Rank', 'Rule of Law: Estimate', 'Political Stability and Absence of Violence/Terrorism: Percentile Rank', nan], dtype=object)
# Tidy data using pivot_table function
govern['2022 [YR2022]'] = pd.to_numeric(govern['2022 [YR2022]'], errors='coerce')
govern_p = govern.pivot_table(index='Country Name', columns='Series Name', values='2022 [YR2022]').reset_index()
govern_p.head()
Series Name | Country Name | Control of Corruption: Estimate | Control of Corruption: Percentile Rank | Government Effectiveness: Estimate | Government Effectiveness: Percentile Rank | Political Stability and Absence of Violence/Terrorism: Estimate | Political Stability and Absence of Violence/Terrorism: Percentile Rank | Regulatory Quality: Estimate | Regulatory Quality: Percentile Rank | Rule of Law: Estimate | Rule of Law: Percentile Rank | Voice and Accountability: Estimate | Voice and Accountability: Percentile Rank |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | -1.183776 | 12.264151 | -1.879552 | 1.886792 | -2.550802 | 0.471698 | -1.271806 | 8.962264 | -1.658442 | 5.188679 | -1.751587 | 2.415459 |
1 | Albania | -0.407876 | 38.679245 | 0.065063 | 56.603775 | 0.114945 | 50.471699 | 0.159354 | 57.547169 | -0.165779 | 47.169811 | 0.139466 | 52.173912 |
2 | Algeria | -0.637930 | 28.301888 | -0.513090 | 32.547169 | -0.741772 | 19.339623 | -1.063573 | 14.150944 | -0.832473 | 22.641510 | -1.003874 | 21.739130 |
3 | American Samoa | 1.270204 | 88.679245 | 0.667918 | 74.528305 | 1.128859 | 91.037735 | 0.545900 | 70.754715 | 1.221118 | 86.320755 | 0.957648 | 77.294685 |
4 | Andorra | 1.270204 | 88.679245 | 1.495305 | 92.452827 | 1.587736 | 98.584908 | 1.398334 | 90.094337 | 1.485450 | 90.566040 | 1.102833 | 85.507248 |
# Add 3.5 to all estimates so that estimates will be above 0.
govern_p[['Control of Corruption: Estimate', 'Government Effectiveness: Estimate', 'Political Stability and Absence of Violence/Terrorism: Estimate', 'Regulatory Quality: Estimate', 'Rule of Law: Estimate', 'Voice and Accountability: Estimate']] += 3.5
# Calculate one governance variable by simply summing up all of the estimate variables
govern_p['Sending_govern'] = (govern_p['Control of Corruption: Estimate']) + (govern_p['Government Effectiveness: Estimate']) + (govern_p['Political Stability and Absence of Violence/Terrorism: Estimate']) + (govern_p['Regulatory Quality: Estimate']) + (govern_p['Rule of Law: Estimate']) + (govern_p['Voice and Accountability: Estimate'])
govern_p.head()
Series Name | Country Name | Control of Corruption: Estimate | Control of Corruption: Percentile Rank | Government Effectiveness: Estimate | Government Effectiveness: Percentile Rank | Political Stability and Absence of Violence/Terrorism: Estimate | Political Stability and Absence of Violence/Terrorism: Percentile Rank | Regulatory Quality: Estimate | Regulatory Quality: Percentile Rank | Rule of Law: Estimate | Rule of Law: Percentile Rank | Voice and Accountability: Estimate | Voice and Accountability: Percentile Rank | Sending_govern |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 2.316224 | 12.264151 | 1.620448 | 1.886792 | 0.949198 | 0.471698 | 2.228194 | 8.962264 | 1.841558 | 5.188679 | 1.748413 | 2.415459 | 10.704034 |
1 | Albania | 3.092124 | 38.679245 | 3.565063 | 56.603775 | 3.614945 | 50.471699 | 3.659354 | 57.547169 | 3.334221 | 47.169811 | 3.639466 | 52.173912 | 20.905173 |
2 | Algeria | 2.862070 | 28.301888 | 2.986910 | 32.547169 | 2.758228 | 19.339623 | 2.436427 | 14.150944 | 2.667527 | 22.641510 | 2.496126 | 21.739130 | 16.207288 |
3 | American Samoa | 4.770204 | 88.679245 | 4.167918 | 74.528305 | 4.628859 | 91.037735 | 4.045900 | 70.754715 | 4.721118 | 86.320755 | 4.457648 | 77.294685 | 26.791646 |
4 | Andorra | 4.770204 | 88.679245 | 4.995305 | 92.452827 | 5.087736 | 98.584908 | 4.898334 | 90.094337 | 4.985450 | 90.566040 | 4.602833 | 85.507248 | 29.339862 |
# Merge govern_p with Immigration Dataset and bring governance indicator.
immigration = pd.merge(immigration, govern_p[['Country Name', 'Sending_govern']], left_on='Sending_Country', right_on='Country Name', how='left')
immigration.drop(columns=['Country Name'], inplace=True)
immigration.head(3)
Sending_Country | Flow | Sending_iso2 | Sending_iso3 | Sending_gdppc | Sending_mpm | Sending_incomegroup | Sending_iso3_num | Sending_conflict_casualties | Sending_govern | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 53037 | AF | AFG | 372.00000 | 68.499527 | Low Income | 4.0 | 1811.55 | 10.704034 |
1 | Albania | 131294 | AL | ALB | 15491.96100 | 0.293161 | High Income | 8.0 | 0.00 | 20.905173 |
2 | Algeria | 58852 | DZ | DZA | 11198.23348 | 12.653047 | Upper-middle Income | 12.0 | 14.50 | 16.207288 |
# Fill missing Sending_mpm values
#Calculate average mpm for each income group
govern_average = immigration.groupby('Sending_incomegroup')['Sending_govern'].transform('mean')
# Fill missing Sending_mpm values with the average values based on Income_group
immigration['Sending_govern'].fillna(govern_average, inplace=True)
immigration.head()
/tmp/ipykernel_145/784547835.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. govern_average = immigration.groupby('Sending_incomegroup')['Sending_govern'].transform('mean')
Sending_Country | Flow | Sending_iso2 | Sending_iso3 | Sending_gdppc | Sending_mpm | Sending_incomegroup | Sending_iso3_num | Sending_conflict_casualties | Sending_govern | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 53037 | AF | AFG | 372.000000 | 68.499527 | Low Income | 4.0 | 1811.55 | 10.704034 |
1 | Albania | 131294 | AL | ALB | 15491.961000 | 0.293161 | High Income | 8.0 | 0.00 | 20.905173 |
2 | Algeria | 58852 | DZ | DZA | 11198.233480 | 12.653047 | Upper-middle Income | 12.0 | 14.50 | 16.207288 |
3 | Angola | 828 | AO | AGO | 5906.115677 | 47.203606 | Upper-middle Income | 24.0 | 42.00 | 16.285654 |
4 | Antigua and Barbuda | 42 | AG | ATG | 22321.870020 | 2.721764 | High Income | 28.0 | 0.00 | 23.496961 |
Merge with Climate Indicator: German Watch Climate Risk IndexΒΆ
Even though German Watch's Climate Risk Index is the most comprehensive climate risk dataset, there are still many countries with a missing climate risk value. Even the social, political, and economic factors are influential in explaining climate risk, such as a low state capacity would increase the climate risk a country face, still climate risks are far less effected by these factors. Netherlands, as a highly developed country, has a climate risk.
Before of this reason, the missing values will simply be filled with the average score so as not to miss a big chunk of observations.
# Merge govern_p with Immigration Dataset and bring governance indicator.
immigration = pd.merge(immigration, climate[['Country', 'CRI\rscore']], left_on='Sending_Country', right_on='Country', how='left')
immigration.drop(columns=['Country'], inplace=True)
# Rename 'CRI/rscore' Column with 'Sending_cri'
immigration.rename(columns={'CRI\rscore': 'Sending_cri'}, inplace=True)
immigration
Sending_Country | Flow | Sending_iso2 | Sending_iso3 | Sending_gdppc | Sending_mpm | Sending_incomegroup | Sending_iso3_num | Sending_conflict_casualties | Sending_govern | Sending_cri | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 53037 | AF | AFG | 372.000000 | 68.499527 | Low Income | 4.0 | 1811.55 | 10.704034 | NaN |
1 | Albania | 131294 | AL | ALB | 15491.961000 | 0.293161 | High Income | 8.0 | 0.00 | 20.905173 | 108.00 |
2 | Algeria | 58852 | DZ | DZA | 11198.233480 | 12.653047 | Upper-middle Income | 12.0 | 14.50 | 16.207288 | 93.83 |
3 | Angola | 828 | AO | AGO | 5906.115677 | 47.203606 | Upper-middle Income | 24.0 | 42.00 | 16.285654 | 76.00 |
4 | Antigua and Barbuda | 42 | AG | ATG | 22321.870020 | 2.721764 | High Income | 28.0 | 0.00 | 23.496961 | 125.00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
162 | Viet Nam | 22106 | VN | VNM | 11396.531300 | 1.166175 | Upper-middle Income | 704.0 | 0.00 | 18.866787 | NaN |
163 | Western Sahara | 0 | EH | ESH | 2500.000000 | 46.103936 | Lower-middle Income | 732.0 | 0.00 | 16.658763 | NaN |
164 | Yemen | 8918 | YE | YEM | 1017.000000 | 35.411829 | Low Income | 887.0 | 2547.30 | 11.075789 | NaN |
165 | Zambia | 698 | ZM | ZMB | 3365.873780 | 66.403395 | Lower-middle Income | 894.0 | 1.65 | 18.777528 | 125.00 |
166 | Zimbabwe | 1610 | ZW | ZWE | 2207.957033 | 42.397930 | Lower-middle Income | 716.0 | 4.95 | 13.841291 | 114.50 |
167 rows Γ 11 columns
# Missing values
missing_cri = immigration[immigration['Sending_cri'].isna()]
display(missing_cri['Sending_Country'].unique())
display(missing_cri.groupby('Sending_Country')['Flow'].sum().sort_values(ascending=False))
display(missing_cri['Flow'].sum())
array(['Afghanistan', 'Bahamas', 'Congo', 'Congo, the Democratic Republic of the', 'Cuba', 'Equatorial Guinea', 'Gambia', 'Iran', 'North Korea', 'Kyrgyzstan', "Lao People's Democratic Republic", 'Macedonia', 'Micronesia, Federated States of', 'Moldova, Republic of', 'Nauru', 'Non-Citizens', 'Palau', 'Palestine, State of', 'Russian Federation', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Vincent and the Grenadines', 'Sao Tome and Principe', 'Somalia', 'Stateless', 'Swaziland', 'Syria', 'Taiwan', 'Timor-Leste', 'Turkmenistan', 'Unkown', 'Viet Nam', 'Western Sahara', 'Yemen'], dtype=object)
Sending_Country Russian Federation 196630 Syria 138876 Cuba 82456 Afghanistan 53037 Iran 43182 Moldova, Republic of 40375 Unkown 35760 Macedonia 28024 Gambia 22640 Viet Nam 22106 Somalia 14831 Stateless 11142 Equatorial Guinea 10398 Kyrgyzstan 9474 Yemen 8918 Congo, the Democratic Republic of the 6594 Non-Citizens 4090 Taiwan 3608 Palestine, State of 2424 Turkmenistan 1849 Congo 1426 Lao People's Democratic Republic 328 Saint Kitts and Nevis 108 North Korea 78 Sao Tome and Principe 68 Saint Lucia 62 Swaziland 52 Timor-Leste 42 Bahamas 34 Saint Vincent and the Grenadines 30 Palau 4 Nauru 0 Western Sahara 0 Micronesia, Federated States of 0 Name: Flow, dtype: int64
738646
# Replace the missing values under Sending_cri column with the average of the column
# The average of the 'Sending_gdppc' column
average_cri = immigration['Sending_cri'].mean()
# Fill missing values in the 'Sending_gdppc' column with the calculated average
immigration['Sending_cri'].fillna(average_cri, inplace=True)
immigration
Sending_Country | Flow | Sending_iso2 | Sending_iso3 | Sending_gdppc | Sending_mpm | Sending_incomegroup | Sending_iso3_num | Sending_conflict_casualties | Sending_govern | Sending_cri | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 53037 | AF | AFG | 372.000000 | 68.499527 | Low Income | 4.0 | 1811.55 | 10.704034 | 82.451203 |
1 | Albania | 131294 | AL | ALB | 15491.961000 | 0.293161 | High Income | 8.0 | 0.00 | 20.905173 | 108.000000 |
2 | Algeria | 58852 | DZ | DZA | 11198.233480 | 12.653047 | Upper-middle Income | 12.0 | 14.50 | 16.207288 | 93.830000 |
3 | Angola | 828 | AO | AGO | 5906.115677 | 47.203606 | Upper-middle Income | 24.0 | 42.00 | 16.285654 | 76.000000 |
4 | Antigua and Barbuda | 42 | AG | ATG | 22321.870020 | 2.721764 | High Income | 28.0 | 0.00 | 23.496961 | 125.000000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
162 | Viet Nam | 22106 | VN | VNM | 11396.531300 | 1.166175 | Upper-middle Income | 704.0 | 0.00 | 18.866787 | 82.451203 |
163 | Western Sahara | 0 | EH | ESH | 2500.000000 | 46.103936 | Lower-middle Income | 732.0 | 0.00 | 16.658763 | 82.451203 |
164 | Yemen | 8918 | YE | YEM | 1017.000000 | 35.411829 | Low Income | 887.0 | 2547.30 | 11.075789 | 82.451203 |
165 | Zambia | 698 | ZM | ZMB | 3365.873780 | 66.403395 | Lower-middle Income | 894.0 | 1.65 | 18.777528 | 125.000000 |
166 | Zimbabwe | 1610 | ZW | ZWE | 2207.957033 | 42.397930 | Lower-middle Income | 716.0 | 4.95 | 13.841291 | 114.500000 |
167 rows Γ 11 columns
Population IndicatorΒΆ
The population of a country is a natural determinant of total flow of migrants. Therefore, this project will import population information from CIA.
Some countries have missing population information, however this is simply because of mismatch between the columns (such as "Syria" vs "Syrian Arab Republic"). We will manually fill these NaN values using the CIA population dataset.
For the Stateless, Unkown, and Recognized Non-Citizen observations, we will simply use the average population.
# Merge population with Immigration Dataset and bring governance indicator.
#immigration = immigration.drop(columns=['Sending_pop', 'value'])
immigration = pd.merge(immigration, population[['name', 'value']], left_on='Sending_Country', right_on='name', how='left')
immigration.drop(columns=['name'], inplace=True)
immigration.head()
Sending_Country | Flow | Sending_iso2 | Sending_iso3 | Sending_gdppc | Sending_mpm | Sending_incomegroup | Sending_iso3_num | Sending_conflict_casualties | Sending_govern | Sending_cri | value | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 53037 | AF | AFG | 372.000000 | 68.499527 | Low Income | 4.0 | 1811.55 | 10.704034 | 82.451203 | 38,346,720 |
1 | Albania | 131294 | AL | ALB | 15491.961000 | 0.293161 | High Income | 8.0 | 0.00 | 20.905173 | 108.000000 | 3,095,344 |
2 | Algeria | 58852 | DZ | DZA | 11198.233480 | 12.653047 | Upper-middle Income | 12.0 | 14.50 | 16.207288 | 93.830000 | 44,178,884 |
3 | Angola | 828 | AO | AGO | 5906.115677 | 47.203606 | Upper-middle Income | 24.0 | 42.00 | 16.285654 | 76.000000 | 34,795,287 |
4 | Antigua and Barbuda | 42 | AG | ATG | 22321.870020 | 2.721764 | High Income | 28.0 | 0.00 | 23.496961 | 125.000000 | 100,335 |
# Rename 'value' Column with 'Sending_pop'
immigration.rename(columns={'value': 'Sending_pop'}, inplace=True)
immigration.head()
Sending_Country | Flow | Sending_iso2 | Sending_iso3 | Sending_gdppc | Sending_mpm | Sending_incomegroup | Sending_iso3_num | Sending_conflict_casualties | Sending_govern | Sending_cri | Sending_pop | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 53037 | AF | AFG | 372.000000 | 68.499527 | Low Income | 4.0 | 1811.55 | 10.704034 | 82.451203 | 38,346,720 |
1 | Albania | 131294 | AL | ALB | 15491.961000 | 0.293161 | High Income | 8.0 | 0.00 | 20.905173 | 108.000000 | 3,095,344 |
2 | Algeria | 58852 | DZ | DZA | 11198.233480 | 12.653047 | Upper-middle Income | 12.0 | 14.50 | 16.207288 | 93.830000 | 44,178,884 |
3 | Angola | 828 | AO | AGO | 5906.115677 | 47.203606 | Upper-middle Income | 24.0 | 42.00 | 16.285654 | 76.000000 | 34,795,287 |
4 | Antigua and Barbuda | 42 | AG | ATG | 22321.870020 | 2.721764 | High Income | 28.0 | 0.00 | 23.496961 | 125.000000 | 100,335 |
# 10 countries with missing population information. This is because of name mismatch.
print(immigration['Sending_pop'].isna().sum())
immigration[immigration['Sending_pop'].isna()]
10
Sending_Country | Flow | Sending_iso2 | Sending_iso3 | Sending_gdppc | Sending_mpm | Sending_incomegroup | Sending_iso3_num | Sending_conflict_casualties | Sending_govern | Sending_cri | Sending_pop | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
17 | Bolivia | 36624 | BO | BOL | 8244.235658 | 4.539775 | Upper-middle Income | 68.0 | 2.20 | 16.553992 | 63.500000 | NaN |
38 | CΓ΄te d'Ivoire | 8292 | CI | CIV | 5537.369758 | 37.273455 | Upper-middle Income | 384.0 | 21.05 | 18.866787 | 89.500000 | NaN |
63 | Iran | 43182 | IR | IRN | 15461.079340 | 1.027940 | High Income | 364.0 | 136.15 | 22.317889 | 82.451203 | NaN |
72 | North Korea | 78 | KP | PRK | 1217.000000 | 46.103936 | Lower-middle Income | 408.0 | 5.10 | 16.658763 | 82.451203 | NaN |
81 | Macedonia | 28024 | MK | MKD | 17128.642860 | 3.205135 | High Income | 807.0 | 0.00 | 22.317889 | 82.451203 | NaN |
105 | Non-Citizens | 4090 | RNC | XXX | 15549.957897 | 2.721764 | High Income | NaN | 0.00 | 22.317889 | 82.451203 | NaN |
140 | Syria | 138876 | SY | SYR | 752.000000 | 68.499527 | Low Income | 760.0 | 1945.75 | 11.075789 | 82.451203 | NaN |
141 | Taiwan | 3608 | TW | TWN | 32716.000000 | 0.061490 | High Income | 158.0 | 0.00 | 22.317889 | 82.451203 | NaN |
143 | Tanzania | 1536 | TZ | TZA | 2623.861572 | 54.589677 | Lower-middle Income | 834.0 | 9.15 | 18.300585 | 69.830000 | NaN |
161 | Venezuela | 340468 | VE | VEN | 3420.000000 | 46.103936 | Lower-middle Income | 862.0 | 445.85 | 16.658763 | 104.170000 | NaN |
# Fill the missing population information manually
missing_pop = {
'Bolivia': 12311974,
"CΓ΄te d'Ivoire": 29981758,
'Iran': 88386937,
'North Korea': 26298666,
'Macedonia': 2135622,
'Syria': 23865423,
'Taiwan': 23595274,
'Tanzania': 67462121,
'Venezuela': 31250306
}
# Fill missing GDP per capita values based on Sending_Country
immigration['Sending_pop'] = immigration.apply(
lambda row: missing_pop[row['Sending_Country']] if pd.isna(row['Sending_pop']) and row['Sending_Country'] in missing_pop else row['Sending_pop'],
axis=1
)
# Check the remaining missing countries
print(immigration[immigration['Sending_pop'].isna()])
Sending_Country Flow Sending_iso2 Sending_iso3 Sending_gdppc \ 105 Non-Citizens 4090 RNC XXX 15549.957897 Sending_mpm Sending_incomegroup Sending_iso3_num \ 105 2.721764 High Income NaN Sending_conflict_casualties Sending_govern Sending_cri Sending_pop 105 0.0 22.317889 82.451203 NaN
# Calculate the average of Sending_pop
immigration['Sending_pop'] = pd.to_numeric(immigration['Sending_pop'], errors='coerce')
average_pop = immigration['Sending_pop'].mean()
# Fill missing Sending_pop values with the average
immigration['Sending_pop'] = immigration['Sending_pop'].fillna(average_pop)
# Missing values
print(immigration['Sending_pop'].isna().sum())
0
Statistical ModellingΒΆ
As indicated above, this research will use a multiple linear regression model to predict the migration flows to Europe, based on various economic, political, conflict, climate-related, and population variables.
Furthermore, a K-NN regressor model will be built to predict the migration flows. Later on, the model's performance will be evaluated.
First of all, to see the association between the outcome variable and the explanatory variables, and also to see how each variable is correlated with each other we will present individual scatterplots and also a correlation heatmap.
As it can be seen from the scatterplot, the conflict casualties has the steepest regression line with the a p-value lower than the threshold value.
At the correlation matrix heatmap, conflict fatalities exhibit a positive correlation with the migration flow compared to other variables.
Moreover, in line with conventional wisdom, GDP/PC demonstrates a negative correlation with the poverty measure and a positive correlation with governance indicators.
Consequently, it would be reasonable to hypothesize that an increase in fatalities resulting from civil and armed conflict would lead to an increase in migration outflows.
# Individual scatter plots
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
observations = ['Sending_gdppc', 'Sending_mpm', 'Sending_conflict_casualties', 'Sending_govern', 'Sending_cri', 'Sending_pop']
def regress_with_stats(immigration, observations):
fig, ax = plt.subplots(3, 2, figsize=(20, 10), sharex=False)
ax = ax.ravel()
for i, o in enumerate(observations):
slope, intercept, r_value, p_value, std_err = stats.linregress(
immigration[o],
immigration['Flow']
)
# A title with statistics
diag_str = (
f"p-value={p_value:.1g}\n"
f"r-value={r_value:.3f}\n"
f"std err={std_err:.3f}\n"
f"slope={slope:.3f}\n"
f"intercept={intercept:.3f}"
)
# Scatter plot with regression line
immigration.plot.scatter(x=o, y='Flow', title=diag_str, ax=ax[i])
pts = np.linspace(immigration[o].min(), immigration[o].max(), 500)
line = slope * pts + intercept
ax[i].plot(pts, line, lw=1, color='red')
for i in range(len(observations), len(ax)):
fig.delaxes(ax[i])
plt.tight_layout()
plt.show()
regress_with_stats(immigration, observations)
import seaborn as sns
# Coerce variables into numeric values
immigration['Flow'] = pd.to_numeric(immigration['Flow'], errors='coerce')
immigration['Sending_gdppc'] = pd.to_numeric(immigration['Sending_gdppc'], errors='coerce')
immigration['Sending_mpm'] = pd.to_numeric(immigration['Sending_mpm'], errors='coerce')
immigration['Sending_conflict_casualties'] = pd.to_numeric(immigration['Sending_conflict_casualties'], errors='coerce')
immigration['Sending_govern'] = pd.to_numeric(immigration['Sending_govern'], errors='coerce')
immigration['Sending_cri'] = pd.to_numeric(immigration['Sending_cri'], errors='coerce')
immigration['Sending_pop'] = pd.to_numeric(immigration['Sending_pop'], errors='coerce')
# Correlation Heatmap
variables = ['Flow', 'Sending_gdppc', 'Sending_mpm', 'Sending_conflict_casualties', 'Sending_govern', 'Sending_cri', 'Sending_pop']
# Correlation Matrix
corr_matrix = immigration[variables].corr()
# Heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm_r', fmt=".2f")
plt.title('Correlation Matrix of Selected Columns in Immigration Dataset')
plt.show()
The results of the multilinear regression model are outlined below.
To begin, the adjusted R-squared value of 0.368 indicates that our model accounts almost for 37% of the variation in the outcome variable, which is considered fair but not optimal.
Among the indicator variables, Sending_conflict_casualties is the only one with a p-value less than 0.5, indicating that this variable has a statistically significant effect on the outcome variable. Specifically, a one unit increase in conflict fatalities leads to a migration increase of 115 individuals to EU countries.
# Multilinear Regression Model
import statsmodels.api as sm
# X and Y variables
predictors = ['Sending_gdppc', 'Sending_mpm', 'Sending_conflict_casualties', 'Sending_govern', 'Sending_cri', 'Sending_pop']
outcome = 'Flow'
# constant term
X = sm.add_constant(immigration[predictors])
# the regression model
model = sm.OLS(immigration[outcome], X)
results = model.fit()
# Print the summary of the regression results
print(results.summary())
OLS Regression Results ============================================================================== Dep. Variable: Flow R-squared: 0.368 Model: OLS Adj. R-squared: 0.344 Method: Least Squares F-statistic: 15.51 Date: Sun, 08 Dec 2024 Prob (F-statistic): 5.54e-14 Time: 23:13:44 Log-Likelihood: -2256.0 No. Observations: 167 AIC: 4526. Df Residuals: 160 BIC: 4548. Df Model: 6 Covariance Type: nonrobust =============================================================================================== coef std err t P>|t| [0.025 0.975] ----------------------------------------------------------------------------------------------- const -6.987e+04 1.25e+05 -0.559 0.577 -3.17e+05 1.77e+05 Sending_gdppc -1.1049 1.055 -1.048 0.296 -3.187 0.978 Sending_mpm -1356.6816 713.712 -1.901 0.059 -2766.193 52.830 Sending_conflict_casualties 115.7646 12.551 9.223 0.000 90.977 140.552 Sending_govern 4533.4749 4510.002 1.005 0.316 -4373.335 1.34e+04 Sending_cri 508.7402 447.709 1.136 0.258 -375.441 1392.921 Sending_pop -9.46e-05 0.002 -0.040 0.968 -0.005 0.005 ============================================================================== Omnibus: 166.355 Durbin-Watson: 1.983 Prob(Omnibus): 0.000 Jarque-Bera (JB): 12738.316 Skew: 3.093 Prob(JB): 0.00 Kurtosis: 45.337 Cond. No. 3.06e+08 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 3.06e+08. This might indicate that there are strong multicollinearity or other numerical problems.
There are various methods to assess the performance of our model. We will begin by conducting a residual analysis, which involves examining the differences between the actual and predicted y values.
Based on the plots below, our model appears to satisfy the normality and autocorrelation assumptions. However, it is possible that the model may be affected by heteroscedasticity, as the residuals do not appear to be randomly scattered.
import numpy as np
import statsmodels.api as sm
import statsmodels.stats.stattools as st
import matplotlib.pyplot as plt
import seaborn as sns
# Calculate Residuals
predicted_values = results.predict()
residuals = immigration['Flow'] - predicted_values
# Check for Linearity
plt.scatter(predicted_values, residuals)
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs. Predicted Values')
plt.show()
# Check for Homoscedasticity
plt.scatter(predicted_values, residuals)
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs. Predicted Values')
plt.axhline(y=0, color='r', linestyle='-')
plt.show()
# Normality of Residuals
sns.histplot(residuals, kde=True)
plt.title('Histogram of Residuals')
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.show()
# Check for Autocorrelation
# Durbin-Watson statistic using DurbinWatson function
dw_statistic = st.durbin_watson(residuals)
print("Durbin-Watson Statistic:", dw_statistic)
# You can also plot autocorrelation function (ACF) of residuals if needed
sm.graphics.tsa.plot_acf(residuals)
plt.xlabel('Lag')
plt.ylabel('Autocorrelation')
plt.title('Autocorrelation Function (ACF) of Residuals')
plt.show()
Durbin-Watson Statistic: 1.982986745408299
ConclusionΒΆ
This project aims to investigate migration flows to the EU and the key determinants of these flows.
Initially, in the exploratory analysis, migration flows by sending and receiving countries, as well as by gender and age brackets, were analyzed. The migrants coming to EU countries were evenly split between males and females, with more than half of the arriving migrants considered to be young. Notably, Ukraine emerged as the primary sending country, with its invasion effect significantly shaping migration patterns.
In the subsequent section, this research statistically analyzed the influence of various factors on migration flows, including GDP per capita, poverty, social and armed conflicts, governance, climate risks, and population. The results of the regression analysis highlighted the statistically significant effects of poverty measures and conflict variables on migration flows.
Further, the project aimed to incorporate additional indicators into the migration dataset (immig_noneu27), encompassing socio-economic and political variables such as GDP per capita, the Multidimensional Poverty Measure, Political Stability Indicators, Gender Inequality Index, Conflict Indicators (Battles, Riots, and Violence against Civilians), Climate Risk Index, and Population.
Subsequently, the project's focus shifted towards constructing a statistical model to analyze the indicators which have a statistical significant effect on migration flows to EU27 countries.
For future studies, considerations such as geographical proximity and common language and culture could be operationalized and integrated into the model to enhance its predictive capacity.