Migration Flows to EuropeΒΆ



Baris Alan

Final Data Science Tutorial

CMPS 6790 / Data Science - Prof. Nicholas Mattei

Online Access to the Project

Project Datasets


Project Topic and GoalsΒΆ

The advanced liberal democracies in the Western world stand as an attractive destination for populations in the developing world, driven by a myriad of pull and push factors. Pull factors encompass liberal developed democracies offering employment opportunities, and providing rule of law and equal treatment before the law. Conversely, push factors comprise issues such as armed and social conflict, unemployment, poverty, corruption, poor governance, and the climate related risks. European countries, in particular, emerge as a desirable destination for numerous nations in Africa, the Middle East, and West Asia.

The primary objective of this project is to conduct a comprehensive analysis and visualization of migration flows to the European Union (EU) countries from regions outside Europe. The key areas of focus include examining the demographic structure and educational background of migrants, identifying their countries of origin and the EU countries they choose for settlement, and mapping out the migration routes and transit countries they navigate.

By addressing these aspects, the project aims to provide valuable insights into the dynamics of migration to the EU, shedding light on the factors influencing migration patterns and contributing to a nuanced understanding of the complex interplay between push and pull factors in the context of global migration.

Later on, this project builds a regression model to measure the impact and the significance of various factors on migration flows to Europe: economic (GPD per capita and multidimensional poverty), political (political stability and effectiveness of government), gender (gender inequality), conflict (armed conflict and social unrest), and climate (climate related risks).

Project DatasetsΒΆ

This project will utilize various datasets from different institutions.

The First DatasetΒΆ

The primary dataset is sourced from Eurostat, the official statistical administration of the EU. Specifically, I obtained the dataset from the Migration and Population Statistics section, focusing on immigration by age group, sex, and citizenship. This dataset provides the total number of migrants based on specified filters, allowing researchers to analyze immigration by receiving country, immigrant citizenship, year, age group, and gender.

Due to Eurostat's data download limitations, careful selection of attributes was necessary. Specifically, (1) I narrowed down the country of citizenship options to 218, excluding EU countries to focus on immigration from other regions to Europe. Additionally, regional groupings such as Africa and South Asia were included for future analysis. (2) Receiving countries were limited to 27 EU nations. (3) Gender analysis was conducted for all available options (Male-Female-Total). (4) Age-based analysis was performed by selecting total and specific age brackets. (5) The dataset was filtered for the year 2021, with plans to include data from previous years for a comprehensive analysis of changing migration flows.

Key questions addressed with the first dataset include: "What is the total number of arrivals in EU countries in 2022?", "What are the demographic characteristics of immigrants based on gender and age?", and "Which EU countries received the highest number of immigrants?", and lastly "Which sent the highest number of immigrants?"

The Second Set of Datasets: ISO CodesΒΆ

For the second dataset, Datahub.io, to incorporate 2-digit country codes, was utilized. This dataset serves the sole purpose of associating country names witb country codes in the immigration dataset.

Some of the independent variable datasets use 3-digit country codes. Therefore, for better merging the main migration dataset with these independent variable datasets, the World Bank ISO3 Dataset was also utilized to bring iso3 codes.

The following datasets will be used to operationalize the explanatory variables which would help to answer the main question: "What is the most important predictor of migration flows to Europe in 2022?". Therefore this project utilizes various datasets to bring independent variables into the analysis.

The Third-set of Datasets: Economic IndicatorsΒΆ

To measure the effect of economic indicators, this project will utilize the GDP per capita (Purchasing Power Parity in 2017 Constant USD) dataset by the World Bank. GPD/PC is the most common and one of the best indicator of overall economic wellbeing of a country.

Yet, because the average income might not reflect the well-being of the whole population, this research will include World Bank Multidimensional Poverty Measurement as well.

The Fourth Dataset: ACLED Conflict DatasetsΒΆ

Armed Conflict Location & Event Data Project provides various datasets, ranging from mob violence to military conflict to have a sense of conflict in a specific country or region. This project will utilize three datasets which are noted as factors creating migration outflows: the Battles, Riots, and Violence against Civilians.

The Fifth Dataset: Governance IndicatorsΒΆ

Sometimes goverments failure to provide services, such as health, education, or social welfare, might create dissatisfaction in public, and they might want to migrate to developed nations for these reasons. Furthermore, poor governance is also linked with worse economic outcomes and conflicts as well. World Bank World Governance Indicator will be used to asses the political stability and government effectiveness.

The Fifth Dataset: German Watch Climate Risk IndexΒΆ

German Watch Climate Risk Index will be used to asses climate-related factors.

The Seventh Dataset: PopulationΒΆ

Population of a country is also a crucial factor explaining the amount of migration flows. This project uses the CIA's population dataset.

ETL (Extract, Transform, Load)ΒΆ

InΒ [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
InΒ [2]:
# Clone the repository, change the wd
!git clone https://github.com/barisalan00/barisalan00.github.io
%cd /home/jovyan/barisalan00.github.io
!pwd
Cloning into 'barisalan00.github.io'...
remote: Enumerating objects: 141, done.
remote: Counting objects: 100% (89/89), done.
remote: Compressing objects: 100% (86/86), done.
remote: Total 141 (delta 43), reused 2 (delta 2), pack-reused 52 (from 1)
Receiving objects: 100% (141/141), 19.18 MiB | 105.00 KiB/s, done.
Resolving deltas: 100% (57/57), done.
/home/jovyan/barisalan00.github.io
/home/jovyan/barisalan00.github.io

Import DatasetsΒΆ

InΒ [3]:
# Main Immigration Dataset: Import Eurostat Immigration/2022 Dataset
euim22 = pd.read_csv('Eurostat-2022Migration-migr_imm1ctz__custom_10841676_linear.csv')
display(euim22.head(3))

# Total number of observations: 119479
display(len(euim22))

# Import Datahub.io 2-digit Country Codes dataset
country_codes2 = pd.read_csv('Datahub-CountryCodes-data_csv.csv')
display(country_codes2.head(3))

# Import WB 3-digit Country Codes dataset
country_codes3 = pd.read_csv('UN-iso3.csv')
display(country_codes3.head(3))

country_codes3_ = pd.read_excel('WB-CountryCodes.xlsx')

# Economic Indicator1: WB 2022 GDP/PC
gdppc = pd.read_csv('WB-2022GDPPC-Const.csv')
display(gdppc.head(3))

# Economic Indicator2: WB 2023 Multidimensional Poverty Measure
mpm = pd.read_excel('WB-2022MPM-Data-AM2022.xlsx')
display(mpm.head(3))

# Conflict Indicator1: ACLED 2022 Battles Dataset
battle = pd.read_csv('ACLED-2022Battles.csv')
display(battle.head(3))

# Conflict Indicator2: ACLED 2022 Riots Dataset
riot = pd.read_csv('ACLED-2022Riots.csv')
display(riot.head(3))

# Conflict Indicator3: ACLED 2022 Violence Dataset
violence = pd.read_csv('ACLED-2022ViolencesCivilians.csv')
display(violence.head(3))

# Political Indicators: WB Governance Indicators
govern = pd.read_csv('WB-2022GovIndic.csv')
display(govern.head(3))

# Climate Indicator: German Watch Climate Risk Index
climate = pd.read_csv('GermanWatch-2018CRI.csv')
display(climate.head(3))

# Population Indicator: CIA World Factbook - Population
population = pd.read_csv('CIA-Population.csv', encoding='latin1')
display(population.head(3))
DATAFLOW LAST UPDATE freq citizen agedef age unit sex geo TIME_PERIOD OBS_VALUE OBS_FLAG
0 ESTAT:MIGR_IMM1CTZ(1.0) 27/03/24 11:00:00 A AD REACH TOTAL NR F AT 2022 0 NaN
1 ESTAT:MIGR_IMM1CTZ(1.0) 27/03/24 11:00:00 A AD REACH TOTAL NR F BG 2022 0 NaN
2 ESTAT:MIGR_IMM1CTZ(1.0) 27/03/24 11:00:00 A AD REACH TOTAL NR F CZ 2022 0 NaN
119479
Name Code
0 Afghanistan AF
1 Γ…land Islands AX
2 Albania AL
iso3 name
0 BEL Belgium
1 CH_ China, mainland
2 GGY Guernsey
Country Name Country Code Series Name Series Code 2022 [YR2022]
0 Afghanistan AFG GDP per capita, PPP (constant 2017 internation... NY.GDP.PCAP.PP.KD ..
1 Africa Eastern and Southern AFE GDP per capita, PPP (constant 2017 internation... NY.GDP.PCAP.PP.KD 3566.269439
2 Africa Western and Central AFW GDP per capita, PPP (constant 2017 internation... NY.GDP.PCAP.PP.KD 4066.48323
Region Country code Economy Reporting year Survey name Survey year Survey coverage Welfare type Survey comparability Monetary (%) Educational attainment (%) Educational enrollment (%) Electricity (%) Sanitation (%) Drinking water (%) Multidimensional poverty headcount ratio (%)
0 ECA ALB Albania 2018 HBS 2018 N c 3.0 0.048107 0.192380 - 0.06025 6.579772 9.594966 0.293161
1 SSA AGO Angola 2018 IDREA 2018 N c 2.0 31.122005 29.753423 27.44306 52.639532 53.637516 32.106507 47.203606
2 LAC ARG Argentina 2021 EPHC-S2 2021 U i 2.0 0.958847 1.085320 0.731351 0 0.193965 0.364048 0.971202
event_id_cnty event_date year time_precision disorder_type event_type sub_event_type actor1 assoc_actor_1 inter1 ... location latitude longitude geo_precision source source_scale notes fatalities tags timestamp
0 DRC27768 31-Dec-22 2022 1 Political violence Battles Armed clash M23: March 23 Movement NaN 2 ... Karenga -1.4724 29.0655 2 Mediacongo.net; Radio Okapi National On 31 December 2022, during a two-day battle, ... 0 NaN 1673291085
1 MZM3154 31-Dec-22 2022 1 Political violence Battles Armed clash Islamist Militia (Mozambique) NaN 3 ... Namacule -11.8567 39.8000 1 AIM; Pinnacle News; Twitter; Zitamar New media-National On 31 December 2022, Islamist militia clashed ... 2 NaN 1673291088
2 MZM3155 31-Dec-22 2022 1 Political violence Battles Armed clash Islamist Militia (Mozambique) NaN 3 ... Namande -11.8278 39.7416 1 AIM; Pinnacle News; Twitter; VOA; Zitamar New media-National On 31 December 2022, Islamist militia clashed ... 2 NaN 1673291088

3 rows Γ— 31 columns

event_id_cnty event_date year time_precision disorder_type event_type sub_event_type actor1 assoc_actor_1 inter1 ... location latitude longitude geo_precision source source_scale notes fatalities tags timestamp
0 KEN9717 31 December 2022 2022 1 Political violence Riots Mob violence Rioters (Kenya) Vigilante Group (Kenya) 5 ... Kutus -0.5753 37.3269 2 Kenya Standard; NTV (Kenya) New media-National On 31 December 2022, a mob lynched a man, part... 1 crowd size=no report 1673291087
1 BRA62473 31 December 2022 2022 1 Political violence Riots Mob violence Rioters (Brazil) Vigilante Group (Brazil) 5 ... Maues -3.3795 -57.7196 1 Portal do Holanda Subnational On 31 December 2022, in Maues (Amazonas), a su... 0 crowd size=no report 1673295343
2 BRA62488 31 December 2022 2022 1 Political violence Riots Mob violence Rioters (Brazil) PL: Liberal Party 5 ... Catalao -18.1670 -47.9448 1 Estado de Minas National Property destruction: On 31 December 2022, in ... 0 crowd size=no report 1673295343

3 rows Γ— 31 columns

event_id_cnty event_date year time_precision disorder_type event_type sub_event_type actor1 assoc_actor_1 inter1 ... location latitude longitude geo_precision source source_scale notes fatalities tags timestamp
0 DRC27766 31 December 2022 2022 1 Political violence Violence against civilians Abduction/forced disappearance Twirwaneho Ethnic Militia (Democratic Republic... Banyamulenge Ethnic Militia (Democratic Republ... 4 ... Mikenge -3.4497 28.4476 1 Kivu Times Subnational On 31 December 2022, Twirwaneho abducted a wom... 0 NaN 1673291085
1 SAF18067 31 December 2022 2022 1 Political violence Violence against civilians Attack Unidentified Armed Group (South Africa) NaN 3 ... Johannesburg -26.2023 28.0436 1 Zambia Reports International On 31 December 2022, unknown suspects shot and... 1 NaN 1673291088
2 SOM38915 31 December 2022 2022 1 Political violence Violence against civilians Abduction/forced disappearance Al Shabaab NaN 2 ... Ted 4.4000 43.9167 2 Undisclosed Source Local partner-Other On 31 December 2022, Al Shabaab abducted three... 0 NaN 1673291088

3 rows Γ— 31 columns

Country Name Country Code Series Name Series Code 2022 [YR2022]
0 Afghanistan AFG Political Stability and Absence of Violence/Te... PV.EST -2.550801754
1 Afghanistan AFG Voice and Accountability: Estimate VA.EST -1.751587272
2 Korea, Dem. People's Rep. PRK Voice and Accountability: Percentile Rank VA.PER.RNK 0
CRI\rRank Country CRI\rscore Fatalities\rin 2018\r(Rank) Fatalities per\r100 000 inhab-\ritants (Rank) Losses in mil-\rlion US$ (PPP)\r(Rank) Losses per\runit GDP in\r% (Rank)
0 1 Japan 5.50 2 2 3 12
1 2 Philippines 11.17 4 14 7 14
2 3 Germany 13.83 3 1 6 36
name slug value date_of_information ranking region
0 Afghanistan afghanistan 38,346,720 2022 est. 37.0 South Asia
1 Albania albania 3,095,344 2022 est. 136.0 Europe
2 Algeria algeria 44,178,884 2022 est. 34.0 Africa

Transform and Tidy DataΒΆ

InΒ [4]:
# Check the dtypes for euim22
# The year (TIME_PERIOD) and flow (OBS_VALUE) columns are integer, and the rest is object as expected.
euim22.dtypes
Out[4]:
DATAFLOW       object
LAST UPDATE    object
freq           object
citizen        object
agedef         object
age            object
unit           object
sex            object
geo            object
TIME_PERIOD     int64
OBS_VALUE       int64
OBS_FLAG       object
dtype: object
InΒ [5]:
# Keep only necessary columns and drop redundant ones
euim22 = euim22[['citizen', 'age', 'sex', 'geo', 'TIME_PERIOD', 'OBS_VALUE']]
euim22.head()
Out[5]:
citizen age sex geo TIME_PERIOD OBS_VALUE
0 AD TOTAL F AT 2022 0
1 AD TOTAL F BG 2022 0
2 AD TOTAL F CZ 2022 0
3 AD TOTAL F EE 2022 0
4 AD TOTAL F FI 2022 0
InΒ [6]:
# Rename columns for readability
euim22.rename(columns={'citizen':'Migrant_Citizenship',
                        'age': 'Age',
                        'sex': 'Gender',
                        'geo':'Receiving_CCode',
                        'TIME_PERIOD':'Year',
                        'OBS_VALUE':'Flow'},inplace=True)
euim22.head()
Out[6]:
Migrant_Citizenship Age Gender Receiving_CCode Year Flow
0 AD TOTAL F AT 2022 0
1 AD TOTAL F BG 2022 0
2 AD TOTAL F CZ 2022 0
3 AD TOTAL F EE 2022 0
4 AD TOTAL F FI 2022 0

Despite the Eurostat data dashboard displaying country names for the country codes, the downloaded dataset does not include country names. This is why I will leverage the "country_codes2" dataset from datahub.io to retrieve country names for both Citizenship and Receiving Country Code codes. This step ensures a comprehensive and accurate representation of country names in the analysis.


InΒ [7]:
# Bring country name information for Migrant Citizenship column (left join to keep all observations at euim_21)
euim22 = pd.merge(euim22, country_codes2, how='left', left_on='Migrant_Citizenship', right_on='Code')

# Drop the redundant "Code" column
euim22.drop('Code', axis=1, inplace=True)

# Rename the 'Name' column to 'Sending_Country'
euim22.rename(columns={'Name':'Sending_Country'}, inplace=True)

# Move Migrant_Country after Migrant_Citizenship
col = euim22.pop('Sending_Country')
euim22.insert(1, col.name, col)

euim22.head(3)
Out[7]:
Migrant_Citizenship Sending_Country Age Gender Receiving_CCode Year Flow
0 AD Andorra TOTAL F AT 2022 0
1 AD Andorra TOTAL F BG 2022 0
2 AD Andorra TOTAL F CZ 2022 0
InΒ [8]:
# Bring country name information for Receiving Country Column (left join to keep all observations at euim_22)
euim22 = pd.merge(euim22, country_codes2, how='left',left_on='Receiving_CCode', right_on='Code')

# Drop the redundant "Code" column
euim22.drop('Code', axis=1, inplace=True)

# Rename the 'Name' column to 'Receiving_Country'
euim22.rename(columns={'Name':'Receiving_Country'}, inplace=True)

# Move Receiving_Country after Receiving_CCode
col = euim22.pop('Receiving_Country')
euim22.insert(5, col.name, col)


euim22.head(3)
Out[8]:
Migrant_Citizenship Sending_Country Age Gender Receiving_CCode Receiving_Country Year Flow
0 AD Andorra TOTAL F AT Austria 2022 0
1 AD Andorra TOTAL F BG Bulgaria 2022 0
2 AD Andorra TOTAL F CZ Czech Republic 2022 0
InΒ [9]:
# Check if we lost any cells at the merge operations.
# We had 119479 observations at the beginning, and and it is still there, we are not missing anything.
len(euim22)
Out[9]:
119479
InΒ [10]:
# Is there any missing values under Receiving_Country?
# 'EL' is country code for Greece. Greece is using both 'GR' (in international systems) and 'EL' (in European systems) as its country code.
# 'EU27_2020' is the code for 27 EU countries.
euim22[euim22['Receiving_Country'].isna()]['Receiving_CCode'].unique()
Out[10]:
array(['EL', 'EU27_2020'], dtype=object)
InΒ [11]:
# Fill these NaN values for 'EL' with Greece
euim22.loc[euim22['Receiving_CCode'] == 'EL', 'Receiving_Country'] = 'Greece'
euim22.loc[euim22['Receiving_CCode'] == 'EU27_2020', 'Receiving_Country'] = 'EU27'
InΒ [12]:
# Are all receiving countries EU27? Iceland, Liechstein, Norway, Switzerland are not EU27.
display(euim22['Receiving_Country'].unique())
array(['Austria', 'Bulgaria', 'Czech Republic', 'Estonia', 'Finland',
       'Croatia', 'Hungary', 'Iceland', 'Italy', 'Lithuania',
       'Luxembourg', 'Latvia', 'Netherlands', 'Norway', 'Romania',
       'Sweden', 'Slovenia', 'Slovakia', 'Spain', 'France', 'Belgium',
       'Switzerland', 'Cyprus', 'Germany', 'Denmark', 'Greece', 'EU27',
       'Ireland', 'Liechtenstein', 'Malta', 'Poland', 'Portugal'],
      dtype=object)
InΒ [13]:
# Drop these 4 countries: Now we have 27 EU countries + 1 EU27 Aggregated observation
countries_to_drop = ['Iceland', 'Liechtenstein', 'Norway', 'Switzerland']
euim22 = euim22[~euim22['Receiving_Country'].isin(countries_to_drop)]
eu27 = (euim22['Receiving_Country'].unique())
print(eu27)
['Austria' 'Bulgaria' 'Czech Republic' 'Estonia' 'Finland' 'Croatia'
 'Hungary' 'Italy' 'Lithuania' 'Luxembourg' 'Latvia' 'Netherlands'
 'Romania' 'Sweden' 'Slovenia' 'Slovakia' 'Spain' 'France' 'Belgium'
 'Cyprus' 'Germany' 'Denmark' 'Greece' 'EU27' 'Ireland' 'Malta' 'Poland'
 'Portugal']
InΒ [14]:
# Is there any NaN cells under Sending_Country column? --> 20017 observations are missing.
euim22['Sending_Country'].isna().sum()
Out[14]:
20017
InΒ [15]:
# Let's check the unique values for these 20017 NaN observations.
euim22[euim22['Sending_Country'].isna()]['Migrant_Citizenship'].unique()
Out[15]:
array(['AFR', 'AFR_C', 'AFR_E', 'AFR_N', 'AFR_S', 'AFR_W', 'AME', 'AME_C',
       'AME_N', 'AME_S', 'ASI', 'ASI_C', 'ASI_E', 'ASI_S', 'ASI_S_E',
       'ASI_W', 'AU_NZ', 'CC8_22_FOR', 'CRB', 'CZ_SK', 'EFTA_FOR', 'EL',
       'EU27_2020_FOR', 'EUR', 'EX_SU', 'EX_YU', 'FOR_STLS', 'MEL', 'MIC',
       'NAT', 'NEU27_2020_FOR', 'OCE', 'POL', 'RNC', 'RS_ME', 'STLS',
       'TOTAL', 'UK', 'UNK', 'XK'], dtype=object)

The NaN values under the "Sending_Country" column correspond to the codes displayed in the array above. Notably, these codes are not 2-digit but rather 3-digit.

As per the Eurostat system, most of these codes represent continents such as 'AFR'=Africa, 'ASI_W'=West Asia, which aggregate the sum of countries within these continents. While the immigrant numbers for continents may introduce duplicates, they remain crucial for continental flow analysis.

Additionally, specific codes represent regions such as 'AU_NZ': Australia-New Zealand, 'CC8_22_FOR':8 Candidate Countries, 'CZ_SK': Czechoslovakia, 'EFTA_FOR':European Free Trade Association Countries, 'EL':Greece, 'EU27_2020_FOR':EU27 Countries except reporting country,'EUR':Europe, 'EX_SU':Soviet Union, 'EX_YU':Yugoslavia, 'FOR_STLS':Foreign country and stateless, 'NAT': Reporting Country, 'NEU27_2020_FOR':Non-EU27 countries nor reporting country, Oceania, 'RNC': Recognized Non-Citizens, 'RS_ME':Serbia and Montenegro, 'STLS': Stateless, 'TOTAL': Total, 'UNK': Unknown, 'XK':Kosovo.

For analytical purposes, all continents and regional observations will be excluded from the primary analysis. A secondary continental dataset will be created, and these observations will be removed from the original "euim22" dataset to prevent duplication. However, 'STLS': Stateless, 'RNC': Recognized Non-Citizens, 'UNK': Unknown observations will be retained in the original dataset, as these observations are not represented under any country-observations and can be treated as distinct entities. 'EU27_2020_FOR', 'NEU27_2020_FOR', and 'TOTAL' will also kept in the dataset for calculations.


InΒ [16]:
# Replace the NaN Values under Migrant_Country for these ('STLS':Stateless, 'RNC':Recognized Non-Citizens and 'UNK':Unknown) under Migrant_Ciizenship
# Therefore all non-NaN observatoins under Migrant_Country column are part of our analysis.
euim22.loc[euim22[euim22['Migrant_Citizenship'] == 'STLS'].index, 'Sending_Country'] = 'Stateless'
euim22.loc[euim22[euim22['Migrant_Citizenship'] == 'RNC'].index, 'Sending_Country'] = 'Non-Citizens'
euim22.loc[euim22[euim22['Migrant_Citizenship'] == 'UNK'].index, 'Sending_Country'] = 'Unkown'
euim22.loc[euim22[euim22['Migrant_Citizenship'] == 'EU27_2020_FOR'].index, 'Sending_Country'] = 'EU27'
euim22.loc[euim22[euim22['Migrant_Citizenship'] == 'NEU27_2020_FOR'].index, 'Sending_Country'] = 'Non-EU27'
euim22.loc[euim22[euim22['Migrant_Citizenship'] == 'TOTAL'].index, 'Sending_Country'] = 'Total'
euim22.loc[euim22[euim22['Migrant_Citizenship'] == 'EL'].index, 'Sending_Country'] = 'Greece'

# How many missing values now: 15129
display(euim22['Sending_Country'].isna().sum())
15129
InΒ [17]:
# Create a 2nd dataset to keep continental observations.
euim22_continents = euim22

# Now we can delete the continent/region observations from the euim22 (which are NA observations under Sending_Country column)
euim22 = euim22.dropna(subset=['Sending_Country'])

# NEw dataframe is 93370 length.
len(euim22)
Out[17]:
93370
InΒ [18]:
#Some extra re-naming for easier coding
euim22.loc[euim22[euim22['Age'] == 'TOTAL'].index, 'Age'] = 'Total'
euim22.loc[euim22[euim22['Gender'] == 'T'].index, 'Gender'] = 'Total'
euim22.loc[euim22[euim22['Migrant_Citizenship'] == 'TOTAL'].index, 'Migrant_Citizenship'] = 'Total'

Basic Summary StatisticsΒΆ

How many immigrants did arrive in the EU countries in 2022 from non-European countries?

According to Frontex (EU Border and Coast Guard AgencySecurity) and Eurostat a total of 5.1 million immigrants entered to EU countries from non-EU countries, which is a 117% compared to 2021 (2.7 million).

Our dataset (code below) reveals that the total number of arrivals to EU27 amounts to almost 7 million individuals. Among these, 4.8 million immigrants originated from non-EU27 countries, while 1.1 million arrived from other EU27 countries. Considering the challenges of managing and compiling data between 27 countries, the 300K difference between Eurostat and Frontex data is ignorable.

Germany emerges as the top destination, with a total of 2.1 million immigrants arriving, followed by Spain (1.2 million), France (430K), and Italy (410K). The table is similar if we look at the arrivals from Non-EU countries.

InΒ [19]:
# Total number of immigration to EU27 from and total number of immigration from non-EU27
euim22[(euim22['Sending_Country'].apply(lambda x: x in ['Total', 'EU27', 'Non-EU27'])) & (euim22['Receiving_Country']=='EU27') & (euim22['Age']=='Total') & (euim22['Gender']=='Total')].sort_values(by='Flow', ascending=False)
Out[19]:
Migrant_Citizenship Sending_Country Age Gender Receiving_CCode Receiving_Country Year Flow
106898 Total Total Total Total EU27_2020 EU27 2022 6977742
78980 NEU27_2020_FOR Non-EU27 Total Total EU27_2020 EU27 2022 4777475
38564 EU27_2020_FOR EU27 Total Total EU27_2020 EU27 2022 1098032
InΒ [20]:
# Total number of arrivals by reciving EU27 countries
euim22[(euim22['Sending_Country']=='Total') & (euim22['Age']=='Total') & (euim22['Gender']=='Total')].sort_values(by='Flow', ascending=False).head(5)
Out[20]:
Migrant_Citizenship Sending_Country Age Gender Receiving_CCode Receiving_Country Year Flow
106898 Total Total Total Total EU27_2020 EU27 2022 6977742
106893 Total Total Total Total DE Germany 2022 2071690
106897 Total Total Total Total ES Spain 2022 1258894
106900 Total Total Total Total FR France 2022 431017
106905 Total Total Total Total IT Italy 2022 410985
InΒ [21]:
# Total number of arrivals from non-EU27 countries
euim22[(euim22['Sending_Country']=='Non-EU27') & (euim22['Age']=='Total') & (euim22['Gender']=='Total')].sort_values(by='Flow', ascending=False).head(5)
Out[21]:
Migrant_Citizenship Sending_Country Age Gender Receiving_CCode Receiving_Country Year Flow
78980 NEU27_2020_FOR Non-EU27 Total Total EU27_2020 EU27 2022 4777475
78975 NEU27_2020_FOR Non-EU27 Total Total DE Germany 2022 1630619
78979 NEU27_2020_FOR Non-EU27 Total Total ES Spain 2022 925587
78974 NEU27_2020_FOR Non-EU27 Total Total CZ Czech Republic 2022 330997
78987 NEU27_2020_FOR Non-EU27 Total Total IT Italy 2022 287010
InΒ [22]:
# Create a new dataset by dropping the 'Total', 'EU27', and 'Non-EU27' observations under 'Sending_Country', and 'EU27' under 'Receiving'
# Total number of observations decreased to 90958.
immig = euim22[~euim22['Sending_Country'].isin(['Total', 'EU27', 'Non-EU27'])]
len(immig)
Out[22]:
90958
InΒ [23]:
# Drop 'EU27' under 'Receiving_Country' 
# Total number of observations decreased to 90952.
immig = immig[~immig['Receiving_Country'].isin(['EU27'])]
len(immig)
Out[23]:
90952

Crucial Note on Migration Dataset

The Eurostat immigration dataset offers observations that provide the total number of arrivals (including 'Total' for total arrivals, 'EU27' for total arrivals within the EU27, and 'Non-EU27' for total arrivals from non-EU27 countries) enabling a comprehensive view of aggregate numbers. According to these observations the total arrivals to EU27 countries amount to 7 million, with 4.8 million originating from non-EU27 countries.

However, upon removing these 'Total' observations to eliminate duplicates and examining the total number of arrivals by filtering the receiving country, we find a significantly lower figure of 3.4 million. Of this immigration flow, 1.1 million arrivals are from other EU27 countries, while 2.2 million are from non-EU27 countries.

These numbers starkly contrast with aggregate observations from Frontex and Eurostat, primarily due to the exclusion of some Sending_Country observations in the dataset. For instance, while aggregate data suggests Germany received 2.1 million immigrants in 2022, of which 1.6 million were from non-EU countries, a closer examination of arrivals to Germany by filtering the Sending_Country reveals only 6226 migrants, recorded as Stateless or Unknown, arrived in Germany.

This means that Germany didnot released/shared the arrivals by sending countries. Same situation exists for some other EU member states too. Therefore, it is evident that our dataset contains missing values under the Sending_Country column, which makes it difficult to have a country-level analysis for EU members.

The datasets by various organizations and projects focusing on international migration flows, including the International Migration Organization, Global Migration Data Portal, UN Global Migration Database, OECD International Migration Database, and World Bank Global Bilateral Migration and some other independent projects and academic research, were checked. However, none of these sources provide a complete dataset of bilateral migration flows. At present, the Eurostat dataset stands as the most comprehensive option. The Eurostat officials stated that the current version of the dataset is the most comprehensive one, and the member countries have discreation not to release the full details. Therefore this project will utilize the Total arrivals to the EU27 by sending cobservations, instead of having a country level analysis of the EU countries.

InΒ [24]:
# What is total number of arrivals? 3.4 million.
immig[(immig['Age']=='Total') & (immig['Gender']=='Total')]['Flow'].sum()
Out[24]:
3406513
InΒ [25]:
# Total number of arrivals from EU27 countries: 1.1 million
immig[(immig['Sending_Country'].isin(eu27)) & (immig['Age']=='Total') & (immig['Gender']=='Total')]['Flow'].sum()
Out[25]:
1171307
InΒ [26]:
# Total number of arrivals from non-EU27 countries: 2.2 million.
immig_noneu = immig[~immig['Sending_Country'].isin(eu27)]
immig_noneu[(immig_noneu['Age']=='Total') & (immig['Gender']=='Total')]['Flow'].sum()
/tmp/ipykernel_145/547588598.py:3: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  immig_noneu[(immig_noneu['Age']=='Total') & (immig['Gender']=='Total')]['Flow'].sum()
Out[26]:
2235206
InΒ [27]:
# Total number of arrivals in Germany from non-EU27 countries: 6226
display(immig_noneu[(immig_noneu['Receiving_Country']=='Germany') & (immig_noneu['Age']=='Total') & (immig_noneu['Gender']=='Total')]['Flow'].sum())
display(immig_noneu[immig_noneu['Receiving_Country']=='Germany']['Sending_Country'].unique())
6226
array(['Stateless', 'Unkown'], dtype=object)

How does the gender distribution among immigrants break down?

It appears that there were more men than women arriving in the EU from non-EU27 countries in 2022. A significant contributing factor to this trend is the presence of irregular migrants, who enter the EU illegally by crossing the Mediterranean and Aegean seas with the assistance of smugglers. Due to the perilous nature of these routes and the life-threatening aspects of the journey, men often aim to arrive first to secure asylum before bringing their families. Additionally, in regions such as the Middle East, Africa, and South Asia, young unmarried men are more likely to immigrate to Europe compared to young unmarried women.

The gender breakdown holds importance for various reasons. Some groups advocate for the inclusion of women, children, and the elderly while excluding men, while others argue that there is a labor shortage in the European labor market, making adult men crucial in filling this gap. More conservative groups express concerns about the potential impact of adult male immigrants on distorting European society. Therefore, understanding the gender and age demographics is crucial to assessing the validity of such perceived threats.

InΒ [28]:
# Group by 'Gender' column and sum 'Flow' column
immig_gender = immig_noneu[(immig_noneu['Age'] == 'Total')].groupby('Gender')['Flow'].sum()
print(immig_gender)

plt.figure(figsize=(8, 6))

plot_gender = immig_gender.plot(kind='bar', color=['pink','blue', 'green'])
plt.xlabel('Gender')
plt.ylabel('Total Arrivals')
plt.title('Total Migration Flow from Non-EU27 Countries Based on Gender')
plt.xticks(rotation=0)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
Gender
F        1142144
M        1092917
Total    2235206
Name: Flow, dtype: int64
No description has been provided for this image

What about age demographics?

Approximately 52% (around 1.1 million) of arriving immigrants are 34 years old or younger, with 323K falling within the age range of 15 years or younger. From the perspective of certain groups within the EU, these numbers may be perceived as a potential threat to European society. An additional observation is that as age increases, the number of immigrants arriving decreases, as depicted in the accompanying plot.

However, from a humanitarian standpoint, these figures underscore the desperation of immigrants who, in the face of civil war, economic hardships, or climate-related challenges, flee their home countries with their children, aspiring to reach Europe. Those in the age group of 20-29 are often individuals who have completed their education or recently started a family but struggle to make a living in their home countries. Frustration with poverty, corruption, and economic challenges compels them to seek better living conditions in Europe.

As future milestones incorporate additional datasets into the analysis, a clearer picture will emerge regarding the underlying reasons behind these migration patterns.

InΒ [29]:
# Take the total ('T') from Gender column, and groupby 'Age'.
# Y_LT15: those below 15, and Y_GE65: those above 65. Other age breaks already make sense.
# reindex the age brackets in order
immig_age = immig_noneu[(immig_noneu['Gender'] == 'Total')].groupby('Age')['Flow'].sum()

# Reindex the age breaks from smallest to biggest
display(immig_age.reindex(['TOTAL','Y_LT15', 'Y15-19', 'Y20-24', 'Y25-29', 'Y30-34', 'Y35-39', 'Y40-44', 'Y45-49', 'Y50-54', 'Y55-59', 'Y60-64', 'Y_GE65']))

#Plot
plot_age = immig_age.reindex(['Y_LT15', 'Y15-19', 'Y20-24', 'Y25-29', 'Y30-34', 'Y35-39', 'Y40-44', 'Y45-49', 'Y50-54', 'Y55-59', 'Y60-64', 'Y_GE65']).plot(kind='bar', color='skyblue', edgecolor='black')
plt.xlabel('Age')
plt.ylabel('Total Flow')
plt.title('Total Flow Based on Age Breaks')
Age
TOTAL          NaN
Y_LT15    323229.0
Y15-19    145583.0
Y20-24    199836.0
Y25-29    258376.0
Y30-34    247847.0
Y35-39    212671.0
Y40-44    162976.0
Y45-49    117268.0
Y50-54     83435.0
Y55-59     56993.0
Y60-64     45407.0
Y_GE65     67215.0
Name: Flow, dtype: float64
Out[29]:
Text(0.5, 1.0, 'Total Flow Based on Age Breaks')
No description has been provided for this image

Which EU countries receive the highest number of immigrants?

Before going into the details it is worth reminding the limited nature of this dataset. As clarified above, Germany received a total number of 1.6 immigrants from non-EU27 countries, yet this dataset only represent 6226 immigrants.

Initially, both the plot and the list below indicate that Spain and Italy have received more than half of the total immigrants. This observation underscores that Africa and the Middle East remain the primary regions of origin for migrants.

Additionally, the substantial influx of immigrants into Central and Eastern EU countries is noteworthy, signifying the impact of the Invasion of Ukraine. This surge in migration patterns in these regions is a notable consequence of the geopolitical events in Ukraine.

InΒ [30]:
# Total immigrants by receiving country
total_by_receiving = immig_noneu[(immig_noneu['Age'] == 'Total') & (immig_noneu['Gender'] == 'Total')].groupby('Receiving_Country')['Flow'].sum()

# Sort them
total_by_receiving = total_by_receiving.sort_values(ascending=False)

print(total_by_receiving)

#Plot
total_by_receiving.plot(kind='bar', color='skyblue', edgecolor='black')
plt.xlabel('Receiving Country')
plt.ylabel('Total Flow')
plt.title('Total Flow Based on Receiving Country')
Receiving_Country
Spain             857915
Czech Republic    330362
Italy             283740
Netherlands       189198
Austria           115919
Romania            91019
Lithuania          66139
Sweden             54356
Hungary            43601
Croatia            40073
Estonia            38898
Finland            33014
Latvia             29800
Slovenia           24269
Luxembourg         14555
Bulgaria           13885
Germany             6226
Belgium              922
Slovakia             561
Denmark              380
Ireland              297
Poland                77
France                 0
Malta                  0
Cyprus                 0
Portugal               0
Greece                 0
Name: Flow, dtype: int64
Out[30]:
Text(0.5, 1.0, 'Total Flow Based on Receiving Country')
No description has been provided for this image

Which country has sent the highest number of immigrants to EU countries?

Concurrently with the aforementioned analysis, the table and plot below reveal that Ukraine, Latin American Countries, and North African Countries are the primary sources of immigration. It is not unexpected to find China and India on these lists, given that they are the two most populous countries globally.

InΒ [31]:
# Total number of immigrants by sending country
total_by_sending = immig_noneu[(immig_noneu['Age'] == 'Total') & (immig_noneu['Gender'] == 'Total')].groupby('Sending_Country')['Flow'].sum()

# Rename countries with long names
total_by_sending.index = total_by_sending.index.str.replace('Venezuela, Bolivarian Republic of', 'Venezuela')

# Sort and display
total_by_sending = total_by_sending.sort_values(ascending=False)
display(total_by_sending.head(10))

# Plot the immigrants by top- 10 sending country
total_by_sending.head(10).plot(kind='bar', color='skyblue', edgecolor='black')
plt.xlabel('Sending Country')
plt.ylabel('Total Flow')
plt.title('Total Flow Based on Sending Country')
Sending_Country
Ukraine                 764356
Colombia                176955
Morocco                 138886
Venezuela                85238
Peru                     74722
India                    59999
Russian Federation       52933
Argentina                49531
Pakistan                 42479
Syrian Arab Republic     42466
Name: Flow, dtype: int64
Out[31]:
Text(0.5, 1.0, 'Total Flow Based on Sending Country')
No description has been provided for this image

Further ETLΒΆ

As clarified above, a second goal of this project is to statistically analyze the effects of various push factors, such as economic, political, climate, and conflict, and gender-related indicators.

From this point and on, we will be using the total flows, and we will not need the age and gender breaks. Therefore, will drop the Male-Female, and Age breaks and will only get the total observations for each sending country. Additionally we will have 1 observation per sending country, because from now on we will be working on the total flow to the EU. Therefore, dataset has 174 total observations.

InΒ [32]:
# Keep one aggregate sums for sending countries
immigration = immig_noneu.groupby('Sending_Country').agg({'Flow': 'sum','Migrant_Citizenship': 'first'}).reset_index()
immigration
Out[32]:
Sending_Country Flow Migrant_Citizenship
0 Afghanistan 53037 AF
1 Albania 131294 AL
2 Algeria 58852 DZ
3 Andorra 30 AD
4 Angola 828 AO
... ... ... ...
169 Viet Nam 22106 VN
170 Western Sahara 0 EH
171 Yemen 8918 YE
172 Zambia 698 ZM
173 Zimbabwe 1610 ZW

174 rows Γ— 3 columns

InΒ [33]:
#Rename countries with long names
immigration['Sending_Country'] = immigration['Sending_Country'].replace({
    'Venezuela, Bolivarian Republic of': 'Venezuela',
    'Syrian Arab Republic': 'Syria',
    "Korea, Democratic People's Republic of": 'North Korea',
    'Taiwan, Province of China': 'Taiwan',
    'Holy See (Vatican City State)': 'Vatican City',
    'Tanzania, United Republic of': 'Tanzania',
    'Macedonia, the Former Yugoslav Republic of': 'Macedonia',
    'Iran, Islamic Republic of': 'Iran',
    'Bolivia, Plurinational State of': 'Bolivia'
})

Merge with Economic Indicator1: GDP/PCΒΆ

GDP/PC (GDP Per Capita) Dataset has country names and 3 digit country codes. In order to merge this dataset with immigration dataset, we need 3 digit countries. Therefore, initialy the UN dataset will be used to bring the 3digit country codes for Sending_Countries.

InΒ [34]:
# Merge immigration dataframe with UN country_codes3
immigration = pd.merge(immigration, country_codes3, left_on='Sending_Country', right_on='name', how='left')
immigration.drop(columns=['name'], inplace=True)
immigration.rename(columns={'iso3': 'Sending_iso3'}, inplace=True)
immigration.rename(columns={'Migrant_Citizenship': 'Sending_iso2'}, inplace=True)


display(len(immigration))
immigration.head(5)
174
Out[34]:
Sending_Country Flow Sending_iso2 Sending_iso3
0 Afghanistan 53037 AF AFG
1 Albania 131294 AL ALB
2 Algeria 58852 DZ DZA
3 Andorra 30 AD AND
4 Angola 828 AO AGO
InΒ [35]:
# Check for missing ISO3 codes: 21 missing values
display(immigration['Sending_iso3'].isna().sum())

# These countries have longer and shorter version of their names.
immigration[immigration['Sending_iso3'].isna()]['Sending_Country'].unique()
21
Out[35]:
array(['Bolivia', 'Cape Verde', 'Congo, the Democratic Republic of the',
       'Vatican City', 'Iran', 'North Korea', 'Korea, Republic of',
       'Macedonia', 'Micronesia, Federated States of',
       'Moldova, Republic of', 'Non-Citizens', 'Palestine, State of',
       'Stateless', 'Swaziland', 'Syria', 'Taiwan', 'Tanzania', 'Turkey',
       'United States', 'Unkown', 'Venezuela'], dtype=object)
InΒ [36]:
# Manually bring the 3 digit country codes for the missing countries
missing_iso3 = {
    'Bolivia': 'BOL',
    'Congo, the Democratic Republic of the': 'COD',
    'Cape Verde': 'CPV',
    'Micronesia, Federated States of': 'FSM',
    'Syria': 'SYR',
    'Iran': 'IRN',
    'North Korea': 'PRK',
    'Korea, Republic of': 'KOR',
    'Moldova, Republic of': 'MDA',
    'Macedonia': 'MKD',
    'Palestine, State of': 'PSE',
    'Non-Citizens': 'XXX',
    'Stateless': 'XXX',
    'Swaziland': 'SWZ',
    'Turkey': 'TUR',
    'Taiwan': 'TWN',
    'Tanzania': 'TZA',
    'Unkown': 'XXX',
    'United States': 'USA',
    'Vatican City': 'VAT',
    'Venezuela': 'VEN'
}

# Update ISO3 column in immig_noneu27 dataset using the dictionary
immigration['Sending_iso3'].fillna(immigration['Sending_Country'].map(missing_iso3), inplace=True)
display(immigration['Sending_iso3'].isna().sum())
0
InΒ [37]:
# Merge WB GDP/PC with Immigration Dataset and bring GDP/PC information
immigration = pd.merge(immigration, gdppc[['Country Code', '2022 [YR2022]']], left_on='Sending_iso3', right_on='Country Code', how='left')
immigration.drop(columns=['Country Code'], inplace=True)

# Rename Sending_gdppc
immigration.rename(columns={'2022 [YR2022]': 'Sending_gdppc'}, inplace=True)

immigration
Out[37]:
Sending_Country Flow Sending_iso2 Sending_iso3 Sending_gdppc
0 Afghanistan 53037 AF AFG ..
1 Albania 131294 AL ALB 15491.961
2 Algeria 58852 DZ DZA 11198.23348
3 Andorra 30 AD AND ..
4 Angola 828 AO AGO 5906.115677
... ... ... ... ... ...
169 Viet Nam 22106 VN VNM 11396.5313
170 Western Sahara 0 EH ESH NaN
171 Yemen 8918 YE YEM ..
172 Zambia 698 ZM ZMB 3365.87378
173 Zimbabwe 1610 ZW ZWE 2207.957033

174 rows Γ— 5 columns

Check the data type and unique values for Sending_gdppc column.

In the dataframe above, Andorra has a value of "..", yet it is not included in missing values. It is better to check the data type of this GDP/PC column, and the unique values.

InΒ [38]:
# Data type of Sending_gdppc = object
print(immigration['Sending_gdppc'].dtype)

# Turn datatype of GDP/PC column into numeric
immigration['Sending_gdppc'] = pd.to_numeric(immigration['Sending_gdppc'], errors='coerce')
print(immigration.dtypes)
object
Sending_Country     object
Flow                 int64
Sending_iso2        object
Sending_iso3        object
Sending_gdppc      float64
dtype: object
InΒ [39]:
# How many NaN values under GDP/PC column
missing_gdppc = immigration[immigration['Sending_gdppc'].isna()]['Sending_Country'].unique()
display(len(missing_gdppc))
display("Countries with NaN GDP/PC: ", missing_gdppc)
24
'Countries with NaN GDP/PC: '
array(['Afghanistan', 'Andorra', 'Bhutan', 'Cuba', 'Eritrea',
       'Vatican City', 'Isle of Man', 'Jersey', 'North Korea', 'Lebanon',
       'Liechtenstein', 'Monaco', 'Non-Citizens', 'Palau', 'San Marino',
       'South Sudan', 'Stateless', 'Syria', 'Taiwan', 'Tonga', 'Unkown',
       'Venezuela', 'Western Sahara', 'Yemen'], dtype=object)
InΒ [40]:
# How many migrants arrived from these countries with missing GDP/PC column?
immigration[(immigration['Sending_Country'].isin(missing_gdppc))].groupby('Sending_Country')['Flow'].sum().sort_values(ascending=False)
Out[40]:
Sending_Country
Venezuela         340468
Syria             138876
Cuba               82456
Afghanistan        53037
Unkown             35760
Eritrea            14295
Stateless          11142
Lebanon             9025
Yemen               8918
Non-Citizens        4090
Taiwan              3608
South Sudan          761
San Marino           118
Bhutan                98
Liechtenstein         88
North Korea           78
Andorra               30
Monaco                 6
Palau                  4
Vatican City           2
Jersey                 0
Tonga                  0
Isle of Man            0
Western Sahara         0
Name: Flow, dtype: int64

Important Note on Imputation: As it can be seen above, some of these countries with missing GDP/PC information didnot send any migrants to the EU, or only send a handful. Some of these countries geographically located in Europe and they send only a handful of migrants(Vatican City, Monaco, Andorra etc). For the purpose of this project, the GDP/PC information of the sending countries outside Europe will be brought from other reliable resources (such as IMF, and World Bank) and be kept in the dataset and whereas those in Europe will be dropped.

Besides these countries, the migrants recorded as Unknown, Stateless, and Non-Citizens also have missing GDP/PC information, because they do not have state information. For these observations we will simply replace the GDP/PC with the average GDP/PC.

InΒ [41]:
# Define GDP per capita values for missing countries
missing_gdp = {
    'Venezuela': 3420,
    'Syria': 752,
    'Cuba': 7449,
    'Afghanistan': 372,
    'Eritrea': 1921,
    'Lebanon': 4467,
    'Yemen': 1017,
    'Taiwan': 32716,
    'South Sudan': 340,
    'Western Sahara': 2500,
    'Tonga': 4681,
    'Palau': 14565,
    'North Korea': 1217,
    'Bhutan': 3704
    
}

# Fill missing GDP per capita values based on Sending_Country
immigration['Sending_gdppc'] = immigration.apply(
    lambda row: missing_gdp[row['Sending_Country']] if pd.isna(row['Sending_gdppc']) and row['Sending_Country'] in missing_gdp else row['Sending_gdppc'],
    axis=1
)

# Check the remaining missing countries
display(immigration[immigration['Sending_gdppc'].isna()]['Sending_Country'].unique())
array(['Andorra', 'Vatican City', 'Isle of Man', 'Jersey',
       'Liechtenstein', 'Monaco', 'Non-Citizens', 'San Marino',
       'Stateless', 'Unkown'], dtype=object)
InΒ [42]:
# For the Unkown, Stateless, Recognized Non-Citizens observations, fill the GDP/PC with the average of dataset
average_gdppc = immigration['Sending_gdppc'].mean()

# 3 observations
countries_to_fill = ['Unkown', 'Stateless', 'Non-Citizens']

# Fill missing GDP per capita for specified countries with the dataset's average GDP per capita
immigration['Sending_gdppc'] = immigration.apply(
    lambda row: average_gdppc if pd.isna(row['Sending_gdppc']) and row['Sending_Country'] in countries_to_fill else row['Sending_gdppc'],
    axis=1
)

# Check the remaining missing countries
display(immigration[immigration['Sending_gdppc'].isna()]['Sending_Country'].unique())
array(['Andorra', 'Vatican City', 'Isle of Man', 'Jersey',
       'Liechtenstein', 'Monaco', 'San Marino'], dtype=object)
InΒ [43]:
# Drop NA values
immigration = immigration.dropna(subset=['Sending_gdppc'])

# Verify that no missing GDP per capita values remain
immigration['Sending_gdppc'].isna().sum()
Out[43]:
0

Scatter Plot of Migration Flows and GDP/PC: My project hypothesizes the economic conditions, GDP/PC, are the consistent and stable long-term cause of migration outflows to Europe.

An early scatterplot with regression line for the whole dataset below shows a small negative slope below, which indicates that the GDP/PC and the migration flows are negatively correlated. However, I argue that the effect would be much higher for African countries. Hence, if a dummy variable for the continent is included, or if seperate models were created for different country groups (such as African countries, Gulf countries, developed countries) then the economic conditions would have a higher negative coefficient and therefore a higher negative impact on the migration flows especially from underdeveloped and developing countries which located at the periphery of the Europe.

Individual scatter plots for each explanatory variable will be provided once all of the y-variables are merged into the main dataset.

InΒ [44]:
import numpy as np
import matplotlib.pyplot as plt

# Scatter plot with size proportional to Flow
plt.figure(figsize=(10, 6))
plt.scatter(
    immigration['Sending_gdppc'], 
    immigration['Flow'], 
    s=immigration['Sending_gdppc'] / 100,  # Size of the dot
    alpha=0.1, 
    color='blue'
)

# Calculate the regression line
x = immigration['Sending_gdppc']
y = immigration['Flow']
coefficients = np.polyfit(x, y, 1) 
regression_line = np.poly1d(coefficients)

# Plot the regression line
plt.plot(
    x, 
    regression_line(x), 
    color='red', 
    linewidth=2, 
    label=f'Regression Line: y = {coefficients[0]:.2f}x + {coefficients[1]:.2f}'
)

# Add labels and title
plt.xlabel('Sending GDP per Capita')
plt.ylabel('Migration Flow')
plt.title('Scatter Plot of Sending GDP/PC vs Migration Flow (Dot Size = GDP/PC)')
plt.grid(True)
plt.legend() 

# Show the plot
plt.show()
No description has been provided for this image

Merge with Economic Indicator2: Multi-dimensional Poverty MetricΒΆ

The second economic indicator which will be used is World Bank Multidimensional Poverty Measure, which shows percentage of individuals who suffer from multdimensional poverty (a measure includes monetary, educational attainment, electricty, sanitation, and water-related poverty measures).

There are a total of 46 missing observations which is approximately 30% of the dataset. The total number of migrants coming from these countries is 1.2 million which is a big proportion of total migration flows.

Among these countries without MPM information, Venezuela comes at the top of the list with 340K migrants, followed by India with 224K migrants, Syria with 138K.

Removing these observations with missing MPM (and missing other indicator variables) would cause our dataset to shrink significantly, which would harm the generealization power of the dataset. Yet, simply filling the average Multidimensional Poverty Measure and replacing would not be appropriate. Because, as it can be seen highly developed countries like Canada or Monaco have missing Sending_mpm, and simply assigning the average Sending_mpm value would be against the reality. Therefore, this project will utilize a K-NN-like measure to fill these missing values. Becuase GDP/PC is a good measure of poverty in a country, we will use World Bank country classification by income information to fill the missing poverty measures. World Bank groups countries into 4 groups based on GDP/PC: Low Income (less than $1,085), Lower-middle Income (1,086 - 4,255), Upper-middle Income (4,256 - 13,205) and High income (higher than 13,206). While filling out the missing values, we will determine the income group of the country, and then fill the missing Sending_mpm with the average of the countries at the same income group.

InΒ [45]:
# Merge WB GDP/MPM with Immigration Dataset and bring Poverty information
immigration = pd.merge(immigration, mpm[['Country code', 'Multidimensional poverty headcount ratio (%)']], left_on='Sending_iso3', right_on='Country code', how='left')
immigration.drop(columns=['Country code'], inplace=True)

# Rename Sending_mpm
immigration.rename(columns={'Multidimensional poverty headcount ratio (%)': 'Sending_mpm'}, inplace=True)

immigration.head()
Out[45]:
Sending_Country Flow Sending_iso2 Sending_iso3 Sending_gdppc Sending_mpm
0 Afghanistan 53037 AF AFG 372.000000 NaN
1 Albania 131294 AL ALB 15491.961000 0.293161
2 Algeria 58852 DZ DZA 11198.233480 NaN
3 Angola 828 AO AGO 5906.115677 47.203606
4 Antigua and Barbuda 42 AG ATG 22321.870020 NaN
InΒ [46]:
# How many missing: 46 observations are missing.
missing_mpm = immigration[immigration['Sending_mpm'].isna()]
display(len(missing_mpm['Sending_Country']))

# Which countries have missing Multidimensional poverty information?
display(missing_mpm['Sending_Country'].unique())

# How many migrants came from these countries with missing MPM: 335,542.
display(immigration[immigration['Sending_Country'].isin(missing_mpm['Sending_Country'])]['Flow'].sum())
46
array(['Afghanistan', 'Algeria', 'Antigua and Barbuda', 'Azerbaijan',
       'Bahamas', 'Bahrain', 'Barbados', 'Belize',
       'Bosnia and Herzegovina', 'Brunei Darussalam', 'Cambodia',
       'Canada', 'Central African Republic', 'China', 'Cuba', 'Dominica',
       'Equatorial Guinea', 'Eritrea', 'Grenada', 'Guyana', 'India',
       'Jamaica', 'North Korea', 'Kuwait', 'Libya', 'New Zealand',
       'Non-Citizens', 'Oman', 'Palau', 'Qatar', 'Saint Kitts and Nevis',
       'Saint Lucia', 'Saint Vincent and the Grenadines', 'Saudi Arabia',
       'Singapore', 'Somalia', 'Stateless', 'Suriname', 'Syria',
       'Trinidad and Tobago', 'Turkmenistan', 'United Arab Emirates',
       'Unkown', 'Uzbekistan', 'Venezuela', 'Western Sahara'],
      dtype=object)
1239427
InΒ [47]:
# How many migrants arrived from the countries with missing MPM:
immigration[immigration['Sending_mpm'].isna()].groupby('Sending_Country')['Flow'].sum().sort_values(ascending=False)[:20]
Out[47]:
Sending_Country
Venezuela                 340468
India                     224303
Syria                     138876
China                     135900
Cuba                       82456
Bosnia and Herzegovina     63130
Algeria                    58852
Afghanistan                53037
Unkown                     35760
Uzbekistan                 17739
Somalia                    14831
Eritrea                    14295
Stateless                  11142
Equatorial Guinea          10398
Canada                      9145
Azerbaijan                  7853
Suriname                    5812
Non-Citizens                4090
Libya                       2338
Turkmenistan                1849
Name: Flow, dtype: int64
InΒ [48]:
# Create Income_group variable
bins = [0, 1085, 4255, 13205, float('inf')]
labels = ['Low Income', 'Lower-middle Income', 'Upper-middle Income', 'High Income']

immigration['Sending_incomegroup'] = pd.cut(immigration['Sending_gdppc'], bins=bins, labels=labels, right=False)
immigration.head(3)
Out[48]:
Sending_Country Flow Sending_iso2 Sending_iso3 Sending_gdppc Sending_mpm Sending_incomegroup
0 Afghanistan 53037 AF AFG 372.00000 NaN Low Income
1 Albania 131294 AL ALB 15491.96100 0.293161 High Income
2 Algeria 58852 DZ DZA 11198.23348 NaN Upper-middle Income
InΒ [49]:
# Fill missing Sending_mpm values

#Calculate average mpm for each income group
mpm_avg = immigration.groupby('Sending_incomegroup')['Sending_mpm'].transform('mean')

# Fill missing Sending_mpm values with the average values based on Income_group
immigration['Sending_mpm'] = immigration['Sending_mpm'].fillna(mpm_avg)

immigration.head(3)
/tmp/ipykernel_145/1772377715.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  mpm_avg = immigration.groupby('Sending_incomegroup')['Sending_mpm'].transform('mean')
Out[49]:
Sending_Country Flow Sending_iso2 Sending_iso3 Sending_gdppc Sending_mpm Sending_incomegroup
0 Afghanistan 53037 AF AFG 372.00000 68.499527 Low Income
1 Albania 131294 AL ALB 15491.96100 0.293161 High Income
2 Algeria 58852 DZ DZA 11198.23348 12.653047 Upper-middle Income

Merge with Conflict Variables: Fatalities in Battles, Riots, and Violence against CiviliansΒΆ

The ACLED Project (Armed Conflict Location and Event Data) provides various measures of armed conflict, and the datasets on battles, riots and conflict against civilians will be imported to our analysis to measure the effect of conflict on the migration flows. The datasets are based on incidents or events. As a result there are multiple observations for countries, and we will calculate the total number of casualties in each country.

Since ACLED datasets include numeric iso3 codes, this project will use UN country codes datasets which include numeric iso3 codes.

As shown below with a total casualty of 13,414 Ukraine tops the list of fatalities in battles, becuase of Russian invasion. Myanmar comes the second, becuase of recent coup d'etat. And the countries with civil war follows these two countries. Similar tables are presented below for casualties in riots and violence against civilians.

If a country has a NA value for any of these conflict variables, this means that that country didnot have any casualties because of battles, riots, and violence. Hence we can simply fill the NA values with 0.

The literature shows that the battles are much more important determinant of migration outflows than riots and violence towards civilians. To take this into consideration a new casualties variable will be created by weighting the existing three variables (battle fatalities 70%, and riot and vilence fatalities 15% each).

InΒ [50]:
# Numeric iso3 codes
immigration = immigration.merge(country_codes3_[['Iso3', 'Code']], how='left', left_on='Sending_iso3', right_on='Iso3')
immigration = immigration.drop(columns=['Iso3'])
immigration.rename(columns={'Code': 'Sending_iso3_num'}, inplace=True)
immigration
Out[50]:
Sending_Country Flow Sending_iso2 Sending_iso3 Sending_gdppc Sending_mpm Sending_incomegroup Sending_iso3_num
0 Afghanistan 53037 AF AFG 372.000000 68.499527 Low Income 4.0
1 Albania 131294 AL ALB 15491.961000 0.293161 High Income 8.0
2 Algeria 58852 DZ DZA 11198.233480 12.653047 Upper-middle Income 12.0
3 Angola 828 AO AGO 5906.115677 47.203606 Upper-middle Income 24.0
4 Antigua and Barbuda 42 AG ATG 22321.870020 2.721764 High Income 28.0
... ... ... ... ... ... ... ... ...
162 Viet Nam 22106 VN VNM 11396.531300 1.166175 Upper-middle Income 704.0
163 Western Sahara 0 EH ESH 2500.000000 46.103936 Lower-middle Income 732.0
164 Yemen 8918 YE YEM 1017.000000 35.411829 Low Income 887.0
165 Zambia 698 ZM ZMB 3365.873780 66.403395 Lower-middle Income 894.0
166 Zimbabwe 1610 ZW ZWE 2207.957033 42.397930 Lower-middle Income 716.0

167 rows Γ— 8 columns

InΒ [51]:
# Countries with the most fatalities in battles.
battle1 = battle.groupby('country').agg({
    'fatalities': 'sum',
    'iso': 'first'
}).reset_index()

battle1.sort_values(by='fatalities', ascending=False)[:10]
Out[51]:
country fatalities iso
95 Ukraine 13414 804
65 Myanmar 12901 104
69 Nigeria 5274 566
81 Somalia 4498 706
33 Ethiopia 3618 231
98 Yemen 3576 887
13 Brazil 3008 76
26 Democratic Republic of Congo 2919 180
86 Syria 2599 760
0 Afghanistan 2424 4
InΒ [52]:
# Merge with battle1 df
immigration = immigration.merge(battle1, how='left', left_on='Sending_iso3_num', right_on='iso')
immigration = immigration.drop(columns=['country', 'iso']) 
immigration = immigration.rename(columns={'fatalities': 'battle_fatalities'})
immigration
Out[52]:
Sending_Country Flow Sending_iso2 Sending_iso3 Sending_gdppc Sending_mpm Sending_incomegroup Sending_iso3_num battle_fatalities
0 Afghanistan 53037 AF AFG 372.000000 68.499527 Low Income 4.0 2424.0
1 Albania 131294 AL ALB 15491.961000 0.293161 High Income 8.0 NaN
2 Algeria 58852 DZ DZA 11198.233480 12.653047 Upper-middle Income 12.0 19.0
3 Angola 828 AO AGO 5906.115677 47.203606 Upper-middle Income 24.0 48.0
4 Antigua and Barbuda 42 AG ATG 22321.870020 2.721764 High Income 28.0 NaN
... ... ... ... ... ... ... ... ... ...
162 Viet Nam 22106 VN VNM 11396.531300 1.166175 Upper-middle Income 704.0 NaN
163 Western Sahara 0 EH ESH 2500.000000 46.103936 Lower-middle Income 732.0 NaN
164 Yemen 8918 YE YEM 1017.000000 35.411829 Low Income 887.0 3576.0
165 Zambia 698 ZM ZMB 3365.873780 66.403395 Lower-middle Income 894.0 NaN
166 Zimbabwe 1610 ZW ZWE 2207.957033 42.397930 Lower-middle Income 716.0 0.0

167 rows Γ— 9 columns

InΒ [53]:
# Merge riot casualties
riot1 = riot.groupby('country').agg({'fatalities': 'sum','iso': 'first'}).reset_index()
print(riot1.sort_values(by='fatalities', ascending=False)[:10])

immigration = immigration.merge(riot1, how='left', left_on='Sending_iso3_num', right_on='iso')
immigration = immigration.drop(columns=['country', 'iso']) 
immigration = immigration.rename(columns={'fatalities': 'riot_fatalities'})
immigration
      
                          country  fatalities  iso
65                           Iran         430  364
36   Democratic Republic of Congo         244  180
63                          India         214  356
75                          Kenya         211  404
106                       Nigeria         208  566
74                     Kazakhstan         185  398
136                  South Africa         169  710
10                     Bangladesh         151   50
64                      Indonesia         147  360
110                      Pakistan         142  586
Out[53]:
Sending_Country Flow Sending_iso2 Sending_iso3 Sending_gdppc Sending_mpm Sending_incomegroup Sending_iso3_num battle_fatalities riot_fatalities
0 Afghanistan 53037 AF AFG 372.000000 68.499527 Low Income 4.0 2424.0 24.0
1 Albania 131294 AL ALB 15491.961000 0.293161 High Income 8.0 NaN 0.0
2 Algeria 58852 DZ DZA 11198.233480 12.653047 Upper-middle Income 12.0 19.0 0.0
3 Angola 828 AO AGO 5906.115677 47.203606 Upper-middle Income 24.0 48.0 34.0
4 Antigua and Barbuda 42 AG ATG 22321.870020 2.721764 High Income 28.0 NaN NaN
... ... ... ... ... ... ... ... ... ... ...
162 Viet Nam 22106 VN VNM 11396.531300 1.166175 Upper-middle Income 704.0 NaN NaN
163 Western Sahara 0 EH ESH 2500.000000 46.103936 Lower-middle Income 732.0 NaN NaN
164 Yemen 8918 YE YEM 1017.000000 35.411829 Low Income 887.0 3576.0 0.0
165 Zambia 698 ZM ZMB 3365.873780 66.403395 Lower-middle Income 894.0 NaN 8.0
166 Zimbabwe 1610 ZW ZWE 2207.957033 42.397930 Lower-middle Income 716.0 0.0 27.0

167 rows Γ— 10 columns

InΒ [54]:
# Merge Violence casualties
violence1 = violence.groupby('country').agg({'fatalities': 'sum','iso': 'first'}).reset_index()
print(violence1.sort_values(by='fatalities', ascending=False)[:10])

immigration = immigration.merge(violence1, how='left', left_on='Sending_iso3_num', right_on='iso')
immigration = immigration.drop(columns=['country', 'iso']) 
immigration = immigration.rename(columns={'fatalities': 'violence_fatalities'})
immigration
                          country  fatalities  iso
80                         Mexico        6561  484
16                         Brazil        4034   76
90                        Nigeria        3701  566
31   Democratic Republic of Congo        3046  180
39                       Ethiopia        2614  231
84                        Myanmar        2188  104
76                           Mali        2151  466
26                       Colombia        1680  170
134                       Ukraine        1348  804
17                   Burkina Faso        1177  854
Out[54]:
Sending_Country Flow Sending_iso2 Sending_iso3 Sending_gdppc Sending_mpm Sending_incomegroup Sending_iso3_num battle_fatalities riot_fatalities violence_fatalities
0 Afghanistan 53037 AF AFG 372.000000 68.499527 Low Income 4.0 2424.0 24.0 741.0
1 Albania 131294 AL ALB 15491.961000 0.293161 High Income 8.0 NaN 0.0 0.0
2 Algeria 58852 DZ DZA 11198.233480 12.653047 Upper-middle Income 12.0 19.0 0.0 8.0
3 Angola 828 AO AGO 5906.115677 47.203606 Upper-middle Income 24.0 48.0 34.0 22.0
4 Antigua and Barbuda 42 AG ATG 22321.870020 2.721764 High Income 28.0 NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ...
162 Viet Nam 22106 VN VNM 11396.531300 1.166175 Upper-middle Income 704.0 NaN NaN NaN
163 Western Sahara 0 EH ESH 2500.000000 46.103936 Lower-middle Income 732.0 NaN NaN NaN
164 Yemen 8918 YE YEM 1017.000000 35.411829 Low Income 887.0 3576.0 0.0 294.0
165 Zambia 698 ZM ZMB 3365.873780 66.403395 Lower-middle Income 894.0 NaN 8.0 3.0
166 Zimbabwe 1610 ZW ZWE 2207.957033 42.397930 Lower-middle Income 716.0 0.0 27.0 6.0

167 rows Γ— 11 columns

InΒ [55]:
# Fill all NAs with 0
immigration[['battle_fatalities', 'riot_fatalities', 'violence_fatalities']] = immigration[['battle_fatalities', 'riot_fatalities', 'violence_fatalities']].fillna(0)
immigration
Out[55]:
Sending_Country Flow Sending_iso2 Sending_iso3 Sending_gdppc Sending_mpm Sending_incomegroup Sending_iso3_num battle_fatalities riot_fatalities violence_fatalities
0 Afghanistan 53037 AF AFG 372.000000 68.499527 Low Income 4.0 2424.0 24.0 741.0
1 Albania 131294 AL ALB 15491.961000 0.293161 High Income 8.0 0.0 0.0 0.0
2 Algeria 58852 DZ DZA 11198.233480 12.653047 Upper-middle Income 12.0 19.0 0.0 8.0
3 Angola 828 AO AGO 5906.115677 47.203606 Upper-middle Income 24.0 48.0 34.0 22.0
4 Antigua and Barbuda 42 AG ATG 22321.870020 2.721764 High Income 28.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ...
162 Viet Nam 22106 VN VNM 11396.531300 1.166175 Upper-middle Income 704.0 0.0 0.0 0.0
163 Western Sahara 0 EH ESH 2500.000000 46.103936 Lower-middle Income 732.0 0.0 0.0 0.0
164 Yemen 8918 YE YEM 1017.000000 35.411829 Low Income 887.0 3576.0 0.0 294.0
165 Zambia 698 ZM ZMB 3365.873780 66.403395 Lower-middle Income 894.0 0.0 8.0 3.0
166 Zimbabwe 1610 ZW ZWE 2207.957033 42.397930 Lower-middle Income 716.0 0.0 27.0 6.0

167 rows Γ— 11 columns

InΒ [56]:
# Create conflict_casualties
immigration['Sending_conflict_casualties'] = (immigration['battle_fatalities'] * 0.70) + (immigration['riot_fatalities']*0.15) + (immigration['violence_fatalities']*0.15)
immigration = immigration.drop(columns=['battle_fatalities', 'riot_fatalities', 'violence_fatalities']) 
immigration
Out[56]:
Sending_Country Flow Sending_iso2 Sending_iso3 Sending_gdppc Sending_mpm Sending_incomegroup Sending_iso3_num Sending_conflict_casualties
0 Afghanistan 53037 AF AFG 372.000000 68.499527 Low Income 4.0 1811.55
1 Albania 131294 AL ALB 15491.961000 0.293161 High Income 8.0 0.00
2 Algeria 58852 DZ DZA 11198.233480 12.653047 Upper-middle Income 12.0 14.50
3 Angola 828 AO AGO 5906.115677 47.203606 Upper-middle Income 24.0 42.00
4 Antigua and Barbuda 42 AG ATG 22321.870020 2.721764 High Income 28.0 0.00
... ... ... ... ... ... ... ... ... ...
162 Viet Nam 22106 VN VNM 11396.531300 1.166175 Upper-middle Income 704.0 0.00
163 Western Sahara 0 EH ESH 2500.000000 46.103936 Lower-middle Income 732.0 0.00
164 Yemen 8918 YE YEM 1017.000000 35.411829 Low Income 887.0 2547.30
165 Zambia 698 ZM ZMB 3365.873780 66.403395 Lower-middle Income 894.0 1.65
166 Zimbabwe 1610 ZW ZWE 2207.957033 42.397930 Lower-middle Income 716.0 4.95

167 rows Γ— 9 columns

Merge with Political Indicators: WB Governance IndicatorsΒΆ

WB Governance Indicators dataset simply has measures of "Political Stability and Absence of Violence", "Voice and Accountability", "Control of Corruption", "Regulatory Quality", "Government Effectiveness", and "Rule of Law".

This dataset mainly has one column for country, and "Series Name" column for (1) estimates and (2) percentiles of above listed measures. This means that all 12 variables (6 estimates and 6 percentiles) are under "Series Name", and there are 12 total observation for each country. To tidy data, pivot_table function is used to spread these 12 variables to columns.

This dataset includes both estimates and percentile rankings. For the sake of simplicity, one single governance quality variable will be calculated only using the estimates variables.

The estimates range between -3.5 and +3.5. So as to make all of the estimates higher than 0, each estimate will be added 3.5 points. After this, all of the 6 estimates will be summed into a new Sending_govern variable as 1 single governance indicator of sending countries.

Again, the missing governance score variable will be filled with the average of the income group

InΒ [57]:
# What kind of measures does WB Governance Indicators dataset have?
display(govern.head())
govern["Series Name"].unique()
Country Name Country Code Series Name Series Code 2022 [YR2022]
0 Afghanistan AFG Political Stability and Absence of Violence/Te... PV.EST -2.550801754
1 Afghanistan AFG Voice and Accountability: Estimate VA.EST -1.751587272
2 Korea, Dem. People's Rep. PRK Voice and Accountability: Percentile Rank VA.PER.RNK 0
3 Afghanistan AFG Control of Corruption: Estimate CC.EST -1.183776498
4 Korea, Dem. People's Rep. PRK Regulatory Quality: Percentile Rank RQ.PER.RNK 0
Out[57]:
array(['Political Stability and Absence of Violence/Terrorism: Estimate',
       'Voice and Accountability: Estimate',
       'Voice and Accountability: Percentile Rank',
       'Control of Corruption: Estimate',
       'Regulatory Quality: Percentile Rank',
       'Government Effectiveness: Estimate',
       'Rule of Law: Percentile Rank',
       'Control of Corruption: Percentile Rank',
       'Regulatory Quality: Estimate',
       'Government Effectiveness: Percentile Rank',
       'Rule of Law: Estimate',
       'Political Stability and Absence of Violence/Terrorism: Percentile Rank',
       nan], dtype=object)
InΒ [58]:
# Tidy data using pivot_table function
govern['2022 [YR2022]'] = pd.to_numeric(govern['2022 [YR2022]'], errors='coerce')
govern_p = govern.pivot_table(index='Country Name', columns='Series Name', values='2022 [YR2022]').reset_index()
govern_p.head()
Out[58]:
Series Name Country Name Control of Corruption: Estimate Control of Corruption: Percentile Rank Government Effectiveness: Estimate Government Effectiveness: Percentile Rank Political Stability and Absence of Violence/Terrorism: Estimate Political Stability and Absence of Violence/Terrorism: Percentile Rank Regulatory Quality: Estimate Regulatory Quality: Percentile Rank Rule of Law: Estimate Rule of Law: Percentile Rank Voice and Accountability: Estimate Voice and Accountability: Percentile Rank
0 Afghanistan -1.183776 12.264151 -1.879552 1.886792 -2.550802 0.471698 -1.271806 8.962264 -1.658442 5.188679 -1.751587 2.415459
1 Albania -0.407876 38.679245 0.065063 56.603775 0.114945 50.471699 0.159354 57.547169 -0.165779 47.169811 0.139466 52.173912
2 Algeria -0.637930 28.301888 -0.513090 32.547169 -0.741772 19.339623 -1.063573 14.150944 -0.832473 22.641510 -1.003874 21.739130
3 American Samoa 1.270204 88.679245 0.667918 74.528305 1.128859 91.037735 0.545900 70.754715 1.221118 86.320755 0.957648 77.294685
4 Andorra 1.270204 88.679245 1.495305 92.452827 1.587736 98.584908 1.398334 90.094337 1.485450 90.566040 1.102833 85.507248
InΒ [59]:
# Add 3.5 to all estimates so that estimates will be above 0.
govern_p[['Control of Corruption: Estimate', 'Government Effectiveness: Estimate', 'Political Stability and Absence of Violence/Terrorism: Estimate', 'Regulatory Quality: Estimate', 'Rule of Law: Estimate',	'Voice and Accountability: Estimate']] += 3.5
InΒ [60]:
# Calculate one governance variable by simply summing up all of the estimate variables
govern_p['Sending_govern'] = (govern_p['Control of Corruption: Estimate']) + (govern_p['Government Effectiveness: Estimate']) + (govern_p['Political Stability and Absence of Violence/Terrorism: Estimate']) + (govern_p['Regulatory Quality: Estimate']) + (govern_p['Rule of Law: Estimate']) + (govern_p['Voice and Accountability: Estimate'])

govern_p.head()
Out[60]:
Series Name Country Name Control of Corruption: Estimate Control of Corruption: Percentile Rank Government Effectiveness: Estimate Government Effectiveness: Percentile Rank Political Stability and Absence of Violence/Terrorism: Estimate Political Stability and Absence of Violence/Terrorism: Percentile Rank Regulatory Quality: Estimate Regulatory Quality: Percentile Rank Rule of Law: Estimate Rule of Law: Percentile Rank Voice and Accountability: Estimate Voice and Accountability: Percentile Rank Sending_govern
0 Afghanistan 2.316224 12.264151 1.620448 1.886792 0.949198 0.471698 2.228194 8.962264 1.841558 5.188679 1.748413 2.415459 10.704034
1 Albania 3.092124 38.679245 3.565063 56.603775 3.614945 50.471699 3.659354 57.547169 3.334221 47.169811 3.639466 52.173912 20.905173
2 Algeria 2.862070 28.301888 2.986910 32.547169 2.758228 19.339623 2.436427 14.150944 2.667527 22.641510 2.496126 21.739130 16.207288
3 American Samoa 4.770204 88.679245 4.167918 74.528305 4.628859 91.037735 4.045900 70.754715 4.721118 86.320755 4.457648 77.294685 26.791646
4 Andorra 4.770204 88.679245 4.995305 92.452827 5.087736 98.584908 4.898334 90.094337 4.985450 90.566040 4.602833 85.507248 29.339862
InΒ [61]:
# Merge govern_p with Immigration Dataset and bring governance indicator.
immigration = pd.merge(immigration, govern_p[['Country Name', 'Sending_govern']], left_on='Sending_Country', right_on='Country Name', how='left')
immigration.drop(columns=['Country Name'], inplace=True)

immigration.head(3)
Out[61]:
Sending_Country Flow Sending_iso2 Sending_iso3 Sending_gdppc Sending_mpm Sending_incomegroup Sending_iso3_num Sending_conflict_casualties Sending_govern
0 Afghanistan 53037 AF AFG 372.00000 68.499527 Low Income 4.0 1811.55 10.704034
1 Albania 131294 AL ALB 15491.96100 0.293161 High Income 8.0 0.00 20.905173
2 Algeria 58852 DZ DZA 11198.23348 12.653047 Upper-middle Income 12.0 14.50 16.207288
InΒ [62]:
# Fill missing Sending_mpm values

#Calculate average mpm for each income group
govern_average = immigration.groupby('Sending_incomegroup')['Sending_govern'].transform('mean')

# Fill missing Sending_mpm values with the average values based on Income_group
immigration['Sending_govern'].fillna(govern_average, inplace=True)

immigration.head()
/tmp/ipykernel_145/784547835.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  govern_average = immigration.groupby('Sending_incomegroup')['Sending_govern'].transform('mean')
Out[62]:
Sending_Country Flow Sending_iso2 Sending_iso3 Sending_gdppc Sending_mpm Sending_incomegroup Sending_iso3_num Sending_conflict_casualties Sending_govern
0 Afghanistan 53037 AF AFG 372.000000 68.499527 Low Income 4.0 1811.55 10.704034
1 Albania 131294 AL ALB 15491.961000 0.293161 High Income 8.0 0.00 20.905173
2 Algeria 58852 DZ DZA 11198.233480 12.653047 Upper-middle Income 12.0 14.50 16.207288
3 Angola 828 AO AGO 5906.115677 47.203606 Upper-middle Income 24.0 42.00 16.285654
4 Antigua and Barbuda 42 AG ATG 22321.870020 2.721764 High Income 28.0 0.00 23.496961

Merge with Climate Indicator: German Watch Climate Risk IndexΒΆ

Even though German Watch's Climate Risk Index is the most comprehensive climate risk dataset, there are still many countries with a missing climate risk value. Even the social, political, and economic factors are influential in explaining climate risk, such as a low state capacity would increase the climate risk a country face, still climate risks are far less effected by these factors. Netherlands, as a highly developed country, has a climate risk.

Before of this reason, the missing values will simply be filled with the average score so as not to miss a big chunk of observations.

InΒ [63]:
# Merge govern_p with Immigration Dataset and bring governance indicator.
immigration = pd.merge(immigration, climate[['Country', 'CRI\rscore']], left_on='Sending_Country', right_on='Country', how='left')
immigration.drop(columns=['Country'], inplace=True)

# Rename 'CRI/rscore' Column with 'Sending_cri'
immigration.rename(columns={'CRI\rscore': 'Sending_cri'}, inplace=True)

immigration
Out[63]:
Sending_Country Flow Sending_iso2 Sending_iso3 Sending_gdppc Sending_mpm Sending_incomegroup Sending_iso3_num Sending_conflict_casualties Sending_govern Sending_cri
0 Afghanistan 53037 AF AFG 372.000000 68.499527 Low Income 4.0 1811.55 10.704034 NaN
1 Albania 131294 AL ALB 15491.961000 0.293161 High Income 8.0 0.00 20.905173 108.00
2 Algeria 58852 DZ DZA 11198.233480 12.653047 Upper-middle Income 12.0 14.50 16.207288 93.83
3 Angola 828 AO AGO 5906.115677 47.203606 Upper-middle Income 24.0 42.00 16.285654 76.00
4 Antigua and Barbuda 42 AG ATG 22321.870020 2.721764 High Income 28.0 0.00 23.496961 125.00
... ... ... ... ... ... ... ... ... ... ... ...
162 Viet Nam 22106 VN VNM 11396.531300 1.166175 Upper-middle Income 704.0 0.00 18.866787 NaN
163 Western Sahara 0 EH ESH 2500.000000 46.103936 Lower-middle Income 732.0 0.00 16.658763 NaN
164 Yemen 8918 YE YEM 1017.000000 35.411829 Low Income 887.0 2547.30 11.075789 NaN
165 Zambia 698 ZM ZMB 3365.873780 66.403395 Lower-middle Income 894.0 1.65 18.777528 125.00
166 Zimbabwe 1610 ZW ZWE 2207.957033 42.397930 Lower-middle Income 716.0 4.95 13.841291 114.50

167 rows Γ— 11 columns

InΒ [64]:
# Missing values
missing_cri = immigration[immigration['Sending_cri'].isna()]

display(missing_cri['Sending_Country'].unique())

display(missing_cri.groupby('Sending_Country')['Flow'].sum().sort_values(ascending=False))

display(missing_cri['Flow'].sum())
array(['Afghanistan', 'Bahamas', 'Congo',
       'Congo, the Democratic Republic of the', 'Cuba',
       'Equatorial Guinea', 'Gambia', 'Iran', 'North Korea', 'Kyrgyzstan',
       "Lao People's Democratic Republic", 'Macedonia',
       'Micronesia, Federated States of', 'Moldova, Republic of', 'Nauru',
       'Non-Citizens', 'Palau', 'Palestine, State of',
       'Russian Federation', 'Saint Kitts and Nevis', 'Saint Lucia',
       'Saint Vincent and the Grenadines', 'Sao Tome and Principe',
       'Somalia', 'Stateless', 'Swaziland', 'Syria', 'Taiwan',
       'Timor-Leste', 'Turkmenistan', 'Unkown', 'Viet Nam',
       'Western Sahara', 'Yemen'], dtype=object)
Sending_Country
Russian Federation                       196630
Syria                                    138876
Cuba                                      82456
Afghanistan                               53037
Iran                                      43182
Moldova, Republic of                      40375
Unkown                                    35760
Macedonia                                 28024
Gambia                                    22640
Viet Nam                                  22106
Somalia                                   14831
Stateless                                 11142
Equatorial Guinea                         10398
Kyrgyzstan                                 9474
Yemen                                      8918
Congo, the Democratic Republic of the      6594
Non-Citizens                               4090
Taiwan                                     3608
Palestine, State of                        2424
Turkmenistan                               1849
Congo                                      1426
Lao People's Democratic Republic            328
Saint Kitts and Nevis                       108
North Korea                                  78
Sao Tome and Principe                        68
Saint Lucia                                  62
Swaziland                                    52
Timor-Leste                                  42
Bahamas                                      34
Saint Vincent and the Grenadines             30
Palau                                         4
Nauru                                         0
Western Sahara                                0
Micronesia, Federated States of               0
Name: Flow, dtype: int64
738646
InΒ [65]:
# Replace the missing values under Sending_cri column with the average of the column

# The average of the 'Sending_gdppc' column
average_cri = immigration['Sending_cri'].mean()

# Fill missing values in the 'Sending_gdppc' column with the calculated average
immigration['Sending_cri'].fillna(average_cri, inplace=True)
immigration
Out[65]:
Sending_Country Flow Sending_iso2 Sending_iso3 Sending_gdppc Sending_mpm Sending_incomegroup Sending_iso3_num Sending_conflict_casualties Sending_govern Sending_cri
0 Afghanistan 53037 AF AFG 372.000000 68.499527 Low Income 4.0 1811.55 10.704034 82.451203
1 Albania 131294 AL ALB 15491.961000 0.293161 High Income 8.0 0.00 20.905173 108.000000
2 Algeria 58852 DZ DZA 11198.233480 12.653047 Upper-middle Income 12.0 14.50 16.207288 93.830000
3 Angola 828 AO AGO 5906.115677 47.203606 Upper-middle Income 24.0 42.00 16.285654 76.000000
4 Antigua and Barbuda 42 AG ATG 22321.870020 2.721764 High Income 28.0 0.00 23.496961 125.000000
... ... ... ... ... ... ... ... ... ... ... ...
162 Viet Nam 22106 VN VNM 11396.531300 1.166175 Upper-middle Income 704.0 0.00 18.866787 82.451203
163 Western Sahara 0 EH ESH 2500.000000 46.103936 Lower-middle Income 732.0 0.00 16.658763 82.451203
164 Yemen 8918 YE YEM 1017.000000 35.411829 Low Income 887.0 2547.30 11.075789 82.451203
165 Zambia 698 ZM ZMB 3365.873780 66.403395 Lower-middle Income 894.0 1.65 18.777528 125.000000
166 Zimbabwe 1610 ZW ZWE 2207.957033 42.397930 Lower-middle Income 716.0 4.95 13.841291 114.500000

167 rows Γ— 11 columns

Population IndicatorΒΆ

The population of a country is a natural determinant of total flow of migrants. Therefore, this project will import population information from CIA.

Some countries have missing population information, however this is simply because of mismatch between the columns (such as "Syria" vs "Syrian Arab Republic"). We will manually fill these NaN values using the CIA population dataset.

For the Stateless, Unkown, and Recognized Non-Citizen observations, we will simply use the average population.

InΒ [66]:
# Merge population with Immigration Dataset and bring governance indicator.
#immigration = immigration.drop(columns=['Sending_pop', 'value'])
immigration = pd.merge(immigration, population[['name', 'value']], left_on='Sending_Country', right_on='name', how='left')
immigration.drop(columns=['name'], inplace=True)

immigration.head()
Out[66]:
Sending_Country Flow Sending_iso2 Sending_iso3 Sending_gdppc Sending_mpm Sending_incomegroup Sending_iso3_num Sending_conflict_casualties Sending_govern Sending_cri value
0 Afghanistan 53037 AF AFG 372.000000 68.499527 Low Income 4.0 1811.55 10.704034 82.451203 38,346,720
1 Albania 131294 AL ALB 15491.961000 0.293161 High Income 8.0 0.00 20.905173 108.000000 3,095,344
2 Algeria 58852 DZ DZA 11198.233480 12.653047 Upper-middle Income 12.0 14.50 16.207288 93.830000 44,178,884
3 Angola 828 AO AGO 5906.115677 47.203606 Upper-middle Income 24.0 42.00 16.285654 76.000000 34,795,287
4 Antigua and Barbuda 42 AG ATG 22321.870020 2.721764 High Income 28.0 0.00 23.496961 125.000000 100,335
InΒ [67]:
# Rename 'value' Column with 'Sending_pop'
immigration.rename(columns={'value': 'Sending_pop'}, inplace=True)
immigration.head()
Out[67]:
Sending_Country Flow Sending_iso2 Sending_iso3 Sending_gdppc Sending_mpm Sending_incomegroup Sending_iso3_num Sending_conflict_casualties Sending_govern Sending_cri Sending_pop
0 Afghanistan 53037 AF AFG 372.000000 68.499527 Low Income 4.0 1811.55 10.704034 82.451203 38,346,720
1 Albania 131294 AL ALB 15491.961000 0.293161 High Income 8.0 0.00 20.905173 108.000000 3,095,344
2 Algeria 58852 DZ DZA 11198.233480 12.653047 Upper-middle Income 12.0 14.50 16.207288 93.830000 44,178,884
3 Angola 828 AO AGO 5906.115677 47.203606 Upper-middle Income 24.0 42.00 16.285654 76.000000 34,795,287
4 Antigua and Barbuda 42 AG ATG 22321.870020 2.721764 High Income 28.0 0.00 23.496961 125.000000 100,335
InΒ [71]:
# 10 countries with missing population information. This is because of name mismatch.
print(immigration['Sending_pop'].isna().sum())
immigration[immigration['Sending_pop'].isna()]
10
Out[71]:
Sending_Country Flow Sending_iso2 Sending_iso3 Sending_gdppc Sending_mpm Sending_incomegroup Sending_iso3_num Sending_conflict_casualties Sending_govern Sending_cri Sending_pop
17 Bolivia 36624 BO BOL 8244.235658 4.539775 Upper-middle Income 68.0 2.20 16.553992 63.500000 NaN
38 CΓ΄te d'Ivoire 8292 CI CIV 5537.369758 37.273455 Upper-middle Income 384.0 21.05 18.866787 89.500000 NaN
63 Iran 43182 IR IRN 15461.079340 1.027940 High Income 364.0 136.15 22.317889 82.451203 NaN
72 North Korea 78 KP PRK 1217.000000 46.103936 Lower-middle Income 408.0 5.10 16.658763 82.451203 NaN
81 Macedonia 28024 MK MKD 17128.642860 3.205135 High Income 807.0 0.00 22.317889 82.451203 NaN
105 Non-Citizens 4090 RNC XXX 15549.957897 2.721764 High Income NaN 0.00 22.317889 82.451203 NaN
140 Syria 138876 SY SYR 752.000000 68.499527 Low Income 760.0 1945.75 11.075789 82.451203 NaN
141 Taiwan 3608 TW TWN 32716.000000 0.061490 High Income 158.0 0.00 22.317889 82.451203 NaN
143 Tanzania 1536 TZ TZA 2623.861572 54.589677 Lower-middle Income 834.0 9.15 18.300585 69.830000 NaN
161 Venezuela 340468 VE VEN 3420.000000 46.103936 Lower-middle Income 862.0 445.85 16.658763 104.170000 NaN
InΒ [72]:
# Fill the missing population information manually
missing_pop = {
    'Bolivia': 12311974,
    "CΓ΄te d'Ivoire": 29981758,
    'Iran': 88386937,
    'North Korea': 26298666,
    'Macedonia': 2135622,
    'Syria': 23865423,
    'Taiwan': 23595274,
    'Tanzania': 67462121,
    'Venezuela': 31250306
}

# Fill missing GDP per capita values based on Sending_Country
immigration['Sending_pop'] = immigration.apply(
    lambda row: missing_pop[row['Sending_Country']] if pd.isna(row['Sending_pop']) and row['Sending_Country'] in missing_pop else row['Sending_pop'],
    axis=1
)

# Check the remaining missing countries
print(immigration[immigration['Sending_pop'].isna()])
    Sending_Country  Flow Sending_iso2 Sending_iso3  Sending_gdppc  \
105    Non-Citizens  4090          RNC          XXX   15549.957897   

     Sending_mpm Sending_incomegroup  Sending_iso3_num  \
105     2.721764         High Income               NaN   

     Sending_conflict_casualties  Sending_govern  Sending_cri Sending_pop  
105                          0.0       22.317889    82.451203         NaN  
InΒ [75]:
# Calculate the average of Sending_pop
immigration['Sending_pop'] = pd.to_numeric(immigration['Sending_pop'], errors='coerce')
average_pop = immigration['Sending_pop'].mean()

# Fill missing Sending_pop values with the average
immigration['Sending_pop'] = immigration['Sending_pop'].fillna(average_pop)


# Missing values
print(immigration['Sending_pop'].isna().sum())
0

Statistical ModellingΒΆ

As indicated above, this research will use a multiple linear regression model to predict the migration flows to Europe, based on various economic, political, conflict, climate-related, and population variables.

Furthermore, a K-NN regressor model will be built to predict the migration flows. Later on, the model's performance will be evaluated.

First of all, to see the association between the outcome variable and the explanatory variables, and also to see how each variable is correlated with each other we will present individual scatterplots and also a correlation heatmap.

As it can be seen from the scatterplot, the conflict casualties has the steepest regression line with the a p-value lower than the threshold value.

At the correlation matrix heatmap, conflict fatalities exhibit a positive correlation with the migration flow compared to other variables.

Moreover, in line with conventional wisdom, GDP/PC demonstrates a negative correlation with the poverty measure and a positive correlation with governance indicators.

Consequently, it would be reasonable to hypothesize that an increase in fatalities resulting from civil and armed conflict would lead to an increase in migration outflows.

InΒ [80]:
# Individual scatter plots
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats

observations = ['Sending_gdppc', 'Sending_mpm', 'Sending_conflict_casualties', 'Sending_govern', 'Sending_cri', 'Sending_pop']

def regress_with_stats(immigration, observations):
    fig, ax = plt.subplots(3, 2, figsize=(20, 10), sharex=False)
    ax = ax.ravel() 
    
    for i, o in enumerate(observations):
        slope, intercept, r_value, p_value, std_err = stats.linregress(
            immigration[o],
            immigration['Flow']
        )
        # A title with statistics
        diag_str = (
            f"p-value={p_value:.1g}\n"
            f"r-value={r_value:.3f}\n"
            f"std err={std_err:.3f}\n"
            f"slope={slope:.3f}\n"
            f"intercept={intercept:.3f}"
        )
        
        # Scatter plot with regression line
        immigration.plot.scatter(x=o, y='Flow', title=diag_str, ax=ax[i])
        pts = np.linspace(immigration[o].min(), immigration[o].max(), 500)
        line = slope * pts + intercept
        ax[i].plot(pts, line, lw=1, color='red')

    for i in range(len(observations), len(ax)):
        fig.delaxes(ax[i])
    
    plt.tight_layout()
    plt.show()

regress_with_stats(immigration, observations)
No description has been provided for this image
InΒ [83]:
import seaborn as sns
# Coerce variables into numeric values
immigration['Flow'] = pd.to_numeric(immigration['Flow'], errors='coerce')
immigration['Sending_gdppc'] = pd.to_numeric(immigration['Sending_gdppc'], errors='coerce')
immigration['Sending_mpm'] = pd.to_numeric(immigration['Sending_mpm'], errors='coerce')
immigration['Sending_conflict_casualties'] = pd.to_numeric(immigration['Sending_conflict_casualties'], errors='coerce')
immigration['Sending_govern'] = pd.to_numeric(immigration['Sending_govern'], errors='coerce')
immigration['Sending_cri'] = pd.to_numeric(immigration['Sending_cri'], errors='coerce')
immigration['Sending_pop'] = pd.to_numeric(immigration['Sending_pop'], errors='coerce')

# Correlation Heatmap
variables = ['Flow', 'Sending_gdppc', 'Sending_mpm', 'Sending_conflict_casualties', 'Sending_govern', 'Sending_cri', 'Sending_pop']

# Correlation Matrix
corr_matrix = immigration[variables].corr()

# Heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm_r', fmt=".2f")
plt.title('Correlation Matrix of Selected Columns in Immigration Dataset')
plt.show()
No description has been provided for this image

The results of the multilinear regression model are outlined below.

To begin, the adjusted R-squared value of 0.368 indicates that our model accounts almost for 37% of the variation in the outcome variable, which is considered fair but not optimal.

Among the indicator variables, Sending_conflict_casualties is the only one with a p-value less than 0.5, indicating that this variable has a statistically significant effect on the outcome variable. Specifically, a one unit increase in conflict fatalities leads to a migration increase of 115 individuals to EU countries.

InΒ [85]:
# Multilinear Regression Model
import statsmodels.api as sm


# X and Y variables
predictors = ['Sending_gdppc', 'Sending_mpm', 'Sending_conflict_casualties', 'Sending_govern', 'Sending_cri', 'Sending_pop']
outcome = 'Flow'

# constant term
X = sm.add_constant(immigration[predictors])

# the regression model
model = sm.OLS(immigration[outcome], X)
results = model.fit()

# Print the summary of the regression results
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   Flow   R-squared:                       0.368
Model:                            OLS   Adj. R-squared:                  0.344
Method:                 Least Squares   F-statistic:                     15.51
Date:                Sun, 08 Dec 2024   Prob (F-statistic):           5.54e-14
Time:                        23:13:44   Log-Likelihood:                -2256.0
No. Observations:                 167   AIC:                             4526.
Df Residuals:                     160   BIC:                             4548.
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
===============================================================================================
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
const                       -6.987e+04   1.25e+05     -0.559      0.577   -3.17e+05    1.77e+05
Sending_gdppc                  -1.1049      1.055     -1.048      0.296      -3.187       0.978
Sending_mpm                 -1356.6816    713.712     -1.901      0.059   -2766.193      52.830
Sending_conflict_casualties   115.7646     12.551      9.223      0.000      90.977     140.552
Sending_govern               4533.4749   4510.002      1.005      0.316   -4373.335    1.34e+04
Sending_cri                   508.7402    447.709      1.136      0.258    -375.441    1392.921
Sending_pop                  -9.46e-05      0.002     -0.040      0.968      -0.005       0.005
==============================================================================
Omnibus:                      166.355   Durbin-Watson:                   1.983
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            12738.316
Skew:                           3.093   Prob(JB):                         0.00
Kurtosis:                      45.337   Cond. No.                     3.06e+08
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.06e+08. This might indicate that there are
strong multicollinearity or other numerical problems.

There are various methods to assess the performance of our model. We will begin by conducting a residual analysis, which involves examining the differences between the actual and predicted y values.

Based on the plots below, our model appears to satisfy the normality and autocorrelation assumptions. However, it is possible that the model may be affected by heteroscedasticity, as the residuals do not appear to be randomly scattered.

InΒ [86]:
import numpy as np
import statsmodels.api as sm
import statsmodels.stats.stattools as st
import matplotlib.pyplot as plt
import seaborn as sns

# Calculate Residuals
predicted_values = results.predict()
residuals = immigration['Flow'] - predicted_values

# Check for Linearity
plt.scatter(predicted_values, residuals)
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs. Predicted Values')
plt.show()

# Check for Homoscedasticity
plt.scatter(predicted_values, residuals)
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs. Predicted Values')
plt.axhline(y=0, color='r', linestyle='-')
plt.show()

# Normality of Residuals
sns.histplot(residuals, kde=True)
plt.title('Histogram of Residuals')
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.show()

# Check for Autocorrelation
# Durbin-Watson statistic using DurbinWatson function
dw_statistic = st.durbin_watson(residuals)
print("Durbin-Watson Statistic:", dw_statistic)

# You can also plot autocorrelation function (ACF) of residuals if needed
sm.graphics.tsa.plot_acf(residuals)
plt.xlabel('Lag')
plt.ylabel('Autocorrelation')
plt.title('Autocorrelation Function (ACF) of Residuals')
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
Durbin-Watson Statistic: 1.982986745408299
No description has been provided for this image

ConclusionΒΆ

This project aims to investigate migration flows to the EU and the key determinants of these flows.

Initially, in the exploratory analysis, migration flows by sending and receiving countries, as well as by gender and age brackets, were analyzed. The migrants coming to EU countries were evenly split between males and females, with more than half of the arriving migrants considered to be young. Notably, Ukraine emerged as the primary sending country, with its invasion effect significantly shaping migration patterns.

In the subsequent section, this research statistically analyzed the influence of various factors on migration flows, including GDP per capita, poverty, social and armed conflicts, governance, climate risks, and population. The results of the regression analysis highlighted the statistically significant effects of poverty measures and conflict variables on migration flows.

Further, the project aimed to incorporate additional indicators into the migration dataset (immig_noneu27), encompassing socio-economic and political variables such as GDP per capita, the Multidimensional Poverty Measure, Political Stability Indicators, Gender Inequality Index, Conflict Indicators (Battles, Riots, and Violence against Civilians), Climate Risk Index, and Population.

Subsequently, the project's focus shifted towards constructing a statistical model to analyze the indicators which have a statistical significant effect on migration flows to EU27 countries.

For future studies, considerations such as geographical proximity and common language and culture could be operationalized and integrated into the model to enhance its predictive capacity.