Machine Learning - Classification problem

SciKit Learn library data classification problem.

This experiment was split in two posts/notebooks, because the texts, code and outputs are extensive.

To complement this experiment, I wisht to propose a real classification prediction scenario for this problem, and a minimum viable product for data analysis machine learning based, in the near future.

This first post talks about Data analysis and the second one << yet to come >> talks about Selecting the Model for Prediction

Data Analysis

Just after data loading, you have to analyse it to clean, merge and transform to make it machine learning friendly.

Check every variable to get a good idea of the kind of data is there on the dataset. Numerical or categorical? Continuous or discrete?

And after the data fits the right layout, you have to analyse other aspects like its dimmensionality, correlations and repetitions, to make another transformation to work good with your models.

So, let's jump into it.

Commented Code

Common data science libraries

Importing Python libraries to analyse data and try machine learning models.
  • Scikit Learn
  • NumPy
  • Pandas
  • Matplotlib
  • Seaborn

About the data

The data is public available on the open government initiative of the State of São Paulo - Brazil.

Disclaimer - The sole purpose of this presentation is to carry out tests with machine learning, using public and open data to improve the use of technology for future application in the FAPESP Virtual Library.
Any results, partial information or charts presented here should not be interpreted as official information. These is just machine learning technology test. For official data, check the links above.

About the problem

Given a set of research grants and it's features, lets try to classify those into two classes. Will it generate a scientific publication? Yes or No? A binary classification.

The dataset download above, is a subset containing the years of the research grants ended up until 2016, 17, 18, 19 and 2020.


from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# To avoid this, sometimes bug, SettingWithCopyWarning
pd.options.mode.chained_assignment = None

# Dois arquivos baixados da BV. 
# Total de auxilos regulares finalizados em 2016, 17,18,19,20
# Auxilos regulares com publicações científicas finalizados em 2016, 17,18,19,20
aux_total = <all_research_grants>
aux_com_pub = <research_granth_with_publications>


# Clean up data
df_total = pd.read_csv(aux_total, sep=";", header = 0)
df_total = df_total.loc[:, ~df_total.columns.str.contains('^Unnamed')]
cols = df_total.columns
cols = cols.map(lambda x: x.replace(' ', '_'))
cols = cols.map(lambda x: x.replace('.', ''))
df_total.columns = cols

df_com_pub = pd.read_csv(aux_com_pub, sep=";", header = 0)
df_com_pub = df_com_pub.loc[:, ~df_com_pub.columns.str.contains('^Unnamed')]
cols = df_com_pub.columns
cols = cols.map(lambda x: x.replace(' ', '_'))
cols = cols.map(lambda x: x.replace('.', ''))
df_com_pub.columns = cols

df_y = df_total.assign(com_pub=df_total.N_Processo.isin(df_com_pub.N_Processo).astype(int))

cols = [0, 3, 4, 9, 11, 16, 17,21, 22, 25, 29, 37, 38, 39, 40]
df_clean = df_y[df_y.columns[cols]]

# Remove not needed naming
df_clean['Instituição'].replace('\(Brasil\)','', regex=True, inplace=True)
df_clean['Instituição'] = df_clean['Instituição'].apply(lambda st: st[st.find("(")+1:st.find(")")])

# Convert datetime and add a new feature duration_date
df_clean['Data_de_Início'] = pd.to_datetime(df_clean['Data_de_Início'])
df_clean['Data_de_Término'] = pd.to_datetime(df_clean['Data_de_Término'])
df_clean['duration_date'] = df_clean['Data_de_Término'] - df_clean['Data_de_Início']
df_clean['duration_date'] = df_clean['duration_date'].astype(int)

# Target class - Acctual values
df_clean['com_pub'].value_counts()

One Hot Encoding

Moving categorical data in one column to related columns with numeric values 0 or 1.

This is made in order to avoid distortion while comparing data with strong scale differences, among other reasons.

There are a lot of information about this, on the Internet. Check this link, which explais some basics about normalizing data to values between zero and one:


def normalize_0_1(df_normalized: pd.DataFrame) -> pd.DataFrame:
    df_normalized = df_normalized.fillna(0)
    df_normalized = (df_normalized - df_normalized.min()) / (df_normalized.max() - df_normalized.min())
    return df_normalized

df_normalized = df_clean.copy(deep=True)

# Pesquisadores_Associados
df_normalized['Pesquisadores_Associados'] = df_normalized['Pesquisadores_Associados'].str.split('-').str.len()    
df_normalized['Pesquisadores_Associados'] = normalize_0_1(df_normalized['Pesquisadores_Associados'])

# # Processos_Vinculados
df_normalized['Processos_Vinculados'] = df_normalized['Processos_Vinculados'].str.split(';').str.len()
df_normalized['Processos_Vinculados'] = normalize_0_1(df_normalized['Processos_Vinculados'])

# # País_(Instituições_no_Exterior)
df_normalized['País_(Instituições_no_Exterior)'] = df_normalized['País_(Instituições_no_Exterior)'].str.split(',').str.len()
df_normalized['País_(Instituições_no_Exterior)'] = normalize_0_1(df_normalized['País_(Instituições_no_Exterior)'])

# duration_date
df_normalized['duration_date'] = normalize_0_1(df_normalized['duration_date'])

# Grantee infos
df_normalized[['GoogleMyCitations_(Beneficiário)', 'ResearcherID_(Beneficiário)', 'Orcid_(Beneficiário)']] = np.where(df_normalized[['GoogleMyCitations_(Beneficiário)', 'ResearcherID_(Beneficiário)', 'Orcid_(Beneficiário)']].isnull(), 0, 1)

df_normalized

	N_Processo	Beneficiário	Instituição	Pesquisador_Responsável	Pesquisadores_Associados	Linha_de_Fomento	Grande_Área_do_Conhecimento	Data_de_Início	Data_de_Término	País_(Instituições_no_Exterior)	Processos_Vinculados	GoogleMyCitations_(Beneficiário)	ResearcherID_(Beneficiário)	Orcid_(Beneficiário)	com_pub	duration_date
0	19/14807-0	Sérgio Kurokawa	UNESP	Sérgio Kurokawa	0.043478	Auxílio à Pesquisa - Regular	Engenharias	2019-12-01	2020-11-30	0.000000	0.000000	1	1	1	1	0.134380
1	19/21108-0	Luis Carlos Uta Nakano	UNIFESP	Luis Carlos Uta Nakano	0.000000	Auxílio à Pesquisa - Regular	Ciências da Saúde	2019-12-01	2020-11-30	0.000000	0.000000	0	0	0	0	0.134380
2	19/01649-7	Eduardo José Grin	FGV	Eduardo José Grin	0.000000	Auxílio à Pesquisa - Regular	Ciências Sociais Aplicadas	2019-11-01	2020-12-31	0.000000	0.166667	1	1	1	0	0.164296
3	19/17890-5	Massimo Di Felice	USP	Massimo Di Felice	0.000000	Auxílio à Pesquisa - Regular	Ciências Sociais Aplicadas	2019-11-01	2020-10-31	0.000000	0.000000	0	1	0	0	0.134380
4	19/08972-8	Diogo Teruo Hashimoto	UNESP	Diogo Teruo Hashimoto	0.043478	Auxílio à Pesquisa - Regular	Ciências Agrárias	2019-09-01	2020-08-31	0.333333	0.000000	1	1	1	1	0.134380
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
6205	11/23781-2	Rosa Maria Rodrigues Pereira	USP	Rosa Maria Rodrigues Pereira	0.000000	Auxílio à Pesquisa - Regular	Ciências da Saúde	2012-05-01	2016-01-31	0.000000	0.083333	1	1	1	1	0.627268
6206	11/51843-2	Jose Antonio Marengo Orsini	INPE	Jose Antonio Marengo Orsini	0.000000	Auxílio à Pesquisa - Regular	Ciências Exatas e da Terra	2012-04-01	2016-03-31	0.333333	0.000000	1	0	1	1	0.671408
6207	11/20435-6	Gilles Landman	UNIFESP	Gilles Landman	0.000000	Auxílio à Pesquisa - Regular	Ciências da Saúde	2012-03-01	2016-02-29	0.000000	0.000000	1	1	0	1	0.671408
6208	11/10062-8	Paula Ayako Tiba	UFABC	Paula Ayako Tiba	0.130435	Auxílio à Pesquisa - Regular	Ciências Humanas	2011-11-01	2016-02-29	0.000000	0.000000	1	1	1	1	0.730750
6209	10/52212-3	Wagner Cotroni Valenti	UNESP	Wagner Cotroni Valenti	0.000000	Auxílio à Pesquisa - Programa de Pesquisa sobr...	Ciências Agrárias	2011-03-01	2016-02-29	0.000000	0.000000	1	1	1	1	0.850907
6210 rows × 16 columns

Heat Map

After one hot encoding, let's check correlation between fields.

Check the good explanation bellow on how to read a correlation matrix:

https://www.statology.org/how-to-read-a-correlation-matrix/

This is more related to feature-selection, based on the correlation matrix:

https://medium.com/analytics-vidhya/feature-selection-feature-engineering-3bb09c67d8c5


df = df_normalized

dummy_gac = pd.get_dummies(df['Grande_Área_do_Conhecimento'])

df = pd.merge(
    left=df,
    right=dummy_gac,
    left_index=True,
    right_index=True,
)
fig, ax = plt.subplots(figsize=(15, 7))
colormap = sns.color_palette("Blues",48)
sns.heatmap(df.corr(), annot = True, cmap='cool')
plt.show()
Heatmap chart for correlation analisys

Split Train and Test Data

Split data for train and test samples


X = df.drop(['N_Processo', 'Instituição' , 'Linha_de_Fomento' , \
             'Grande_Área_do_Conhecimento' , 'com_pub', \
             'Data_de_Início', 'Data_de_Término', \
             'Beneficiário', 'Pesquisador_Responsável'], axis=1)

y = df[['com_pub']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
y_train = y_train.values.ravel()

Principal component analysis (PCA)

Linear dimensionality reduction using Singular Value Decomposition

A page explanation about PCA and dimensionality reduction:

https://en.wikipedia.org/wiki/Principal_component_analysis

At a first glance, in the 2D plot, it looks like the points are very mixed. In the 3d plot however, you can see more sparced values viewing one more dimension.


from sklearn.preprocessing import StandardScaler
from mpl_toolkits.mplot3d import Axes3D
from sklearn.decomposition import PCA

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

scaler = StandardScaler()
scaler.fit(X)
X_Scale = scaler.transform(X)

# 2 dimensions plot
pca2 = PCA(n_components=2, random_state=2020)
principalComponents = pca2.fit_transform(X_Scale)
principalDf = pd.DataFrame(data = principalComponents, columns = ['principal component 1', 'principal component 2'])

finalDf = pd.concat([principalDf, y], axis = 1)

plt.figure(figsize=(10,7))
sns.scatterplot(x = finalDf['principal component 1'],
                y = finalDf['principal component 2'],
                s=70, hue=finalDf['com_pub'],
                palette=['green', 'blue'])


# 3 dimensions plot
pca3 = PCA(n_components=3)
principalComponents = pca3.fit_transform(X_Scale)
principalDf = pd.DataFrame(data = principalComponents, columns = ['principal component 1', 'principal component 2', 'principal component 3'])

finalDf = pd.concat([principalDf, df[['com_pub']]], axis = 1)
finalDf.head()

fig = plt.figure(figsize=(9,9))
axes = Axes3D(fig)
axes.set_title('PCA Representation', size=14)
axes.set_xlabel('PC1')
axes.set_ylabel('PC2')
axes.set_zlabel('PC3')

axes.scatter(finalDf['principal component 1'],finalDf['principal component 2'],finalDf['principal component 3'],c=finalDf['com_pub'], cmap = 'prism', s=10)

2D chart for dimensinality reduction
3D chart for dimensinality reduction

Pair plot analysis

Creating a pairplot with Seaborn

To a introduction for visualizing data with pair plot:

https://towardsdatascience.com/visualizing-data-with-pair-plots-in-python-f228cf529166


cols = ['duration_date', 'Orcid_(Beneficiário)', 'ResearcherID_(Beneficiário)', \
        'GoogleMyCitations_(Beneficiário)', 'Processos_Vinculados', 'País_(Instituições_no_Exterior)',\
        'Pesquisadores_Associados', 'com_pub']
df_clean = df[cols]
sns.pairplot(df_clean, kind='kde', hue='com_pub', corner=True)#, diag_kind="hist")
plt.show()
Pair plot chart

Save DataFrame toCSV

Afte analysis and cleanup data, save dataframe to CSV, to use in the next steps of this Machine Learning how to make a product to predict data series.


save_file = <cleaned_dataset.csv>

df.drop(['N_Processo', 'Instituição' , 'Linha_de_Fomento' , \
             'Grande_Área_do_Conhecimento' ,\
             'Data_de_Início', 'Data_de_Término', \
             'Beneficiário', 'Pesquisador_Responsável'], axis=1, inplace=True)

df.to_csv(save_file, sep=';')