Machine Learning - Classification problem

SciKit Learn library data classification problem.

This experiment was split in two posts/notebooks, because the texts, code and outputs are extensive.

To complement this experiment, I wisht to propose a real classification prediction scenario for this problem, and a minimum viable product for data analysis machine learning based, in the near future.

This first post talks about Data analysis and the second one << yet to come >> talks about Selecting the Model for Prediction

Data Analysis

Just after data loading, you have to analyse it to clean, merge and transform to make it machine learning friendly.

Check every variable to get a good idea of the kind of data is there on the dataset. Numerical or categorical? Continuous or discrete?

And after the data fits the right layout, you have to analyse other aspects like its dimmensionality, correlations and repetitions, to make another transformation to work good with your models.

So, let's jump into it.

Commented Code

Common data science libraries

Importing Python libraries to analyse data and try machine learning models.

Scikit Learn
NumPy
Pandas
Matplotlib
Seaborn

About the data

The data is public available on the open government initiative of the State of São Paulo - Brazil.

State of São Paulo / Brazil - Transparency Portal

Disclaimer - The sole purpose of this presentation is to carry out tests with machine learning, using public and open data to improve the use of technology for future application in the FAPESP Virtual Library.
Any results, partial information or charts presented here should not be interpreted as official information. These is just machine learning technology test. For official data, check the links above.

About the problem

Given a set of research grants and it's features, lets try to classify those into two classes. Will it generate a scientific publication? Yes or No? A binary classification.

The dataset download above, is a subset containing the years of the research grants ended up until 2016, 17, 18, 19 and 2020.


from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# To avoid this, sometimes bug, SettingWithCopyWarning
pd.options.mode.chained_assignment = None

# Dois arquivos baixados da BV. 
# Total de auxilos regulares finalizados em 2016, 17,18,19,20
# Auxilos regulares com publicações científicas finalizados em 2016, 17,18,19,20
aux_total = <all_research_grants>
aux_com_pub = <research_granth_with_publications>


# Clean up data
df_total = pd.read_csv(aux_total, sep=";", header = 0)
df_total = df_total.loc[:, ~df_total.columns.str.contains('^Unnamed')]
cols = df_total.columns
cols = cols.map(lambda x: x.replace(' ', '_'))
cols = cols.map(lambda x: x.replace('.', ''))
df_total.columns = cols

df_com_pub = pd.read_csv(aux_com_pub, sep=";", header = 0)
df_com_pub = df_com_pub.loc[:, ~df_com_pub.columns.str.contains('^Unnamed')]
cols = df_com_pub.columns
cols = cols.map(lambda x: x.replace(' ', '_'))
cols = cols.map(lambda x: x.replace('.', ''))
df_com_pub.columns = cols

df_y = df_total.assign(com_pub=df_total.N_Processo.isin(df_com_pub.N_Processo).astype(int))

cols = [0, 3, 4, 9, 11, 16, 17,21, 22, 25, 29, 37, 38, 39, 40]
df_clean = df_y[df_y.columns[cols]]

# Remove not needed naming
df_clean['Instituição'].replace('\(Brasil\)','', regex=True, inplace=True)
df_clean['Instituição'] = df_clean['Instituição'].apply(lambda st: st[st.find("(")+1:st.find(")")])

# Convert datetime and add a new feature duration_date
df_clean['Data_de_Início'] = pd.to_datetime(df_clean['Data_de_Início'])
df_clean['Data_de_Término'] = pd.to_datetime(df_clean['Data_de_Término'])
df_clean['duration_date'] = df_clean['Data_de_Término'] - df_clean['Data_de_Início']
df_clean['duration_date'] = df_clean['duration_date'].astype(int)

# Target class - Acctual values
df_clean['com_pub'].value_counts()

One Hot Encoding

Moving categorical data in one column to related columns with numeric values 0 or 1.

This is made in order to avoid distortion while comparing data with strong scale differences, among other reasons.

There are a lot of information about this, on the Internet. Check this link, which explais some basics about normalizing data to values between zero and one:


def normalize_0_1(df_normalized: pd.DataFrame) -> pd.DataFrame:
    df_normalized = df_normalized.fillna(0)
    df_normalized = (df_normalized - df_normalized.min()) / (df_normalized.max() - df_normalized.min())
    return df_normalized

df_normalized = df_clean.copy(deep=True)

# Pesquisadores_Associados
df_normalized['Pesquisadores_Associados'] = df_normalized['Pesquisadores_Associados'].str.split('-').str.len()    
df_normalized['Pesquisadores_Associados'] = normalize_0_1(df_normalized['Pesquisadores_Associados'])

# # Processos_Vinculados
df_normalized['Processos_Vinculados'] = df_normalized['Processos_Vinculados'].str.split(';').str.len()
df_normalized['Processos_Vinculados'] = normalize_0_1(df_normalized['Processos_Vinculados'])

# # País_(Instituições_no_Exterior)
df_normalized['País_(Instituições_no_Exterior)'] = df_normalized['País_(Instituições_no_Exterior)'].str.split(',').str.len()
df_normalized['País_(Instituições_no_Exterior)'] = normalize_0_1(df_normalized['País_(Instituições_no_Exterior)'])

# duration_date
df_normalized['duration_date'] = normalize_0_1(df_normalized['duration_date'])

# Grantee infos
df_normalized[['GoogleMyCitations_(Beneficiário)', 'ResearcherID_(Beneficiário)', 'Orcid_(Beneficiário)']] = np.where(df_normalized[['GoogleMyCitations_(Beneficiário)', 'ResearcherID_(Beneficiário)', 'Orcid_(Beneficiário)']].isnull(), 0, 1)

df_normalized


	N_Processo	Beneficiário	Instituição	Pesquisador_Responsável	Pesquisadores_Associados	Linha_de_Fomento	Grande_Área_do_Conhecimento	Data_de_Início	Data_de_Término	País_(Instituições_no_Exterior)	Processos_Vinculados	GoogleMyCitations_(Beneficiário)	ResearcherID_(Beneficiário)	Orcid_(Beneficiário)	com_pub	duration_date
0	19/14807-0	Sérgio Kurokawa	UNESP	Sérgio Kurokawa	0.043478	Auxílio à Pesquisa - Regular	Engenharias	2019-12-01	2020-11-30	0.000000	0.000000	1	1	1	1	0.134380
1	19/21108-0	Luis Carlos Uta Nakano	UNIFESP	Luis Carlos Uta Nakano	0.000000	Auxílio à Pesquisa - Regular	Ciências da Saúde	2019-12-01	2020-11-30	0.000000	0.000000	0	0	0	0	0.134380
2	19/01649-7	Eduardo José Grin	FGV	Eduardo José Grin	0.000000	Auxílio à Pesquisa - Regular	Ciências Sociais Aplicadas	2019-11-01	2020-12-31	0.000000	0.166667	1	1	1	0	0.164296
3	19/17890-5	Massimo Di Felice	USP	Massimo Di Felice	0.000000	Auxílio à Pesquisa - Regular	Ciências Sociais Aplicadas	2019-11-01	2020-10-31	0.000000	0.000000	0	1	0	0	0.134380
4	19/08972-8	Diogo Teruo Hashimoto	UNESP	Diogo Teruo Hashimoto	0.043478	Auxílio à Pesquisa - Regular	Ciências Agrárias	2019-09-01	2020-08-31	0.333333	0.000000	1	1	1	1	0.134380
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
6205	11/23781-2	Rosa Maria Rodrigues Pereira	USP	Rosa Maria Rodrigues Pereira	0.000000	Auxílio à Pesquisa - Regular	Ciências da Saúde	2012-05-01	2016-01-31	0.000000	0.083333	1	1	1	1	0.627268
6206	11/51843-2	Jose Antonio Marengo Orsini	INPE	Jose Antonio Marengo Orsini	0.000000	Auxílio à Pesquisa - Regular	Ciências Exatas e da Terra	2012-04-01	2016-03-31	0.333333	0.000000	1	0	1	1	0.671408
6207	11/20435-6	Gilles Landman	UNIFESP	Gilles Landman	0.000000	Auxílio à Pesquisa - Regular	Ciências da Saúde	2012-03-01	2016-02-29	0.000000	0.000000	1	1	0	1	0.671408
6208	11/10062-8	Paula Ayako Tiba	UFABC	Paula Ayako Tiba	0.130435	Auxílio à Pesquisa - Regular	Ciências Humanas	2011-11-01	2016-02-29	0.000000	0.000000	1	1	1	1	0.730750
6209	10/52212-3	Wagner Cotroni Valenti	UNESP	Wagner Cotroni Valenti	0.000000	Auxílio à Pesquisa - Programa de Pesquisa sobr...	Ciências Agrárias	2011-03-01	2016-02-29	0.000000	0.000000	1	1	1	1	0.850907
6210 rows × 16 columns

Heat Map

After one hot encoding, let's check correlation between fields.

Check the good explanation bellow on how to read a correlation matrix:

https://www.statology.org/how-to-read-a-correlation-matrix/

This is more related to feature-selection, based on the correlation matrix:

https://medium.com/analytics-vidhya/feature-selection-feature-engineering-3bb09c67d8c5


df = df_normalized

dummy_gac = pd.get_dummies(df['Grande_Área_do_Conhecimento'])

df = pd.merge(
    left=df,
    right=dummy_gac,
    left_index=True,
    right_index=True,
)
fig, ax = plt.subplots(figsize=(15, 7))
colormap = sns.color_palette("Blues",48)
sns.heatmap(df.corr(), annot = True, cmap='cool')
plt.show()

X = df.drop(['N_Processo', 'Instituição' , 'Linha_de_Fomento' , \ 'Grande_Área_do_Conhecimento' , 'com_pub', \ 'Data_de_Início', 'Data_de_Término', \ 'Beneficiário', 'Pesquisador_Responsável'], axis=1) y = df[['com_pub']] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42) y_train = y_train.values.ravel()

At a first glance, in the 2D plot, it looks like the points are very mixed. In the 3d plot however, you can see more sparced values viewing one more dimension.

from sklearn.preprocessing import StandardScaler from mpl_toolkits.mplot3d import Axes3D from sklearn.decomposition import PCA import warnings warnings.filterwarnings("ignore", category=DeprecationWarning) scaler = StandardScaler() scaler.fit(X) X_Scale = scaler.transform(X) # 2 dimensions plot pca2 = PCA(n_components=2, random_state=2020) principalComponents = pca2.fit_transform(X_Scale) principalDf = pd.DataFrame(data = principalComponents, columns = ['principal component 1', 'principal component 2']) finalDf = pd.concat([principalDf, y], axis = 1) plt.figure(figsize=(10,7)) sns.scatterplot(x = finalDf['principal component 1'], y = finalDf['principal component 2'], s=70, hue=finalDf['com_pub'], palette=['green', 'blue']) # 3 dimensions plot pca3 = PCA(n_components=3) principalComponents = pca3.fit_transform(X_Scale) principalDf = pd.DataFrame(data = principalComponents, columns = ['principal component 1', 'principal component 2', 'principal component 3']) finalDf = pd.concat([principalDf, df[['com_pub']]], axis = 1) finalDf.head() fig = plt.figure(figsize=(9,9)) axes = Axes3D(fig) axes.set_title('PCA Representation', size=14) axes.set_xlabel('PC1') axes.set_ylabel('PC2') axes.set_zlabel('PC3') axes.scatter(finalDf['principal component 1'],finalDf['principal component 2'],finalDf['principal component 3'],c=finalDf['com_pub'], cmap = 'prism', s=10)

cols = ['duration_date', 'Orcid_(Beneficiário)', 'ResearcherID_(Beneficiário)', \ 'GoogleMyCitations_(Beneficiário)', 'Processos_Vinculados', 'País_(Instituições_no_Exterior)',\ 'Pesquisadores_Associados', 'com_pub'] df_clean = df[cols] sns.pairplot(df_clean, kind='kde', hue='com_pub', corner=True)#, diag_kind="hist") plt.show()

save_file = <cleaned_dataset.csv> df.drop(['N_Processo', 'Instituição' , 'Linha_de_Fomento' , \ 'Grande_Área_do_Conhecimento' ,\ 'Data_de_Início', 'Data_de_Término', \ 'Beneficiário', 'Pesquisador_Responsável'], axis=1, inplace=True) df.to_csv(save_file, sep=';')

Search This Blog

Extensão da Memória