Machine Learning - Linear Regression

A simple generic prediction 

Maybe the most simple generic prediction is a collection of points, and a linear regression (when applicable) pointing to the future.

However, aside from the uncertainty that surrounds most predictions, problem happens all the time along the way. In the very recent years, Covid-19, Russia invasion, and finally, China's lockdown again, and maybe this is not the last impacting event for the early 2020's.

Which means that predictions have to be updated all the time. And it is also important to get track of the previous predictions for comparison and for impact analysis.

Linear Regression

The chart bellow can represent anything that relates to the stated use case above. Two lines which slopes varies, due to a great variation caused by an extremes events. 

This model was used for the purpose of the exploration of the Scikit Learn library. The depicted problem is a time series, but the result variable has influence of other factors beyond the years interval. It is not just a simple linear relation, but it applies for the purpose of the example. 

There are many other use cases to predict future data, based on impacting events. Some examples can be the number of security incidents after the adoption of security measures, the production of some part after an industry buys new machinery.


Linear Regression - COVID19 impact

Commented Code

Common data science libraries

Importing Scikit Learn, NumPy, Pandas, Matplotlib

import numpy as np 
import pandas as pd
import json, datetime

from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score

import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter

About the dataset

This test relies on public data available on the open government initiative of the State of São Paulo - Brazil.

State of São Paulo / Brazil - Transparency Portal

Quicknote - The information below can be used as a feature, in an updated version of this regression. This is an open data of the financial funds transfered monthly, from the State of São Paulo to FAPESP, and may have a high correlation with the related regression.

Repasses Financeiros à FAPESP_2005 a 2022 CONSOLIDADO
Disclaimer - The sole purpose of this presentation is to carry out tests with machine learning, using public and open data to improve the use of technology for future application in the FAPESP Virtual Library.
Any results, partial information or charts presented here should not be interpreted as official information. These is just machine learning technology test. For official data, check the links above.

data_from_json = json.loads(
'{"format_compatible": [["1992", 1148, 716], ["1993", 1647, 1273], ["1994", 1928, 1232], ["1995", 3141, 1855], ["1996", 4177, 3204], ["1997", 4268, 4351], ["1998", 3964, 4952], ["1999", 4013, 5081], ["2000", 3848, 5880], ["2001", 3222, 4428], ["2002", 3057, 4536], ["2003", 2821, 4253], ["2004", 3412, 4850], ["2005", 3265, 4764], ["2006", 4218, 6108], ["2007", 4374, 5892], ["2008", 4495, 6638], ["2009", 3977, 6912], ["2010", 4420, 6953], ["2011", 4589, 7468], ["2012", 4515, 8471], ["2013", 4256, 7962], ["2014", 4118, 7216], ["2015", 3656, 6153], ["2016", 3646, 6496], ["2017", 3471, 6643], ["2018", 3483, 7165], ["2019", 3235, 7256], ["2020", 1749, 4991], ["2021", 1700, 5058], ["2022", 645, 1608]]}')
df = pd.DataFrame(data_from_json['format_compatible'], columns = ['date', 'research_grants', 'scholarships'])
df['total'] = df['research_grants'] + df['scholarships']
df = df.apply(pd.to_numeric)

Model Training

SciKit Learn methods to split and train data and test the model. To make possible the comparison between the two models, the same date were used for training. Except that for the model of the first period, the data didn't include the Covid-19 most critical period of the years 2020 and 2021.

The second model include all data from the first model plus the Covid-19 critical years.

The last point is the current year, 2022. And of course this data can be attached to an updated data, daily for instance, so you can follow what the model has predicted. And never forget to filter the data to be predicted from the training model.


""" Training data until 2019 - Before COVID """

df_up_to_2019 = df[df['date']<= 2019]

# Get X values
X_2019 = (df_up_to_2019.index -  df_up_to_2019.index[0]).values.reshape(-1, 1)
y_2019 = df_up_to_2019.total

# Linear Regression
X_2019_train, X_2019_test, y_2019_train, y_2019_test = train_test_split(
    X_2019, y_2019, test_size=0.33, random_state=42)
regr_2019 = linear_model.LinearRegression()
regr_2019.fit(X_2019_train, y_2019_train)

# Normalize lenght
b = np.array([[28], [29], [30]]) 
X_2019 = np.concatenate((X_2019, b))
y_2019_pred = regr_2019.predict(X_2019)


""" Training full data, up to 2021 to measure COVID impact """
df_up_to_2021 = df[df['date'] < 2022]

# Get X values
X = (df_up_to_2021.index -  df_up_to_2021.index[0]).values.reshape(-1, 1)
y = df_up_to_2021.total

# Linear Regression
_, X_test, _, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)
regr = linear_model.LinearRegression()

# Add 2020/2021 data 
X_train = np.concatenate((X_2019_train, X[-1:-3:-1]))
y_train = np.concatenate((y_2019_train, y[-1:-3:-1]))
regr.fit(X_train, y_train)

# Normalize lenght
b = np.array([[30]]) 
X = np.concatenate((X, b))
y_pred = regr.predict(X)


# Current value - 2022
total_corrente = df[df['date'] == 2022]['total']

Plot chart

The most important informations ploted on the chart are placed in the legend area, at the top left corner. But it is important to highlight some already writen topics.

The orange line was drawn over the green dots, used to train the model between 1922 and 2019 (inclusive). The blue line was trained with the same dots of the orange line, including the years 2020 and 2021.

That is because the sample data is small and can make the comparison bad, in the case that the training data for both cases differentiates much.


 import warnings
    warnings.filterwarnings('ignore')
    
    plt.figure(figsize=(14,7))
    plt.scatter([X[-1]], 
                total_corrente, 
                color="blue", s=50, 
                marker="x", 
                label = '2022 - Current updated value')
    
    ## Before COVID
    plt.plot(X_2019, 
             y_2019_pred, 
             color="orange", 
             linewidth=1, 
             label = 'Before COVID - Excluded 2020 and 2021 data')
    
    plt.scatter(X_2019_train, y_2019_train, color="#00FF00")

    
    ## After COVID
    plt.plot(X, 
             y_pred, 
             color="blue", 
             linewidth=1, 
             label = 'After COVID - Full data')
    
    plt.scatter(X_train, y_train, color="#00FF00", label = 'Scatter plot (model training)')

    ## Full data = Test data + train data
    plt.scatter(X_train, y_train, color="k", s=6)
    plt.scatter(X_2019_train, y_2019_train, color="k", s=6)
    plt.scatter(X_2019_test, y_2019_test, color="k", s=6)
    plt.scatter(X_test, y_test, color="k", s=6, label = 'Scatter plot (full data)')

    
    # Chart details - grid, axis, labels
    plt.grid(lw=0.3)
    ax = plt.gca()
    ax.set_xticklabels(['0', '1992', '1997', '2002', '2007', '2012', '2017', '2022'])

    def human_format(num, pos):
        magnitude = 0
        while abs(num) >= 1000:
            magnitude += 1
            num /= 1000.0
        # Format Y tick label volums
        return '%.1f%s' % (num, ['', 'K', 'M', 'G', 'T', 'P'][magnitude])

    formatter = FuncFormatter(human_format)
    ax.yaxis.set_major_formatter(formatter)

    last_update = 'Last update: {}'.format(str(datetime.date.today().strftime("%b %d %Y")))
    plt.xlabel(last_update, labelpad=15)
    plt.rc('axes', labelsize=14) 

    handles, labels = plt.gca().get_legend_handles_labels()
    order = [1,2,3,4,0]
    plt.legend([handles[idx] for idx in order],[labels[idx] for idx in order]) 

    plt.show()

Model Evaluation

SciKit Learn methods to evaluate Machine Learning models.

This study aimed to make the right usage of the librarie's API, and to make use of a simple datasource to demonstrate in a simple manner, how to obtain useful information.

The use of the same data for trainning for both models, only adding data to the second model, may have caused the Coefficient of determination of the second model decrease. 

Gradually, I will test other models to try to incremente the evaluation result and improve the overall results. 


print("Coefficients: ", human_format(regr_2019.coef_[0], 0))
print("Intercept: ", human_format(regr_2019.intercept_, 0))
mae = mean_absolute_error(y_2019, y_2019_pred[:len(y_2019)])
print("Mean absolute error: %s" % human_format(mae, 0))
print("Coefficient of determination: %.2f" % r2_score(y_2019, y_2019_pred[:len(y_2019)]))

print("\n\n")

print("COVID years included - Up to 2021")
print("Coefficients: ", human_format(regr.coef_[0], 0))
print("Intercept: ", human_format(regr.intercept_, 0))
mae = mean_absolute_error(y,  y_pred[:len(y)])
print("Mean absolute error: %s" % human_format(mae, 0))
print("Coefficient of determination: %.2f" % r2_score(y, y_pred[:len(y)],))
Pre-COVID years - Up to 2019
Coefficients:  251.9
Intercept:  5.6K
Mean absolute error: 1.5K
Coefficient of determination: 0.62

COVID years included - Up to 2021
Coefficients:  155.9
Intercept:  6.5K
Mean absolute error: 1.7K
Coefficient of determination: 0.36

Popular posts from this blog

Atom - Jupyter / Hydrogen

Metodologias em ação

Design Patterns - Observer