End to End Machine Learning Project on Student Performance Dataset

End To End Project
Regression Project
An End to End ML project

End to End ML Project

Before I begin, I want to thank Krish Naik Sir, learn from him on youtube, for creating a series around End to End Machine Learning Projects. I have learnt a lot from his videos and I am trying to implement the same in this notebook. I have also referred to his notebook for this project. I have tried to add more details and explanations wherever I felt necessary.

A brief intro of this project work

  • These days it’s getting more common for a person in data team, to wear multiple hats in terms of bringing insights from the sea of data. To gain experience on what it means to multiple hats, I created this end to to end project on a simple data set of Students Performance Dataset from Kaggle.

  • The goal of this project is not to derive any magical insights from data, rather, to do a comprehensive work on building an end to end project which includes but is not limited to:

  • Follow the actions mentioned below to make your own copy of this end to end project

  • You Can Try the WebApp created for this project before you get your hands dirty

  • Notebook checkpoints

    • STAR ANALYSIS
    • Explained every point of the star method, step by step in detail.
      • Ex. Action will be broken down to A1, A2 to follow up the notebook.
      • We use acronyms like T1 representing Task 1, A1.1 representing subaction 1 of Action 1.

STAR ANALYSIS

  • Situation: To gain experience in end-to-end machine learning project development and deployment
  • Task: Create a machine learning project from scratch, following the entire pipeline from data collection to model deployment
  • Action: Develop a project idea, collect and preprocess the data, perform EDA on data, decide on design and training of the machine learning model, evaluate the model’s performance, and deploy the model into a production - environment
  • Result: Gain hands-on experience in end-to-end machine learning project development and deployment, with a fully functional machine learning system that can be used for real-world applications

Situation

  • S1. Need of gaining exposure in real-world ML project development and deployment
  • S2. A way to improve my Data Science profile, with such projects
  • S3. Building skillset to be of use in the real-world, and not be limited to books

With the situation being clear let’s jump to a bit about task that was required to be done for this situation

Tasks

  • T1. Creating a folder structure, for a real-world entity project.
    • Uses: Introduction of Modularity to the project, rather than a single jupyter notebook file doing all the job.
  • T2. Creating an environment and setup file to run this ML pipeline from scratch.
  • A. Developing an End to End ML pipeline and then performing web deployment for using the ML model.

With the basic overview of task now, let’s look onto every task in details

Task T1: Creation of Project Structure

  • Creating a folder structure for our real-world project. This is an essential part for any real-world code project, as it introduces modularity to our code. This modularity helps us to deal with complexity of huge projects in simple way, where a team can work together on different parts of the project, re-use each others work and combine it all at the end.
T1.1: Folder Structure Creation
  • First setup a github repo (ML_Web_Project is my repo), keeping all the options to default.
  • Locally setup a folder (END_To_END_ML_PROJECT is my base local folder setup on WSL, but the one can use windows or mac as well)
    • Open this directory in vscode
    • Open a terminal
  • Secondly let’s create a conda environment named venv into this local folder, so to have packages locally to run the project. bash conda create -p venv python==3.8 -y
    • Activate this environment from the base folder
    conda activate venv/ # don't forget '/' cause it tells that this environment is in a folder named venv
  • Link the local folder to the github repo
    • First do git init in the local folder
    • Follow all the steps mentioned in the github repo you created to do the syncing of local folder to the repo.
    • After the update of git back in 2021, one needs to setup ssh-keys to use the github repo or use tokens, I prefer to use ssh-keys, follow the steps here.
    • Create a default .gitignore for python in github repo online.
    • Finally do a git pull, to sync the changes locally as well.
    • Later on whenever there are enough changes to the local code, follow the steps of git add, commit and push with a useful commit message.
  • By now local repo should have a .gitignore, README.md, venv/ in their local repo, after this create the following folder structure locally.
- END_TO_END_ML_PROJECT
    - setup.py # The setup script is the center of all activity in building, distributing, and installing modules that are necessary to run the ML pipeline. # Consider it as the core of the ML Pipeline. This setup.py will help to use our ML pipeline as a package itself and can even be deployed to Pypi.
    - requirements.txt # All packages that need to be installed before running the project. # This is the part that gives energy to the core.
    - assets
        - data # The folder which consist of datasets used in the project.
            - StudentsPerformance.csv
        - files
            - notebook # jupyter notebooks, consisting of all codes which helps to find patterns in data and give a big picture code, later to be broken down into src folders.
                - EDA_notebook.ipynb 
                - Model_train.ipynb
        - images # to store plot images        
    - src # The backbone containing all the source codes for creation of ML pipeline package.
        - __init__.py
        - exception.py # Helps in producing custom exceptions.
        - logger.py # Contains the code that will help in creation of logs, and trace errors if caused any during the realtime.
        - utils.py # Contains all the utilities that can be reused across the whole project.
        - components # The major components of the project, which deal with data cleaning, transformation, model training etc.
            - __init__.py
            - data_ingestion.py
            - data_transformation.py
            - model_trainer.py
        - pipeline # The complete pipelines built via use of components for further deployment of the model.
            - __init__.py
            - predict_pipeline.py
            - train_pipeline.py

Task T2: Environment creation and setup

  • Creating an environment and setup file which later can be used to condense our ML pipeline in form of package. In this part we build the foundation for our ML pipeline, by creating the code for setup.py file.
Task T2.1: Setup File
Code
from setuptools import find_packages,setup
from typing import List

def get_requirements(file_path:str)->List[str]:
    '''
    This function will return the list of requirements
    '''
    requirements = []
    file = open(file_path,'r')
    
    for line in file:
        if "-e ." not in line:
            requirements.append(line.strip('\n'))
    file.close()
    
    #print(requirements)
    return requirements
    
# With this setup we parse our requirements file to get the requirements installed for our project, one can make this static via use of package names in form of a list, instead of parsing a requirements file.
setup(
    name='mlproject',
    version='0.0.1',
    author='<Your Name>',
    author_email='<Your Email>',
    packages=find_packages(), # This will use the codes or modules that we write for our ML pipeline, to ensure that our every module can be used for building the package, we have a __init__.py in src, or any directory that can be reused.
    install_requires=get_requirements('requirements.txt') 
)
  • contents of requirements.txt file
pandas
numpy
seaborn
matplotlib
scikit-learn
catboost
xgboost
dill
tensorboard
-e . # This triggers the setup .py file automatically, but this is not readed when setup.py is called as per our above code.
  • Once these 2 files are setup, simply run:
pip install -r requirements.txt
  • This will install all the necessary packages in our virtual environment and create a new directory .egg-info which will help to create the ML pipeline package for us.

Actions

  • A1. Project Idea: Using a student performance data to predict it’s grades or scores, depending on the other features of the dataset.
  • A2. Data Collection and Preprocessing: We first do all EDA in a jupyter notebook to find patterns in the data and getting to know the type of preprocessing required to be done on the dataset.
    • For simple application the data is simply imported in form of csv file, but all this can even be done by getting data from Data Warehouse as well.
  • A3. Design and Development of ML pipeline components: After EDA, we try to create simple modular codes in a jupyter notebook, which do the job of development, training and evaluation of ML model. Later these modular codes are more or less split into the folder structure that we created earlier.
  • A4. Deployment of model into a production environment: We use cloud tools like AWS or Streamlit or Flask n Django or any other web service to deploy the ML model online to be used on realtime data provided by user or fetched from a source.

Action A1 & A2: Project Idea, Data collection and Preprocessing

  • Project Idea
    • We will use a simple student performance dataset, to predict the child’s maths scores via the rest of the features of the dataset.
    • We will be using this dataset, because it’s having a mixed of categorical and numerical features, we can have a good amount of EDA done on this simple data, and last but not the least train many regression algorithms on this simple data easily.
  • Data Collection & Preprocessing
    • We will use jupyter notebooks, to majority of the EDA, and finding the patterns.
    • Once the EDA is done, we will also have basic models run on the data, in another jupyter notebook, so that we have basic model pipeline code in place as well.
  • Insights from the EDA n Model Training, I have mentioned in a brief on my github, you can view it here, with the insights in place, let’s begin with design and development of ML pipeline.

Action A3: Design and Development of ML pipeline components

  • Design and Development of ML pipeline components in form of modular codes
  • Steps
    • Creation of utility codes, logging and exception handling module that will be used all over the components, pipelines.
    • Creation of Components modules inside the package consisting of Data Ingestion, Data Transformation and Model Trainer Component.
    • Creation of train and predict pipelines modules that will be connected to the above components, and will be a pipeline connecting the frontend user and the backend model of Machine learning.
Action A3.1: Creation of Utilities, Loggers n Exceptions
Action A3.1.1: Creation of Utilities
Code
#Common functionalities for the whole project.
import os
import sys

import dill
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score
import seaborn as sns

from src.exception import CustomException
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import confusion_matrix

def save_object(file_path,obj):
    try:
        dir_path = os.path.dirname(file_path)
        os.makedirs(dir_path,exist_ok=True)
        
        file_obj = open(file_path,"wb")
        dill.dump(obj,file_obj)
        
    except Exception as e:
        raise CustomException(e,sys)

def evaluate_models(X_train, y_train, X_test,y_test,models, param):
    try:
        report = {}
        
        for i in range(len(list(models))):
            model = list(models.values())[i]
            para=param[list(models.keys())[i]]

            gs = GridSearchCV(model,para,cv=3)
            gs.fit(X_train,y_train)

            model.set_params(**gs.best_params_)
            model.fit(X_train,y_train)
            
            #model.fit(X_train,y_train)
            
            y_train_pred = model.predict(X_train)
            y_test_pred = model.predict(X_test)
            
            train_model_score = r2_score(y_train,y_train_pred)
            test_model_score = r2_score(y_test,y_test_pred)
            
            report[list(models.keys())[i]] = test_model_score
        
        return report
    
    except Exception as e:
        raise CustomException(e,sys)
            
def load_object(file_path):
    try:
        file_obj = open(file_path,"rb")
        return dill.load(file_obj)
        file_obj.close()
        
    except Exception as e:
        raise CustomException(e,sys)

def create_plot(y_test, y_pred, type, model_name, xlabel = "Actual Math Score", ylabel="Predicted Math Score", file_name = "Actual vs Predicted"):
    """
    A function to create a plot and save it to a file.
    """
    if type == "scatter":
        title = f"{model_name}'s Actual vs Predicted Values Scatterplot"
        plt.scatter(y_test, y_pred)
        plt.title(title)
        plt.xlabel(xlabel)
        plt.ylabel(ylabel)
        directory = "./assets/images/"
        plt.savefig(f"{directory}{file_name}")
        
    elif type == "reg":
        title = f"{model_name}'s Actual vs Predicted Values Regplot"
        sns.regplot(x=y_test,y=y_pred,ci=None,color ='red');
        plt.title(title)
        plt.xlabel(xlabel)
        plt.ylabel(ylabel)
        directory = "./assets/images/"
        plt.savefig(f"{directory}{file_name}_regplot")
Action A3.1.2: Creation of Logger
Code
# Logger is for the purpose of logging all the events in the program from execution to termination.
# For example, whenever there is an exception, we can log the exception info in a file via use of logger.

# Read logger documentation at https://docs.python.org/3/library/logging.html
# Logger is for the purpose of logging all the events in the program from execution to termination.
# For example, whenever there is an exception, we can log the exception info in a file via use of logger.

# Read logger documentation at https://docs.python.org/3/library/logging.html
import logging
import os
from datetime import datetime

LOG_FILE_NAME = f"{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}.log"
logs_path = os.path.join(os.getcwd(), "logs",LOG_FILE_NAME) # This will create logs folder in the same working directory where this file is present
os.makedirs(logs_path,exist_ok=True) # Keep appending the logs in the same directory even if there are multiple runs of the program

LOG_FILE_PATH = os.path.join(logs_path,LOG_FILE_NAME)

logging.basicConfig(filename=LOG_FILE_PATH,
                    level=logging.INFO,
                    format="[%(asctime)s] %(lineno)d %(name)s - %(levelname)s: %(message)s",
                    datefmt='%m/%d/%Y %I:%M:%S %p'
                    ) #This is the change of basic configuration for the logger

if __name__ == '__main__':
    logging.info("This is a test log")
    logging.warning("This is a warning log")
    logging.error("This is an error log")
    logging.critical("This is a critical log")
Action A3.1.3: Creation of Exception.py
Code
# We use this custom exception handling in the project to handle all the errors that will come into the project, simply we can say that we are handling all the errors that will come into the project in a single place.

import sys

# Sys module in python provides various functions and variables that are used to manipulate different parts of the python runtime environment. It allows operating on the python interpreter as it provides access to some variables used or maintained by the interpreter and to functions that interact strongly with the interpreter.
# Read more about sys module here: https://docs.python.org/3/library/sys.html
from src.logger import logging


def error_message_detail(error,error_detail:sys):
    _,_,exec_tb = error_detail.exc_info()
    file_name = exec_tb.tb_frame.f_code.co_filename
    error_message = f"Error occured in python script name {file_name} on line number {exec_tb.tb_lineno} and error is {str(error)}"
    
    return error_message
    
class CustomException(Exception):
    def __init__(self,error_message,error_detail:sys):
        super().__init__(error_message)
        self.error_message = error_message_detail(error_message,error_detail= error_detail)
        #self.error_detail = error_detail        
        
    def __str__(self):
        return f"{self.error_message}"

# Read more about custom exception handling here: https://www.programiz.com/python-programming/user-defined-exception

if __name__ == '__main__':
    try:
        a = 10
        b = 0
        c = a/b
        print(c)
    except Exception as e:
        logging.error(e)
        raise CustomException(e,error_detail=sys)
Action A3.2: Creation of Components
A3.2.1 Data Ingestion Component
  • Data being a central component of any project, in this component, we write classes such as DataIngestionConfig and DataIngestion.
    • DataIngestionConfig consists of public path variables to train, test and raw data.
    • DataIngestion helps to create an object which invokes an object of DataIngestionConfig during initialization and retrieves public path variables.
    • By use of those paths, we read data, split them up and save them to the directory by use of initiate_data_ingestion method.
  • Data ingestion is a crucial step in any project that involves handling data. This process involves extracting data from different sources, such as databases or warehouses, and loading it into a centralized location, such as a data warehouse, data lake, or data mart. Typically, this task is performed by a specialized big data team, whose responsibility is to ensure that data is obtained from various sources and stored in different formats, such as Hadoop or MongoDB.
  • As Data Scientists, it’s essential to have knowledge of how to extract data from different sources, such as Hadoop, MongoDB, MySQL, or Oracle, and make it available for analysis. Since data is a critical asset in any project, understanding the process of data ingestion is vital to ensure that the data is organized and stored in a way that facilitates analysis.
Code
import os
import sys
from dataclasses import dataclass

import pandas as pd
from sklearn.model_selection import train_test_split

from src.components.data_transformation import (DataTransformation,
                                                DataTransformationConfig)
from src.components.model_trainer import ModelTrainer, ModelTrainerConfig
from src.exception import CustomException
from src.logger import logging


@dataclass
class DataIngestionConfig:
    '''
    Used for defining the configuration for data ingestion.
    '''
    train_data_path: str = os.path.join('artifacts', 'train.csv')
    test_data_path: str = os.path.join('artifacts', 'test.csv') 
    raw_data_path: str = os.path.join('artifacts', 'data.csv')

class DataIngestion:
    '''
    Used for ingesting data by making use of the configuration defined in DataIngestionConfig.
    '''
    def __init__(self,ingestion_config: DataIngestionConfig = DataIngestionConfig()):
        self.ingestion_config = ingestion_config
    
    def initiate_data_ingestion(self,raw_data_path: str = None):
        try:
            # Reading data here.
            logging.info("Initiating data ingestion")
            if raw_data_path is not None:
                self.ingestion_config.raw_data_path = raw_data_path
                data = pd.read_csv(self.ingestion_config.raw_data_path)
            else:
                data = pd.read_csv('./assets/data/NewSPerformance.csv')
                        
            os.makedirs(os.path.dirname(self.ingestion_config.train_data_path),exist_ok=True)
            data.to_csv(self.ingestion_config.train_data_path,index=False,header=True)
            logging.info("Data ingestion completed")
            
            logging.info("Train test split initiated")
            train_set, test_set = train_test_split(data,test_size = 0.2, random_state = 18)

            train_set.to_csv(self.ingestion_config.train_data_path,index = False, header = True)
            test_set.to_csv(self.ingestion_config.test_data_path,index = False, header = True)
            logging.info("Train test split ingestion completed")
            
            return (
                self.ingestion_config.train_data_path,
                self.ingestion_config.test_data_path
            )
        except Exception as e:
            logging.error("Error occured in data ingestion")
            raise CustomException(e,sys)
    

if __name__ == '__main__':
    obj = DataIngestion()
    train_data, test_data = obj.initiate_data_ingestion()
    
    data_transformation = DataTransformation() # We call DataTransformation here, just for the sake of demonstration.
    train_arr, test_arr,_ = data_transformation.initiate_data_transformation(train_data,test_data)
    
    modeltrainer = ModelTrainer()
    print(modeltrainer.initiate_model_trainer(train_arr, test_arr))
A3.2.1 Data Transformation Component
  • Once data ingestion is done, Data transformation component is used to transform the data, to make it useful for analysis and train models on it.
    • DataTransformationConfig class in this component stores public path variable to store the preprocessing object in pickle type data, to be later used during building the web app.
    • DataTransformation class helps to create an object which invokes DataTransformationConfig Object to get access to preprocessing object path.
    • We have a get_data_transformer_object method, that returns a preprocessor object which can preprocess numerical and categorical columns
    • By use of the get_data_transformer_object method, in initiate_data_transformation method, to do all the preprocessing on the train and test files, whose path is available from Data Ingestion component. After all the preprocessing we return train and test array consisting of feature and target variables.
  • A data transformation component is a crucial part of the data science process, which involves transforming raw data into a format that can be used for analysis. Data Scientists play a vital role in this process as they use various techniques such as feature engineering, feature selection, feature scaling, data cleaning, and handling null values to ensure the quality of data used for analysis. By understanding the process of data transformation, Data Scientists can generate valuable insights from raw data and make informed business decisions.
Code
import os
import sys
from dataclasses import dataclass

import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

from src.exception import CustomException
from src.logger import logging
from src.utils import save_object

# Defining the paths for the data ingestion
# di_obj = DataIngestion.DataIngestionConfig() # Not required as we already are doing the Doing the addition of paths in data_ingestion.py itself.
# di_obj.train_data_path = "data/train_data.csv"
# di_obj.test_data_path = "data/test_data.csv"

@dataclass #This is a decorator which is used to create a dataclass variables.
class DataTransformationConfig:
    '''
    We are creating a dataclass variable which will be used to store the paths for the data transformation transformer object.
    '''
    preprocessor_obj_file_path = os.path.join("artifacts","preprocessor.pkl")

class DataTransformation:
    
    def __init__(self,transformation_config: DataTransformationConfig = DataTransformationConfig()):
        self.data_transformation_config = transformation_config

    def get_data_transformer_object(self):
        '''
        This function is responsible for creating a preprocessing data transformation object.
        '''
        try:
            numerical_columns = ["writing_score","reading_score"]
            categorical_columns = [
                "gender",
                "race_ethnicity",
                "parental_level_of_education",
                "lunch",
                "test_preparation_course"
                ]
            
            num_pipeline = Pipeline(
                steps=[
                    ("imputer",SimpleImputer(strategy = 'median')),
                    ("scaler",StandardScaler())
                ]
            )
            
            cat_pipeline = Pipeline(
                steps = [
                    ("imputer",SimpleImputer(strategy = 'most_frequent')),
                    ("one_hot_encoder",OneHotEncoder()),
                    ('scaler',StandardScaler(with_mean=False))
                ]
            )
            
            logging.info(f"Numerical columns:{numerical_columns}")
            logging.info(f"Categorical columns:{categorical_columns}")
            
            preprocessor = ColumnTransformer(
                [
                    ("num_pipeline",num_pipeline,numerical_columns),
                    ('cat_pipeline',cat_pipeline,categorical_columns)
                ]
            )
            
            return preprocessor
        except Exception as e:
            raise CustomException(e,sys)
        
    def initiate_data_transformation(self,train_path,test_path):
        '''
        Here we use the preprocessing object to transform the data.
        '''
            
        try:
            train_df = pd.read_csv(train_path)
            test_df= pd.read_csv(test_path)
            
            
            logging.info("Read train and test data completed") 
            
            logging.info("Obtaining preprocessing object and starting processing.")
            preprocessing_obj = self.get_data_transformer_object()
            target_column_name = "math_score"
            numerical_columns = ["writing_score","reading_score"]
            
            input_feature_train_df = train_df.drop(columns = [target_column_name],axis=1)
            target_feature_train_df = train_df[target_column_name]
            
            input_feature_test_df = test_df.drop(columns = [target_column_name],axis=1)
            target_feature_test_df = test_df[target_column_name]
            
            input_feature_train_arr = preprocessing_obj.fit_transform(input_feature_train_df)
            input_feature_test_arr = preprocessing_obj.fit_transform(input_feature_test_df)
            
            train_arr = np.c_[
                input_feature_train_arr,np.array(target_feature_train_df)
                ]
            
            test_arr = np.c_[
                input_feature_test_arr,np.array(target_feature_test_df)
                ]
            
            logging.info(f"Saved Preprocessing object at a particular filepath ")
            save_object(
                file_path = self.data_transformation_config.preprocessor_obj_file_path,
                obj = preprocessing_obj
            )
            return(
                train_arr,
                test_arr,
                self.data_transformation_config.preprocessor_obj_file_path,
            )
            
                
        except Exception as e:
            raise CustomException(e,sys)
A3.1.1 Model Trainer Component
  • Model Trainer(MT): We can run various models, once above components have turned data to desired format. This component consists of 2 classes as follows.
    • ModelTrainerConfig class stores public path variable to store the model object once trained in the pickle format.
    • ModelTrainer class, uses initiate_model_trainer method, that access train and test array from Data Transformation component. This method is able to train various models together on the train array and then finally make predictions on the test array, by using the best model from the various models being used on the base of r2_scores. Also in this method we use the ModelTrainerConfig object to store this trained model in local directory and last but not least we also return the r2_score for the best model on test data in this method itself.
  • The model trainer component is responsible for training machine learning models on the transformed data. Data Scientists use this component to select an appropriate algorithm, tune hyperparameters, and train the model on the data. The trained model is then evaluated for its performance, and if it meets the desired level of accuracy, it is deployed for production use. The role of Data Scientists in this component is to select and fine-tune the machine learning models that best fit the problem at hand, and ensure that the models meet the business requirements. Ultimately, the model trainer component helps Data Scientists to generate insights and make predictions that can drive business decisions.
Code
import os
import sys
import pandas as pd
from dataclasses import dataclass

from catboost import CatBoostRegressor
from sklearn.ensemble import (AdaBoostRegressor, GradientBoostingRegressor,
                              RandomForestRegressor)
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import r2_score
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor
from src.exception import CustomException
from src.logger import logging
from src.utils import evaluate_models, save_object, create_plot

@dataclass
class ModelTrainerConfig:
    # This class is used to store the configs, or any other files generated in this particular python file.
    trained_model_file_path = os.path.join("artifacts","model.pkl")

class ModelTrainer:
    def __init__(self,model_train_config:ModelTrainerConfig = ModelTrainerConfig() ) -> None:
        self.model_trainer_config = model_train_config
        
    
    def initiate_model_trainer(self,train_array,test_array):
        try:
            logging.info("Split training and test input data")
            X_train, y_train, X_test, y_test = (
                train_array[:,:-1],
                train_array[:,-1],
                test_array[:,:-1],
                test_array[:,-1]
            )
            
            models = {
                "Random Forest": RandomForestRegressor(),
                "Decision Tree": DecisionTreeRegressor(),
                "Gradient Boosting": GradientBoostingRegressor(),
                "Linear Regression": LinearRegression(),
                "XGBRegressor": XGBRegressor(),
                "CatBoosting Regressor": CatBoostRegressor(verbose=False),
                "AdaBoost Regressor": AdaBoostRegressor(),
                "Ridge": Ridge(),
                "Lasso": Lasso()
            }
            
            params ={
                "Ridge": {
                "alpha": [0.1, 1, 10],
                "fit_intercept": [True, False],
                
                },
                    "Lasso": {
                "alpha": [0.1, 1, 10],
                "fit_intercept": [True, False],
                
                },
                "Decision Tree": {
                    'criterion':['squared_error', 'friedman_mse', 'absolute_error', 'poisson'],
                    'splitter':['best','random'],
                    #'max_features':['sqrt','log2'],
                    #"max_depth": [None, 5, 10],
                    #"min_samples_split": [2, 5, 10],
                    
                },
                "Random Forest":{
                    'criterion':['squared_error', 'friedman_mse', 'absolute_error', 'poisson'],
                     #'max_features':['sqrt','log2',None],
                    'n_estimators': [8,16,32,64,128,256],
                    #"max_depth": [None, 5, 10],
                    #"min_samples_split": [2, 5, 10],
                },
                "Gradient Boosting":{
                    #'loss':['squared_error', 'huber', 'absolute_error', 'quantile'],
                    'learning_rate':[.1,.01,.05,.001],
                    'subsample':[0.6,0.7,0.75,0.8,0.85,0.9],
                    'criterion':['squared_error', 'friedman_mse'],
                    #'max_features':['auto','sqrt','log2'],
                    'n_estimators': [8,16,32,64,128,256],
                    #"max_depth": [None, 5, 10],
                    #"min_samples_split": [2, 5, 10],
                },
                "Linear Regression":{ 
                    "fit_intercept": [True, False],
                    #"normalize": [True, False],
                    },
                "XGBRegressor":{
                    'learning_rate':[.1,.01,.05,.001],
                    'n_estimators': [8,16,32,64,128,256],
                    #"max_depth": [None, 5, 10],
                    #"min_child_weight": [1, 3, 5],
                },
                "CatBoosting Regressor":{
                    'depth': [6,8,10],
                    'learning_rate': [0.01, 0.05, 0.1],
                    'iterations': [30, 50, 100],
                    #"n_estimators": [50,100,250],
                    #"max_depth": [None, 5, 10],
                    #"reg_lambda": [0.1, 1, 10],
                },
                "AdaBoost Regressor":{
                    'learning_rate':[.1,.01,0.5,.001],
                    #'loss':['linear','square','exponential'],
                    'n_estimators': [8,16,32,64,128,256],
                }
                                
                }
            
            model_report: dict = evaluate_models(
                X_train = X_train,
                y_train =  y_train, 
                X_test  = X_test,
                y_test = y_test,
                models = models,
                param = params
                )
            print(model_report)
            
            model_report_df = pd.DataFrame(model_report, index=[0])
            model_report_df.to_csv("./assets/files/model_report.csv",index=False)
            
            # To get best model score from dict
            best_model_score = max(sorted(model_report.values()))
            
            # To get best model name from dict
            best_model_name = list(model_report.keys())[
                list(model_report.values()).index(best_model_score)
                ]
            best_model = models[best_model_name]
            
            if best_model_score<0.6:
                raise CustomException("No best model found")
            logging.info(f"Best found model on both training and testing dataset")
            
            save_object(
                file_path = self.model_trainer_config.trained_model_file_path,
                obj = best_model
            )
            

    
            predicted = best_model.predict(X_test)
            r2_square = r2_score(y_test,predicted)
            create_plot(y_test,predicted,type = 'scatter',model_name = best_model_name)
            create_plot(y_test,predicted, type = 'reg',model_name = best_model_name)
            
            return r2_square
            

        except Exception as e:
            raise CustomException(e,sys)
Action A3.3: Creation of train and predict pipelines
A3.3.1 Train Pipeline
  • A pipeline that interacts with the DI, DT, MT components to process the raw data available in the frontend.
    • This pipeline has a TrainPipeline class, which takes in the raw data and uses the train method which interacts with the DI, DT and MT components to simply return the best models r2 score in the end.
Currently this pipeline is not in production env, simply used locally to train the model on local data
  • To run this pipeline, and train models, simply run the file with appropriate raw data
python3 ./src/pipeline/train_pipeline.py
Code
# Will use the components from src, in the train pipeline to make the model train on the database.
import sys
import pandas as pd
from src.logger import logging
from src.exception import CustomException
from src.components import data_ingestion as di
from src.components import data_transformation as dt
from src.components import model_trainer as mt


class TrainPipeline:
    def __init__(self, raw_data_path=None):
        self.raw_data_path = raw_data_path

    def train(self):
        try:
            logging.info("Initiating data ingestion")
            di_obj = di.DataIngestion()
            train_data, test_data = di_obj.initiate_data_ingestion(raw_data_path=self.raw_data_path)
            logging.info("Data ingestion completed")
            
            logging.info("Initiating data transformation")
            dt_obj = dt.DataTransformation() # We call DataTransformation here, just for the sake of demonstration.
            train_arr, test_arr,_ = dt_obj.initiate_data_transformation(train_data,test_data)
            logging.info("Data transformation completed and saved preprocessor object")
            
            
            logging.info("Training the model")
            mt_obj = mt.ModelTrainer()
            print(f"Best Models r2_score: {mt_obj.initiate_model_trainer(train_arr, test_arr)}")
            logging.info("Model training completed and saved the best model")
    
        except Exception as e:
            raise CustomException(e, sys)

if __name__ == "__main__":
    train_pipeline_obj = TrainPipeline("data/NewSPerformance.csv")
    train_pipeline_obj.train()
A3.3.2 Predict Pipeline Component
  • Predict Pipeline: A pipeline that takes the user inputs and makes prediction on the given data by using the trained model and other objects like preprocessor obj, created via the train pipeline.
    • This pipeline consists of CustomData class which takes the user inputs submitted to our application and returns a data frame out of the inputs.
    • PredictPipeline class, takes the CustomData class returned df object as features, scales them via the DT component generated transformer and finally, makes predictions by using the best model, from the MT component and showcases them back to the user.
Code
# A prediction pipeline file.
import sys
import pandas as pd
from src.exception import CustomException
from src.utils import load_object
from src.logger import logging


class PredictPipeline:
    def __init__(self):
        pass
    
    def predict(self,features):
        try:
            logging.info("Predicting the data")
            model_path  = "artifacts/model.pkl"
            preprocessor_path = "artifacts/preprocessor.pkl"
            
            model = load_object(file_path = model_path)
            preprocessor = load_object(file_path = preprocessor_path)
        
            data_scaled = preprocessor.transform(features)
            predictions = model.predict(data_scaled)
            logging.info("Predictions completed")
            return pd.DataFrame(predictions,columns=["predictions"])
        
        except Exception as e:
            raise CustomException(e,sys)    


class CustomData:
    def __init__(  self,
        gender: str,
        race_ethnicity: str,
        parental_level_of_education,
        lunch: str,
        test_preparation_course: str,
        reading_score: int,
        writing_score: int):

        self.gender = gender

        self.race_ethnicity = race_ethnicity

        self.parental_level_of_education = parental_level_of_education

        self.lunch = lunch

        self.test_preparation_course = test_preparation_course

        self.reading_score = reading_score

        self.writing_score = writing_score

    def get_data_as_data_frame(self):
        try:
            logging.info("Creating a data frame from the custom data")
            custom_data_input_dict = {
                "gender": [self.gender],
                "race_ethnicity": [self.race_ethnicity],
                "parental_level_of_education": [self.parental_level_of_education],
                "lunch": [self.lunch],
                "test_preparation_course": [self.test_preparation_course],
                "reading_score": [self.reading_score],
                "writing_score": [self.writing_score],
            }
            logging.info("Data frame created")
            return pd.DataFrame(custom_data_input_dict)

        except Exception as e:
            raise CustomException(e, sys)
  • The train and predict pipelines are a critical component of the machine learning process. The train pipeline is responsible for training the machine learning model on the training data. This process involves selecting an appropriate algorithm, fine-tuning hyperparameters, and fitting the model to the training data.

  • Once the model is trained, it is deployed to the predict pipeline, which is responsible for making predictions on new data. The predict pipeline involves processing the data, applying any necessary transformations, and using the trained model to generate predictions.

  • Data Scientists play a crucial role in both the train and predict pipelines. They must ensure that the training data is representative of the problem at hand, and that the model is trained and optimized to meet the desired level of accuracy. In addition, they must ensure that the predict pipeline is efficient and reliable, and that the model generates accurate predictions in real-time.

  • Ultimately, the train and predict pipelines are essential to the machine learning process, as they allow Data Scientists to build and deploy models that can generate valuable insights and drive business decisions.

Action A4: Model Deployment

  • Deployment of Model into a production environment:
    • Deployment of model via Flask API locally
    • Deployment of model via Codepipeline, and Beanstalk.
    • Deployment of model via Docker, ECR, Github-Action-Runners, and ECS.
Action A4.1: Deployment of Model via Flask API locally
  • Following is the code of the flask application that will use the predict pipeline and flask api.
    • Flask api is used to fetch the data from the frontend to backend and vice-versa.
    • This prediction pipeline makes predictions in the backend and finally the results are shared with the user by use of the Flask api.
    • This local app can be directly run by running python application.py in the terminal.
Code
from flask import Flask, request, render_template
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from src.pipeline.predict_pipeline import CustomData, PredictPipeline
from src.exception import CustomException
from src.logger import logging
import os

application = Flask(__name__)

app = application

## Route for a home page
@app.route('/')
def index():
    return render_template('index.html')


@app.route('/predictdata', methods=['GET','POST'])
def predict_datapoint():
    if request.method == 'GET':
        return render_template('home.html')
    else:
        try:
            data = CustomData(
                gender = request.form.get('gender'),
                race_ethnicity=request.form.get('ethnicity'),
                parental_level_of_education=request.form.get('parental_level_of_education'),
                lunch = request.form.get('lunch'),
                test_preparation_course=request.form.get('test_preparation_course'),
                reading_score=request.form.get('reading_score'),
                writing_score=request.form.get('writing_score'),
                
            )
            pred_df = data.get_data_as_data_frame()
            
            print(pred_df)
            
            predict_pipeline = PredictPipeline()
            results = predict_pipeline.predict(pred_df)
            return render_template('home.html',results = results)
        
        except Exception as e:
            logging.error(f"Error occured while predicting the data:{e}")
            return render_template('home.html',results = e)

if __name__=='__main__':
    port = int(os.environ.get("PORT", 8080))
    app.run(host='0.0.0.0', port=port) # This is the port assigned by the beanstalk or AWS instance, if not provided by them any... else we use the 8080 port in the host to run our application. Rest the docker run 

Local Deployment Video by Krish Naik Sir

Action A4.2: Deployment of Model via Codepipeline, and Beanstalk.
A4.2.1: Setting up Beanstalk
  • Elastick Beanstalk is a service that allows you to deploy your application on AWS on instances like linux machine based on EC2.
    • To set up Beanstalk, you need to have an AWS account. If you don’t have one, you can create one for free.
    • Open the AWS console and search for Elastic Beanstalk. Click on it and then click on create a new application.
    • Follow the steps one by one, but remember to create an IAM role for the application with the permissions of AWSElasticBeanstalkFullAccess, AWSElasticBeanstalkWebTier, AWSElasticBeanstalkEnhancedHealth and AdministratorAccess-AWSElasticBeanstalk.
    • Also in the steps, one will require ec2 key pair. You can create one on the AWS console, while creating the application.
    • Rest setup the EC2 instance profile via the IAM role itself, in the steps, and then keeping default options in the steps create the application.
    • Locally we also need a config file, which is called .ebextensions. This file contains the configuration for the Beanstalk instance.
    • The code for the config file is given below
  • After all the setup, get onto below steps, but before starting the code-pipeline, make sure that the beanstalk environment is running.
    • Also to view our application make sure, you setup ports in Beanstalk, by following steps given below.
    • To do that navigate to the Beanstalk console, find your environment, and click on the “Configuration” tab.
    • Scroll down to the “Software” section and click on “Edit”.
    • In the “Environment Properties” section, add a new property called PORT with a value of 5000.
    • Save the changes and wait for the environment to restart before you run the code-pipeline.
  • .ebsextensions file
option_settings:
  "aws:elasticbeanstalk:container:python":
    WSGIPath: application:application
A4.2.2: Setting up the Github Repo
  • After setting up Beanstalk, simply push the github repo that we created at the beginning of application.
    • This contains all the code of our application, now to deploy the code on the linux instance, there has to be a pipeline, connecting the github repo and the beanstalk server. Such a pipeline is called as a code pipeline in AWS . Whenever any changes happens in our code… the code pipelines makes real-time changes to the deployed application via help of a button.
    • This pipeline in general is called as a continuous delivery pipeline.
A4.2.3: Setting up the Code Pipeline
  • To set up the code pipeline, we need to have a code pipeline service in AWS.
    • We will use AWS code pipeline to create our code pipeline connecting our git repo to the elastic beanstalk.
    • Follow the steps with default options, while setting up the pipeline, make sure, we have connected to github repo and the branch on which the code is present, by github version 1 connect for simplicity.
    • Once the pipeline is connected, we need to build it and deploy it.
      • We skip the build part, but we do deploy our application in elastic beanstalk, while deploying we specify the name of application and the env as well.
      • Now ensuring that the beanstalk env is already running on port 5000, start the code pipeline, which will connect to the github repo and deploy the application on the beanstalk env.
      • Once the pipeline is running, we can view the application by clicking on the link given in domain column in elastic beanstalk application console.
      • The link will be similar to http://studperformance-env.eba-wmtvi3wb.eu-central-1.elasticbeanstalk.com/

Beanstalk Deployment Video by Krish Naik Sir

A4.3 Deployment of model via Docker, Github-Action-Runners, AWS ECR, EC2 instances.

A4.3.1 Setting up Docker Containers
  • Assuming that reader knows the basics about docker images, containers following the below steps ensure that our application is ready for deployment via AWS.
  • Setup Docker container:
    • Build an image using Dockerfile, contents of docker file are given below, also add a .dockerignore file to ignore adding venv/ environement to the image.
    • Run the image in a container using the command docker run -p 8080:8080 -v /path/to/ml_application:/app my-app
      • The -p flag is used to expose a port on the host machine, so that you can access it from outside the container. Here, we are exposing port 8080 on the host machine, which is mapped to port 8080 on the container.
      • The -v flag is used to mount a volume, which allows the container to access a directory on the host machine. The first path is the path to the directory on the host machine, and the second path is the path to the directory in the container.
      • The last argument my-app is the name of the image that you want to run.
      • The container is listening on port 8080, so the application.py file should have the port number set to 8080.
      • Access the application by visiting http://localhost:8080 in your web browser.
      • If you want to run the container in the background, you can use the -d flag.

A4.3.2 Setting AWS IAM role, ECR repo, EC2 instance

  • Setup of AWS IAM role
    • Create a new user and allow the permissions of AmazonEC2ContainerRegistryFullAccess & AmazonEC2FullAccess permissions to the user.
    • Setup access keys for the user and also download it in csv format.
  • Setup of ECR repo
    • Go to Elastic Container Registry and create a new repository named student performance and copy the URL for the repository to the aws yml file.
  • Setup of EC2 instance
    • Use the default settings, with Ubuntu instance and use all HTTP connections on this instance in one of the steps, while seeting up the instance.
    • Once the instance is up and running, connect to the instance.
    • On the instance run the following commands to install docker.
      • sudo apt-get update -y
      • sudo apt-get upgrade
      • curl -fsSL https://get.docker.com -o get-docker.sh
      • sudo sh get-docker.sh
      • sudo usermod -aG docker ubuntu
      • newgrp docker
    • Single Command:
    sudo apt-get update -y && sudo apt-get upgrade -y && curl -fsSL https://get.docker.com -o get-docker.sh && sudo sh get-docker.sh && sudo usermod -aG docker ubuntu && newgrp docker 
    • Also in the instance configuration under security groups in network, add a new inbound rule, which will be a custom TCP rule, which will help us access our application on 8080 port.

A4.3.3 Setup Github Workflow, Github Runner Actions and Github secrets to be used by the workflow.

  • Github workflow:
    • It refers to the idea that our code which is in github, once we update it, from github, a docker image should be created in the ecr repository and then this docker image will get installed in the ec2 instance that we have created and from this installation our application will run on the ec2 instance.
    • This same idea is represented in the aws.yaml file in our github workflow folder.
  • Setting up runner in github to run the workflow:
    • This runner will trigger the workflow whenever there is a change in the code.
    • Go to the instance and run all the commands to setup the runner in the ec2 instance as shown from the settings of github repo, when you create an self-hosted runner from the github-actions tab in settings for ml app repo.
    • Keep default options as it is, while doing the setup, just when the name of runner is asked, I gave it as self-hosted
    • Remember that after certain time, if we are not using the runner, it will go offline, to make it back online, and run the command ./run.sh from the actions-runner folder in the ec2 instance.
  • After this add the following github secrets in actions tab in settings of the repo. These will be used in our workflow itself.
    • AWS_ACCESS_KEY_ID: Created when we created the user in IAM
    • AWS_SECRET_ACCESS_KEY: Created when we created the user in IAM
    • AWS_DEFAULT_REGION: See it in the ec2 instance details
    • AWS_ECR_LOGIN_URI: See it in the ecr repository details #kind of format. but it doesn’t include the repository name.
    • AWS_ECR_REPOSITORY: studentperformance or whatever name you have given to the repository

A4.3.4 Running the application

  • Now simply go to the ec2 instance url, and paste :8080 in front of it, to see the application running. Play with it and have fun predicting marks of students.

AWS CI/CD Pipeline Deployment Video by Krish Naik Sir.

Gradio Webapp

Code
%%html
<script type="module" src="https://gradio.s3-us-west-2.amazonaws.com/3.24.1/gradio.js"
></script><gradio-app src="https://yuvidhepe-studentperformance.hf.space"></gradio-app>

Thanks to the support of Krish Naik sir, in this project journey

  • The whole project creation outline and execution is of Krish Naik sir, whose efforts in making Data Science simple have been enormous through such projects. You can give him a big shoutout on linkedin, learn from him on youtube.