End to End Machine Learning Project on Student Performance Dataset
End To End Project
Regression Project
An End to End ML project
End to End ML Project
Before I begin, I want to thank Krish Naik Sir, learn from him on youtube, for creating a series around End to End Machine Learning Projects. I have learnt a lot from his videos and I am trying to implement the same in this notebook. I have also referred to his notebook for this project. I have tried to add more details and explanations wherever I felt necessary.
A brief intro of this project work
These days it’s getting more common for a person in data team, to wear multiple hats in terms of bringing insights from the sea of data. To gain experience on what it means to multiple hats, I created this end to to end project on a simple data set of Students Performance Dataset from Kaggle.
The goal of this project is not to derive any magical insights from data, rather, to do a comprehensive work on building an end to end project which includes but is not limited to:
Design and Develop Components and Pipelines, where components interact with data in backend, whereas the pipelines interact with the user and components to get predictions from a trained ML model and finally provide result to the user.
Follow the actions mentioned below to make your own copy of this end to end project
You Can Try the WebApp created for this project before you get your hands dirty
Notebook checkpoints
STAR ANALYSIS
Explained every point of the star method, step by step in detail.
Ex. Action will be broken down to A1, A2 to follow up the notebook.
We use acronyms like T1 representing Task 1, A1.1 representing subaction 1 of Action 1.
STAR ANALYSIS
Situation: To gain experience in end-to-end machine learning project development and deployment
Task: Create a machine learning project from scratch, following the entire pipeline from data collection to model deployment
Action: Develop a project idea, collect and preprocess the data, perform EDA on data, decide on design and training of the machine learning model, evaluate the model’s performance, and deploy the model into a production - environment
Result: Gain hands-on experience in end-to-end machine learning project development and deployment, with a fully functional machine learning system that can be used for real-world applications
Situation
S1. Need of gaining exposure in real-world ML project development and deployment
S2. A way to improve my Data Science profile, with such projects
S3. Building skillset to be of use in the real-world, and not be limited to books
With the situation being clear let’s jump to a bit about task that was required to be done for this situation
Tasks
T1. Creating a folder structure, for a real-world entity project.
Uses: Introduction of Modularity to the project, rather than a single jupyter notebook file doing all the job.
T2. Creating an environment and setup file to run this ML pipeline from scratch.
A. Developing an End to End ML pipeline and then performing web deployment for using the ML model.
With the basic overview of task now, let’s look onto every task in details
Task T1: Creation of Project Structure
Creating a folder structure for our real-world project. This is an essential part for any real-world code project, as it introduces modularity to our code. This modularity helps us to deal with complexity of huge projects in simple way, where a team can work together on different parts of the project, re-use each others work and combine it all at the end.
T1.1: Folder Structure Creation
First setup a github repo (ML_Web_Project is my repo), keeping all the options to default.
Locally setup a folder (END_To_END_ML_PROJECT is my base local folder setup on WSL, but the one can use windows or mac as well)
Open this directory in vscode
Open a terminal
Secondly let’s create a conda environment named venv into this local folder, so to have packages locally to run the project. bash conda create -p venv python==3.8 -y
Activate this environment from the base folder
conda activate venv/ # don't forget '/' cause it tells that this environment is in a folder named venv
Link the local folder to the github repo
First do git init in the local folder
Follow all the steps mentioned in the github repo you created to do the syncing of local folder to the repo.
After the update of git back in 2021, one needs to setup ssh-keys to use the github repo or use tokens, I prefer to use ssh-keys, follow the steps here.
Create a default .gitignore for python in github repo online.
Finally do a git pull, to sync the changes locally as well.
Later on whenever there are enough changes to the local code, follow the steps of git add, commit and push with a useful commit message.
By now local repo should have a .gitignore, README.md, venv/ in their local repo, after this create the following folder structure locally.
- END_TO_END_ML_PROJECT- setup.py # The setup script is the center of all activity in building, distributing, and installing modules that are necessary to run the ML pipeline. # Consider it as the core of the ML Pipeline. This setup.py will help to use our ML pipeline as a package itself and can even be deployed to Pypi.- requirements.txt # All packages that need to be installed before running the project. # This is the part that gives energy to the core.- assets- data # The folder which consist of datasets used in the project.- StudentsPerformance.csv- files- notebook # jupyter notebooks, consisting of all codes which helps to find patterns in data and give a big picture code, later to be broken down into src folders.- EDA_notebook.ipynb - Model_train.ipynb- images # to store plot images - src # The backbone containing all the source codes for creation of ML pipeline package.- __init__.py- exception.py # Helps in producing custom exceptions.- logger.py # Contains the code that will help in creation of logs, and trace errors if caused any during the realtime.- utils.py # Contains all the utilities that can be reused across the whole project.- components # The major components of the project, which deal with data cleaning, transformation, model training etc.- __init__.py- data_ingestion.py- data_transformation.py- model_trainer.py- pipeline # The complete pipelines built via use of components for further deployment of the model.- __init__.py- predict_pipeline.py- train_pipeline.py
Task T2: Environment creation and setup
Creating an environment and setup file which later can be used to condense our ML pipeline in form of package. In this part we build the foundation for our ML pipeline, by creating the code for setup.py file.
Task T2.1: Setup File
Code
from setuptools import find_packages,setupfrom typing import Listdef get_requirements(file_path:str)->List[str]:''' This function will return the list of requirements ''' requirements = []file=open(file_path,'r')for line infile:if"-e ."notin line: requirements.append(line.strip('\n'))file.close()#print(requirements)return requirements# With this setup we parse our requirements file to get the requirements installed for our project, one can make this static via use of package names in form of a list, instead of parsing a requirements file.setup( name='mlproject', version='0.0.1', author='<Your Name>', author_email='<Your Email>', packages=find_packages(), # This will use the codes or modules that we write for our ML pipeline, to ensure that our every module can be used for building the package, we have a __init__.py in src, or any directory that can be reused. install_requires=get_requirements('requirements.txt') )
contents of requirements.txt file
pandasnumpyseabornmatplotlibscikit-learncatboostxgboostdilltensorboard-e . # This triggers the setup .py file automatically, but this is not readed when setup.py is called as per our above code.
Once these 2 files are setup, simply run:
pip install -r requirements.txt
This will install all the necessary packages in our virtual environment and create a new directory .egg-info which will help to create the ML pipeline package for us.
Actions
A1. Project Idea: Using a student performance data to predict it’s grades or scores, depending on the other features of the dataset.
A2. Data Collection and Preprocessing: We first do all EDA in a jupyter notebook to find patterns in the data and getting to know the type of preprocessing required to be done on the dataset.
For simple application the data is simply imported in form of csv file, but all this can even be done by getting data from Data Warehouse as well.
A3. Design and Development of ML pipeline components: After EDA, we try to create simple modular codes in a jupyter notebook, which do the job of development, training and evaluation of ML model. Later these modular codes are more or less split into the folder structure that we created earlier.
A4. Deployment of model into a production environment: We use cloud tools like AWS or Streamlit or Flask n Django or any other web service to deploy the ML model online to be used on realtime data provided by user or fetched from a source.
Action A1 & A2: Project Idea, Data collection and Preprocessing
Project Idea
We will use a simple student performance dataset, to predict the child’s maths scores via the rest of the features of the dataset.
We will be using this dataset, because it’s having a mixed of categorical and numerical features, we can have a good amount of EDA done on this simple data, and last but not the least train many regression algorithms on this simple data easily.
Data Collection & Preprocessing
We will use jupyter notebooks, to majority of the EDA, and finding the patterns.
Once the EDA is done, we will also have basic models run on the data, in another jupyter notebook, so that we have basic model pipeline code in place as well.
Insights from the EDA n Model Training, I have mentioned in a brief on my github, you can view it here, with the insights in place, let’s begin with design and development of ML pipeline.
Action A3: Design and Development of ML pipeline components
Design and Development of ML pipeline components in form of modular codes
Steps
Creation of utility codes, logging and exception handling module that will be used all over the components, pipelines.
Creation of Components modules inside the package consisting of Data Ingestion, Data Transformation and Model Trainer Component.
Creation of train and predict pipelines modules that will be connected to the above components, and will be a pipeline connecting the frontend user and the backend model of Machine learning.
Action A3.1: Creation of Utilities, Loggers n Exceptions
Action A3.1.1: Creation of Utilities
Code
#Common functionalities for the whole project.import osimport sysimport dillimport numpy as npimport pandas as pdimport matplotlib.pyplot as pltfrom sklearn.metrics import r2_scoreimport seaborn as snsfrom src.exception import CustomExceptionfrom sklearn.model_selection import GridSearchCVfrom sklearn.metrics import ConfusionMatrixDisplayfrom sklearn.metrics import confusion_matrixdef save_object(file_path,obj):try: dir_path = os.path.dirname(file_path) os.makedirs(dir_path,exist_ok=True) file_obj =open(file_path,"wb") dill.dump(obj,file_obj)exceptExceptionas e:raise CustomException(e,sys)def evaluate_models(X_train, y_train, X_test,y_test,models, param):try: report = {}for i inrange(len(list(models))): model =list(models.values())[i] para=param[list(models.keys())[i]] gs = GridSearchCV(model,para,cv=3) gs.fit(X_train,y_train) model.set_params(**gs.best_params_) model.fit(X_train,y_train)#model.fit(X_train,y_train) y_train_pred = model.predict(X_train) y_test_pred = model.predict(X_test) train_model_score = r2_score(y_train,y_train_pred) test_model_score = r2_score(y_test,y_test_pred) report[list(models.keys())[i]] = test_model_scorereturn reportexceptExceptionas e:raise CustomException(e,sys)def load_object(file_path):try: file_obj =open(file_path,"rb")return dill.load(file_obj) file_obj.close()exceptExceptionas e:raise CustomException(e,sys)def create_plot(y_test, y_pred, type, model_name, xlabel ="Actual Math Score", ylabel="Predicted Math Score", file_name ="Actual vs Predicted"):""" A function to create a plot and save it to a file. """iftype=="scatter": title =f"{model_name}'s Actual vs Predicted Values Scatterplot" plt.scatter(y_test, y_pred) plt.title(title) plt.xlabel(xlabel) plt.ylabel(ylabel) directory ="./assets/images/" plt.savefig(f"{directory}{file_name}")eliftype=="reg": title =f"{model_name}'s Actual vs Predicted Values Regplot" sns.regplot(x=y_test,y=y_pred,ci=None,color ='red'); plt.title(title) plt.xlabel(xlabel) plt.ylabel(ylabel) directory ="./assets/images/" plt.savefig(f"{directory}{file_name}_regplot")
Action A3.1.2: Creation of Logger
Code
# Logger is for the purpose of logging all the events in the program from execution to termination.# For example, whenever there is an exception, we can log the exception info in a file via use of logger.# Read logger documentation at https://docs.python.org/3/library/logging.html# Logger is for the purpose of logging all the events in the program from execution to termination.# For example, whenever there is an exception, we can log the exception info in a file via use of logger.# Read logger documentation at https://docs.python.org/3/library/logging.htmlimport loggingimport osfrom datetime import datetimeLOG_FILE_NAME =f"{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}.log"logs_path = os.path.join(os.getcwd(), "logs",LOG_FILE_NAME) # This will create logs folder in the same working directory where this file is presentos.makedirs(logs_path,exist_ok=True) # Keep appending the logs in the same directory even if there are multiple runs of the programLOG_FILE_PATH = os.path.join(logs_path,LOG_FILE_NAME)logging.basicConfig(filename=LOG_FILE_PATH, level=logging.INFO,format="[%(asctime)s] %(lineno)d%(name)s - %(levelname)s: %(message)s", datefmt='%m/%d/%Y %I:%M:%S %p' ) #This is the change of basic configuration for the loggerif__name__=='__main__': logging.info("This is a test log") logging.warning("This is a warning log") logging.error("This is an error log") logging.critical("This is a critical log")
Action A3.1.3: Creation of Exception.py
Code
# We use this custom exception handling in the project to handle all the errors that will come into the project, simply we can say that we are handling all the errors that will come into the project in a single place.import sys# Sys module in python provides various functions and variables that are used to manipulate different parts of the python runtime environment. It allows operating on the python interpreter as it provides access to some variables used or maintained by the interpreter and to functions that interact strongly with the interpreter.# Read more about sys module here: https://docs.python.org/3/library/sys.htmlfrom src.logger import loggingdef error_message_detail(error,error_detail:sys): _,_,exec_tb = error_detail.exc_info() file_name = exec_tb.tb_frame.f_code.co_filename error_message =f"Error occured in python script name {file_name} on line number {exec_tb.tb_lineno} and error is {str(error)}"return error_messageclass CustomException(Exception):def__init__(self,error_message,error_detail:sys):super().__init__(error_message)self.error_message = error_message_detail(error_message,error_detail= error_detail)#self.error_detail = error_detail def__str__(self):returnf"{self.error_message}"# Read more about custom exception handling here: https://www.programiz.com/python-programming/user-defined-exceptionif__name__=='__main__':try: a =10 b =0 c = a/bprint(c)exceptExceptionas e: logging.error(e)raise CustomException(e,error_detail=sys)
Action A3.2: Creation of Components
A3.2.1 Data Ingestion Component
Data being a central component of any project, in this component, we write classes such as DataIngestionConfig and DataIngestion.
DataIngestionConfig consists of public path variables to train, test and raw data.
DataIngestion helps to create an object which invokes an object of DataIngestionConfig during initialization and retrieves public path variables.
By use of those paths, we read data, split them up and save them to the directory by use of initiate_data_ingestion method.
Data ingestion is a crucial step in any project that involves handling data. This process involves extracting data from different sources, such as databases or warehouses, and loading it into a centralized location, such as a data warehouse, data lake, or data mart. Typically, this task is performed by a specialized big data team, whose responsibility is to ensure that data is obtained from various sources and stored in different formats, such as Hadoop or MongoDB.
As Data Scientists, it’s essential to have knowledge of how to extract data from different sources, such as Hadoop, MongoDB, MySQL, or Oracle, and make it available for analysis. Since data is a critical asset in any project, understanding the process of data ingestion is vital to ensure that the data is organized and stored in a way that facilitates analysis.
Code
import osimport sysfrom dataclasses import dataclassimport pandas as pdfrom sklearn.model_selection import train_test_splitfrom src.components.data_transformation import (DataTransformation, DataTransformationConfig)from src.components.model_trainer import ModelTrainer, ModelTrainerConfigfrom src.exception import CustomExceptionfrom src.logger import logging@dataclassclass DataIngestionConfig:''' Used for defining the configuration for data ingestion. ''' train_data_path: str= os.path.join('artifacts', 'train.csv') test_data_path: str= os.path.join('artifacts', 'test.csv') raw_data_path: str= os.path.join('artifacts', 'data.csv')class DataIngestion:''' Used for ingesting data by making use of the configuration defined in DataIngestionConfig. '''def__init__(self,ingestion_config: DataIngestionConfig = DataIngestionConfig()):self.ingestion_config = ingestion_configdef initiate_data_ingestion(self,raw_data_path: str=None):try:# Reading data here. logging.info("Initiating data ingestion")if raw_data_path isnotNone:self.ingestion_config.raw_data_path = raw_data_path data = pd.read_csv(self.ingestion_config.raw_data_path)else: data = pd.read_csv('./assets/data/NewSPerformance.csv') os.makedirs(os.path.dirname(self.ingestion_config.train_data_path),exist_ok=True) data.to_csv(self.ingestion_config.train_data_path,index=False,header=True) logging.info("Data ingestion completed") logging.info("Train test split initiated") train_set, test_set = train_test_split(data,test_size =0.2, random_state =18) train_set.to_csv(self.ingestion_config.train_data_path,index =False, header =True) test_set.to_csv(self.ingestion_config.test_data_path,index =False, header =True) logging.info("Train test split ingestion completed")return (self.ingestion_config.train_data_path,self.ingestion_config.test_data_path )exceptExceptionas e: logging.error("Error occured in data ingestion")raise CustomException(e,sys)if__name__=='__main__': obj = DataIngestion() train_data, test_data = obj.initiate_data_ingestion() data_transformation = DataTransformation() # We call DataTransformation here, just for the sake of demonstration. train_arr, test_arr,_ = data_transformation.initiate_data_transformation(train_data,test_data) modeltrainer = ModelTrainer()print(modeltrainer.initiate_model_trainer(train_arr, test_arr))
A3.2.1 Data Transformation Component
Once data ingestion is done, Data transformation component is used to transform the data, to make it useful for analysis and train models on it.
DataTransformationConfig class in this component stores public path variable to store the preprocessing object in pickle type data, to be later used during building the web app.
DataTransformation class helps to create an object which invokes DataTransformationConfig Object to get access to preprocessing object path.
We have a get_data_transformer_object method, that returns a preprocessor object which can preprocess numerical and categorical columns
By use of the get_data_transformer_object method, in initiate_data_transformation method, to do all the preprocessing on the train and test files, whose path is available from Data Ingestion component. After all the preprocessing we return train and test array consisting of feature and target variables.
A data transformation component is a crucial part of the data science process, which involves transforming raw data into a format that can be used for analysis. Data Scientists play a vital role in this process as they use various techniques such as feature engineering, feature selection, feature scaling, data cleaning, and handling null values to ensure the quality of data used for analysis. By understanding the process of data transformation, Data Scientists can generate valuable insights from raw data and make informed business decisions.
Code
import osimport sysfrom dataclasses import dataclassimport numpy as npimport pandas as pdfrom sklearn.compose import ColumnTransformerfrom sklearn.impute import SimpleImputerfrom sklearn.pipeline import Pipelinefrom sklearn.preprocessing import OneHotEncoder, StandardScalerfrom src.exception import CustomExceptionfrom src.logger import loggingfrom src.utils import save_object# Defining the paths for the data ingestion# di_obj = DataIngestion.DataIngestionConfig() # Not required as we already are doing the Doing the addition of paths in data_ingestion.py itself.# di_obj.train_data_path = "data/train_data.csv"# di_obj.test_data_path = "data/test_data.csv"@dataclass#This is a decorator which is used to create a dataclass variables.class DataTransformationConfig:''' We are creating a dataclass variable which will be used to store the paths for the data transformation transformer object. ''' preprocessor_obj_file_path = os.path.join("artifacts","preprocessor.pkl")class DataTransformation:def__init__(self,transformation_config: DataTransformationConfig = DataTransformationConfig()):self.data_transformation_config = transformation_configdef get_data_transformer_object(self):''' This function is responsible for creating a preprocessing data transformation object. '''try: numerical_columns = ["writing_score","reading_score"] categorical_columns = ["gender","race_ethnicity","parental_level_of_education","lunch","test_preparation_course" ] num_pipeline = Pipeline( steps=[ ("imputer",SimpleImputer(strategy ='median')), ("scaler",StandardScaler()) ] ) cat_pipeline = Pipeline( steps = [ ("imputer",SimpleImputer(strategy ='most_frequent')), ("one_hot_encoder",OneHotEncoder()), ('scaler',StandardScaler(with_mean=False)) ] ) logging.info(f"Numerical columns:{numerical_columns}") logging.info(f"Categorical columns:{categorical_columns}") preprocessor = ColumnTransformer( [ ("num_pipeline",num_pipeline,numerical_columns), ('cat_pipeline',cat_pipeline,categorical_columns) ] )return preprocessorexceptExceptionas e:raise CustomException(e,sys)def initiate_data_transformation(self,train_path,test_path):''' Here we use the preprocessing object to transform the data. '''try: train_df = pd.read_csv(train_path) test_df= pd.read_csv(test_path) logging.info("Read train and test data completed") logging.info("Obtaining preprocessing object and starting processing.") preprocessing_obj =self.get_data_transformer_object() target_column_name ="math_score" numerical_columns = ["writing_score","reading_score"] input_feature_train_df = train_df.drop(columns = [target_column_name],axis=1) target_feature_train_df = train_df[target_column_name] input_feature_test_df = test_df.drop(columns = [target_column_name],axis=1) target_feature_test_df = test_df[target_column_name] input_feature_train_arr = preprocessing_obj.fit_transform(input_feature_train_df) input_feature_test_arr = preprocessing_obj.fit_transform(input_feature_test_df) train_arr = np.c_[ input_feature_train_arr,np.array(target_feature_train_df) ] test_arr = np.c_[ input_feature_test_arr,np.array(target_feature_test_df) ] logging.info(f"Saved Preprocessing object at a particular filepath ") save_object( file_path =self.data_transformation_config.preprocessor_obj_file_path, obj = preprocessing_obj )return( train_arr, test_arr,self.data_transformation_config.preprocessor_obj_file_path, )exceptExceptionas e:raise CustomException(e,sys)
A3.1.1 Model Trainer Component
Model Trainer(MT): We can run various models, once above components have turned data to desired format. This component consists of 2 classes as follows.
ModelTrainerConfig class stores public path variable to store the model object once trained in the pickle format.
ModelTrainer class, uses initiate_model_trainer method, that access train and test array from Data Transformation component. This method is able to train various models together on the train array and then finally make predictions on the test array, by using the best model from the various models being used on the base of r2_scores. Also in this method we use the ModelTrainerConfig object to store this trained model in local directory and last but not least we also return the r2_score for the best model on test data in this method itself.
The model trainer component is responsible for training machine learning models on the transformed data. Data Scientists use this component to select an appropriate algorithm, tune hyperparameters, and train the model on the data. The trained model is then evaluated for its performance, and if it meets the desired level of accuracy, it is deployed for production use. The role of Data Scientists in this component is to select and fine-tune the machine learning models that best fit the problem at hand, and ensure that the models meet the business requirements. Ultimately, the model trainer component helps Data Scientists to generate insights and make predictions that can drive business decisions.
Code
import osimport sysimport pandas as pdfrom dataclasses import dataclassfrom catboost import CatBoostRegressorfrom sklearn.ensemble import (AdaBoostRegressor, GradientBoostingRegressor, RandomForestRegressor)from sklearn.linear_model import LinearRegression, Ridge, Lassofrom sklearn.metrics import r2_scorefrom sklearn.neighbors import KNeighborsRegressorfrom sklearn.tree import DecisionTreeRegressorfrom xgboost import XGBRegressorfrom src.exception import CustomExceptionfrom src.logger import loggingfrom src.utils import evaluate_models, save_object, create_plot@dataclassclass ModelTrainerConfig:# This class is used to store the configs, or any other files generated in this particular python file. trained_model_file_path = os.path.join("artifacts","model.pkl")class ModelTrainer:def__init__(self,model_train_config:ModelTrainerConfig = ModelTrainerConfig() ) ->None:self.model_trainer_config = model_train_configdef initiate_model_trainer(self,train_array,test_array):try: logging.info("Split training and test input data") X_train, y_train, X_test, y_test = ( train_array[:,:-1], train_array[:,-1], test_array[:,:-1], test_array[:,-1] ) models = {"Random Forest": RandomForestRegressor(),"Decision Tree": DecisionTreeRegressor(),"Gradient Boosting": GradientBoostingRegressor(),"Linear Regression": LinearRegression(),"XGBRegressor": XGBRegressor(),"CatBoosting Regressor": CatBoostRegressor(verbose=False),"AdaBoost Regressor": AdaBoostRegressor(),"Ridge": Ridge(),"Lasso": Lasso() } params ={"Ridge": {"alpha": [0.1, 1, 10],"fit_intercept": [True, False], },"Lasso": {"alpha": [0.1, 1, 10],"fit_intercept": [True, False], },"Decision Tree": {'criterion':['squared_error', 'friedman_mse', 'absolute_error', 'poisson'],'splitter':['best','random'],#'max_features':['sqrt','log2'],#"max_depth": [None, 5, 10],#"min_samples_split": [2, 5, 10], },"Random Forest":{'criterion':['squared_error', 'friedman_mse', 'absolute_error', 'poisson'],#'max_features':['sqrt','log2',None],'n_estimators': [8,16,32,64,128,256],#"max_depth": [None, 5, 10],#"min_samples_split": [2, 5, 10], },"Gradient Boosting":{#'loss':['squared_error', 'huber', 'absolute_error', 'quantile'],'learning_rate':[.1,.01,.05,.001],'subsample':[0.6,0.7,0.75,0.8,0.85,0.9],'criterion':['squared_error', 'friedman_mse'],#'max_features':['auto','sqrt','log2'],'n_estimators': [8,16,32,64,128,256],#"max_depth": [None, 5, 10],#"min_samples_split": [2, 5, 10], },"Linear Regression":{ "fit_intercept": [True, False],#"normalize": [True, False], },"XGBRegressor":{'learning_rate':[.1,.01,.05,.001],'n_estimators': [8,16,32,64,128,256],#"max_depth": [None, 5, 10],#"min_child_weight": [1, 3, 5], },"CatBoosting Regressor":{'depth': [6,8,10],'learning_rate': [0.01, 0.05, 0.1],'iterations': [30, 50, 100],#"n_estimators": [50,100,250],#"max_depth": [None, 5, 10],#"reg_lambda": [0.1, 1, 10], },"AdaBoost Regressor":{'learning_rate':[.1,.01,0.5,.001],#'loss':['linear','square','exponential'],'n_estimators': [8,16,32,64,128,256], } } model_report: dict= evaluate_models( X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test, models = models, param = params )print(model_report) model_report_df = pd.DataFrame(model_report, index=[0]) model_report_df.to_csv("./assets/files/model_report.csv",index=False)# To get best model score from dict best_model_score =max(sorted(model_report.values()))# To get best model name from dict best_model_name =list(model_report.keys())[list(model_report.values()).index(best_model_score) ] best_model = models[best_model_name]if best_model_score<0.6:raise CustomException("No best model found") logging.info(f"Best found model on both training and testing dataset") save_object( file_path =self.model_trainer_config.trained_model_file_path, obj = best_model ) predicted = best_model.predict(X_test) r2_square = r2_score(y_test,predicted) create_plot(y_test,predicted,type='scatter',model_name = best_model_name) create_plot(y_test,predicted, type='reg',model_name = best_model_name)return r2_squareexceptExceptionas e:raise CustomException(e,sys)
Action A3.3: Creation of train and predict pipelines
A3.3.1 Train Pipeline
A pipeline that interacts with the DI, DT, MT components to process the raw data available in the frontend.
This pipeline has a TrainPipeline class, which takes in the raw data and uses the train method which interacts with the DI, DT and MT components to simply return the best models r2 score in the end.
Currently this pipeline is not in production env, simply used locally to train the model on local data
To run this pipeline, and train models, simply run the file with appropriate raw data
python3 ./src/pipeline/train_pipeline.py
Code
# Will use the components from src, in the train pipeline to make the model train on the database.import sysimport pandas as pdfrom src.logger import loggingfrom src.exception import CustomExceptionfrom src.components import data_ingestion as difrom src.components import data_transformation as dtfrom src.components import model_trainer as mtclass TrainPipeline:def__init__(self, raw_data_path=None):self.raw_data_path = raw_data_pathdef train(self):try: logging.info("Initiating data ingestion") di_obj = di.DataIngestion() train_data, test_data = di_obj.initiate_data_ingestion(raw_data_path=self.raw_data_path) logging.info("Data ingestion completed") logging.info("Initiating data transformation") dt_obj = dt.DataTransformation() # We call DataTransformation here, just for the sake of demonstration. train_arr, test_arr,_ = dt_obj.initiate_data_transformation(train_data,test_data) logging.info("Data transformation completed and saved preprocessor object") logging.info("Training the model") mt_obj = mt.ModelTrainer()print(f"Best Models r2_score: {mt_obj.initiate_model_trainer(train_arr, test_arr)}") logging.info("Model training completed and saved the best model")exceptExceptionas e:raise CustomException(e, sys)if__name__=="__main__": train_pipeline_obj = TrainPipeline("data/NewSPerformance.csv") train_pipeline_obj.train()
A3.3.2 Predict Pipeline Component
Predict Pipeline: A pipeline that takes the user inputs and makes prediction on the given data by using the trained model and other objects like preprocessor obj, created via the train pipeline.
This pipeline consists of CustomData class which takes the user inputs submitted to our application and returns a data frame out of the inputs.
PredictPipeline class, takes the CustomData class returned df object as features, scales them via the DT component generated transformer and finally, makes predictions by using the best model, from the MT component and showcases them back to the user.
The train and predict pipelines are a critical component of the machine learning process. The train pipeline is responsible for training the machine learning model on the training data. This process involves selecting an appropriate algorithm, fine-tuning hyperparameters, and fitting the model to the training data.
Once the model is trained, it is deployed to the predict pipeline, which is responsible for making predictions on new data. The predict pipeline involves processing the data, applying any necessary transformations, and using the trained model to generate predictions.
Data Scientists play a crucial role in both the train and predict pipelines. They must ensure that the training data is representative of the problem at hand, and that the model is trained and optimized to meet the desired level of accuracy. In addition, they must ensure that the predict pipeline is efficient and reliable, and that the model generates accurate predictions in real-time.
Ultimately, the train and predict pipelines are essential to the machine learning process, as they allow Data Scientists to build and deploy models that can generate valuable insights and drive business decisions.
Action A4: Model Deployment
Deployment of Model into a production environment:
Deployment of model via Flask API locally
Deployment of model via Codepipeline, and Beanstalk.
Deployment of model via Docker, ECR, Github-Action-Runners, and ECS.
Action A4.1: Deployment of Model via Flask API locally
Following is the code of the flask application that will use the predict pipeline and flask api.
Flask api is used to fetch the data from the frontend to backend and vice-versa.
This prediction pipeline makes predictions in the backend and finally the results are shared with the user by use of the Flask api.
This local app can be directly run by running python application.py in the terminal.
Code
from flask import Flask, request, render_templateimport numpy as npimport pandas as pdfrom sklearn.preprocessing import StandardScalerfrom src.pipeline.predict_pipeline import CustomData, PredictPipelinefrom src.exception import CustomExceptionfrom src.logger import loggingimport osapplication = Flask(__name__)app = application## Route for a home page@app.route('/')def index():return render_template('index.html')@app.route('/predictdata', methods=['GET','POST'])def predict_datapoint():if request.method =='GET':return render_template('home.html')else:try: data = CustomData( gender = request.form.get('gender'), race_ethnicity=request.form.get('ethnicity'), parental_level_of_education=request.form.get('parental_level_of_education'), lunch = request.form.get('lunch'), test_preparation_course=request.form.get('test_preparation_course'), reading_score=request.form.get('reading_score'), writing_score=request.form.get('writing_score'), ) pred_df = data.get_data_as_data_frame()print(pred_df) predict_pipeline = PredictPipeline() results = predict_pipeline.predict(pred_df)return render_template('home.html',results = results)exceptExceptionas e: logging.error(f"Error occured while predicting the data:{e}")return render_template('home.html',results = e)if__name__=='__main__': port =int(os.environ.get("PORT", 8080)) app.run(host='0.0.0.0', port=port) # This is the port assigned by the beanstalk or AWS instance, if not provided by them any... else we use the 8080 port in the host to run our application. Rest the docker run
Action A4.2: Deployment of Model via Codepipeline, and Beanstalk.
A4.2.1: Setting up Beanstalk
Elastick Beanstalk is a service that allows you to deploy your application on AWS on instances like linux machine based on EC2.
To set up Beanstalk, you need to have an AWS account. If you don’t have one, you can create one for free.
Open the AWS console and search for Elastic Beanstalk. Click on it and then click on create a new application.
Follow the steps one by one, but remember to create an IAM role for the application with the permissions of AWSElasticBeanstalkFullAccess, AWSElasticBeanstalkWebTier, AWSElasticBeanstalkEnhancedHealth and AdministratorAccess-AWSElasticBeanstalk.
Also in the steps, one will require ec2 key pair. You can create one on the AWS console, while creating the application.
Rest setup the EC2 instance profile via the IAM role itself, in the steps, and then keeping default options in the steps create the application.
Locally we also need a config file, which is called .ebextensions. This file contains the configuration for the Beanstalk instance.
The code for the config file is given below
After all the setup, get onto below steps, but before starting the code-pipeline, make sure that the beanstalk environment is running.
Also to view our application make sure, you setup ports in Beanstalk, by following steps given below.
To do that navigate to the Beanstalk console, find your environment, and click on the “Configuration” tab.
Scroll down to the “Software” section and click on “Edit”.
In the “Environment Properties” section, add a new property called PORT with a value of 5000.
Save the changes and wait for the environment to restart before you run the code-pipeline.
After setting up Beanstalk, simply push the github repo that we created at the beginning of application.
This contains all the code of our application, now to deploy the code on the linux instance, there has to be a pipeline, connecting the github repo and the beanstalk server. Such a pipeline is called as a code pipeline in AWS . Whenever any changes happens in our code… the code pipelines makes real-time changes to the deployed application via help of a button.
This pipeline in general is called as a continuous delivery pipeline.
A4.2.3: Setting up the Code Pipeline
To set up the code pipeline, we need to have a code pipeline service in AWS.
We will use AWS code pipeline to create our code pipeline connecting our git repo to the elastic beanstalk.
Follow the steps with default options, while setting up the pipeline, make sure, we have connected to github repo and the branch on which the code is present, by github version 1 connect for simplicity.
Once the pipeline is connected, we need to build it and deploy it.
We skip the build part, but we do deploy our application in elastic beanstalk, while deploying we specify the name of application and the env as well.
Now ensuring that the beanstalk env is already running on port 5000, start the code pipeline, which will connect to the github repo and deploy the application on the beanstalk env.
Once the pipeline is running, we can view the application by clicking on the link given in domain column in elastic beanstalk application console.
The link will be similar to http://studperformance-env.eba-wmtvi3wb.eu-central-1.elasticbeanstalk.com/
A4.3 Deployment of model via Docker, Github-Action-Runners, AWS ECR, EC2 instances.
A4.3.1 Setting up Docker Containers
Assuming that reader knows the basics about docker images, containers following the below steps ensure that our application is ready for deployment via AWS.
Setup Docker container:
Build an image using Dockerfile, contents of docker file are given below, also add a .dockerignore file to ignore adding venv/ environement to the image.
Run the image in a container using the command docker run -p 8080:8080 -v /path/to/ml_application:/app my-app
The -p flag is used to expose a port on the host machine, so that you can access it from outside the container. Here, we are exposing port 8080 on the host machine, which is mapped to port 8080 on the container.
The -v flag is used to mount a volume, which allows the container to access a directory on the host machine. The first path is the path to the directory on the host machine, and the second path is the path to the directory in the container.
The last argument my-app is the name of the image that you want to run.
The container is listening on port 8080, so the application.py file should have the port number set to 8080.
Access the application by visiting http://localhost:8080 in your web browser.
If you want to run the container in the background, you can use the -d flag.
A4.3.2 Setting AWS IAM role, ECR repo, EC2 instance
Setup of AWS IAM role
Create a new user and allow the permissions of AmazonEC2ContainerRegistryFullAccess & AmazonEC2FullAccess permissions to the user.
Setup access keys for the user and also download it in csv format.
Setup of ECR repo
Go to Elastic Container Registry and create a new repository named student performance and copy the URL for the repository to the aws yml file.
Setup of EC2 instance
Use the default settings, with Ubuntu instance and use all HTTP connections on this instance in one of the steps, while seeting up the instance.
Once the instance is up and running, connect to the instance.
On the instance run the following commands to install docker.
Also in the instance configuration under security groups in network, add a new inbound rule, which will be a custom TCP rule, which will help us access our application on 8080 port.
A4.3.3 Setup Github Workflow, Github Runner Actions and Github secrets to be used by the workflow.
Github workflow:
It refers to the idea that our code which is in github, once we update it, from github, a docker image should be created in the ecr repository and then this docker image will get installed in the ec2 instance that we have created and from this installation our application will run on the ec2 instance.
This same idea is represented in the aws.yaml file in our github workflow folder.
Setting up runner in github to run the workflow:
This runner will trigger the workflow whenever there is a change in the code.
Go to the instance and run all the commands to setup the runner in the ec2 instance as shown from the settings of github repo, when you create an self-hosted runner from the github-actions tab in settings for ml app repo.
Keep default options as it is, while doing the setup, just when the name of runner is asked, I gave it as self-hosted
Remember that after certain time, if we are not using the runner, it will go offline, to make it back online, and run the command ./run.sh from the actions-runner folder in the ec2 instance.
After this add the following github secrets in actions tab in settings of the repo. These will be used in our workflow itself.
AWS_ACCESS_KEY_ID: Created when we created the user in IAM
AWS_SECRET_ACCESS_KEY: Created when we created the user in IAM
AWS_DEFAULT_REGION: See it in the ec2 instance details
AWS_ECR_LOGIN_URI: See it in the ecr repository details #kind of format. but it doesn’t include the repository name.
AWS_ECR_REPOSITORY: studentperformance or whatever name you have given to the repository
A4.3.4 Running the application
Now simply go to the ec2 instance url, and paste :8080 in front of it, to see the application running. Play with it and have fun predicting marks of students.
Thanks to the support of Krish Naik sir, in this project journey
The whole project creation outline and execution is of Krish Naik sir, whose efforts in making Data Science simple have been enormous through such projects. You can give him a big shoutout on linkedin, learn from him on youtube.