The preprocessing step is necessary to filter the characteristics that we will manipulate throughout our modeling process and those that we will omit. To do so:
let us get rid of unused features
specify the features we require
✅ As we already determined before, the features we are going to include are :
and Fare ,
❌ whereas, those we are going to voluntarily get rid of are:
The processing operations will all be done through the Data Processor.
The "Data Processor" is SmartPredict' s powerful console for treating all the dataset' s defects. It is , among others, able to handle missing values and sort out the columns to keep or to drop.
To access it from the dataset list , click on the cog icon next to the dataset name. The data processing will pass through a couple of steps:
The first is to eliminate all the unused columns.
Then , the second will deal with the missing values.
As a reminder , the dispensable columns are : PassengerId , Name, Cabin,Ticket and Embarked. The "Cabin" column is chosen as an illustration - as it is full of missing data anyway .
We may pursue with the other columns using the same method, or combine all the deletion operations in a same step.
To perform this multiple deletion step, select the columns' name in the drop-down menu.
We need to handle missing values because they may mislead our outputs. To do so , let us remove the affected rows . Check how to do this in the section about data processing.
As we previously deleted all the superfluous columns (some of which containing missing values as well), the "Age" column is all that is left to process.
After this, we obtain a clean dataset which we may save as a new one for further purposes. This is also the clean dataset from which we are going to create and export our processing pipeline.
To export a processing pipeline, click on the export icon . This will directly export it into SmartPredict collection of processing pipelines.
We may retrieve it right under the Processing pipelines' sub-tab.