Step 3. Preprocess the dataset

This section describes the initial data analysis that includes the processing and cleaning steps that our dataset must undergo before it can be ingested in our model.

The preprocessing step is necessary to filter the characteristics that we will manipulate throughout our modeling process and those that we will omit. To do so:

  1. let us get rid of unused features

  2. specify the features we require

As we already determined before, the features we are going to include are :

  • Pclass ,

  • Sex ,

  • Age ,

  • SibSp ,

  • Parch ,

  • and Fare ,

whereas, those we are going to voluntarily get rid of are:

  • PassengerId

  • Name

  • Ticket

  • Cabin

  • Embarked

The processing operations will all be done through the Data Processor.

The Data Processor

The "Data Processor" is SmartPredict' s powerful console for treating all the dataset' s defects. It is , among others, able to handle missing values and sort out the columns to keep or to drop.

To access it from the dataset list , click on the cog icon next to the dataset name. The data processing will pass through a couple of steps:

  1. The first is to eliminate all the unused columns.

  2. Then , the second will deal with the missing values.

Delete unused columns

As a reminder , the dispensable columns are : PassengerId , Name, Cabin,Ticket and Embarked. The "Cabin" column is chosen as an illustration - as it is full of missing data anyway .

We may directly delete needless columns one by one.

We may pursue with the other columns using the same method, or combine all the deletion operations in a same step.

To perform this multiple deletion step, select the columns' name in the drop-down menu.

Multiple deletion is also possible.

Handle missing values

We need to handle missing values because they may mislead our outputs. To do so , let us remove the affected rows . Check how to do this in the section about data processing.

As we previously deleted all the superfluous columns (some of which containing missing values as well), the "Age" column is all that is left to process.

Export the processing pipeline

After this, we obtain a clean dataset which we may save as a new one for further purposes. This is also the clean dataset from which we are going to create and export our processing pipeline.

To export a processing pipeline, click on the export icon . This will directly export it into SmartPredict collection of processing pipelines.

The pipeline is exported from the dataset processor into the SmartPredict collection.

We may retrieve it right under the Processing pipelines' sub-tab.

The processing pipeline becomes a drag and drop module.