Step 3. Preprocess the dataset

Preprocessing is an important stage in any data analytics process. For the purpose , SmartPredict have a powerful data processor.

Having uploaded our iris dataset, next we will be able to clean it thanks to the SmartPredict Data Processor .

  1. To access it, first , click on “Applications” in the left pane's menu.

  2. Then click on ''Dataset Processing and Visualization".

A new dashboard expands. It is very similar to the first one (seen in the dataset application) except that this time, data are filtered here by date of creation and size.

From this new dashboard, we might also add new datasets by clicking on the round yellow + button as an alternative to uploading them directly from the dataset table.

From wherever on the platform we may retrieve a dataset, clicking on the processor' s cog icon leads to the ‘Dataset Processor’’s console.

The Dataset Processor allows to sample, filter and download datasets.

Flaws in ‘dirty dataset’ are highlighted, just like here where missing items are set in red. For dealing with these data impurities, we are able to choose the kind of processing we want .

As a strategy, dropping the rows is systematic here for rows (and columns) with missing values. To do so, we just need to specify it in the processor type, then as a "Strategy" , choose Drop.

  1. Click on the + button to add a processing step.

  2. As we are asked to select a processor , scroll until 'Handle missing values'.

  3. Click on it.

We can pursue with further data cleansing /processing by clicking once more on the + button.

  1. So, click on the + sign to add processor step

  2. Scroll until 'Sort' and select it.

  3. In choosing column, select 'Variety' for it is the scrambled one.

  4. Validate.

The table shows that the processor took into account the steps we wanted it to apply as the quality of our data has obviously increased.

Now that we have obtained a clean dataset, let us just keep in mind that our main aim is to produce a processing pipeline : this is our next step from here.

.

The processing pipeline pane appears on the right side of the Dataset processor . It can fold and unfold and contains its own set of functions.

Once we have finished the cleaning operations, we are able to utilize the new dataset for our Processing pipeline , the list of which is located on the right sidebar next to the Data Processor's dashboard .

  1. Look for the exporting icon (3rd icon from the left ) on the menu below the right pane . The tooltip shows its label'' Export processing pipeline to SmartPredict" .

  2. Click on it.

  3. Then, click on "Export".

We are allowed to choose which processing steps we want to include in a processing pipeline. If we do not want some steps to be included, we can intentionally delete them on their own after having ticked the check-box placed next to them .

Now let us get back to our workspace, enter through the flowchart icon on the left to switch to project view. From the right sidebar, as we click on the third button named ‘Processing pipelines’, we see the pipeline we have just configured before displaying there.

We might need to note the name under which we exported the pipeline we want to use in the build, to avoid ending up using another later as the number of builds grows.

Just like any other module, we can drag and drop it into the layout now, in order to attach it to the flowchart .

Last updated