Introduction

Data preprocessing modules are a collection of modules that are designed to perform all sorts of data transformation operations. Use them in order to get clean datasets for feeding your models.

Why processing data first?

Data preprocessing is a generic name that encompasses all sorts of datasets' treatment such as arranging, sorting, cleansing and encoding in order to render them fit to feed into ML models.

This could be:

  • Data Quality Assessment ( sorting out, cleansing dataset including handling missing values)

  • Feature Sampling( augmenting the data in order to obtain more samples )

  • Dimensionality Reduction (transforming features into a lower dimension ) : including

    • Feature Aggregation (combining input files into a smaller set )

    • Feature Selection (eliminating some features from the input file)

    • Principal Component Analysis (PCA)

  • Feature encoding (for categorical data)

Data preprocessing is very important as it is also one of the indispensable stages for creating a processing pipeline.

As a reminder, a processing pipeline is a set of algorithms that lead the design logic of the model. In other words , if we wish to have an ordered dataset and a model that performs in the accurate way it is always wise not to overlook this step.

SmartPredict contains all the elements to perform all of these steps seamlessly. To do so, we can choose from the various modules included in the platform such as:

  1. Array reshaper

  2. Generic Data Preprocessor

  3. Missing data handler

  4. Normalizer

  5. One Hot Encoder

  6. Ordinal Encoder

  7. and more...

Last updated