Machine Learning

Supervised Learning: Regression / Data Pre-processing


Data Pre-processing

Data pre-processing transforms raw, real-world data into a clean and suitable format before applying machine learning algorithms. Real-world data is often incomplete, noisy, inconsistent, and unstructured. Proper pre-processing directly impacts model accuracy, performance, and generalizability.

1. Data Cleaning

Fixes incomplete, noisy, and inconsistent data.

  • Handling Missing Values: Remove rows (listwise deletion), Mean/Median/Mode Imputation, or Predictive Imputation (e.g., KNN imputer).
  • Handling Outliers: Z-Score method (|z| > 3) or IQR method (removing values below Q1 - 1.5*IQR or above Q3 + 1.5*IQR).
  • Noise Reduction: Smoothing using moving averages.
2. Data Transformation

Converts data into an appropriate scale or format for models.

  • Normalization (Min-Max Scaling): Scales data to a [0, 1] range.
    x' = (x - min) / (max - min)
  • Standardization (Z-Score): Scales data to have zero mean (μ=0) and unit variance (σ=1).
    x' = (x - μ) / σ
  • Log Transformation: Reduces right-skewness in data.
    x' = log(1 + x)
3. Data Encoding

Most machine learning models require numerical input. Encoding converts categorical data to numeric data.

  • One-Hot Encoding: Converts nominal categories into separate binary columns. (e.g., Color: [1,0,0] for Red, [0,1,0] for Green).
  • Label Encoding: Assigns integer values to ordinal categories. (e.g., Low=0, Medium=1, High=2).
4. Data Integration & Discretization
  • Data Integration: Combining data from multiple sources (databases, files, APIs) while resolving entity identification problems and handling redundant attributes.
  • Data Discretization (Binning): Converting continuous data into categorical groups. (e.g., grouping ages: 0-18 = child, 19-60 = adult, 60+ = senior).

Ready to test your Data Pre-processing knowledge?

Data Pre-processing

Assess your knowledge on handling missing values, scaling data, encoding categorical variables, and discretization.

5 questions·No time limit·Instant feedback