Machine Learning
Supervised Learning: Regression / Data Pre-processing
Data Pre-processing
Data pre-processing transforms raw, real-world data into a clean and suitable format before applying machine learning algorithms. Real-world data is often incomplete, noisy, inconsistent, and unstructured. Proper pre-processing directly impacts model accuracy, performance, and generalizability.
1. Data Cleaning▼
Fixes incomplete, noisy, and inconsistent data.
- Handling Missing Values: Remove rows (listwise deletion), Mean/Median/Mode Imputation, or Predictive Imputation (e.g., KNN imputer).
- Handling Outliers: Z-Score method (|z| > 3) or IQR method (removing values below Q1 - 1.5*IQR or above Q3 + 1.5*IQR).
- Noise Reduction: Smoothing using moving averages.
2. Data Transformation▼
Converts data into an appropriate scale or format for models.
- Normalization (Min-Max Scaling): Scales data to a [0, 1] range.
x' = (x - min) / (max - min) - Standardization (Z-Score): Scales data to have zero mean (μ=0) and unit variance (σ=1).
x' = (x - μ) / σ - Log Transformation: Reduces right-skewness in data.
x' = log(1 + x)
3. Data Encoding▼
Most machine learning models require numerical input. Encoding converts categorical data to numeric data.
- One-Hot Encoding: Converts nominal categories into separate binary columns. (e.g., Color: [1,0,0] for Red, [0,1,0] for Green).
- Label Encoding: Assigns integer values to ordinal categories. (e.g., Low=0, Medium=1, High=2).
4. Data Integration & Discretization▼
- Data Integration: Combining data from multiple sources (databases, files, APIs) while resolving entity identification problems and handling redundant attributes.
- Data Discretization (Binning): Converting continuous data into categorical groups. (e.g., grouping ages: 0-18 = child, 19-60 = adult, 60+ = senior).
Ready to test your Data Pre-processing knowledge?
Data Pre-processing
Assess your knowledge on handling missing values, scaling data, encoding categorical variables, and discretization.
5 questions·No time limit·Instant feedback