Machine Learning

Introduction to Machine Learning

Introduction to Machine Learning explores how systems can learn from experience without being explicitly programmed. This chapter covers human learning, the formal definition of ML, types of algorithms, well-posed learning problems, applications, issues, and data quality.

Human Learning & Its Types

Human learning is the process by which a person acquires new knowledge, skills, behaviours, and understanding through experience, study, or instruction.

Rote Learning

Memorising information through repetition without deep understanding (e.g., memorising multiplication tables).

Meaningful Learning

Understanding and relating new concepts to existing knowledge (e.g., applying Newton's laws to real-world physics problems).

Discovery Learning

Constructing knowledge through exploration and inquiry independently (e.g., discovering patterns in nature).

Analogical Learning

Understanding a new concept by comparing it to a previously known concept (e.g., comparing electrical current to water flow).

Machine Learning Definition

Machine Learning is a subset of Artificial Intelligence (AI) that enables systems to learn from data and improve over time.

Tom Mitchell's Definition (1997)

"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks T, as measured by P, improves with experience E."

Task (T): What the program is doing (e.g., classifying emails).
Experience (E): The data the system learns from (e.g., historical labelled emails).
Performance (P): The evaluation metric (e.g., accuracy percentage).

The Hierarchy: Artificial Intelligence is the broadest concept. Machine Learning is a subset of AI. Deep Learning is a further subset of ML that uses multi-layered neural networks.

Types of Machine Learning

Click on each type to see detailed characteristics, sub-types, and examples.

Supervised Learning▼

Definition: Learning from a labelled training dataset where each example is paired with the correct output.

Sub-types:
- Classification: Discrete output labels (e.g., Spam vs Not-Spam).
- Regression: Continuous numerical values (e.g., Predicting house prices).
Feedback: Immediate and direct. The algorithm knows the correct answer.
Examples: Credit risk scoring, image recognition, medical diagnosis.
Algorithms: Linear Regression, SVM, KNN, Naive Bayes.

Unsupervised Learning▼

Definition: Working with unlabelled data to discover hidden patterns, structures, or groupings without guidance.

Sub-types:
- Clustering: Grouping similar data points together (e.g., customer segmentation).
- Association: Discovering rules between items (e.g., market basket analysis).
- Dimensionality Reduction: Reducing features while retaining info (e.g., PCA).
Feedback: No labels or ground truth provided.
Examples: Anomaly detection, document clustering, genes grouping.
Algorithms: K-Means, Hierarchical Clustering, Apriori.

Semi-Supervised Learning▼

Definition: Lies between supervised and unsupervised; uses a small amount of labelled data with a large amount of unlabelled data.

Use Case: When labelling data is too expensive or time-consuming.
Examples: Web content classification (few labelled pages), Medical imaging (few labelled X-rays).

Reinforcement Learning▼

Definition: An agent learns to make decisions by interacting with an environment through trial and error to maximise cumulative rewards.

Key Concepts: Agent, Environment, State, Action, Reward, Policy.
Feedback: Reward/Penalty signal (delayed reinforcement).
Examples: Game AI (AlphaGo), Self-driving cars, Robotics, Stock trading.
Algorithms: Q-Learning, DQN, SARSA.

Comparison of ML Types

Aspect	Supervised	Unsupervised	Reinforcement
Training Data	Labelled	Unlabelled	No fixed dataset
Feedback	Direct	None	Reward/Penalty
Goal	Predict output	Find patterns	Maximise reward

Well-Posed Learning Problems

A problem is well-posed when it is formally defined with a Task (T), Performance measure (P), and Experience (E).

Example: Checkers Game

T: Playing checkers
P: Percentage of games won
E: Practice games played against itself

Example: Email Spam

T: Classifying emails as spam or not
P: Percentage of correct classifications
E: Database of manually labelled emails

Real-World Applications

Healthcare: Disease diagnosis (cancer, diabetes), drug discovery, medical image analysis.
Finance: Fraud detection, credit risk scoring, algorithmic stock trading.
Natural Language Processing: Virtual assistants (Siri, Alexa), sentiment analysis, language translation.
Computer Vision: Face recognition, object detection, optical character recognition (OCR).
Recommendation Systems: Netflix movie suggestions, Amazon product recommendations.

Issues in Machine Learning

Overfitting: The model learns the training data (including noise) too well and fails to generalise to new data. (Low bias, High variance). Fixed using regularisation or more data.

Underfitting: The model is too simple to capture the underlying patterns. It performs poorly on both training and test data. (High bias, Low variance). Fixed by increasing model complexity.

Data Quality: Missing values, noisy data, outliers, and inconsistent records heavily impact performance (Garbage In, Garbage Out).
Bias-Variance Tradeoff: Finding the sweet spot between an oversimplified model (bias) and an overly sensitive model (variance).
Lack of Interpretability: Deep neural networks often act as "black boxes," making it hard to explain their decisions.

Types of Data

Understanding data types is critical for selecting the right algorithms.

Qualitative (Categorical)

Nominal: Named categories with NO inherent order. (e.g., Blood Group, Gender, Nationality). Operations: Counting, Mode.
Ordinal: Named categories with a meaningful ORDER, but exact differences are unknown. (e.g., Grades A/B/C, Customer Satisfaction). Operations: Median, Mode.

Quantitative (Numerical)

Interval: Numeric data with meaningful order AND exact differences, but NO true zero. (e.g., Temperature in °C, Calendar Years).
Ratio: Highest level of measurement. Has a true zero (meaning complete absence). Ratios are meaningful. (e.g., Height, Weight, Salary, Age).

Data Quality & Remediation

Strategies to clean imperfect real-world datasets:

Missing Values: Remedied by deletion, mean/median/mode imputation, or predictive imputation using other ML models.
Noisy Data: Addressed using binning (smoothing by grouping), regression fitting, or clustering to identify and remove outliers.
Duplicate Data: Solved using deduplication algorithms based on key attributes to keep a single authoritative record.
Irrelevant Features: Handled using Feature Selection (statistical correlation tests) or Dimensionality Reduction (PCA) to compress feature space.

Ready to test your Chapter I: Introduction to Machine Learning knowledge?

Chapter I: Introduction to Machine Learning

Test your understanding of the foundational concepts of ML, human learning types, supervised vs unsupervised learning, and data quality issues.

5 questions·No time limit·Instant feedback

Course Outline Supervised Learning: Regression