Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning explores how systems can learn from experience without being explicitly programmed. This chapter covers human learning, the formal definition of ML, types of algorithms, well-posed learning problems, applications, issues, and data quality.
Human Learning & Its Types
Human learning is the process by which a person acquires new knowledge, skills, behaviours, and understanding through experience, study, or instruction.
Rote Learning
Memorising information through repetition without deep understanding (e.g., memorising multiplication tables).
Meaningful Learning
Understanding and relating new concepts to existing knowledge (e.g., applying Newton's laws to real-world physics problems).
Discovery Learning
Constructing knowledge through exploration and inquiry independently (e.g., discovering patterns in nature).
Analogical Learning
Understanding a new concept by comparing it to a previously known concept (e.g., comparing electrical current to water flow).
Machine Learning Definition
Machine Learning is a subset of Artificial Intelligence (AI) that enables systems to learn from data and improve over time.
"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks T, as measured by P, improves with experience E."
- Task (T): What the program is doing (e.g., classifying emails).
- Experience (E): The data the system learns from (e.g., historical labelled emails).
- Performance (P): The evaluation metric (e.g., accuracy percentage).
Types of Machine Learning
Click on each type to see detailed characteristics, sub-types, and examples.
Supervised Learning▼
Definition: Learning from a labelled training dataset where each example is paired with the correct output.
- Sub-types:
- Classification: Discrete output labels (e.g., Spam vs Not-Spam).
- Regression: Continuous numerical values (e.g., Predicting house prices).
- Feedback: Immediate and direct. The algorithm knows the correct answer.
- Examples: Credit risk scoring, image recognition, medical diagnosis.
- Algorithms: Linear Regression, SVM, KNN, Naive Bayes.
Unsupervised Learning▼
Definition: Working with unlabelled data to discover hidden patterns, structures, or groupings without guidance.
- Sub-types:
- Clustering: Grouping similar data points together (e.g., customer segmentation).
- Association: Discovering rules between items (e.g., market basket analysis).
- Dimensionality Reduction: Reducing features while retaining info (e.g., PCA).
- Feedback: No labels or ground truth provided.
- Examples: Anomaly detection, document clustering, genes grouping.
- Algorithms: K-Means, Hierarchical Clustering, Apriori.
Semi-Supervised Learning▼
Definition: Lies between supervised and unsupervised; uses a small amount of labelled data with a large amount of unlabelled data.
- Use Case: When labelling data is too expensive or time-consuming.
- Examples: Web content classification (few labelled pages), Medical imaging (few labelled X-rays).
Reinforcement Learning▼
Definition: An agent learns to make decisions by interacting with an environment through trial and error to maximise cumulative rewards.
- Key Concepts: Agent, Environment, State, Action, Reward, Policy.
- Feedback: Reward/Penalty signal (delayed reinforcement).
- Examples: Game AI (AlphaGo), Self-driving cars, Robotics, Stock trading.
- Algorithms: Q-Learning, DQN, SARSA.
Comparison of ML Types
| Aspect | Supervised | Unsupervised | Reinforcement |
|---|---|---|---|
| Training Data | Labelled | Unlabelled | No fixed dataset |
| Feedback | Direct | None | Reward/Penalty |
| Goal | Predict output | Find patterns | Maximise reward |
Well-Posed Learning Problems
A problem is well-posed when it is formally defined with a Task (T), Performance measure (P), and Experience (E).
Example: Checkers Game
- T: Playing checkers
- P: Percentage of games won
- E: Practice games played against itself
Example: Email Spam
- T: Classifying emails as spam or not
- P: Percentage of correct classifications
- E: Database of manually labelled emails
Real-World Applications
- Healthcare: Disease diagnosis (cancer, diabetes), drug discovery, medical image analysis.
- Finance: Fraud detection, credit risk scoring, algorithmic stock trading.
- Natural Language Processing: Virtual assistants (Siri, Alexa), sentiment analysis, language translation.
- Computer Vision: Face recognition, object detection, optical character recognition (OCR).
- Recommendation Systems: Netflix movie suggestions, Amazon product recommendations.
Issues in Machine Learning
- Data Quality: Missing values, noisy data, outliers, and inconsistent records heavily impact performance (Garbage In, Garbage Out).
- Bias-Variance Tradeoff: Finding the sweet spot between an oversimplified model (bias) and an overly sensitive model (variance).
- Lack of Interpretability: Deep neural networks often act as "black boxes," making it hard to explain their decisions.
Types of Data
Understanding data types is critical for selecting the right algorithms.
Qualitative (Categorical)
- Nominal: Named categories with NO inherent order. (e.g., Blood Group, Gender, Nationality). Operations: Counting, Mode.
- Ordinal: Named categories with a meaningful ORDER, but exact differences are unknown. (e.g., Grades A/B/C, Customer Satisfaction). Operations: Median, Mode.
Quantitative (Numerical)
- Interval: Numeric data with meaningful order AND exact differences, but NO true zero. (e.g., Temperature in °C, Calendar Years).
- Ratio: Highest level of measurement. Has a true zero (meaning complete absence). Ratios are meaningful. (e.g., Height, Weight, Salary, Age).
Data Quality & Remediation
Strategies to clean imperfect real-world datasets:
- Missing Values: Remedied by deletion, mean/median/mode imputation, or predictive imputation using other ML models.
- Noisy Data: Addressed using binning (smoothing by grouping), regression fitting, or clustering to identify and remove outliers.
- Duplicate Data: Solved using deduplication algorithms based on key attributes to keep a single authoritative record.
- Irrelevant Features: Handled using Feature Selection (statistical correlation tests) or Dimensionality Reduction (PCA) to compress feature space.
Ready to test your Chapter I: Introduction to Machine Learning knowledge?
Chapter I: Introduction to Machine Learning
Test your understanding of the foundational concepts of ML, human learning types, supervised vs unsupervised learning, and data quality issues.