Machine Learning
Unsupervised Learning / Apriori Algorithm
5. Apriori Algorithm & Association Rules
Association Rule Mining discovers interesting relationships between variables in large databases, famously used for Market Basket Analysis (finding which products are frequently bought together).
1. Support
How frequently an itemset appears in the dataset.
2. Confidence
How often the rule is true. Given A, how likely is B?
3. Lift
Strength of association over random chance. Lift > 1 = positive correlation.
The Apriori Property (Anti-Monotonicity)
"If an itemset is frequent, then all of its subsets must also be frequent."
Conversely: If an itemset is infrequent, all its supersets will also be infrequent. This rule allows the algorithm to aggressively prune the search space without counting every combination.
Apriori Algorithm Steps
- GENERATE C₁: Create all 1-itemsets.
- SCAN & PRUNE: Count support. Remove items < Minimum Support to create L₁.
- GENERATE Cₖ₊₁ from Lₖ: Join Lₖ with itself (self-join).
- PRUNE Cₖ₊₁ (Using Apriori Property): Remove candidates having ANY infrequent k-subset.
- SCAN & PRUNE: Count support in database. Remove itemsets < Minimum Support to get Lₖ₊₁.
- REPEAT: Until no new frequent itemsets can be found.
- GENERATE RULES: From frequent itemsets, generate rules satisfying Minimum Confidence.
Solved Example (Min Support = 2, Min Conf = 50%)
Dataset: A,B,C (T1), A,C (T2), A,D (T3), B,E,F (T4) ... total 9 transactions.
- Step 1 (Generate L₁): Scan DB. Suppose counts are: A:6, B:7, C:6, D:2. All ≥ 2, so L₁ = {{A}, {B}, {C}, {D}}.
- Step 2 (Generate C₂ → L₂): Join L₁. Candidate pairs are {A,B}, {A,C}, {B,C}, {B,D} etc. Scan DB. Only those appearing ≥ 2 times survive. L₂ = {{A,B}:4, {A,C}:4, {B,C}:4, {B,D}:2}.
- Step 3 (Generate C₃ → L₃): Join L₂ where first item matches. Candidate {A,B,C}. Scan DB. Count = 2. L₃ = {{A,B,C}:2}.
- Step 4 (Rules from L₃): For {A,B,C}:
- A,B → C: Conf = Sup(A,B,C) / Sup(A,B) = 2/4 = 50% (Valid)
- A,C → B: Conf = Sup(A,B,C) / Sup(A,C) = 2/4 = 50% (Valid)
- B,C → A: Conf = Sup(A,B,C) / Sup(B,C) = 2/4 = 50% (Valid)