Q1
(a)
L1 =
{
{ Avator2 },
{ Fabelmans },
{ Scream6 },
{ TheFirstSlamDunk },
{ Quantumania }
}
// Join/Prune step
C2 =
{
{ Avator2, Fabelmans },
{ Avator2, Scream6 },
{ Avator2, TheFirstSlamDunk },
{ Avator2, Quantumania },
{ Fabelmans, Scream6 },
{ Fabelmans, TheFirstSlamDunk },
{ Fabelmans, Quantumania },
{ Scream6, TheFristSlamDunk },
{ Scream6, Quantumania },
{ TheFirstSlamDunk, Quantumania }
}
// Counting step
L2 =
{
{ Avator2, Fabelmans },
{ Avator2, Scream6 },
{ Avator2, Quantumania },
{ Fabelmans, TheFirstSlamDunk },
{ Fabelmans, Quantumania },
{ Scream6, TheFristSlamDunk },
{ TheFirstSlamDunk, Quantumania }
}
// Join step
C3 =
{
{ Avator2, Fabelmans, Scream6 },
{ Avator2, Fabelmans, Quantumania },
{ Avator2, Scream6, Quantumania },
{ Fabelmans, TheFirstSlamDunk, Quantumania }
}
// Prune step
C3 =
{
{ Avator2, Fabelmans, Quantumania },
{ Fabelmans, TheFirstSlamDunk, Quantumania }
}
// Counting step
L3 =
{
{ Fabelmans, TheFirstSlamDunk, Quantumania }
}
// Join step
C4 = {}
Large itemsets =
{
{ Avator2 },
{ Fabelmans },
{ Scream6 },
{ TheFirstSlamDunk },
{ Quantumania },
{ Avator2, Fabelmans },
{ Avator2, Scream6 },
{ Avator2, Quantumania },
{ Fabelmans, TheFirstSlamDunk },
{ Fabelmans, Quantumania },
{ Scream6, TheFristSlamDunk },
{ TheFirstSlamDunk, Quantumania }
{ Fabelmans, TheFirstSlamDunk, Quantumania }
}
(b)
{ Fabelmans, TheFirstSlamDunk } -> { Quantumania }
{ TheFirstSlamDunk, Quantumania } -> { Fabelmans }
Q2
(a)
Frequency table of all items
| Item | Frequency |
|---|---|
| a | 3 |
| b | 3 |
| c | 5 |
| d | 5 |
| e | 1 |
| f | 6 |
| g | 1 |
| h | 1 |
| i | 1 |
| j | 1 |
| k | 1 |
| l | 1 |
| m | 1 |
| n | 1 |
| o | 1 |
| p | 1 |
| q | 1 |
Frequency table of all frequent items
| Item | Frequency |
|---|---|
| a | 3 |
| b | 3 |
| c | 5 |
| d | 5 |
| f | 6 |
Frequency table of all frequent items (sorted in descending order of frequencies)
| Item | Frequency |
|---|---|
| f | 6 |
| c | 5 |
| d | 5 |
| a | 3 |
| b | 3 |
Ordered frequent items
| TID | Ordered Frequent Items |
|---|---|
| 1 | f |
| 2 | f |
| 3 | d, b |
| 4 | a |
| 5 | c |
| 6 | c |
| 7 | f |
| 8 | c, d, b |
| 9 | f |
| 10 | c |
| 11 | d, a |
| 12 | c |
| 13 | f, d, b |
| 14 | f, d, a |
Frequent itemsets =
{
{ b },
{ b, d },
{ a },
{ a, d },
{ d },
{ d, f },
{ c },
{ f }
}
(b)
{ b } -> { d }
Q3
(a)
(i)
k = 2
| Cluster | Size | Average Distance | Data Points | Mean |
|---|---|---|---|---|
| Cluster 1 | 4 | 7.448289596 | (14.5, 16.25) | |
| Cluster 2 | 4 | 5.236630928 | (4.25, 8) |
(ii)
k = 3
| Cluster | Size | Average Distance | Data Points | Mean |
|---|---|---|---|---|
| Cluster 1 | 3 | 2.687415275 | (2.333, 10.333) | |
| Cluster 2 | 1 | 0 | (10, 30) | |
| Cluster 3 | 4 | 4.663317009 | (14.5, 9) |
(iii)
k = 5
| Cluster | Size | Average Distance | Data Points | Mean |
|---|---|---|---|---|
| Cluster 1 | 2 | 1.802775638 | (2, 8.5) | |
| Cluster 2 | 3 | 1.800481345 | (16, 11.667) | |
| Cluster 3 | 1 | 0 | (3, 14) | |
| Cluster 4 | 1 | 0 | (10, 30) | |
| Cluster 5 | 1 | 0 | (10, 1) |
(b)
- Advantages
- The algorithm is simple to implement.
- The algorithm can be scaled to large data sets easily.
- Disadvantages
- We do not know the number of clusters before finding clusters, thus the value of k is not user-friendly.
- If the random centroid initialization is “bad”, there will be no points assigned to that cluster.
- Only clusters with spherical shapes can be found.
Q4
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
|---|---|---|---|---|---|---|---|---|
| 1 | 0 | |||||||
| 2 | 11 | 0 | ||||||
| 3 | 5 | 13 | 0 | |||||
| 4 | 12 | 2 | 14 | 0 | ||||
| 5 | 7 | 17 | [ 1 ] | 18 | 0 | |||
| 6 | 13 | 4 | 15 | 5 | 20 | 0 | ||
| 7 | 9 | 15 | 12 | 16 | 15 | 19 | 0 | |
| 8 | 11 | 20 | 12 | 21 | 17 | 22 | 30 | 0 |
| 1 | 2 | 3, 5 | 4 | 6 | 7 | 8 | |
|---|---|---|---|---|---|---|---|
| 1 | 0 | ||||||
| 2 | 11 | 0 | |||||
| 3, 5 | 6 | 15 | 0 | ||||
| 4 | 12 | [ 2 ] | 16 | 0 | |||
| 6 | 13 | 4 | 17.5 | 5 | 0 | ||
| 7 | 9 | 15 | 13.5 | 16 | 19 | 0 | |
| 8 | 11 | 20 | 14.5 | 21 | 22 | 30 | 0 |
| 1 | 2, 4 | 3, 5 | 6 | 7 | 8 | |
|---|---|---|---|---|---|---|
| 1 | 0 | |||||
| 2, 4 | 11.5 | 0 | ||||
| 3, 5 | 6 | 15.5 | 0 | |||
| 6 | 13 | [ 4.5 ] | 17.5 | 0 | ||
| 7 | 9 | 15.5 | 13.5 | 19 | 0 | |
| 8 | 11 | 20.5 | 14.5 | 22 | 30 | 0 |
| 1 | 2, 4, 6 | 3, 5 | 7 | 8 | |
|---|---|---|---|---|---|
| 1 | 0 | ||||
| 2, 4, 6 | 12 | 0 | |||
| 3, 5 | [ 6 ] | 16.167 | 0 | ||
| 7 | 9 | 16.167 | 13.5 | 0 | |
| 8 | 11 | 21 | 29 | 30 | 0 |
| 1, 3, 5 | 2, 4, 6 | 7 | 8 | |
|---|---|---|---|---|
| 1, 3, 5 | 0 | |||
| 2, 4, 6 | 14.778 | 0 | ||
| 7 | [ 12 ] | 16.167 | 0 | |
| 8 | 13.334 | 21 | 30 | 0 |
| 1, 3, 5, 7 | 2, 4, 6 | 8 | |
|---|---|---|---|
| 1, 3, 5, 7 | 0 | ||
| 2, 4, 6 | [ 15.25 ] | 0 | |
| 8 | 17.5 | 21 | 0 |
| 1, 2, 3, 4, 5, 6, 7 | 8 | |
|---|---|---|
| 1, 2, 3, 4, 5, 6, 7 | 0 | |
| 8 | [ 19 ] | 0 |
| 1, 2, 3, 4, 5, 6, 7, 8 | |
|---|---|
| 1, 2, 3, 4, 5, 6, 7, 8 | 0 |
Q5
(a)
(i)
(ii)
The new customer will not buy a Mac Studio.
(b)
Compared to ID3, C4.5 has a lower tendency to choose an attribute containing more values.
The reason is that C4.5 consider SplitInfo when calculating Gain. SplitInfo is proportional to the number of distinct values, by dividing Gain in ID3 by SplitInfo, we can obtain the Gain in C4.5, this penalizes an attribute containing more values. In other words, lowering Gain to prevent that attribute from being chosen.