Q1

(a)

L1 =
{
    { Avator2 },
    { Fabelmans },
    { Scream6 },
    { TheFirstSlamDunk },
    { Quantumania }
}
// Join/Prune step
C2 =
{
    { Avator2, Fabelmans },
    { Avator2, Scream6 },
    { Avator2, TheFirstSlamDunk },
    { Avator2, Quantumania },
    { Fabelmans, Scream6 },
    { Fabelmans, TheFirstSlamDunk },
    { Fabelmans, Quantumania },
    { Scream6, TheFristSlamDunk },
    { Scream6, Quantumania },
    { TheFirstSlamDunk, Quantumania }
}

// Counting step
L2 =
{
    { Avator2, Fabelmans },
    { Avator2, Scream6 },
    { Avator2, Quantumania },
    { Fabelmans, TheFirstSlamDunk },
    { Fabelmans, Quantumania },
    { Scream6, TheFristSlamDunk },
    { TheFirstSlamDunk, Quantumania }
}

// Join step
C3 =
{
    { Avator2, Fabelmans, Scream6 },
    { Avator2, Fabelmans, Quantumania },
    { Avator2, Scream6, Quantumania },
    { Fabelmans, TheFirstSlamDunk, Quantumania }
}

// Prune step
C3 =
{
    { Avator2, Fabelmans, Quantumania },
    { Fabelmans, TheFirstSlamDunk, Quantumania }
}

// Counting step
L3 =
{
    { Fabelmans, TheFirstSlamDunk, Quantumania }
}
// Join step
C4 = {}
Large itemsets =
{
    { Avator2 },
	{ Fabelmans },
	{ Scream6 },
    { TheFirstSlamDunk },
    { Quantumania },
    { Avator2, Fabelmans },
    { Avator2, Scream6 },
    { Avator2, Quantumania },
    { Fabelmans, TheFirstSlamDunk },
    { Fabelmans, Quantumania },
	{ Scream6, TheFristSlamDunk },
    { TheFirstSlamDunk, Quantumania }
    { Fabelmans, TheFirstSlamDunk, Quantumania }
}

(b)

{ Fabelmans, TheFirstSlamDunk } -> { Quantumania }
{ TheFirstSlamDunk, Quantumania } -> { Fabelmans }

Q2

(a)

Frequency table of all items

ItemFrequency
a3
b3
c5
d5
e1
f6
g1
h1
i1
j1
k1
l1
m1
n1
o1
p1
q1

Frequency table of all frequent items

ItemFrequency
a3
b3
c5
d5
f6

Frequency table of all frequent items (sorted in descending order of frequencies)

ItemFrequency
f6
c5
d5
a3
b3

Ordered frequent items

TIDOrdered Frequent Items
1f
2f
3d, b
4a
5c
6c
7f
8c, d, b
9f
10c
11d, a
12c
13f, d, b
14f, d, a

Frequent itemsets =
{
    { b },
    { b, d },
    { a },
    { a, d },
    { d },
    { d, f },
    { c },
    { f }
}

(b)

{ b } -> { d }

Q3

(a)

(i)

k = 2

ClusterSizeAverage DistanceData PointsMean
Cluster 147.448289596(14.5, 16.25)
Cluster 245.236630928(4.25, 8)

(ii)

k = 3

ClusterSizeAverage DistanceData PointsMean
Cluster 132.687415275(2.333, 10.333)
Cluster 210(10, 30)
Cluster 344.663317009(14.5, 9)

(iii)

k = 5

ClusterSizeAverage DistanceData PointsMean
Cluster 121.802775638(2, 8.5)
Cluster 231.800481345(16, 11.667)
Cluster 310(3, 14)
Cluster 410(10, 30)
Cluster 510(10, 1)

(b)

  • Advantages
    • The algorithm is simple to implement.
    • The algorithm can be scaled to large data sets easily.
  • Disadvantages
    • We do not know the number of clusters before finding clusters, thus the value of k is not user-friendly.
    • If the random centroid initialization is “bad”, there will be no points assigned to that cluster.
    • Only clusters with spherical shapes can be found.

Q4

12345678
10
2110
35130
4122140
5717[ 1 ]180
6134155200
7915121615190
8112012211722300
123, 54678
10
2110
3, 56150
412[ 2 ]160
613417.550
791513.516190
8112014.52122300

12, 43, 5678
10
2, 411.50
3, 5615.50
613[ 4.5 ]17.50
7915.513.5190
81120.514.522300
12, 4, 63, 578
10
2, 4, 6120
3, 5[ 6 ]16.1670
7916.16713.50
8112129300
1, 3, 52, 4, 678
1, 3, 50
2, 4, 614.7780
7[ 12 ]16.1670
813.33421300
1, 3, 5, 72, 4, 68
1, 3, 5, 70
2, 4, 6[ 15.25 ]0
817.5210

1, 2, 3, 4, 5, 6, 78
1, 2, 3, 4, 5, 6, 70
8[ 19 ]0
1, 2, 3, 4, 5, 6, 7, 8
1, 2, 3, 4, 5, 6, 7, 80

Q5

(a)

(i)

(ii)

The new customer will not buy a Mac Studio.

(b)

Compared to ID3, C4.5 has a lower tendency to choose an attribute containing more values.

The reason is that C4.5 consider SplitInfo when calculating Gain. SplitInfo is proportional to the number of distinct values, by dividing Gain in ID3 by SplitInfo, we can obtain the Gain in C4.5, this penalizes an attribute containing more values. In other words, lowering Gain to prevent that attribute from being chosen.