CORE1232 HW1

Q1

(a)

L1 =
{
    { Avator2 },
    { Fabelmans },
    { Scream6 },
    { TheFirstSlamDunk },
    { Quantumania }
}

// Join/Prune step
C2 =
{
    { Avator2, Fabelmans },
    { Avator2, Scream6 },
    { Avator2, TheFirstSlamDunk },
    { Avator2, Quantumania },
    { Fabelmans, Scream6 },
    { Fabelmans, TheFirstSlamDunk },
    { Fabelmans, Quantumania },
    { Scream6, TheFristSlamDunk },
    { Scream6, Quantumania },
    { TheFirstSlamDunk, Quantumania }
}

// Counting step
L2 =
{
    { Avator2, Fabelmans },
    { Avator2, Scream6 },
    { Avator2, Quantumania },
    { Fabelmans, TheFirstSlamDunk },
    { Fabelmans, Quantumania },
    { Scream6, TheFristSlamDunk },
    { TheFirstSlamDunk, Quantumania }
}

$\pagebreak$

// Join step
C3 =
{
    { Avator2, Fabelmans, Scream6 },
    { Avator2, Fabelmans, Quantumania },
    { Avator2, Scream6, Quantumania },
    { Fabelmans, TheFirstSlamDunk, Quantumania }
}

// Prune step
C3 =
{
    { Avator2, Fabelmans, Quantumania },
    { Fabelmans, TheFirstSlamDunk, Quantumania }
}

// Counting step
L3 =
{
    { Fabelmans, TheFirstSlamDunk, Quantumania }
}

// Join step
C4 = {}

Large itemsets =
{
    { Avator2 },
	{ Fabelmans },
	{ Scream6 },
    { TheFirstSlamDunk },
    { Quantumania },
    { Avator2, Fabelmans },
    { Avator2, Scream6 },
    { Avator2, Quantumania },
    { Fabelmans, TheFirstSlamDunk },
    { Fabelmans, Quantumania },
	{ Scream6, TheFristSlamDunk },
    { TheFirstSlamDunk, Quantumania }
    { Fabelmans, TheFirstSlamDunk, Quantumania }
}

(b)

{ Fabelmans, TheFirstSlamDunk } -> { Quantumania }
{ TheFirstSlamDunk, Quantumania } -> { Fabelmans }

Q2

(a)

Frequency table of all items

Item	Frequency
a	3
b	3
c	5
d	5
e	1
f	6
g	1
h	1
i	1
j	1
k	1
l	1
m	1
n	1
o	1
p	1
q	1

$\pagebreak$

Frequency table of all frequent items

Item	Frequency
a	3
b	3
c	5
d	5
f	6

Frequency table of all frequent items (sorted in descending order of frequencies)

Item	Frequency
f	6
c	5
d	5
a	3
b	3

$\pagebreak$

Ordered frequent items

TID	Ordered Frequent Items
1	f
2	f
3	d, b
4	a
5	c
6	c
7	f
8	c, d, b
9	f
10	c
11	d, a
12	c
13	f, d, b
14	f, d, a

$\pagebreak$

Frequent itemsets =
{
    { b },
    { b, d },
    { a },
    { a, d },
    { d },
    { d, f },
    { c },
    { f }
}

(b)

{ b } -> { d }

Q3

(a)

(i)

k = 2

Cluster	Size	Average Distance	Data Points	Mean
Cluster 1	4	7.448289596	$x_{1}, x_{3}, x_{5}, x_{8}$	(14.5, 16.25)
Cluster 2	4	5.236630928	$x_{2}, x_{4}, x_{6}, x_{7}$	(4.25, 8)

(ii)

k = 3

Cluster	Size	Average Distance	Data Points	Mean
Cluster 1	3	2.687415275	$x_{2}, x_{4}, x_{6}$	(2.333, 10.333)
Cluster 2	1	0	$x_{8}$	(10, 30)
Cluster 3	4	4.663317009	$x_{1}, x_{3}, x_{5}, x_{7}$	(14.5, 9)

$\pagebreak$

(iii)

k = 5

Cluster	Size	Average Distance	Data Points	Mean
Cluster 1	2	1.802775638	$x_{2}, x_{6}$	(2, 8.5)
Cluster 2	3	1.800481345	$x_{1}, x_{3}, x_{5}$	(16, 11.667)
Cluster 3	1	0	$x_{4}$	(3, 14)
Cluster 4	1	0	$x_{8}$	(10, 30)
Cluster 5	1	0	$x_{7}$	(10, 1)

(b)

Advantages
- The algorithm is simple to implement.
- The algorithm can be scaled to large data sets easily.
Disadvantages
- We do not know the number of clusters before finding clusters, thus the value of k is not user-friendly.
- If the random centroid initialization is “bad”, there will be no points assigned to that cluster.
- Only clusters with spherical shapes can be found.

$\pagebreak$

Q4

	1	2	3	4	5	6	7	8
1	0
2	11	0
3	5	13	0
4	12	2	14	0
5	7	17	[ 1 ]	18	0
6	13	4	15	5	20	0
7	9	15	12	16	15	19	0
8	11	20	12	21	17	22	30	0

	1	2	3, 5	4	6	7	8
1	0
2	11	0
3, 5	6	15	0
4	12	[ 2 ]	16	0
6	13	4	17.5	5	0
7	9	15	13.5	16	19	0
8	11	20	14.5	21	22	30	0

$\pagebreak$

	1	2, 4	3, 5	6	7	8
1	0
2, 4	11.5	0
3, 5	6	15.5	0
6	13	[ 4.5 ]	17.5	0
7	9	15.5	13.5	19	0
8	11	20.5	14.5	22	30	0

	1	2, 4, 6	3, 5	7	8
1	0
2, 4, 6	12	0
3, 5	[ 6 ]	16.167	0
7	9	16.167	13.5	0
8	11	21	29	30	0

	1, 3, 5	2, 4, 6	7	8
1, 3, 5	0
2, 4, 6	14.778	0
7	[ 12 ]	16.167	0
8	13.334	21	30	0

	1, 3, 5, 7	2, 4, 6	8
1, 3, 5, 7	0
2, 4, 6	[ 15.25 ]	0
8	17.5	21	0

$\pagebreak$

	1, 2, 3, 4, 5, 6, 7	8
1, 2, 3, 4, 5, 6, 7	0
8	[ 19 ]	0

	1, 2, 3, 4, 5, 6, 7, 8
1, 2, 3, 4, 5, 6, 7, 8	0

Q5

(a)

(i)

The first split, I n f o (T) = 1 - (\frac{4}{8})^{2} - (\frac{4}{8})^{2} = \frac{1}{2} For attribute Income, I n f o (T_{hi g h}) = 1 - (\frac{3}{3})^{2} - (\frac{0}{3})^{2} = 0 I n f o (T_{m e d i u m}) = 1 - (\frac{1}{5})^{2} - (\frac{4}{5})^{2} = \frac{8}{25} I n f o (I n co m e, T) = \frac{3}{8} \cdot 0 + \frac{5}{8} \cdot \frac{8}{25} = \frac{1}{5} G ain (I n co m e, T) = \frac{1}{2} - \frac{1}{5} = 0.3 For attirbute Age, I n f o (T_{o l d}) = 1 - (\frac{2}{4})^{2} - (\frac{2}{4})^{2} = \frac{1}{2} I n f o (T_{yo u n g}) = 1 - (\frac{2}{4})^{2} - (\frac{2}{4})^{2} = \frac{1}{2} I n f o (A g e, T) = \frac{4}{8} \cdot \frac{1}{2} + \frac{4}{8} \cdot \frac{1}{2} = \frac{1}{2} G ain (A g e, T) = \frac{1}{2} - \frac{1}{2} = 0 For attirbute Have_MacBook, I n f o (T_{yes}) = 1 - (\frac{2}{2})^{2} - (\frac{0}{2})^{2} = 0 I n f o (T_{n o}) = 1 - (\frac{2}{6})^{2} - (\frac{4}{6})^{2} = \frac{4}{9} I n f o (H a v e_M a c B oo k, T) = \frac{2}{8} \cdot 0 + \frac{6}{8} \cdot \frac{4}{9} = \frac{1}{3} G ain (H a v e_M a c B oo k, T) = \frac{1}{2} - \frac{1}{3} \approx 0.167

The second split (I n co m e_{m e d i u m}), I n f o (T) = 1 - (\frac{1}{5})^{2} - (\frac{4}{5})^{2} = \frac{8}{25} For attirbute Age, I n f o (T_{o l d}) = 1 - (\frac{0}{2})^{2} - (\frac{2}{2})^{2} = 0 I n f o (T_{yo u n g}) = 1 - (\frac{1}{3})^{2} - (\frac{2}{3})^{2} = \frac{4}{9} I n f o (A g e, T) = \frac{2}{5} \cdot 0 + \frac{3}{5} \cdot \frac{4}{9} = \frac{4}{15} G ain (A g e, T) = \frac{8}{25} - \frac{4}{15} \approx 0.0533 For attirbute Have_MacBook, I n f o (T_{yes}) = 1 - (\frac{0}{0})^{2} - (\frac{0}{0})^{2} = 1 I n f o (T_{n o}) = 1 - (\frac{1}{5})^{2} - (\frac{4}{5})^{2} = \frac{8}{25} I n f o (H a v e_M a c B oo k, T) = \frac{0}{5} \cdot 1 + \frac{5}{5} \cdot \frac{8}{25} = \frac{8}{25} G ain (H a v e_M a c B oo k, T) = \frac{8}{25} - \frac{8}{25} = 0

$\pagebreak$

(ii)

The new customer will not buy a Mac Studio.

(b)

Compared to ID3, C4.5 has a lower tendency to choose an attribute containing more values.

The reason is that C4.5 consider SplitInfo when calculating Gain. SplitInfo is proportional to the number of distinct values, by dividing Gain in ID3 by SplitInfo, we can obtain the Gain in C4.5, this penalizes an attribute containing more values. In other words, lowering Gain to prevent that attribute from being chosen.

🏡

Explorer

CORE1232 HW1

Q1

(a)

(b)

Q2

(a)

(b)

Q3

(a)

(i)

(ii)

(iii)

(b)

Q4

Q5

(a)

(i)

(ii)

(b)

Explorer

Table of Contents

Backlinks