Characteristics
- Lazy learning (the test point is not given)
- Non-parametric algorithm (no assumption)
- Heavy computation, not real-time
- A Low K-value is sensitive to outliers and a higher K-value is more resilient to outliers as it considers more voters to decide prediction
- Higher K-value has higher computational time
Pros
- Easy to understand
- No assumptions about data
- Classification and regression (average of the output value for the top K nearest neighbors)
- Works easily on multi-class problems
Cons
- Memory/Computationally expensive
- Sensitive to scale of data
- Struggle when high number of attributes (do standardization before knn)
- Does not work well with categorical features since it is difficult to find the distance between dimensions with categorical features (use one-hot encoding for hamming distances)
Improve
- Principle-Component Analysis (PCA) to reduce dimension
- KD-tree
- Parallelization for distance computations
Applications
- Hand-written character recognition
- Fast content-based image retrieval
- Intrusion detection
- Fault detection for semiconductor manufacturing processes
Steps
- Prepare training and test dataset
- Select K
- Determine distance function
- Compute distances to n training samples
- Sort the distances and find K-nearest data samples
- Assign to class based on majority vote
Standardization
Distance Measures
Euclidean distance
Manhattan distance
- Reduce computation time
- Might not get good result
Cosine distance
- Smaller the distance, the more similar
Cosine distance vs cosine similarity
Hamming Distance
- The number of bit positions in which the bits are different
- For binary data string
- Categorical data
K Value
Suppose we have even number of classes and if we choose K to be an odd number, we can prevent the tie condition
D-fold Cross-Validation
- Split into d groups
- Select 1 group for testing, remaining for testing
- For each value of K, classify the data and record the error
- Repeat 1-3 for different value of K
- For each value of K, find the average error across validation sets, choose K with the lowest error