Characteristics

  • Lazy learning (the test point is not given)
  • Non-parametric algorithm (no assumption)
  • Heavy computation, not real-time
  • A Low K-value is sensitive to outliers and a higher K-value is more resilient to outliers as it considers more voters to decide prediction
  • Higher K-value has higher computational time

Pros

  • Easy to understand
  • No assumptions about data
  • Classification and regression (average of the output value for the top K nearest neighbors)
  • Works easily on multi-class problems

Cons

  • Memory/Computationally expensive
  • Sensitive to scale of data
  • Struggle when high number of attributes (do standardization before knn)
  • Does not work well with categorical features since it is difficult to find the distance between dimensions with categorical features (use one-hot encoding for hamming distances)

Improve

  • Principle-Component Analysis (PCA) to reduce dimension
  • KD-tree
  • Parallelization for distance computations

Applications

  • Hand-written character recognition
  • Fast content-based image retrieval
  • Intrusion detection
  • Fault detection for semiconductor manufacturing processes

Steps

  1. Prepare training and test dataset
  2. Select K
  3. Determine distance function
  4. Compute distances to n training samples
  5. Sort the distances and find K-nearest data samples
  6. Assign to class based on majority vote

Standardization

Distance Measures

Euclidean distance

Manhattan distance

  • Reduce computation time
  • Might not get good result

Cosine distance

  • Smaller the distance, the more similar

Cosine distance vs cosine similarity

Hamming Distance

  • The number of bit positions in which the bits are different
  • For binary data string
  • Categorical data

K Value

Suppose we have even number of classes and if we choose K to be an odd number, we can prevent the tie condition

D-fold Cross-Validation

  1. Split into d groups
  2. Select 1 group for testing, remaining for testing
  3. For each value of K, classify the data and record the error
  4. Repeat 1-3 for different value of K
  5. For each value of K, find the average error across validation sets, choose K with the lowest error

Error Measurement for Classification Problems

Precision

Recall

F1-Score

Error Measurement for Regression Problems

Mean Absolute Error (MAE)

Mean Square Error (MSE)

Mean Absolute Percentage Error (MAPE)