Artificial Neural Network
- Universal function approximator
- Inspired from neurons in our brain
- Most powerful artificial intelligence and machine learning algorithms
| Biological Neuron | Artificial Neuron |
|---|---|
| Dendrites | Inputs |
| Cell Nucleus (computation unit) | Node (linear function and activation function) |
| Axon | Output |
| Synapse | Weight |
Preceptron
If the output and target differs, weights are updated such that the output will be closer to target
| Variable | Meaning |
|---|---|
| Inputs | |
| Weights | |
| Bias | |
| Activation function | |
| Learning rate | |
| Output | |
| Target |
Stopping Rules
- Maximum training time
- Maximum number of training cycles (epoch)
- High enough accuracy
- Low enough error
- Weight change threshold
Cons
- Can only represent limited set of function
- Can only distinguish something that is linearly separable
Multilayer Perceptron
- Feed-forward neural network (no cycle)
- Input, hidden (>=1), output layer (i, j, k)
- Minimize the error/loss function
Steps
- Initialize weights and biases to some small random values
- Forward propagation (from input to output)
- For the n-th node in the m-th layer, compute the sum
- Compute the output
- Backward propagation (from output to hidden)
- Compute
- Update
- Update
- Backward propagation (from hidden to input)
- Compute
- Update
- Repeat from step 2 until stopping criteria is met
The formulas are for the sigmoid function
Stopping Criteria
- Fixed number of iterations
- Error falls below threshold
- Minimum of the error on the validation set
Gradient Descent
To reach a local minimum, we minimize by following the negative of the gradient, and update the initial guess by
- +ve slope → decrease weights and biases
- -ve slope → increase weights and biases
Implementation
Steps
- Import the required libraries and define a global variable
- Load the data
- Explore the data
- Build the model
- Compile the model
- Train the model
- Evaluate the model accuracy
- Save the model
- Use the model
- Plotting the confusion matrix
Code
kernel_regularizer=regularizers.l2(0.002)to avoid overfittingactivation=activations.reluoractivation='relu
How to Calculate Param of Dense Layer
- Flatten: (None, 784)
- Dense: (None, 128), param: 784 x 128 weights + 128 bias
- Dense: (None, 128), param: 128 x 128 weights + 128 bias
- Dense: (None, 10), param: 128 * 10 weights + 10 bias
Validation
- validation loss
- average of (error = 1 - x/n), a perfect label have a probability of 1
- validation accuracy
Model Saving
- HDFS (Hadoop Distributed File System)
model_name = ""
mode.save(model_name, save_format="h5")
loaded_model = load_model(model_name)Predication
predictions = loaded_model.predict([x_test])
print('predictions:', predictions.shape)
prediction_results = np.argmax(predictions, axis=1)Confusion Matrix
# First parameter is actual label, second one is prediction
cm = confusion_matrix(y_test, prediction_results)Layers and Neurons
- 1 input layer
- Number of neurons = number of features
- 1 output layer
- Number of neurons = mostly 1 (unless softmax)
- Hidden layers
- Number of layers
- Linearly separable → 0
- Less complex → 1 to 2
- More complex → 3 to 5
- Number of neurons
- Between the size of the input and output layer
- Decreasing in subsequent layers to get closer to pattern and feature extraction
- Number of layers
Weights and Biases
- Weights control the steepness of the activation function
- Higher weight → steeper slope
- Lower weight → softer slope
- Biases is for shifting the activation function left/right
- Smaller bias → right
- Larger bias → left
Problems
Vanishing Gradient
- Parameters of the higher layers vary drastically
- Parameters of the lower levels do not change significantly
- Weight may become zero
- Learns slowly, even stagnant
Exploding Gradient
- All parameters grow exponentially
- Weights may become NaN
- Avalanche learning process
Overfitting
- Learns details and noise
- Use regularizer to add some error, to avoid overfitting
Underfitting
- Cannot generalize to new data
Activation Functions
| Neural Network | Commonly Used Activation Fucntion |
|---|---|
| MLP | ReLU |
| CNN | ReLU |
| RNN | Tanh/Sigmoid |
| Scenerio | Activation Function for Output Layer |
|---|---|
| Regression | Linear |
| Binary Classification | One node, sigmoid |
| Multiclass Classification | One node per class, softmax |
| Multilabel Classification | One node per class, sigmoid |
Linear
Softmax
- Probability values
- For multi-class classification problems
- For negative values, would give positive values
ReLU
- Most common and simple
- Less susceptible to vanishing gradient
- “He Normal” or “He Uniform” to scale input to the range 0 to 1
Sigmoid
- Hidden layer: “Glorot Normal” or “Glorot Uniform” (or Xavier) to scale input to the range -1 to 1
- Output layer: 0 to 1
Tanh
- “Glorot Normal” or “Glorot Uniform” (or Xavier) to scale input to the range -1 to 1