Neural Network

Artificial Neural Network

Universal function approximator
Inspired from neurons in our brain
Most powerful artificial intelligence and machine learning algorithms

Biological Neuron	Artificial Neuron
Dendrites	Inputs
Cell Nucleus (computation unit)	Node (linear function and activation function)
Axon	Output
Synapse	Weight

Preceptron

If the output and target differs, weights are updated such that the output will be closer to target

O = f (w_{1} \times x_{1} + w_{2} \times x_{2} + θ)

w_{i} = w_{i} + η (T - O) x_{i} θ = θ + η (T - O)

Variable	Meaning
$x_{1}, x_{2}$	Inputs
$w_{1}, w_{2}$	Weights
$θ$	Bias
$f (x)$	Activation function
$η$	Learning rate
$O$	Output
$T$	Target

Stopping Rules

Maximum training time
Maximum number of training cycles (epoch)
High enough accuracy
Low enough error
Weight change threshold

Cons

Can only represent limited set of function
Can only distinguish something that is linearly separable

Multilayer Perceptron

Feed-forward neural network (no cycle)
Input, hidden (>=1), output layer (i, j, k)
Minimize the error/loss function $E = \frac{1}{2} \sum_{all k} (O_{k} - T_{k})^{2}$

Steps

Initialize weights and biases to some small random values
Forward propagation (from input to output)
1. For the n-th node in the m-th layer, compute the sum $\sum_{n} = w_{1} x_{1} + \dots + w_{k} x_{k}$
2. Compute the output $O_{m, n} = f (\sum_{n} + θ_{n})$
Backward propagation (from output to hidden)
1. Compute $δ_{k} = (O_{k} - T_{k}) O_{k} (1 - O_{k})$
2. Update $w_{m, n} \leftarrow w_{m, n} - η δ_{k} O_{m, n}$
3. Update $θ_{m, n} \leftarrow θ_{m, n} - η δ_{k}$
Backward propagation (from hidden to input)
1. Compute $δ_{m, n} = O_{m, n} (1 - O_{m, n}) \sum_{k \in K} δ_{k} w_{m, k}$
2. Update $w_{m, n}, θ_{m, n}$

δ_{k} = (O_{k} - T_{k}) O_{k} (1 - O_{k}) w_{jk} \leftarrow w_{jk} - η δ_{k} O_{j} θ_{j} \leftarrow θ_{j} - η δ_{j} δ_{j} = O_{j} (1 - O_{j}) k \in K \sum δ_{k} w_{jk} w_{ij} \leftarrow w_{ij} - η δ_{j} O_{i} θ_{k} \leftarrow θ_{k} - η δ_{k}

Repeat from step 2 until stopping criteria is met

The formulas are for the sigmoid function $σ (x) = \frac{1}{1 + e ^{- x}}$

Stopping Criteria

Fixed number of iterations
Error falls below threshold
Minimum of the error on the validation set

Gradient Descent

To reach a local minimum, we minimize $E = \frac{1}{2} \sum_{all k} (O_{k} - T_{k})^{2}$ by following the negative of the gradient, and update the initial guess by $α_{t + 1} = α_{t} - η \nabla E (α_{t})$

+ve slope → decrease weights and biases
-ve slope → increase weights and biases

Implementation

Steps

Import the required libraries and define a global variable
Load the data
Explore the data
Build the model
Compile the model
Train the model
Evaluate the model accuracy
Save the model
Use the model
Plotting the confusion matrix

Code

kernel_regularizer=regularizers.l2(0.002) to avoid overfitting
activation=activations.relu or activation='relu

How to Calculate Param of Dense Layer

Flatten: (None, 784)
Dense: (None, 128), param: 784 x 128 weights + 128 bias
Dense: (None, 128), param: 128 x 128 weights + 128 bias
Dense: (None, 10), param: 128 * 10 weights + 10 bias

Validation

validation loss
- average of (error = 1 - x/n), a perfect label have a probability of 1
validation accuracy

Model Saving

HDFS (Hadoop Distributed File System)

model_name = ""
mode.save(model_name, save_format="h5")
 
loaded_model = load_model(model_name)

Predication

predictions = loaded_model.predict([x_test])
print('predictions:', predictions.shape)
prediction_results = np.argmax(predictions, axis=1)

Confusion Matrix

# First parameter is actual label, second one is prediction
cm = confusion_matrix(y_test, prediction_results)

Layers and Neurons

1 input layer
- Number of neurons = number of features
1 output layer
- Number of neurons = mostly 1 (unless softmax)
Hidden layers
- Number of layers
  - Linearly separable → 0
  - Less complex → 1 to 2
  - More complex → 3 to 5
- Number of neurons
  - $input layer nodes \times output layer nodes$
  - Between the size of the input and output layer
  - Decreasing in subsequent layers to get closer to pattern and feature extraction

Weights and Biases

Weights control the steepness of the activation function
- Higher weight → steeper slope
- Lower weight → softer slope
Biases is for shifting the activation function left/right
- Smaller bias → right
- Larger bias → left

Problems

Vanishing Gradient

Parameters of the higher layers vary drastically
Parameters of the lower levels do not change significantly
Weight may become zero
Learns slowly, even stagnant

Exploding Gradient

All parameters grow exponentially
Weights may become NaN
Avalanche learning process

Overfitting

Learns details and noise
Use regularizer to add some error, to avoid overfitting

Underfitting

Cannot generalize to new data

Activation Functions

Neural Network	Commonly Used Activation Fucntion
MLP	ReLU
CNN	ReLU
RNN	Tanh/Sigmoid

Scenerio	Activation Function for Output Layer
Regression	Linear
Binary Classification	One node, sigmoid
Multiclass Classification	One node per class, softmax
Multilabel Classification	One node per class, sigmoid

Linear

f (x) = x

Softmax

x = [x_{0}, x_{1}, \dots, x_{n - 1}] f (x_{i}) = \frac{e ^{x_{i}}}{\sum _{j = 0}^{n - 1} e ^{x_{j}}}

Probability values
For multi-class classification problems
For negative values, $e^{x}$ would give positive values

ReLU

f (x) = max (0, x)

Most common and simple
Less susceptible to vanishing gradient
“He Normal” or “He Uniform” to scale input to the range 0 to 1

Sigmoid

f (x) = \frac{1}{1 + e ^{- x}}

$(- \infty, + \infty) \to (0, 1)$

Hidden layer: “Glorot Normal” or “Glorot Uniform” (or Xavier) to scale input to the range -1 to 1
Output layer: 0 to 1

Tanh

f (x) = \frac{e ^{x} - e ^{- x}}{e ^{x} + e ^{- x}}

$(- \infty, + \infty) \to (- 1, 1)$

“Glorot Normal” or “Glorot Uniform” (or Xavier) to scale input to the range -1 to 1

🏡

Explorer

Neural Network

Artificial Neural Network

Preceptron

Stopping Rules

Cons

Multilayer Perceptron

Steps

Stopping Criteria

Gradient Descent

Implementation

Steps

Code

How to Calculate Param of Dense Layer

Validation

Model Saving

Predication

Confusion Matrix

Layers and Neurons

Weights and Biases

Problems

Vanishing Gradient

Exploding Gradient

Overfitting

Underfitting

Activation Functions

Linear

Softmax

ReLU

Sigmoid

Tanh

Explorer

Table of Contents

Backlinks