[cs231n/Spring 2023] Lecture 4: Neural Networks and Backpropagation

	Sigmoid 더 이상 사용되지 않는다. range [0,1] 단점1. Saturated neurons → kill the gradients. (vanishing gradient) 단점2. slow → not zero contered 단점3. exp()자체가 compute expensive
	tanh 가급적 사용X range [-1, 1] zero centered (good) but, 여전히 kill gradients 문제 있음.
	ReLU (default) not saturate computationally efficient sigmoid / tanh 보다 바름 단점1. not zero centered output 단점2. x< 0 → 기울기 0 (소실) x > 0 → 기울기 1 x = 0 → 기울기 undefined dead ReLU 방지위해 bias를 0.01 정도의 값으로 초기화하는 등의 방법을 사용
	Leaky ReLU ReLU 개선한 것
	Maxout ReLU, Leaky ReLU 일반화 not saturate, not die 단점1. 뉴런이 두개의 파라미터를 갖게 된다 → 연산이 2배
	ELU

Neural networks

모든 layer들이 연결되어 있으며(Full-connected), 하나의 단일레이어는 단일 연산으로 끝난다

Do not use size of neural network as a regularizer.

neural network의 크기가 regularization의 역할을 하는 것은 아니다.

→ overfitting 방지를 위해 network size를 조절해 작게 만드는 것이 아니라 regularization의 strength를 더 높여줘야한다. 즉, neural network는 regularization을 잘한다는 전제하에서는 크면 클수록 좋다.

dendrite : input, axon : output, cell body : soma(activation function)

def neuron_tick(inputs):
cell_body_sum = np.sum(inputs*self.weights) + self.bias // x
firing_rate = 1.0/(1.0+math.exp(-cell_body_sum)) // sigmoid activation func
return firing_rate

Be very careful with your brain ananlogies!

인공신경망이 실제 우리 두뇌와 유사하다고 말하는 것에는 경계해야 한다.

Biological Neurons:

Many different types
Dendrites can perform complex non-linear computations
Synapses are not a single weight but a complex non-linear dynamical system

Backpropagation

How to compute gradients?

Idea #1 Derive $\triangle wL$ on paper

Problem 1: Very tedious: Lots of matrix calculus, need lots of paper (계산이 많음)
Problem 2: What if we want to change loss? (loss function을 바꾸고 싶다면 처음부터 다시 계산해야함) E.g. use softmax instead of SVM? -> Need to re-derive from scratch =(
Problem 3: Not feasible for very complex models! (복잡한 모델에서는 불가능)

Idea #2(Better Idea) Computational graphs + Backpropagation

local gradient : foward pass시에 구해서 메모리에 저장해 놓는다. (이미 우리가 구할 수 있는 값) upstream gradient (global) : backward pass시에 구할 수 있다.

Upstream gradient : 노드의 output에 대한 gradient.
Local gradient : 해당 노드내에서만 계산되는 gradient.
Downstream gradient : 노드의 input에 있는 변수에 대한 gradient.

Backpropagation 계산 example

def f(w0, x0, w1, x1, w2) :
// forward pass : compute output
s0 = w0 * x0
s1 = w1 * x1
s2 = s0 + s1
s3 = s2 + w2
L = sigmoid(s3)
// backward pass : compute grads
grad_L = 1.0
grad_s3 = grad_L * (1-L) * L // sigmoid local gradient(dsigmoid/dx) : (1-sigmoid)*sigmoid
grad_w2 = grad_s3 // add gate
grad_s2 = grad_s3 // add gate
grad_s0 = grad_s2 // add gate
grad_s1 = grad_s2 // add gate
grad_w1 = grad_s1 * x1 // mul gate
grad_x1 = grad_s1 * w1 // mul gate
grad_w0 = grad_s0 * x0 // mul gate
grad_x0 = grad_s0 * w0 // mul gate

Patterns in gradient flow

def forward(ctx, x, y) :
ctx.save_for_backware(x, y) // nedd to cash some values for use in backward
z = x * y
return z

def backward(ctx, grad_z) : // upstream gradient
x, y = ctx.saved_tensors
grad_x = y * grad_z // dz/dx * dL/dz = dL/dx
grad_y = x * grad_z // dz/dy * dL/dz = dL/dy
return grad_x, grad_y

What about vector-valued functions?

1. Input, output이 모두 scala일때는, 미분도 scala로 정의가 됩니다. 2. output은 scala, input은 vector인 경우, gradient는 input과 차원수가 같다. gradient의 n번째 component는 n번째 input이 조금 변화했을 때, 전체 output이 얼마나 변하는 지에 대한 값이다. 3. Input, output이 모두 다변수인 경우에, 미분(derivative)은 jacobian으로 표현된다. jacobian의 (n,m) component는 n번째 input이 변화할 때, m번째 output의 변화 정도를 의미합니다. 훨씬 간단하고 효율적으로 표현하고 계산된다.

Jacobian → x>0 = 1, x ≤ 0 = 0 Jacobian 행렬이 sparse하며 off-diagonal 값이 모두 0이므로 explicit한 방법 대신 implicit multiplication을 수행한다

Q. if 4096 input vector, 4096 output vector, what is the size of the jacobian matrix?

A. 4096 x 4096

Q. What parts of y are affected by one element of x?

A. $x_{n,d}$ affects the whole row $y_n$.

Q. how much does $x_{n,d}$ affect $y_{n,m}$ ?

A. $w_{d,m}$

Summary

(Fully-connected) Neural Networks are stacks of linear functions and nonlinear activation functions;

they have much more representational power than linear classifiers

backpropagation = recursive application of the chain rule along a computational graph to compute the gradients of all inputs/parameters/intermediates
implementations maintain a graph structure, where the nodes implement the forward() / backward() API
forward: compute result of an operation and save any intermediates needed for gradient computation in memory
backward: apply the chain rule to compute the gradient of the loss function with respect to the inputs

728x90

'🤖 ai logbook' 카테고리의 다른 글

[NLP/자연어처리] BERT & GPT & ChatGPT (0)	2023.07.05
[NLP/자연어처리] 트랜스포머(Transformer) (0)	2023.07.04
[NLP/자연어처리] seq2seq 인코더-디코더 및 어텐션 모델 (Seq2Seq Encoder-Decoder & Attention Model) (0)	2023.07.04
[NLP/자연어처리] 자연어 처리에서의 순환 신경망 (RNN in Natural Language Processing) (0)	2023.07.01
[NLP/자연어처리] 정보 검색 & 단어 임베딩(Information Retrieval & Word Embedding) (0)	2023.07.01
[NLP/자연어처리] 감정 분석 & 문장에 대한 확률 (Sentiment Classification & Probabilities to Sentences) (0)	2023.06.29
[NLP/자연어처리] 언어 모델에서의 나이브베이즈 (Naive Bayes as a Language Model) (1)	2023.06.28
[NLP/자연어처리] 언어 모델링(Language Modeling) (0)	2023.06.28

Standford University - CS231n(Convolutional Neural Networks for Visual Recognition)

Multi-layer Perceptron

Activation functions

Neural networks

Backpropagation

How to compute gradients?

Backpropagation 계산 example

Patterns in gradient flow

What about vector-valued functions?

Summary

'🤖 ai logbook' 카테고리의 다른 글

티스토리툴바