# ConvNets!¶

TODO：如何画图？

## 数学原理¶

$\forall x, \mid f(x) - g(x) \mid < \epsilon$

## 数据预处理¶

### 批正则化¶

$\hat{x}^{(k)} = \frac{x^{(k)} - \operatorname{E}[x^{(k)}]}{\sqrt{\operatorname{Var}[x^{(k)}]}}$

### 正则化¶

#### L2 正则化¶

That is, for every weight $$w$$ in the network, we add the term $$\frac{1}{2} \lambda w^2$$ to the objective, where $$\lambda$$ is the regularization strength.

#### L1 正则化¶

$$\lambda |w|$$

#### Elastic net regularization¶

$$\lambda_1 |w| + \lambda_2 w^2$$

#### 最大范数约束¶

$$||\vec w||_2 < c$$$$c$$的数量级在$$3$$~$$4$$

## 损失函数¶

### 分类¶

$L_i = \sum_{j\neq y_i} \max(0, f_j - f_{y_i} + 1)$

cross-entropy loss

$L_i = -\log\left(\frac{e^{f_{y_i}}}{ \sum_j e^{f_j} }\right)$

$P(y = 1 \mid x; w, b) = \frac{1}{1 + e^{-(w^Tx +b)}} = \sigma (w^Tx + b)$

$L_i = -\sum_j y_{ij} \log(\sigma(f_j)) + (1 - y_{ij}) \log(1 - \sigma(f_j))$

## 学习与训练¶

### 计算梯度时的注意事项¶

#### 使用中心差分方法¶

$\frac{d f(x)}{d x}=\frac{f(x+h)-f(x-h)}{2 h}$

#### 使用相对误差比较解析梯度和数值梯度¶

$\frac{\left|f_{a}^{\prime}-f_{n}^{\prime}\right|}{\max \left(\left|f_{a}^{\prime}\right|,\left|f_{n}^{\prime}\right|\right)}$

#### Kinks¶

Kinks 指目标函数中不可微的部分，此时的数值微分不是准确的。

### 训练时¶

#### 基本更新¶

$x_{t+1} = x_t - lr\times dx$

#### 动量更新¶

$v_{t+1} = mu \times v_t - lr \times dx_t \\ x_{t+1} = x_t + v_{t+1}$

#### Nesterov 动量¶

x_ahead = x + mu * v
v = mu * v - learning_rate * dx_ahead
x += v
v_prev = v # back this up
v = mu * v - learning_rate * dx # velocity update stays the same
x += -mu * v_prev + (1 + mu) * v # position update changes form

# Assume the gradient dx and parameter vector x
cache += dx**2
x += - learning_rate * dx / (np.sqrt(cache) + eps)

#### RMSProp¶

cache = decay_rate * cache + (1 - decay_rate) * dx**2
x += - learning_rate * dx / (np.sqrt(cache) + eps)

m = beta1*m + (1-beta1)*dx
v = beta2*v + (1-beta2)*(dx**2)
x += - learning_rate * m / (np.sqrt(v) + eps)

Adam 论文中推荐的参数设置是 eps = 1e-8, beta1 = 0.9, beta2 = 0.999。学习率设置在5e-3（从头训练）到1e-5（迁移学习）