Loss function：均方误差

true_w = torch.arange(feature_size + 1, dtype=torch.float)
X = torch.randn(data_size, feature_size + 1)
y = X @ true_w + torch.normal(0, 0.1, (X.shape[0],))

def loss_func(w, xb, yb):
    return ((xb @ w - yb)**2).mean() / 2

解析解

将偏置 b 也合并进 w 中，在 X 上新加一个维度

$$ w^* = (X^TX)^{-1}X^Ty $$

wstar = (X.T @ X).inverse() @ X.T @ y

推导

认为 $X \in \mathbb{R}^{\text{batch size}\times \text{feature}}$ 是列满秩的，于是 $X^TX$ 是满秩的，有逆

梯度下降

w = torch.randn(feature_size + 1, requires_grad=True)
train_dl = DataLoader(TensorDataset(X, y), batch_size=batch_size)
for epo in range(train_epoch):
    for xb, yb in train_dl:
        if w.grad is not None:
            w.grad.zero_()
        loss = loss_func(w, xb, yb)
        loss.backward()
        with torch.no_grad():
            w -= lr * w.grad

requires_grad=True
*if* w.grad is not None 一开始什么都没有发生的时候，w.grad 为 None
w.grad.zero_()，对于 optimizer 才是 opti.zero_grad()
*with* torch.no_grad()：才能使用梯度对 w 进行更新

nn & optim

net = nn.Sequential(nn.Linear(feature_size, 1))
criterion = nn.MSELoss()
optim =  torch.optim.SGD(net.parameters(), lr=lr)
train_dl = DataLoader(TensorDataset(X[:, :-1], y), batch_size=batch_size)
for epo in range(train_epoch):
    for xb, yb in train_dl:
        optim.zero_grad()
        loss = criterion(net(xb), yb)
        loss.backward()
        optim.step()

正态分布与平方损失（MSE）

当我们使用均方损失时，实际上是假设了观测中包含符合正态分布的噪声

$$ y = w^Tx + b + \epsilon,\quad\epsilon \sim \mathcal{N}(0, \sigma^2) $$

记 $y_{\text{pred}} = w^Tx + b,\quad y\sim \mathcal{N}(y_\text{pred}, \sigma^2)$