Loss function:均方误差
true_w = torch.arange(feature_size + 1, dtype=torch.float)
X = torch.randn(data_size, feature_size + 1)
y = X @ true_w + torch.normal(0, 0.1, (X.shape[0],))
def loss_func(w, xb, yb):
return ((xb @ w - yb)**2).mean() / 2
将偏置 b 也合并进 w 中, 在 X 上新加一个维度
$$ w^* = (X^TX)^{-1}X^Ty $$
wstar = (X.T @ X).inverse() @ X.T @ y
认为 $X \in \mathbb{R}^{\text{batch size}\times \text{feature}}$ 是列满秩的,于是 $X^TX$ 是满秩的,有逆
w = torch.randn(feature_size + 1, requires_grad=True)
train_dl = DataLoader(TensorDataset(X, y), batch_size=batch_size)
for epo in range(train_epoch):
for xb, yb in train_dl:
if w.grad is not None:
w.grad.zero_()
loss = loss_func(w, xb, yb)
loss.backward()
with torch.no_grad():
w -= lr * w.grad
requires_grad=True
*if* w.grad is not None
一开始什么都没有发生的时候,w.grad
为 None
w.grad.zero_()
,对于 optimizer 才是 opti.zero_grad()
*with* torch.no_grad()
:才能使用梯度对 w 进行更新net = nn.Sequential(nn.Linear(feature_size, 1))
criterion = nn.MSELoss()
optim = torch.optim.SGD(net.parameters(), lr=lr)
train_dl = DataLoader(TensorDataset(X[:, :-1], y), batch_size=batch_size)
for epo in range(train_epoch):
for xb, yb in train_dl:
optim.zero_grad()
loss = criterion(net(xb), yb)
loss.backward()
optim.step()
当我们使用均方损失时,实际上是假设了观测中包含符合正态分布的噪声
$$ y = w^Tx + b + \epsilon,\quad\epsilon \sim \mathcal{N}(0, \sigma^2) $$
记 $y_{\text{pred}} = w^Tx + b,\quad y\sim \mathcal{N}(y_\text{pred}, \sigma^2)$