Homework 1

1. Theory of Hard-Margin Support Vector Machines

$(a)$

Show the Equation

can be rewritten as the dual optimization problem:

\max_{\lambda_i \ge 0}\;\left(\sum_{i=1}^n \lambda_i - \frac{1}{4}\sum_{i=1}^n\sum_{j=1}^n \lambda_i\lambda_j y_i y_j X_i\cdot X_j\right) \quad\text{subject to}\quad \sum_{i=1}^n \lambda_i y_i = 0. \tag{4}

$proof:$ 这个式子我们在 “支持向量机” 这一部分证明过了，这里再简要证明一遍。

我们对原始拉格朗日表达式求偏导，令参数偏导为 0：

\frac{\partial L}{\partial w} = 2w - \sum_{i=1}^n λ_i y_i X_i=0\Rightarrow w=\frac{1}{2}\sum_{i=1}^n λ_i y_i X_i\tag{1-1}

\frac{\partial L}{\partial \alpha} = - \sum_{i=1}^n λ_i y_i = 0 \tag{1-2}

我们将 $(1-1)$ 带入，由 $\|w\|^2=w \cdot w^T$ ，带入 $L$ ：

L=\sum_{i=1}^n \lambda_i + (\frac{1}{2}\sum_{i=1}^n λ_i y_i X_i)(\frac{1}{2}\sum_{i=1}^n λ_i y_i X_i)^T - \sum_{i=1}^n \lambda_iy_ix_i \cdot (\frac{1}{2}\sum_{i=1}^n λ_i y_i X_i)

将点成转换成双重求和即可得到待证式子。

$(b)$

Suppose we know the values $λ^*_i$ and $α^*$ that optimize equation $(3)$ . Show that the decision rule below

r(x)= \begin{cases} +1, & \text{if }\ \mathbf{w}\cdot\mathbf{x}+\alpha \ge 0,\\[4pt] -1, & \text{otherwise.} \end{cases}

can be written

r(x)= \begin{cases} +1, & \text{if }\ \alpha^{*} + \dfrac{1}{2}\sum_{i=1}^n \lambda_i^{*}\,y_i\,\mathbf{X}_i\cdot\mathbf{x} \ge 0,\\[6pt] -1, & \text{otherwise.} \end{cases}

$proof:$ 我们只需要将在 $(a)$ 问中推导出的最优 $w$ 的表达式代入决策规则即可。

$(c)$

Applying KKT conditions, any pair of optimal primal and dual solutions $w^∗$ , $α^∗$ , $λ^∗$ for a linear, hard-margin SVM must satisfy the following condition:

λ^∗_i (y_i(X_i \cdot w^∗ + \alpha^∗) − 1) = 0 \quad \forall i \in \lbrace 1, \dots , n \rbrace

This condition is called complementary slackness. Explain what this implies for points corresponding to $λ^∗_i > 0$ . That is, what relationship must exist between the data points, labels, and decision rule parameters? (You can provide a one sentence answer.)

$Answer:$ 由于 $\lambda^*_i \lt 0$ ，为了让 $λ^∗_i (y_i(X_i \cdot w^∗ + \alpha^∗) − 1) = 0$ ，有 $y_i(X_i \cdot w^∗ + \alpha^∗) − 1 = 0$ 。这说明这个点落在我们的分割边界上。

$(d)$

The training points $X_i$ for which $λ^∗_i > 0$ are called the support vectors. In practice, we frequently encounter training data sets for which the support vectors are a small minority of the training points, especially when the number of training points is much larger than the number of features. Explain why the support vectors are the only training points needed to use the decision rule

$Answer:$ 我们使用在 $(b)$ 中推导的决策规则：

r(x)= \begin{cases} +1, & \text{if }\ \alpha^{*} + \dfrac{1}{2}\sum_{i=1}^n \lambda_i^{*}\,y_i\,\mathbf{X}_i\cdot\mathbf{x} \ge 0,\\[6pt] -1, & \text{otherwise.} \end{cases}

对于求和中的每一项 $λ^*_i y_i \mathbf{X_i} \cdot \mathbf{x}$ ，如果 $\mathbf{X_i}$ 是一个非支持向量，那么 $λ*_i = 0$ ，这一项对最终的求和结果没有任何贡献。因此，决策规则的求和部分实际上可以简化为只对支持向量进行求和：

\sum_{i=1}^n λ^*_i y_i \mathbf{X_i} \cdot \mathbf{x} = \sum_{i \in V} λ^*_i y_i \mathbf{X_i} \cdot \mathbf{x}

Homework 1

1. Theory of Hard-Margin Support Vector Machines

(a)(a)(a)

(b)(b)(b)

(c)(c)(c)

(d)(d)(d)

Comments

$(a)$

$(b)$

$(c)$

$(d)$