Support Vector Machines (SVMs)

Geometry, Convex Optimization, Duality, Kernel Methods, and Extensions to SVC and SVR.

Notation

$x_i \in \mathbb{R}^d$ — input (feature) vector of the $i$-th data point

$w \in \mathbb{R}^d$ — normal vector to the separating hyperplane

$\phi(x) \in \mathcal{H}$ — feature-space representation of input $x$

$y_i={−1,+1}$ — class label of the $i$-th data point

$b \in \mathbb{R}$ — bias (offset) term of the hyperplane

$\gamma \in \mathbb{R}_+$ — geometric margin

$\xi_i \in \mathbb{R}_+$ — slack variable

$C \in \mathbb{R}_+$ — regularization parameter

$\epsilon \in \mathbb{R}_+$ — SVR tube radius

$\alpha_i \ge 0$ — Lagrange multipliers (classification)

$\alpha_i^{\ast} \ge 0$ — Lagrange multipliers (SVR)

$\mu_i \ge 0$ — Lagrange multipliers for slack constraints

$f(x)=\operatorname{sign}(\langle w,x \rangle + b)$ — decision function

$\mathcal{K}(x,z)$ — kernel function

⟨⋅,⋅⟩ — Euclidean inner product

$\left\lVert \ast \right\rVert_2$ — Euclidean norm

$n$ — number of training samples

$d$ — dimension of the input space

$i,j$ — data point indices

$w^{\ast}, b^{\ast}$ — optimal primal solution

$\alpha^{\ast}$ — optimal dual variables

1. The Classification(Binary) Problem as Geometry

We begin with the binary classification problem from a purely geometric perspective. Let the training set be

${(x_i, y_i)}_{i=1}^n$, where $x_i \in \mathbb{R}^d$ and $y_i \in {-1, +1}$.

The goal is to construct a decision rule that separates points of different labels with maximal robustness(maximize the minimum distance from a data point to our hyperplane, that gives = simpler model = better generalization). SVMs approach this problem by explicitly reasoning about geometry in the input space, rather than by introducing a loss function from the outset.

1.1. Linear Separability and Hyperplanes

A hyperplane in $\mathbb{R}^d$ is defined as the set

\[H = \{x \in \mathbb{R}^d \mid \langle w, x \rangle + b = 0\},\]

where $w \in \mathbb{R}^d$ is the normal vector to the hyperplane and $b \in \mathbb{R}$ is the bias term

We are building a model for a classification problem(let it be binary) .The hyperplane induces a classifier of the form

\[f(x) = \operatorname{sign}(\langle w, x \rangle + b).\]

small definition

\[\operatorname{sign}(f(\cdot)) = \begin{cases} -1, & \text{if } f(\cdot) < 0 \\ 0, & \text{if } f(\cdot) = 0 \\ 1, & \text{if } f(\cdot) > 0 \\ \end{cases}\]

The space is partitioned into two half-spaces:

$ \langle w, x \rangle + b > 0$ and $ \langle w, x \rangle + b < 0$,

which correspond to the two class labels ${-1,+1}$

Linear Separability

The dataset is said to be linearly separable if there exists at least one pair $(w,b)$ such that

\[y_i(\langle w, x_i \rangle + b) > 0 .\]

for all

\[i = 1,2,...,n.\]
Proof:
the correct predictions are defined as : $y_i = 1 <=> \langle w,x \rangle + b \ge 0$ and
$y_i = -1 <=> \langle w,x \rangle + b < 0 $ thus, if we multiply LHS and RHS , we get
\(y_i(\langle w,x \rangle + b) > 0\)

This condition guarantees that a hyperplane exists which correctly classifies all training points.

At this stage, infinitely many separating hyperplanes may exist.But the central question becomes :

And tadaa SVMs answer this by introducing the notion of margin!!

1.2. Functional and Geometric Margins

Functional Margin

Given a classifier $(w,b)$, the functional margin of a labeled point $(x_i,y_i)$ is defined as

\[\hat{\gamma_i} = y_i(\langle w, x_i \rangle + b).\]

The functional margin of the dataset is

\[\hat{\gamma} = \min_i \hat{\gamma_i}.\]

While convenient algebraically, the functional margin depends on the scale of $w$ and $b$. Multiplying $(w,b)$ by any positive constant scales the functional margin arbitrarily, without changing the decision boundary.

This makes the functional margin unsuitable as a geometric notion of confidence.

Geometric Margin

The geometric margin is the distance between the hyperplane(decision boundary) and the support vectors(closest data points to the hyperplane).

Maximum-margin hyperplane and support vectors source

The distance from a point $x$ to the hyperplane $H$ is

\[\operatorname{dist}(x,H) = d = \frac{\lvert \langle w, x \rangle + b \rvert}{||w||_{2}}.\]
Proof:

let’s assume we have an arbitrary point $x_0$ and a hyperplane defined by the equation $\langle w, x \rangle + b = 0$ . We need to find the perpendicular distance $d$ from $x_{0}$ to this hyperplane

Consider any point $x_{p}$ on the hyperplane which is the projection of $x_0$ onto that plane. The vector from $x_{p}$ to $x_{0}$ , i.e. , $(x_{0} - x_{p})$ is parallel to the normal vector of the hyperplane $w$ . Thus, the distance $d$ is the length of the vector $(x_{0} - x_{p})$ . The projection of the vector $(x_{0} - x_{p})$ onto the unit normal vector ( $\frac{w}{\left\lVert w \right\rVert}_{2} $ ) will give us the desired distance $d$ :

\[d = \vert(x_{0} - x_{p}) \cdot \frac{w}{\left\lVert w \right\rVert_{2}} \vert\]

Since point $x_{p}$ lies on the hyperplane, its coordinates satisfy the equation

\[\langle w, x_{p} \rangle + b = 0\]

or

\[\langle w,x_{p} \rangle = -b\]

Let’s expand the dot product in the distance formula:

\[d = \frac{\vert \langle w, x_{0} \rangle - \langle w, x_{p} \rangle \vert }{\left\lVert w \right\rVert_{2}}\]

Substitute $\langle w, x_{p} \rangle = -b$ :

\[d = \frac{\vert \langle w, x_{0} \rangle - (-b) \vert }{\left\lVert w \right\rVert_{2}}\] \[d = \frac{\vert \langle w, x_{0} \rangle + b \vert }{\left\lVert w \right\rVert_{2}}\]

Accordingly, the geometric margin of a labeled point $(x_{i},y_{i})$ is

\[\gamma_i = \frac{y_i(\langle w, x_{i} \rangle +b)}{\left\lVert w \right\rVert_{2}}.\]

The geometric margin of the dataset is

\[\gamma = \min_i \gamma_i.\]

Unlike the functional margin, the geometric margin is invariant under rescaling of $(w,b)$ and has a direct geometric interpretation.

1.3. Margin Maximization Principle

Among all separating hyperplanes, Support Vector Machines choose the one that maximizes the geometric margin :

\[\max_{w,b} \gamma.\]

Geometrically, this corresponds to selecting the hyperplane that is as far as possible from the closest data points of either class.

Why Margin Matters ?

Maximizing the margin has several fundamental consequences:

  1. Robustness to perturbations

    A larger margin implies that small perturbations of the input vectors are less likely to change the classification outcome.

  2. Capacity control

    From statistical learning theory, classifiers with larger margins have lower effective capacity, leading to better generalization guarantees.

  3. Uniqueness of solution

    While many separating hyperplanes may exist, the maximum-margin hyperplane is unique (up to translation of $b$) under mild conditions.

This principle provides the conceptual bridge between geometry and generalization, which will later be formalized through convex optimization and duality.

1.4. Scale Normalization

Because the decision boundary defined by $(w,b)$ is invariant under positive rescaling, we may impose a normalization constraint without loss of generality as $\hat{\gamma_i} = 1$ . Which means

\[\min_{w,b} y_i(\langle w, x_i \rangle + b) = 1\]
Proof :

We can always find some $(w,b)$ such that $\min\lvert \langle w,x \rangle + b \rvert = 1$

Let $\min( \langle w,x \rangle + b) = c$

Thus, if we substitue $w,b$ with $w_\text{new} = \frac{w}{c}$ and $b_\text{new} = \frac{b}{c}$ we get

\[\min\lvert \langle w_{\text{new}},x \rangle + b_{\text{new}} \rvert = c\] \[\min\lvert \langle \frac{w}{c},x \rangle + \frac{b}{c} \rvert = c\] \[\min\frac{\lvert \langle w,x \rangle + b \rvert}{c} = c\] \[\min\lvert \langle w,x \rangle + b \rvert = \frac{c}{c}\] \[\min\lvert \langle w,x \rangle + b \rvert = 1\]

Now let’s maximize the margin keeping in mind those derivations

\[\max\frac{\min y_i(\langle w, x \rangle + b)}{\left\lVert w \right\rVert_{2}}\] \[\max\frac{\min \hat{\gamma_i}}{\left\lVert w \right\rVert_{2}}\] \[\max\frac{1}{\left\lVert w \right\rVert_{2}}\]

Under this normalization, maximizing the geometric margin is equivalent to minimizing $\left\lVert w \right\rVert_{2}$.


This observation leads directly to the primal optimization formulation of the hard-margin SVM.

2. Hard-Margin Support Vector Machine

2.1. Primal Optimization Problem

Under the normalization constraint, the problem of finding the maximum-margin separating hyperplane can be written as the following optimization problem:

$\min_{w,b} \frac{1}{2} \left\lVert w \right\rVert_2^2$ s.t. $y_i(\langle w, x_i \rangle + b) \ge 1,$ $i = 1,2,…,n.$

This is known as the hard-margin Support Vector Machine. The fraction $\frac{1}{2}$ and a term $\left\lVert w \right\rVert_{2}^{2}$ are added for mathematical convenience(for smoothness and differentiability when optimizing) . Minimizing $\left\lVert w \right\rVert_{2} $ is equivalent to minimizing $\left\lVert w \right\rVert_{2}^{2}$ cause it yields the same optimal result.

Remarks
  1. Quadratic objective

    The objective function $\frac{1}{2} \left\lVert w \right\rVert_2^2$ is a strictly convex quadratic function in $w$.

  2. Linear Constraints

    The constraints are affine functions of $(w,b)$, defining a convex feasible region.

  3. Convexity

    Since the objective is convex and the feasible set is convex, this is a convex optimization problem. Any local minimum is also a global minimum.

  4. Uniqueness

    The solution for $w$ is unique whenever the data are linearly separable. The bias term $b$ may not be unique though.

2.2. Geometric Interpretation of the Constraints

Each constraint

\[y_i(\langle w, x_i \rangle + b) \ge 1\]

enforces that the point $x_i$ lies on the correct side of the hyperplane and at least a distance $\frac{1}{\left\lVert w \right\rVert_2}$ from it.

The points that satisfy

\[y_i(\langle w, x_i \rangle + b) = 1\]

lie exactly on the margin boundaries. These points are the support vectors. All other points satisfy the constraints strictly and do not influence the optimal solution.

2.3. Support Vectors and Sparsity

A fundamental consequence of the margin maximization formulation is that only a subset of the training points determines the solution!!


Definition.

A data point $(x_i,y_i)$ is called a support vector if

\[y_i(\langle w^{\ast}, x_i \rangle + b^{\ast}) = 1,\]

where $(w^{\ast},b^{\ast})$ denotes the optimal solution of the primal problem.

Support vectors source


Geometrically, support vectors are the points closest to the decision boundary. Removing any non-support vector does not change the optimal hyperplane.

This sparsity property will later emerge naturally from the DUAL formulation.

You may ask “Why the Primal Formulation Is Not Enough?”

While the primal problem provides clear geometric intuition, it has two important limitations:

1.It relies explicitly on the assumption of linear separability.

2.It depends directly on the inner product $\langle w, x \rangle$ , which restricts us to linear decision boundaries. Both limitations will be addressed by:


To expose the structure of the solution and prepare the ground for kernel methods, we now derive the dual formulation of the hard-margin Support Vector Machine using Lagrangian duality and the Karush–Kuhn–Tucker conditions.


3. Lagrangian Duality and The Dual Problem

The primal formulation of the hard-margin Support Vector Machine is a convex optimization problem with inequality constraints. To analyze its structure and derive a more flexible representation, we apply Lagrangian duality. Before we start, let me firstly introduce several concepts.


* — WHAT IS A CONVEX OPTIMIZATION PROBLEM? —

A constrained optimization problem is called a convex optimization problem if

$\min_x{f(x)}$ <- $f(x)$ is a convex funciton

s.t.

$g_i(x)\le 0$ for all $i = 1,…,m$ <- $g_i(x)$ is also a convex function

and $ h_j(x)=0$ for all $j=1,…,n$ <- $h_j(x)=0$ is an affine equality (thus, a convex set)


For any optimization problem we apply KKT conditions,which are necessary but for most cases NOT sufficient . HOWEVER! a special case arise when we deal with a convex constrained optimization problem , KKT conditions are still necessary AND become sufficient as they fully characterize the solution.


* — Karush-Kuhn-Tucker(KKT) Conditions —

  1. Primal Feasibility

The solution must satisfy the constraints :

\[g_i(x)\le 0, h_j(x)=0\]
  1. Dual Feasibility

Lagrange multipliers must satisfy

\[\alpha_i \ge 0\]
  1. Stationarity

Gradient of the Lagrangian with respect to primal variables is zero :

\[\nabla_x{\mathcal{L}(x,\alpha,v)} = 0\]

Stationarity connects primal variables to dual variables

  1. Complementary Slackness

For each component, either the constraint is active or its multiplier is zero, never both nonzero :

\[\alpha_i g_i(x) = 0\]

3.1. The Primal Optimization Problem, The Lagrangian

We introduce a non-negative Lagrange multiplier $\alpha_i \ge 0$ for each constraint

\[y_i(\langle w, x_i \rangle + b) \ge 1.\]

The Lagrangian associated with the primal problem is

\[\mathcal{L}(w, b, \alpha) = \frac{1}{2}\left\lVert w \right\rVert_2^2 - \sum^n_{i=1}\alpha_i[y_i(\langle w, x_i \rangle + b)-1],\]

where $\alpha = (\alpha_1,…,\alpha_n)$ - vector of lagrange multipliers

The primal problem corresponds to minimizing $\mathcal{L}$ with respect to $(w,b)$ and maximizing with respect to $\alpha \ge 0$

To obtain the dual formulation, we minimize the Lagrangian with respect to the primal variables $w$ and $b$

Let’s compute the derivatives with respect to primal variables and apply stationarity conditions

Derivative with respect to $w$
\[\frac{\partial \mathcal{L}}{\partial w} = w - \sum^n_{i=1} \alpha_i y_i x_i.\]

Setting the derivative to zero yiels

\[w = \sum \alpha_i y_i x_i.\]

This equation already reveals an important fact : the normal vector $w$ lies in the span of the training points

Derivative with respect to $b$
\[\frac{\partial \mathcal{L}}{\partial b} = -\sum_{i=1}^n \alpha_i y_i\]

Setting it to zero gives

\[\sum_{i=1}^n \alpha_i y_i = 0\]

3.2. The Dual Optimization Problem

Substituting the stationarity conditions back into the Lagrangian eliminates the primal variables $(w,b)$. The resulting dual objective depends only on the dual variables $\alpha_i$.

Usually it’s convinient to maximize the dual while minimizing the primal so let’s simplify and derive the dual problem. Firstly,we expand our primal lagrangian(hard margin)

\[\mathcal{L}_{primal}(w,b,\alpha) = \frac{1}{2}\left\lVert w \right\rVert_2^2 - \sum_i\alpha_iy_i\langle w, x_i \rangle - b\sum_i\alpha_iy_i + \sum_i\alpha_i\]

Now let’s substitute the derivatives w.r.t. primal variables

\[\mathcal{L}(w,b,\alpha) = \frac{1}{2}\left\lVert \sum_i\alpha_iy_ix_i\right\rVert_2^2 - \langle w, w \rangle - b * 0 + \sum_i\alpha_i\]

Keep in mind that

\[\left\lVert\sum_i\alpha_iy_ix_i\right\rVert_2^2 = \langle \sum_i\alpha_iy_ix_i, \sum_j\alpha_jy_jx_j \rangle = \sum_i\sum_j\alpha_i\alpha_jy_iy_j\langle x_i,x_j\rangle\]

Thus,

\[\mathcal{L}(w,b,\alpha) = \frac{1}{2}[\sum_i\sum_j\alpha_i\alpha_jy_iy_j\langle x_i,x_j\rangle] - \sum_i\sum_j\alpha_i\alpha_jy_iy_j\langle x_i,x_j\rangle + \sum_i\alpha_i\]

Our Dual

\[\mathcal{L}_{dual}=\sum_i\alpha_i - \frac{1}{2}\sum_i\sum_j\alpha_i\alpha_jy_iy_j\langle x_i,x_j\rangle\]

The dual optimization problem becomes

\[\max_\alpha \sum_i\alpha_i - \frac{1}{2}\sum_i\sum_j\alpha_i\alpha_jy_iy_j\langle x_i,x_j\rangle\]

subject to

\[\alpha_i \ge 0, \sum\alpha_iy_i = 0\]

This is a quadratic programming problem in the variables $\alpha_i$

Remarks

1.Dependence on inner products only

The data appear exclusively through inner products $\langle x_i,x_j \rangle$. This observation is the key to kernel methods.

2.Dimensionality

The optimization is performed in $\mathbb{R}^n$, independent of the ambient dimension $d$.

3.Convexity

The dual objective is concave, and the feasible region is convex. Therefore, the dual problem has a unique global maximum.

3.3. Connecting the dual

Using the stationarity condition for $w$, the classifier can be written as

\[f(x) = \operatorname{sign}(\sum_i\alpha_iy_i\langle x_i,x_j \rangle + b).\]

Only points with $\alpha_i > 0$ contribute to the sum , yielding a sparse representation.


The dual formulation reveals that the Support Vector Machine depends on the data only through inner products. This observation allows the extension of SVMs to nonlinear decision boundaries via kernel methods.


4. Kernel Methods: From Inner Products to Nonlinear Decision Boundaries

4.1. From Featute Maps to Kernels

Suppose we map inputs into a (possibly high-dimensional) feature space:

\[\phi : \mathbb{R}^d \rightarrow \mathcal{H},\]

Kernel trick source

Where $\mathcal{H}$ is a Hilbert space(Inner-product space) . A linear classifier in feature space has the form

\[f(x) = \operatorname{sign}(\langle w , \phi(x) \rangle_\mathcal{H} + b ).\]

If we try to directly work with the feature mapping $\phi(x)$ , we may need to handle extremely high-dimensional vectors. The kernel method avoids explicit computation of those mappings.

4.2. Kernel Trick

Let’s firstly introduce the concept of a kernel


— * What is a kernel K ? —

A kernel is a function

\[\mathcal{K} : \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}\]

such that

\[\mathcal{K}(x,z) = \langle \phi(x), \phi(z) \rangle_\mathcal{H}\]

for some feature mapping $\phi$ into some inner-product space $\mathcal{H}$


In the SVM dual , we replace every $\langle x_i,x_j \rangle$ by $\mathcal{K}(x_i,x_j)$.Thee hard-margin dual becomes:

\[\max_\alpha \sum_i\alpha_i - \frac{1}{2}\sum_i\sum_j\alpha_i\alpha_jy_iy_j\mathcal{K}(x_i,x_j)\]

subject to

\[\alpha_i \ge 0, \sum_i\alpha_iy_i = 0.\]

The corresponding decision function is

\[f(x) = \operatorname{sign}(\sum_i\alpha_iy_i\mathcal{K}(x_i,x) + b).\]

Only support vectors (those with $\alpha_i > 0$) contribute to the sum.

4.3. Which Functions Are Valid Kernels?

Not every symmetric function $\mathcal{K}(x,z)$ is a valid kernel. The key requirement is that the kernel corresponds to an inner product in some (possibly infinite-dimensional) feature space. Let me formalize it by introducing the definition of the Gram matrix and a simple version of Mercer’s Theorem.


Definition

Given points $x_1,…,x_n$, we define the Gram matrix $G \in \mathbb{R}^{n \times n}$ by

\[G_{ij} = \mathcal{K}(x_i,x_j).\]

A function $\mathcal{K}$ is a valid (Mercer) kernel if, for any finite set ${x_1,…,x_n}$, the Gram matrix $G$ is positive semidefinite (PSD), i.e.

\[\operatorname{spectrum}(G) \ge 0\]

or $c^{\intercal} Gc \ge 0 $ for all $c \in \mathbb{R}^n$. This PSD condition is what ensures that $\mathcal{K}$ behaves like an inner product.

In general ,a Gram matrix $G$ of vectors $x_1,x_2,…,x_n$ is the matrix of all possible dot products of those vectors :

\[G(x_1,x_2,...,x_n) = \begin{bmatrix} \langle x_1,x_1 \rangle& \langle x_1,x_2 \rangle& ...& \langle x_1,x_n \rangle\\ \langle x_2,x_1 \rangle& \langle x_2,x_2 \rangle& ...& \langle x_2,x_n \rangle\\ \vdots& \vdots& \ddots& \vdots\\ \langle x_n,x_1 \rangle& \langle x_n,x_2 \rangle& ...& \langle x_n,x_n \rangle\\ \end{bmatrix}\]
Mercer’s Theorem in a nutshell

If $\mathcal{K}$ is symmetric and positive semidefinite (under mild regularity assumptions), then there exists a feature mapping $\phi$ into a Hilbert space $\mathcal{H}$ such that

\[\mathcal{K}(x,z) = \langle \phi(x),\phi(z) \rangle_\mathcal{H}.\]

Thus,using $\mathcal{K}$ in the dual is just the same as running a linear SVM in the feature space $\mathcal{H}$, without explicitly knowing the mapping $\phi$.


4.4. Examples of Most Common Kernels

1.Linear Kernel

\[\mathcal{K}(x,z) = \langle x,z \rangle\]

2.Polynomial Kernel

\[\mathcal{K}(x,z) = (\langle x,z \rangle + c)^p\]

$c \ge 0, p \in \mathbb{N}$.

3.Gaussian(RBF) kernel

\[\mathcal{K}(x,z) = \exp(-\frac{\left\lVert x-z\right\rVert^2_2}{2\sigma^2}),\]

$\sigma > 0$.

Examples of kernels source


The kernalized dual allows creating nonlinear decision boundaries, however it still assumes perfect separability in feature space. In practice , that ideal separability is RARELY achieved , thus let me now tell you about slack variables and the soft-margin SVM.


5. Soft-Margin Support Vector Machines

5.1. Slack Variables and Constraint Relaxation

For each training point , we introduce a slack variable $\xi_i \ge 0$ that measures the degree of constraint violation. The margin constraints become

\[y_i(\langle w,x_i \rangle + b) \ge 1 - \xi_i , i = 1,...,n.\]

Interpretation of $\xi_i$ :

Soft-margin SVM source

5.2. Primal Soft-Margin Optimization Problem

The soft-margin SVM trades off margin maximization against constraint violations:

\[\min_{w,b,\xi}\frac{1}{2}\left\lVert w \right\rVert_2^2 + C\sum_i\xi_i\]

subject to

\[y_i(\langle w,x_i \rangle + b) \ge 1 - \xi_i\]

$i = 1,…,n$ and $\xi_i \ge 0$

Here the hyperparameter $C > 0$ controls the bias-variance tradeoff like :


Now, before we continue ,I will show you another, BUT FOUNDATIONAL concept for SVMs - the hinge loss


Definition
\[\ell_{hinge}(y,f(x)) = \max\{0,1 - yf(x)\}.\]

5.3. Connection to Hinge Loss

The constrained soft-margin problem can be rewritten as an unconstrained regularized risk minimization problem.Then the soft-margin SVM is equivalent to:

\[\min_{w,b}\frac{1}{2}\left\lVert w \right\rVert_2^2 + C\sum_i\max\{0,1 - y_i(\langle w,x_i \rangle + b)\}.\]

This makes explicit the connection between SVMs and empirical risk minimization(ERM), while preserving a geometric interpretation via the margin.

5.4. Dual Formulation of the Soft-Margin SVM

Following the same procedure as for the hard-margin but we introduce another lagrange multiplier $\mu_i \ge 0$ for $\xi_i \ge 0$, let’s derive it :

I’ve already introduced the primal problem , now let’s construct the primal lagrangian :

\[\mathcal{L}_{primal}(w,b,\alpha,\mu) = \frac{1}{2}\left\lVert w \right\rVert_2^2 + C\sum_i\xi_i - \sum_i\alpha_i[y_i(\langle w, x_i \rangle + b) - 1 + \xi_i] - \sum_i\mu_i\xi_i\]
stationarity w.r.t. primal variables
\[\frac{\partial \mathcal{L}}{\partial \xi_i} = C - \alpha_i - \mu_i\]

thus, $ C - \alpha_i -\mu_i = 0 $ => $\mu_i = C - \alpha_i$

And from knowing that $\mu_i \ge 0$ : $ C - \alpha_i \ge 0 $ => $\alpha_i \le C$ => $0 \le \alpha_i \le C$

\[\frac{\partial \mathcal{L}}{\partial w} = w - \sum_i\alpha_ix_iy_i\]

$w = \sum_i\alpha_ix_iy_i$

\[\frac{\partial \mathcal{L}}{\partial b} = \sum_i\alpha_iy_i\]

$\sum_i\alpha_iy_i = 0$

Substituting into primal yields the dual
\[\mathcal{L}_{dual}(\alpha,\mu) = \sum_i\alpha_i - \frac{1}{2}\sum_i\sum_j\alpha_i\alpha_jy_iy_j\mathcal{K}(x_i,x_j)\]

And finally the dual optimization problem :

\[\max_\alpha \sum_i\alpha_i - \frac{1}{2}\sum_i\sum_j\alpha_i\alpha_jy_iy_j\mathcal{K}(x_i,x_j)\]

subject to

\[0 \le \alpha_i \le C , \sum_i\alpha_iy_i = 0.\]

Compared to the hard-margin case, the dual variables are now box-constrained.

5.5. Decision Function ( Kernalized)

The final classifier has the form

\[f(x) = \operatorname{sign}(\sum_i\alpha_iy_i\mathcal{K}(x_i,x) + b).\]

The bias term 𝑏 b can be computed using any support vector with $0 \le \alpha_i \le C$.


The soft-margin formulation provides robustness to noise while preserving convexity and sparsity. We now extend the same principles to regression, leading to Support Vector Regression.


6. Support Vector Regression (SVR)

Support Vector Regression extends the maximum-margin principle to real-valued outputs by introducing an ε-insensitive loss. Instead of separating classes, SVR seeks a function that approximates the data while remaining as flat as possible.

6.1. Regression as Function Approximation

Given training data

\[\{x_i,y_i\}^n_{i=1}, x_i \in \mathbb{R}^d, y_i \in \mathbb{R},\]

We consider linear predictors of the form

\[f(x) = \langle w,x \rangle + b.\]

As in classification, flatness is measured by the norm $\left\lVert w\right\rVert_2^2 $. SVR therefore minimizes $\left\lVert w\right\rVert_2^2 $ subject to an _error tolerance*

6.2. $\epsilon$-Insensitive Tube

SVR introduces a tube of radius $\epsilon > 0$ around the regression function

\[\vert y_i - f(x_i)\vert \le \epsilon.\]

Errors inside this tube are ignored(only deviations $> \epsilon$ are penalized) which gives a better robustness to outliers.

We also include 2 slack variables to allow violations :

The constraints become :

\[y_i - (\langle w,x_i \rangle + b) \le \epsilon + \xi_i,\] \[(\langle w,x_i \rangle + b) - y_i \le \epsilon + \xi_i^{\ast}.\]

SVR source

6.3. Primal Optimization Problem for SVR

The $\epsilon$-SVR primal problemmm is:

\[\min_{w,b,\xi,\xi^{\ast}}\frac{1}{2}\left\lVert w \right\rVert_2^2 + C\sum_i(\xi_i + \xi_i^{\ast})\]

subject to :

\[y_i - (\langle w,x_i \rangle + b) \le \epsilon + \xi_i,\] \[(\langle w,x_i \rangle + b) - y_i \le \epsilon + \xi_i^{\ast},\] \[\xi_i\xi_i^* \ge 0.\]

Same as in classification :

$\epsilon$-Insensitive Loss

The primal problem is equivalent to minimizing a regularized empirical risk with $\epsilon$-insensitive loss:

\[\ell_\epsilon(y,f(x)) = \max\{0, \vert y - f(x)\vert - \epsilon\}.\]

Thus:

\[\min_{w,b}\frac{1}{2}\left\lVert w \right\rVert_2^2 + C\sum_i\max\{0, \vert y - (\langle w,x_i \rangle + b)\vert - \epsilon\}.\]

6.4. Dual Formulation of SVR

Introduce Lagrange multipliers :

After minimizing the primal Lagrangian w.r.t. $w$ and $b$ , the dual problem becomes:

\[\max_{\alpha,\alpha^{\ast}} -\frac{1}{2}\sum_i\sum_j(\alpha_i - \alpha^{\ast}_i)(\alpha_j - \alpha_j^{\ast})\mathcal{K}(x_i,x_j) - \epsilon\sum_i(\alpha_i + \alpha_i^{\ast}) + \sum_iy_i(\alpha_i-\alpha_i^{\ast})\]

subject to:

\[0 \le \alpha_i\alpha_j \le C\] \[\sum_i(\alpha_i - \alpha_i^{\ast}) = 0\]

6.5. Decision Function (SVR)

The regression function is:

\[f(x)= \sum_i(\alpha_i-\alpha_i^{\ast})\mathcal{K}(x_i,x) + b.\]

Only points with $\vert\alpha_i - \alpha_i^{\ast}\vert > 0$ conribute (these are the support vectors)

6.6. Geometrical Interpretation

Thus, as in SVC, SVR yields a sparse output.

SVR1 source


Implementing SVM from scratch in Python(Soft-margin)

Requirements : Python(>2.7), NumPy, matplotlib, sklearn

Core algrorithm

#svm.py <-----
# !!!!
import numpy as np


class SVM:

   def __init__(self, C = 1.0):
      # C = hyperparameter for penalty
      self.C = C
      self.w = 0
      self.b = 0
   def hingeloss(self, w, b, x, y):
      # Regularization term (L_2)
      reg = 0.5 * (w * w)

      for i in range(x.shape[0]):
         # Optimization term
         opt_term = y[i] * ((np.dot(w, x[i])) + b)

         # calculating loss
         loss = reg + self.C * max(0, 1-opt_term)
      return loss[0][0]
   def fit(self, X, Y, batch_size=100, learning_rate=0.001, epochs=1000):

      number_of_features = X.shape[1]
      number_of_samples = X.shape[0]

      c = self.C

      # Creating ids from 0 to number_of_samples - 1
      ids = np.arange(number_of_samples)

      # Random shuffle of samples
      np.random.shuffle(ids)

      w = np.zeros((1, number_of_features))
      b = 0
      losses = []

      # Gradient Descent logic
      for i in range(epochs):
         # Calculating the Hinge Loss
         l = self.hingeloss(w, b, X, Y)
         losses.append(l)

         for batch_initial in range(0, number_of_samples, batch_size):
            grad_w = 0
            grad_b = 0

            for j in range(batch_initial, batch_initial + batch_size):
               if j < number_of_samples:
                  x = ids[j]
                  ti = Y[x] * (np.dot(w, X[x].T) + b)

                  if ti > 1:
                     grad_w += 0
                     grad_b += 0
                  else:
                     # Calculating the gradients
                     grad_w += c * Y[x] * X[x]
                     grad_b += c * Y[x]

            # Updating weights and bias
            w = w - learning_rate * w + learning_rate * grad_w
            b = b + learning_rate * grad_b

         self.w = w
         self.b = b

         return self.w, self.b, losses
   def predict(self, X):
      prediction = np.dot(X, self.w[0]) + self.b
      return np.sign(prediction)

You may be confused why we update w and b like that . But the reason is quite simple , let learning rate = $\eta$ and our cost function (objective func.)

\[\mathcal{J}(w,b) = \frac{1}{2}\left\lVert w \right\rVert_{2}^{2} + \operatorname{hingeloss}(w,b)\]

Also let

\[\ell_i =\max(0,1-y_if_i)\]

that’s the loss per sample,

\[\operatorname{hingeloss}\]

be the loss over the whole dataset. Taking the gradients with respect to $w$ and $b$ yields:

\[\nabla_{w}{\mathcal{J}} = w + \nabla_{w}\operatorname{hingeloss}\]

and

\[\nabla_{b}{\mathcal{J}} = \nabla_{b}\operatorname{hingeloss}\]

Also taking the gradients of hingeloss per sample w.r.t. $w$ and $b$ gives

\[\nabla_{w}{\ell_i} = \frac{\partial}{\partial w}(\max[0,1-y_i(\operatorname{sign}(\langle w,x_i \rangle + b))]) = -y_ix_i\]

and

\[\nabla_{b}{\ell_i} = \frac{\partial}{\partial b}(\max[0,1-y_i(\operatorname{sign}(\langle w,x_i \rangle + b))]) = -y_i\]

Thus,

\[\nabla_{w}{\operatorname{hingeloss}} = -C \sum_i x_iy_i\]

and

\[\nabla_{b}{\operatorname{hingeloss}} = -C \sum_i y_i\]

We defined negatives of those grads as grad_w and grad_b respectively. Remembering the grad descent update rule :

\[\theta \leftarrow \theta - \eta\nabla_{\theta}{\mathcal{J}}\]

Substituting our variables of interest and their gradients gives us :

\[w \leftarrow w - \eta(w + \nabla_{w}\operatorname{hingeloss})\] \[w \leftarrow w -\eta w - \eta (-C\sum_i y_ix_i)\] \[w \leftarrow w - \eta w + \eta\text{gradw}\]

and

\[b \leftarrow b - \eta\nabla_{b}\operatorname{hingeloss}\] \[b \leftarrow b + \eta\text{gradb}\]

Now let’s create a random dataset(Binary classification)

import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

X, y = datasets.make_blobs(
      n_samples = 100,
      n_features = 2,
      centers = 2,
      cluster_std = 1,
      random_state=67
    )

# Cause binary classification
y = np.where(y == 0, -1, 1)
#Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=228)

Training our SVM

from svm import SVM
svm = SVM()
w, b, losses = svm.fit(X_train, y_train)

Producing the predictions and checking

preds = svm.predict(X_test)

# Loss
err = losses.pop()

print("Loss:", err)
print("Prediction:", preds)
print("Accuracy:", accuracy_score(preds, y_test))
print("w, b:", [w, b])

Output[1] : Output(1)

Visualizing SVM on the test set

import matplotlib.pyplot as plt
def visualize_svm():
   def get_hyperplane_value(x, w, b, offset):
      return (-w[0][0] * x + b + offset) / w[0][1]

   fig = plt.figure()
   ax = fig.add_subplot(1,1,1)
   plt.scatter(X_test[:, 0], X_test[:, 1], marker="o", c=y_test)

   x0_1 = np.amin(X_test[:, 0])
   x0_2 = np.amax(X_test[:, 0])

   x1_1 = get_hyperplane_value(x0_1, w, b, 0)
   x1_2 = get_hyperplane_value(x0_2, w, b, 0)

   x1_1_m = get_hyperplane_value(x0_1, w, b, -1)
   x1_2_m = get_hyperplane_value(x0_2, w, b, -1)

   x1_1_p = get_hyperplane_value(x0_1, w, b, 1)
   x1_2_p = get_hyperplane_value(x0_2, w, b, 1)

   ax.plot([x0_1, x0_2], [x1_1, x1_2], "y--")
   ax.plot([x0_1, x0_2], [x1_1_m, x1_2_m], "k")
   ax.plot([x0_1, x0_2], [x1_1_p, x1_2_p], "k")

   x1_min = np.amin(X[:, 1])
   x1_max = np.amax(X[:, 1])
   ax.set_ylim([x1_min - 3, x1_max + 3])

   plt.show()
visualize_svm()

Output[2] : Output(2)


I hope this article would work for you as an assist to understand the SVM algorithm under the hood. Thank youuu for reading <3 !!

\[\]