Calculus

1 Calculus for Machine Learning

This section covers core calculus concepts used in machine learning — especially in optimization, backpropagation, and probability.

1.1 Completing the Square

A general method for rewriting quadratic expressions:

\[ x^2 - 2bx + c = \left(x - b\right)^2 + \frac{b^2}{c} - \frac{b^2}{a} \]

This transformation is widely used in:

  • Gaussian probability density derivations
  • KL divergence simplification
  • Optimization of quadratic objectives

1.2 Derivatives

The derivative of a function measures the rate of change:

\[ \frac{d}{dx} f(x) \]

Common derivatives:

  • \(\frac{d}{dx}(x^n) = nx^{n-1}\) (power rule)
  • \(\frac{d}{dx}(e^x) = e^x\)
  • \(\frac{d}{dx}(\ln x) = \frac{1}{x}\)
  • \(\frac{d}{dx}(\sin x) = \cos x\)
  • \(\frac{d}{dx}(\cos x) = -\sin x\)
  • \(\frac{d}{dx}(\tanh x) = 1 - \tanh^2 x\) (used in activation functions)

Rules:

  • Product rule: \((fg)' = f'g + fg'\)
  • Quotient rule: \(\left(\frac{f}{g}\right)' = \frac{f'g - fg'}{g^2}\)
  • Chain rule: \(\frac{d}{dx}f(g(x)) = f'(g(x)) \cdot g'(x)\)

1.3 Partial Derivatives

Used when dealing with multivariable functions:

\[ \frac{\partial f}{\partial x}, \quad \frac{\partial f}{\partial y} \]

Example: For \(f(x,y) = x^2y + 3y^2\): - \(\frac{\partial f}{\partial x} = 2xy\) - \(\frac{\partial f}{\partial y} = x^2 + 6y\)

ML relevance: Computing gradients for each parameter in neural networks

1.4 Gradient Vector

The gradient is a vector of all partial derivatives:

\[ \nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix} \]

Properties:

  • Points in direction of steepest ascent
  • Magnitude indicates steepness
  • Perpendicular to level curves/surfaces

ML relevance: Gradient descent uses \(-\nabla f\) to minimize loss functions

1.5 Jacobian Matrix

For a vector-valued function \(\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m\), the Jacobian is:

\[ J_{ij} = \frac{\partial f_i}{\partial x_j} \]

Dimensions: \(m \times n\) matrix

Example: For \(\mathbf{f}(x,y) = [x^2y, xy^2]\):

\[ J = \begin{bmatrix} 2xy & x^2 \\ y^2 & 2xy \end{bmatrix} \]

ML relevance:

  • Backpropagation through layers
  • Computing derivatives of vector outputs

1.6 Hessian Matrix

Second-order partial derivatives:

\[ H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j} \]

Properties:

  • Symmetric matrix (if \(f\) is twice differentiable)
  • Diagonal elements: \(\frac{\partial^2 f}{\partial x_i^2}\)
  • Off-diagonal: mixed partials \(\frac{\partial^2 f}{\partial x_i \partial x_j}\)

Used in curvature analysis and second-order optimization.

ML relevance:

  • Newton’s method optimization
  • Analyzing convergence properties
  • Identifying saddle points vs local minima

1.7 Chain Rule (Multivariate)

For composed functions \(f(g(x))\):

\[ \frac{\partial f}{\partial x_i} = \sum_j \frac{\partial f}{\partial g_j} \cdot \frac{\partial g_j}{\partial x_i} \]

Matrix form:

\[ \frac{\partial f}{\partial \mathbf{x}} = \frac{\partial f}{\partial \mathbf{g}} \cdot \frac{\partial \mathbf{g}}{\partial \mathbf{x}} \]

ML relevance: Foundation of backpropagation algorithm

1.8 Directional Derivative

Rate of change of \(f\) in direction of unit vector \(\mathbf{u}\):

\[ D_\mathbf{u} f = \nabla f \cdot \mathbf{u} \]

Properties:

  • Maximum when \(\mathbf{u}\) aligned with \(\nabla f\)
  • Zero when \(\mathbf{u}\) perpendicular to \(\nabla f\)

ML relevance: Understanding optimization landscapes

1.9 Taylor Series Expansion

Approximating functions using derivatives:

\[ f(x + \Delta x) \approx f(x) + f'(x)\Delta x + \frac{1}{2}f''(x)\Delta x^2 + \cdots \]

First-order (linear) approximation:

\[ f(x + \Delta x) \approx f(x) + f'(x)\Delta x \]

Second-order (quadratic) approximation:

\[ f(x + \Delta x) \approx f(x) + f'(x)\Delta x + \frac{1}{2}f''(x)\Delta x^2 \]

Multivariate Taylor expansion:

\[ f(\mathbf{x} + \mathbf{\Delta x}) \approx f(\mathbf{x}) + \nabla f(\mathbf{x})^T \mathbf{\Delta x} + \frac{1}{2}\mathbf{\Delta x}^T H \mathbf{\Delta x} \]

ML relevance:

  • Local approximations in optimization
  • Newton’s method derivation
  • Trust region methods

1.10 Integration

Area under a curve:

\[ \int_a^b f(x) \, dx \]

Common integrals:

  • \(\int x^n \, dx = \frac{x^{n+1}}{n+1} + C\) (for \(n \neq -1\))
  • \(\int e^x \, dx = e^x + C\)
  • \(\int \frac{1}{x} \, dx = \ln|x| + C\)
  • \(\int \sin x \, dx = -\cos x + C\)
  • \(\int \cos x \, dx = \sin x + C\)

ML relevance:

  • Computing expectations in probability
  • Normalizing probability distributions
  • Deriving closed-form solutions

1.11 Integration by Parts

A useful transformation:

\[ \int u \, dv = uv - \int v \, du \]

ML relevance:

  • Deriving variational bounds
  • Evidence lower bound (ELBO) derivations

1.12 Fundamental Theorem of Calculus

Links differentiation and integration:

\[ \frac{d}{dx} \int_a^x f(t) \, dt = f(x) \]

\[ \int_a^b f'(x) \, dx = f(b) - f(a) \]

1.13 Optimization Conditions

1.13.1 First-Order Necessary Condition

At a local minimum/maximum \(x^*\):

\[ \nabla f(x^*) = \mathbf{0} \]

Critical points: Where gradient vanishes

1.13.2 Second-Order Sufficient Condition

For a local minimum at \(x^*\):

  • \(\nabla f(x^*) = \mathbf{0}\) (first-order condition)
  • Hessian \(H(x^*)\) is positive definite (all eigenvalues > 0)

For a local maximum: - Hessian \(H(x^*)\) is negative definite (all eigenvalues < 0)

Saddle point: Hessian has both positive and negative eigenvalues

ML relevance: Analyzing loss function landscapes

1.14 Convexity

A function \(f\) is convex if:

\[ f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda)f(y) \]

for all \(x, y\) and \(\lambda \in [0,1]\)

Equivalent conditions:

  • \(f''(x) \geq 0\) (for univariate functions)
  • Hessian \(H\) is positive semi-definite (for multivariate)
  • Any local minimum is a global minimum

Strictly convex: Use strict inequalities (\(>\) instead of \(\geq\))

ML relevance:

  • Guarantees for optimization convergence
  • Linear regression, logistic regression are convex
  • Neural networks are generally non-convex

1.15 Lagrange Multipliers

For constrained optimization:

\[ \min f(\mathbf{x}) \quad \text{subject to} \quad g(\mathbf{x}) = 0 \]

Method: Solve:

\[ \nabla f(\mathbf{x}) = \lambda \nabla g(\mathbf{x}) \]

and \(g(\mathbf{x}) = 0\)

Lagrangian:

\[ \mathcal{L}(\mathbf{x}, \lambda) = f(\mathbf{x}) - \lambda g(\mathbf{x}) \]

ML relevance:

  • Support Vector Machines (SVM)
  • Constrained optimization problems
  • Dual formulations

1.16 L’Hôpital’s Rule

For indeterminate forms \(\frac{0}{0}\) or \(\frac{\infty}{\infty}\):

\[ \lim_{x \to c} \frac{f(x)}{g(x)} = \lim_{x \to c} \frac{f'(x)}{g'(x)} \]

ML relevance: Analyzing limiting behavior of loss functions

1.17 Common Activation Function Derivatives

Sigmoid: \(\sigma(x) = \frac{1}{1 + e^{-x}}\)

\[ \sigma'(x) = \sigma(x)(1 - \sigma(x)) \]

Tanh: \(\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}\)

\[ \tanh'(x) = 1 - \tanh^2(x) \]

ReLU: \(\text{ReLU}(x) = \max(0, x)\)

\[ \text{ReLU}'(x) = \begin{cases} 1 & \text{if } x > 0 \\ 0 & \text{if } x < 0 \\ \text{undefined} & \text{if } x = 0 \end{cases} \]

Softmax: For vector \(\mathbf{z}\), \(\text{softmax}(\mathbf{z})_i = \frac{e^{z_i}}{\sum_j e^{z_j}}\)

\[ \frac{\partial \text{softmax}(\mathbf{z})_i}{\partial z_j} = \text{softmax}(\mathbf{z})_i (\delta_{ij} - \text{softmax}(\mathbf{z})_j) \]

where \(\delta_{ij}\) is the Kronecker delta


2 ML Applications Summary

2.1 Gradient Descent

Update rule:

\[ \mathbf{x}_{t+1} = \mathbf{x}_t - \alpha \nabla f(\mathbf{x}_t) \]

where \(\alpha\) is the learning rate

Variants: - Stochastic Gradient Descent (SGD): Use gradient from single sample - Mini-batch GD: Use gradient from small batch - Momentum: \(\mathbf{v}_{t+1} = \beta \mathbf{v}_t + \nabla f(\mathbf{x}_t)\)

2.2 Backpropagation

Chain rule application:

For loss \(L\) and layer outputs \(\mathbf{z}^{(l)}\):

\[ \frac{\partial L}{\partial \mathbf{w}^{(l)}} = \frac{\partial L}{\partial \mathbf{z}^{(l)}} \cdot \frac{\partial \mathbf{z}^{(l)}}{\partial \mathbf{w}^{(l)}} \]

Recursive gradient flow:

\[ \frac{\partial L}{\partial \mathbf{z}^{(l)}} = \frac{\partial L}{\partial \mathbf{z}^{(l+1)}} \cdot \frac{\partial \mathbf{z}^{(l+1)}}{\partial \mathbf{z}^{(l)}} \]

2.3 Loss Functions and Their Derivatives

Mean Squared Error (MSE):

\[ L = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2 \]

\[ \frac{\partial L}{\partial \hat{y}_i} = -\frac{2}{n}(y_i - \hat{y}_i) \]

Cross-Entropy Loss (binary):

\[ L = -\frac{1}{n}\sum_{i=1}^n [y_i \log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i)] \]

\[ \frac{\partial L}{\partial \hat{y}_i} = -\frac{1}{n}\left[\frac{y_i}{\hat{y}_i} - \frac{1-y_i}{1-\hat{y}_i}\right] \]

Categorical Cross-Entropy (with softmax):

\[ L = -\sum_{i=1}^n y_i \log(\hat{y}_i) \]

Combined softmax + cross-entropy derivative simplifies to:

\[ \frac{\partial L}{\partial z_i} = \hat{y}_i - y_i \]

2.4 Regularization Terms

L2 Regularization (Ridge):

\[ R(\mathbf{w}) = \frac{\lambda}{2}\|\mathbf{w}\|_2^2 = \frac{\lambda}{2}\sum_i w_i^2 \]

\[ \frac{\partial R}{\partial w_i} = \lambda w_i \]

L1 Regularization (Lasso):

\[ R(\mathbf{w}) = \lambda\|\mathbf{w}\|_1 = \lambda\sum_i |w_i| \]

\[ \frac{\partial R}{\partial w_i} = \lambda \cdot \text{sign}(w_i) \]

2.5 Quick Reference Table

Concept Formula ML Application
Gradient \(\nabla f\) Gradient descent direction
Jacobian \(J_{ij} = \frac{\partial f_i}{\partial x_j}\) Backpropagation
Hessian \(H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}\) Second-order optimization
Chain Rule \(\frac{df}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx}\) Backpropagation
Taylor Expansion \(f(x+\Delta x) \approx f(x) + f'(x)\Delta x\) Local approximation
Convexity \(f''(x) \geq 0\) Optimization guarantees
Lagrange Multipliers \(\nabla f = \lambda \nabla g\) Constrained optimization
Sigmoid Derivative \(\sigma'(x) = \sigma(x)(1-\sigma(x))\) Neural network activations