Probability

1 Probability for Machine Learning

Probability is central to machine learning — particularly in generative models, classification, and Bayesian methods.

1.1 Random Variables and Distributions

1.1.1 Types of Random Variables

Discrete: Bernoulli, Binomial, Poisson, Categorical
Continuous: Normal (Gaussian), Exponential, Uniform, Beta

Notation:

\(P(X = x)\) (discrete): Probability mass function (PMF)
\(f_X(x)\) (continuous): Probability density function (PDF)
\(F_X(x) = P(X \leq x)\): Cumulative distribution function (CDF)

1.2 Expectation and Variance

1.2.1 Expected Value

The average or mean value:

Discrete: \[ \mathbb{E}[X] = \sum_x x P(X = x) \]

Continuous: \[ \mathbb{E}[X] = \int x f_X(x) \, dx \]

Properties:

Linearity: \(\mathbb{E}[aX + bY] = a\mathbb{E}[X] + b\mathbb{E}[Y]\)
\(\mathbb{E}[c] = c\) for constant \(c\)
\(\mathbb{E}[g(X)] = \sum_x g(x)P(X=x)\) or \(\int g(x)f_X(x)dx\)

1.2.2 Variance

Measure of spread around the mean:

\[ \text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 \]

Properties:

\(\text{Var}(aX + b) = a^2 \text{Var}(X)\)
For independent \(X, Y\): \(\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)\)

Standard Deviation: \(\sigma = \sqrt{\text{Var}(X)}\)

1.2.3 Covariance and Correlation

Covariance: Measures linear relationship between two variables

\[ \text{Cov}(X, Y) = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])] = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y] \]

Properties:

\(\text{Cov}(X, X) = \text{Var}(X)\)
\(\text{Cov}(X, Y) = \text{Cov}(Y, X)\) (symmetric)
If \(X, Y\) independent: \(\text{Cov}(X, Y) = 0\) (converse not always true)

Correlation coefficient:

\[ \rho(X, Y) = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y} \]

where \(-1 \leq \rho \leq 1\)

ML relevance: Feature correlation analysis, covariance matrices in PCA

1.3 Joint, Marginal, and Conditional Probability

1.3.1 Joint Probability

Probability of multiple events occurring together:

\[ P(X = x, Y = y) \quad \text{or} \quad f_{X,Y}(x, y) \]

1.3.2 Marginal Probability

Probability of one variable, ignoring others:

Discrete: \[ P(X = x) = \sum_y P(X = x, Y = y) \]

Continuous: \[ f_X(x) = \int f_{X,Y}(x, y) \, dy \]

1.3.3 Conditional Probability

Probability of \(A\) given \(B\) has occurred:

\[ P(A \mid B) = \frac{P(A \cap B)}{P(B)} = \frac{P(A, B)}{P(B)} \]

Chain rule: \[ P(A, B) = P(A \mid B) P(B) = P(B \mid A) P(A) \]

General chain rule (for multiple variables): \[ P(X_1, X_2, \ldots, X_n) = P(X_1) P(X_2 \mid X_1) P(X_3 \mid X_1, X_2) \cdots P(X_n \mid X_1, \ldots, X_{n-1}) \]

1.4 Independence

1.4.1 Statistical Independence

\(X\) and \(Y\) are independent if:

\[ P(X, Y) = P(X) P(Y) \]

Equivalent conditions:

\(P(X \mid Y) = P(X)\)
\(P(Y \mid X) = P(Y)\)
\(\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y]\)

1.4.2 Conditional Independence

\(X\) and \(Y\) are conditionally independent given \(Z\) if:

\[ P(X, Y \mid Z) = P(X \mid Z) P(Y \mid Z) \]

Notation: \(X \perp Y \mid Z\)

ML relevance:

Naive Bayes assumes features are conditionally independent given class
Graphical models encode conditional independence

1.5 Bayes’ Theorem

Bayes’ Theorem lets us update what we believe about a situation after seeing new evidence.

The formula:

\[ P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)} \]

Components:

\(P(A \mid B)\): Posterior — probability of \(A\) given that \(B\) happened
\(P(B \mid A)\): Likelihood — how likely is \(B\) if \(A\) is true
\(P(A)\): Prior — our initial belief about \(A\)
\(P(B)\): Evidence — total probability of \(B\) happening under all possibilities

Expanded form (Law of Total Probability):

\[ P(A \mid B) = \frac{P(B \mid A) P(A)}{P(B \mid A)P(A) + P(B \mid A^c)P(A^c)} \]

1.5.1 What is Bayes’ Theorem really doing?

Bayes’ Theorem is about updating your belief:

Start with your prior belief \(P(A)\)
Then observe new evidence \((B)\)
Update your belief using how likely that evidence is under \(A\) (\(P(B \mid A)\))

1.5.2 Example: Medical Test

Suppose a disease affects 1% of the population.

You take a test that is: - \(P(\text{Positive} \mid \text{Disease}) = 0.99\) (true positive rate) - \(P(\text{Positive} \mid \text{No Disease}) = 0.05\) (false positive rate)

You test positive. What is the chance you actually have the disease?

We want to compute:

\[ P(\text{Disease} \mid \text{Positive}) = \frac{P(\text{Positive} \mid \text{Disease}) \cdot P(\text{Disease})}{P(\text{Positive})} \]

Let’s plug in values: - \(P(\text{Disease}) = 0.01\) - \(P(\text{No Disease}) = 0.99\) - \(P(\text{Positive}) = 0.99 \cdot 0.01 + 0.05 \cdot 0.99 = 0.0594\)

So:

\[ P(\text{Disease} \mid \text{Positive}) = \frac{0.99 \cdot 0.01}{0.0594} \approx 0.167 \]

Surprising result: Even after a positive test, the chance of having the disease is only about 16.7%, because false positives are more common than true positives.

1.5.3 Why It Matters

Bayes’ Theorem is widely used in:

Medical diagnosis
Spam detection
Probabilistic machine learning (e.g., Naive Bayes classifiers)
Updating beliefs in AI models
A/B testing and experimental design

1.5.4 Proportional Version of Bayes’ Rule

In many machine learning applications, the denominator \(P(B)\) is the same for all outcomes and can be ignored:

\[ P(A \mid B) \propto P(B \mid A) \cdot P(A) \]

This version is often used for ranking outcomes instead of calculating exact probabilities.

ML relevance: Naive Bayes classification, Bayesian inference

1.6 Law of Total Probability

For a partition of the sample space \(\{A_1, A_2, \ldots, A_n\}\):

\[ P(B) = \sum_{i=1}^n P(B \mid A_i) P(A_i) \]

Continuous version:

\[ P(B) = \int P(B \mid A = a) P(A = a) \, da \]

ML relevance: Computing marginal probabilities, evidence in Bayesian inference

1.7 Probability Distributions

Probability distributions describe how values of a random variable are distributed.

1.7.1 Discrete Distributions

1.7.1.1 Bernoulli Distribution

Single binary trial (success/failure):

\[ P(X = 1) = p, \quad P(X = 0) = 1 - p \]

\(\mathbb{E}[X] = p\)
\(\text{Var}(X) = p(1-p)\)

ML relevance: Binary classification outputs

1.7.1.2 Binomial Distribution

Number of successes in \(n\) Bernoulli trials:

\[ P(X = k) = \binom{n}{k} p^k (1-p)^{n-k} \]

\(\mathbb{E}[X] = np\)
\(\text{Var}(X) = np(1-p)\)

1.7.1.3 Categorical (Multinoulli) Distribution

Generalization of Bernoulli to \(k\) categories:

\[ P(X = i) = p_i, \quad \sum_{i=1}^k p_i = 1 \]

ML relevance: Multi-class classification

1.7.1.4 Poisson Distribution

Number of events in fixed interval:

\[ P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!} \]

\(\mathbb{E}[X] = \lambda\)
\(\text{Var}(X) = \lambda\)

ML relevance: Count data, rare events

1.7.2 Continuous Distributions

1.7.2.1 Uniform Distribution

Equal probability over interval \([a, b]\):

\[ f(x) = \frac{1}{b - a}, \quad a \leq x \leq b \]

\(\mathbb{E}[X] = \frac{a + b}{2}\)
\(\text{Var}(X) = \frac{(b-a)^2}{12}\)

ML relevance: Random initialization, data augmentation

1.7.2.2 Exponential Distribution

Time between events in Poisson process:

\[ f(x) = \lambda e^{-\lambda x}, \quad x \geq 0 \]

\(\mathbb{E}[X] = \frac{1}{\lambda}\)
\(\text{Var}(X) = \frac{1}{\lambda^2}\)

1.7.2.3 Normal (Gaussian) Distribution

PDF of normal distribution in 1D:

\[ f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) \]

Notation: \(X \sim \mathcal{N}(\mu, \sigma^2)\)

Parameters: - \(\mu\): mean - \(\sigma^2\): variance - \(\sigma\): standard deviation

Properties:

Symmetric around \(\mu\)
68-95-99.7 rule: ~68% within 1σ, ~95% within 2σ, ~99.7% within 3σ
Sum of independent Gaussians is Gaussian
Central Limit Theorem: sum of many i.i.d. variables → Gaussian

Standard Normal: \(\mathcal{N}(0, 1)\)

\[ Z = \frac{X - \mu}{\sigma} \sim \mathcal{N}(0, 1) \]

1.7.2.4 Multivariate Normal Distribution

The normal (Gaussian) distribution can be extended to multiple variables — for example, when \(\mathbf{x}\) is a vector instead of just a number.

When \(\mathbf{x}\) is a vector in \(\mathbb{R}^d\) (i.e., a list of \(d\) values), the multivariate Gaussian looks like:

\[ \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \frac{1}{(2\pi)^{d/2}|\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu})\right) \]

What do all these symbols mean?

\(\mathbf{x}\): A \(d\)-dimensional vector (e.g., an image flattened into a 1D array)
\(\boldsymbol{\mu}\): The mean vector (the “center” of the distribution)
\(\boldsymbol{\Sigma}\): The \(d \times d\) covariance matrix, which captures the spread and correlation of the variables
\(|\boldsymbol{\Sigma}|\): The determinant of \(\boldsymbol{\Sigma}\), representing the volume scaling
\((2\pi)^{d/2}\): Comes from extending the 1D normalizing constant to \(d\) dimensions
\((\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu})\): A generalized squared distance from \(\mathbf{x}\) to the mean — see below!

What’s the Mahalanobis Distance?

The term \((\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu})\) measures how far \(\mathbf{x}\) is from the mean \(\boldsymbol{\mu}\), taking into account how spread out the data is in different directions. It’s like Euclidean distance, but adjusted for direction-dependent variance. The more variance in a direction, the less distance is penalized in that direction.

Special Case: Spherical Gaussian

If all variables are independent and have equal variance \(\sigma^2\), then \(\boldsymbol{\Sigma} = \sigma^2 \mathbf{I}\) and the formula simplifies to:

\[ \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}, \sigma^2 \mathbf{I}) = \frac{1}{(2\pi\sigma^2)^{d/2}} \exp\left(-\frac{1}{2\sigma^2}\|\mathbf{x} - \boldsymbol{\mu}\|^2\right) \]

This looks more like the familiar 1D bell curve — but extended to \(d\) dimensions.

ML relevance:

Gaussian Mixture Models (GMM)
Latent variable models (VAE)
Gaussian processes
Discriminant analysis

1.7.2.5 Beta Distribution

Probability distribution over probabilities \([0, 1]\):

\[ f(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha, \beta)}, \quad 0 \leq x \leq 1 \]

where \(B(\alpha, \beta)\) is the beta function

ML relevance: Bayesian inference (conjugate prior for Bernoulli/Binomial)

1.7.2.6 Gamma Distribution

Generalization of exponential distribution:

\[ f(x) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x}, \quad x \geq 0 \]

ML relevance: Prior distributions in Bayesian models

1.8 Maximum Likelihood Estimation (MLE)

Used to estimate parameters of models:

\[ \hat{\theta}_{\text{MLE}} = \arg\max_\theta P(\mathcal{D} \mid \theta) \]

Equivalently (using log-likelihood):

\[ \hat{\theta}_{\text{MLE}} = \arg\max_\theta \sum_{i=1}^n \log P(x_i \mid \theta) \]

Why log-likelihood?

Products become sums (easier to optimize)
Numerically more stable
Same maximum as likelihood

Example: MLE for Gaussian

Given data \(\{x_1, \ldots, x_n\}\) from \(\mathcal{N}(\mu, \sigma^2)\):

\[ \hat{\mu}_{\text{MLE}} = \frac{1}{n}\sum_{i=1}^n x_i \quad \text{(sample mean)} \]

\[ \hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}\sum_{i=1}^n (x_i - \hat{\mu})^2 \quad \text{(sample variance)} \]

ML relevance: Training probabilistic models, logistic regression

1.9 Maximum A Posteriori (MAP) Estimation

Incorporates prior knowledge:

\[ \hat{\theta}_{\text{MAP}} = \arg\max_\theta P(\theta \mid \mathcal{D}) = \arg\max_\theta P(\mathcal{D} \mid \theta) P(\theta) \]

Using log:

\[ \hat{\theta}_{\text{MAP}} = \arg\max_\theta \left[\log P(\mathcal{D} \mid \theta) + \log P(\theta)\right] \]

Difference from MLE: MAP includes prior \(P(\theta)\)

ML relevance:

Regularization (L2 = Gaussian prior, L1 = Laplace prior)
Bayesian neural networks

1.10 KL Divergence

Measures how one distribution diverges from another:

\[ D_{\text{KL}}(P \| Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)} \]

Continuous:

\[ D_{\text{KL}}(P \| Q) = \int P(x) \log \frac{P(x)}{Q(x)} \, dx \]

Properties:

\(D_{\text{KL}}(P \| Q) \geq 0\) (Gibb’s inequality)
\(D_{\text{KL}}(P \| Q) = 0\) iff \(P = Q\)
Not symmetric: \(D_{\text{KL}}(P \| Q) \neq D_{\text{KL}}(Q \| P)\)
Not a true distance metric

Equivalent form:

\[ D_{\text{KL}}(P \| Q) = \mathbb{E}_{x \sim P}\left[\log \frac{P(x)}{Q(x)}\right] = \mathbb{E}_{x \sim P}[\log P(x)] - \mathbb{E}_{x \sim P}[\log Q(x)] \]

\[ = -H(P) - \mathbb{E}_{x \sim P}[\log Q(x)] \]

where \(H(P)\) is the entropy of \(P\)

ML relevance:

Variational inference
Training VAEs (ELBO)
Information-theoretic learning
Model comparison

1.11 Cross-Entropy

Measures expected log-likelihood under distribution \(Q\) when true distribution is \(P\):

\[ H(P, Q) = -\sum_x P(x) \log Q(x) \]

Continuous:

\[ H(P, Q) = -\int P(x) \log Q(x) \, dx \]

Relation to KL divergence:

\[ H(P, Q) = H(P) + D_{\text{KL}}(P \| Q) \]

where \(H(P) = -\sum_x P(x) \log P(x)\) is the entropy

ML relevance:

Cross-entropy loss in classification
Minimizing cross-entropy ≡ minimizing KL divergence (since \(H(P)\) is constant)

1.12 Entropy

Measures uncertainty or information content:

Shannon Entropy:

\[ H(X) = -\sum_x P(x) \log P(x) = \mathbb{E}[-\log P(X)] \]

Continuous (Differential Entropy):

\[ H(X) = -\int f(x) \log f(x) \, dx \]

Properties:

\(H(X) \geq 0\)
Maximum entropy for discrete uniform distribution
Higher entropy = more uncertainty

Conditional Entropy:

\[ H(Y \mid X) = \sum_x P(x) H(Y \mid X = x) \]

ML relevance:

Information gain in decision trees
Entropy regularization
Information theory foundations

1.13 Mutual Information

Measures dependence between variables:

\[ I(X; Y) = D_{\text{KL}}(P(X,Y) \| P(X)P(Y)) \]

Equivalent forms:

\[ I(X; Y) = H(X) - H(X \mid Y) = H(Y) - H(Y \mid X) \]

\[ I(X; Y) = H(X) + H(Y) - H(X, Y) \]

Properties:

\(I(X; Y) \geq 0\)
\(I(X; Y) = 0\) iff \(X\) and \(Y\) independent
Symmetric: \(I(X; Y) = I(Y; X)\)

ML relevance:

Feature selection
Information bottleneck theory
Variational information maximization

1.14 Central Limit Theorem (CLT)

For i.i.d. random variables \(X_1, \ldots, X_n\) with mean \(\mu\) and variance \(\sigma^2\):

\[ \frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} \xrightarrow{d} \mathcal{N}(0, 1) \]

where \(\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i\)

Practical form: For large \(n\):

\[ \bar{X}_n \approx \mathcal{N}\left(\mu, \frac{\sigma^2}{n}\right) \]

ML relevance:

Justifies Gaussian assumptions
Confidence intervals
Bootstrap methods

1.15 Law of Large Numbers (LLN)

1.15.1 Weak LLN

For i.i.d. \(X_1, \ldots, X_n\) with mean \(\mu\):

\[ \bar{X}_n \xrightarrow{P} \mu \]

(Sample mean converges in probability to true mean)

1.15.2 Strong LLN

\[ \bar{X}_n \xrightarrow{\text{a.s.}} \mu \]

(Sample mean converges almost surely)

ML relevance:

Monte Carlo methods
Justifies empirical risk minimization

1.16 Jensen’s Inequality

For convex function \(f\) and random variable \(X\):

\[ f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)] \]

For concave function: reverse inequality

Example: \(\log\) is concave, so:

\[ \log(\mathbb{E}[X]) \geq \mathbb{E}[\log(X)] \]

ML relevance:

Derives EM algorithm
Variational inference bounds
Information theory inequalities

2 ML Applications Summary

2.1 Classification

Logistic Regression (binary):

\[ P(y = 1 \mid \mathbf{x}) = \sigma(\mathbf{w}^T\mathbf{x}) = \frac{1}{1 + e^{-\mathbf{w}^T\mathbf{x}}} \]

Softmax (multi-class):

\[ P(y = k \mid \mathbf{x}) = \frac{e^{\mathbf{w}_k^T\mathbf{x}}}{\sum_{j=1}^K e^{\mathbf{w}_j^T\mathbf{x}}} \]

2.2 Naive Bayes Classifier

Assumes features conditionally independent:

\[ P(y \mid \mathbf{x}) \propto P(y) \prod_{i=1}^d P(x_i \mid y) \]

Decision rule:

\[ \hat{y} = \arg\max_y P(y) \prod_{i=1}^d P(x_i \mid y) \]

2.3 Gaussian Mixture Models (GMM)

Mixture of \(K\) Gaussians:

\[ P(\mathbf{x}) = \sum_{k=1}^K \pi_k \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \]

where \(\sum_{k=1}^K \pi_k = 1\) (mixing coefficients)

ML relevance: Clustering, density estimation, generative models

2.4 Expectation-Maximization (EM) Algorithm

Iterative method for MLE with latent variables:

E-step: Compute expected log-likelihood

\[ Q(\theta \mid \theta^{(t)}) = \mathbb{E}_{Z \mid X, \theta^{(t)}}[\log P(X, Z \mid \theta)] \]

M-step: Maximize w.r.t. \(\theta\)

\[ \theta^{(t+1)} = \arg\max_\theta Q(\theta \mid \theta^{(t)}) \]

ML relevance: Training GMMs, hidden Markov models

2.5 Sampling Methods

2.5.1 Monte Carlo Estimation

Approximate expectation by sampling:

\[ \mathbb{E}[f(X)] \approx \frac{1}{N}\sum_{i=1}^N f(x_i), \quad x_i \sim P(X) \]

2.5.2 Markov Chain Monte Carlo (MCMC)

Generate samples from complex distributions:

Metropolis-Hastings: Accept/reject samples based on ratio
Gibbs Sampling: Sample each variable conditional on others

ML relevance: Bayesian inference, probabilistic programming

2.6 Quick Reference Table

Concept	Formula	ML Application
Bayes’ Theorem	\(P(A \mid B) = \frac{P(B \mid A)P(A)}{P(B)}\)	Classification, inference
KL Divergence	\(D_{\text{KL}}(P \\| Q) = \sum P(x)\log\frac{P(x)}{Q(x)}\)	VAE training, model comparison
Cross-Entropy	\(H(P,Q) = -\sum P(x)\log Q(x)\)	Classification loss
Entropy	\(H(X) = -\sum P(x)\log P(x)\)	Information theory, decision trees
Gaussian PDF	\(\mathcal{N}(\mu, \sigma^2)\)	Continuous modeling
MLE	\(\hat{\theta} = \arg\max P(\mathcal{D} \mid \theta)\)	Parameter estimation
MAP	\(\hat{\theta} = \arg\max P(\theta \mid \mathcal{D})\)	Bayesian estimation
Covariance	\(\text{Cov}(X,Y) = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]\)	Feature correlation
CLT	\(\bar{X}_n \to \mathcal{N}(\mu, \sigma^2/n)\)	Statistical inference