Notes on Inference and Information
Table of contents
- Notations
- Inequalities
- Exponential families
- Typicality
- Hypothesis testing
- Estimation
- Information measures
- Information geometry
- Non-Bayesian modeling
- Approximation
This post is based on a course that I took in spring, 2020. I refine it aperiodically for the purpose of summarizing and promoting my understanding.
Updates
July 22, 2020: added Beta distributions.
Notations
upper-case letters $X, Y$ | random variables |
lower-case letters $x, y$ | real numbers |
$\mathcal{X}, \mathcal{Y}$ | alphabets |
lower-case letters $p, q$ | PDFs or PMFs |
Inequalities
- Jensen’s inequality: for convex $f$, $\mathbb{E}[f(x)] \ge f(\mathbb{E}[x])$
- Log-sum inequality: $\sum_i a_i \log (a_i / b_i) \ge \sum_{i} a_i \log (\sum_{i} a_i / \sum_{i} b_i)$
- Proof: Jensen’s inequality, either by the convexity of $x\log x$ or the convexity of $-\log x$
- Gibbs’ inequality: $\mathbb{E} _{p} [\log p] \ge \mathbb{E} _{p}[\log q]$
Exponential families
- One-parameter exponential family (EF): $p_{Y}(y; x) = \exp\lbrace \lambda(x) t(y) - \alpha(x) + \beta(y) \rbrace$
- $\lambda(x)$: natural parameter, $t(y)$: natural statistic, $\alpha(x)$: log partition function, $\beta(x)$: log base function
- Regular EF: the support of $P_{Y}(\cdot; x)$ does not depend on $x$
- Canonical EF: $\lambda(x) = x$, called linear in MIT 6.437
- Geometric mean of two distributions
- $q(y) = p_{Y}(y; 0)$: base function
- Natural EF: $\lambda(x) = x, t(y)=y$, called canonical in MIT 6.437
- Tilting a distribution
- Properties of 1D canonical EF
- $\dot{\alpha}(x) = \eta(x) := \mathbb{E} _{p _{Y}(\cdot; x)}[t(Y)]$: mean parameter
- $\ddot{\alpha}(x) = \dot{\eta}(x) = J_{Y}(x)$: Fisher information
- $D(p_{Y}(\cdot; x) \Vert q(\cdot)) + \alpha(x) = x\eta(x)$
- $d D(p_{Y}(\cdot; x) \Vert q(\cdot)) / d\eta = x$
- $d^2 D(p_{Y}(\cdot; x) \Vert q(\cdot)) / d\eta^2 = 1/J_{Y}(x)$
Typicality
- Normalized log-likelihood $L_{p}(y_{1}^{N}) = 1/N \sum_{n=1}^{N} \log p(y_n)$
- Typical set $\mathcal{T} _{\epsilon}(p; N)$: the set of sequences $y _{1} ^{N}$ such that $\lvert L _{p}(y_1^{N}) + H(p)\rvert \le \epsilon$
- $P \lbrace \mathcal{A} \rbrace := \sum_{y_{1}^{N} \in \mathcal{A}} p^{N}(y_1^{N})$
- Asymptotic equipartition property (AEP)
- AEP-1: $\lim_{N\to \infty} P\lbrace\mathcal{T}_{\epsilon}(p; N)\rbrace = 1$
- AEP-2: for all $y_1^N \in \mathcal{T}_{\epsilon}(p; N)$, $2^{-N(H(p) + \epsilon)} \le p^{N}(y_1^N) \le 2^{-N(H(p) - \epsilon)}$
- AEP-3: for sufficiently large $N$, $(1 - \epsilon) 2^{N(H(p) - \epsilon)} \le \lvert \mathcal{T}_{\epsilon}(p; N) \rvert \le 2^{N(H(p) + \epsilon)}$
- AEP for continuous distributions: $H(p) \to h(p)$, cardinality $\lvert \cdot \rvert \to \mathrm{vol}[\cdot]$ volume
- Normalized log-likelihood ratio $L_{p \vert q}(y_1^N) = 1/N \sum_{n=1}^{N} \log (p(y_n) / q(y_n))$
- Divergence typical set $\mathcal{T} _{\epsilon}(p\vert q; N)$: the set of sequences $y _{1} ^{N}$ such that $\lvert L _{p\vert q}(y _1 ^{N}) - D(p \Vert q)\rvert \le \epsilon$
- Equivalent in the sense that $\lim _{N\to \infty} P \lbrace\mathcal{T} _{\epsilon}(p; N) \cap \mathcal{T} _{\epsilon}(p\vert q; N) \rbrace = 1$
- For sufficiently large $N$, $(1 - \epsilon) 2^{-N(D(p \Vert q) + \epsilon)} \le Q\lbrace\mathcal{T}_{\epsilon}(p\vert q; N)\rbrace \le 2^{-N(D(p \Vert q) - \epsilon)}$
- Cramer’s theorem: large deviation probability of sample averages
- $Y_1^N$ i.i.d. from $q$ with finite mean $\mu < \gamma$
- Chernoff exponent $E_C(\gamma) := D(p(\cdot; x) \Vert q)$, where $p(y; x) = q(y) \exp\lbrace xy - \alpha(x)\rbrace$ such that $\mathbb{E}_{p(\cdot; x)}[y] = \gamma$
- $\lim_{N\to \infty} -1/N \log \mathbb{P}(1/N \sum_{n=1}^N Y_n \ge \gamma) = E_C(\gamma)$
- Extensions to Cramer’s theorem
- Averages of functions of i.i.d. samples
- Vector version
- Beyond i.i.d. to samples from an ergodic random process: Gartner-Ellis theorem
- $\mathcal{P}_{N}^{\mathcal{Y}}$: the set of all types for sequences of length $N$ generated from an alphabet $\mathcal{Y}$
- Type class $\mathcal{T}_{N}^{\mathcal{Y}}(p)$: the set of sequences $y_1^N$ such that $\hat{p}(\cdot; y_1^N) = p(\cdot)$
- Properties of types
- Polynomial number of types: $\lvert \mathcal{P}_{N}^{\mathcal{Y}}\rvert \le (N + 1)^{\lvert \mathcal{Y} \rvert}$
- Probability of sequence as a function of its type: $q^N(y_1^N) = 2^{-N(D(\hat{p}(\cdot; y_1^N) \Vert q) + H(\hat{p}(\cdot; y_1^N)))}$
- Exponentially many sequences in each nondegenerate type class: $cN^{-\lvert \mathcal{Y} \rvert} 2^{NH(p)} \le \lvert\mathcal{T}_{N}^{\mathcal{Y}}(p)\rvert \le 2^{NH(p)}$, where $c$ depends on $p$ and $\mathcal{Y}$
- Probability that a sequence in type $p$ is generated by $q$: $cN^{-\lvert \mathcal{Y} \rvert } 2^{-ND(p \Vert q)} \le Q\lbrace\mathcal{T}_{N}^{\mathcal{Y}}(p)\rbrace \le 2^{-ND(p \Vert q)}$
- Sanov’s theorem: large deviation probability specified by $\mathcal{S}$
- For $\mathcal{S} \subset \mathcal{P}^{\mathcal{Y}}$ and $q \in \mathcal{P}^{\mathcal{Y}}$, Chernoff exponent $D_{\star}: = \inf_{p\in \mathcal{S}} D(p \Vert q)$
- $Q \lbrace \mathcal{S} \cap \mathcal{P} _{N} ^{\mathcal{Y}} \rbrace \le (N + 1) ^{\lvert \mathcal{Y} \rvert} 2 ^{-N D _{\star}}$
- For $\mathcal{S}$ closed, the dominant atypical event converges in probability to the I-projection $p_{\star} = \arg\min_{p\in \mathcal{S}} D(p \Vert q)$
- More powerful than the scalar version of Cramer’s theorem, but less powerful than its vector version
- Strongly typical set $\mathcal{T} _{\epsilon} ^{\mathrm{S}}(p; N)$: the set of sequences $y_1^N$ such that $\Vert \hat{p}(\cdot; y_1^N) - p(\cdot) \Vert _{\infty} \le \epsilon$
- Called “strong” not because of SLLN, but because it implies weaker typicality
- Implies divergence typicality
- Implies typicality, because for finite alphabets, typicality is an instance of divergence typicality where $q$ is the uniform distribution
Hypothesis testing
- Bayesian hypothesis testing: likelihood ratio test (LRT)
- Neyman-Pearson hypothesis testing: no prior & no cost
- Randomized test
- Operatering characteristic (OC)
- Efficient frontiers
- Achievable operating points
- Minimax hypothesis testing: unknown prior $p$
- In general $\min \max \ge \max \min$
- $\min_{f} \max_{p\in [0, 1]} \phi(f, p) = \max_{p\in [0, 1]} \min_{f} \phi(f, p)$
- Asymptotics:
Estimation
Bayesian estimation
- Sufficient statistics:
- Bernstein-von Mises theorem: asymptotics of Bayesian estimation
Non-Beysian estimation
- Valid: does not depend on $x$
- Admissible: valid and unbiased
- Minimum-variance unbiased (MVU) estimator
- Score function $S(y; x) = \partial \log p_{Y}(y; x) / \partial x$
- Fisher information $J_{Y}(x) = \mathbb{E}[S^2(y; x)] = - \mathbb{E}[\partial^2 \log p_{Y}(y; x) / \partial x^2]$
- Cramer-Rao (CR) bound: for all admissible $\hat{x}(\cdot)$, $\lambda_{\hat{x}}(x) \ge 1/J_{Y}(x)$
- Efficient estimator: satisfies CR bound with equality
- Exists iff exponential family
- Exists iff $\hat{x}(y) = x + S(y; x) / J_{Y}(x)$ is valid
- Admissible, MVU, ML estimator
- Sufficient statistics:
- Wald’s theorem
- Cramer’s theorem
Information measures
Discrete alphabets
- Bayesian modeling
- Entropy $H(p) = -\sum_{a} p(a) \log p(a)$
- Nonnegative, by definition
- Strictly concave, due to log-sum inequality
- Mutual information $I(X; Y) = I(Y; X) = H(X) - H(X \vert Y) = H(Y) - H(Y \vert X)$
- Nonnegative, due to the nonnegativity of information divergence
- $I(X; Y) = D(p_{X, Y} \Vert p_{X} p_{Y}) = \sum_{x, y} p_{X, Y}(x, y) \log \frac{p_{X, Y}(x, y)}{p_{X}(x) p_{Y}(y)}$
- $I(X; Y) = \sum_{X} p_{X}(x) D(p_{Y\vert X}(y\vert x) \Vert p_{Y}(y)) = \sum_{Y} p_{Y}(y) D(p_{X\vert Y}(x\vert y) \Vert p_{X}(x))$
- $I(p_{X}, p_{Y\vert X})$ is concave, but not strictly concave, in $p_{X}$, due to the concavity of the entropy
- Information divergence $D(p \Vert q) = \sum_{a} p(a) \log \frac{p(a)}{q(a)}$
- Nonnegative, due to Gibbs’ inequality
- Jointly convex in $(p, q)$, due to log-sum inequality
- Strictly convex in $p$ for a fixed $q$
- Strictly convex in $q$ for a fixed $p$
Continuous alphabets
- Differential entropy $h(X) = -\int p(x) \log p(x) dx$
- Not necessarily nonnegative
- Not invariant to to coordinate transformation $X = g(S)$
- Conditional differential entropy $h(X\vert Y) = \mathbb{E} _{Y}[h(X\vert Y = y)] = -\int \int p _{X, Y}(x, y) \log p _{X\vert Y}(x\vert y) dx dy$
- Mutual information $I(X; Y) = h(X) - h(X\vert Y)$
- Nonnegative, due to the nonnegativity of information divergence
- Invariant to coordinate transformation
- Information divergence $D(p \Vert q) = \int p(x) \log (p(x)/q(x)) dx$
- Nonnegative, due to Gibbs’ inequality
- Invariant to coordinate transformation
- Information measures for Gaussian distributions
- $h(X) = \frac{1}{2} \log (2\pi e \sigma^2)$, $h(X) = \frac{1}{2} \log \lvert 2\pi e \Lambda \rvert$
- $h(X\vert Y) = \frac{1}{2} \log (2\pi e (\lambda_X - \lambda_{XY}^2 / \lambda_Y))$, $h(Y\vert X) = \frac{1}{2} \log \lvert 2\pi e (\Lambda_X - \Lambda_{XY}{\Lambda_X}^{-1} {\Lambda_{XY}}^{\top})\rvert$,
- $I(X; Y) = \frac{1}{2} \log (1 - \rho_{XY}^2)^{-1}$, $I(X; Y) = \frac{1}{2} \log \lvert I - B_{XY} B_{XY}^{\top}\rvert ^{-1}$
- $D(p_1 \Vert p_2) = (log e / 2) (\mu_1 - \mu_2)^2 / \sigma^2$, $D(p_1 \Vert p_2) = (log e / 2) (\mu_1 - \mu_2)^{\top}{\Lambda}^{-1} (\mu_1 - \mu_2)$
- SNR: $(\mu_1 - \mu_2)^2 / \sigma^2$, $(\mu_1 - \mu_2)^{\top}{\Lambda}^{-1} (\mu_1 - \mu_2)$
Information geometry
- Finite alphabet $\mathcal{Y}$ with cardinality $M$
- Each probability distribution: a point in an $M$ dimensional space
- Probability simplex $\mathcal{P}^{\mathcal{Y}}$: $M - 1$ dimensional hyperplane
- Boundary $\partial(\mathcal{P}^{\mathcal{Y}})$ and interior $\mathrm{int}(\mathcal{P}^{\mathcal{Y}})$
- Information divergence $D(p \Vert q)$ on the boundary of the simplex: define $0\log 0 = 0$
- Information projection (I-projection): $p^{\star} = \arg\min_{p\in \mathcal{P}} D(p \Vert q)$
- Reverse I-projection: $q^{\star} = \arg\min_{q\in \mathcal{P}} D(p \Vert q)$
- Pythagorean theorem: for concex $\mathcal{P}$, $D(p \Vert q) \ge D(p \Vert p^{\star}) + D(p^{\star} \Vert q)$
- Linear family $\mathcal{L}$: for $K < M$, for all $i \in [K]$, $\mathbb{E}[t_i(Y)] = \overline{t}_i$
- In matrix form, $\widetilde{T} p = 0$, i.e., $\mathcal{L}$ is the null space of $\widetilde{T}$
- $\dim(\mathcal{L}) = M - 1 - K’$, where $K’$ is the rank of $\widetilde{T}$
- Minimal representation, if $K’ = K$, i.e., the rows in $\widetilde{T}$ are linearly independent
- Closed to linear combinations
- Pythagorean identity: for $\mathcal{P} = \mathcal{L}$
- Exponential family $\mathcal{E}_{t}(p^{\star})$: $q(y) = p^{\star}(y)\exp\lbrace x^{\top}t(y) - \alpha(x)\rbrace$
- all distributions I-projected onto $\mathcal{L} _{t}(p^{\star})$ at $p^{\star}$ form $\mathcal{E} _{t}(p^{\star})$
- $\mathcal{E} _{t}(p^{\star})$ is called the orthogonal family to $\mathcal{L} _{t}(p^{\star})$ at $p^{\star}$
- ML as reverse I-projection / divergence minimization
- Type / empirical distribution: $\hat{p} _{Y}(\cdot; y _{1}^{N})$
- Type identity: $\frac{1}{N} \sum _{i=1}^{N} f(y_n) = \mathbb{E} _{\hat{p} _{Y}(\cdot; y _{1}^{N})} [f(Y)]$
- $\hat{x} _{\mathrm{ML}} = \arg\min _{x} D(\hat{p} _{Y}(\cdot; y _{1}^{N}) \Vert p _{Y}(\cdot; x))$
- EM: complete data $Z$ from $p_{Z}(\cdot; x)$, observation $Y = g(Z)$
- Nondecreasing likelihood in each iteration due to Gibbs’ inequality
- Expectation: $U(x; \hat{x} ^{(n)}) = \mathbb{E} _{p _{Z\vert Y}(\cdot\vert y; \hat{x} ^{(n)})} [\log p _{Z}(z; x) \vert Y=y; \hat{x} ^{(n)}]$
- Maximization: $\hat{x}^{(n+1)} = \arg\max_{x} U(x; \hat{x}^{(n)})$
- Advantage: the maximization can be solved analytically for Gaussian mixtures
- Decision form of the data processing inequality (DPI): $D(p_{Z} \Vert q_{Z}) \ge D(p_{Y} \Vert q_{Y})$, with equality iff $p_{Z}(z) / q_{Z}(z) = p_{Y}(g(z)) / q_{Y}(g(z))$
- EM as alternating projections on the probability simplex
- $\hat{\mathcal{P}} ^{\mathcal{Z}}(y_{1}^{N}) = \left\lbrace \hat{p} _{Z}(\cdot) \in \mathcal{P}^{\mathcal{Z}} : \sum _{z \in g^{-1}(b)} \hat{p} _{Z}(z) = \hat{p} _{Y}(b; y_1^N), \forall b \in y \right\rbrace$
- $\hat{p} _{Z} ^{\star}(\cdot; x) = \arg\min _{\hat{p} _{Z} \in \hat{\mathcal{P}} ^{\mathcal{Z}}(y _{1}^{N})} D(\hat{p} _{Z} \Vert p _{Z}(\cdot; x)) = p _{Z}(z; x) \hat{p} _{Y}(g(z)) / p _{Y}(g(z); x)$, due to the DPI
- E-step / I-projection finds the correct empirical distribution from all candidates: $\hat{p} _{Z}^{\star}(\cdot; \hat{x} ^{(n)}) = \arg\min _{\hat{p} _{Z} \in \hat{\mathcal{P}} ^{\mathcal{Z}}(y _{1}^{N})} D(\hat{p} _{Z} \Vert p _{Z}(\cdot; \hat{x} ^{(n)}))$
- M-step / reverse I-projection: $\hat{x} ^{(n + 1)} = \arg\min_{x} D(\hat{p} _{Z}^{\star}(\cdot; \hat{x} ^{(n)}) \Vert p _{Z}(\cdot; x))$
Non-Bayesian modeling
Modeling via mixtures
- Modeling is distinct from estimation even for parametric models
- Hypothesis testing: decision
- Estimation: parameter
- Modeling: distribution
- Modeling via mixtures $q_{w}(y) = \sum_{x\in \mathcal{X}} w(x) p_{Y}(y; x)$
- For any admissible $q\in \mathcal{P}^{\mathcal{Y}}$, there exists $w(\cdot)$, such that $D(p_{Y}(\cdot; x) \Vert q_{w}(y)) \le D(p_{Y}(\cdot; x) \Vert q(\cdot))$ for all $x\in \mathcal{X}$
- Redundancy-capacity theorem
- $\max_{x\in \mathcal{X}} D(p_{Y}(\cdot; x) \Vert q(\cdot)) = \max_{w\in \mathcal{P}^{\mathcal{X}}} \sum_{x} w(x) D(p_{Y}(\cdot; x) \Vert q(\cdot))$
- $C = R^{+} = \min_{q\in \mathcal{P}^{\mathcal{Y}}} \max_{w\in \mathcal{P}^{\mathcal{X}}} \sum_{x} w(x) D(p_{Y}(\cdot; x) \Vert q(\cdot)) = \max_{w\in \mathcal{P}^{\mathcal{X}}} \min_{q\in \mathcal{P}^{\mathcal{Y}}} \sum_{x} w(x) D(p_{Y}(\cdot; x) \Vert q(\cdot)) = R^{-}$
- The number of $x$ such that $w^{\star}(x) > 0$ is at most $\lvert \mathcal{Y}\rvert$: the number of points on the surface of the hyperball that subsumes all $x$
- Equidistance property: $D(p_{Y}(\cdot; x) \Vert q^{\star}) \le C$ for all $x \in \mathcal{X}$, with equality for all $x$ such that $w^{\star}(x) > 0$
- Interpret $w = p_{X}$: $q = p_{Y}, R^{-} = \max_{p_{X}} I(X; Y)$
- Least informative prior for $p_{Y\vert X}$: $p_{X}^{\star} = \arg\max_{p_{X}} I(X; Y)$
- Model capacity of the model $p_{Y\vert X}$: $C = \max_{p_{X}} I(X; Y)$
- $0 \le C \le \log \lvert \mathcal{X}\rvert$
- Inference with the optimized mixture model: $w ^{\star}(x \vert y _{-}) = \frac{w ^{\star}(x) L _{y _{-}}(x)}{\sum _{a} w ^{\star}(a) L _{y _{-}}(a)}$
Maximally ignorant priors
- Motivation of maximally ignorant priors (MIP): detailed structure of the prior is not critical to good performance & a prior independent of $\mathcal{Y}$
- MIP for finite alphabets
- Maximum entropy distributions: I-projection of the uniform distribution onto the constraint family
- For linear constraints via linear family $\mathcal{L} _{t}(p ^{\star})$: $p ^{\star}(y) = \exp\lbrace \sum _{i=1}^{K} x_i t_i(y) - \alpha(x) \rbrace$
- MIP for infinite alphabets
- For linear constraints, the form of $p^{\star}$, when it exists, is the same as that in the case of finite alphabets
- $\mathbb{E}[Y^2] = \sigma^2$: MIP is $\mathrm{N}(0, \sigma^2)$
- $p(y) = 0$ for $y < 0$, and $\mathbb{E}[Y] = \mu$: MIP is $\mathrm{Exp}(1/\mu)$
Conjugate prior families
- Conjugate prior families (CPF) $\mathcal{Q} = \lbrace q(\cdot; \theta): \theta \in \mathbb{R}^{K}\rbrace$: for $y\in \mathcal{Y}$, $p_{X\vert Y}(\cdot\vert y) \in \mathcal{Q}$ whenever $p_{X}(\cdot) \in \mathcal{Q}$
- Equivalently, $p_{X\vert Y_1, \cdots, Y_N} \in \mathcal{Q}$ whenever $p_{X}(\cdot) \in \mathcal{Q}$
- Natural CPF: a CPF that has the same form as the model
- Convex hull of a natural CPF / mixture of natural CPF: unnatural, not an exponential family
- Equivalent statements
- CPF exists
- $\exists$ sufficient statistic $t_{N}(y_1, \cdots, y_N)$ for model $p_{Y\vert X}$ with finite dimension independent of $N$
- $p_{Y\vert X}$ is an exponential family with continuous natural statistics
- For model $p_{Y\vert X}(y\vert x) = \exp\lbrace \lambda(x)^{\top} t(y) - \alpha(x) + \beta(y) \rbrace$, the CPF is $\mathcal{Q} = \lbrace q: q(x; t, N) = \exp\lbrace [t^{\top}\lambda(x) - N \alpha(x)] - \gamma(t, N) \rbrace\rbrace$
- $N$ is a parameter, not the number of observations
- The CPF is natural & a linear exponential family
- The Gaussian family is self-conjugate
Example. Beta priors for Bernoulli distributions.
- Gamma function: $\Gamma(z) := \int_{0}^{\infty} u^{z-1} e^{-u} du$ for $z > 0$.
- $\Gamma(z + 1) = z\Gamma(z)$.
- For natural number $z$, $\Gamma(z) = (z - 1)!$.
- Beta function: $\mathrm{B}(\alpha, \beta) = \int_{0}^{1} x^{\alpha - 1} (1 - x)^{\beta - 1} dx = \Gamma(\alpha) \Gamma(\beta) \Gamma^{-1}(\alpha + \beta)$ for $\alpha, \beta > 0$.
- Beta distribution: $X \sim \mathrm{Beta}(\alpha, \beta)$ for $\alpha, \beta > 0$, if $p_{X}(x) = x^{\alpha - 1} (1 - x)^{\beta - 1} \mathrm{B}^{-1}(\alpha, \beta)$.
Asymptotics of non-Bayesian modeling
- Asymptotics of model capacity
- Universal predictor
- Universal prediction theorem
Approximation
Stochastic approximation
- Stochastic approximation methods / Monte Carlo methods / sampling methods: construct estimates from data samples
- Importance sampling (IS): modify the approximation $\mathbb{E}_{p}[f(X)]$
- $w(x) = p_{0}(x) / q_{0}(x)$
- $\hat{\overline{f}}(x) = \sum_{i=1}^{n} w(x_i) f(x_i) / \sum_{j=1}^{n} w(x_{j})$
- $\mathbb{E}_{q}[w(x)] = Z_p / Z_q$: a way to estimate the partition function $Z_p$ if $Z_q$ is known
- Disadvantage: convergence rate deteriorates as $\mathcal{X}$ or dimension increases
- Rejection sampling (RS): generate i.i.d. samples from $p$
- Assume a known $c$ such that $cq_{0} > p_{0}(x)$
- Sample $x$ from $q$, and retain it with probability $p_{0}(x) / (cq_{0}(x))$
- Acceptance rate: $Z_p / (c Z_q)$
- Disadvantage: $c$ grows as $\mathcal{X}$ or dimension increases
- Markov chain Monte Carlo (MCMC): asymptotically produce correlated samples from $p$
- Advantage: viable in high dimensions, the most powerful among sampling methods
- Idea: construct a reversible, ergodic / irreducible & aperiodic, Markov chain with stationary distribution $p$
- Metropolis-Hastings algorithm: an MCMC method
- Generate $x’$ from the proposal distribution $V(\cdot\vert x)$
- Transit to $x’\neq x$ with probability $a(x\to x’) = \min\lbrace 1, p_0(x’) V(x\vert x’) / (p_0(x) V(x’\vert x)) \rbrace$
- A local version of rejection sampling: a local $c_{x, x’}$
Deterministic approximation
Statement: 1. The copyrights of all original posts in this blog are reserved. For reposting, you do not need to contact the author. However, you need to indicate Siberian Tiger as the author and attach the links of the posts. 2. All reposted posts are marked as "reprint" with authors indicated and links provided. Should there be copyright infringement, please contact the author as per the "About" page. The author will delete the content as soon as possible.