4.3 Advanced MCMC (wk08)

1, 3, 5번이 자주 사용됨 2, 3, 4의 목적은 proposal density 개선

1번은 missing data handling

5번은 variable selection

기본적으로 $f(x) = c \psi (x) = c \exp \left( -\frac{U(x)}{t} \right)$의 개형을 따름.

4.3.1 Data Augmentation

Missing Pattern

	MCAR	MAR	NMAR
in missingness in missing on a variable	No patterns	can be predicted by other variables
Missing values related to variable?	not to any	other variables	the variable itself
	complete data considered as a random subsample from the original target sample		whether data is NMAR is a theoretical and conceptual considerate
Assumption power	3	2	1
missing을 의식할 필요가 있는 상황인가?	No. missing 자체를 무시 (ignore).	Yes. predict missing by other variable	Yes. impute with seperate model

bayesian 분석에서 Handling Missing Data에 자주 사용된다.

Data Augmentation(DA) 알고리즘은, 불완전 데이터에 대한 bayesian 분석으로 설명될 수 있다.

$X_{obs}$	observed data
$X_{mis}$	missing data
$X_{com}= \left(X_{obs},X_{mis} \right)$	complete data

assume complete 데이터 모델 $X_{com} = (X_{obs}, X_{mis}) \sim g \left( X_{obs}, X_{mis} \rvert \theta \right)$, where 패러미터 $\theta \in \Theta \subseteq \mathbb{R}^d, d \in \mathbb{Z}^+$.

여기서 목적은 패러미터 $\theta$에 대한 prior 분포 $\pi (\theta)$와 함께 bayesian 추론을 만드는 것. 이는 곧 $g \left( X_{obs}, X_{mis} \rvert \theta \right) \ast \pi(\theta)$ 를 말한다.

Multiple Imputation	데이터 impute, 그 impute한걸로 패러미터 업데이트, 다시 impute, …. 여기선 impute 셋을 여러개를 만들어놓고, 그걸 패러미터를 계산을 해서 패러미터 분포를 갖고 inference.
Data Augmentation	interatively하게 패러미터 impute, 업데이트, impute, 업데이트, ….

proceeds: MCMC로 DA

let observed-data model $f(X_{obs} \rvert \theta)$. 이는 아래와 같이 joint pdf에서 마지널을 뽑아냄으로써 (integrate) 단독 pdf를 획득하는 것이 가능함.

MCMC 방법론을 사용해 $\theta$에 대한 Bayesian 추론을 진행하면 true or observed-data post를 샘플링하거나, 혹은 더욱 일반적으로는 $\theta$와 $X_{mis}$의 joint 분포를 샘플링할 것을 요구한다. 이의 각각을 수식으로 나타내면 다음과 같다.

$$ \[\begin{alignat}{4} & &&f(X_{obs} \rvert \theta) && &&= \int_{\mathbb{X}_{mis}} g(X_{obs}, X_{mis} \rvert \theta) \; dX_{mis}, \; \; \; && \theta \in \Theta \\ \\ \pi (\theta \lvert X_{obs}) &\propto \; &&f (X_{obs} \lvert \theta) && \pi(\theta), && \; \; \; \; \; \; \; && \theta \in \Theta \tag{1}\\ \\ \pi (\theta, X_{mis} \lvert X_{obs}) &\propto \; &&g (X_{obs}, X_{mis} \lvert \theta) && \pi(\theta), && \; \; \; \; \; \; \; && \theta \in \Theta \tag{2}\\ \end{alignat}\] $$

integrate하여 획득한 위의 likelihood $f(X_{obs} \rvert \theta)$에 post를 곱하여 direct하게 업데이트하는 1번 방법
missing을 하나의 unknown 패러미터로 두고, 이 missing 업데이트 한 다음에 missing까지 포함하여 $\theta$를 inference하고, 그 후에 다시 missing을 업데이트 하는 방법.
1번에서는 마지널化한 pdf를 사용했고, 2번에서는 complete의 pdf를 통채로 사용했다. 1번에서는 마지널 시켜서 missing의 영향력을 삭제했다는 것을 발견할 수 있어야 한다.

의 2가지 방법을 취하는 것이 가능하다.

let $h( X_{mis} \lvert \theta, X_{obs})$는 $X_{mis}, X_{obs}$의 conditional 분포. 이때, 이 분포와

\[ \pi (\theta \lvert X_{obs}, X_{mis}) \propto g (X_{mis}, X_{obs} \rvert \theta) \pi(\theta) \tag{Full Posterior} \]

상기의 분포 2개가 양쪽 모두 샘플링이 간단하다고 가정하자.

이 두 조건부 분포 2개에 기반한 2단계 GS로 이를 해결하고자 하는 것은 당연한 귀결이다. 이를 Data Augmentation 알고리즘 (DA)라 칭한다. 이는 이하와 같이 설명될 수 있다.

take $\theta^{(0)} \in \Theta$, and iterate for $t=1,2, \cdots$.

Imputation-step:
generate $X_{mis}^{(t)} \sim f_{mis}(X_{mis} \rvert \theta^{(t-1)}, X_{obs})$. (이는 정식적인 likelihood가 아니라 missing variable과 observed variable 간의 관계를 의미하는 것).
해당 스텝에서 missing을 대체. 다른 조건들 (패러미터, obs) 들이 주어졌을 때의 mis의 조건부분포를 구하여 이를 기반으로 mis를 imputation.
Parameter update-step:
generate $\theta^{(t)} \sim \pi (\theta \lvert X_{obs}, X_{mis} )$.
해당 스텝에서 패러미터를 갱신. 위에서 채운 mis와 obs를 합쳐 com 데이터로 삼고 이를 토대로 패러미터 추정한다.

As a two-step GS, DA creates two interleaving Markov Chain:

$$ \[\begin{align*} \{ \theta(t) &: t=1,2,\cdots \} \\ \{ X_{mis}^{(t)} &: t=1,2,\cdots \} \end{align*}\] $$

Example: Multivariate Normal Distribution

4.3.2 Hit-and-Run Algorithm

for improving the inefficiency of the RW

이를 위해 proposal 분포에 대한 수정이 요구된다.

이는 RW에 추가적인 요소를 넣어 변형하는 것인데, RW의 거리와 방향을 거리에 대응하는 pdf, 방향에 대응하는 pdf를 각각 만들어 거기서 랜덤하게 생산하는 것이다.

draw below two parameters, and compute an MH acceptance probability $\alpha(x,y)$, where $x = x^{(t)} = X^{(t)}$.
- direction $d \sim g(d) \; \; \;(d \in \mathbb{O})$.
- distance $\lambda \sim I(\lambda \rvert d,x )$ over $\mathcal{X}_{x,d}$.
$X^{(t+1)} = \begin{cases} x^{(t)}+ \lambda d, & \text{if } \; U \le \alpha(x,y) = \frac{f(x')}{f(x^{(t)})}\frac{g(x' -> x^{(t)})}{g(x^{(t)}-> x' ))}\\ x^{(t)}, & o.w.\\ \end{cases}$.

이때, $g(d)$에 대한 가장 흔한 choice는 $\mathbb{O}$에 대한 $U$. 이외에 $g(\cdot \rvert x, d)$, $\alpha(x,y)$에 대한 가장 흔한 choice도 논해졌던 바가 있다.

이는 특히 sharply constrained 패러미터 space ($\Theta$) 와 함께하는 문제에 효과적이다. - Wrapped Normal Distribution

4.3.3 Metropolis-Adjusted Langevin Algorithm

how do we propose a new value? proposal improvement에 자주 사용된다. 또한, Hamiltonian MC와 아주 밀접한 관련이 있다.

Langevin 방정식에 기반하여 생성된 알고리즘. 해당 알고리즘은 기본적으로 $f$를 정적 (stationary) 분포로서 내버려두게 된다.

gradient flow에 의해 발생하면 얘도 로컬트랩 가능성 있는거 아님?

\[ dX_t = dB_t + \frac {1}{2} \bigtriangledown \log f(X_t ) \]

이때 $B_t$는 표준 브라운 운동.

이의 실적용은 Langevin diffusion process를 RW-like Transition으로 대체한 discretion step을 포함하는 아래의 식으로 이루어진다.

\[ x^{(t+1)} = x^{(t)} + \frac {\sigma^2}{2} \bigtriangledown \log f(X^{(t)} ) + \sigma \epsilon_t, \; \; \; \; \; \; \epsilon_t \sim N_d (0, I_d) \]

이때 $\sigma$는 step size of discretization.

하지만 discretized 된 프로세스는 transient (일시적, 해당 성질이 이후에도 이어질 것이라 장담 불가) 할 우려가 있으며 $f$에 대해 더이상 reversible 하지 않음. 이 negative behavior (악영향)을 보정 (correct)하기 위해서 discretization step을 MH acceptance-rejection rule에 기반하여 완화하는 것이 제기됨. 이인즉슨 실적용되는 수식을 conventional MH 제안 (proposal) 분포로서 취급하자는 것.

새로운 state: Langevin Dynamics 사용해서 제안
새로운 state의 accept 여부: MH 알고리즘 사용해서 평가

따라서 Langevin 알고리즘의 1회의 이터레이션은:

새로운 state $x^\ast = x^{(t)} + \frac {\sigma^2} {2} \bigtriangledown \log f(x^{(t)}) + \sigma \epsilon_t$. 이때 $\sigma$는 user-specified 패러미터.
For limited classes of target distributions, the optimal acceptance rate for this algorithm can be shown to be 0.574.
MH ra tio $r = \frac {f(x^\ast)}{f(x^{(t)})} \ast \frac {\exp \left\{ - \frac {1} {2\sigma^2} \left[ x^{(t)} - x^\ast - \frac {\sigma^2} {2} \bigtriangledown \log f(x^\ast) \right]^2 \right\}}{\exp \left\{ - \frac {1} {2\sigma^2} \left[ x^{\ast} - x^{(t)} - \frac {\sigma^2} {2} \bigtriangledown \log f(x^{(t)}) \right]^2 \right\}}$.
set $x^{(t+1)} = x^\ast$ with probability $\min (1, r)$, 남는 확률로는 $x^{(t+1)} = x^{(t)}$

Advantages
- gradient flow 방법론에 따라 높은 density region으로 향하도록 RW를 충동질함. RW 대비 mixing이나 convergence 관점에서 효과적.
- 고차원 분포에서 유리
Disadvantages
- gradient 계산에 자원 다쳐먹는 경우 있음
- multi-mode 관장 불가

이때 with probability $\min (1, r)$라는 것은 $u \sim U(0,1): u < min(1,r)$과 동치라는 것을 알아두자. 앞으로는 모두 이렇게 서술할 것.

4.3.4 Multiple-Try Metropolis Algorithm

H&R, MALA, MTMA + HMC는 모두 proposal의 성능을 높이기 위함

MH Transition rule에 기반한 MC에서, 이의 효율성은 제안 (proposal) 분포에 크게 의존한다.

여러번의 이동을 거치면 과거의 실값과는 멀어지기 때문에 independent하기 때문에 어떤 측면에선 효율적이지만 acceptance 측면에선 떨어짐.

let $\lambda(x,y)$ non-negative symmetric function. $q(y \rvert x) >0$ 이면 항상 $\lambda(x,y)>0$임을 가정. define $w(x,y)=f(x) \ast q(y \rvert x) \ast \lambda(x,y)$.

At here, when $q(y \rvert x)$ is symmetric, for example, one can choose $\lambda(x,y) = \frac {1} {q(y \rvert x)}$, and then $w(x, y) = f(x)$. 이 경우, MTM 알고리즘은 orientational bias MC 알고리즘으로 축소되는데, 이는 molecular simulation에서 사용되는 방법론 중 하나다. See Liu, Liang and Wong (2000) for the details. 위의 수식에서 proposal인 $q(y \vert x)$에 symmetric이라는 조건을 붙이자. 그러면 $q(y \vert x) = {1}{\lambda(x,y)}$라는 상황을 가정하는 것이 가능하다.

Current state is at $\pmb x$. Proceeds:
1. draw $y_1 , \cdots, y_k \sim T(\pmb x \rightarrow \pmb y)$, $T$ is proposal.
2. select $\pmb y = y_j$ with probability $\propto w(y_j , \pmb x)$.
3. draw $x_1^\ast, \cdots, x_{k-1}^\ast$ from $T(y \rightarrow x)$. let $x_k^\ast = \pmb x$.
  - 이때 3번에서 원본 데이터로 돌아가는 프로세스가 포함된 이유는 MHMCMC 알고리즘에서는 얼마만큼 original proposal로 잘 돌아갈 수 있는지를 항상 고려해주어야 함. for reversability.
4. accept proposed $\pmb y$ with probability $p = \min \left\{ 1, \frac{w(y_1, \pmb x) + \cdots + w(y_k, \pmb x)} {w(x_1^\ast, \pmb y) + \cdots + w(x_k^\ast, \pmb y)} \right \} \tag{1}$.
  - (1)의 확률 자체가 결정된 메커니즘은 고급확률론이 필요하므로 이해 불가능함

current state $\pmb x$에서, $\pmb x \rightarrow y_1\rightarrow y_2 \rightarrow \cdots \rightarrow y_k$로 가도록 한다. 이때 각 y에 대해 weight값이 존재.

4.3.5 Reversible Jump MCMC Algorithm

number of variable이 작을 때, 베이지안 Variable Selection에 자주 사용됨

4.3.5.1 Bayesian Variable Dimension Model

A Bayesian variable dimension model is defined as a collection of models

\[ \mathcal{M}_k = \left\{ f(\cdot \lvert \theta_k ) ; \; \; \theta_k \in \Theta_k \right\}, \; \; \; \; \; k=1:K \]

with a collection of priors $\pi_k (\theta_k)$ on the 패러미터 of these models. 이는 곧 model의 변화에 따라 각각의 model들 또한 다른 prior를 갖는다는 이야기이다. and a prior distribution $\rho_k, \; \; \; k=1, \cdots, K$ on the indices of these models.

※ Note: 당연히 각각의 model space $\Theta_k$들은 may have different dimensions. may라고 했지만 대부분의 경우 다름.

In this setting one can compute the posterior probability of models, i.e.

\[ p \left( \mathcal{M}_k \rvert \pmb y \right) = \frac {\rho_k \ast \int f_k (y \lvert \theta_k) \ast \pi_k (\theta_k) d\theta_k } {\sum_j \rho_j \ast \int f_j (y \lvert \theta_j) \ast \pi_j (\theta_j) d\theta_j} \]

RJMCMC 알고리즘이란, 패러미터 space의 dim이 정해지지 않은 상황에서, 이 dim과 상관없이 모델 space 자체를 이동할 수 있게 만들어준 MCMC 알고리즘.

단, 패러미터 space의 dim은 모르지만, 여기서는 모델의 총 갯수인 $k$로 fix가 되어 있다.

모델 간을 이동시키면서 모델 셀렉션
이와 동시에 패러미터 estimation까지 동시에 한다

는 것이 RJMCMC 알고리즘. 즉 RJMCMC 중에는 2가지 공정이 동시에 돌아간다

A variable dimension model is a “model where one among things you do not know is the number of things you do now know”, 연구자 모르는 것들 중 하나에, 니가 지금 알고 있는 것들의 갯수가 있는 모델. 즉, 내가 지금 알고 있는 것이 총 몇개인지조차도 모른다는 이야기이다. 그럼 대체 아는게 뭐야; e.g., 패러미터 space의 dimension이 고정되어 있지 않음. model selection, checking, improvement, 등등 다양한 상황에서 발생 가능. 모델 $\mathcal{M}_k$ 사이에서의 움직임을 설계하는데 있어 적절한 framework를 구축하고 싶다.

즉슨 모델에 대한 다양한 예상들이 있고, 이러한 모델의 예상도를 확률적으로 옮겨다니면서 이게 맞나? 이게 맞나? 를 체크한다는 이야기.

이를 위한 Green의 원칙:

모델 한 쌍 간의 움직임(채택 모델의 변경)만을 고려.
“dimension matching” moves를 설계.
MH 알고리즘과 유사하게 움직임을 with probability로 수용. 여기서의 probability 노테이션은 $q$.

let $x_t = (k^{(t)}, \theta_k^{(t)} )$가 현 상태를 나타내고, $x^{(t+1)}$에 대한 proposed state $x' = (k', \theta_k')$.

if $k'=k$, proposed move 가 같은 subspace $\mathcal{X}_k$에서 다른 위치를 탐색한다는 것이다. 따라서 dimension-matching problem 자체가 발생하지 않는다.
if $k' \not= k$, 분포 $\psi_{k^{(t)} \rightarrow k'} (u)$로부터의 $s$개의 rv $\pmb u = (u_1 , \cdots, u_s)$ 를 생산한다. 그리고 bijection $(\theta_k ' , u' ) = T(\theta_k^{(t)}, u)$를 생각하자. 이때 $s'$ 차원의 rvec$u' = (u_1 , \cdots, u_{s'})$이며, $s$와 $s'$는 dimension-matching condition $s+d_k = s' + d_{k'}$를 만족한다.

Proceeds:

모델 $\mathcal{M}_k$ with probability $q(k^{(t)}, k')$에 의해 선택.
generate $u_1 , \cdots, u_s \sim \psi_{k^{(t)} \rightarrow k'} (u)$
$(\theta_{k'}', u') = T(\theta_k^{(t)}, u)$.
Compute MH ratio $r = \frac {f(k', \theta'_{k'} \rvert Y) } {f(k^{(t)}, \theta^{(t)}_{k} \rvert Y) } \frac {g(k^{(t)} \rvert k')} {g(k' \rvert k^{(t)})} \frac {\psi_{k' \rightarrow k^{(t)}} (u')} {\psi_{k^{(t)} \rightarrow k'} (u)} \left\lvert {\frac{\partial (\theta'_{k'}, u')}{\partial (\theta^{(t)}_{k}, u)}} \right\rvert$
where $\frac{\partial (\theta'_{k'}, u')}{\partial (\theta^{(t)}_{k} , u ) }$ is Jacobian of Transformation.
set $X^{(t+1)} = (k', \theta'_{k})$ with probability $\min (1,r)$.

그러나 이렇게 무제한 모델을 만들면 너무 어려움. 따라서 보통 제약식을 추가해서 간단한 모델을 사용함. 사용되는 제약식은 각 이터레이션에서 이전 것과 이후 것 간의 dim 차이가 ±1 이라는 것.

Example: Reversible Jump MCMC Algorithm

\(X_{obs}\)	observed data
\(X_{mis}\)	missing data
\(X_{com}= \left(X_{obs},X_{mis} \right)\)	complete data