4.6 Hamiltonian Monte Carlo

RW의 단점 (randomness에서 오는 inefficiency를 줄이기 위해) 을 보완하기 위해 나온 개념.

Hamiltonian Monte Carlo borrows an idea from physics to suppress the local random walk behavior in the Metropolis algorithm, thus allowing it to move much more rapidly through the target distribution.

Version of Metropolis where you take many “steps” per “iteration”

Use “Hamiltonian dynamics” and a latent “momentum ($\phi_j$)” vector so the steps within an iteration move along a “trajectory”.

Requires computation of gradient of log target density. 한번의 스텝에서 어느정도 움직이는지를 알기 위해선 이것이 요구되기 때문.

Take $L$ leapfrog steps (leapfrog를 몇번을 할 것인지), each of distance $\epsilon$ (1번의 leapfrog에서 얼마만큼을 움직일 것인지), then accept/reject.

In a leapfrog step, both $\theta$ and $\phi$ are changed, each in relation to the other.

Effecitive Sample Size $=ESS/S$: correlated되지 않았을 경우에 샘플 사이즈로 볼 수 있는 크기. MCMC는 과거 체인에 dependent하므로 당연히 correlated되어 있으며, 따라서 10000개를 생산했다고 치면 10000개의 샘플 사이즈에서 independent한 샘플의 사이즈가 얼마만큼인지를 체크하는 것. 따라서 ESS가 크면 클 수록 좋음. 마지막에 computation time $S$로 나누어준 이유는 단위시간 당 샘플로 비교해야 하니까. 해당 개념은 알고리즘 comparison에서 대단히 많이 사용됨.

Rstan은 GS가 아니라 HMC를 사용함. 그래서 샘플 크기가 상대적으로 작아도 OK.

How far to jump in each step? - If $\epsilon$ is too small, you waste time shuffling along. - If $\epsilon$ is too large, the physical approximation breaks and you find yourself rejecting.

How many steps? (No U-turn Sampler) - If $L$ is too small, you might not go far enough in each iteration. - If $L$ is too large, you’ll waste time circling around and around.

Still HMC can be much better than Gibbs or Metropolis in high dimensions.

Energy Barrier. Multimodal에 취약.

Proceeds:

우선 momentum부터 만들어야 함. 운동에너지를 위치에너지로 바꾸는 것이 Hamiltonian의 기본적인 메커니즘. 이때 운동에너지 + 위치에너지는 항상 일정한 상수로 일정하다. 이를 위해 momentum을 생산해야 하는데, 이때 가장 손쉽게 momentum을 생산할 수 있는 방법이 $M=I$ 로 잡는 것. 단, 어떤 측면에서는 $I$가 inefficient하기도 함. 다른 제안으로 Inverse Fisher’s Information 사용이 제안된 적도 있음. 하지만 중간중간에 fixed point iteration으로 패러미터를 정해줘야 한다고? 경사가 완만할 때 더 많은 거리 이동 가능.

A. The iteration begins by updating $\phi$ with a random draw from its posterior distribution - which is the same as its prior distribution $\phi \sim N(0, M)$.

B. A simultaneous update of $(\theta, \phi)$ conducted in an elaborate but effective fashion via a discrete mimicking of physical dynamics. B step의 1회 이터레이션이 leapfrog Step 1회에 해당한다.

Use the gradient of the log-posterior density of $\theta$ to make a half-step of $\phi$.

$ \[\begin{align} \phi_{half.new} \; \; \; &\leftarrow \; \; \; \phi + \dfrac {1}{2} \epsilon \ast \dfrac{d \log \pi(\theta \vert y)}{d \theta} \\ &= \; \; \; \dfrac{1}{2} \epsilon \bigtriangledown_{\pmb \theta} \log f(\theta^\ast) \\ &= \; \; \; \phi + \dfrac {1}{2} \epsilon \ast \dfrac{d \log \pi(\theta \vert y)}{d \theta} \Bigg \vert_{\theta^\ast} \end{align}\] $

{:start=“2”}

Use the ‘momentum’ vector $\phi$ to update the ‘position’ vector $\theta$:

$ ; ; ; ; ; ; + M^{-1} $

{:start=“3”}

Use the gradient of $\theta$ to half-update $\phi$:

$ {new} ; ; ; ; ; ; {half.new} + $

C. label $\theta^{t-1}, \phi^{t-1}$ as the value of the parameter and momentum vectors at the start of the leapfrog process and $\theta^{\ast}, \phi^{\ast}$ as the value after the $L$ steps. In the accept-reject step, we compute

$ r = $

D. Set $\theta^t = \begin{cases} \theta^\ast & \text{with probability } \min(1,r) \\ \theta^{t-1} & o.w. \end{cases}$

4.6.0.0.1 HMC Example Trajectory

Blue ellipse is contour of target distribution
Initial position at black solid circle.
Arrows indicate a U-turn in momentum

“One practical impediment to the use of Hamiltonian Monte Carlo is the need to select suitable values for the leapfrog stepsize, $\epsilon$, and the number of leapfrog steps $L$. Tuning HMC will usually require preliminary runs with trial values for $\epsilon$ and $L$. Unfortunately, preliminary runs can be misleading….”

4.6.1 Introduction to Hamiltonian Monte Carlo

Hamiltonian Monte Carlo (HMC) was originally developed in the late 1980s as Hybrid Monte Carlo to tackle calculations in Lattice Quantum Chromodynamics, a field focused on understanding the structure of the protons and neutrons.

Neal (1995, 2011) introduced the Hamiltonian Monte Carlo into the mainstream of statistical computing.

HMC is built upon a rich theoretical foundation, which is formulated in terms of differential geometry, that makes it uniquely suited to the high-dimensional problems of applied interest.

4.6.1.0.1 Foundations of Hamiltonian Monte Carlo

Most Markov Transitions are diffusive, concentrating around the initial point such that the corresponding MArkov chains linger in small neighborhoods of the typical set for long periods of time. In order to maximize the utility of our computational resources, we need coherent Markov Transitions that are able to glide across the typical set towards new, unexplored neighborhoods.

In order to make large jumps away from the initial point, and into new, unexplored regions of the typical set, we need to exploit information about the geometry of the typical set.

HMC is the unique procedure for automatically generating this coherent exploration for sufficiently well-behaved target distributions.

Introduce some intuition to motivate how we can generate the desired exploration by carefully exploiting the differential structure of the target probability density.

Discuss the procedure with the complete construction of the Hamiltonian Markov transition.

A vector field is the assignment of a direction at every point in parameter space. When those directions are aligned with the typical set, we can follow them like guide posts, generating coherent exploration of the target distribution.

When the sample space is continuous, a natural way of encoding this direction information is with a vector field aligned with the typical set.

A vector field is the assignment of a direction at every point in parameter space, and if those directions are aligned with the typical set then they act as a guide through this neighborhood of largest target probability.

By construction this, we follow the direction assigned to each at point for a small distance.

Continuing this process traces out a coherent trajectory through the typical set that efficiently moves us far away from the initial point to new, unexplored regions of the typical set as quickly as possible.

The gradient of the target pdf encodes information about the geometry of the typical set, but not enough to guide us through the typical set by itself. Follwing along the gradient instead pulls us away from the typical set and towards the mode of the target density. In order to generate motion through the typical set we need to introduce additional structure that carefully twists the gradient into alignment with the typical set.

Need to construct a vector field aligned with the typical set using only information that we can extract from the target distribution.

The natural information is the differential structure of the target distribution which we can query through the gradient of the target probability density.

The gradient defines a vector field in parameter space sensitive to the structure of the target distribution.

Unfortunately, that sensitivity is not sufficient as the gradient will never be aligned with the typical set. Following the guidance of the gradient pulls us away from the typical set and towards the mode of the target density.

Without enough transverse momentum to balance againts the gravitational attraction of the planet, a satellte will still crash into the planet.
On the other hand, if the satelite is given too much momentum then the gravitational attraction will be too weak to capture the satelite in a stable orbit, which will instead abandon the planet for the depths of space.

When we introduce exactly the right amount of momentum to the physical system, the equations describing the evolution of the satelite define a vector field aligned with the orbit. The subsequent evolution of the system will then trace out orbital trajectories.

The gradient can direct us towards only parameterization sensitive neighborhoods like that around the mode, and not the parameterization-invariant neighborhoods within the typical set.

To utilize the information in the gradient we need to complement it with additional geometric constraints, carefully removing the dependence on any particular parameterization while twisting the directions to align with the typical set.

If we add the right amount of momentum, then the momentum will exactly balance against the gradient information, and the corresponding dynamics of the system will be conservative.

The key is twisting the gradient vector field into a vector field aligned with the typical set, and hence once capable of generating efficient exploration, is to expand our original probabilistic system with the introduction of auxiliary momentum parameters.

There is only one procedure for introducing auxiliary momentum with such a probabilistic structure: Hamiltonian Monte Carlo.

4.6.1.0.2 Phase Space and Hamilton’s Equations

A defining feature of conservative dynamics is the preservation of volume in position-momentum phase space. For example, althou dynamics might compress volumes in position space, the corresponding volume in momentum space expands to compensate and ensure that the total volume is invariant. Such volume-preserving mapping are also known as shear Transformations.

Conservative dynamics in physical systems requires that volumes are exactly preserved.

As the system evolves, any compression or expansion in position space must be compensated with a respective expansion or compression in momentum space to ensure that the volume of any neighborhood in position-momentum phase space is unchanged.

In order to mimic this behavior in our probabilistic system we need to introduce auxiliary momentum parameter, $p_n$, to complement each dimension of our target parameter space, $q_n$, expanding the D-dimensional into a 2D-dimensional phase space.

$ q_n ( q_n, p_n ) $

Moreover, these auxiliary momentum have to be dual to the target parameters, transforming in the opposite way under any reparameterization so that phase space volumes are invariant.

Having expanded the target parameter space to phase space, we can lift the target distribution onto a joint probability distribution on phase space called the canonical distribution. Then, the choice of a conditional probability distribution over the auxiliary momentum, $ (q, p) = (p q) (q) $, which ensures that if we marginalize out the momentum we immediately recover our target distribution.

By constructing a probability distribution on phase space that marginalizes to the target distribution, we ensure that the typical set on phase space projects to the typical set of the target distribution. In particular, if we can construct trajectories that efficiently explore the joint distribution (black) they will project to trajectories that efficientyl explore the target distribution (green).

The canonical density $\pi(q, p)$ does not depend on a particular choice of parameterization, and we can write it in terms of an invariant Hamiltonian function, $H(q, p)$,

(q,p) = ( - H(q, p) )

Because $H(q, p)$ is independent of the details of any parameterization, it captures the invariant probabilistic structure of the phase space distribution and, the geometry of its typical set.

The value of the Hamiltonian at any point in phase space is called the energy at that point.

Hamiltonian decomposes into two terms, Density over the auxiliary momentum, $K(p,q)$ is called the kinetic energy (unconstrained and specified by the implementation), while the term corresponding to the density of the target distribution, $V(q)$ is known as the potential energy (determined by the target distribution).

H(q,p)

(q,p) = - (p q) - (q)

K(p,q) + V(q)

Because the Hamiltonian captures the geometry of the typical set, it should be able to use it to generate a vector field oriented with the typical set of the canonical distribution.

The desired vector field can be generated from a given Hamiltonian with

= + {p}

= {p},

; ; ; ; ;

= - {q}

= - {q} - {q}

By channeling the gradient through the momentum instead of the target parameter directly, Hamilton’s equations twist differential information to align with the typical set of canonical distribution.

Following the Hamiltonian vector field for some time, $t$, generates trajectories $\phi_t (q, p)$, that rapidly move through phase space while being constrained to the typical set.

Projecting these trajectories back down onto the target parameter space finally yields the efficient exploration of the target typical set for which we are searching.

Every Hamiltonian Markov Transition is comprised of a random lift from the target parameter space onto phase space (light red), a deterministic Hamiltonian trajectory through phase space (dark red), and a projection back down to the target parameter space (light red).

Need a mechanism for introducing momentum to a given point in thetarget parameter space.

To lift an initial point in parameter space into one on phase space, we simply sample from the conditional distribution over the momentum, $p \sim \pi(p \vert q)$.

Once on phase space we can explore the joint typical set by integrating Hamilton’s equations for some time, $(q,p) \rightarrow \phi_t (q,p)$. By construction these trajectories coherently push the Markov transition away from the initial point, and neighborhoods that we have already explored, while staying confined to the joint typical set.

After integrating Hamilton’s equations, we can return to the target parameter space by simply projecting away the momentum, $(q,p) \rightarrow q$.