Learning Hidden Markov Models with Simple Examples [Part 2]

Hidden Markov Model Diagram

In the previous part, we covered the first task of the presented example. In this part, I’ll present the solution to the second task of the problem and try to explain the intuition behind it.

Problem Recap

Consider an HMM with the following properties:

Hidden state set: $Q = \{1, 2, 3\}$
Observable state set: $V = \{a, b\}$
Parameter: $\lambda = (\pi, A, B)$

Where the matrices are defined as follows:

Initial distribution $\pi$ :

\pi = \begin{bmatrix} 0.2 \\ 0.4 \\ 0.4 \end{bmatrix}

Transition matrix $A$ (probability of transitioning from state $i$ to state $j$ ):

A = \begin{bmatrix} 0.5 & 0.2 & 0.3 \\ 0.3 & 0.5 & 0.2 \\ 0.2 & 0.3 & 0.5 \end{bmatrix}

Emission matrix $B$ (probability of observing symbol given hidden state):

B = \begin{bmatrix} 0.5 & 0.5 \\ 0.4 & 0.6 \\ 0.7 & 0.3 \end{bmatrix}

We concern ourselves with a small time step of $T = 3$ with the observed sequence being $O = (a, b, a)$ .

The Tasks

~~Use the Forward algorithm to calculate $P(O|\lambda)$ .~~ → DONE in Part 1
Find the optimal hidden state sequence $I^* = (i_1^*, i_2^*, i_3^*)$ . Which are the 3 hidden states with the highest possibility of being behind the sequence “a → b → a”?

Solution to Part 2: The Viterbi Algorithm

To find out which is the most probable hidden state at each time step, it’s necessary to find the maximum product over the following properties:

Probability of having the current observed state given the current hidden state
Probability of having the current hidden state given the last hidden state
Probability of having the last hidden state

Important Aspects

We have evidence of the observed state at each time step, given in the task as $O = (a, b, a)$ .
We don’t have to look further back past the last hidden state, because the last hidden state “blocks” the influence of any previous evidence from the Markov chain.

If you’re not certain about the second aspect, be sure to check out the concept of conditional independence. The Markov property ensures that the future is independent of the past given the present state.

Mathematical Definition

Once this is established, we can give the mathematical definition of what needs to be calculated at each step:

\mu_k(i_k) = \max_{i_{1:k-1}} \left( P(i_{1:k}, O_{1:k}) = \max_{i_k-1} P(O_k | i_k) \cdot P(i_k | i_{k-1}) \cdot \mu_{k-1}(i_{k-1}) \right)

This represents the highest probability of any path ending in state $i$ at time $t$ , given the observations up to time $t$ .

Then, it’s a matter of calculating the most probable hidden state for each time step.

Step-by-Step Calculations

I’ll add the $max$ only at the beginning for improved readability, but the important thing is to remember that we’re always trying to find a scalar value - the max scalar value.

First Time Step (Observation “a”)

For $t = 1$ :

\mu_1(i_1) = max \left( P(a | i_1) \cdot \pi \right) = \begin{bmatrix} 0.5 \\ 0.4 \\ 0.7 \end{bmatrix} \cdot \begin{bmatrix} 0.2 \\ 0.4 \\ 0.4 \end{bmatrix} = \begin{bmatrix} 0.10 \\ 0.16 \\ 0.28 \end{bmatrix}

\mu_1(i_1) = state\ 3 = i_1^*

Result: The highest probability is for hidden state 3 - $28\%$ .

Second Time Step (Observation “b”)

For $t = 2$ :

For each possible current hidden state, we find the maximum over all possible previous states:

\mu_2(i_2) = \max_{i_{k-1}} \left( P(b | i_2) \cdot P(i_2 | i_1) \cdot \mu_1(i_1) \right)

For hidden state 1:

\mu_2(1) = \max_{i_1} \left( P(b | 1) \cdot P(1 | i_1) \cdot \mu_1(i_1) \right) = P(b | 1) \cdot \max_{i_1} \left( P(1 | i_1) \cdot \mu_1(i_1) \right) = \newline = P(b | 1) \cdot \max_{i_1} \left( \begin{bmatrix} 0.5 \\ 0.3 \\ 0.2 \end{bmatrix} \cdot \begin{bmatrix} 0.10 \\ 0.16 \\ 0.28 \end{bmatrix} \right) = 0.5 * 0.056 = 0.028

For hidden state 2:

\mu_2(2) = \max_{i_1} \left( P(b | 2) \cdot P(2 | i_1) \cdot \mu_1(i_1) \right) = P(b | 2) \cdot \max_{i_1} \left( P(2 | i_1) \cdot \mu_1(i_1) \right) = \newline = P(b | 2) \cdot \max_{i_1} \left( \begin{bmatrix} 0.2 \\ 0.5 \\ 0.3 \end{bmatrix} \cdot \begin{bmatrix} 0.10 \\ 0.16 \\ 0.28 \end{bmatrix} \right) = 0.6 * 0.084 = 0.0504

For hidden state 3:

\mu_2(3) = \max_{i_1} \left( P(b | 3) \cdot P(3 | i_1) \cdot \mu_1(i_1) \right) = P(b | 3) \cdot \max_{i_1} \left( P(3 | i_1) \cdot \mu_1(i_1) \right) = \newline = P(b | 3) \cdot \max_{i_1} \left( \begin{bmatrix} 0.3 \\ 0.2 \\ 0.5 \end{bmatrix} \cdot \begin{bmatrix} 0.10 \\ 0.16 \\ 0.28 \end{bmatrix} \right) = 0.3 * 0.14 = 0.042

Result: The most probable hidden state at the second time step is hidden state 2 (0.0504).

Third Time Step (Observation “a”)

For $t = 3$ :

For hidden state 1:

\mu_3(1) = \max_{i_2} \left( P(a | 1) \cdot P(1 | i_2) \cdot \mu_2(i_2) \right) = P(a | 1) \cdot \max_{i_2} \left( P(1 | i_2) \cdot \mu_2(i_2) \right) = \newline = P(a | 1) \cdot \max_{i_2} \left( \begin{bmatrix} 0.5 \\ 0.3 \\ 0.2 \end{bmatrix} \cdot \begin{bmatrix} 0.028 \\ 0.0504 \\ 0.042 \end{bmatrix} \right) = 0.5 * 0.01512 = 0.0076

For hidden state 2:

\mu_3(2) = \max_{i_2} \left( P(a | 2) \cdot P(2 | i_2) \cdot \mu_2(i_2) \right) = P(a | 2) \cdot \max_{i_2} \left( P(2 | i_2) \cdot \mu_2(i_2) \right) = \newline = P(a | 2) \cdot \max_{i_2} \left( \begin{bmatrix} 0.2 \\ 0.5 \\ 0.3 \end{bmatrix} \cdot \begin{bmatrix} 0.028 \\ 0.0504 \\ 0.042 \end{bmatrix} \right) = 0.4 * 0.0252 = 0.01008

For hidden state 3:

\mu_3(3) = \max_{i_2} \left( P(a | 3) \cdot P(3 | i_2) \cdot \mu_2(i_2) \right) = P(a | 3) \cdot \max_{i_2} \left( P(3 | i_2) \cdot \mu_2(i_2) \right) = \newline = P(a | 3) \cdot \max_{i_2} \left( \begin{bmatrix} 0.3 \\ 0.2 \\ 0.5 \end{bmatrix} \cdot \begin{bmatrix} 0.028 \\ 0.0504 \\ 0.042 \end{bmatrix} \right) = 0.7 * 0.021 = 0.0147

\mu_3(i_3) = \begin{bmatrix} 0.0076 \\ 0.01008 \\ 0.0147 \end{bmatrix}

Result: The most probable hidden state at the third time step is hidden state 3 (0.0147).

Final Answer

Taking the argmax at each time step gives us the optimal hidden state sequence:

I^* = (i_1^*, i_2^*, i_3^*) = (3, 2, 3)

In words: If you observe the sequence “a → b → a” during the first 3 time steps, the hidden states with the highest probability of producing this sequence are “3 → 2 → 3”.

Comments, corrections, and other remarks are always appreciated.