When I study for my PHD, I usually write down some notes of what I'm reading. Guillermo, a friend of mine, recommended me to put these notes in the blog, so I don't have to work so hard on this web site. I thought that this was a good idea, but I had a small problem, all of my notes are in Latex :-s, fortunately, my friends are clever than me and they recommended me to use mathjax (http://www.mathjax.org), thanks Mariano :-).
Well, some disclaimer first, these notes are for my particular use, they were not written to be public. So I would not recomend to use them for any academic reason :-).
HMM
We can talk about Hidden Markov Models (HMM) when we have a list of observations $O={o_1,o_2...o_k}$ and a set of hidden events $S={s_1,s_2...s_t}$ and we want to determine a subset of $S$ that leads to $O$. This subset of $S$ is the most probable combination of events that lead to $O$. For example, consider that if it rains you have a bigger probability of staying at home, if it is sunny, you prefer going to the park, but if it is cloudy do rather go to the cinema. These are your preferences, but there is a not zero probability of staying at home if it is cloudy or sunny. The same for going to the cinema and going to the park. Let's say you quantify the probability of doing something based on the weather, and your actions for the past three days were "Going to the cinema, going to the park, going to the park", the question is, how was the weather like for the last three days?. This is a Hidden Markov Model, your hidden states are the weather possibilities and your observations are your actions for the last three days.
An HMM is specified by a set of states $S={s_1,s_2...s_t}$, a set of observations $O={o_1,o_2...o_k}$ a transition probability matrix $A$ where each $a_ij$ represents the probability of moving from state $i$ to state $j$, an emission probability $E=e_i(o_t)$ which is the probability of an observation $o_t$ being generated by a state i, and finally, the $q_0$ and $q_1$ which are the start and end states.
HMM taggers
What needs to be solved?
The main objective of POS tagging is: "given a set of words $w^n$ retrieve a set of tags $t^n$ (one for each word)". This is an HMM because we have a set of observations $w^n$ and we need to determine the set of hidden states $t^n$ that generated that observations. All the probabilities calculations are done based on a training set of texts that are already tagged.
HMM taggers achieve their objective by solving the following formula:
1 \begin{equation}
\max{P(t^n|w^n)}
\label{eq:mainFormula}
\end{equation}
Which can be read as "based on the observations $w^n$ which is the most probable set $t^n$". The problem with equation last equation is that if very difficult to calculate $P(t^n|w^n)$. So we need to transform this equation into a simpler one, to do so we use bayes:
2 \begin{equation}
P(x|y)=\frac{P(y|x) P(x)}{P(y)}
\label{eq:bayesEq}
\end{equation}
So, replacing equation 2 in 1 we get:
3 \begin{equation}
\max{\frac{P(w^n|t^n) P(t^n)}{P(w^n)}}
\label{eq:mainWithBayes}
\end{equation}
As $P(w^n)$ is the same as for all $t^n$ then we can remove it from equation 3:
4 \begin{equation}
\max{P(w^n|t^n) P(t^n)}
\label{eq:finalTaggerEq}
\end{equation}
Assumptions
Equation 4 is still hard to compute directly. HMM taggers therefore make two simplifying assumptions. The first assumption is that the probability of a word appearing is dependent only on its own tag; that is independent of other words around it:
5 \begin{equation}
P(w^n|t^n)\approx \displaystyle\prod_{i=1}^{i=n}{P(w_i|t_i)}
\label{eq:assuptionOnWords}
\end{equation}
The second assumption is that the probability of a tag appearing is dependent only on the previous tag,
6 \begin{equation}
P(t^n)\approx \displaystyle\prod_{i=1}^{i=n}{P(t_i|t_i-1)}
\label{eq:assumptionOnTags}
\end{equation}
So, if we consider 5 and 6 we can assume that
7 \begin{equation}
\max{P(t^n|w^n)}\approx\max{\displaystyle\prod_{i=1}^{i=n}{P(w_i|t_i)P(t_i|t_i-1)}}
\label{eq:assumptionOnBayes}
\end{equation}
Comment on Dynamic programming
Dynamic programming means dividing your complex problem into sub-problems. This is not applicable for all computer problems, you first have to realize if what you want to solve can be break into smaller tasks, and if the combination of these tasks solve your major problem.
Dynamic programming takes far less time than the naive methods.
The problems that can be solved by dynamic programming are the ones that exhibit the properties of overlapping subproblems (a problem can be divided into subproblems which are reused several times) AND optimal substructure (is an optimal solution can be achieved by finding the optimal solution of its subproblems.
Dynamic programming is the same as Greedy algorithms?, NO. A greedy algorithm is applicable when your problem exhibits an optimal substructure and can be demonstrate that the solution is optimal at each step. If you want to apply dynamic programming you should also demonstrate that the subproblems can be overlapped (basically all the sub-solutions can be achieved in the same way).
Why do we talk about dynamic programming?
To answer this question we need to see equation 7 . In order to maximize this equation we need to maximize each element of the product. So, the tagger problem has an optimal substructure and it is also exhibits the property of overlapping subproblems (all the terms of the product are the same).
So we can conclude that any tagging algorithm that consider the assumptions taken in Assumptions can be implemented by dynamic programming.
Algorithms
Viterbi
I'll explain this algorithm with an example, I think is the best way to see how it works:
The transition Matrix:
\begin{equation}
\label{eq:transMatrix}
A=\begin{bmatrix}{0.1}&{0.5}&{0.4}\\{0.2}&{0.7}&{0.1}\\{0.3}&{0.31}&{0.39}\\{0.1}&{0.1}&{0.8}\end{bmatrix}
\end{equation}
The first row is the start state ($q_0$), the second row is $t_1$, the third row is $t_2$ and the third row is $t_3$. The first column is $t_1$, the second $t_2$ and the third $t_3$. So, by reading $A$ we know that $P(t_2|t_1)=0.7$
The observation likehoods:
\begin{equation}
\label{eq:transMatrix}
B=\begin{bmatrix}{0.01}&{0.03}&{0.004}\\{0.25}&{0.007}&{0.15}\\{0}&{0}&{0.80}\end{bmatrix}
\end{equation}
The rows of $B$ matrix are the tags and the columns the observed words ($w_1$,$w_2$,$w_3$). So, by reading $B$ we know that $P(w_1|t_1)=0.01$.
Now let's calculate the firs term of the product 7, $P(t_i|q_0).P(q_0)$. The first iteration of the viterbi algorithm will solve:
$v_1(t_1)=P(t_1|q_0)P(w_1|t_1)=0.1*0.01 = 0.001 $
$v_1(t_2)=P(t_2|q_0)P(w_1|t_2)=0.5*0.25 = 0.125 $
$v_1(t_3)=P(t_3|q_0)P(w_1|t_2)=0.4*0.00 = 0.000 $
So the best of this iteration will be $t_2$. And the maximum probability so far will be $0.125$.
For the second iteration we do the same (calculate $v_2(t_1)$, $v_2(t_2)$, $v_2(t_3)$ and choose the maximum. The total probability will we $v_1(t_2)*v_2(t_x)$. And so on.
So, as we can see Viterbi does what we could do by hand. This is the simpliest implementation of Viterbi, there are different variations, like Lazy Viterbi, which doesn't calculate the nodes probability until is needed.
Adding context
The bigram assumption taken in Assumptions equation 6 is a simplification of the real world problem, some modern taggers include more context information by using trigrams instead of bigrams. The procedure is quite simple, the main problem is the probability calculations.
By doing this change (using trigrams instead of bigrams) equation 6 can be rewritten as:
8 \begin{equation}
P(t^n)\approx\displaystyle\prod_{i=1}^{i=n}{P(t_i|t_{i-1},t_{i-2})}
\label{eq:trigramExtension}
\end{equation}
There is one large problem with 8; data sparsity. This means that a trigram $t_{i-2}t_{i-1}t_i$ may never occur in the training data, so we cannot use the maximum likelihood estimate (we cannot just count the occurrence of the trigram in order to calculate $P(t_{i-2}|t_{i-1}t_i)$ because many of them will be zero and we will predict that the trigram occurence will never occur).
One way to calculate $P(t_{i-2}|t_{i-1}t_i)$ could be:
\begin{equation}
P(t_i|t_{i-2}t_{i-1})=\alpha_1\frac{C(t_{i-2},t_{i-1},t_i)}{C(t_{i-2},t_{i-1}}+\alpha_2\frac{C(t_{i-1},t_i)}{C(t_{i-1})}+\alpha_3\frac{C(t_i)}{N}
\label{eq:a}
\end{equation}
Where,
\begin{equation}
\alpha_1+\alpha_2+\alpha_3=1
\label{eq:v}
\end{equation}
This is called deleted interpolation.
No comments:
Post a Comment