NLP
NLP基础

七月NLP课程笔记(5)-隐马尔科夫模型及其应用

ref: http://blog.csdn.net/zxm1306192988/article/details/78595933

# 马尔科夫链

每个状态的转移，只依赖于之前的n个状态（n阶）以一阶为例，包括三个部分:状态、初始向量、状态转移矩阵。

缺陷：比如股市，其状态不可观测，因而无法利用状态的前后关系。可观测的只是部分指标

# 隐马尔科夫链

包括可见状态链和隐含状态链，隐含状态之间的转移概率，以及从隐含状态到可见状态之间的输出概率。

关于HMM的问题主要分为三类： ①Recognition: 知道隐含状态数量，知道转移概率，根据可见状态链，猜测隐含状态链，例如语音识别的解码问题 ②Evaluation: 知道隐含状态数量，知道转移概率，根据可见状态链，想知道能生成这个可见状态链的概率 ③Training: 知道隐含状态数量，不知道转换概率，知道可见状态链，想学习得到转移概率。

例如，根据天气雨R晴S（隐含状态），决定当天的活动（可见状态）是散步购物或者打扫房间。 对应上面的三个问题： ①已知整个模型，我观测到连续三天做的事情是：散步W，购物S，收拾C，我想猜，这三天的天气是怎么样的。 ②已知整个模型，我观测到连续三天做的事情是：散步，购物，收拾。那么根据模型，计算产生这些行为的概率是多少。 ③最复杂的，我只知道这三天做了这三件事儿，而其他什么信息都没有。我得建立一个模型，晴雨转换概率，第一天天气情况的概率分布，根据天气情况选择做某事的概率分布。

# 隐马尔科夫链的解法

问题一：Viterbi Algorithm，维特比算法。问题二：Forword Algorithm，向前算法，或者 Backward Algorithm，向后算法。问题三：Baum-Welch Algo，鲍姆-韦尔奇算法

解法部分，配合给的博客链接来看，这里只写出直观通俗的解释，不给具体的形式化定义。

# Recognition问题的解法

Viterbi算法： 应用了动态规划的思想。（动态规划本质仍然是穷举了所有的可能性，但是只不过把计算的中间结果保留下来，不做重复的计算，其结果是最优的，但复杂度高。贪心算法只保留每一步的当前最优，容易陷入局部最优，但计算量小，复杂度较低）

# Evaluation问题的解法

解法①：遍历 遍历方式，即穷举三天的天气情况共有2^3=8种可能性。模型已知，所以根据模型参数，可以计算这8种的每一种情况下，产生这些行为的概率。最后加权起来（比如认为8种情况等概率出现），得到总的概率。

解法②：向前算法 即，先考虑第一天，计算出第一天下雨且散步，以及第一天晴天且散步的概率；然后考虑第二天，计算出第一天散步的情况下，第二天下雨且购物，以及第二天晴天且购物的概率……依次越来越长向前计算。

解法③：向后算法 先设最后第三天是R或S的概率均为1，然后根据模型，向前推出前各天是各种天气的概率。

# Training问题的解法

Baum-Welch算法 更详细可参考文章 http://blog.sina.com.cn/s/blog_8267db980102wpvz.html http://blog.csdn.net/u014688145/article/details/53046765

# HMM的应用：词性标注

词性是隐含状态，单词是观测

首先训练得到HMM模型，即Training问题（不过这里的Training不必用上面的BW算法，直接根据语料库，进行数量的统计和概率的计算就可以完成对转移矩阵和发射矩阵的训练）；然后可利用模型，进行词性标注，即Recognition问题，使用Viterbi算法即可。

下面即使用代码实现一个HMM的词性标注。

训练语料使用nltk自带的brown语料库，带有pos标注。形如（I,NOUN）,(LOVE,VERB),(YOU,NOUN)

import nltk
import sys
from nltk.corpus import brown

1
2
3

预处理词库，即给每一句加上开头和结尾标记。

brown_tags_words = [ ]
for sent in brown.tagged_sents():
    # 先加开头
    brown_tags_words.append( ("START", "START") )
    # 为了省事儿，我们把tag都省略成前两个字母
    brown_tags_words.extend([ (tag[:2], word) for (word, tag) in sent ])
    # 加个结尾
    brown_tags_words.append( ("END", "END") )

1
2
3
4
5
6
7
8

词统计，先计算一下某个POS发射出某个单词的概率，即观测概率矩阵B。只是简单粗暴地统计出单词和POS之间的个数关系即可。P(wi|ti)=count(wi,ti)/count(ti) 这里用nltk自带的统计工具：

# conditional frequency distribution
cfd_tagwords = nltk.ConditionalFreqDist(brown_tags_words)
# conditional probability distribution
cpd_tagwords = nltk.ConditionalProbDist(cfd_tagwords, nltk.MLEProbDist)

1
2
3
4

查看统计的效果:

print("The probability of an adjective (JJ) being 'new' is", cpd_tagwords["JJ"].prob("new"))
print("The probability of a verb (VB) being 'duck' is", cpd_tagwords["VB"].prob("duck"))

# output 
The probability of an adjective (JJ) being ‘new’ is 0.01472344917632025 
The probability of a verb (VB) being ‘duck’ is 6.042713350943527e-05

1
2
3
4
5
6

还需要计算状态转移矩阵A，这个只要简单粗暴地计算出由某个状态转移到某个状态的概率即可。 P(ti|ti−1)= count(ti−1,ti) / count(ti−1) 仍然使用nltk带的功能：

brown_tags = [tag for (tag, word) in brown_tags_words ]

# count(t{i-1} ti)
# bigram的意思是 前后两个一组，联在一起
cfd_tags= nltk.ConditionalFreqDist(nltk.bigrams(brown_tags))
# P(ti | t{i-1})
cpd_tags = nltk.ConditionalProbDist(cfd_tags, nltk.MLEProbDist)

1
2
3
4
5
6
7

效果

print("If we have just seen 'DT', the probability of 'NN' is", cpd_tags["DT"].prob("NN"))
print( "If we have just seen 'VB', the probability of 'JJ' is", cpd_tags["VB"].prob("DT"))
print( "If we have just seen 'VB', the probability of 'NN' is", cpd_tags["VB"].prob("NN"))

# output
If we have just seen ‘DT’, the probability of ‘NN’ is 0.5057722522030194 
If we have just seen ‘VB’, the probability of ‘JJ’ is 0.016885067592065053 
If we have just seen ‘VB’, the probability of ‘NN’ is 0.10970977711020183

1
2
3
4
5
6
7
8

所以就可以计算出某句话对应某个POS标注序列的概率了，即Evaluation问题。

# 一句话，”I want to race”， 一套tag，”PP VB TO VB” 
# 他们之间的匹配度有多高呢？ 
# 其实就是： 
# P(START) * P(PP|START) * P(I | PP) * P(VB | PP) * P(want | VB) * P(TO | VB) * P(to | TO) * P(VB | TO) * P(race | VB) * P(END | VB)

prob_tagsequence = cpd_tags["START"].prob("PP") * cpd_tagwords["PP"].prob("I") * \
    cpd_tags["PP"].prob("VB") * cpd_tagwords["VB"].prob("want") * \
    cpd_tags["VB"].prob("TO") * cpd_tagwords["TO"].prob("to") * \
    cpd_tags["TO"].prob("VB") * cpd_tagwords["VB"].prob("race") * \
    cpd_tags["VB"].prob("END")

print( "The probability of the tag sequence 'START PP VB TO VB END' for 'I want to race' is:", prob_tagsequence)

# output
The probability of the tag sequence ‘START PP VB TO VB END’ for ‘I want to race’ is: 1.0817766461150474e-14

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

下面要做Recognition，需要实现Viterbi算法。

# 拿到所有tag合集
distinct_tags = set(brown_tags)

# 找一句话来测试
sentence = ["I", "want", "to", "race" ]
sentlen = len(sentence)

# 下面，开始维特比： 
# 从1循环到句子的总长N，记为i 
# 每次都找出以tag X为最终节点，长度为i的tag链
viterbi = [ ]

# 同时，还需要一个回溯器： 
# 从1循环到句子的总长N，记为i 
# 把所有tag X 前一个Tag记下来
backpointer = [ ]

first_viterbi = { }
first_backpointer = { }
for tag in distinct_tags:
    # don't record anything for the START tag
    if tag == "START": continue
    first_viterbi[ tag ] = cpd_tags["START"].prob(tag) * cpd_tagwords[tag].prob( sentence[0] )
    first_backpointer[ tag ] = "START"

print(first_viterbi)
print(first_backpointer)

# 以上，是所有的第一个viterbi 和第一个回溯点。 
# 接下来，把楼上这些，存到Viterbi和Backpointer两个变量里去
viterbi.append(first_viterbi)
backpointer.append(first_backpointer)

# 我们可以先看一眼，目前最好的tag是啥：
currbest = max(first_viterbi.keys(), key = lambda tag: first_viterbi[ tag ])
print( "Word", "'" + sentence[0] + "'", "current best two-tag sequence:", first_backpointer[ currbest], currbest)

# output
Word ‘I’ current best two-tag sequence: START PP

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40

我们开始loop：

for wordindex in range(1, len(sentence)):
    this_viterbi = { }
    this_backpointer = { }
    prev_viterbi = viterbi[-1]

    for tag in distinct_tags:
        # START没有卵用的，我们要忽略
        if tag == "START": continue

        # 如果现在这个tag是X，现在的单词是w，
        # 我们想找前一个tag Y，并且让最好的tag sequence以Y X结尾。
        # 也就是说
        # Y要能最大化：
        # prev_viterbi[ Y ] * P(X | Y) * P( w | X)

        best_previous = max(prev_viterbi.keys(),
                            key = lambda prevtag: \
            prev_viterbi[ prevtag ] * cpd_tags[prevtag].prob(tag) * cpd_tagwords[tag].prob(sentence[wordindex]))

        this_viterbi[ tag ] = prev_viterbi[ best_previous] * \
            cpd_tags[ best_previous ].prob(tag) * cpd_tagwords[ tag].prob(sentence[wordindex])
        this_backpointer[ tag ] = best_previous

    # 每次找完Y 我们把目前最好的 存一下
    currbest = max(this_viterbi.keys(), key = lambda tag: this_viterbi[ tag ])
    print( "Word", "'" + sentence[ wordindex] + "'", "current best two-tag sequence:", this_backpointer[ currbest], currbest)


    # 完结
    # 全部存下来
    viterbi.append(this_viterbi)
    backpointer.append(this_backpointer)

# output
Word ‘want’ current best two-tag sequence: PP VB 
Word ‘to’ current best two-tag sequence: VB TO 
Word ‘race’ current best two-tag sequence: IN NN

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

找END，结束：

# 找所有以END结尾的tag sequence
prev_viterbi = viterbi[-1]
best_previous = max(prev_viterbi.keys(),
                    key = lambda prevtag: prev_viterbi[ prevtag ] * cpd_tags[prevtag].prob("END"))

prob_tagsequence = prev_viterbi[ best_previous ] * cpd_tags[ best_previous].prob("END")

# 我们这会儿是倒着存的。。。。因为。。好的在后面
best_tagsequence = [ "END", best_previous ]
# 同理 这里也有倒过来
backpointer.reverse()

1
2
3
4
5
6
7
8
9
10
11

最终：回溯所有的回溯点此时，最好的 tag 就是 backpointer 里面的 current best

current_best_tag = best_previous
for bp in backpointer:
    best_tagsequence.append(bp[current_best_tag])
    current_best_tag = bp[current_best_tag]

1
2
3
4

显示上面测试样例的结果：

best_tagsequence.reverse()
print( "The sentence was:", end = " ")
for w in sentence: print( w, end = " ")
print("\n")
print( "The best tag sequence is:", end = " ")
for t in best_tagsequence: print (t, end = " ")
print("\n")
print( "The probability of the best tag sequence is:", prob_tagsequence)

1
2
3
4
5
6
7
8

输出

The sentence was: I want to race

The best tag sequence is: START PP VB IN NN END

The probability of the best tag sequence is: 5.71772824864617e-14

可能需要添加更多语料来训练，才能使得结果更好一点。

上次更新: 2020/09/19, 22:09:00

← 七月NLP课程笔记(4)-基于统计的翻译七月NLP课程笔记(6)-深度学习与NLP应用→