Saturday, August 16, 2014

Context-less POS tagging with NLTK


Last week I was handed an interesting task - flagging non-nouns from single word synonyms for concepts in our Medical Ontology. Now, the Part of Speech (POS) tag is not really an intrinsic attribute of a word - rather, the POS is determined by the context in which the word appears. For example, consider the two sentences - "He had an infectious laugh", where laugh is a Noun, and "He will laugh at the joke", where the same word is now a Verb.

However, a word with multiple senses (usually) has one dominant sense, so in the absence of my context, we could call that the word's POS. In order to calculate the dominant POS for words, I decided to use the words in the Brown Corpus, a manually tagged text corpus of about 500 articles. The Brown Corpus is available (after nltk.download()) as part of the Natural Language Processing Toolkit (NLTK).

The idea is to create a frequency distribution of (word, tag) using all the words and their tags in the Brown Corpus. Then for each word, we normalize the frequencies across the different POS tags to derive the probability of a word having each of these tags. Output is in a flat file. Heres the code to build this data.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
# Source: dict_build.py
import nltk
from nltk.corpus import brown

DELIM = "_|_"
NORMED_TAGS = ["NN", "VB", "JJ", "RB", "DT", "IN", "OT"]
POSTAGS = {
    "NN" : "noun",
    "VB" : "verb",
    "JJ" : "adjective",
    "RB" : "adverb",
    "DT" : "determiner",
    "IN" : "preposition",
    "OT" : "other"
}

def normalize_brown_postags():
    brown_tags = open("../data/brown_tags.csv", 'rb')
    tag_map = dict()
    for line in brown_tags:
        line = line.strip()
        if len(line) == 0 or line.startswith("#"):
            continue
        tag_name, tag_description = line.split("\t")[0:2]
        tag_desc_words = set(nltk.word_tokenize(tag_description.lower()))
        is_tagged = False
        for normed_tag in NORMED_TAGS[:-1]:
            desc_pattern = POSTAGS[normed_tag]
            if desc_pattern in tag_desc_words:
                tag_map[tag_name] = normed_tag
                is_tagged = True
        if not is_tagged:
            tag_map[tag_name] = "OT"
    brown_tags.close()
    return tag_map

def retag_brown_words(tag_map):
    wordpos_fd = nltk.FreqDist()
    for word, tag in brown.tagged_words():
        if tag_map.has_key(tag):
            normed_pos = tag_map[tag]
            retagged_word = DELIM.join([word.lower(), normed_pos])
            wordpos_fd.inc(retagged_word)  
    return wordpos_fd
    
def compose_record(word, wordpos_fd):
    freqs = []
    for tag in NORMED_TAGS:
        wordpos = DELIM.join([word, tag])
        freqs.append(wordpos_fd[wordpos])
    sum_freqs = float(sum(freqs))
    nf = [float(f) / sum_freqs for f in freqs]
    return "%s\t%s\n" % (word, "\t".join(["%5.3f" % (x) for x in nf]))

# main
tag_map = normalize_brown_postags()
wordpos_fd = retag_brown_words(tag_map)
already_seen_words = set()
brown_dict = open("../data/brown_dict.csv", 'wb')
brown_dict.write("#WORD\t%s\n" % ("\t".join(NORMED_TAGS)))
for wordpos in wordpos_fd.keys():
    word, tag = wordpos.split(DELIM)
    if word in already_seen_words:
        continue
    brown_dict.write(compose_record(word, wordpos_fd))
    already_seen_words.add(word)
brown_dict.close()

I got the Brown tag descriptions from this page, I copied the table into OpenOffice Calc and wrote it out as a CVS file (brown_tags.csv) referenced above. I reduced the large set of Brown tags to a small set of 7 tags - Nouns (NN), Verbs (VB), Adjectives (JJ), Adverbs (RB), Determiners (DT), Prepositions (IN), and Other (OT). I had originally reduced it just 4 tags (similar to Wordnet) - NN, VB, JJ and RB - but extended it to account for the phrase sequences (see below). The output of the code above is a file that looks like this:

1
2
3
4
5
6
7
#WORD      NN      VB      JJ      RB      DT      IN      OT
as      0.000   0.000   0.000   0.000   0.000   0.017   0.983
his     0.000   0.000   0.000   0.000   0.995   0.000   0.005
be      0.000   1.000   0.000   0.000   0.000   0.000   0.000
on      0.000   0.000   0.000   0.083   0.000   0.917   0.000
at      0.000   0.000   0.000   0.000   0.000   1.000   0.000
...

The data is used from Java, for maintainability (we are a Java shop), and because part of the work involves checking against a middleware component that is Java-only, but here is some testing code that illustrates how this information can be used.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# Source: predict.py
from __future__ import division
import numpy as np
import nltk

NORMTAGS = ["NN", "VB", "JJ", "RB", "DT", "IN", "OT"]

def load_word_dict(dict_file):
    word_dict = {}
    fdict = open(dict_file, "rb")
    for line in fdict:
        line = line.strip()
        if len(line) == 0 or line.startswith("#"):
            continue
        cols = line.split("\t")
        word = cols[0]
        probs = [float(x) for x in cols[1:]]
        word_dict[word] = probs
    fdict.close()
    return word_dict

def assert_true(fn, message):
    if fn != True:
        print "Assert failed:", message

def predict_pos(word, word_dict):
    if word_dict.has_key(word):
        probs = np.array(word_dict[word])
        return NORMTAGS[np.argmax(probs)]
    else:
        return "OT"
        
def predict_if_noun(word, word_dict):
    return predict_pos(word, word_dict) == "NN"

# test cases for individual words
word_dict = load_word_dict("../data/brown_dict.csv")
assert_true(predict_if_noun("hypothalamus", word_dict), 
            "Hypothalamus == NOUN!")
assert_true(not predict_if_noun("intermediate", word_dict), 
            "Intermediate != NOUN!")
assert_true(predict_if_noun("laugh", word_dict), "Laugh ~= NOUN!")

Having done this, I got to thinking that perhaps we could extend the idea for multi-word phrases as well, ie, we could flag noun phrases instead of nouns. Like their single word counterparts, these phrases are also synonyms of concepts, and do not provide any context to decide what kind of phrases they are.

A phrase is a sequence of POS tags. I decided to use the Penn Treebank (PTB) corpus. The full PTB is not free, but NLTK provides a small subset of the PTB for educational purposes. The idea here is to scan the chunks, filtering out noun phrases and creating a frequency distribution of the tag sequences for noun phrases.

Since not all possible English words are possible in the Brown Corpus, we create a transition matrix that will predict what POS tag will appear given the current POS tag. That way, we will be able to assign a likely POS tag for words that are not in our dictionary. As before, I reduce the PTB tags down to our set of 7 tags. The code for generating the POS sequences and the POS transition matrix is shown below:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
# Source:phrase_seqs.py
from __future__ import division
import nltk
import numpy as np

from nltk.corpus import treebank_chunk

NORMTAGS = ["NN", "VB", "JJ", "RB", "DT", "IN", "OT"]
POSTAGS = {
    "NN" : "noun",
    "VB" : "verb",
    "JJ" : "adjective",
    "RB" : "adverb",
    "DT" : "determiner",
    "IN" : "preposition",
    "OT" : "other"
}

def normalize_ptb_tags():
    tag_map = {}
    ptb_tags = open("../data/ptb_tags.csv", 'rb')
    for line in ptb_tags:
        line = line.strip()
        if len(line) == 0 or line.startswith("#"):
            continue
        tag, desc = line.split("\t")
        desc_words = nltk.word_tokenize(desc.lower().replace("-", " "))
        is_tagged = False
        for key in NORMTAGS[:-1]:
            postag_desc = POSTAGS[key]
            if postag_desc in desc_words:
                tag_map[tag] = key
                is_tagged = True
        if not is_tagged:
            tag_map[tag] = "OT"
    ptb_tags.close()
    return tag_map
    
def get_chunks(tree, phrase_type, tags):
    try:
        tree.node
    except AttributeError:
        return 
    else:
        if tree.node == phrase_type:
            tags.append(tree)
        else:
            for child in tree:
                get_chunks(child, phrase_type, tags)

def index_of(tag):
    if tag == "START":
        return 0
    elif tag == "END":
        return len(NORMTAGS) + 1
    else:
        return NORMTAGS.index(tag) + 1
        
def update_trans_freqs(trans_freqs, tag_seq):
    tags = ["START"]
    tags.extend(tag_seq.split(" "))
    tags.append("END")
    bigrams = nltk.bigrams(tags)
    for bigram in bigrams:
        row = index_of(bigram[0])
        col = index_of(bigram[1])
        trans_freqs[row, col] += 1
    
# generate phrases as a sequence of (normalized) POS tags and
# transition probabilities across POS tags.
tag_map = normalize_ptb_tags()
np_fd = nltk.FreqDist()
trans_freqs = np.zeros((len(NORMTAGS) + 2, len(NORMTAGS) + 2))
for tree in treebank_chunk.chunked_sents():
    chunks = []
    get_chunks(tree, "NP", chunks)
    for chunk in chunks:
        tagged_poss = [tagged_word[1] for tagged_word in chunk]
        normed_tags = []
        for tagged_pos in tagged_poss:
            try:
                normed_tags.append(tag_map[tagged_pos])
            except KeyError:
                normed_tags.append("OT")
        np_fd.inc(" ".join(normed_tags))
        
fout = open("../data/np_tags.csv", 'wb')
for tag_seq in np_fd.keys():
    fout.write("%s\t%d\n" % (tag_seq, np_fd[tag_seq]))
    update_trans_freqs(trans_freqs, tag_seq)
fout.close()
# normalize so they are all probablities (by row sum)
trans_probs = trans_freqs / np.linalg.norm(trans_freqs, axis=1)[:, np.newaxis]
trans_probs[~np.isfinite(trans_probs)] = 0.0
np.savetxt("../data/pos_trans.csv", trans_probs, fmt="%7.5f", delimiter="\t")

Thanks to this presentation and this project for teaching me how to parse a NLTK tree. The output of the code above are the following two files, the first a list of inter-POS transition probabilities to figure out what an unknown POS tag should be given the current one.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
# START      NN      VB      JJ      RB      DT      IN      OT     END
0.00000 0.43902 0.06724 0.21358 0.04351 0.78312 0.02769 0.37574 0.00000 # START
0.00000 0.71857 0.07099 0.13159 0.00346 0.00693 0.00000 0.36708 0.57139 # NN
0.00000 0.83146 0.03079 0.18477 0.15397 0.01540 0.00000 0.27715 0.41573 # VB
0.00000 0.94283 0.06168 0.20266 0.00441 0.01322 0.00000 0.22029 0.13217 # JJ
0.00000 0.08762 0.04381 0.65716 0.04381 0.00000 0.00000 0.04381 0.74478 # RB
0.00000 0.79556 0.14763 0.50030 0.04921 0.03281 0.00000 0.29526 0.06561 # DT
0.00000 0.68599 0.00000 0.34300 0.00000 0.17150 0.00000 0.34300 0.51450 # IN
0.00000 0.66386 0.11246 0.33374 0.02177 0.06893 0.01814 0.59131 0.28296 # OT
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 # END

and the other is a list of POS tag sequences in the limited PTB corpus, alongwith their counts.

1
2
3
4
5
6
7
DT JJ NN        1162
OT NN           1097
DT NN NN        875
DT              830
NN NN NN        547
JJ              359
...

Once again, the consumer of this data is in the Java layer, but we do the following tests for our own satisfaction.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
def load_phrase_tags(phrase_tag_file):
    phrase_tags = set()
    ftags = open(phrase_tag_file, 'rb')
    for line in ftags:
        line = line.strip()
        if len(line) == 0 or line.startswith("#"):
            continue
        phrase, count = line.split("\t")
        phrase_tags.add(phrase)
    ftags.close()
    return phrase_tags

def assert_true(fn, message):
    if fn != True:
        print "Assert failed:", message

def tag_to_index(tag):
    if tag == "START":
        return 0
    elif tag == "END":
        return len(NORMTAGS) + 1
    else:
        return NORMTAGS.index(tag) + 1

def index_to_tag(index):
    if index == 0: 
        return "START"
    elif index == len(NORMTAGS) + 1:
        return "END"
    else:
        return NORMTAGS[index - 1]

def predict_likely_pos(prev_tag, trans_probs):
    row = tag_to_index(prev_tag)
    probs = trans_probs[row, :]
    return index_to_tag(np.argmax(probs))

def predict_if_noun_phrase(phrase, trans_probs, phrase_tags):
    words = nltk.word_tokenize(phrase)
    tags = []
    for word in words:
        if word_dict.has_key(word):
            tags.append(predict_pos(word, word_dict))
        else:
            prev_tag = "START" if len(tags) == 0 else tags[-1]
            tags.append(predict_likely_pos(prev_tag, trans_probs))
    return " ".join(tags) in phrase_tags

# test cases for phrases
phrase_tags = load_phrase_tags("../data/np_tags.csv")
trans_probs = np.loadtxt("../data/pos_trans.csv", delimiter="\t")
assert_true(predict_if_noun_phrase("time flies", trans_probs, phrase_tags), 
            "time flies == NP!")
assert_true(not predict_if_noun_phrase(
            "were spoken", trans_probs, phrase_tags), 
            "were spoken == VP!")

The phrase stuff seems to be a bit too loose at the moment, it matches almost any phrase you throw at it as a noun-phrase. One possibility is to manually combine rules into a smaller number of regex rules, with a cutoff mandating a minimum required frequency of a tag sequence for it to be considered. Alternatively, we could extract Verb, Adjective and Adverb phrases and attempt to match against all of them, choosing the one with the highest frequency.

One other problem with this approach is that the data sets we have considered are a bit on the small side, and they don't match our domain. An option is to POS tag and chunk a large amount of relevant text with something like OpenNLP, then use the above approach to find free-standing noun phrases.

8 comments (moderated to prevent spam):

Anonymous said...

Interesting !! Got me thinking, what if you could use Stanford NLP which gives more richer POS tagging than OpenNLP

http://nlp.stanford.edu/software/dependencies_manual.pdf

And also combine it with the dependency tree to glean more about the semantic structure...Note that it will be a memory hog though.

http://nlp.stanford.edu/pubs/LREC06_dependencies.pdf

http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/SemanticGraphFactory.html

Let me know what you think. Am I barking up the wrong tree ??

Ravi Kiran Bhaskar

Sujit Pal said...

Thanks Ravi. This is the second time someone recommended the Stanford Parser, so I guess I really should try it again, found it a bit slow the last time I did. In my current scenario, I am trying to "normalize" the tags to a smaller set so I can (manually, after counting and sorting by frequency) construct POS sequence templates that translate to different phrase types, so a richer set of tags may not necessarily help.

Anonymous said...

I dont know your exact scenario or design so I wont comment on what you should use, However, I will layout my thought process (atleast I might learn something form you that I overlooked :-) ).

At least from the limited testing I did HMM based taggers (opennlp) are slightly less accurate than the CRF based pos taggers. As you noted Stanford NLP is memory hungry and slower.

coming to paring down the nouns...I believe, sometimes to narrow down you need increased richness in tagging so you can really strip down to really correct ones

Lets take for example "Genetically modified food" could not be detected as a whole by comon HMM taggers however Stanford Parser can tag it as "Adverbial modifier"

similarly an "adjectival modifier" adds more context for you to work with...look at the first link you might get some ideas.

Thanks,
Ravi Kiran Bhaskar

Sujit Pal said...

Hi Ravi, thanks for the detailed comment. Yes, completely agree, from my very limited testing of Stanford parser, it looks to be much more accurate than OpenNLP (and that in itself may be a good argument for using Stanford).

For context, the application is to flag possible "bad" entries from a large (approx 20M) list of free standing words (or phrases) by computing the POS (or phrase type) - the intuition is that nouns and noun phrases are "good" and others should be reviewed manually.

OpenNLP came up because the tagged corpora in NLTK is limited, and I figured I could manufacture some training data by tagging/chunking a largish corpus like PubMed using OpenNLP's OOB trained model. As you rightly pointed out, I could also use the Stanford Parser as well and get more accurate and richer tagging.

However, the reason I said it may be of limited use is because my current approach is to reduce the PTB (and in future the OpenNLP/Stanford) tags to a smaller set (for example NN, NNS, NNP, etc all become NN), in order to generalize them into candidate POS sequences for matching. I haven't had great success with this approach (at least not so far), but I have a few alternatives in mind - one is to find candidate sequences for other phrase types and then do a bake-off instead of a yes/no decision I have in place currently. Another is to get more training data (thats where OpenNLP came in :-)).

I am beginning to see your point about Stanford being able to tag at a more granular level, and I guess it sort of makes sense to do now even with my current approach (as long as the performance is acceptable and the resulting accuracy cuts down manual review time drastically) - in the sense that its better to round off things as late as possible to keep errors low.

Anonymous said...

BTW dont forget to check the caseless models. I found them very useful when the nouns in the text are not properly camel cased. :-)

Ravi Kiran Bhaskar

Sujit Pal said...

Thanks Ravi. I am guessing these should be used as a fallback to report additional tags right?

Anonymous said...

Sujit I used caseless as primary itself, not just as fallback. I sent you sample code to your email.

Ravi Kiran Bhaskar

Sujit Pal said...

I see, thank you. Let me look at the code and I will ask if I have questions.