Sunday, March 24, 2013

Implementing the RAKE Algorithm with NLTK


The Rapid Automatic Keyword Extraction (RAKE) algorithm extracts keywords from text, by identifying runs of non-stopwords and then scoring these phrases across the document. It requires no training, the only input is a list of stop words for a given language, and a tokenizer that splits the text into sentences and sentences into words.

The RAKE algorithm is described in the book Text Mining Applications and Theory by Michael W Berry (free PDF). There is a (relatively) well-known Python implementation and somewhat less well-known Java implementation.

I started looking for something along these lines because I needed to parse a block of text before vectorizing it and using the resulting features as input to a predictive model. Vectorizing text is quite easy with Scikit-Learn as shown in its Text Processing Tutorial. What I was trying to do was to cut down the noise by extracting keywords from the input text and passing a concatenation of the keywords into the vectorizer. It didn't improve results by much in my cross-validation tests, however, so I ended up not using it. But keyword extraction can have other uses, so I decided to explore it a bit more.

I had started off using the Python implementation directly from my application code (by importing it as a module). I soon noticed that it was doing a lot of extra work because it was implemented in pure Python. I was using NLTK anyway for other stuff in this application, so it made sense to convert it to also use NLTK so I could hand off some of the work to NLTK's built-in functions. So here is another RAKE implementation, this time using Python and NLTK.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
# Adapted from: github.com/aneesha/RAKE/rake.py
from __future__ import division
import operator
import nltk
import string

def isPunct(word):
  return len(word) == 1 and word in string.punctuation

def isNumeric(word):
  try:
    float(word) if '.' in word else int(word)
    return True
  except ValueError:
    return False

class RakeKeywordExtractor:

  def __init__(self):
    self.stopwords = set(nltk.corpus.stopwords.words())
    self.top_fraction = 1 # consider top third candidate keywords by score

  def _generate_candidate_keywords(self, sentences):
    phrase_list = []
    for sentence in sentences:
      words = map(lambda x: "|" if x in self.stopwords else x,
        nltk.word_tokenize(sentence.lower()))
      phrase = []
      for word in words:
        if word == "|" or isPunct(word):
          if len(phrase) > 0:
            phrase_list.append(phrase)
            phrase = []
        else:
          phrase.append(word)
    return phrase_list

  def _calculate_word_scores(self, phrase_list):
    word_freq = nltk.FreqDist()
    word_degree = nltk.FreqDist()
    for phrase in phrase_list:
      degree = len(filter(lambda x: not isNumeric(x), phrase)) - 1
      for word in phrase:
        word_freq.inc(word)
        word_degree.inc(word, degree) # other words
    for word in word_freq.keys():
      word_degree[word] = word_degree[word] + word_freq[word] # itself
    # word score = deg(w) / freq(w)
    word_scores = {}
    for word in word_freq.keys():
      word_scores[word] = word_degree[word] / word_freq[word]
    return word_scores

  def _calculate_phrase_scores(self, phrase_list, word_scores):
    phrase_scores = {}
    for phrase in phrase_list:
      phrase_score = 0
      for word in phrase:
        phrase_score += word_scores[word]
      phrase_scores[" ".join(phrase)] = phrase_score
    return phrase_scores
    
  def extract(self, text, incl_scores=False):
    sentences = nltk.sent_tokenize(text)
    phrase_list = self._generate_candidate_keywords(sentences)
    word_scores = self._calculate_word_scores(phrase_list)
    phrase_scores = self._calculate_phrase_scores(
      phrase_list, word_scores)
    sorted_phrase_scores = sorted(phrase_scores.iteritems(),
      key=operator.itemgetter(1), reverse=True)
    n_phrases = len(sorted_phrase_scores)
    if incl_scores:
      return sorted_phrase_scores[0:int(n_phrases/self.top_fraction)]
    else:
      return map(lambda x: x[0],
        sorted_phrase_scores[0:int(n_phrases/self.top_fraction)])

def test():
  rake = RakeKeywordExtractor()
  keywords = rake.extract("""
Compatibility of systems of linear constraints over the set of natural 
numbers. Criteria of compatibility of a system of linear Diophantine 
equations, strict inequations, and nonstrict inequations are considered. 
Upper bounds for components of a minimal set of solutions and algorithms 
of construction of minimal generating sets of solutions for all types of 
systems are given. These criteria and the corresponding algorithms for 
constructing a minimal supporting set of solutions can be used in solving 
all the considered types of systems and systems of mixed types.
  """, incl_scores=True)
  print keywords
  
if __name__ == "__main__":
  test()

The results are nearly identical, and I hope you will agree that the code is much more readable (if you are familiar with the NLTK API, of course). Here is the results from the original RAKE implementation, compared to my NLTK based implementation.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
sujit@cyclone:src$ python rake.py
[('minimal generating sets', 8.6666666666666661), 
 ('linear diophantine equations', 8.5), 
 ('minimal supporting set', 7.6666666666666661), 
 ('minimal set', 4.6666666666666661), 
 ('linear constraints', 4.5), 
 ('upper bounds', 4.0), 
 ('natural numbers', 4.0), 
 ('nonstrict inequations', 4.0)]
sujit@cyclone:src$ python rake_nltk.py
[('minimal generating sets', 8.6666666666666661), 
 ('linear diophantine equations', 8.5), 
 ('minimal supporting set', 7.6666666666666661), 
 ('minimal set', 4.6666666666666661), 
 ('linear constraints', 4.5), 
 ('upper bounds', 4.0), 
 ('natural numbers', 4.0), 
 ('nonstrict inequations', 4.0), 
 ('strict inequations', 4.0)]

As you can see, the NLTK based version returns one more candidate keyword. This is because it finds 27 candidate keywords instead of 24 keywords. I believe its related to the way the original version handles punctuation (ie as part of the preceding word) compared to NLTK's approach (as a separate token). In any case, I am not convinced its a bug because one of extra keywords returned was "corresponding algorithms" with score 3.5, which seems reasonable.

An optimization on top of this is to find candidate keywords that span a stopword. So the next step would involve storing the text into a Lucene index and hitting it with n2 SpanQuery calls where n is the number of extracted keywords, something along these lines.


21 comments (moderated to prevent spam):

Unknown said...

Very nice work, can you please share it in repository? Or should I use it by copy pasting.

Sujit Pal said...

Hi Muhammad, thanks for the kind words. Its available on github here: https://github.com/sujitpal/mlia-examples/blob/master/src/salary_pred/rake_nltk.py. Its part of another mini-project which did not work out too well :-), but this part works fine. Oh and BTW, no worries about the multiple comments, I get that a lot, I guess the message is not very visible...

Vinod Nambiar said...

Dear Sujith,

I am new entrant into the text analysis area. Due to my curiosity, I am trying to explore this. I would like to ask a question. Can we have any control on the engines like RAKE etc to use our own vocabulary to search for. After searching the net, I could trace that it uses its own logic to pick that.

Look forward to hear from you.

Sujit Pal said...

Hi Vinod, you can't have your own dictionary with Rake, it uses a property of keywords. But there are other extraction tools like Kea which make use of a controlled vocabulary.

Vinod Nambiar said...

Dear Sujith,

Much appreciate you for the speedy response. I'll explore the KEA instead of RAKE.

Thank you once again.

Thanks & Regards,
Vinod Nambiar.

Sujit Pal said...

You're welcome Vinod. Kea is on my list of things to look at but I haven't gotten around to it yet. If you write up your findings, would appreciate a pointer to the URL.

Vinod Nambiar said...

Sure. Will keep you posted.

One question, Is there are way we can link custom dictionary with RAKE ? Any possibilities of adding additional code or modules ?

Just a thought.

Thanks & Regards,
Vinod Nambiar.

Sujit Pal said...

Thanks. To answer your question, Rake looks for frequent runs of words delimited by stopwords in sentences. I guess you could redefine the stopwords for your corpus, or filter the output through a custom dictionary. Alternatively you could cap the number of words you will allow a keyword to have (top and bottom). Can't think of any others, I guess it would depend on your use case.

lima subramanian said...

is there any pseudo code available..?????

Sujit Pal said...

Hi lima, I believe there is pseudo code in the Berry book.

Victoria Stuart said...

Nice work; thanks! Updated code (Python 3; ...) here: https://github.com/victoriastuart/mlia-examples

Sujit Pal said...

Thank you, this is going to be very useful for me... was planning to use RAKE for keyword extraction in Spark and our PySpark platform is Python 3.x, so this would be a drop in replacement.

Unknown said...

Hi Sujit,

Excellent work. Amazing response of your algorithm. In-deep i tried to one of my problem statement your code and glad that i could solve 50% it. Thought to take your suggestions on continued part of my problem.

the dataset is pure unlabeled text (review comments from users). I have to categorize the review comments such that, they must fall in 22 pre-defined categories or labels. (ex:label1-Positive Feedback; label2-Negative feedback; label3-pricing; label4-Built quality; label5-Compatibility and so on). Product owner wants to know how well the product users are accepting his latest released product.

keyword extraction gave a very good response where they are indicating towards either one of those 22 pre-defined labels. but am stuck how to utilize the score of keywords to do classification or clustering?

Sujit Pal said...

Hi Adarsha, this seems like a very interesting problem, glad RAKE gave you good keywords and helped solve the first part. For the second part, I have a few suggestions, but all will need some manual work. The first is to assign the "best" class of the 22 classes to each of your keywords -- for example the keyword "reliable" or "long lasting" might point to label4-build quality. Positive and negative may be trickier, typically looking for adjectives on the review text may be a better approach. Once you have these, just find these keywords in incoming text, add up the scores for each class and score it as the class with the highest score. The second approach would be to abandon the keywords and manually label the reviews into one of 22 classes, train a classifier and predict on new text using that. The third approach is slightly long term -- use the keywords as hints to build Snorkel labeling functions, train a generative model to reconcile the noisy labels, and train a classifier on the reconciled labels. Labeling functions are little Python functions that you would run against the reviews. They detect some kind of pattern, such as the existence of one or more keywords, adjectives preceding them, sentiment words (drawn from a sentiment lexicon), ontologies (for technical terms around compatibility or build quality for example) and assign one of the 22 labels. This is a slightly long term approach but is likely to get you better results than #1. Also unlike #2 you are not throwing away the work you did to generate keywords. Good luck with your project!

Anonymous said...

how can i extract words from a html file and save it in a excel file?

Unknown said...

Dear Sujith,
You did a great job, Congrats!!!
I tried to run the above code, getting some errors. Kindly check once.

Anonymous said...

This post was very useful. Thanks for sharing. I am looking for a way to extract noun phrases from bilingual files. For example, English and its corresponding Spanish translation. Can RAKE handle this? Do you know of any reliable solution for this? Thanks in advance.

Sujit Pal said...

@Unknown -- since Python is an interpreted language, it might be giving some errors based on the data it sees, maybe it doesn't know how to handle. Can you post your error message and I can try to debug remotely?

@Anonymous -- idea behind RAKE is that contiguous sequences of non-stopwords are keyword candidates. I suspect that might be true for Spanish as well? If so, RAKE could be used, you would need to give it a Spanish keywords set.

Sujit Pal said...

Answer to @Anonymous for comment dated 7/6/2018: you can save it as a CSV file and import it into Excel.

Unknown said...

Sujit -
Excellent work, and very fast processing. I found for myself few neat features of KW extraction:
1. Dealing with negatives:
Example: "No problem was found" generates a list ['problem', 'found'].

I do extra check to store negative parts, such as ['no', 'not', 'never', 'neither', 'nor', 'none', 'nothing', 'nowhere', 'nobody'] and reattach them to the word it modifies.
It results in ['no problem', 'found']

2. I keep all KWs - uni-, bi-, tri-grams in the list - for further processing, you may never know upfront, which version will work better for the next steps. For the same example, I would generate: ['no problem', 'found', 'no problem found']. Note that 'no' and 'problem' are not in the list.

Thank you for sharing your code.

Vladimir L.

Sujit Pal said...

Thank you Vladimir, this is a cool and really useful approach, thank you for sharing! Sorry for the delay in responding.