The keywords library filters the most important words from a text. Keywords uses a list of connectives that filters out words like the and because. Tags are stripped as well. Optionally, it queries WordNet to filter out any words that are not nouns. The remaining words are counted, for example if the word typography occurs five times in the text and the word workshop occurs three times, and no word occurs more than that, the text will likely be dealing about typography workshops.
This algorithm is part of the NodeBox Linguistics library. If you need more linguistical power we recommend trying out that library.
|keywords.zip (5KB) |
Last updated for NodeBox 1.9.0
First import the library in your project. You can do this by putting the keywords.py file in the same folder as your script or by putting it in ~/Library/Application Support/NodeBox/. Then load it:
keywords = ximport("keywords")
The library has a single command, top() which returns a list of 2-tuples, for example [(5,"typography"), (2,"workshop")], representing the most important words in the text str, and their count. The command returns the top n words.
top(str, n=10, nouns=True, singularize=True, filters=)
When WordNet is present, the command will return only nouns by default and try to singularize them. Optionally, you also can supply a filters list of words to ignore. For example, here we filter keywords from Lucas Nijs' text on Ideas from the Heart.
import keywords str = open("ideasfromtheheart.txt").read() top = keywords.top(str, nouns=True) for count, word in top: print count, word >>> 10 typography >>> 7 student >>> 6 workshop >>> 6 project >>> 5 teacher >>> 4 idea >>> 4 grade >>> 4 design >>> 4 communication >>> 3 courseinclude("util/comment.php"); ?>