Saturday, February 28, 2009

tf-idf library

Recently I needed to use tf-idf for a personal project. I couldn't find a suitable library on the internet in python, without complicated dependencies. I ended up writing a simple one. Here it is, in case anyone else would find it useful.

http://code.google.com/p/tfidf/

No n-grams or stemming, but it computes basic tf-idf. Thanks to Alex for reviewing.

4 comments:

david1082 said...

A girl phoned me
From the moon;
Asking me for
Fork and spoon.

Dave said...

I have been happy with Xapian and its python bindings, but it is an external dep on the Xapian C pkg.

I recently heard about http://whoosh.ca/ but haven't tried it yet.

-- Dave

PS
Can I give a shout out? Hi Carl.

Anonymous said...

From the wiki:

The tf–idf weight (term frequency–inverse document frequency) is a weight often used in information retrieval and text mining.

I wonder, how would you pronounce the term in conversation with others?

"And would you believe, I spent half the friggin night rewriting those damned tee-eff-eye-dee-eff routines? (laughter) Wtf? You know, the uh, tiff-i-diff stuff from last night? (roaring laughter) Damn it! Fine, have it your way: Cool Whip."

kkuhl said...

@ Quinn: I usually just pronounce it tee-eff-eye-dee-eff. Just like it sounds. Its too difficult to try to get your point across making a word out of it :).

Niniane, thanks for putting this library together! I'm working on a web mining undergrad research project and this is exactly what I was looking for! Nice work!