Wednesday, January 20, 2010

tfidf package updated

I changed my TF-IDF open-source package to use the MIT license, so that it is more freely re-usable.

I also fixed a couple of issues, and uploaded a new version. Thanks to the people who reported the bugs!

If you use the package and enjoy it, please drop me an email to let me know.

5 comments:

Justin K said...

http://en.wikipedia.org/wiki/Tf-idf
"The tf-idf weighting scheme is often used in the vector space model together with cosine similarity to determine the similarity between two documents."

Wow. I am not worthy.

N said...

All that means is:
- you create a vector of words for each document
- you figure out how similar two vectors are
- that tells you how similar the documents are

The description just sounds more challenging when condensed.

Anonymous said...

Thanks. Have you hosted more of your projects elsewhere? I just wanted to read code written by an awesome Xoogler. :)

N said...

Haha. If you want to read awesome Googler code, you should look at code by Jeff Dean and Sanjay Ghemawat. Some of their libraries are so beautiful -- impossibly short, yet performs every function you'd want from the library. It is art!

The only public code of theirs I can think of is the protocol buffer code:
http://code.google.com/apis/protocolbuffers

Anonymous said...

Sigh. Looks like code hosted might not be originally written by Sanjay Ghemawat and Jeff Dean.
See a lot of the following comments in the code:
// Author: kenton@google.com (Kenton Varda)
// Based on original Protocol Buffers design by
// Sanjay Ghemawat, Jeff Dean, and others.

(Also, their names don't seem to be mentioned in the project members list:
http://code.google.com/p/protobuf/people/list)

Thanks, anyway!