I changed my TF-IDF open-source package to use the MIT license, so that it is more freely re-usable.
I also fixed a couple of issues, and uploaded a new version. Thanks to the people who reported the bugs!
If you use the package and enjoy it, please drop me an email to let me know.
Wednesday, January 20, 2010
Subscribe to:
Post Comments (Atom)
5 comments:
http://en.wikipedia.org/wiki/Tf-idf
"The tf-idf weighting scheme is often used in the vector space model together with cosine similarity to determine the similarity between two documents."
Wow. I am not worthy.
All that means is:
- you create a vector of words for each document
- you figure out how similar two vectors are
- that tells you how similar the documents are
The description just sounds more challenging when condensed.
Thanks. Have you hosted more of your projects elsewhere? I just wanted to read code written by an awesome Xoogler. :)
Haha. If you want to read awesome Googler code, you should look at code by Jeff Dean and Sanjay Ghemawat. Some of their libraries are so beautiful -- impossibly short, yet performs every function you'd want from the library. It is art!
The only public code of theirs I can think of is the protocol buffer code:
http://code.google.com/apis/protocolbuffers
Sigh. Looks like code hosted might not be originally written by Sanjay Ghemawat and Jeff Dean.
See a lot of the following comments in the code:
// Author: kenton@google.com (Kenton Varda)
// Based on original Protocol Buffers design by
// Sanjay Ghemawat, Jeff Dean, and others.
(Also, their names don't seem to be mentioned in the project members list:
http://code.google.com/p/protobuf/people/list)
Thanks, anyway!
Post a Comment