Wednesday, January 20, 2010

tfidf package updated

I changed my TF-IDF open-source package to use the MIT license, so that it is more freely re-usable.

I also fixed a couple of issues, and uploaded a new version. Thanks to the people who reported the bugs!

If you use the package and enjoy it, please drop me an email to let me know.

5 comments:

  1. http://en.wikipedia.org/wiki/Tf-idf
    "The tf-idf weighting scheme is often used in the vector space model together with cosine similarity to determine the similarity between two documents."

    Wow. I am not worthy.

    ReplyDelete
  2. All that means is:
    - you create a vector of words for each document
    - you figure out how similar two vectors are
    - that tells you how similar the documents are

    The description just sounds more challenging when condensed.

    ReplyDelete
  3. Thanks. Have you hosted more of your projects elsewhere? I just wanted to read code written by an awesome Xoogler. :)

    ReplyDelete
  4. Haha. If you want to read awesome Googler code, you should look at code by Jeff Dean and Sanjay Ghemawat. Some of their libraries are so beautiful -- impossibly short, yet performs every function you'd want from the library. It is art!

    The only public code of theirs I can think of is the protocol buffer code:
    http://code.google.com/apis/protocolbuffers

    ReplyDelete
  5. Sigh. Looks like code hosted might not be originally written by Sanjay Ghemawat and Jeff Dean.
    See a lot of the following comments in the code:
    // Author: kenton@google.com (Kenton Varda)
    // Based on original Protocol Buffers design by
    // Sanjay Ghemawat, Jeff Dean, and others.

    (Also, their names don't seem to be mentioned in the project members list:
    http://code.google.com/p/protobuf/people/list)

    Thanks, anyway!

    ReplyDelete

Your comment will need approval before it is shown: