I've experimented several times with this, and most results were actually NOT great. I've given some thought to this, and would be interested to hear other people's insights.
The components I perceive are:
- Deciding what counts as the user's "current context". If they're typing a long email or document, do you use the last sentence? The last paragraph? Or time-based (the last 20 seconds of typing)?
- Extracting the most interesting parts of the user context.
- Matching information to the extracted parts of user context. If I'm writing to Neha about hiking in SF, the matched information might be links (top google result is the useful bahiker.com), my own documents (previous communications with Neha about hiking), products (hiking shoes), or ads (such as Gmail Ads which I worked on).
- Ranking the information results, based on their quality and correlation to the extracted user context.
- Deciding when the user is in the right mindset to be presented with information. If they are focused on finishing their writing, they may want to minimize interruptions. My instinct is to use typing speed (faster typing = less interest in interruptions).
- Displaying the information in a non-obtrusive way. If the information changes every 30 seconds with the user's typing, it can get annoying to have an area of the screen updating so frequently.
#3 (matching) and #4 (ranking) are very similar to web search, so the existing solutions are mature. #1 (determining user context) can be reached with some tweaking. The parts I found most challenging were #2 (extracting the key parts of the context) and #5 (determining when to show results).
I think #6 (displaying) depends on #5 (user mindset). Solving #5 would make #6 tractable.
I did an experiment a couple years ago to automatically match images as users typed in an IM conversation. I wanted to see if IM conversations could be summarized via images. Then if you had a lobby of real-time group chats, you could look at the image streams to decide which of the chats was most interesting to you.
I preprocessed a few million online images to match each one with semantic word-clusters. During an IM conversation, I took lines of typing and converted them to the same clusters, and then correlated that with my image corpus.
The results were mixed. Sometimes the image matching worked serendipitously well. Once I was writing about my experiment, and the algorithm posted one of the standard image-processing photos (I think it was the babboon). The issue was the amount of noise, caused by images being related to a tangential part of the conversation rather than the main gist (the extraction problem discussed in point 2).
I've also done this experiment using emails instead of IM, and another time even hooking into the user's keyboard strokes so that I captured whatever they were typing at the moment. Increasing the amount of input context didn't actually help too much. At any moment, the user is still only interested in a small amount of information that he's typing. It is hard to capture the gist from that short piece of text.
The academic papers in this area (at least the ones I read a few years ago) didn't seem to have good solutions to this problem.
I think this is an interesting area. Seeing advancements in this space would be pretty cool.