I m trying to do Entity Extraction (more like matching) in Lucene. Here is a sample workflow:
Given some text (from a URL) AND a list people names, try to extract names of people from the text.
Note:
Names of people are not completely normalized. e.g. Some are Mr. X, Mrs. Y and some are just John Doe, X and Y. Other prefixes and suffixes to think about are Jr., Sr., Dr., I, II ... etc. (dont let me get started with non US names).
I am using Lucene MemoryIndex to create an in memory index of the text from each Url (stripping html tags) and am using StandardAnalyzer to query for the list of all names, one at a time (100k names, Is there any other way to do this? On an avg. this takes about 8 secs. on the average text I have).
A major problem is that to eliminate noise I m using a score of 0.01 as a base score and queries like "Mr. John Doe" have a significantly lower score as compared to "John Doe" if the text contains "John Doe" and in many cases miss the 0.01 threshold.
The other problem is that If I normalize all names and start removing all occurences of Dr. Mr. Mrs. etc. then I start missing good matches like "Dr. John Edward II" and end up with a lot of junk matches like "Mr. John Edward".
I understand that Lucene might not be the right tool for the job either, but so far it hasnt proved to be too bad. Any help appreciated.