4

I m trying to do Entity Extraction (more like matching) in Lucene. Here is a sample workflow:

Given some text (from a URL) AND a list people names, try to extract names of people from the text.

Note:

Names of people are not completely normalized. e.g. Some are Mr. X, Mrs. Y and some are just John Doe, X and Y. Other prefixes and suffixes to think about are Jr., Sr., Dr., I, II ... etc. (dont let me get started with non US names).

I am using Lucene MemoryIndex to create an in memory index of the text from each Url (stripping html tags) and am using StandardAnalyzer to query for the list of all names, one at a time (100k names, Is there any other way to do this? On an avg. this takes about 8 secs. on the average text I have).

A major problem is that to eliminate noise I m using a score of 0.01 as a base score and queries like "Mr. John Doe" have a significantly lower score as compared to "John Doe" if the text contains "John Doe" and in many cases miss the 0.01 threshold.

The other problem is that If I normalize all names and start removing all occurences of Dr. Mr. Mrs. etc. then I start missing good matches like "Dr. John Edward II" and end up with a lot of junk matches like "Mr. John Edward".

I understand that Lucene might not be the right tool for the job either, but so far it hasnt proved to be too bad. Any help appreciated.

ankimal
  • 875
  • 2
  • 9
  • 22

5 Answers5

2

NEE is an NLP task that is not part of lucene. For open source, you can look at lingpipe and gate and opennlp. There are various for-money alternatives.

GATE is entirely rule-based, and will be hard to use for high precision. You'll need a statistical engine for that; lingpipe has one, but you have to supply the training data. I'm not up to date on the contents of opennlp in this area.

bmargulies
  • 91,317
  • 38
  • 166
  • 290
  • We ve tried Gate but are running into the same problem of sorts. High false positives. This is not necessarily bad but we re trying to tweak this as much as possible. – ankimal Nov 29 '10 at 21:18
  • Well, there's a reason some of us sell this stuff for money. If you are interested in a commercial solution, comment to that effect and I'll send you a point of contact. – bmargulies Nov 29 '10 at 21:19
0

OpenNPL is useful. http://opennlp.apache.org/

The site has documentation and examples.

For the completely uninitiated The book Taming Text : http://www.manning.com/ingersoll/ provides a good overview. You can also download the source code from the book from the above link.

Jason
  • 11
  • 1
  • 1
0

You can try this.. http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html

Documenataion is clear, you can also use DBPedia-Spotlight webservice too...

http://spotlight.dbpedia.org/rest/spot/?text=

Sreedhar GS
  • 2,461
  • 1
  • 19
  • 26
0

Disambiguation of human names is notoriously difficult. If you have other information such as locations, or co-occurrence of names this will be valuable. But there is a lot of work still going into author disambiguation and it cannot normally be solved just from a list of names.

Here is a typical project http://code.google.com/p/bibapp/wiki/AuthorAuthorities . And a typical publication http://www.springerlink.com/content/lk07h1m311t130w4/.

Here is a project on record deduplications which we find useful for author disambiguation http://datamining.anu.edu.au/projects/linkage.html

peter.murray.rust
  • 35,191
  • 41
  • 141
  • 211
0

These projects could be useful for you:

http://nlp.stanford.edu/ner/index.shtml

http://cogcomp.cs.illinois.edu/page/software_view/4

jmmata
  • 39
  • 3