Semi-automatic annotation tool - How to find RDF Triplets

Question

I'm developing a semi-automatic annotation tool for medical texts and I am completely lost in finding the RDF triplets for annotation.

I am currently trying to use an NLP based approach. I have already looked into Stanford NER and OpenNLP and they both do not have models for extracting disease names.

My question is: * How can I create a new NER model for extracting disease names? and can I get any help from the OpenNLP or Standford NERs? * Is there another approach all-together - other than NLP - to extracting the RDF triplets from a text?

Any help would be appreciated! Thanks.

score 4 · Accepted Answer · answered Apr 29 '12 at 14:53

4

I have done something similar to what you need both with OpenNLP and LingPipe. I found the exact dictionary-based chunking of LingPipe good enough for my use case and used that. Documentation available here: http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html

You can find a small demo here:

https://github.com/castagna/nerdf

If a gazetteer/dictionary approach isn't good enough for you, you can try creating your own model, OpenNLP has API for training models as well. Documentation is here: http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.namefind.training

Extracting RDF triples from natural language is a different problem than identify named entities. NER is a related and perhaps necessary step, but not enough. To extract an RDF statement from natural language not only you need to identify entities such as the subject and the object of a statement. But you also need to identify the verb and/or relationship of those entities and also you need to map those to URIs.

answered Apr 29 '12 at 14:53

castagna

1,309
10
12

mmmmmmm.. ok I've already looked into OpenNLP training tool but my question is: Can I train the model on discovering disease names when it's originally designed for person names? – Gavin Spencer Apr 29 '12 at 19:57
You can try and measure as good it is. Whatever tool you'll use, you probably need a dataset for training it (unless you use a gazetteer/dictionary approach). So, you can use the same dataset to train different systems and compare them. OpenNLP training API are sufficiently simple that doing an experiment with it isn't expensive. But, you need a training dataset. – castagna Apr 30 '12 at 09:49
Yes exactly.. I have looked a little for a training / test data set and I found a couple of free ones, the best of which seems to be the PubMed database.. Do you know of any other training datasets I can use? Thank you so much! – Gavin Spencer Apr 30 '12 at 18:21
Hi Gavin, my use case was finding ingredients, tools and cooking techniques in recipes (so, quite a different scenarios). I am afraid, I do not have good suggestions on training datasets for your diseases use case. – castagna May 04 '12 at 11:09

Semi-automatic annotation tool - How to find RDF Triplets

1 Answers1