How to get topdown topics for terms?

Question

For example the terms experience, yrs, ctc must imply the subject jobs, badge, unlocked associated with foursquare.

How do I get the subject from its terms? I want to analyse less-than-formal english like emails, tweets etc. Is there a data repository and API for this? Can I query Freebase for this? I prefer something that can be self-hosted.

Are this implications (e.g., cv -> jobs) dependent on your specific data or general? Within your collection, do you have labelled documents (e.g., in jobs) where those terms appear? — miguelmalvarez, Jun 05 '13 at 09:10
no, we dont. Hence the need for an external source. My bad, not mentioning it clearly. — Jesvin Jose, Jun 05 '13 at 09:20

score 1 · Answer 1 · answered Jun 05 '13 at 13:51

Freebase includes WordNet but doesn't really have much which will help with this task -- at least directly. As Miguel implies with his question, if you had gold standard data you could train a classifier, or set of classifiers, for your problem. The other option would be to pay for a commercial service to do this.

score 0 · Answer 2 · answered Jun 06 '13 at 08:43

@TomMorris has been very clear with his answer and I agree that FreeBase (or similar approaches) can only be used indirectly because a global taxonomy might not have a direct mapping to your problem.

My advise, and what I would do if no topic information can be provided is the following:

Apply clustering techniques to your data.
Try to decide (automatically or not) the meaning of each cluster.
Assume that all the document in the cluster belong to that "class".
Use that information to feed a classifier.

Main problems: 1. I have no idea about the size of your data but it could be a problem for the clustering and/or the manual labelling of clusters. 2. Quality might be way lower than using manual judgments.

I hope this gives you some hints at least.

How to get topdown topics for terms?

2 Answers2