3

I'm looking at writing a mashup app that will take submission titles from a subreddit and attempt to plot them on a map based on where they are likely to be relevant. I'd also like to add on things like Twitter later on.

What I'm having difficulty planning is how to detect the most likely to be relevant country from the title. My first guess is to have a list of countries, along with their matching permutations (e.g. "English" matches "England", etc.) and check for occurrences of those items in the text. However this is probably going to be quite slow and will require me listing the possessive* name for each country.

I'm planning on doing this in Python (so as to learn to use it) so I'm wondering is there a) a library that does this (and that I can learn from it) or b) a more obvious way to do this?

To give an idea of the types of input I'm working with here are some samples and what I'm trying to get out of them:

  • "Well they can't arrest all of us - Giving the middle finger to the British legal system (pic)"
    • Keyword: British (Great Britain)
  • "Poll: Wikileaks Assange leading Time 'Person of the Year' - Assange, an Australian who has become a thorn in the side of the Pentagon with his releases of secret US military documents about the wars in Iraq and Afghanistan, had received 21,736 votes as of Friday."
    • Keywords: Afghanistan, Iraq, [Australian] (Afghanistan, Iraq, [Australia]) - Australia would be difficult to catch out as mainly irrelevant but this is acceptable for my purposes
  • "Cyber attack on Nobel peace prize website launched. Stay classy, China."
    • Keyword: China (China)
  • "A Jewish surgeon refuses to operate on a patient and walks out of the operating room after discovering a nazi tattoo on the patient's arm."
    • Keywords: none - acceptable for my purposes

* This is probably the wrong word to use

Ross
  • 43,016
  • 36
  • 114
  • 168
  • Using the API can you get the submitted user's details? – alex Nov 13 '10 at 02:19
  • Alex: I'll probably use the user's geoloc info with twitter but in this case I'm basically working with just a headline. I'm looking into subject indexing which looks just as complicated as last time I read about it :) – Ross Nov 13 '10 at 02:29
  • (1) s/possessive/adjective/ (2) How will you distinguish whether "English" is referring to the country, the language, or the people? – John Machin Nov 13 '10 at 04:36

3 Answers3

3

You can look into the Yahoo! Place Maker API

Placemaker provides geo-enrichment for the hugely significant proportion of Web content that is geographically relevant but not geographically discoverable. Provided with free-form text, the service identifies places mentioned in text, disambiguates those places, and returns unique identifiers (WOEIDs) for each, as well as information about how many times the place was found in the text, and where in the text it was found. The WOEIDs returned by the service can be passed to Yahoo!'s GeoPlanet™ API for further geographic enrichment and discovery.

Russell Dias
  • 63,102
  • 5
  • 46
  • 71
  • Correct me if I'm wrong but it looks like you need to give them a place name and not just text containing a place name somewhere in it. Regardless I'll probably use that or Google's variant somewhere. – Ross Nov 13 '10 at 03:19
  • 2
    It states `Provided with free-form text, the service identifies places mentioned in text, disambiguates those places, and returns unique identifiers` in my above quote, which is in turn quoted form the Yahoo! page itself. So, I'm assuming that it does in fact gather place names *within* a body of text. – Russell Dias Nov 13 '10 at 03:25
0

Use a FullText search index in MySQL. Then use AJAX calls to query against your database.

Dex
  • 11,863
  • 15
  • 65
  • 88
  • I know this will sound odd but I'd like to know more about how it's done, rather than actually get it done. Also I'm not quite sure but wouldn't this mean I'd have to query for every country? I'd like to be able to know what country a story is relevant to by just running a function on the headline. – Ross Nov 13 '10 at 03:31
0

Please see if this answer may help:

[The package geograpy3] allows you to extract place names from a URL or text, and add context to those names -- for example distinguishing between a country, region or city.

Matteo Gamboz
  • 335
  • 2
  • 10