Python and SQL: store value, query, and parse the next one

Question

I have a logic dilemma.

I'm trying to retrieve all restaurants in France with Yelp. In order to do so I'm writing a web crawler to scrap Yelp API and retrieve the data. The query is like this:

https://api.yelp.com/v3/businesses/search?cc=fr&location=Paris&categories=restaurants

I have a SQL database with all the cities in France (nice isn't it?):

I'm now running my web crawler manually entering the city name with my hands. Of course I would like my script to:

query SQL Server
take Country and City values
pass the value in the URL (I know how to do it)
take the value from the next row (I don't know how to do it)

So my question is more bout logic: how to structure the whole thing?

How to query, do, and pass to the next one?

score 2 · Accepted Answer · answered May 30 '17 at 08:16

Assuming there's a python db-api compliant connector for your database (you didn't mention the vendor...) you just have to iterate over your cursor:

def get_restaurants_for(country, city):
    # your api calls etc here
    ...

def main():
    db = yourdbmodule.connect(**connexion_infos)
    cursor = db.cursor()
    try:
        cursor.execute("select country, city from yourtable")
        for country, city in cursor:
            get_restaurants_for(city, country)
    finally:
        cursor.close()
        db.close()

Dear @Bruno, I wasn't expecting a replay that precise and that fast! (<5 min.) I will try and I will let you know, thank you — Francesco Mantovani, May 30 '17 at 08:19

score 0 · Answer 2 · answered May 30 '17 at 08:18

Ok so let's start with a basic point about playing to the strengths of each environment and what the tradeoffs of each approach are.

Python is very consistent regarding how things are handed via iteration. So ideally you want to structure your code as:

for city in cities():
    crawl_yelp(city)

Now the obvious, simple approach is to return a cursor and then as you iterate over it, you get a record you pass into your crawl function. If your database is not under high write load, my suspicion is that holding the read transaction open for this long is probably not a terrible thing.

However.... if it is, then cities() could collect the rows into a list and return that. You would trade memory and initial latency for something that would be easier on the database server itself under higher write load. If memory is a problem, you could write to a csv file, then open that and iterate over it (though that may require more work to do in a backwards-compatible way).

There's a saying, premature optimization is the root of all evil and that applies here. So the key thing is to make sure you can properly separate the logic in a way you can change it later without breaking things. So I would start by looping over the cursor unless I know that will be a problem and then if that is the case (or if it is shown to be a problem) I would explore other options, such as fetching everything from the cursor into an array of dictionaries (or tuples) and returning that, or serializing to a file and iterating over that.

Python and SQL: store value, query, and parse the next one

2 Answers2