-1

I have a logic dilemma.

I'm trying to retrieve all restaurants in France with Yelp. In order to do so I'm writing a web crawler to scrap Yelp API and retrieve the data. The query is like this:

https://api.yelp.com/v3/businesses/search?cc=fr&location=Paris&categories=restaurants

I have a SQL database with all the cities in France (nice isn't it?):

enter image description here

I'm now running my web crawler manually entering the city name with my hands. Of course I would like my script to:

  1. query SQL Server
  2. take Country and City values
  3. pass the value in the URL (I know how to do it)
  4. take the value from the next row (I don't know how to do it)

So my question is more bout logic: how to structure the whole thing?

How to query, do, and pass to the next one?

Francesco Mantovani
  • 4,065
  • 3
  • 39
  • 55

2 Answers2

2

Assuming there's a python db-api compliant connector for your database (you didn't mention the vendor...) you just have to iterate over your cursor:

def get_restaurants_for(country, city):
    # your api calls etc here
    ...

def main():
    db = yourdbmodule.connect(**connexion_infos)
    cursor = db.cursor()
    try:
        cursor.execute("select country, city from yourtable")
        for country, city in cursor:
            get_restaurants_for(city, country)
    finally:
        cursor.close()
        db.close() 
bruno desthuilliers
  • 68,994
  • 6
  • 72
  • 93
0

Ok so let's start with a basic point about playing to the strengths of each environment and what the tradeoffs of each approach are.

Python is very consistent regarding how things are handed via iteration. So ideally you want to structure your code as:

for city in cities():
    crawl_yelp(city)

Now the obvious, simple approach is to return a cursor and then as you iterate over it, you get a record you pass into your crawl function. If your database is not under high write load, my suspicion is that holding the read transaction open for this long is probably not a terrible thing.

However.... if it is, then cities() could collect the rows into a list and return that. You would trade memory and initial latency for something that would be easier on the database server itself under higher write load. If memory is a problem, you could write to a csv file, then open that and iterate over it (though that may require more work to do in a backwards-compatible way).

There's a saying, premature optimization is the root of all evil and that applies here. So the key thing is to make sure you can properly separate the logic in a way you can change it later without breaking things. So I would start by looping over the cursor unless I know that will be a problem and then if that is the case (or if it is shown to be a problem) I would explore other options, such as fetching everything from the cursor into an array of dictionaries (or tuples) and returning that, or serializing to a file and iterating over that.

Chris Travers
  • 22,769
  • 6
  • 52
  • 166