8

I have a list of about 1 million addresses, and a function to find their latitudes and longitudes. Since some of the records are improperly formatted (or for whatever reason), sometimes the function is not able to return the latitudes and longitudes of some addresses. This would lead to the for loop breaking. So, for each address whose latitude and longitude is successfully retrieved, I want to write it to the output CSV file. Or, perhaps instead of writing line by line, writing in small chunk sizes would also work. For this, I am using df.to_csv in "append" mode (mode='a') as shown below:

for i in range(len(df)):
    place = df['ADDRESS'][i]
    try:
        lat, lon, res = gmaps_geoencoder(place)
    except:
        pass

    df['Lat'][i] = lat
    df['Lon'][i] = lon
    df['Result'][i] = res

    df.to_csv(output_csv_file,
          index=False,
          header=False,
          mode='a', #append data to csv file
          chunksize=chunksize) #size of data to append for each loop

But the problem with this is that, it is printing the whole dataframe for each append. So, for n lines, it would write the whole dataframe n^2 times. How to fix this?

Kristada673
  • 2,895
  • 3
  • 21
  • 59
  • 1
    Why not just assign `NaN` or something in the except case, and then just write the entire DataFrame at the end? You can even subset it to where it's not null if you don't want to include the bad data in the csv. – ALollz Jul 12 '18 at 03:12
  • Regardless, you can use `df.iloc[i:i+1].to_csv(...)` to write only the single line you are working with if you truly need to do it line by line. – ALollz Jul 12 '18 at 03:23
  • 1
    just declare default values for `lat`, `lon` and `res` before your `try` block. – Burhan Khalid Jul 12 '18 at 03:40

1 Answers1

6

If you really want to print line by line. (You should not).

for i in range(len(df)):
    df.loc[[i]].to_csv(output_csv_file,
        index=False,
        header=False,
        mode='a')
Robert Altena
  • 695
  • 1
  • 9
  • 25
  • If printing line by line is not a good idea, what would you suggest? Printing in chunks? That's what I was trying in my code, but it was appending the whole dataframe every time, that was the issue. – Kristada673 Jul 12 '18 at 04:03
  • Save the entire data frame at the end of the loop? And instead of a python loop use something like this: https://stackoverflow.com/questions/46799234/fastest-way-to-loop-over-pandas-dataframe-for-api-calls – Robert Altena Jul 12 '18 at 05:10
  • 5
    But as I explained in the question, if there is some problem, any problem, due to which the code breaks, then the whole time spent until then is basically wasted (for a million records, it would take 6 days to process the entire dataset). I want to print in chunks or line-by-line only to avoid this. – Kristada673 Jul 12 '18 at 05:57