0

I have below code that reads from a csv file a number of ticker symbols into a dataframe.
Each ticker calls the Web Api returning a dafaframe df which is then attached to the last one until complete.
The code works , but when a large number of tickers is used the code slows down tremendously.
I understand I can use multiprocessing and threads to speed up my code but dont know where to start and what would be the most suited in my particular case.

What code should I use to get my data into a combined daframe in the fastest possible manner?

import pandas as pd
import numpy as np
import json

tickers=pd.read_csv("D:/verhuizen/pensioen/MULTI.csv",names=['symbol','company'])
read_str='https://financialmodelingprep.com/api/v3/income-statement/AAPL?limit=120&apikey=demo'
df = pd.read_json (read_str)
df = pd.DataFrame(columns=df.columns)

for ind in range(len(tickers)):
    read_str='https://financialmodelingprep.com/api/v3/income-statement/'+ tickers['symbol'][ind] +'?limit=120&apikey=demo'
    df1 = pd.read_json (read_str)       
    df=pd.concat([df,df1], ignore_index=True)
  
df.set_index(['date','symbol'], inplace=True)
df.sort_index(inplace=True)

df.to_csv('D:/verhuizen/pensioen/MULTI_out.csv')


The code provided works fine for smaller data sets, but when I use a large number of tickers (>4,000) at some point I get the below error. Is this because the web api gets overloaded or is there another problem?

Traceback (most recent call last):
  File "D:/Verhuizen/Pensioen/Equity_Extractor_2021.py", line 43, in <module>
    data = pool.starmap(download_data, enumerate(TICKERS, start=1))
  File "C:\Users\MLUY\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py", line 276, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "C:\Users\MLUY\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py", line 657, in get
    raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x00C33E30>'. Reason: 'TypeError("cannot serialize '_io.BufferedReader' object")'

Process finished with exit code 1


It keeps giving the same error (for a larger amount of tickers) code is exactly as provided:

def download_data(pool_id, symbols):
    df = []
    for symbol in symbols:
        print("[{:02}]: {}".format(pool_id, symbol))
        #do stuff here
        read_str = BASEURL.format(symbol)
        df.append(pd.read_json(read_str))
        #df.append(pd.read_json(fake_data(symbol)))
    return pd.concat(df, ignore_index=True)

It failed again with the pool.map, but one strange thing I noticed. Each time it fails it does so around 12,500 tickers (total is around 23,000 tickers) Similar error:

Traceback (most recent call last):
  File "C:/Users/MLUY/AppData/Roaming/JetBrains/PyCharmCE2020.1/scratches/Equity_naive.py", line 21, in <module>
    data = pool.map(download_data, TICKERS)
  File "C:\Users\MLUY\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py", line 268, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "C:\Users\MLUY\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py", line 657, in get
    raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x078D1BF0>'. Reason: 'TypeError("cannot serialize '_io.BufferedReader' object")'

Process finished with exit code 1


I get the tickers also from a API call https://financialmodelingprep.com/api/v3/financial-statement-symbol-lists?apikey=demo (I noticed it does not work without subscription), I wanted to attach the data it as a csv file but I dont have sufficient rights. I dont think its a good idea to paste the returned data here...


I tried adding time.sleep(0.2) before return as suggested, but again I ge the same error at ticker 12,510. Strange everytime its around the same location. As there are multiple processes going on I cannot see at what point its breaking

Traceback (most recent call last):
  File "C:/Users/MLUY/AppData/Roaming/JetBrains/PyCharmCE2020.1/scratches/Equity_naive.py", line 24, in <module>
    data = pool.map(download_data, TICKERS)
  File "C:\Users\MLUY\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py", line 268, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "C:\Users\MLUY\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py", line 657, in get
    raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x00F32C90>'. Reason: 'TypeError("cannot serialize '_io.BufferedReader' object")'

Process finished with exit code 1


Something very very strange is going on , I have split the data in chunks of 10,000 / 5,000 / 4,000 and 2,000 and each time the code breaks approx 100 tickers from the end. Clearly there is something going on that not right

import time
import pandas as pd
import multiprocessing

# get tickers from your csv
df=pd.read_csv('D:/Verhuizen/Pensioen/All_Symbols.csv',header=None)

# setting the Dataframe to a list (in total 23,000 tickers)
df=df[0]
TICKERS=df.tolist()

#Select how many tickers I want
TICKERS=TICKERS[0:2000]

BASEURL = "https://financialmodelingprep.com/api/v3/income-statement/{}?limit=120&apikey=demo"

def download_data(symbol):
    print(symbol)
    # do stuff here
    read_str = BASEURL.format(symbol)
    df = pd.read_json(read_str)
    #time.sleep(0.2)
    return df

if __name__ == "__main__":
    with multiprocessing.Pool(multiprocessing.cpu_count()) as pool:
        data = pool.map(download_data, TICKERS)
        df = pd.concat(data).set_index(["date", "symbol"]).sort_index()
    df.to_csv('D:/verhuizen/pensioen/Income_2000.csv')

In this particular example the code breaks at position 1,903


RPAI
Traceback (most recent call last):
  File "C:/Users/MLUY/AppData/Roaming/JetBrains/PyCharmCE2020.1/scratches/Equity_testing.py", line 27, in <module>
    data = pool.map(download_data, TICKERS)
  File "C:\Users\MLUY\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py", line 268, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "C:\Users\MLUY\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py", line 657, in get
    raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x0793EAF0>'. Reason: 'TypeError("cannot serialize '_io.BufferedReader' object")'

M LUY
  • 3
  • 5
  • Can you copy/paste the content of your `download_data` function, please? – Corralien Mar 19 '21 at 07:07
  • It keeps giving the same error, code is exactly as provided: def download_data(pool_id, symbols): df = [] for symbol in symbols: print("[{:02}]: {}".format(pool_id, symbol)) #do stuff here read_str = BASEURL.format(symbol) df.append(pd.read_json(read_str)) #df.append(pd.read_json(fake_data(symbol))) return pd.concat(df, ignore_index=True) – M LUY Mar 19 '21 at 08:03
  • Use the second simpler version with `pool.map` instead of the first one with `pool.starmap`. We will get there! I think you have a rate limitation from your api. – Corralien Mar 19 '21 at 08:29
  • It failed again with the pool.map, but one strange thing I noticed. Each time it fails it does so around 12,500 tickers (total is around 23,000 tickers) I get the tickers from a API call , I will attach it as a csv file – M LUY Mar 19 '21 at 09:18
  • API calls are limited depends on subscription except for Enterprise plan. If each request take 100ms so you can do 600 calls / minute. However, you must not exceed 300 calls for Start plan but it's right for Professional one (< 750 calls). You can introduce a delay with `time.sleep(0.2)` before `return`. – Corralien Mar 19 '21 at 14:12
  • Hmm thanks, tried that but again it gives an error at tocker at position 12,510 Maybe there is something wrong with the responded data? – M LUY Mar 19 '21 at 15:18
  • Tried again with time.sleep(0.5) but again failed at 12,507 tickers – M LUY Mar 19 '21 at 16:57
  • Split your tickers file manually. Try a first run with 10k symbols, wait few minutes, then next 10k. – Corralien Mar 19 '21 at 17:36
  • Hi Corralien, I have split in 10k/5k/4k/2k but each time approx 100 tickers from the end the code breaks...how is this possible? – M LUY Mar 20 '21 at 15:18
  • would it be possible to add some code in where by we check on if we have a valid json response or at which ticker the code breaks? I am suspecting that there might be some tickers which are not existing or which causes the error – M LUY Mar 20 '21 at 16:49
  • You can try something like [that](https://stackoverflow.com/a/16877439/15239951) and check the `json_data` before `read_json`. Try also the version without the multiprocessing with error symbol: `for symbol in ["RPAI"]`. – Corralien Mar 20 '21 at 19:08
  • I have tried everything possible this weekend and it keeps failing. All tickers in the dataset provide a valid json response. I am now suspecting it has to do with memory allocation. – M LUY Mar 21 '21 at 15:41
  • I am now using the inital code you provided with starmap (which is the fasted) I am splitting up the symbols manually in chunks of 5,000 which works ok. I tried to write a fundtion which does it automatically and writes it to separate files which are than later stitched together. But I cannot get it to work as the if __name__ =="__main__" is complicating things , could you suggest how to best do this (maybe some sample code?) – M LUY Mar 21 '21 at 17:42

1 Answers1

2

First optimization is to avoid concatenate your dataframe at each iteration.
You can try something like that:

url = "https://financialmodelingprep.com/api/v3/income-statement/{}?limit=120&apikey=demo"
df = []

for symbol in tickers["symbol"]:
    read_str = url.format(symbol)
    df.append(pd.read_json(read_str))

df = pd.concat(df, ignore_index=True)

If it's not sufficient, we will see to use async, threading or multiprocessing.

Edit:
The code below can do the job:

import pandas as pd
import numpy as np
import multiprocessing
import time
import random

PROCESSES = 4  # number of parallel process
CHUNKS = 6  # one process handle n symbols

# get tickers from your csv
TICKERS = ["BCDA", "WBAI", "NM", "ZKIN", "TNXP", "FLY", "MYSZ", "GASX", "SAVA", "GCE",
           "XNET", "SRAX", "SINO", "LPCN", "XYF", "SNSS", "DRAD", "WLFC", "OILD", "JFIN",
           "TAOP", "PIC", "DIVC", "MKGI", "CCNC", "AEI", "ZCMD", "YVR", "OCG", "IMTE",
           "AZRX", "LIZI", "ORSN", "ASPU", "SHLL", "INOD", "NEXI", "INR", "SLN", "RHE-PA",
           "MAX", "ARRY", "BDGE", "TOTA", "PFMT", "AMRH", "IDN", "OIS", "RMG", "IMV",
           "CHFS", "SUMR", "NRG", "ULBR", "SJI", "HOML", "AMJL", "RUBY", "KBLMU", "ELP"]

# create a list of n sublist
TICKERS = [TICKERS[i:i + CHUNKS] for i in range(0, len(TICKERS), CHUNKS)]

BASEURL = "https://financialmodelingprep.com/api/v3/income-statement/{}?limit=120&apikey=demo"


def fake_data(symbol):
    dti = pd.date_range("1985", "2020", freq="Y")
    df =  pd.DataFrame({"date": dti, "symbol": symbol,
                        "A": np.random.randint(0, 100, size=len(dti)),
                        "B": np.random.randint(0, 100, size=len(dti))})
    time.sleep(random.random())  # to simulate network delay
    return df.to_json()


def download_data(pool_id, symbols):
    df = []
    for symbol in symbols:
        print("[{:02}]: {}".format(pool_id, symbol))
        # do stuff here
        # read_str = BASEURL.format(symbol)
        # df.append(pd.read_json(read_str))
        df.append(pd.read_json(fake_data(symbol)))
    return pd.concat(df, ignore_index=True)


if __name__ == "__main__":
    with multiprocessing.Pool(PROCESSES) as pool:
        data = pool.starmap(download_data, enumerate(TICKERS, start=1))
        df = pd.concat(data).set_index(["date", "symbol"]).sort_index()

In this example, I split the list of tickers into sublists for each process retrieves data for multiple symbols and limits overhead due to create and destroy processes.

The delay is to simulate the response time from the network connection and highlight the multiprocess behaviour.

Edit 2: simpler but naive version for your needs

import pandas as pd
import multiprocessing

# get tickers from your csv
TICKERS = ["BCDA", "WBAI", "NM", "ZKIN", "TNXP", "FLY", "MYSZ", "GASX", "SAVA", "GCE",
           "XNET", "SRAX", "SINO", "LPCN", "XYF", "SNSS", "DRAD", "WLFC", "OILD", "JFIN",
           "TAOP", "PIC", "DIVC", "MKGI", "CCNC", "AEI", "ZCMD", "YVR", "OCG", "IMTE",
           "AZRX", "LIZI", "ORSN", "ASPU", "SHLL", "INOD", "NEXI", "INR", "SLN", "RHE-PA",
           "MAX", "ARRY", "BDGE", "TOTA", "PFMT", "AMRH", "IDN", "OIS", "RMG", "IMV",
           "CHFS", "SUMR", "NRG", "ULBR", "SJI", "HOML", "AMJL", "RUBY", "KBLMU", "ELP"]

BASEURL = "https://financialmodelingprep.com/api/v3/income-statement/{}?limit=120&apikey=demo"


def download_data(symbol):
    print(symbol)
    # do stuff here
    read_str = BASEURL.format(symbol)
    df = pd.read_json(read_str)
    return df


if __name__ == "__main__":
    with multiprocessing.Pool(multiprocessing.cpu_count()) as pool:
        data = pool.map(download_data, TICKERS)
        df = pd.concat(data).set_index(["date", "symbol"]).sort_index()

Note about pool.map: for each symbol in TICKERS, create a process and call function download_data.

Corralien
  • 3,229
  • 1
  • 3
  • 16
  • Works (and I often work like this), but I would not call the list `df` but `ldf`(for list of dataframes) to avoid confusion. – 576i Mar 17 '21 at 10:04
  • You use 2 variables `ldf`and `df` so you consume twice as much memory. `df`as a list is only used in the loop then `df` as a dataframe can be used anywhere later in the code, so I prefer overwrite the temporary list by the final dataframe. – Corralien Mar 17 '21 at 11:00
  • Thanks, there is some 15 % speed increase, but was hoping for a multitude How would I use multprocessesing and what speed increase could I expect there? – M LUY Mar 17 '21 at 19:49
  • How many symbols you have to download? – Corralien Mar 18 '21 at 10:03
  • I have around 10,000 symbols and each symbol has three separate web calls to get a json with the three financial statements. But looking at your code, wow I think I am in way over my head.... I (think) I understand why you are simulating the response time, but I really dont know how to implement this now. I see you commented out the code that calls my real data. But how do I actually call the download_data function.? I suppose the pool_id is one of the 4 processes? If I would get this to work its a black box I really need to study! – M LUY Mar 18 '21 at 17:10
  • Remove the two lines of comments below `# do stuff here` and comment out the third line. Try it. I have no API key to test it. `download_data` function is called by the `pool.starmap`. – Corralien Mar 18 '21 at 17:35
  • 1
    Wow amazing fast!, its working! , thanks so much. Initially I tried to run in jupyter notebook but somehow it did not work, But when I ran it inside Pycharm IDE it worked fine!. I dont understand most of the code but I am getting results. I tried changing the processes (set at 4) and chunks (set at 6) but for each change (upwards) its giving an error. Is there a maximum to these sizes and it there still a way to optimise the speed by adjusting these or it this the top speed now? – M LUY Mar 18 '21 at 20:08
  • Don't forget to accept and/or upvote if the answer fits your needs :-) – Corralien Mar 18 '21 at 21:02
  • I am completely new at this I think I upvoted by clicking on the up arrow on the left of your 1st response. One last question pls, why do I get an error with larger datasets. I described the problem in my original post, thanks – M LUY Mar 18 '21 at 21:17
  • To help you, read this [post](https://meta.stackexchange.com/a/5235/943629) – Corralien Mar 19 '21 at 07:08