10

I need to open a .bi5 file and read the contents to cut a long story short. The problem: I have tens of thousands of .bi5 files containing time-series data that I need to decompress and process (read, dump into pandas).

I ended up installing Python 3 (I use 2.7 normally) specifically for the lzma library, as I ran into compiling nightmares using the lzma back-ports for Python 2.7, so I conceded and ran with Python 3, but with no success. The problems are too numerous to divulge, no one reads long questions!

I have included one of the .bi5 files, if someone could manage to get it into a Pandas Dataframe and show me how they did it, that would be ideal.

ps the fie is only a few kb, it will download in a second. Thanks very much in advance.

(The file) http://www.filedropper.com/13hticks

ajsp
  • 1,924
  • 16
  • 31
  • Do you know the format of the original data? Is it for example: (int, int, int, float, float) in each line? – ptrj Dec 18 '16 at 04:03
  • 32-bit integer: milliseconds since epoch, 32-bit float: Ask price, 32-bit float: Bid price, 32-bit float: Ask volume, 32-bit float: Bid volume. There is a C++ library out there specifically for this task, but I need to work from Python. You can look at it here https://github.com/ninety47/dukascopy – ajsp Dec 18 '16 at 04:08
  • Python 2.7 that is. – ajsp Dec 18 '16 at 04:09

3 Answers3

15

The code below should do the trick. First, it opens a file and decodes it in lzma and then uses struct to unpack the binary data.

import lzma
import struct
import pandas as pd


def bi5_to_df(filename, fmt):
    chunk_size = struct.calcsize(fmt)
    data = []
    with lzma.open(filename) as f:
        while True:
            chunk = f.read(chunk_size)
            if chunk:
                data.append(struct.unpack(fmt, chunk))
            else:
                break
    df = pd.DataFrame(data)
    return df

The most important thing is to know the right format. I googled around and tried to guess and '>3i2f' (or >3I2f) works quite good. (It's big endian 3 ints 2 floats. What you suggest: 'i4f' doesn't produce sensible floats - regardless whether big or little endian.) For struct and format syntax see the docs.

df = bi5_to_df('13h_ticks.bi5', '>3i2f')
df.head()
Out[177]: 
      0       1       2     3     4
0   210  110218  110216  1.87  1.12
1   362  110219  110216  1.00  5.85
2   875  110220  110217  1.00  1.12
3  1408  110220  110218  1.50  1.00
4  1884  110221  110219  3.94  1.00

Update

To compare the output of bi5_to_df with https://github.com/ninety47/dukascopy, I compiled and run test_read_bi5 from there. The first lines of the output are:

time, bid, bid_vol, ask, ask_vol
2012-Dec-03 01:00:03.581000, 131.945, 1.5, 131.966, 1.5
2012-Dec-03 01:00:05.142000, 131.943, 1.5, 131.964, 1.5
2012-Dec-03 01:00:05.202000, 131.943, 1.5, 131.964, 2.25
2012-Dec-03 01:00:05.321000, 131.944, 1.5, 131.964, 1.5
2012-Dec-03 01:00:05.441000, 131.944, 1.5, 131.964, 1.5

And bi5_to_df on the same input file gives:

bi5_to_df('01h_ticks.bi5', '>3I2f').head()
Out[295]: 
      0       1       2     3    4
0  3581  131966  131945  1.50  1.5
1  5142  131964  131943  1.50  1.5
2  5202  131964  131943  2.25  1.5
3  5321  131964  131944  1.50  1.5
4  5441  131964  131944  1.50  1.5

So everything seems to be fine (ninety47's code reorders columns).

Also, it's probably more accurate to use '>3I2f' instead of '>3i2f' (i.e. unsigned int instead of int).

ptrj
  • 4,504
  • 15
  • 30
  • It looks plausible ptrj. The 1st column is supposed to be a timestamp, but if you are correct, and that is a true representation of the data I will have to take the initial timestamp from the dated folder structure and add to it (long story). I will check it out tomorrow just to be sure, but it looks like you banked the bounty. Speak soon. – ajsp Dec 18 '16 at 04:57
  • 2
    The first columns is "miliseconds since epoch". If you run `pd.TimedeltaIndex(df[0], 'ms')`, you'll see that it covers 1 hour. To get the timestamps, do for example `ts + pd.TimedeltaIndex(df[0], 'ms')` where `ts` is your timestamp. – ptrj Dec 18 '16 at 05:02
  • the first two columns are supposed to be floats though. It's the price of a currency (EURUSD). Would there be any chance that it has been compressed this way, to save space perhaps? – ajsp Dec 18 '16 at 21:50
  • When I say the first two columns, I mean the first two columns after the milliseconds, so that would be the second and third to be clear. Or `1` and `2` this case. – ajsp Dec 18 '16 at 22:03
  • I am going with the answer regardless, it's just a case of munging from here. Thanks for your time. – ajsp Dec 18 '16 at 22:12
  • @ajsp These two columns indeed look like the EURUSD prices * 10000. And if I'm not mistaken it's always `df[1] > df[2]`. I doubt it's a matter of space - maybe rather the resolution of prices matters. – ptrj Dec 18 '16 at 22:14
  • the convention is usually to put the bid price before the ask price which would normally result in `df[1] < df[2]`, where the difference between is the spread, but that's not even a consideration at this point, I suppose it's down to broker preference, or whoever captures their data for them in this case. Anyway thanks for your efforts, much appreciated. – ajsp Dec 19 '16 at 02:14
  • @ajsp Glad I could help. See also the update - comparison with ninety47's C++ code. – ptrj Dec 19 '16 at 02:43
  • Thank you @ptrj. The only thing, how to transform columns 1 and 2 so that they show the decimals as wel? Here is a example: from your function i get this number 7106260 but the actual price is 7106,26 – RaduS Jan 13 '17 at 12:00
  • 2
    @RaduS Not sure what you mean. You can simply divide the numbers in both columns by 100, for example `df.iloc[:, [1, 2]] = df.iloc[:, [1, 2]] / 100`. For this currency pair the factor is 100. It can be different for another pair - it's not encoded in the input file - you have ot know it. – ptrj Jan 14 '17 at 14:54
  • thank you @ptrj; i just realized that my question was a bit stupid. Of course, i just had to divide it :) – RaduS Jan 14 '17 at 16:11
2
import requests
import struct
from lzma import LZMADecompressor, FORMAT_AUTO

# for download compressed EURUSD 2020/06/15/10h_ticks.bi5 file
res = requests.get("https://www.dukascopy.com/datafeed/EURUSD/2020/06/15/10h_ticks.bi5", stream=True)
print(res.headers.get('content-type'))

rawdata = res.content

decomp = LZMADecompressor(FORMAT_AUTO, None, None)
decompresseddata = decomp.decompress(rawdata)

firstrow = struct.unpack('!IIIff', decompresseddata[0: 20])
print("firstrow:", firstrow)
# firstrow: (436, 114271, 114268, 0.9399999976158142, 0.75)
# time = 2020/06/15/10h + (1 month) + 436 milisecond

secondrow = struct.unpack('!IIIff', decompresseddata[20: 40])
print("secondrow:", secondrow)
# secondrow: (537, 114271, 114267, 4.309999942779541, 2.25)

# time = 2020/06/15/10h + (1 month) + 537 milisecond
# ask = 114271 / 100000 = 1.14271
# bid = 114267 / 100000 = 1.14267
# askvolume = 4.31
# bidvolume = 2.25

# note that 00 -> is january
# "https://www.dukascopy.com/datafeed/EURUSD/2020/00/15/10h_ticks.bi5" for january
# "https://www.dukascopy.com/datafeed/EURUSD/2020/01/15/10h_ticks.bi5" for february

#  iterating
print(len(decompresseddata), int(len(decompresseddata) / 20))
for i in range(0, int(len(decompresseddata) / 20)):
    print(struct.unpack('!IIIff', decompresseddata[i * 20: (i + 1) * 20]))
xatz3npki
  • 129
  • 5
  • Thanks for that, much appreciated, do you know how often they update their feed by any chance? – ajsp Jul 20 '20 at 09:49
0

Did you try using numpy as to parse the data before transfer it to pandas. Maybe is a long way solution, but I will allow you to manipulate and clean the data before you made the analysis in Panda, also the integration between them are pretty straight forward,

dsapandora
  • 104
  • 7