Decompress and read Dukascopy .bi5 tick files

Question

I need to open a .bi5 file and read the contents to cut a long story short. The problem: I have tens of thousands of .bi5 files containing time-series data that I need to decompress and process (read, dump into pandas).

I ended up installing Python 3 (I use 2.7 normally) specifically for the lzma library, as I ran into compiling nightmares using the lzma back-ports for Python 2.7, so I conceded and ran with Python 3, but with no success. The problems are too numerous to divulge, no one reads long questions!

I have included one of the .bi5 files, if someone could manage to get it into a Pandas Dataframe and show me how they did it, that would be ideal.

ps the fie is only a few kb, it will download in a second. Thanks very much in advance.

(The file) http://www.filedropper.com/13hticks

Do you know the format of the original data? Is it for example: (int, int, int, float, float) in each line? — ptrj, Dec 18 '16 at 04:03
32-bit integer: milliseconds since epoch, 32-bit float: Ask price, 32-bit float: Bid price, 32-bit float: Ask volume, 32-bit float: Bid volume. There is a C++ library out there specifically for this task, but I need to work from Python. You can look at it here https://github.com/ninety47/dukascopy — ajsp, Dec 18 '16 at 04:08

ptrj · Accepted Answer · 2016-12-19T02:41:10.967

15

The code below should do the trick. First, it opens a file and decodes it in lzma and then uses struct to unpack the binary data.

import lzma
import struct
import pandas as pd


def bi5_to_df(filename, fmt):
    chunk_size = struct.calcsize(fmt)
    data = []
    with lzma.open(filename) as f:
        while True:
            chunk = f.read(chunk_size)
            if chunk:
                data.append(struct.unpack(fmt, chunk))
            else:
                break
    df = pd.DataFrame(data)
    return df

The most important thing is to know the right format. I googled around and tried to guess and '>3i2f' (or >3I2f) works quite good. (It's big endian 3 ints 2 floats. What you suggest: 'i4f' doesn't produce sensible floats - regardless whether big or little endian.) For struct and format syntax see the docs.

df = bi5_to_df('13h_ticks.bi5', '>3i2f')
df.head()
Out[177]: 
      0       1       2     3     4
0   210  110218  110216  1.87  1.12
1   362  110219  110216  1.00  5.85
2   875  110220  110217  1.00  1.12
3  1408  110220  110218  1.50  1.00
4  1884  110221  110219  3.94  1.00

Update

To compare the output of bi5_to_df with https://github.com/ninety47/dukascopy, I compiled and run test_read_bi5 from there. The first lines of the output are:

time, bid, bid_vol, ask, ask_vol
2012-Dec-03 01:00:03.581000, 131.945, 1.5, 131.966, 1.5
2012-Dec-03 01:00:05.142000, 131.943, 1.5, 131.964, 1.5
2012-Dec-03 01:00:05.202000, 131.943, 1.5, 131.964, 2.25
2012-Dec-03 01:00:05.321000, 131.944, 1.5, 131.964, 1.5
2012-Dec-03 01:00:05.441000, 131.944, 1.5, 131.964, 1.5

And bi5_to_df on the same input file gives:

bi5_to_df('01h_ticks.bi5', '>3I2f').head()
Out[295]: 
      0       1       2     3    4
0  3581  131966  131945  1.50  1.5
1  5142  131964  131943  1.50  1.5
2  5202  131964  131943  2.25  1.5
3  5321  131964  131944  1.50  1.5
4  5441  131964  131944  1.50  1.5

So everything seems to be fine (ninety47's code reorders columns).

Also, it's probably more accurate to use '>3I2f' instead of '>3i2f' (i.e. unsigned int instead of int).

edited Dec 19 '16 at 02:41

answered Dec 18 '16 at 04:31

ptrj

4,504
15
30

It looks plausible ptrj. The 1st column is supposed to be a timestamp, but if you are correct, and that is a true representation of the data I will have to take the initial timestamp from the dated folder structure and add to it (long story). I will check it out tomorrow just to be sure, but it looks like you banked the bounty. Speak soon. – ajsp Dec 18 '16 at 04:57
2

The first columns is "miliseconds since epoch". If you run `pd.TimedeltaIndex(df[0], 'ms')`, you'll see that it covers 1 hour. To get the timestamps, do for example `ts + pd.TimedeltaIndex(df[0], 'ms')` where `ts` is your timestamp. – ptrj Dec 18 '16 at 05:02
the first two columns are supposed to be floats though. It's the price of a currency (EURUSD). Would there be any chance that it has been compressed this way, to save space perhaps? – ajsp Dec 18 '16 at 21:50
When I say the first two columns, I mean the first two columns after the milliseconds, so that would be the second and third to be clear. Or `1` and `2` this case. – ajsp Dec 18 '16 at 22:03
I am going with the answer regardless, it's just a case of munging from here. Thanks for your time. – ajsp Dec 18 '16 at 22:12
@ajsp These two columns indeed look like the EURUSD prices * 10000. And if I'm not mistaken it's always `df[1] > df[2]`. I doubt it's a matter of space - maybe rather the resolution of prices matters. – ptrj Dec 18 '16 at 22:14
the convention is usually to put the bid price before the ask price which would normally result in `df[1] < df[2]`, where the difference between is the spread, but that's not even a consideration at this point, I suppose it's down to broker preference, or whoever captures their data for them in this case. Anyway thanks for your efforts, much appreciated. – ajsp Dec 19 '16 at 02:14
@ajsp Glad I could help. See also the update - comparison with ninety47's C++ code. – ptrj Dec 19 '16 at 02:43
Thank you @ptrj. The only thing, how to transform columns 1 and 2 so that they show the decimals as wel? Here is a example: from your function i get this number 7106260 but the actual price is 7106,26 – RaduS Jan 13 '17 at 12:00
2

@RaduS Not sure what you mean. You can simply divide the numbers in both columns by 100, for example `df.iloc[:, [1, 2]] = df.iloc[:, [1, 2]] / 100`. For this currency pair the factor is 100. It can be different for another pair - it's not encoded in the input file - you have ot know it. – ptrj Jan 14 '17 at 14:54
thank you @ptrj; i just realized that my question was a bit stupid. Of course, i just had to divide it :) – RaduS Jan 14 '17 at 16:11

xatz3npki · Answer 2 · 2020-07-23T02:13:16.280

import requests
import struct
from lzma import LZMADecompressor, FORMAT_AUTO

# for download compressed EURUSD 2020/06/15/10h_ticks.bi5 file
res = requests.get("https://www.dukascopy.com/datafeed/EURUSD/2020/06/15/10h_ticks.bi5", stream=True)
print(res.headers.get('content-type'))

rawdata = res.content

decomp = LZMADecompressor(FORMAT_AUTO, None, None)
decompresseddata = decomp.decompress(rawdata)

firstrow = struct.unpack('!IIIff', decompresseddata[0: 20])
print("firstrow:", firstrow)
# firstrow: (436, 114271, 114268, 0.9399999976158142, 0.75)
# time = 2020/06/15/10h + (1 month) + 436 milisecond

secondrow = struct.unpack('!IIIff', decompresseddata[20: 40])
print("secondrow:", secondrow)
# secondrow: (537, 114271, 114267, 4.309999942779541, 2.25)

# time = 2020/06/15/10h + (1 month) + 537 milisecond
# ask = 114271 / 100000 = 1.14271
# bid = 114267 / 100000 = 1.14267
# askvolume = 4.31
# bidvolume = 2.25

# note that 00 -> is january
# "https://www.dukascopy.com/datafeed/EURUSD/2020/00/15/10h_ticks.bi5" for january
# "https://www.dukascopy.com/datafeed/EURUSD/2020/01/15/10h_ticks.bi5" for february

#  iterating
print(len(decompresseddata), int(len(decompresseddata) / 20))
for i in range(0, int(len(decompresseddata) / 20)):
    print(struct.unpack('!IIIff', decompresseddata[i * 20: (i + 1) * 20]))

Thanks for that, much appreciated, do you know how often they update their feed by any chance? — ajsp, Jul 20 '20 at 09:49

score 0 · Answer 3 · answered Dec 18 '16 at 04:36

Did you try using numpy as to parse the data before transfer it to pandas. Maybe is a long way solution, but I will allow you to manipulate and clean the data before you made the analysis in Panda, also the integration between them are pretty straight forward,

Decompress and read Dukascopy .bi5 tick files

3 Answers3

Linked