How to write custom binary file handler in c++ with serialisation of custom objects?

Question

I have some structures I want to serialise and deserialise to be able to pass them from program to program (as a save file), and to be manipulated by other programs (make minor changes....).

I've read through:

Document that describes isocpp serialisation explanation
SO questions that show how to read blocks
SO question how to reading and writing binary files
Benchmarking different file handlers speed and reliance
Serialisation "intro"

But I didn't found anywhere how to pass that step from having some class or struct to serialised structure that you can then read, write, manipulate... be it singular (1 structure per file) to in sequence (multiple lists of multiple structure types per file).

How to write custom binary file handler in c++ with serialisation of custom objects ?

you would need more than a year or two to reimplement what boost can offer you for de/serialization, just saying... — 463035818_is_not_a_number, Aug 27 '19 at 09:38
If you don't care about portability and backwards compatibility then you can just cast ([POD](https://stackoverflow.com/questions/146452/what-are-pod-types-in-c)) structures to `char*` and write them to a file, otherwise use boost serialisation, google protobuf etc. — Alan Birtles, Aug 27 '19 at 09:49
Technically, `static_cast(&some_data_obj)` is legal way for "serialization", yielding pointer to array of bytes with length `sizeof(some_data)` - but it does not address complexity of the task, e.g. endianness, reference/pointers as members, containers, (and few others I have never met). This approach in Python world is called `pickling` (https://pythontips.com/2013/08/02/what-is-pickle-in-python/). But it all depends what are your needs, how serialized data will be used? (only on the same machine?) — R2RT, Aug 27 '19 at 09:49
@formerlyknownas_463035818 you are missing the point. This isn't `boost vs anything` question. This is question of serialisation. Please stick to the topic. — Danilo, Aug 27 '19 at 09:56
btw, just in case... c++ doesnt have classes and structs, it has classes that can be declared with either of the two keywords `struct` or `class` — 463035818_is_not_a_number, Aug 27 '19 at 10:05
true, but there are two keywords. I am maximising searching options. — Danilo, Aug 27 '19 at 10:08
I just realized that you linked SO Q/As only. May be, this is interesting as well (non-SO): [C++ FAQ: Serialization and Unserialization](https://isocpp.org/wiki/faq/serialization). — Scheff's Cat, Aug 27 '19 at 10:59
from above: your questions reads as if you want to de/serialize classes in all generality, but from your comment I understand that you need it only for one specific type. Which is it? — 463035818_is_not_a_number, Aug 27 '19 at 12:38
good observation. Specific type, preferably. I wanted to share how can users serialise their data, whatever they have. To give an set of tools that they can use, depending on their own data. So perhaps to add `specialised custom structured binary data` ? — Danilo, Aug 27 '19 at 12:40
but your answer can only store a `zoo` to file, if you wanted to store a `supermarket` to file you would basically have to rewrite most of it, no? — 463035818_is_not_a_number, Aug 27 '19 at 12:42
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/198533/discussion-between-danilo-and-formerlyknownas-463035818). — Danilo, Aug 27 '19 at 12:42

score -1 · Accepted Answer · edited Jul 06 '20 at 12:02

Before we start

Most of new users aren't familiar with different data types in C++ and often use plain int, char and etc in their code. To successfully do serialisation, one needs to thoroughly think about their data types. Therefore these are your first steps if you have an int lying down somewhere.

Know your data
- What is maximum value that variable should hold?
- Can it be negative?
Limit your data
- Implement decisions from above
- Limit amount of objects your file can hold.

Know your data

If you have an struct or a class with some data as:

struct cat {
    int weight = 0; // In kg (pounds)
    int length = 0; // In cm (or feet)
    std::string voice = "meow.mp3";
    cat() {}
    cat(int weight, int length): weight(weight), length(length) {}
}

Can your cat really weight around 255 kg (maximum size for the 1 byte integer)? Can it be as long as 255 cm (2.5 m)? Does the voice of your cat change with every object of cat?

Objects that don't change should be declared static, and you should limit your object size to best fit your needs. So in these examples answers to the questions above is no.

So our cat struct now looks like this:

struct cat {
    uint8_t weight = 0; // Non negative 8 bit (1 byte) integer (or unsigned char)
    uint8_t length = 0; // Same for length
    static std::string voice;
    cat() {}
    cat(uint8_t w, uint8_t l):weight(w), length(l) {}
};
static cat::voice = std::string("meow.mp3");

Files are written byte by byte (often as character sets) and your data can vary so you need to presume or limit maximum value your data can handle.

But not every project (or structure) is the same, so let's talk about differences of your code data and binary structured data. When thinking about serialisation you need to think in this manner "what is bare minimum of data that this structure needs to be unique?".

For our cat object, it can represent anything beside:

tigers: max 390 kg, and 340 cm
lions : max 315 kg, and 365 cm

Anything else is eligable. So you can influence your "meow.mp3" depending on the size and weight, and then most important data that makes a cat unique is its length and weight. Those are data we need to save to our file.

Limit your data

The largest zoo in the world has 5000 animals and 700 species, which means that in average each species in the zoo contains a population around 10 per species. Which means that per our species of cat we can store maximum of 1 byte worth of cats and don't fear that it will go over it.

So it is safe to assume that our zoo project should hold up to 200 elements per species. This leaves us with two different byte sized data, so our serialised data for our struct is maximum two bytes.

Approach to serialisation

Constructing our cat block

For starters, this is the great way to start. It helps you approach custom serialisation with the right foundation. Now all that is left is to define a structured binary format. For that we need a way to recognise if our two bytes are part of the cat or some other structure, which it can be done with same type collection (every two bytes are cats) or by an identifier.

If we have single file (or part of the file) that holds all cats. We need just start offset of the file and the size of the cat bytes, and then read read every two bytes from start offset to get all cats.
Identifier is a way we can identify depending on the starting character if the object is a cat or something else. This is commonly done by the TLV (Type Length Value) format where type would be Cats, length would be two bytes, and value would be those two bytes.

As you can see, the first option contains fewer bytes and therefore it is more compact, but with the second option we have ability to store multiple animals in our file and make a zoo. How you will structure your binary files depends a lot on your project. For now, since the "single file" option is the most logical to work with, I will implement the second one.

The most important this about "identifier" approach is to first make it logical for us, and then make it logical for our machine. I come from a world where reading from left to right is an norm. So it is logical that the first thing I want to read about cats is its type, then length, and then value.

    char type         = 'C';       // C shorten for Cat, 0x43
    uint8_t length    = 2;          // It holds 2 bytes, 0x02
    uint8_t c_length  = '?';        // cats length
    uint8_t c_weight  = '?';        // cats weight

And to represent it as an chunk(block);

+00        4B        43-02-LL-WW        ('C\x02LW')

Where this means:

+00: offset form the start, 0 means it is start of the file
4B: size of our data block, 4 bytes.
43-02-LL-WW: actual value of cat
- 43: hexadecimal representation of character 'C'
- 02: hexadecimal representation of length of this type (2)
- LL: length of this cat of 1 byte value
- WW: weight of this cat of 1 byte value

But since it is easier for me to read data from left to right, this means my data should be written as little endian, and most of standalone computers are big endian.

Endianess and importance of them

The main issue here is endianness of our machine and because of our struct/class and endianness we need an base type. The way we wrote it defines an little endian OS, but OS's can be all kind of endianness and you can find out how to find which your machine has here.

For users experienced with bit fields I would strongly suggest that you use them for this. But for unfamiliar users:

#include <iostream> // Just for std::ostream, std::cout, and std::endl

bool is_big() {
    union {
        uint16_t w;
        uint8_t p[2];
    } p;
    p.w = 0x0001;
    return p.p[0] == 0x1;
}

union chunk {
    uint32_t space;
    uint8_t parts[4];
};

chunk make_chunk(uint32_t VAL) {
    union chunk ret;
    ret.space = VAL;
    return ret;
}

std::ostream& operator<<(std::ostream& os, union chunk &c) {
    if(is_big()) {
        return os << c.parts[3] << c.parts[2] << c.parts[1] << c.parts[0];
    }else {
        return os << c.parts[0] << c.parts[1] << c.parts[2] << c.parts[3];
    }
}

void read_as_binary(union chunk &t, uint32_t VAL) {
    t.space = VAL;
    if(is_big()) {
        t.space = (t.parts[3] << 24) | (t.parts[2] << 16) | (t.parts[1] << 8) | t.parts[0];
    }
}

void write_as_binary(union chunk t, uint32_t &VAL) {
    if(is_big()) {
         t.space = (t.parts[3] << 24) | (t.parts[2] << 16) | (t.parts[1] << 8) | t.parts[0];
    }
    VAL = t.space;
}

So now we have our chunk that will print out characters in the order we can recognise it at first glance. Now we need a set of casting functionality from uint32_t to our cat since our chunk size is 4 bytes or uint32_t.

struct cat {
    uint8_t weight = 0; // Non negative 8 bit (1 byte) integer (or unsigned char)
    uint8_t length = 0; // The same for length
    static std::string voice;
    cat() {}
    cat(uint8_t w, uint8_t l): weight(w), length(l) {}
    cat(union chunk cat_chunk) {
        if((cat_chunk.space & 0x43020000) == 0x43020000) {
           this->length = cat_chunk.space & 0xff; // To circumvent the endianness bit shifts are best solution for that
           this->weight = (cat_chunk.space >> 8) & 0xff;
        }
        // Some error handling
        this->weight = 0;
        this->length = 0;
    }
    operator uint32_t() {
        return 0x4302000 | (this->weight << 8) | this->length;
    }
};
static cat::voice = std::string("meow.mp3");

Zoo file structure

So now we have our cat object ready to be casted back and forth from chunk to cat. Now we need to structure an whole file with Header, footer, data, and checksums*. Let's say we are building an application for keeping track between zoo facility showing how many animals they have. Data of our zoo is what animals they have and how much, The footer of our zoo can be omitted (or it can represent the timestamp of when file was created), and in the header we save instructions on how to read our file, versioning and checking for corruption.

For more information how I structured these files you can find sources here and this shameless plug.

// File structure: all little endian
------------
HEADER:
+00    4B    89-5A-4F-4F   ('\221ZOO')   Our magic number for the zoo file
+04    4B    XX-XX-XX-XX   ('????')      Whole file checksum
+08    4B    47-0D-1A-0A   ('\r\n\032\n') // CRLF <-> LF conversion and END OF FILE 032
+12    4B    YY-YY-00-ZZ   ('??\0?')     Versioning and usage
+16    4B    AA-BB-BB-BB   ('X???')      Start offset + data length

------------
DATA:
Animals: // For each animal type (block identifier)
+20+??   4B     ??-XX-XX-LL   ('????')   : ? animal type identifier, X start offset from header, Y animals in struct objects
+24+??+4 4B     XX-XX-XX-XX   ('????')   : Checksum for animal type

For checksums, you can use the normal ones (manually add each byte) or among others CRC-32. The choice is yours, and it depends on the size of your files and data. So now we have data for our file. Of course, I must warn you:

Having only one structure or class that requires serialisation means that in general this type of serialisation isn't needed. You can just cast the whole object to the integer of desirable size and then to a binary character sequence, and then read that character sequence of some size into an integer and back to the object. The real value of serialisation is that we can store multiple data and find our way in that binary mess.

But since Zoo can have more data than which animals we have, that can vary in size in chunks. We need to make an interface or abstract class for file handling.

#include <fstream>      // File input output ...
#include <vector>       // Collection for writing data
#include <sys/types.h>  // Gets the types for struct stat
#include <sys/stat.h>   // Struct stat
#include <string>       // String manipulations

struct handle {

    // Members
    protected: // Inherited in private
        std::string extn = "o";
        bool acces = false;
        struct stat buffer;
        std::string filename = "";
        std::vector<chunk> data;

    public: // Inherited in public
        std::string name = "genesis";
        std::string path = "";

    // Methods
    protected:
        void remake_name() {
            this->filename = this->path;
            if(this->filename != "") {
                this->filename.append("//");
            }
            this->filename.append(this->name);
            this->filename.append(".");
            this->filename.append(this->extn);
        }

        void recheck() {
            this->acces = (
                stat(
                    this->filename.c_str(),
                    &this->buffer
                ) == 0);
        }

        // To be overwritten later on [override]
        virtual bool check_header() { return true;}
        virtual bool check_footer() { return true;}
        virtual bool load_header()  { return true;}
        virtual bool load_footer()  { return true;}

    public:
        handle()
            : acces(false),
            name("genesis"),
            extn("o"),
            filename(""),
            path(""),
            data(0) {}

        void operator()(const char *name, const char *ext, const char *path) {
            this->path = std::string(path);
            this->name = std::string(name);
            this->extn = std::string(ext);
            this->remake_name();
            this->recheck();
        }

        void set_prefix(const char *prefix) {
            std::string prn(prefix);
            prn.append(this->name);
            this->name = prn;
            this->remake_name();
        }
        void set_suffix(const char *suffix) {
            this->name.append(suffix);
            this->remake_name();
        }

        int write() {
            this->remake_name();
            this->recheck();

            if(!this->load_header()) {return 0;}
            if(!this->load_footer()) {return 0;}

            std::fstream file(this->filename.c_str(), std::ios::out | std::ios::binary);
            uint32_t temp = 0;

            for(int i = 0; i < this->data.size(); i++) {

                write_as_binary(this->data[i], temp);
                file.write((char *)(&temp), sizeof(temp));
            }

            if(!this->check_header()) { file.close();return 0; }
            if(!this->check_footer()) { file.close();return 0; }

            file.close();
            return 1;
        }

        int read() {
            this->remake_name();
            this->recheck();

            if(!this->acces) {return 0;}

            std::fstream file(this->filename.c_str(), std::ios::in | std::ios::binary);
            uint32_t temp = 0;
            chunk ctemp;
            size_t fsize = this->buffer.st_size/4;

            for(int i = 0; i < fsize; i++) {

                file.read((char*)(&temp), sizeof(temp));
                read_as_binary(ctemp, temp);
                this->data.push_back(ctemp);

            }

            if(!this->check_header()) {
                file.close();
                this->data.clear();
                return 0;
            }
            if(!this->check_footer()) {
                file.close();
                this->data.clear();
                return 0;
            }

            return 1;
        }

    // Friends
    friend std::ostream& operator<<(std::ostream& os, const handle& hand);
    friend handle& operator<<(handle& hand, chunk& c);
    friend handle& operator>>(handle& hand, chunk& c);
    friend struct zoo_file;
};

std::ostream& operator<<(std::ostream& os, const handle& hand) {
    for(int i = 0; i < hand.data.size(); i++) {
        os << "\t" << hand.data[i] << "\n";
    }
    return os;
}

handle& operator<<(handle& hand, chunk& c) {
    hand.data.push_back(c);
    return hand;
}

handle& operator>>(handle& hand, chunk& c) {
    c = hand.data[ hand.data.size() - 1 ];
    hand.data.pop_back();
    return hand;
}

From which we can initialise our zoo object and later on which ever we need. File handle is an just a file template containing a data block (handle.data) and headers and/are implemented footers later on.

Since headers are describing whole files, checking and loading can have added functionality that your specific case needs. If you have two different objects, you need to add to file, instead of changing headers/footers, one type of data insert at the start of the data, and other type push_back at the end of the data via overloaded operator<</operator>>.

For multiple objects that have no relationship between each other, you can add more private members in inheritance, for storing current position of individual segments while keeping things neat and organised for the file writing and reading.

struct zoo_file: public handle {

    zoo_file() {this->extn = "zoo";}

    void operator()(const char *name,  const char *path) {
        this->path = std::string(path);
        this->name = std::string(name);
        this->remake_name();
        this->recheck();
    }

    protected:
        virtual bool check_header() {
            chunk temp = this->data[0];
            uint32_t checksums = 0;

            // Magic number
            if(chunk.space != 0x895A4F4F) {
                this->data.clear();
                return false;
            }else {
                this->data.erase(this->data.begin());
            }

            // Checksum
            temp = this->data[0];
            checksums = temp.space;
            this->data.erase(this->data.begin());

            // Valid load number
            temp = this->data[0];
            if(chunk.space != 0x470D1A0A) {
                this->data.clear();
                return false;
            }else {
                this->data.erase(this->data.begin());
            }

            // Version + flag
            temp = this->data[0];
            if((chunk.space & 0x01000000) != 0x01000000) { // If not version 1.0
                this->data.clear();
                return false;
            }else {
                this->data.erase(this->data.begin());
            }

            temp = this->data[0];
            int opt_size = (temp.space >> 24);
            if(opt_size != 20) {
                this->data.clear();
                return false;
            }
            opt_size = temp.space & 0xffffff;

            return (opt_size == this->data.size());
        }

        virtual bool load_header()  {

            chunk magic, checksum, vload, ver_flag, off_data;
            magic = 0x895A4F4F;
            checksum = 0;
            vload = 0x470D1A0A;
            ver_flag = 0x01000001; // 1.0, usage 1 (normal)
            off_data = (20 << 24) | ((this->data.size()-1)-4);

            for(int i = 0; i < this->data.size(); i++) {
                checksum.space += this->data[i].parts[0];
                checksum.space += this->data[i].parts[1];
                checksum.space += this->data[i].parts[2];
                checksum.space += this->data[i].parts[3];
            }

            this->data.insert(this->data.begin(), off_data);
            this->data.insert(this->data.begin(), ver_flag);
            this->data.insert(this->data.begin(), vload);
            this->data.insert(this->data.begin(), checksum);
            this->data.insert(this->data.begin(), magic);

            return true;
        }

    friend zoo_file& operator<<(zoo_file& zf, cat sc);
    friend zoo_file& operator>>(zoo_file& zf, cat sc);
    friend zoo_file& operator<<(zoo_file& zf, elephant se);
    friend zoo_file& operator>>(zoo_file& zf, elephant se);
};

zoo_file& operator<<(zoo_file& zf, cat &sc) {
    union chunk temp;
    temp = (uint32_t)sc;
    zf.data.push_back(temp);
    return zf;
}

zoo_file& operator>>(zoo_file& zf, cat &sc) {
    size_t pos = zf.data.size() - 1;
    union chunk temp;
    while (1) {
        if((zf[pos].space & 0x4302000) != 0x4302000) {
            pos --;
        }else {
            temp = zf[pos];
            break;
        }
        if(pos == 0) {break;}
    }
    zf.data.erase(zf.data.begin() + pos);
    sc = (uint32_t)temp;
    return zf;
}
// same for elephants, koyotes, giraffes .... whatever you need

Please don't just copy code. The handle object is meant as a template, so how you structure your data block is up to you. If you have a different structure and just copy code of course it won't work.

And now we can have zoo with only cats. And building a file is easy as:

// All necessary includes

// Writing the zoo file
zoo_file my_zoo;

// Push back to the std::vector some cats in
my_zoo("superb_zoo");
my_zoo.write();

// Reading the zoo file
zoo_file my_zoo;
my_zoo("superb_zoo");
my_zoo.read();

How to write custom binary file handler in c++ with serialisation of custom objects?

1 Answers1

Before we start

Know your data