Before we start
Most of new users aren't familiar with different data types in C++ and often use plain int
, char
and etc in their code. To successfully do serialisation, one needs to thoroughly think about their data types. Therefore these are your first steps if you have an int
lying down somewhere.
- Know your data
- What is maximum value that variable should hold?
- Can it be negative?
- Limit your data
- Implement decisions from above
- Limit amount of objects your file can hold.
Know your data
If you have an struct
or a class
with some data as:
struct cat {
int weight = 0; // In kg (pounds)
int length = 0; // In cm (or feet)
std::string voice = "meow.mp3";
cat() {}
cat(int weight, int length): weight(weight), length(length) {}
}
Can your cat really weight around 255 kg (maximum size for the 1 byte integer)? Can it be as long as 255 cm (2.5 m)? Does the voice
of your cat change with every object of cat
?
Objects that don't change should be declared static
, and you should limit your object size to best fit your needs. So in these examples answers to the questions above is no.
So our cat struct
now looks like this:
struct cat {
uint8_t weight = 0; // Non negative 8 bit (1 byte) integer (or unsigned char)
uint8_t length = 0; // Same for length
static std::string voice;
cat() {}
cat(uint8_t w, uint8_t l):weight(w), length(l) {}
};
static cat::voice = std::string("meow.mp3");
Files are written byte by byte (often as character sets) and your data can vary so you need to presume or limit maximum value your data can handle.
But not every project (or structure) is the same, so let's talk about differences of your code data and binary structured data. When thinking about serialisation you need to think in this manner "what is bare minimum of data that this structure needs to be unique?".
For our cat object, it can represent anything beside:
- tigers: max 390 kg, and 340 cm
- lions : max 315 kg, and 365 cm
Anything else is eligable. So you can influence your "meow.mp3"
depending on the size and weight, and then most important data that makes a cat unique is its length
and weight
. Those are data we need to save to our file.
Limit your data
The largest zoo in the world has 5000 animals and 700 species, which means that in average each species in the zoo contains a population around 10 per species. Which means that per our species of cat
we can store maximum of 1 byte worth of cats and don't fear that it will go over it.
So it is safe to assume that our zoo
project should hold up to 200 elements per species. This leaves us with two different byte sized data, so our serialised data for our struct is maximum two bytes.
Approach to serialisation
Constructing our cat
block
For starters, this is the great way to start. It helps you approach custom serialisation with the right foundation. Now all that is left is to define a structured binary format. For that we need a way to recognise if our two bytes are part of the cat or some other structure, which it can be done with same type collection (every two bytes are cat
s) or by an identifier.
If we have single file (or part of the file) that holds all cats. We need just start offset of the file and the size of the cat
bytes, and then read read every two bytes from start offset to get all cats.
Identifier
is a way we can identify depending on the starting character if the object is a cat or something else. This is commonly done by the TLV (Type Length Value) format where type would be Cats, length would be two bytes, and value would be those two bytes.
As you can see, the first option contains fewer bytes and therefore it is more compact, but with the second option we have ability to store multiple animals in our file and make a zoo. How you will structure your binary files depends a lot on your project. For now, since the "single file" option is the most logical to work with, I will implement the second one.
The most important this about "identifier" approach is to first make it logical for us, and then make it logical for our machine. I come from a world where reading from left to right is an norm. So it is logical that the first thing I want to read about cats is its type
, then length
, and then value
.
char type = 'C'; // C shorten for Cat, 0x43
uint8_t length = 2; // It holds 2 bytes, 0x02
uint8_t c_length = '?'; // cats length
uint8_t c_weight = '?'; // cats weight
And to represent it as an chunk(block);
+00 4B 43-02-LL-WW ('C\x02LW')
Where this means:
+00
: offset form the start, 0 means it is start of the file
4B
: size of our data block, 4 bytes.
43-02-LL-WW
: actual value of cat
43
: hexadecimal representation of character 'C'
02
: hexadecimal representation of length of this type (2)
LL
: length of this cat of 1 byte value
WW
: weight of this cat of 1 byte value
But since it is easier for me to read data from left to right, this means my data should be written as little endian, and most of standalone computers are big endian.
Endianess and importance of them
The main issue here is endianness of our machine and because of our struct
/class
and endianness we need an base type. The way we wrote it defines an little endian OS, but OS's can be all kind of endianness and you can find out how to find which your machine has here.
For users experienced with bit
fields I would strongly suggest that you use them for this. But for unfamiliar users:
#include <iostream> // Just for std::ostream, std::cout, and std::endl
bool is_big() {
union {
uint16_t w;
uint8_t p[2];
} p;
p.w = 0x0001;
return p.p[0] == 0x1;
}
union chunk {
uint32_t space;
uint8_t parts[4];
};
chunk make_chunk(uint32_t VAL) {
union chunk ret;
ret.space = VAL;
return ret;
}
std::ostream& operator<<(std::ostream& os, union chunk &c) {
if(is_big()) {
return os << c.parts[3] << c.parts[2] << c.parts[1] << c.parts[0];
}else {
return os << c.parts[0] << c.parts[1] << c.parts[2] << c.parts[3];
}
}
void read_as_binary(union chunk &t, uint32_t VAL) {
t.space = VAL;
if(is_big()) {
t.space = (t.parts[3] << 24) | (t.parts[2] << 16) | (t.parts[1] << 8) | t.parts[0];
}
}
void write_as_binary(union chunk t, uint32_t &VAL) {
if(is_big()) {
t.space = (t.parts[3] << 24) | (t.parts[2] << 16) | (t.parts[1] << 8) | t.parts[0];
}
VAL = t.space;
}
So now we have our chunk that will print out characters in the order we can recognise it at first glance. Now we need a set of casting functionality from uint32_t
to our cat
since our chunk size is 4 bytes or uint32_t
.
struct cat {
uint8_t weight = 0; // Non negative 8 bit (1 byte) integer (or unsigned char)
uint8_t length = 0; // The same for length
static std::string voice;
cat() {}
cat(uint8_t w, uint8_t l): weight(w), length(l) {}
cat(union chunk cat_chunk) {
if((cat_chunk.space & 0x43020000) == 0x43020000) {
this->length = cat_chunk.space & 0xff; // To circumvent the endianness bit shifts are best solution for that
this->weight = (cat_chunk.space >> 8) & 0xff;
}
// Some error handling
this->weight = 0;
this->length = 0;
}
operator uint32_t() {
return 0x4302000 | (this->weight << 8) | this->length;
}
};
static cat::voice = std::string("meow.mp3");
Zoo file structure
So now we have our cat
object ready to be casted back and forth from chunk
to cat
. Now we need to structure an whole file with Header, footer, data, and checksums*. Let's say we are building an application for keeping track between zoo facility showing how many animals they have. Data of our zoo is what animals they have and how much, The footer of our zoo can be omitted (or it can represent the timestamp of when file was created), and in the header we save instructions on how to read our file, versioning and checking for corruption.
For more information how I structured these files you can find sources here and this shameless plug.
// File structure: all little endian
------------
HEADER:
+00 4B 89-5A-4F-4F ('\221ZOO') Our magic number for the zoo file
+04 4B XX-XX-XX-XX ('????') Whole file checksum
+08 4B 47-0D-1A-0A ('\r\n\032\n') // CRLF <-> LF conversion and END OF FILE 032
+12 4B YY-YY-00-ZZ ('??\0?') Versioning and usage
+16 4B AA-BB-BB-BB ('X???') Start offset + data length
------------
DATA:
Animals: // For each animal type (block identifier)
+20+?? 4B ??-XX-XX-LL ('????') : ? animal type identifier, X start offset from header, Y animals in struct objects
+24+??+4 4B XX-XX-XX-XX ('????') : Checksum for animal type
For checksums, you can use the normal ones (manually add each byte) or among others CRC-32. The choice is yours, and it depends on the size of your files and data. So now we have data for our file. Of course, I must warn you:
Having only one structure or class that requires serialisation means that in general this type of serialisation isn't needed. You can just cast the whole object to the integer of desirable size and then to a binary character sequence, and then read that character sequence of some size into an integer and back to the object. The real value of serialisation is that we can store multiple data and find our way in that binary mess.
But since Zoo can have more data than which animals we have, that can vary in size in chunks. We need to make an interface
or abstract class
for file handling.
#include <fstream> // File input output ...
#include <vector> // Collection for writing data
#include <sys/types.h> // Gets the types for struct stat
#include <sys/stat.h> // Struct stat
#include <string> // String manipulations
struct handle {
// Members
protected: // Inherited in private
std::string extn = "o";
bool acces = false;
struct stat buffer;
std::string filename = "";
std::vector<chunk> data;
public: // Inherited in public
std::string name = "genesis";
std::string path = "";
// Methods
protected:
void remake_name() {
this->filename = this->path;
if(this->filename != "") {
this->filename.append("//");
}
this->filename.append(this->name);
this->filename.append(".");
this->filename.append(this->extn);
}
void recheck() {
this->acces = (
stat(
this->filename.c_str(),
&this->buffer
) == 0);
}
// To be overwritten later on [override]
virtual bool check_header() { return true;}
virtual bool check_footer() { return true;}
virtual bool load_header() { return true;}
virtual bool load_footer() { return true;}
public:
handle()
: acces(false),
name("genesis"),
extn("o"),
filename(""),
path(""),
data(0) {}
void operator()(const char *name, const char *ext, const char *path) {
this->path = std::string(path);
this->name = std::string(name);
this->extn = std::string(ext);
this->remake_name();
this->recheck();
}
void set_prefix(const char *prefix) {
std::string prn(prefix);
prn.append(this->name);
this->name = prn;
this->remake_name();
}
void set_suffix(const char *suffix) {
this->name.append(suffix);
this->remake_name();
}
int write() {
this->remake_name();
this->recheck();
if(!this->load_header()) {return 0;}
if(!this->load_footer()) {return 0;}
std::fstream file(this->filename.c_str(), std::ios::out | std::ios::binary);
uint32_t temp = 0;
for(int i = 0; i < this->data.size(); i++) {
write_as_binary(this->data[i], temp);
file.write((char *)(&temp), sizeof(temp));
}
if(!this->check_header()) { file.close();return 0; }
if(!this->check_footer()) { file.close();return 0; }
file.close();
return 1;
}
int read() {
this->remake_name();
this->recheck();
if(!this->acces) {return 0;}
std::fstream file(this->filename.c_str(), std::ios::in | std::ios::binary);
uint32_t temp = 0;
chunk ctemp;
size_t fsize = this->buffer.st_size/4;
for(int i = 0; i < fsize; i++) {
file.read((char*)(&temp), sizeof(temp));
read_as_binary(ctemp, temp);
this->data.push_back(ctemp);
}
if(!this->check_header()) {
file.close();
this->data.clear();
return 0;
}
if(!this->check_footer()) {
file.close();
this->data.clear();
return 0;
}
return 1;
}
// Friends
friend std::ostream& operator<<(std::ostream& os, const handle& hand);
friend handle& operator<<(handle& hand, chunk& c);
friend handle& operator>>(handle& hand, chunk& c);
friend struct zoo_file;
};
std::ostream& operator<<(std::ostream& os, const handle& hand) {
for(int i = 0; i < hand.data.size(); i++) {
os << "\t" << hand.data[i] << "\n";
}
return os;
}
handle& operator<<(handle& hand, chunk& c) {
hand.data.push_back(c);
return hand;
}
handle& operator>>(handle& hand, chunk& c) {
c = hand.data[ hand.data.size() - 1 ];
hand.data.pop_back();
return hand;
}
From which we can initialise our zoo
object and later on which ever we need. File handle
is an just a file template containing a data block (handle.data
) and headers and/are implemented footers later on.
Since headers are describing whole files, checking and loading can have added functionality that your specific case needs. If you have two different objects, you need to add to file, instead of changing headers/footers, one type of data insert
at the start of the data, and other type push_back
at the end of the data via overloaded
operator<</operator>>
.
For multiple objects that have no relationship between each other, you can add more private members in inheritance, for storing current position of individual segments while keeping things neat and organised for the file writing and reading.
struct zoo_file: public handle {
zoo_file() {this->extn = "zoo";}
void operator()(const char *name, const char *path) {
this->path = std::string(path);
this->name = std::string(name);
this->remake_name();
this->recheck();
}
protected:
virtual bool check_header() {
chunk temp = this->data[0];
uint32_t checksums = 0;
// Magic number
if(chunk.space != 0x895A4F4F) {
this->data.clear();
return false;
}else {
this->data.erase(this->data.begin());
}
// Checksum
temp = this->data[0];
checksums = temp.space;
this->data.erase(this->data.begin());
// Valid load number
temp = this->data[0];
if(chunk.space != 0x470D1A0A) {
this->data.clear();
return false;
}else {
this->data.erase(this->data.begin());
}
// Version + flag
temp = this->data[0];
if((chunk.space & 0x01000000) != 0x01000000) { // If not version 1.0
this->data.clear();
return false;
}else {
this->data.erase(this->data.begin());
}
temp = this->data[0];
int opt_size = (temp.space >> 24);
if(opt_size != 20) {
this->data.clear();
return false;
}
opt_size = temp.space & 0xffffff;
return (opt_size == this->data.size());
}
virtual bool load_header() {
chunk magic, checksum, vload, ver_flag, off_data;
magic = 0x895A4F4F;
checksum = 0;
vload = 0x470D1A0A;
ver_flag = 0x01000001; // 1.0, usage 1 (normal)
off_data = (20 << 24) | ((this->data.size()-1)-4);
for(int i = 0; i < this->data.size(); i++) {
checksum.space += this->data[i].parts[0];
checksum.space += this->data[i].parts[1];
checksum.space += this->data[i].parts[2];
checksum.space += this->data[i].parts[3];
}
this->data.insert(this->data.begin(), off_data);
this->data.insert(this->data.begin(), ver_flag);
this->data.insert(this->data.begin(), vload);
this->data.insert(this->data.begin(), checksum);
this->data.insert(this->data.begin(), magic);
return true;
}
friend zoo_file& operator<<(zoo_file& zf, cat sc);
friend zoo_file& operator>>(zoo_file& zf, cat sc);
friend zoo_file& operator<<(zoo_file& zf, elephant se);
friend zoo_file& operator>>(zoo_file& zf, elephant se);
};
zoo_file& operator<<(zoo_file& zf, cat &sc) {
union chunk temp;
temp = (uint32_t)sc;
zf.data.push_back(temp);
return zf;
}
zoo_file& operator>>(zoo_file& zf, cat &sc) {
size_t pos = zf.data.size() - 1;
union chunk temp;
while (1) {
if((zf[pos].space & 0x4302000) != 0x4302000) {
pos --;
}else {
temp = zf[pos];
break;
}
if(pos == 0) {break;}
}
zf.data.erase(zf.data.begin() + pos);
sc = (uint32_t)temp;
return zf;
}
// same for elephants, koyotes, giraffes .... whatever you need
Please don't just copy code. The handle object is meant as a template, so how you structure your data block is up to you. If you have a different structure and just copy code of course it won't work.
And now we can have zoo with only cats. And building a file is easy as:
// All necessary includes
// Writing the zoo file
zoo_file my_zoo;
// Push back to the std::vector some cats in
my_zoo("superb_zoo");
my_zoo.write();
// Reading the zoo file
zoo_file my_zoo;
my_zoo("superb_zoo");
my_zoo.read();