0

I am working to manually parse a binary database file. The data is stored in fixed width, fixed length flat files where integers and dates are stored using little endian binary values. There are several types I am dealing with which have anywhere from a single byte for an integer to 7 bytes for a timestamp.

I am new to C++ and trying to recreate a working system I did in C#, but there are some things that are tripping me up here. So the first thing I am doing is just reading 300 bytes from a file into a wchar_t array.

int LineLength =300;
int counter = 0;
wint_t c;
wchar_t* wcs = new wchar_t[LineLength];
while ((c = fgetwc("BinaryFile.bin")) != WEOF) {
    wcs[counter] = c;
    counter++;
    if (counter > LineLength - 1) {
        break;
    }
}

That seems to be working just fine. Now next I need to create a dynamic array size since some data like an int will be 1 byte, some will be strings and much longer. Normally I would set all ColumWidth, StartingPosition and such from looping over a JSON file, but for simplicity I just did them as arrays here:

string* ColumnName = new string[3]{ "Int1","Int2","String1" };
int*  StartingPosition =new int[3] { 1,10,11 };
int*  ColumnWidth =new int[3] { 4,1,25 };
for (int i = 0; 3 >= i; ++i)
{
    wchar_t* CurrentColumnBytes = new wchar_t[ColumnWidth[i]];
    cout << "\nBytes Requested: " << ColumnWidth[i]<< ", ArraySize: " << sizeof(CurrentColumnBytes) << "\n";

    int Counter = 0;
    for (int C = StartingPosition[i]; C <= ColumnWidth[i] + StartingPosition[i]; ++C)
    {
        CurrentColumnBytes[Counter] = wcs[C];
        Counter++;
    }

}

For some reason CurrentColumnBytes always gets a size of 4. I simplified things a little to:

int arraySize = 8;
char* locale = setlocale(LC_ALL, "UTF-8");
char* CurrentColumnBytes=new char[arraySize];
cout << "\nBytes Requested: " << arraySize << ", ArraySize: " << sizeof(CurrentColumnBytes) << "\n";

This prints out: Bytes Requested: 8, ArraySize: 4. Changing it a little to be:

char* locale = setlocale(LC_ALL, "UTF-8");
char* CurrentColumnBytes[8];
cout << "ArraySize: " << sizeof(CurrentColumnBytes) << "\n";

This prints out: ArraySize: 32 There seems to be something basic I am missing here. Why can't I create an array with the width that I want? Why does it in the last example multiply it by 4 bytes?

Is there a better way to duplicate what I did using C# like this:

FileStream WorkingFile = new FileStream(fileName, FileMode.Open, FileAccess.Read);

while (WorkingFile.Position <= WorkingFile.Length)
{      
    byte[] ByteArray = new byte[300];
    WorkingFile.Read(ByteArray, 0, 300);
}

I also tried using ifstream, but I think there is a fundamental misunderstanding of creating arrays I am dealing with. I am including all the rest of the information in case someone has a suggestion for a better datatype or method.

Alan
  • 1,730
  • 1
  • 15
  • 32
  • 2
    `sizeof(CurrentColumnBytes)` will return the size in bytes of `CurrentColumnBytes`, a pointer. Pointers have no knowledge of the size of the object at which they point. – user4581301 Oct 15 '19 at 19:36
  • You have to realize that an `int` and a pointer is 4 bytes on a 32bit machine. You're doing `sizeof(CurrentColumnBytes)` which is a pointer and thus 4 bytes. Also, `wchar_t` is 2 bytes. Maybe use `char` instead. – Trinopoty Oct 15 '19 at 19:36
  • 4
    *"Is there a better way..."* - yeah, use a `std::vector` or other suitable container. – WhozCraig Oct 15 '19 at 19:37
  • @WhozCraig What about performance? Right now I am parsing around 20,000 records per second using c#, but when I was using dictionaries and lists, it slowed WAY down compared to basic byte arrays. Thank you Trinopoty and user4581301 for explaining that to me, I was trying to do the equivalent of `CurrentColumnBytes.Count` and I guess I did it wrong. – Alan Oct 15 '19 at 19:39
  • C++ and C# are very different languages and the backing idioms do not carry over well between them. Don't forget everything you learned from C#, but do get a good book on C++ and make sure you are familiar with the differences. Recommended reading: [Why should C++ programmers minimize use of 'new'?](https://stackoverflow.com/questions/6500313/why-should-c-programmers-minimize-use-of-new) – user4581301 Oct 15 '19 at 19:40
  • The only noticeable performance hit you'll get from a `vector`, assuming you turn the optimizer on, is during initialization where `vector` will default initialize all of the elements. If you don't turn the optimizer on, all complaints about performance will fall on deaf ears. – user4581301 Oct 15 '19 at 19:43
  • @user4581301 that a bit of oversimplification. At least one should reserve enough space upfront. Frequent reallocations can be expensive (maybe not much, but compared to no reallocations it is a price you pay when you dont use vectors right) – 463035818_is_not_a_number Oct 15 '19 at 19:47
  • 4
    @Alan If performance is an issue (honestly I can't see how it would be whenthe real bottle-kneck in the problematic code is going to be file i/o), then cross that bridge as we come to it. You should be able to use a `std::vector` anywhere you're using manually managed dynamic allocation, and with the benefit of carrying around size information, accessor operators (which are all inlined by the time you hit release-quality code), etc. I've yet to hit a scenario outside of value-initialization where a vector vs raw management is a problem. Get a good reference. It's worth it. – WhozCraig Oct 15 '19 at 19:48
  • Agreed, @formerly . I should have made that clear. In my head, the asker would use `vector` almost exactly as they are now with the array and see almost no change. And value initialization, not default initialization. My bad. – user4581301 Oct 15 '19 at 19:50
  • Thanks guys, I usually will load up a pre buffer with multiple records (5-100 depending on length) for optimizing the disk IO, and then break that buffer up into a smaller array that would be each row, then break that into columns and then transfer the results to a `DataTable`. I also can't say where the bottle neck will be, just trying to do it "right" where I can. So imagine an Excel spread sheet with 100,000 rows, 110 columns and each cell would be it's own vector or byte/char array. – Alan Oct 15 '19 at 19:53
  • you can only know where the bottleneck is once you have working code that you can profile, hence first getting it right is the way to go – 463035818_is_not_a_number Oct 15 '19 at 19:56

0 Answers0