0

I have numbers stored in the file of the form:

12766 961 2595
19427 11518 9233

But there are 400,000 such sets. How can I quickly read them from a file?

ifstream file_for_reading("C:\\Tests\\21");
    short number_of_vertexes;
    int edge;
    file_for_reading >> number_of_vertexes >> edge;
    if (number_of_vertexes < 1 || number_of_vertexes > 30000 || edge < 0 || edge>400000) { cout << "Correct your vallues"; exit(1); };
    int tmp = 0;
    short i;
    short** matrix = new short* [edge];
    for (tmp = 0; tmp < edge; tmp++)
        matrix[tmp] = new short[3];
    unsigned int first_vertex, second_vertex, edge_size;
    i = 0;
    while (!file_for_reading.eof()) {
        for (tmp = 0; tmp < edge; tmp++) {
            file_for_reading >> matrix[tmp][i] >> matrix[tmp][i + 1] >> matrix[tmp][i + 2];
            i = 0;
        }
    }
    for (tmp = 0; tmp < edge; tmp++) {
        for (i = 0; i < 3; i++) {
            cout << matrix[tmp][i] << " ";
        }
        cout << endl;
    }
    file_for_reading.close();
    //Dijkstra(matrix, 0, number_of_vertexes);
  • 3
    Best to choose one language. How do you plan to store the numbers in memory? Is converting the file to binary an option? – Retired Ninja Mar 28 '21 at 08:40
  • 1
    What does _quickly_ translate to? You can't use stream operators? Could you add what you've tried and found to be not performant? Would help. – Zoso Mar 28 '21 at 08:43
  • 1
    Are there always three integers per line? – Ted Lyngmo Mar 28 '21 at 08:44
  • 1
    What have you tried? It'd be pointless to give an answer using a solution you've already rejected. – Ted Lyngmo Mar 28 '21 at 08:51
  • I tried to read through the fstream library (file >> matrix [i] [j]), but it took me 6 seconds to read this matrix from the file (1,200,000 elements). SSD - 512 GB. Time required - no more than 2 seconds.Yes,there are always 3 elements. – Eugenos_Programos Mar 28 '21 at 09:06
  • 1
    Sure, but _how_ did you do it? Please show your code. Are the number of sets in the file known? – Ted Lyngmo Mar 28 '21 at 09:10
  • 1
    Btw, I just made a test of reading 400000 of these sets using an `ifstream` into a `vector`. Nothing fancy. It took 123 ms on an old 5400 rpm HD. – Ted Lyngmo Mar 28 '21 at 09:22
  • 1
    Do not tag both C and C++ except when asking about differences or interactions between the two languages. Consider what happens when you tag both C and C++, somebody answers for C++, you accept it, and then other people do not answer. Later, somebody else looking for an answer for C may search Stack Overflow for questions tagged C, come to this question, and be disappointed there is no C answer. A purpose of Stack Overflow is to create a durable repository of questions and answers for the future. It is not just to solve your immediate problems. Tag only the one language you are writing in. – Eric Postpischil Mar 28 '21 at 09:27
  • Follow-up to my previous comment. [Here's the code](https://godbolt.org/z/jhY6e5TY7) I used. – Ted Lyngmo Mar 28 '21 at 09:28
  • 1
    I made a very simple test of this too and it's right around 500ms in a release build. Debug is ~4.5 seconds. Do you have optimization on? BTW, it's best to edit the code into your question. – Retired Ninja Mar 28 '21 at 09:43
  • Using your code from the link I get similar results. ~500ms release, ~4.5 sec debug. – Retired Ninja Mar 28 '21 at 09:51
  • @RetiredNinja You mean my code? Perhaps my caches were hot when I got 123 ms on my old HD. I got 122 ms on an SSD with `-O3` and 160 ms with `-O0` :-) I noticed I took `tripple` _by-value_ when printing the result. It doesn't affect the reading time but it's annoying to leave it like that, so: [small update](https://godbolt.org/z/brKedPh7d) – Ted Lyngmo Mar 28 '21 at 09:58
  • @TedLyngmo Ah, I mistakenly thought the OP eventually gave up their code. Sorry about that. :) I thought it might have been that I was generating the file immediately before reading it but skipping that step didn't change much. I'm using a fast SSD, Windows 10, Visual Studio 2019. *shrug* Love to see the code that takes 6 seconds for such a small task. – Retired Ninja Mar 28 '21 at 10:12
  • @RetiredNinja No worries! :-) Yes, 6 seconds seems like a lot even with optimizing turned off. It'd be interesting to hear what time my program takes on OP:s computer. – Ted Lyngmo Mar 28 '21 at 10:22
  • I add my code to the question. – Eugenos_Programos Mar 28 '21 at 10:26
  • @Eugenos_Programos It's incomplete though. What time do you get with [this](https://godbolt.org/z/brKedPh7d) reading the same file? – Ted Lyngmo Mar 28 '21 at 10:31
  • @TedLyngmo I have 21 files which test my programm,and one of the conditions of correct tests is meet in 2 seconds.In the last three, he writes that the time has been exceeded.I have a special exe file that runs with a batch file along with tests that controls the time. – Eugenos_Programos Mar 28 '21 at 10:35
  • @Eugenos_Programos You're doing something wrong then but I can't say what without seeing the code. Your little loop reading the values doesn't look like it could slow it down by a factor 12-20 compared to the tests we made. I assume this is some online judge you're testing against? Can you link to the question? – Ted Lyngmo Mar 28 '21 at 10:37
  • @TedLyngmo Ok,I added all the code that is in my cpp. – Eugenos_Programos Mar 28 '21 at 10:40
  • @Eugenos_Programos That's not all the code. I can't compile that _as-is_. Anyway, it's enough code to question your question. _Why_ do you think it's _reading the file_ that takes time? Have you measured that part alone? Also, why `while (!file_for_reading.eof())`? It makes it loop twice. You also have memory leaks. `short** matrix = new short* [edge];` is never `delete[]`d. – Ted Lyngmo Mar 28 '21 at 10:42
  • @TedLyngmo I don't know that this command makes it loop twice.I'm beginner.Maybe you can help me to find the solution. – Eugenos_Programos Mar 28 '21 at 10:48
  • @Eugenos_Programos I can try to help, but I think I need to see the full problem description. `while (!file_for_reading.eof())` is `true` just when you've opened the file, so it'll loop and read the 400000 sets. It will also be `true` after you've read the 400000 sets. See [Why is `iostream::eof()` inside a loop condition (i.e. `while (!stream.eof())`) considered wrong?](https://stackoverflow.com/questions/5605125/why-is-iostreameof-inside-a-loop-condition-i-e-while-stream-eof-cons) – Ted Lyngmo Mar 28 '21 at 10:54
  • I could see the loop that prints the data to the console taking several seconds. What happens if you remove that along with the `while (!file_for_reading.eof()) {` statement and the closing brace for it? – Retired Ninja Mar 28 '21 at 11:08
  • @RetiredNinja do you mean to delete the cycle? – Eugenos_Programos Mar 28 '21 at 11:20
  • @TedLyngmo I will try to make a complete description of the problem. I read a matrix from a file and based on this matrix I need to compose the distance from the first vertex of the graph to all the others and output the data to another file, and this is all in 2 seconds. – Eugenos_Programos Mar 28 '21 at 11:38

2 Answers2

1

if you just want to read and print, this function will help :

void readMatrix(int dimension, char *path, int *data)
{
    FILE *file;
    file = fopen(path, "r");
    if (file == NULL)
    {
        fprintf(stderr, "error: while trying to open `%s' for reading\n", path);
        return; //
    }

    for (int i = 0 ; ((i < dimension * dimension) && (fscanf(file, "%d ", &data[i]) == 1)) ; ++i)
        printf("data[%d] = %d\n", i, data[i]);

    fclose(file);
}
zazz
  • 1,424
  • 1
  • 3
  • 20
0

You need to define what "quickly read a matrix from a file?" means to you.

On what computer, with what operating system, what hardware?

A possible approach, if the dataset is written once a day and used by your application running a dozen of times each day, is to specify some binary format (in some written document, using EBNF notation, inspired by the ELF specification) and convert your textual file to a binary file.

On Linux, you could mmap(2) that binary file. See also readahead(2) and posix_fadvise(2). On Windows, read about file mapping.

If you use C <stdio.h> functions, be aware of setvbuf(3). You want to have (in 2021) I/O buffers of at least 64kbytes (because of the page cache).

Consider also, if that is allowed to you, to convert the textual file to some XDR format. C or C++ code generators for serializing and unserializing XDR data do exist. See also ASN.1

Another approach could be to split that dataset in a dozen of smaller textual files (e.g. using utilities like csplit(1) or your own equivalent) and use a multi-threading approach (one thread reading each file).

A third approach would be inspired by assembler programs: a first pass just parses end of lines and store file offsets of them, and a second pass use several threads to parse "segments" of the file.

My personal opinion is that you should benchmark and not bother.

With SSD disks - on a typical Linux desktop or server in 2021 -, reading a million integers encoded in decimal inside textual files could take less than a second of CPU time.

And you could also convert (or store) that data into some SQLite or PostGreSQL or REDIS database.

Basile Starynkevitch
  • 1
  • 16
  • 251
  • 479
  • 1
    Windows,SSD - 512 GB,it took me 6 seconds to read this matrix from the file (1,200,000 elements) using fstream library. – Eugenos_Programos Mar 28 '21 at 09:08
  • 2
    Ehm I recently discovered this: "You should post also [X]" -- if the question does not have enough information, then it should not be answered yet. [How to Answer](https://stackoverflow.com/questions/how-to-answer) – MatG Mar 28 '21 at 09:12