3

in my program I want to read several text files (more than ~800 files), each with 256 lines and their filenames starting from 1.txt to n.txt, and store them into a database after several processing steps. My problem is the data's reading speed. I could speed the program up to about twice the speed it had before by using OpenMP multithreading for the reading loop. Is there a way to speed it up a bit more? My actual code is

std::string CCD_Folder = CCDFolder; //CCDFolder is a pointer to a char array
int b = 0;
int PosCounter = 0;
int WAVENUMBER, WAVELUT;
std::vector<std::string> tempstr;
std::string inputline;
//Input
omp_set_num_threads(YValue);
#pragma omp parallel for private(WAVENUMBER) private(WAVELUT) private(PosCounter) private(tempstr) private(inputline)
    for(int i = 1; i < (CCD_Filenumbers+1); i++)
    {
        //std::cout << omp_get_thread_num() << ' ' << i << '\n';
        //Umwandlung und Erstellung des Dateinamens, Öffnen des Lesekanals
        std::string CCD_Filenumber = boost::lexical_cast<string>(i);
        std::string CCD_Filename = CCD_Folder + '\\' + CCD_Filenumber + ".txt";
        std::ifstream datain(CCD_Filename, std::ifstream::in);  
        while(!datain.eof())
        {
            std::getline(datain, inputline);
            //Processing

        };

    };

All variables which are not defined here are defined somewhere else in my code, and it is working. So is there a possibility to speed this code a bit more up?
Thank you very much!

arc_lupus
  • 3,317
  • 4
  • 36
  • 66
  • 6
    `while(!datain.eof())` Argggggghhhhhhhhhh – Tony The Lion Aug 20 '13 at 14:24
  • What platform, filesystem, hardware etc. are you concerned with? – Useless Aug 20 '13 at 14:25
  • 3
    hmm, you will soon hit the disk access limitation : your hard drive is fundamentally a sequential mechanism. That's why database managers like SQL uses optimized storage subsystems. – lucasg Aug 20 '13 at 14:29
  • @TonyTheLion: Is there a better way for my purpose? – arc_lupus Aug 20 '13 at 14:40
  • @Useless: Platform: Win7/8 x64, Filesystem: NTFS, Hardware: HDD, standard pc. – arc_lupus Aug 20 '13 at 14:42
  • 1
    @georgesl: Still you can read a file by other (more convoluted ways) faster than with `std::iostream` if this proves to be a bottleneck. A combination of `fopen/fread` with a manual implementation of `getline` and `strtol` can be faster than iostreams. – David Rodríguez - dribeas Aug 20 '13 at 14:43
  • @georgesl: The problem with using optimized storage subsystems is, that I get these files from an external program I can not modify... – arc_lupus Aug 20 '13 at 14:43
  • 3
    @arc_lupus you should use `(while std::getline())` then do something with the read in line in your loop – Tony The Lion Aug 20 '13 at 14:44
  • @TonyTheLion: Ok, I will try this. Why exactly is this faster? – arc_lupus Aug 20 '13 at 14:45
  • 3
    @arc_lupus It has nothing to do with speed, but with correctness. Your code is plain wrong. – Tony The Lion Aug 20 '13 at 14:47
  • what speed you achieved? how many ms and average file size? – evilruff Aug 20 '13 at 14:49
  • @arc_lupus : in that case do not get high hopes of radically speeding your process with software optimizations alone. – lucasg Aug 20 '13 at 14:52
  • There's a slight performance increase for `while` loop's, over `for` loops. `for (int i=0; i < insanely_huge_number; i++) {...}` is slower than `int i = 0; while (i < insanely_huge_number) {... i++;}` On my system at least. – Wolfgang Skyler Aug 20 '13 at 14:54
  • @DavidRodríguez-dribeas : I agree. When dealing with filesystems on embedded devices, I did use a double buffer on RAM ( to make the hdd work at nearly 100% ). – lucasg Aug 20 '13 at 14:59
  • @evilruff: Speed with only one thread: In average 8.5 seconds for 800 files with 256 lines and in average 20 chars. Speed with 20 (not more because of the processing routines in the background) threads: In average 4 seconds for the same files. – arc_lupus Aug 20 '13 at 15:21
  • 2
    When you profiled your code, where did you find the bulk of the time was being spent? – Johnsyweb Aug 20 '13 at 15:24
  • @Johnsyweb: By measuring at first the whole function and then excluding every part, and looking for the part which takes most time for computing. The part for reading the files took most of the time. – arc_lupus Aug 20 '13 at 15:30
  • You may want to look into memory-mapped files, too. Though if you're files are all pretty small, it may not help that much. – Cornstalks Aug 20 '13 at 16:09
  • That's an interesting way to profile! Were you transforming and writing a similar amount of data? Was your reading [CPU-bound or I/O-bound](http://stackoverflow.com/q/868568/78845)? – Johnsyweb Aug 20 '13 at 21:30
  • 1
    Some when with problem, "I I'll threads." they two people, confronted a think know, use Now have problems. – Johnsyweb Aug 20 '13 at 21:46

4 Answers4

8

Some experiment:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <Windows.h>

void generateFiles(int n) {
    char fileName[32];
    char fileStr[1032];

    for (int i=0;i<n;i++) {
        sprintf( fileName, "c:\\t\\%i.txt", i );
        FILE * f = fopen( fileName, "w" );
        for (int j=0;j<256;j++) {
            int lineLen = rand() % 1024;
            memset(fileStr, 'X', lineLen );
            fileStr[lineLen] = 0x0D;
            fileStr[lineLen+1] = 0x0A;
            fileStr[lineLen+2] = 0x00;
            fwrite( fileStr, 1, lineLen+2, f );         
        }
        fclose(f);
    }
}

void readFiles(int n) {
    char fileName[32];

    for (int i=0;i<n;i++) {
        sprintf( fileName, "c:\\t\\%i.txt", i );
        FILE * f = fopen( fileName, "r" );
        fseek(f, 0L, SEEK_END);
        int size = ftell(f);
        fseek(f, 0L, SEEK_SET);
        char * data = (char*)malloc(size);
        fread(data, size, 1, f);
        free(data);
        fclose(f);
    }   
}

DWORD WINAPI readInThread( LPVOID lpParam ) 
{ 
    int * number = (int *)lpParam;
    char fileName[32];

    sprintf( fileName, "c:\\t\\%i.txt", *number );
    FILE * f = fopen( fileName, "r" );
    fseek(f, 0L, SEEK_END);
    int size = ftell(f);
    fseek(f, 0L, SEEK_SET);
    char * data = (char*)malloc(size);
    fread(data, size, 1, f);
    free(data);
    fclose(f);

    return 0; 
} 


int main(int argc, char ** argv) {
    long t1 = GetTickCount();
    generateFiles(256);
    printf("Write: %li ms\n", GetTickCount() - t1 );

    t1 = GetTickCount();
    readFiles(256);
    printf("Read: %li ms\n", GetTickCount() - t1 );

    t1 = GetTickCount();

    const int MAX_THREADS = 256;

    int     pDataArray[MAX_THREADS];
    DWORD   dwThreadIdArray[MAX_THREADS];
    HANDLE  hThreadArray[MAX_THREADS]; 

    for( int i=0; i<MAX_THREADS; i++ )
    {

        pDataArray[i] = (int) HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY,
                sizeof(int));

        pDataArray[i] = i;

        hThreadArray[i] = CreateThread( 
            NULL,                   
            0,                      
            readInThread,       
            &pDataArray[i],          
            0,                      
            &dwThreadIdArray[i]);   
    } 

    WaitForMultipleObjects(MAX_THREADS, hThreadArray, TRUE, INFINITE);

    printf("Read (threaded): %li ms\n", GetTickCount() - t1 );

}

first function just ugly thing to make a test dataset ( I know it can be done much better, but I honestly have no time )

1st experiment - sequential read 2nd experiment - read all in parallel

results:

256 files:

Write: 250 ms
Read: 140 ms
Read (threaded): 78 ms

1024 files:

Write: 1250 ms
Read: 547 ms
Read (threaded): 843 ms

I think second attempt clearly shows that on a long run 'dumb' thread creation just makes things even worse. Of course it needs improvements in a sense of preallocated workers, some thread pool etc, but I think with such fast operation as reading 100-200k from disk there is no really benefit of moving this functionality into thread. I have no time to write more 'clever' solution, but I have my doubts that it will be much faster because you will have to add system calls for mutexes etc...

going extreme you could think of preallocating memory pools etc.. but as being mentioned before code your posted just wrong.. it's a matter of milliseconds, but for sure not seconds

800 files (20 chars per line, 256 lines)

Write: 250 ms
Read: 63 ms
Read (threaded): 500 ms

Conclusion:

ANSWER IS:

Your reading code is wrong, you reading files so slow that there is a significant increase in speed then you make tasks runs in parallel. In the code above reading is actually faster then an expenses to spawn a thread

evilruff
  • 3,849
  • 1
  • 13
  • 27
  • But why then my code is faster when I am using multithreading? 200 files, 256 lines: Single thread: 1.5 seconds in average, Multi thread: 0.8 seconds in average; 800 files, 256 lines: Single thread: 8.5 seconds in average, Multi thread: 4 seconds in average... – arc_lupus Aug 20 '13 at 15:28
  • 1
    because your code is WRONG, you reading files in SO SLOW WRONG WAY that there is a significant increase in speed then you make tasks runs in parallel. In my code reading is actually faster then an expenses to spawn a thread – evilruff Aug 20 '13 at 15:31
  • My code is wrong because I use the .eof()-Command? Or where else is my code wrong, too? – arc_lupus Aug 20 '13 at 15:32
  • 1
    your code is wrong for a lot of reasons.. you should learn about block direct read if you want to achieve maximum performance rather then using iostream layer. If you want to go even faster then fopen(), fread() etc, look on direct WIN32 API calls, to remove extra wrap. – evilruff Aug 20 '13 at 15:34
  • Ok, then I will read a bit about block direct read and stuff like that. I did not know that before, so thank you very much for telling me that! – arc_lupus Aug 20 '13 at 15:38
  • My past findings agree with those here. Whenever I'm doing tasks like this, I usually have two thread for reading files, and two more for processing. Additional threads do not usually improve performance. – Mooing Duck Aug 20 '13 at 19:25
4

Your primary bottleneck is physically reading from the hard disk.

Unless you have the files on separate drives, the drive can only read data from one file at a time. Your best bet is to read each file as a whole rather read a portion of one file, tell the drive to locate to another file, read from there, and repeat. Repositioning the drive head to other locations, especially other files, is usually more expensive than letting the drive finish reading the single file.

The next bottle neck is the data channel between the processor and the hard drive. If your hard drives share any kind of communications channel, you will see a bottleneck, as data from each drive must come through the communications channel to your processor. Your processor will be sending commands to the drive(s) through this communications channel (PATA, SATA, USB, etc.).

The objective of the next steps is to reduce the overhead of the "middle men" between your program's memory and the hard drive communications interface. The most efficient is to access the controller directly; lesser efficient are using the OS functions; the "C" functions (fread and familiy) and least is the C++ streams. With increased efficiency comes tighter coupling with the platform and reduced safety (and simplicity).

I suggest the following:

  1. Create multiple buffers in memory, large enough to save time, small enough to prevent the OS from paging the memory to the hard drive.
  2. Create a thread that reads the files into memory, as necessary. Search the web for "double buffering". As long as there is space in the buffer, this thread will read data.
  3. Create multiple "outgoing" buffers.
  4. Create a second thread that removes data from memory and "processes" it, and inserts into the "outgoing" buffers.
  5. Create a third thread that takes the data in the "outgoing" buffers and sends to the databases.
  6. Adjust the size of the buffers for the best efficiency within the limitations of memory.

If you can access the DMA channels, use them to read from the hard drive into the "read buffers".

Next, you can optimize your code to efficiently use the data cache of the processor. For example, set up your "processing" so the data structures to not exceed a data line in the cache. Also, optimize your code to use registers (either specify the register keyword or use statement blocks so that the compiler knows when variables can be reused).

Other optimizations that may help:

  • Align data to the processors native word size, pad if necessary. For example, prefer using 32 bytes instead of 13 or 24.
  • Fetch data in quantities of the processor's word size. For example, access 4 octets (bytes) at a time on a 32-bit processor rather than 4 accesses of 1 byte.
  • Unroll loops - more instructions inside the loop, as branch instructions slow down processing.
Thomas Matthews
  • 52,985
  • 12
  • 85
  • 144
1

You are probably hitting the read limit of your disks, which means your options are somewhat limited. If this is a constant problem you could consider a different RAID structure, which will give you greater read throughput because more than one read head can access data at the same time.

To see if disk access really is the bottleneck, run your program with the time command:

>> /usr/bin/time -v <my program>

In the output you'll see how much CPU time you were utilizing compared to the amount of time required for things like disk access.

guyrt
  • 917
  • 7
  • 12
  • Depending on your Windows version, there are alternatives: http://stackoverflow.com/questions/673523/how-to-measure-execution-time-of-command-in-windows-command-line – guyrt Aug 20 '13 at 14:53
1

I would try going with C code for reading the file. I suspect that it'll be faster.

FILE* f = ::fopen( CCD_Filename.c_str(), "rb" );
if( f == NULL )
{
    return;
}

::fseek( f, 0, SEEK_END );
const long lFileBytes = ::ftell( f );
::fseek( f, 0, SEEK_SET );

char* fileContents = new char[lFileBytes + 1];
const size_t numObjectsRead = ::fread( fileContents, lFileBytes, 1, f );
::fclose( f );

if( numObjectsRead < 1 )
{
    delete [] fileContents;
    return;
}

fileContents[lFileBytes] = '\0';

// assign char buffer of file contents here

delete [] fileContents;
Paul Dardeau
  • 2,381
  • 1
  • 10
  • 9
  • 3
    -1 'Suspecting' that the C API is principally faster than the C++ API is a shaky ground for an optimization. – ComicSansMS Aug 20 '13 at 15:43
  • I 'suspect' it's faster because my solution pre-allocates a buffer large enough to accommodate the full size of the file. It appears that the poster's solution does not. My answer is not pure speculation. – Paul Dardeau Aug 20 '13 at 16:14
  • 1
    @PaulDardeau: You could do the same thing using `fstream`. There's no need for C in this C++ code. – Cornstalks Aug 20 '13 at 17:40