-2

For a university project, I have to sort a CSV file of 20 million records (wich are represented in 2^64 bit, for example, 10000000 or 7000000, so I used unsigned long long) using MergeSort. So, I developed this C file:

#include <stdio.h>
#include <stdlib.h>
#include <limits.h>

// Path to the dataset
#define DATASET_PATH "/Volumes/HDD/Lorenzo/Unito/2 Anno/ASD/Progetto/Progetto 2017-2018/laboratorio-algoritmi-2017-18/Datasets/ex1/integers.csv"
#define ELEMENTS_TO_SCAN 1000000 // the numbers of elements to be scanned

void mergeSort(unsigned long long * arrayToSort, int leftIndex, int rightIndex);
void merge(unsigned long long * arrayToSort, int left, int center, int right);
void read();
void printArray();

// from "Introduction to Algorithms" of T. H. Cormen
void mergeSort(unsigned long long * arrayToSort, int leftIndex, int rightIndex){
    if(leftIndex < rightIndex){
        int center = (leftIndex + rightIndex) / 2;
        mergeSort(arrayToSort, leftIndex, center);
        mergeSort(arrayToSort, center + 1, rightIndex);
        merge(arrayToSort, leftIndex, center, rightIndex);
    }
}

// from "Introduction to Algorithms" of T. H. Cormen
void merge(unsigned long long * arrayToSort, int left, int center, int right){
    int n1 = center - left + 1;
    int n2 = right - center; 

    unsigned long long leftSubArray[n1+1];
    unsigned long long rightSubArray[n2+1];

    leftSubArray[n1] = ULLONG_MAX; // here Cormen use infinite
    rightSubArray[n2] = ULLONG_MAX; // here Cormen use infinite

    for(int i = 0; i < n1; i++)
        leftSubArray[i] = arrayToSort[left + i];
    for(int j = 0; j < n2; j++)
        rightSubArray[j] = arrayToSort[center + j + 1];

    int i = 0;
    int j = 0;
    int k = 0;

    for(k = left; k <= right; k++){
        if(leftSubArray[i] <= rightSubArray[j]){
            arrayToSort[k] = leftSubArray[i];
            i++;
        } else {
            arrayToSort[k] = rightSubArray[j];
            j++;
        }
    }
}

// it reads all the dataset, and saves every line (wich contains a single element)
// in a position of an array to sort by MergeSort.
void read(char pathToDataset[], unsigned long long arrayToFill[]) {
    FILE* dataset = fopen(pathToDataset, "r");
    if(dataset == NULL ) { 
        printf("Error while opening the file.\n");
        exit(0); // exit failure, it closes the program
    }

    int i = 0;
    while (i < ELEMENTS_TO_SCAN && fscanf(dataset, "%llu", &arrayToFill[i])!=EOF) { 
        //printf("%llu\n", arrayToFill[i]); // ONLY FOR DEBUG, it wil print 20ML of lines!
        i++;
    }
    printf("\nRead %d lines.\n", i); 
    fclose(dataset);
}

void printArray(unsigned long long * arrayToPrint, int arrayLength){
    printf("[");
    for(int i = 0; i < arrayLength; i++) {
        if (i == arrayLength-1) {
        printf("%llu]", arrayToPrint[i]);
        }
        else {
            printf("%llu, ", arrayToPrint[i]);
        }
    }
}

int main() {
    unsigned long long toSort [ELEMENTS_TO_SCAN] = {};
    read(DATASET_PATH, toSort);

    mergeSort(toSort,0,ELEMENTS_TO_SCAN-1);
    printf("Merge finished\n");

    return 0;
}

after some testing, if ELEMENTS_TO_SCAN is bigger than 500000 (= 1/4 of 20 million) i don't know why, but the output on the terminal is

Segmentation fault: 11

Someone can help me?

  • 3
    Local variables (including arrays) are usually stored on the stack. The stack is a limited resource, on Linux it's by default 8MiB per process. Your array in the `main` function is eight million bytes, *very* close to the limit on Linux. Add a couple of more variables and a few function calls (which are also handled by the stack) and you will quite quickly run out. – Some programmer dude Oct 12 '18 at 10:03
  • @Someprogrammerdude ok, so there is something I can do to make this program work? – Lorenzo Tabasso Oct 12 '18 at 10:04
  • Surely with mergesort the idea is to sort sections of the file then merge the sections? The final merge can surely be done file-to-file? – Gem Taylor Oct 12 '18 at 10:05
  • Don't put large arrays on the stack? Either use global variables (bad idea) or use dynamic memory allocation. – Some programmer dude Oct 12 '18 at 10:05
  • 1
    Dynamic allocation with malloc is the right thing to do anyway. – Gem Taylor Oct 12 '18 at 10:06
  • 1
    Or split the data into smaller chunks, don't read all of the data immediately into memory. Or use memory mapped files to sort in-place. – Some programmer dude Oct 12 '18 at 10:06

2 Answers2

0

You’re doing a local variable declaration (eg on stack). If you’re dealing with larger arrays, consider making them global, or use dynamic arrays — in general dynamic would be better. Using globals makes it easier to get into bad habits.

Why are global variables bad, in a single threaded, non-os, embedded application

Segmentation fault 11 because of a 40 MB array in C

James C.
  • 46
  • 5
0

As people pointed out, this type of allocation can't be done on Stack. I would try dynamically allocating it, for that you just need to change the code like so:

int main() {
    unsigned long long *toSort;
    toSort = (unsigned long long) malloc(ELEMENTS_TO_SCAN*sizeof(unsigned long long));
    read(DATASET_PATH, toSort);

    mergeSort(toSort,0,ELEMENTS_TO_SCAN-1);
    printf("Merge finished\n");

    free(toSort);
    return 0;
}

As you pointed the merge is the one causing problems. Just to note, if you use things like:

int array[n];

You will run into problems eventually, that's a given. If you don't know how much memory you will use at compile time, either use a data structure that supports the resizing like linked lists or dynamically allocate it .

João Areias
  • 652
  • 4
  • 24
  • I just tested it out, but with 10000000 (I have to sort 2* 10000000 elements) elements it still gives me Segmentation Fault: 11 – Lorenzo Tabasso Oct 12 '18 at 10:15
  • Try using a linked list then, it's probably the right way to go I posted a link for it in my answer – João Areias Oct 12 '18 at 10:18
  • Using `malloc` is a good approach, however, the return value must be checked: if `malloc` can't sarisfy the request, it will return zero (hence, seg fault when used without having been checked). – Paul Ogilvie Oct 12 '18 at 10:45
  • @JoãoAreias, I don't see why a linked list would consume less memory than an array. In fact, it won't. – Paul Ogilvie Oct 12 '18 at 10:46
  • It doesn't use less memory but it breaks down the array into smaller chunks. Instead of trying to allocate one single chunk of memory for the vector you allocate smaller chunks for each node. – João Areias Oct 12 '18 at 10:50
  • While the malloc will give you an array that must be, or at least look like it is, continuous, while the linked list will be scattered all throughout the memory into smaller sizes – João Areias Oct 12 '18 at 10:54
  • @JoãoAreias, assuming that still all elements are read in to memory, a linked list will have the overhead of the pointer to the next element, so effectively consuming _twice_ (1,5x on 32bit plarforms) the memory of an array. So if `malloc` can't get the memory, a linked list will by long not get the memory. – Paul Ogilvie Oct 12 '18 at 10:56
  • It doesn't look to me that the problem he is having is not having enough memory, seems like the problem is how much memory the OS will let you allocate at once. The linked list does use more memory, and it does have a performance overhead too, but it solves the issue of not being able to allocate enough memory even though you have enough. – João Areias Oct 12 '18 at 10:59
  • @JoãoAreias, If have never heard of that (the OS not allowing you to allocate so much memory). I consider it wrong. – Paul Ogilvie Oct 12 '18 at 11:06
  • Correct me if I'm wrong but when you run malloc, you actually do a system call in user mode, the OS tries to allocate the amount of memory you requested in kernel mode and returns a pointer to the beginning of the block or an error if it's not able to find enough free space. If you are trying to allocate one single monolithic block it may not find space where it would fit continuously. – João Areias Oct 12 '18 at 11:12
  • Wrong. It is called virtual memory, so there is always enough space (up to the addressable space), and it is contiguous. – Paul Ogilvie Oct 12 '18 at 11:27
  • Maybe the problem would be the amount of sub-array that the "merge" function creates in stack? I made some debug, and I see that the program perfecly read all the records (20000000), but while merging the sub-array I think that it goes in Stack Overflow, and this is the cause of Segmentation Fault:11 – Lorenzo Tabasso Oct 12 '18 at 11:58
  • @PaulOgilvie Sorry, my mistake, I knew about virtual memory but misunderstood how it treats the allocation I've removed that part from my answer. – João Areias Oct 12 '18 at 12:06
  • @LorenzoTabasso it could be, use dynamic memory allocation there too – João Areias Oct 12 '18 at 12:06
  • You still refer to linked lists. Don't do that, as explained earlier. – Paul Ogilvie Oct 12 '18 at 12:34
  • I referred not as a solution to the memory consumption but to the variable sized array, it does add an overhead, but it is quite useful sometimes – João Areias Oct 12 '18 at 12:53