18

I've got text file that contains 1 000 002 numbers in following formation:

123 456
1 2 3 4 5 6 .... 999999 100000

Now I need to read that data and allocate it to int variables (the very first two numbers) and all the rest (1 000 000 numbers) to an array int[].

It's not a hard task, but - it's horrible slow.

My first attempt was java.util.Scanner:

 Scanner stdin = new Scanner(new File("./path"));
 int n = stdin.nextInt();
 int t = stdin.nextInt();
 int array[] = new array[n];

 for (int i = 0; i < n; i++) {
     array[i] = stdin.nextInt();
 }

It works as excepted but it takes about 7500 ms to execute. I need to fetch that data in up to several hundred of milliseconds.

Then I tried java.io.BufferedReader:

Using BufferedReader.readLine() and String.split() I got the same results in about 1700 ms, but it's still too many.

How can I read that amount of data in less that 1 second? The final result should be equal to:

int n = 123;
int t = 456;
int array[] = { 1, 2, 3, 4, ..., 999999, 100000 };

According to trashgod answer:

StreamTokenizer solution is fast (takes about 1400 ms) but it's still too slow:

StreamTokenizer st = new StreamTokenizer(new FileReader("./test_grz"));
st.nextToken();
int n = (int) st.nval;

st.nextToken();
int t = (int) st.nval;

int array[] = new int[n];

for (int i = 0; st.nextToken() != StreamTokenizer.TT_EOF; i++) {
    array[i] = (int) st.nval;
}

PS. There is no need for validation. I'm 100% sure that data in ./test_grz file is correct.

Crozin
  • 41,538
  • 12
  • 84
  • 134
  • Why don't you store them to a LinkedList {if you're going to move them around in the list or sort them} or ArrayList (if you want random access} (depending on how you're going to use them)? That's a large volume of data, and I'm assuming you'll be using them later. – Humphrey Bogart Apr 22 '10 at 18:06
  • 1
    I've changed my question - it's only about reading from file. ;) I **don't need** any Collection - simple array is what I really need but the problem is how to populate this array with data from file in the fastest possible way. – Crozin Apr 22 '10 at 18:19
  • 2
    How much time is spent on the allocation of the large array (nearly 4MB) vs. the parsing? Can you allocate the array outside of that call? And assuming you're using Integer.parseInt, have you looked for any other libs that might have optimized integer parsing for base 10? This claims to: http://www.cs.ou.edu/~weaver/improvise/downloads/javadoc/oblivion/oblivion/util/NumberUtilities.html#parseInt%28java.lang.String%29 – NG. Apr 22 '10 at 18:26
  • Just throwing some ideas around - could you create a separate program to split the file in multiple files, and then use separate threads to read in the data, then combine the results ? Would make the whole design much more complex though. Not sure it'd be faster either. – JRL Apr 22 '10 at 19:03
  • @JRL: That is not possible. Everything must be done within this one program in the simplest, most primitive but fast way. – Crozin Apr 22 '10 at 19:10
  • @Crozin: since StreamTokenizer's `nVal` returns a double, have you tried using doubles instead of ints? – JRL Apr 22 '10 at 19:24
  • If I'll remove integer casting I won't gain any performance in this case. Also I think that those several millisecond I would save here I would lose later (when I do some complex operations on this data). – Crozin Apr 22 '10 at 19:35

7 Answers7

14

Thanks for every answer but I've already found a method that meets my criteria:

BufferedInputStream bis = new BufferedInputStream(new FileInputStream("./path"));
int n = readInt(bis);
int t = readInt(bis);
int array[] = new int[n];
for (int i = 0; i < n; i++) {
    array[i] = readInt(bis);
}

private static int readInt(InputStream in) throws IOException {
    int ret = 0;
    boolean dig = false;

    for (int c = 0; (c = in.read()) != -1; ) {
        if (c >= '0' && c <= '9') {
            dig = true;
            ret = ret * 10 + c - '0';
        } else if (dig) break;
    }

    return ret;
}

It requires only about 300 ms to read 1 mln of integers!

Crozin
  • 41,538
  • 12
  • 84
  • 134
  • what does your `int t` variable do? – Adam Johns Jun 27 '14 at 15:48
  • @AdamJohns Absolutely nothing, it's just a second number from a file (see format of the file from the question). `array` variable also does nothing. ;) – Crozin Jun 28 '14 at 06:36
  • Excellent. This was approx. 2 times faster than using `StringTokenizer` in my problem (reading 1 million of integers up to 1 million each). – jbarrameda Feb 28 '17 at 22:26
2

StreamTokenizer may be faster, as suggested here.

Community
  • 1
  • 1
trashgod
  • 196,350
  • 25
  • 213
  • 918
  • In fact StreamTokenizer seems to be the fastest solution so far (please check my question update). But it still needs about 1400 ms to read necessary data. – Crozin Apr 22 '10 at 18:59
  • Excellent. See also @Kevin Brock's informative answer: http://stackoverflow.com/questions/2693223/read-large-amount-of-data-from-file-in-java/2694507#2694507 – trashgod Apr 23 '10 at 03:00
2

You can reduce the time for the StreamTokenizer result by using a BufferedReader:

Reader r = null;
try {
    r = new BufferedReader(new FileReader(file));
    final StreamTokenizer st = new StreamTokenizer(r);
    ...
} finally {
    if (r != null)
        r.close();
}

Also, don't forget to close your files, as I've shown here.

You can also shave some more time off by using a custom tokenizer just for your purposes:

public class CustomTokenizer {

    private final Reader r;

    public CustomTokenizer(final Reader r) {
        this.r = r;
    }

    public int nextInt() throws IOException {
        int i = r.read();
        if (i == -1)
            throw new EOFException();

        char c = (char) i;

        // Skip any whitespace
        while (c == ' ' || c == '\n' || c == '\r') {
            i = r.read();
            if (i == -1)
                throw new EOFException();
            c = (char) i;
        }

        int result = (c - '0');
        while ((i = r.read()) >= 0) {
            c = (char) i;
            if (c == ' ' || c == '\n' || c == '\r')
                break;
            result = result * 10 + (c - '0');
        }

        return result;
    }

}

Remember to use a BufferedReader for this. This custom tokenizer assumes the input data is always completely valid and contains only spaces, new lines, and digits.

If you read these results a lot and those results do not change much, you should probably save the array and keep track of the last file modified time. Then, if the file has not changed just use the cached copy of the array and this will speed up the results significantly. For example:

public class ArrayRetriever {

    private File inputFile;
    private long lastModified;
    private int[] lastResult;

    public ArrayRetriever(File file) {
        this.inputFile = file;
    }

    public int[] getResult() {
        if (lastResult != null && inputFile.lastModified() == lastModified)
            return lastResult;

        lastModified = inputFile.lastModified();

        // do logic to actually read the file here

        lastResult = array; // the array variable from your examples
        return lastResult;
    }

}
Kevin Brock
  • 8,429
  • 1
  • 31
  • 37
1

How much memory do you have in the computer? You could be running into GC issues.

The best thing to do is to process the data one line at a time if possible. Don't load it into an array. Load what you need, process, write it out, and continue.

This will reduce your memory footprint and still use the same amount of File IO

Pyrolistical
  • 26,088
  • 21
  • 78
  • 104
  • It looks like his second line is one looong line that contains a million numbers.. – NG. Apr 22 '10 at 18:39
  • If my calculations are correct 1 mln of `int` costs me only 7 MB of memory - that's not so much. I just need to load that data from file to memory - I'll need that for some calculations that requires whole data to be loaded. – Crozin Apr 22 '10 at 18:45
1

It it's possible to reformat the input so that each integer is on a separate line (instead of one long line with one million integers), you should be seeing much improved performance using Integer.parseInt(BufferedReader.readLine()) due to smarter buffering by line and not having to split the long string into a separate array of Strings.

Edit: I tested this and managed to read the output produced by seq 1 1000000 into an array of int well under half a second, but of course this depends on the machine.

Arkku
  • 37,604
  • 10
  • 57
  • 79
  • Unfortunately I cannot change file format. It has to be two integers separated by a single space in the first line and 1 mln of integers in the second line (also separated by a single space). – Crozin Apr 22 '10 at 19:04
0

I would extend FilterReader and parse the string as it is read in the read() method. Have a getNextNumber method return the numbers. Code left as an exercise for the reader.

Skip Head
  • 6,800
  • 1
  • 27
  • 34
0

Use a StreamTokenizer on a BufferedReader will give you quite good performance already. You shouldn't need to write your own readInt() function.

Here is the code I used to do some local performance testing:

/**
 * Created by zhenhua.xu on 11/27/16.
 */
public class MyReader {

private static final String FILE_NAME = "./1m_numbers.txt";
private static final int n = 1000000;

public static void main(String[] args) {
    try {
        readByScanner();
        readByStreamTokenizer();
        readByStreamTokenizerOnBufferedReader();
        readByBufferedInputStream();
    } catch (Exception e) {
        e.printStackTrace();
    }
}

public static void readByScanner() throws Exception {
    long startTime = System.currentTimeMillis();

    Scanner stdin = new Scanner(new File(FILE_NAME));
    int array[] = new int[n];
    for (int i = 0; i < n; i++) {
        array[i] = stdin.nextInt();
    }

    long endTime = System.currentTimeMillis();
    System.out.println(String.format("Total time by Scanner: %d ms", endTime - startTime));
}

public static void readByStreamTokenizer() throws Exception {
    long startTime = System.currentTimeMillis();

    StreamTokenizer st = new StreamTokenizer(new FileReader(FILE_NAME));
    int array[] = new int[n];

    for (int i = 0; st.nextToken() != StreamTokenizer.TT_EOF; i++) {
        array[i] = (int) st.nval;
    }

    long endTime = System.currentTimeMillis();
    System.out.println(String.format("Total time by StreamTokenizer: %d ms", endTime - startTime));
}

public static void readByStreamTokenizerOnBufferedReader() throws Exception {
    long startTime = System.currentTimeMillis();

    StreamTokenizer st = new StreamTokenizer(new BufferedReader(new FileReader(FILE_NAME)));
    int array[] = new int[n];

    for (int i = 0; st.nextToken() != StreamTokenizer.TT_EOF; i++) {
        array[i] = (int) st.nval;
    }

    long endTime = System.currentTimeMillis();
    System.out.println(String.format("Total time by StreamTokenizer with BufferedReader: %d ms", endTime - startTime));
}

public static void readByBufferedInputStream() throws Exception {
    long startTime = System.currentTimeMillis();

    BufferedInputStream bis = new BufferedInputStream(new FileInputStream(FILE_NAME));
    int array[] = new int[n];
    for (int i = 0; i < n; i++) {
        array[i] = readInt(bis);
    }

    long endTime = System.currentTimeMillis();
    System.out.println(String.format("Total time with BufferedInputStream: %d ms", endTime - startTime));
}

private static int readInt(InputStream in) throws IOException {
    int ret = 0;
    boolean dig = false;

    for (int c = 0; (c = in.read()) != -1; ) {
        if (c >= '0' && c <= '9') {
            dig = true;
            ret = ret * 10 + c - '0';
        } else if (dig) break;
    }

    return ret;
}

Results I got:

  • Total time by Scanner: 789 ms
  • Total time by StreamTokenizer: 226 ms
  • Total time by StreamTokenizer with BufferedReader: 80 ms
  • Total time by BufferedInputStream: 95 ms