0

I have a requirement where I need to read tabbed separated data from a file and put it in map with particular keys. Right now, what I am doing to split the data, create the String[] and put it into hashmap. Sample code on how I am doing it is provided below:

while ((line = bufferedReader.readLine()) != null) {
    String[] data = line.split("\t");
    myHashMap = new HashMap<String, String>();
    for (int i = 0; i < data.length; i++) {
           myHashMap.put("<key>"+i,data[i]);
    }
    <additional logic>
}

The problem is I have to read almost 2500 files each having around 1000 lines. This result into high memory usage and performance issue. As far as I know String are always costlier with regards to memory. So I am looking for it better approach to tackle this requirement.

Thanks for your help.

justcurious
  • 224
  • 2
  • 15
  • 2
    Use `int` instead of `""+int`? – Thomas Weller Jul 17 '15 at 11:47
  • If you regard your input as just a collection of text fields, then the best you can do is to store that text. Can the text fields be interpreted as having a meaning and thus a more compact representation? Are any of them numbers, or drawn from a small set of values? – Raedwald Jul 17 '15 at 11:49
  • 1
    Why does the amount of files matter for memory usage? You are processing the files one by one anyway. – Manu Jul 17 '15 at 11:49
  • Take a look at [this](http://stackoverflow.com/questions/5965767/performance-of-stringtokenizer-class-vs-split-method-in-java) previous SO post. It should do what you are after. – npinti Jul 17 '15 at 11:53
  • 1
    You can encode them with an 8-bit encoding (Unicode 8-bit, ASCII), but they will lose functionality in Java that way. You can compress each string with some other kind of encoding, and decompress them whenever you want to use them. But your "high memory usage and performance" and "always costlier" statements sound vague, and my guess is that you are doing what is called "premature optimization"; that is, you are attempting to solve a problem that has not been defined well enough and may not even exist. – arcy Jul 17 '15 at 11:54
  • @Thomas Apologizes if that created confusion. I just mocked that up to create a different key logic in example. In real application, key are decided based on different condition and are essentially "String" without any concatenation with integer. – justcurious Jul 17 '15 at 11:55
  • Like @Manu said, if you are having trouble maybe you missed something, are you closing the file/buffer? – Rodolfo Jul 17 '15 at 11:55
  • @npinti, Raedwald : The sample file can be considered as having these data : ROLE(String) PERSONNAME(String) SALARY(Double) along with many other data. – justcurious Jul 17 '15 at 11:57
  • I guess `myHashMap` is defined outside the while loop. Please provide a real minimum example that demonstrates the problem, not some pseudo code where we need to guess what you're actually doing. – Thomas Weller Jul 17 '15 at 11:59
  • @Manu: Even I had the impression that processing file one by one should not impact memory. But if profile my application, the size of char[] keeps on growing file after file. And yes, I am closing the file and buffer. – justcurious Jul 17 '15 at 12:00

2 Answers2

0

Do you really need to store your information into a Map?

If the answer is no you can store it in a list.

while ((line = bufferedReader.readLine()) != null) {
    String[] data = line.split("\t");
    myList = new ArrayList<String>(data.length);
    for (int i = 0; i < data.length; i++) {
           myList.add(data[i]);
    }
    // To get the elements, instead of using:
    myHashMap.get("<key>" + i);
    // use:
    myList.get(i);
    //<additional logic>
}

I'm considering you close your file/buffer.

while ((line = bufferedReader.readLine()) != null) {
    List<String> myList = Arrays.asList(line.split("\t"));
    // To get the elements, instead of using:
    myHashMap.get("<key>" + i);
    // use:
    myList.get(i);
    //<additional logic>
}

Unless you're using a different key for the hash map really improves your access performance you should just use the index of your array.

while ((line = bufferedReader.readLine()) != null) {
    String[] data = line.split("\t");
    // To get the elements, instead of using:
    myHashMap.get("<key>" + i);
    // use:
    data[i];
    //<additional logic>
}
Rodolfo
  • 709
  • 2
  • 6
  • 25
  • 1
    This won't even compile, as ArrayList does not have two type parameters. Also it's the same as Arrays.asList, which would be better. Also, there's no advantage compared to the String[] in this scenario. – Joeri Hendrickx Jul 17 '15 at 12:22
  • @JoeriHendrickx my bad, I copied and pasted the code and missed the to parameters in the generics. – Rodolfo Jul 17 '15 at 12:29
0

If you have a problem with large amounts of Strings (you'll read approx 2.5 million entries according to your estimates, so you probably will), read up on String interning and set -XX:StringTableSize=N accordingly.

I don't understand why you would like to use a HashMap effectively as an array, but if you really want to do that, make sure you initialize it with approximately the right initialCapacity and loadFactor.

claj
  • 4,862
  • 2
  • 23
  • 30