0

What is reasonable time to load a CSV file into a 2-dimensional array in memory, where # columns is fixed (406), and the number of rows are about 87,000? -- In Perl it is taking about 12 seconds from either HardDisk (SATA) or SSD. -- other languages OK if speed can greatly improved. I expected the time to be much less! Size on disk of the referenced CSV file is 302MB!

Snip of the interesting Perl below:

while ($iline = <$CSVFILE>)
{
    chomp($iline);
    @csv_values = split /,/,$iline;
    # Create a HASH Key from csv_value[0], which is the CODE/label!
    $hashname=$csv_values[0];
    $Greeks{$hashname}=[@csv_values];   # Create the reference & copy the array!
}

For above, the majority of the time is consumed in the "split", and the Hash new key addition lines!

I tried a similar test in python (not my strong suite), and the performance was much much worse! FYI: cpu is intel 3.2GHz i7-3930K wiht 32GB ram, 64-bit OS (win 10), for referenced performance. Thanks for constructive ideas!

  • 5
    I recommend justifying that collection of language tags. In general don't use a language tag unless you have good reason and explicitly call it out. Often multiple language tags doom your question to being too broad.If you truly do not care what language is used, consider tagging `language-agnostic`. – user4581301 Feb 03 '19 at 02:38
  • 1
    There are a number of factors that play in here that aren't just language specific. The speed of your hard disk is relevant, but also the speed of your RAM, the speed of your CPU, the encoding of your data file, and the underlying algorithm your loader uses. Then, of course, there's the language you use. I know this isn't what you were hoping for, but the answer to your question is "it depends." – Jordan Singer Feb 03 '19 at 03:42
  • "I tried a similar test in python" --> post code used for that test. – chux - Reinstate Monica Feb 03 '19 at 03:58
  • 2
    [Please stop writing faulty CSV parsers!](https://stackoverflow.com/questions/14274259/read-csv-with-scanner/24950812#24950812), instead stop wasting your time and use [Text::CSV](https://metacpan.org/pod/Text::CSV) (or whatever module is available for your language of choice). Thank you. – Stefan Becker Feb 03 '19 at 05:37
  • I wrote my version of the Perl code and get for `time perl dummy.pl – Stefan Becker Feb 03 '19 at 06:08
  • Actually about 1 second of that is just for de-allocating all that memory. For the actual loop [Time::HiRes](https://metacpan.org/pod/Time::HiRes) gives me `2.69s`. If I use OPs code, I get `4.00s`. – Stefan Becker Feb 03 '19 at 06:32
  • @chux: py code I tried. filename="/backtesting/historicaldata/SPX/SPX_Greeks/SPX_20190124_proc_greeks.csv" import pandas as pd dict = {row[0] : row for _, row in pd.read_csv(filename,header=None).iterrows()} #print (dict) – StepAndFetchit Feb 03 '19 at 09:31
  • @StepAndFetchit and why did you not update your question and wrote a comment instead? – Stefan Becker Feb 03 '19 at 09:41
  • @Stefan Becker: What is processor type, speed, and memory size on your laptop? Your results (if apples to apples comparison) is about 3 to 3.3X better than mine, which is still slower than I expected. – StepAndFetchit Feb 03 '19 at 09:45
  • @Stefan Becker: This is my first post on this site, and am "green" with proper process. I moved the info update to the original post. – StepAndFetchit Feb 03 '19 at 09:48
  • 1
    The point of my comment was that it is as pointless as IMHO your question is. Obviously your research/assumption is incorrect too, because I left the `split()` untouched and changed 2 other lines and already got ~33% faster execution time. – Stefan Becker Feb 03 '19 at 09:48
  • 1
    Seconding the recommendation to use Text::CSV instead of `split`. Not only will Text::CSV give you correct results for a wider range of inputs, it will also use an XS (compiled C) internal implementation by default (assuming you also install Text::CSV_XS), which will almost certainly give you better performance if raw speed is your main concern. – Dave Sherohman Feb 03 '19 at 10:25
  • @DaveSherohman good suggestion with [Text::CSV_XS](https://metacpan.org/pod/Text::CSV_XS). That gives me `4.64s`. Not bad considering it will handle a wide range of input... – Stefan Becker Feb 03 '19 at 13:44
  • Thank you all for your contribution! – StepAndFetchit Feb 03 '19 at 18:10

0 Answers0