Parsing file of two formats and parse the line as well

Question

I have an huge file which can have lines in below two formats:

Format1:

*1 <int_1/string_1>:<int/string> <int_2/string_2>:<int/string> <float>

Format2:

*1 <int/string>:<int/string> <float>

So, possible cases for above format are:

*1 1:2 3:4 2.3
*1 1:foo 3:bar 2.3
*1 foo:1 bar:4 2.3
*1 foo:foo bar:bar 2.3
*1 foo:foo 2.3

From both of above format lines, I only need to consider 'Format1' for my code. While reading that huge file, skip the lines respective to 'Format2'. In possible cases, I will consider first 4 cases, not the last one since it matches to 'Format2'. So, regex should be something like this:

(\d+)(\s+)(\\*\S+:\S+)(\s+)(\\*\S+:\S+)(\s+)(\d+)

where

\d is any digit. \d+ is more than 1 digit.
\s is space. \s+ is more than 1 space.
\S is anything non-space. \S+ is anything more than 1 non-space.

After considering the 'Format1' line, I will have to take two values from it:

int_1/string_1
int_2/string_2

What could have you done optimally to deal with it?

Read the file line by line (`std::getline()`) and compare line buffer (`std::string`) whether it starts with `"*1 — Scheff's Cat, Apr 04 '19 at 05:31
@Scheff, int_1 could be any integer.. Regular expression should come to the rescue here but I am not sure about the optimal solution. — Rahul Bhargava, Apr 04 '19 at 05:44
Will the entire file be either of these two lines? There won't be any other line or different strings in the same format etc? Or do you mean `int` as in any integer and `string` as any string? — Wander3r, Apr 04 '19 at 05:44
@Wander3r, Thanks for interest. Yes, lets say that entire file would be either of these two type of lines. Yes int is any integer and string as any string. — Rahul Bhargava, Apr 04 '19 at 05:46
I suspected this but I feel it's worth to be mentioned in question... ;-) OK. Read the file line by line, and read each line with a `std::istringstream` and input operators (`>>`) matching your first format. If reading fails you can discard the line. In that case, it's either second format or something else. — Scheff's Cat, Apr 04 '19 at 05:47
Yet another approach: [SO: How to rearrange a string equation?](https://stackoverflow.com/a/50021308/7478597) - a hand-knitted parser. — Scheff's Cat, Apr 04 '19 at 05:49
`*1 1/foo:2/bar 3/baz:4/Foo 2.3` Is this a valid example for your interested format? — Wander3r, Apr 04 '19 at 05:49
At least, I like your attempt to use something else than regex (seeing that you're "at home" in JavaScript). ;-) — Scheff's Cat, Apr 04 '19 at 05:53
@Wander3r, Edited the question to answer. It answers Scheff as well. :) — Rahul Bhargava, Apr 04 '19 at 05:54
IMHO, in your input sample, line 1 and 2 would match your format 1. Do you mean something like an identifier with _string_ or any arbitrary sequence of characters (uhm... `:` excluded)? — Scheff's Cat, Apr 04 '19 at 05:58
@Scheff, Yes. In possible cases examples, line 1-4 will match Format1. Line 5 will match Format2 and hence should be discarded. — Rahul Bhargava, Apr 04 '19 at 06:01
May be, it could help if you provide a regex in your question what exactly should match. (To me, it's still not clear what _string_ can be / cannot be.) In the link above, I demonstrated how to write a simple LA parser from a syntax diagram. I believe there is no simpler alternative than that (except a sequence of loops and ifs but actually that's the same). — Scheff's Cat, Apr 04 '19 at 06:06
@Scheff, I see. Sorry about the confusion. Edited the question to have regex of that structure. Do let me know if you need more information. — Rahul Bhargava, Apr 04 '19 at 06:13
Your regex doesn't match your sample code: [**Live Demo on regex101**](https://regex101.com/r/VX8v0a/1). — Scheff's Cat, Apr 04 '19 at 06:16
2 upvotes without any code attempt? This smells like "Please, write the code." — Scheff's Cat, Apr 04 '19 at 06:17
@Scheff, Well.. I know that regex expression could solve this problem easily. The one which I wrote was just an example. But I wanted to know the optimal solution to it since the file can be huge. Even 50G. — Rahul Bhargava, Apr 04 '19 at 06:21
@Wander3r, I am open to that as well but never had experience to use that. — Rahul Bhargava, Apr 04 '19 at 06:39

score 1 · Accepted Answer · answered Apr 04 '19 at 06:14

1

You could first count the number of space-separated fields

struct Field {
    int start, stop;
};
Field fields[4];
int i = 0, nf = 0;
while (s[i]) {
    while (s[i] && isspace(s[i])) i++;
    if (!s[i]) break;
    int start = i;
    while (s[i] && !isspace(s[i])) i++;
    nf++;
    if (nf == 5) break; // Too many fields
    fields[nf-1].start = start;
    fields[nf-1].stop = i;
}
if (nf == 4) {
    // We got 4 fields, line could be acceptable
    ...
}

Possibly adding a pre-check for the first chars to be '1', '*' and a space could speedup skipping over invalid lines if they are many.

answered Apr 04 '19 at 06:14

6502

104,192
14
145
251

Thanks for the interest. There seems to be many operations. I am note sure how it will behave on 50G file. – Rahul Bhargava Apr 04 '19 at 06:23
@RahulBhargava A _50G file_ will be heavy stuff for code written in any language. I cannot imagine any much faster approach. I don't know whether you are aware of [``](https://en.cppreference.com/w/cpp/regex). However, I'm quite sure that a regex wouldn't outperform the approach of this answer. – Scheff's Cat Apr 04 '19 at 07:13
1

@RahulBhargava: you're confusing the size of the source code with the number of machine instructions the CPU will have to perform. Actually, normally, a short source code means the use of high-level abstractions that may be are not 100% optimized for the specific work. Of course you can have a long and convoluted source code being slow, but it can also be the opposite, i.e. that a long and apparently convoluted source code is orders of magnitude faster than a two-liner. Unfortunately specific and optimized code is rarely short and simple. – 6502 Apr 04 '19 at 09:35

score 0 · Answer 2 · answered Apr 04 '19 at 09:35

Using boost

#include <iostream>
#include <array>
#include <vector>
#include <string>

#include <boost/algorithm/string/classification.hpp>
#include <boost/algorithm/string/split.hpp>

int main() {
    std::array<std::string, 5> x = { "*1 1:2 3:4 2.3",
        "*1 1:foo 3:bar 2.3",
        "*1 foo:1 bar:4 2.3",
        "*1 foo:foo bar:bar 2.3",
        "*1 foo:foo 2.3"
    };

    for (const auto& item : x) {

        std::vector<std::string> Words;
        // split based on <space> and :

        boost::split(Words,item, boost::is_any_of(L" :"));
        std::cout << item << std::endl;

       // Only consider the Format1
        if (Words.size() > 4) {
            std::cout << Words[1] << ":" << Words[2] << std::endl;
            std::cout << Words[3] << ":" << Words[4] << std::endl;
        }
        std::cout << std::endl;
    }
    return 0;
}

Using std::regex

int main() {
    std::array<std::string, 5> x = { "*1 1:2 3:4 2.3",
        "*1 1:foo 3:bar 2.3",
        "*1 foo:1 bar:4 2.3",
        "*1 foo:foo bar:bar 2.3",
        "*1 foo:foo 2.3"
    };

    std::regex re(R"(\*1\s+(\w+):(\w+)\s+(\w+):(\w+).*)");

    for (const auto& item : x) {
        std::smatch sm;
        if (std::regex_match(item, sm, re)) {
            std::cout << sm[1] << ":" << sm[2] << std::endl;
            std::cout << sm[3] << ":" << sm[4] << std::endl;
        }
    }

    return 0;
}

Parsing file of two formats and parse the line as well

2 Answers2