0

I have string delimited by the pipe character. It is a repeatable sequence:

<machinenr>|<controldone>|<nrofitems|<items>

However where you see the items tag, you will have itemnumbers delimited also by the pipe character inbetween. Well, its' not a smart format, but I have to solve it, and I want to do with with regex in C#. So assuming the above format lets have a real example:

446408|0|2|111|6847|446408||0||

Note theoretically there doesn't need to be a value between the pipes, nor are the contents limited by a length. An item Id can be 111 or 877333, but even a mixed alphanumeric id XB111. So here we have a two machines with no items:

446408|0|0||447400||0||

Here we have a few machines with no or some items. Note, the pipe character is also used to delimit the items, so you have pipes within pipes:

446408|0|1|111|446408|0|3|99884|111|73732|446408|0|0||

This machine has three items: 446408|0|3|99884|111|73732|

The item ids:

99884|111|73732

What should the regex look like? I've tried with the below named groups (easier to read), but it just doesn't work:

^(?P<machinenr>.*?)\|
(?P<controldone>.*?)\|
(?P<nrofitems>.*?)\|
(?P<items>.*?)\|

Here is a clarification for @Atterson @sln and @. Note, the amount of items can be 0-n there is no limit to the amount. Lets take this example, a long string with machines, and their items: 446408|0|1|111|446408|0|3|99884|111|73732|446408|0|0|| What I expect the regex to do is to break up this string into three matches/parts and their values, the first match being: 446408|0|1|111| the second match: 446408|0|3|99884|111|73732| and the third match: 446408|0|0|| Ok, so to take an example of the values each part is supposed to be split into, lets use the second match/part. It is a machine with nr 446408, it has not been controlled 0, it has 3 items, the item ids: 99884|111|73732. After these items, a new sequence of:

<machinenr>|<controldone>|<nrofitems|<items>

can follow. @Sanxofon please check your regex here: [link] https://regex101.com/r/kC3gH0/87 and you'll see unfortunately it does not match.

Baal
  • 9
  • 4
  • Please type clearly: 2 or 3 example strings and the exact matches you expect or the final strings after substitution (if you want to substitute) – Attersson May 19 '18 at 17:27
  • So far you've not given a proper description of a _pipe delimited field_ vs. a _pipe delimited value_. This is needed, even if it's a fixed number of fields. –  May 19 '18 at 18:26
  • You are using `.` instead of `[^\|]`. And beeing [greedy](http://www.rexegg.com/regex-quantifiers.html#greedytrap), your regex does not work. – Sanxofon May 19 '18 at 19:12
  • This is not clear at all. Please provide some very clear explanation of the format, your illustrations just "do not click". – Wiktor Stribiżew May 19 '18 at 20:40
  • Please check clarification in the edited question. @Sanxofon – Baal May 19 '18 at 20:58
  • Why use a regular expression at all? Why not just use string.split on | and call it a day? – aquinas May 19 '18 at 21:08
  • 446408|0|3|99884|111|73732| Should be one match. The same regex should be able to match a the next machine: 446408|0|0|| as you can see the last machine has no items || its empty. If you would have one item the last part would contain one itemid |111|, if you would have two items: |111|222| – Baal May 19 '18 at 21:11
  • @aquinas well ofcourse I could do that. But there is got to be a way to do this with regex. Its both easier to read if you use named groups and most likely faster. – Baal May 19 '18 at 21:13
  • Rereading this, there are multiple items within each group. How would you group the variable number of items within each group? There is certainly NOT A way to do this with a regex. This *is* a pretty easy problem to solve with a straight string.split though. I think it would be about 4 lines of code. – aquinas May 19 '18 at 21:21
  • If you could let the regex grab the complete group of items without splitting: |111|222|333|444| it would basically be grabbing anything in between the two last pipes: \|(?P.*?\|)\| – Baal May 19 '18 at 21:30
  • Two questions: when you have consecutive records (like in your example: `446408|0|1|111|446408|0|3|99884|111|73732|446408|0|0||`), is it always for the same machine? Also, is it possible for the fields "controldown", "nrofitems", "items" to contains the machine number delimited by pipes? – Casimir et Hippolyte May 19 '18 at 21:43
  • @Casimir et Hippolyte. Its not always for the same machine. You can have different machines. Only the field with can contain a list of items also delimited by pipe. The machinenumber can only exist at the begining of every sequence. – Baal May 19 '18 at 21:51
  • Another example containing two different machines: 66899766|0|0||56222|1|2|453|895| The second machine with nr: 56222 has two items: 453|895 – Baal May 19 '18 at 21:54
  • 1
    Is there a way to make the difference between a machine number and a part of the item field? – Casimir et Hippolyte May 19 '18 at 22:19
  • Without a way to differentiate the end of the items list and the beginning of the next machine, there would be no way to make a regular expression out of this. Your human eye is picking out the machine numbers, but what rule are you using? How is a machine number different from an item? – bigtlb May 19 '18 at 22:47
  • @Casimir et Hippolyte unfortunately not. There is no way to distinguish the part containing the items. If you remove all the values from the format, you get this: |||| This is the basic pattern. So there has to be a way of telling regex to accept anything/any chars in between these pipes: |||| – Baal May 19 '18 at 23:01
  • A regex that can handle a pattern like: |||| would manage to break apart a string into exactly that many pipes not matter what it would find between the pipes, even pipes. Because the main pattern should be followed, no matter of what is found between each pipe. Like: 0|0|0|0| or 0|0,a,b,c|||||||||||0| Are there some other pattern matching techniques in C#? I'm not thinking about split. There has to be some really efficient and more readable way of doing this than with split. – Baal May 19 '18 at 23:10

2 Answers2

1

This isn't solvable with a regex, there's no way to tell the regular expression something like: "Match .*?\| the same number of times as a certain capturing group...which happens to contain a number." This is the straightforward solution to this problem using plain old C# though.

string items = "446408|0|1|111|446408|0|3|99884|111|73732|446408|0|0|";
var fields = items.Split('|');
for (int i = 0; i < fields.Length;) {
    Console.WriteLine("machinenr:" + fields[i++]);
    Console.WriteLine("controldone:" + fields[i++]);
    int numSubItems = Int32.Parse(fields[i++]);
    Console.WriteLine("num subitems:" + numSubItems);
    if (numSubItems == 0) {
        i++;
        continue;
    }                

    for (int subItemIndex = 0; subItemIndex < numSubItems; subItemIndex++) {
        Console.WriteLine("\tItem:" + (subItemIndex + 1) + ": " + fields[i++]);
    }                
}

FYI, I trimmed the trailing "|" that your original string had, so

string items = "446408|0|1|111|446408|0|3|99884|111|73732|446408|0|0|";

instead of

string items = "446408|0|1|111|446408|0|3|99884|111|73732|446408|0|0||";
aquinas
  • 21,814
  • 5
  • 51
  • 78
  • Thanks, but doing it by regex is more challenging. Here is a similar issue: [link] https://stackoverflow.com/questions/28200875/regular-expression-to-match-pipe-separated-strings-with-pipe-escaping However, its an easier scenario because the pipe sequence to ignore is escaped. In the format I have to parse, there are no special tags/chars to distinguish the part of the string that contains the items. C# regex has special features like balancing groups etc. It has to be possible somehow with regex, like wrapping the subgroup containing the items ? @Wiktor Stribiżew – Baal May 19 '18 at 22:56
  • Although this is not a regular expression this is probably the more straightforward and performant approach. – bigtlb May 19 '18 at 23:10
  • You can of course keep trying with a regex, but I can tell you in advance with the way you have defined the problem, it is impossible. "Thanks, but doing it by regex is more challenging." So, you're intentionally trying to avoid a simple solution? That seems weird, but ok :) – aquinas May 20 '18 at 01:26
  • I think the split solution will have to do if there is no possible way of solving this with regex. Thanks everyone for your help/answers and examples from: @aquinas – Baal May 20 '18 at 15:50
0

Named capturing groups are (?<nam>...) not (?P<name>...) in C#. Also, you expressed the desire to have repeating matches (so I have wrapped your regex in a repeating (?<grp>..).

You need to figure out how to differentiate an item from a machine. For instance, if you could say all machine numbers were 6 digits, and items were 0-5 digits you could do something like this... You would still have to split out the items collection.

^(?<grp>(?<machinenr>[^\|]{6})\|
(?<controldone>[^\|]*)\|
(?<nrofitems>[^\|]*)\|
(?<items>(?:[^\|]{0,5}\|){1,}))*$

Sample C# implementation:

class Program
{

    static void Main(string[] args)
    {
        string strRegex = 
@"^(?<grp>(?<machinenr>[^\|]{6})\|
(?<controldone>[^\|]*)\|
(?<nrofitems>[^\|]*)\|
(?<items>(?:[^\|]{0,5}\|){1,}))*$";
        Regex myRegex = new Regex(strRegex, RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace);
        string strTargetString = @"446408|0|1|111|446408|0|3|99884|111|73732|446408|0|0||";

        MatchCollection matches = myRegex.Matches(strTargetString);

        foreach (Match m in matches)
        {
            for (int idx = 0; idx < m.Groups["grp"].Captures.Count; idx++)
            {
                Console.WriteLine("Group:");
                Console.WriteLine($"\tmachinenr={m.Group["machinenr"].Captures[idx]}");
                Console.WriteLine($"\tcontroldone={m.Groups["controldone"].Captures[idx]}");
                Console.WriteLine($"\tnrofitems={m.Groups["nrofitems"].Captures[idx]}");
                Console.WriteLine($"\titems={m.Groups["items"].Captures[idx]}");
            }
        }
    }
}

enter image description here


Using C# IEnumerable<T> Algorithm

It would seem easier just to split the string and parse the subsequent array. But, if you are concerned about dealing with large strings or don't wish to use String.Split(), you can use an IEnumerable<T> method. Here is one approach...

class Program
{

    public class Entry
    {
        public string MachineNr { get; set; }
        public string ControlDone { get; set; }
        public int Count { get; set; }
        public List<string> Items { get; set; }

        private static IEnumerable<string> fields(string list)
        {
            int idx = 0;
            do
            {
                int ndx = list.IndexOf('|', idx);
                if (ndx == 1)
                    yield return list.Substring(idx);
                else
                    yield return list.Substring(idx, ndx - idx);                        

                idx = ++ndx;
            }
            while (idx > 0 && idx < list.Length-1) ;
        }

        public static IEnumerable<Entry> parseList(string list)
        {
            int idx =0;
            var fields = Entry.fields(list).GetEnumerator();
            while (fields.MoveNext())
            {
                var e = new Entry();
                e.MachineNr = fields.Current;
                if (fields.MoveNext())
                {
                    e.ControlDone = fields.Current;
                    if (fields.MoveNext())
                    {
                        int val = 0;
                        e.Count = int.TryParse(fields.Current, out val) ? val : 0;
                        e.Items = new List<string>();
                        for (int x=e.Count;x>0;x--)
                        {
                            if (fields.MoveNext())
                                e.Items.Add(fields.Current);
                        }
                    }
                }

                yield return e;
            }
        }
    }
    static void Main(string[] args)
    {
        string strTargetString = @"446408|0|1|111|446408|0|3|99884|111|73732|446408|0|0||";
        foreach (var entry in Entry.parseList(strTargetString))
        {
            Console.WriteLine(
$@"Group:
    Machine:        {entry.MachineNr}
    ControlDone:    {entry.ControlDone}
    Count:          {entry.Count}
    Items:          {string.Join(", ",entry.Items)}");
        }

    }
}
bigtlb
  • 1,462
  • 9
  • 16
  • Splitting the items collection would be no problem. Have you tested your regex, I cant seem to make it work. However from reading it, its really very close to a solution but because I cannot differentiate between the machinenr and itemnrs it wont work. A machinenr could be anything from 0-10000000000 and the same goes for any item number. But I know that the C# version of regex has some unique features and within these is the key to the solution. However I'm not such a regex master yet to figure out how to use them. – Baal May 19 '18 at 23:24
  • I have tested with Multiline and IgnorePatternWhiteSpace – bigtlb May 19 '18 at 23:25
  • **REMEMBER:** If you're using a C# string make sure it is a literal `@"^(?(?[^\|]{6})\| (?[^\|]*)\| (?[^\|]*)\| (?(?:[^\|]{0,5}\|){1,}))*$"` or else double escape the backslashes. – bigtlb May 19 '18 at 23:31
  • The string from the database will be one long single line string containing tens or hundreds of machines and their respective items. – Baal May 19 '18 at 23:59
  • I updated my answer with a sample implementation. This assumes you have multiple lines. If not, just remove the Multiline RegexOption. – bigtlb May 20 '18 at 01:02
  • To use balancing groups you would need to be able to quantify your `nrofitems` into discrete captures. For instance is you could turn `446408|0|1|111|446408|0|3|99884|111|73732|446408|0|0||` into `446408|0|+|111|446408|0|+++|99884|111|73732|446408|0|||` then you could use a balancing capture group where you push `+` and pop `|`... But at that point you are really better of just parsing the string in C#. – bigtlb May 20 '18 at 06:13
  • I think the split solution will have to do if there is no possible way of solving this with regex. Thanks everyone for your help/answers and examples from: @bigtlb – Baal May 20 '18 at 15:50