4

I've spent about 3 hours trying to create a regex to validate a coordinate string that's homebrew (but based on MGRS). I've always had a rough time with regex and this one's starting to make me slam my head on my desk.

I normally wouldn't ask for something this specific, but none of my friends who know regex were available to help, and my own research/attempts to learn at this point are taking too long.

How would I go about creating a regex to validate the following (or is it even possible to do)?

AZ AA 0123456789 0123456789 HG-20

  • All whitespace is optional
  • The first and second set of 2 characters is case insensitive, but must be a-z and be 2 characters for each section (first is grid zone identifier, second is subsection identifier).
  • The easting and northing digit portion can be any arbitrary length, so long as the total number of digits between the 2 "sections" is even (e.g. "1 1" is valid, "04563213245 54986187995" is valid, "5598 549" is invalid). This is the part that I'm really just at a loss for.
  • The 'H' is mandatory, but case insensitive (it's a delimiter for the "height" component of the coordinate string)
  • The 'G' is optional (it indicates whether to use the locations ground height or not when converting to Unity coordinates)
  • The last bit is a float, which can be negative. If a period is there, then a digit must follow.
  • The entire height section is optional. If "H" is present, at a minimum a digit must come after (same rules as the last 2 bullet points basically).

I'm using C# and System.Text.RegularExpressions for the regex engine.

The part that's really getting to me is the main digit component. I'm not sure if this is possible (especially considering that the easting and northing can be seperated by whitespace).

So far I've been able to come up with:

/^[a-z]{2}\s?[a-z]{2}\s?\d+\s?\d+\s?hg?[-+]?\d+.?\d+?/ig

But it doesn't actually validate whether the length of the coordinates sans whitespace is even or not and isn't able to tell if the entire height section is optional or not (it technically makes the H required...no clue how to subsection it...).

Ultimately, since I break the string apart in code I could validate it with actual code (if the digit sub string is not even throw InvalidArgument or ArgumentOutOfRange exception since this logic happens in a constructor for a Location class). That seems like it would be bad juju if the validation can be done in regex though.

Examples:

  • AZ AA 012345 012345 HG-20 (valid)
  • AZAA 012345 012345 HG-20 (valid)
  • AZAA012345012345HG-20 (valid)
  • AZ AA 012345 012345 (valid)
  • AZ AA 012345 01234 (invalid, easting and northing are different lengths)
  • AZ AA 012345 012345 HG (invalid, must have digits if 'H' is present)
  • AZ AA 012345 012345 H (invalid, must have digits if "H" is present)

Thanks!

In case anyone is curious about how my code currently looks:

public Location(string coordinateString)
{
    //TODO: Validate string

    //Calling ToUpper() as we never want to worry about casing. Everything is upper case.
    var stripped = coordinateString.ToUpper().Replace(" ", "");
    var gridZonedesignator = stripped.Substring(0, 2);
    var subLocationId = stripped.Substring(2, 2);
    var identifiersRemoved = stripped.Remove(0, 4);
    var heightParsed = identifiersRemoved.Split('H');

    float height = 0;
    bool useGroundHeight = false;

    //If the height component is there, parse ground height flag (if there) and set 
    //height.
    if (heightParsed.Length > 1)
    {
        if (heightParsed[1].StartsWith("G"))
        {
            useGroundHeight = true;
            heightParsed[1] = heightParsed[1].Remove(0);
        }

        height = float.Parse(heightParsed[1], CultureInfo.InvariantCulture);
    }

    //Since the total digits of the easting/northing section must be equal, 
    //simply divide by 2 to separate the number of digits the easting and
    //northing each consist of (accuracy).
    var accuracy = heightParsed[0].Length / 2;

    // It's possible to end up with accuracy 1, 2, or 3, in which case we want 
    //to pad 0s to the right as a 1 digit coordinate translates to thousands,
    //not ones as a grid zone is currently 10k x 10k meters.
    //TODO: Base the digits off the scale of a grid zone instead of hard
    //coding to 4. If we change the scale the following will no longer be
    //valid.
    var eastingString = heightParsed[0].Substring(0, accuracy).PadRight(4,'0');
    var northingString = heightParsed[0].Substring(accuracy, accuracy).PadRight(4,'0');

    CoordinateString = coordinateString;
    GridZoneDesignation = gridZonedesignator;
    SubLocationId = subLocationId;
    Easting = eastingString;
    EastingInt = int.Parse(Easting);
    Northing = northingString;
    NorthingInt = int.Parse(Northing);
    IsStartFromGroundHeight = useGroundHeight;
    Height = height;
}
bobble bubble
  • 11,968
  • 2
  • 22
  • 34
TylerWStx
  • 143
  • 1
  • 7
  • 2
    Does the entire validation have to happen in the regex? I don't think the .NET regex syntax provides a mechanism to verify _length_ of a substring against some other matched substring. But even if it does, I would think the code might be easier to comprehend if you simply matched the sections of the string, and then compared the lengths of the two respective match groups after getting a regex match. Assuming I'm right and it doesn't, then obviously that'd be the _only_ way to do it. – Peter Duniho Dec 31 '16 at 23:43
  • If you just want to check for an even amount of digits optionally separated by space [see like this demo](https://regex101.com/r/FSFuOP/1) probably not what you're after :p – bobble bubble Dec 31 '16 at 23:54
  • Why do you have to do it in a single regex? regex + a few *if* s would solve the problem... – L.B Dec 31 '16 at 23:55
  • These values have internal structure. A data type that lets you take advantage of that radically simplifies your problem. *"That seems like it would be bad juju if the validation can be done in regex though."* [Now you have two problems.](http://softwareengineering.stackexchange.com/a/223640/24529) – Mike Sherrill 'Cat Recall' Jan 01 '17 at 00:42
  • The biggest reason is for performance. This is going to be in a game, and this system is an abstraction of Unity's coordinate system, which means it's going to be used heavily. I could simply attempt to break apart the string as I'm doing now with no validation at the string level and do validation conditionals at every property, but when thousands of these are being processed every frame that would be an unnecessary slowdown given the performance benefits of regex vs a bunch of conditionals. – TylerWStx Jan 01 '17 at 03:25
  • Other reasons include not having to attempt to create a Location object just to validate the string, early validation instead of late validation (can validate every specified location quickly and easily at game start to ensure modded files won't throw exceptions for locations when they're eventually used), and being able to use the same validation across applications (e.g. modding tools which may not be written in .Net in addition to the game itself). – TylerWStx Jan 01 '17 at 03:36
  • @TylerWStx AFAIK RegEx is usually the slowest way to validate input (especially when there is backtracking which seems to be the case for Balancing Groups). I can't imagine a case where RegEx would be faster than bunch of conditions (unless there are too many replace/split/join memory allocations) – Slai Jan 01 '17 at 06:41
  • @Slai Unity has some serious GC issues because it's running an older Mono version and as a result allocations can actually be very costly (they're working on updating, thank god). Plus, as I understand it, compiled regex is often faster than complex string operations (though as you said, with backtracking that may not be the case, though I'd like to think balancing groups are optimized). In the big picture, one (heavily commented) line of regex/1 method call is at the very least more readable, and will (given Unity's kinks) likely be more performant. – TylerWStx Jan 01 '17 at 23:08

2 Answers2

3

Let's try to put it together.

  • The first part to match AZ AA, AZAA is pretty straightforward and self explanatory.
    ^(?:[A-Z]{2} ?){2}. ^ anchor matches start, (: opens a non capturing group.

  • The second part where you tried \d+\s?\d+ is the challenge. It reads to me like you want to match two groups of digits that have the same length and are separated by an optional space in the middle. Something like (?:\d ?\d ?)+ would not be convenient for your input as it would also match such as 0 123 01 23 but allowed is only such as 0123 0123.
    You could use a special feature of .NET Regex which is called balancing groups. Dig a bit into it. Basically add stuff to a stack and subtract it again in another group until the stuck is gone. Have a try with (?'x'\d)+ ?(?'-x'\d)+(?(x)(?!)) at regexstorm.

  • Finally we have the last optional part. H followed by an optional G, an optional - or+ and an optional float not ending in a digit if a period occures: (?:HG?[+-]\d+(?:\.\d+)?)?

Assembled with optional space the total pattern could be

^(?:[A-Z]{2} ?){2}(?'x'\d)+ ?(?'-x'\d)+(?(x)(?!)) ?(?:HG?[+-]\d+(?:\.\d+)?)?$

See demo at regexplanet (click on green .NET button) or regexstorm.

Community
  • 1
  • 1
bobble bubble
  • 11,968
  • 2
  • 22
  • 34
  • I needed to add a ? after [+-] to make the symbols optional, and with that it works exactly as expected. The only downside I see to balancing groups is that it's .net specific, so if in the future we need the same validation in another language/library it'll be back to square one on this part, but this works great for the game itself (which is the most important thing). Thank you very much! – TylerWStx Jan 01 '17 at 03:55
  • @TylerWStx You're welcome! Great you got it going. Have overlooked the optional `[+-]` – bobble bubble Jan 01 '17 at 17:56
1

The below should do what you need. Basically it used Balanced Groups. Balanced groups are supported with C#/.net.

It increments the number of digits found to a "N" counter and then decrements the ones after the space. At the end N should be zero or there is no match.

^([a-zA-Z]{2}\s?){2}((\d((?>\d(?<N>))*\s?(?>\d(?<-N>))*)*(?(N)(?!))\d))\s?[Hh][Gg]?[-+]?\d+.?\d*

The breakdown...

^([a-zA-Z]{2}\s?){2} Two letters (case insensitive) followed by an optional space

((\d((?>\d(?))\s?(?>\d(?<-N>)))*(?(N)(?!))\d)) x #of digits + optional space + x # of digits

\s?[Hh][Gg]? H with optional G

[-+]?\d+.?\d* +/-##.####

Hope this helps.

Sunsetquest
  • 5,624
  • 2
  • 31
  • 31
  • You were right on with balanced groups, but the regex didn't end up validating correctly, so I ended up marking bobble bubble's as correct. In my code I'll be crediting both you and bobble though, as you were both right w/ balancing groups. Thanks! – TylerWStx Jan 01 '17 at 03:57
  • Somehow some extra spaces got in the regex. I removed them so it should work now. – Sunsetquest Jan 01 '17 at 06:36