1

I am validating input strings using regex.

The input string must be 8 characters long, it must begin with an 'E', followed by six digits then either another digit or 'X'.

My Powershell comparison is as follows:

$Input -match "E\d{6}[\dX]"

This returns true for valid values and false for invalid values, as long as they are 8 characters or less. Unfortunately it also returns true for values where the first 8 characters match the expression but additional unwanted characters are also present, e.g:

input desired return actual return
Invalid1 False False
E012345X True True
E012345Xblahblahblah False True

No doubt this is by design, however I would like it to return false for the last example. I cannot simply truncate the input as the script is validating existing data and needs to highlight the error.

Is there a way to do what I want in regex, or do I just have to use a combination of regex then test the string length separately?

blackworx
  • 490
  • 4
  • 16
  • 4
    You should include the beginn (`^`) and the end (`$`) of the string to your pattern. So it should be `$Input -match '^E\d{6}[\dX]$'` – Olaf May 23 '21 at 16:31
  • 2
    Also, if you need the leading `E` and the trailing `X` to be all caps, (don't know if that's the case though), use the `-cmatch` operator – Theo May 23 '21 at 16:47

3 Answers3

2

The input string must be 8 characters long, it must begin with an 'E', followed by six digits then either another digit or 'X'.

The real answer is

$Input -cmatch '^E[0-9]{6}[0-9X]\z'

Powershell -match operator is case insensitive, you must use the case sensitive -cmatch, or the e123456x input will be matched, too.

More, \d matches any Unicode digits, [0-9] will only match ASCII digits.

^ matches start of string, \z matches end of string.

Ryszard Czech
  • 10,599
  • 2
  • 12
  • 31
  • 1
    For single-line input, `^` and `$` as anchors work just fine. If you always want to match the very start and the very end of the input - even with multiline input and/or a trailing newline and/or the multiline matching option (`(?m)`) - use `\A` and `\z` (`\Z` to also match before a trailing newline). Combining `^` and `\z` is confusing. – mklement0 May 24 '21 at 19:07
1

All the relevant information is in the existing answers, but let me try to present a systematic overview:

  • PowerShell's -match operator by default matches any substring of its input string(s)[1] (albeit only ever the first occurrence), so in order to match the input in full, enclosing a regex pattern in ^...$ is required, where anchor ^ represents the start, and $ the end of the input.

    • With single-line input, ^ and $ work as described, ditto with multiline input by default, except that $ also matches before a trailing newline, if any.
    • To always match the very start and the very end - even with multiline input and the multiline matching option - use \A and \z instead (\Z to also match before a trailing newline, if any) - see the anchors quick-reference.
  • -match (and its rarely used alias, -imatch) is case-insensitive, as PowerShell is in general; use the -cmatch variant for case-sensitive matching.

  • Unless you truly need string interpolation, i.e. you need to embed the value of a PowerShell variable or expression in a regex pattern string[2], it's best to use single-quoted strings ('...') to define regex patterns, so as to prevent confusion over what elements PowerShell's expandable strings ("...") interpolate up front vs. what the .NET regex engine will end up seeing.

To put it all together:

# To match case-*insensitively*, use -match
$str -cmatch '^E\d{6}[\dX]$'

Note: I've avoided use of $Input as a variable name, because it is an automatic variable that shouldn't be used as a user variable.

Note:

  • Strictly speaking, character class \d matches all Unicode characters classified as decimal digits, not just the ASCII-range digits 0 through 9.

  • While you can limit matching to ASCII-range digits with [0..9], \d is a convenient shorthand that in practice is likely to work just fine, given that the non-ASCII-range decimal digits are from scripts rarely used in English contexts - see the full list here.


[1] The LHS of -match can be a collection (array), in which case matching is performed against each element in that collection, and the return value then isn't $true or $false, but the sub-array of matching items. Also note that the automatic $Matches variable is then not populated.

[2] Even then you may choose not to use an expandable string ("...") and to instead construct the pattern based on a single-quoted string combined with -f, the format operator, combined with escaping the variable value / expression via [regex]::Escape() to ensure literal treatment inside the pattern; e.g. '^{0}' -f [regex]::Escape($PSHOME)

mklement0
  • 245,023
  • 45
  • 419
  • 492
0

This should answer your question

$Input -match "^E\d{6}[\dX]$"
user10722100
  • 127
  • 8