13

I'm doing the RegexOne regex tutorial and it has a question about writing a regular expression to remove unnecessary whitespaces.

The solution provided in the tutorial is

We can just skip all the starting and ending whitespace by not capturing it in a line. For example, the expression ^\s*(.*)\s*$ will catch only the content.

The setup for the question does indicate the use of the hat at the beginning and the dollar sign at the end, so it makes sense that this is the expression that they want:

We have previously seen how to match a full line of text using the hat ^ and the dollar sign $ respectively. When used in conjunction with the whitespace \s, you can easily skip all preceding and trailing spaces.

That said, using \S instead, I was able to come up with what seems like a simpler solution - (\S.*\S).

I've found this SO solution that match the one in the tutorial - Regex Email - Ignore leading and trailing spaces? and I've seen other guides that recommend the same format but I'm struggling to find an explanation for why the \S is bad.

Additionally, this validates as correct in their tool... so, are there cases where this would not work as well as the provided solution? Or is the recommended version just a standard format?

Catija
  • 281
  • 5
  • 17
  • 1
    The two expressions would produce different results if the remaining text, after stripping whitespace at both ends, was empty or consisted of a single character. Your version requires at least two matches of `\S` in order to match at all. – jasonharper Jul 25 '20 at 04:18
  • 1
    @WiktorStribiżew I don't really see how this is a duplicate at all. I'm asking why, not for someone to hand me code. That's a really useful resource but it doesn't at all fix my problem - and according to the answer here, the suggested answer in the tutorial I'm citing is actually *wrong*... so - again - I think the duplicate is incorrect. – Catija Aug 26 '20 at 16:57
  • 1
    http://regex101.com provided explainations for these basic patterns. No need asking again and again what `\s` or `\S` matches. – Wiktor Stribiżew Aug 26 '20 at 17:00

1 Answers1

12

The tutorial's solution of ^\s*(.*)\s*$ is wrong. The capture group .* is greedy, so it will expand as much as it can, all the way to the end of the line - it will capture trailing spaces too. The .* will never backtrack, so the \s* that follows will never consume any characters.

https://regex101.com/r/584uVG/1

Your solution is much better at actually matching only the non-whitespace content in the line, but there are a couple odd cases in which it won't match the non-space characters in the middle. (\S.*\S) will only capture at least two characters, whereas the tutorial's technique of (.*) may not capture any characters if the input is composed of all whitespace. (.*) may also capture only a single character.

But, given the problem description at your link:

Occasionally, you'll find yourself with a log file that has ill-formatted whitespace where lines are indented too much or not enough. One way to fix this is to use an editor's search a replace and a regular expression to extract the content of the lines without the extra whitespace.

From this, matching only the non-whitespace content (like you're doing) probably wouldn't remove the undesirable leading and trailing spaces. The tutorial is probably thinking to guide you towards a technique that can be used to match a whole line with a particular pattern, and then replace that line with only the captured group, like:

Match ^\s*(.*\S)\s*$, replace with $1: https://regex101.com/r/584uVG/2/

Your technique would work given the problem if you had a way to make a new text file containing only the captured groups (or all the full matches), eg:

const input = `   foo   
bar
  baz   
qux  `;
const newText = (input.match(/\S(?:$|.*\S)/gm) || [])
  .join('\n');
console.log(newText);

Using \S instead of . is not bad - if one knows a particular location must be matched by a non-space character, rather than by a space, using \S is more precise, can make the intent of the pattern clearer, and can make a bad match fail faster, and can also avoid problems with catastrophic backtracking in some cases. These patterns don't have backtracking issues, but it's still a good habit to get into.

CertainPerformance
  • 260,466
  • 31
  • 181
  • 209
  • 3
    I appreciate the thorough answer! I particularly appreciate your explanation of looking to match the entire line rather than just a segment of it. That made it more clear why the abbreviated versions - for example, `^\s*(.*\S)` wouldn't work. It would, however avoid the problem of only capturing two or more characters, I think? I also appreciate the (possibly unintended) introduction to the regex101 tool. I'd heard of it but hadn't seen an example of it in use, so that was appreciated, too. – Catija Jul 25 '20 at 06:56