4

I am looking for a regular expression that works in the Javascript regexp engine that satisfies the following requirements.

I have a file with content structured in the following way (the text within the box):

       Column 1        Column 2     Column 3
       _______________________________________________________________________________________________
line  1|Heading 1     Heading 2     Heading 3                                                        |
line  2|      123           456     Quisque imperdiet nibh nec fermentum sollicitudin.               |
line  3|                            Vestibulum eu   elit rutrum, eleifend ligula eu, interdum massa. |
line  4|      789           012     Suspendisse vel   urna vulputate, porta ex ut, varius felis.     |
line  5|                            Praesent a metus faucibus, porttitor magna at, fermentum libero. |
line  6|                                                                                             |
line  7|                                                                                             |
line  8|Heading 1     Heading 2     Heading 3                                                        |
line  9|      123           456     Quisque imperdiet nibh nec fermentum sollicitudin.               |
line 10|                            Vestibulum eu   elit rutrum, eleifend ligula eu, interdum massa. |
line 11|      789           012     Suspendisse vel   urna vulputate, porta ex ut, varius felis.     |
line 12|                            Praesent a metus faucibus, porttitor magna at, fermentum libero. |
       |_____________________________________________________________________________________________|

Note that the file does not contain tabs, only spaces, but I would prefer if the regular expression was extended to be able to handle tabs.

Column Description:

  • The heading lines are simply letters. I already know how to create a regular expression to match the heading lines.

  • The first two columns can either only be empty or can only contain a number with an arbitrary number of digits.

  • The third column can have any combination of letters, numbers, and some special characters as well (brackets of any type--curly, round, angle, forward slash, period, hyphen, equals sign)

    • The third column may contain a number followed by a space followed by a word or special character (these examples are valid entries in the third column, 5 RANDOMWORD, 5 (10), 5 AND 10)

    • The third column will never contain: (1) a single number, (2) only numbers separated by spaces

I want a regular expression which will allow me to match extra spaces (either two or more spaces, tabs, or any combination of tabs or spaces) in the contents in the third column so I can easily delete them. The goal is to find multiple spaces in the third column and replace them with a single space.

I want to ignore the heading lines completely.

I also do not want to match the spaces around the numbers present in the first two columns. Note that the first two columns may not always contain numbers.

The regular expression I have been able to piece together so far looks like this:

/(?=^(?:(?!Heading 1 Heading 2 Heading 3).)*$)([ \t]*[\S]+[^\n]*)[ \t]{2,}/

  • The /(?=^(?:(?!Heading 1 Heading 2 Heading 3).)*$)/ allows me to ignore heading lines completely.

  • The /([ \t]*[\S]+[^\n]*)[ \t]{2,}/ allows me to find multiple spaces in the lines which do not have numbers in the first two columns. However, the problem with this one is that it will match the space after numbers in the second column (like in lines 2 and 9), which I do not want to do.

If Javascript supported lookbehind I think this problem would have been easy to solve, otherwise I am at a loss on how to solve this problem.

Edit 1: Apologies, my original question was not clear. I am not looking for Javascript code, but merely a regular expression that works in the Javascript regexp engine.

Also, my preference would be a single regexp expression as opposed to doing it in multiple steps.

Edit 2: More details added in the specifications.

Edit 3: Lookbehind assertions got accepted into the JavaScript standard and is supported by some but not all JavaScript engines as of writing this comment. See: Javascript: negative lookbehind equivalent?. This might be possible with a single regexp using lookbehinds, but I have not yet tested this as of yet.

Thanks a lot for your help.

mr7432631
  • 45
  • 6
  • Could you define `column` precisely ? It will be the base of your regex requirements. – Logar Aug 24 '18 at 13:52
  • You shouldn't add spaces manually in regex as you are doing with `Heading 1 Heading 2` you should instead do `Heading\s1\s{5}Heading\s2` where `\s{5}` shows 5 spaces – Tom Aug 24 '18 at 14:13
  • Good point about the spaces. I'll keep that in mind. – mr7432631 Aug 24 '18 at 14:14
  • @logar Added description of columns. – mr7432631 Aug 24 '18 at 14:26
  • That's a good start, i'd add an obvious point : `The columns in all lines start at the same offset.` From there you could do it in 2 steps : First, locate the start offset of the first column. Second, use a simple regex on each line to replace all spaces by one, starting at the said offset. Would it be ok for you to do it this way, and not with only one regular expression ? – Logar Aug 24 '18 at 14:36
  • Apologies, my original question was not clear. I am not looking for Javascript code, but merely a regular expression that works in the Javascript regexp engine. – mr7432631 Aug 24 '18 at 15:00
  • Yeah I was affraid of this. Deleted my answer – Logar Aug 24 '18 at 15:07
  • 1
    I don't think you can fulfill your requirements with merely ONE regular expression. It takes at least 3 steps for each line to achieve your goal: (1) to extract column 3; (2) to clean up excessive whitespaces in column3; (3) to put the result back into the line. – KaiserKatze Aug 24 '18 at 15:12
  • Is it possible that a 3d column would contain only numbers ? – Logar Aug 24 '18 at 15:14
  • @logar No, the third column cannot only contain numbers. It may have a number followed by a word or a special character, but not a number by itself. To be clear, the third column will never contain: (1) a single number, (2) only numbers separated by spaces. – mr7432631 Aug 24 '18 at 15:20
  • Just to elaborate some more on my previous comment, the third column may contain a number followed by a space followed by a word or special character (these examples are valid entries in the third column, `5 RANDOMWORD`, `5 (10)`, `5 and 10`) – mr7432631 Aug 24 '18 at 15:55
  • Oh, well, sorry man I give up :D – Logar Aug 24 '18 at 16:01
  • I think You should put all these specifications in your question so other people can have a quick look at it without reading all the comments – Logar Aug 24 '18 at 16:02

3 Answers3

1

A regular expression will not work in this context because of the fact that the first two columns may be omitted, and the fact that the character set for the first two columns is a subset of the character set of the third column. There is, therefore, no way to distinguish the start of the third column without knowing the width of the columns.

The only way I can think to solve this problem is to examine the row with the headings to find out how wide each column is, and to use that to find the beginning of the third column. It should be pretty simple, you should be able to do it with some sort of substring function.

TallChuck
  • 778
  • 4
  • 21
1

I can't find a solution that'll use only one replace. I think you'll need several iterations over the string.

I believe this would work (/^(?= {20,}| +\d+ +\d+ +\S.* {2,})( +\d+ +\d+ +| +)(\S.*? ) +/gm) but I'm not absolutely sure:

var regex = /^(?= {20,}| +\d+ +\d+ +\S.* {2,})( +\d+ +\d+ +| +)(\S.*? ) +/gm;

const str = `Heading 1     Heading 2     Heading 3
      123           456     Quisque imperdiet nibh nec fermentum sollicitudin.
                            Vestibulum eu   elit rutrum, 5   RANDOM eleifend ligula eu, interdum massa.
      789           012     Suspendisse vel   urna vulputate, porta ex ut, varius felis.
                            Praesent a metus faucibus, porttitor magna at, fermentum libero.


Heading 1     Heading 2     Heading 3
      123           456     Quisque imperdiet nibh nec fermentum sollicitudin.
                            Vestibulum eu   elit rutrum, eleifend ligula eu, interdum massa.
      789           012     Suspendisse vel   urna vulputate, porta ex ut, varius felis.
                            Praesent a metus faucibus, porttitor magna at, fermentum libero.`;
const subst = `$1$2`;

var result = str;

while(regex.test(result))
  result = result.replace(regex, subst)


console.log('Substitution result: \n', result);

Side notes:

  • 20 is an arbitrary number that correspond to what I consider being the margin of the paragraphs here;
  • This solution may not be fast at all;
  • That's an excellent first question!
Thomas Ayoub
  • 27,208
  • 15
  • 85
  • 130
1

This is (I think) not possible to accomplish with a mere JavaScript regex. Even if you managed to contort some Frankenstein's Monster of a regex it would be difficult to maintain.

Given the input text

Heading 1     Heading 2     Heading 3                                                       
      123           456     Quisque imperdiet nibh nec fermentum sollicitudin.              
                            Vestibulum eu   elit rutrum, eleifend ligula eu, interdum massa.
      789           012     Suspendisse vel   urna vulputate, porta ex ut, varius felis.    
                            Praesent a metus faucibus, porttitor magna at, fermentum libero.


Heading 1     Heading 2     Heading 3                                                       
      123           456     Quisque imperdiet nibh nec fermentum sollicitudin.              
                            Vestibulum eu   elit rutrum, eleifend ligula eu, interdum massa.
      789           012     Suspendisse vel   urna vulputate, porta ex ut, varius felis.    
                            Praesent a metus faucibus, porttitor magna at, fermentum libero.

One can do

const blocks = text.split(/\n\n/g);
const result = blocks
  .map(block => {
    const [headingRow, ...rows] = block.split('\n');
    const heading3index = headingRow.indexOf('Heading3');
    return rows
      .map(row => {
        const [start, col3] = [row.slice(0, heading3index), row.slice(heading3index)];
        return start + col3.replace(/\s\s+/g, ' ');
      })
      .join('\n');
  })
  .join('\n\n');
Jared Smith
  • 14,977
  • 4
  • 36
  • 57