Using Regex to Split Mathematical Formula into Array

Question

I'm looking to get a flat string formula and split it into an array, divided based on a few factors. Getting a little stuck around parenthesis and looking for assistance.

I've been using a regex scan plus a few filters to try and get the resulting array.

My current tests are such:

 describe 'split algorithm' do

      it 'can split a flat algorithm' do
        algo = 'ABC * DEF * GHI Round(3) = JKL * MNO * PQR Round(0) = SAVE'
        actual = split_algo(algo)
        expected = ['ABC', '* DEF', '* GHI', 'Round(3)', '= JKL', '* MNO', '* PQR', 'Round(0)', '= SAVE']
        expect(actual).to eq expected
      end

      it 'can split an algorithm with parenthesis' do
        algo = '(ABC + DEF + (GHI * JKL)) - ((MNO + PQR + (STU * VWX)) * YZ) Round(0) + SUM(AAA) = SAVE'
        actual = split_algo(algo)
        expected = ['(', 'ABC', '+ DEF', '+', '(', 'GHI', '* JKL', ')', ')', '-', '(', '(', 'MNO', '+ PQR', '+', '(', 'STU', '* VWX', ')', ')', '* YZ', ')', 'Round(0)', '+ SUM', '(', 'AAA', ')', '= SAVE']
        expect(actual).to eq expected
      end

end

With the following code, I can get the first half to pass just fine:

def split_algo(algorithm)
   pattern = /(?:(\ (\*\ |\+\ |\-\ |\\\ |\=\ )\S*))|(\S*)/
   matches = algorithm.scan(pattern)
   matches.each_with_index { |match, index| matches[index]=match.compact }
   arr = []
   matches.each do |match|
     arr << match.max_by(&:length).strip
   end
   arr.delete('')
   arr
end

I've tried modifying pattern to accept a parenthesis matcher as such:

pattern = (\(|\))|(?:(\ (\*\ |\+\ |\-\ |\\\ |\=\ )\S*))|(\S*)

But that only captures the parenthesis at the beginning of the formula.

Related: https://stackoverflow.com/questions/546433/regular-expression-to-match-balanced-parentheses and https://stackoverflow.com/questions/6331065/matching-balanced-parenthesis-in-ruby-using-recursive-regular-expressions-like-p — Jordan Running, Apr 02 '19 at 18:49

score 0 · Answer 1 · answered Apr 02 '19 at 18:51

I wound up doing the following which seems to work:

Added a call to a new method, split_paren(arr) at the end of split_algo.

def split_paren(algo_arr)
  pattern = /Round\(\d*\)/
  arr = []
  algo_arr.each do |step|
    f = step.split(/(\(|\))/) unless step =~ pattern
    f.delete('') if f.class == Array
    f.nil? ? arr << step : f.each{|s| arr << s.strip}
  end
  arr
end

If anyone wants to respond with a better way to do this, please feel free to respond. Otherwise I'll accept my answer and close the question here in a bit.

Will do. As often as I peruse, I'm not well versed in the etiquette here. Always opportunities to learn! :) — PanoramicPanda, Apr 02 '19 at 22:02

Cary Swoveland · Accepted Answer · 2019-04-03T00:39:21.500

We can define the following regular expression.

R = /
    # split after an open paren if not followed by a digit
    (?<=\()      # match is preceded by an open paren, pos lookbehind
    (?!\d)       # match is not followed by a digit, neg lookahead
    [ ]*         # match >= 0 spaces
    |            # or
    # split before an open paren if paren not followed by a digit
    (?=          # begin pos lookahead
      \(         # match a left paren...
      (?!\d)     # ...not followed by a digit, neg lookahead
    )            # end pos lookahead
    [ ]*         # match >= 0 spaces        
    |            # or
    # split before a closed paren if paren not preceded by a digit
    (?<!\d)      # do not follow a digit, neg lookbehind
    (?=\))       # match a closed paren, pos lookahead
    [ ]*         # match >= 0 spaces        
    |            # or
    # split after a closed paren
    (?<=\))      # match a preceding closed paren, pos lookbehind
    [ ]*         # match >= 0 spaces        
    |            # or
    # match spaces not preceded by *, = or + and followed by a letter 
    (?<![*=+\/-]) # match is not preceded by one of '*=+\/-', neg lookbehind
    [ ]+         # match one or more spaces
    |            # or
    # match spaces followed by a letter 
    [ ]+         # match one or more spaces
    (?=\()       # match a left paren, pos lookahead
    /x           # free-spacing regex definition mode

In the first example we have the following.

algo1 = 'ABC * DEF * GHI Round(3) = JKL * MNO * PQR Round(0) = SAVE'
expected1 = ['ABC', '* DEF', '* GHI', 'Round(3)', '= JKL', '* MNO',
             '* PQR', 'Round(0)', '= SAVE']
algo1.split(R) == expected1
  #=> true

In the second example we have the following.

algo2 = '(ABC + DEF + (GHI * JKL)) - ((MNO + PQR + (STU * VWX)) * YZ) Round(0) + SUM(AAA) = SAVE'
expected2 = ['(', 'ABC', '+ DEF', '+', '(', 'GHI', '* JKL', ')', ')', '-',
             '(', '(', 'MNO', '+ PQR', '+', '(', 'STU', '* VWX', ')', ')',
             '* YZ', ')', 'Round(0)', '+ SUM', '(', 'AAA', ')', '= SAVE']
algo2.split(R) == expected2
  #=> true

The regular expression is conventionally written as follows.

R = /(?<=\()(?!\d) *|(?=\((?!\d)) *|(?<!\d)(?=\)) *|(?<=\)) *|(?<![*=+\/-]) +| +(?=\()/

In free-spacing mode I enclosed spaces in a character class ([ ]); else they would be stripped out before the expression is evaluated. That's not necessary when the regex is written conventionally.

I did not know about free-spacing regex! That's so much cleaner to read. I knew there had to be a quicker way to get the same result. Thank you so much. This really expands my understanding of how Regex matchers work within split as well. — PanoramicPanda, Apr 02 '19 at 23:51
So to include division and subtraction in our matcher, we would simply change this group at the end `(? — PanoramicPanda, Apr 03 '19 at 00:04
Ooh! Even managed to get rid of needing the `.strip` that was there. This works by matching the non-space in between the things we actually care about keeping, yes? Wanting to make sure I'm getting the understanding of this correctly. — PanoramicPanda, Apr 03 '19 at 01:29
I got rid of `.map(&:strip)` by also splitting on spaces followed by a left parenthesis (at the end or the regex). But yes, some of the splitting involving parentheses is between adjacent characters. — Cary Swoveland, Apr 03 '19 at 02:26

Using Regex to Split Mathematical Formula into Array

2 Answers2