1

I have data and I need to split each block so as to store each block in separate row. The entire text looks like:

م
مطروح
الحمام
school
الصف
:
الصف الأول
 1
 458316219 
 30709101600371 
ابراهيم وليد ابراهيم ابوالحمد
منافذ فورى
 2
 458361688 
 30702263300318 
احمد ابوالريش فرج عبدالله
منافذ فورى
 3
 458312720 
 30703143300418 
اسلام فتحى محمد ناجى
منافذ فورى
 4
 458790904 
 30606101802299 
اسلام نصار حسين نصار حسين عبد الونيس
منافذ فورى
 5
 458312908 
 30612013300259 
ايمن راضى صالح سلومه
منافذ فورى
 6
 458884564 
 30802203300186 
بسمه محمد ابراهيم ظدم
منافذ فورى
 7
 477625786 
 30708263300235 
بشار نصر الله مصوف السايب
منافذ فورى

I used https://regex101.com/ and I could define the start of each block like that

\d{1,3}\n

This highlights the start of each block

How can I split and separate each block >> and each block has to be in one row?

Here's the HTML for the whole page: https://pastebin.com/nu0dLvch

Here's a link of the full data: https://pastebin.com/dWcu97Wt

I would highlight the needed parts(these are the groups to match). Starting with...

ending with...

There are 22 blocks of data (groups) in total.

Looking at the regex provided by @Wiktor Stribiżew in comments: https://regex101.com/r/dmCNuH/1

match 11 is the first real needed data (match group) though truncates the final line.

enter image description here

After the amazing pattern I got it from Wiktor, I tried to get all the matches

Sub Test()
    Dim a(), s As String, i As Long, j As Long
        Dim bot As New ChromeDriver
    With bot
        .AddArgument "--headless"
        .Get "file:///C:\Sample.html"
        s = .FindElementByCss("table[id='all']").Text

    End With
        a = GetMatches(s, "^\s*\d{1,3}(?:(?:\r\n|[\r\n])(?!\s*\d{1,3}\n).*)+")
        For i = LBound(a) To UBound(a)
            Debug.Print a(i)
        Next i
End Sub

Function GetMatches(ByVal inputString As String, ByVal sPattern As String) As Variant
    Dim arrMatches(), matches As Object, iMatch As Object, s As String, i As Long
    With CreateObject("VBScript.RegExp")
        .Global = True
        .MultiLine = True
        .IgnoreCase = True
        .Pattern = sPattern
        If .Test(inputString) Then
            Set matches = .Execute(inputString)
            ReDim arrMatches(0 To matches.Count - 1)
            For Each iMatch In matches
                arrMatches(i) = iMatch.SubMatches.Item(0)
                i = i + 1
            Next iMatch
        Else
            ReDim arrMatches(0)
            arrMatches(0) = vbNullString
        End If
    End With
    GetMatches = arrMatches
End Function

But this doesn't work for me and throws an error.

YasserKhalil
  • 5,029
  • 2
  • 13
  • 39
  • 1
    Try `^\s*\d{1,3}(?:(?:\r\n|[\r\n])(?!\s*\d{1,3}\n).*)*` with `regExp.Multiline = True`. – Wiktor Stribiżew Jun 12 '20 at 10:26
  • Where from do you have the string in the picture? From a text file, or from an Excel cell? I am asking that, due to the fact that splitting on rows can be easily done with standard VBA. And result to be pasted in excel, on separate cells... – FaneDuru Jun 12 '20 at 10:43
  • There are a lot of leading-trailing spaces. Would they always be there on the first and second line after the start of a block? This, so we can differentiate between the start of a block and the lines with digits below. – JvdV Jun 12 '20 at 10:51
  • Thanks a lot for replies. @Wiktor Stribiżew Can you show me how to extract each block and store each element in that block so as to get a valid row for each block? – YasserKhalil Jun 12 '20 at 11:01
  • Not sure what you mean. Show your current code please. – Wiktor Stribiżew Jun 12 '20 at 11:03
  • I am working on a code that I have used selenium and the website has credentials which is not allowed to be public. I have used this line `Debug.Print .FindElementByCss("table[id='all']").Text`. This is the line that has the data such as I posted in the post. I have updated the post with the whole HTML page. – YasserKhalil Jun 12 '20 at 11:05
  • Perhaps also show the desired output format – QHarr Jun 12 '20 at 15:55
  • The desired output is the same as the appeared table on the HTML page .. – YasserKhalil Jun 12 '20 at 21:06
  • @QHarr Any other questions my tutor? – YasserKhalil Jun 13 '20 at 12:06
  • @WiktorStribiżew Can you join the discussion, please? https://chat.stackoverflow.com/rooms/215907/discussion-between-qharr-and-yasserkhalil – YasserKhalil Jun 14 '20 at 18:21
  • I have updated the question .. – YasserKhalil Jun 14 '20 at 18:38
  • 1
    Do you need https://regex101.com/r/dmCNuH/2? – Wiktor Stribiżew Jun 14 '20 at 20:06
  • Amazing. Really amazing. I thought it is impossible to do that by regex. Regex is very powerful. Can you please guide me on how to extract those matches using VBA? Please put an answer to select it as the answer to the question. – YasserKhalil Jun 14 '20 at 20:08
  • Can you give me details and explanation of the pattern so as to learn as I am concerned more with learning such issues? – YasserKhalil Jun 14 '20 at 20:33
  • @WiktorStribiżew I have a problem with the last match which is Match 22 (this includes other data that I don't need .. as each block is of five parts only). Can we use regex to define the number of carriage returns? – YasserKhalil Jun 14 '20 at 22:47
  • 1
    @YasserKhalil You probably want the regex to stop also before a line that looks like a date string, right? Then you need https://regex101.com/r/dmCNuH/4, `^\s*\d{1,3}(?:\n(?!\s*\d{1,3}\n|\d{4}/\d{2}/\d{2}\n).*)+`. If you have line breaks inside a cell in Excel, those are probably CRs, so you need `\r` instead of `\n`. – Wiktor Stribiżew Jun 14 '20 at 23:04
  • Yes. That's amazing. But I found some of the pages don't have the date. Can we rely on the number of lines for each block? I mean to specify the needed number of lines. – YasserKhalil Jun 15 '20 at 07:19
  • And how many do you plan to match? Is the number constant? Is it 4? `^\s*\d{1,3}(?:\n(?!\s*\d{1,3}\n).*){4}`? See [demo](https://regex101.com/r/dmCNuH/5) – Wiktor Stribiżew Jun 15 '20 at 07:21
  • Thank you very much. That's better. Last request, please put an answer to explain the pattern as I noticed in your posts. – YasserKhalil Jun 15 '20 at 07:22

2 Answers2

2

You may use

^\s*\d{1,3}(?:\n(?!\s*\d{1,3}\n).*){4}

See the regex demo. Use with .Global = True and .MultiLine = True options, you do not need to set .IgnoreCase to True.

NOTE: Since \r, carriage return, is used inside Excel cell values to define a line break, you may need to replace all \n chars in the pattern with \r.

The regex matches a line that may be indented or not and contains 1, 2 or 3 digits, and then grabs the next four lines that do not match the initial pattern.

More details

  • ^ - start of a line
  • \s* - 0 or more whitespace characters
  • \d{1,3} - one to three digits
  • (?:\n(?!\s*\d{1,3}\n).*){4} - a non-capturing group matching four ({4}) occurrences of
    • \n - a newline character (\n) that is...
    • (?!\s*\d{1,3}\n) - (negative lookahead) not immediately followed with:
      • \s* - 0 or more whitespaces
      • \d{1,3} - one, two or three digits
      • \n - a newline char
    • .* - any 0 or more characters other than line break characters, as many as possible.

To extract detailed information with groups, you may use

^[^\S\n]*(\d{1,3})\n\s*(\d{6,})[^\S\n]*\n\s*(\d{14})[^\S\n]*\n(.+)\n(.+)

See this regex demo

  • ^ - start of string
  • [^\S\n]* - 0 or more whitespace characters other than a newline char
  • (\d{1,3}) - one to three digits
  • \n - a newline
  • \s* - any 0+ whitespaces
  • (\d{6,}) - Group 2:
  • [^\S\n]*\n\s* - 0 or more whitespace characters other than a newline char, a newline and then any 0 or more whitespaces
  • (\d{14}) - Group 3: fourteen digits
  • [^\S\n]*\n - 0 or more whitespace characters other than a newline char and a newline char
  • (.+) - Group 4: any one or more characters other than line break chars, as many as possible
  • \n - a newline
  • (.+) - Group 5: any one or more characters other than line break chars, as many as possible
Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397
  • Thank you very much. Regex is very interesting but it looks difficult for me. How can I overcome this difficulty? – YasserKhalil Jun 15 '20 at 11:36
  • 1
    @YasserKhalil Practice as much as possible, you will feel more certain. Use http://regexone.com as a starter. Check http://regular-expressions.info and http://rexegg.com. – Wiktor Stribiżew Jun 15 '20 at 11:50
  • %c5%bcew I have read about grouping and I like to know if it is possible in that case to group each part of the five parts of the full match. I tried this `(^\s*\d{1,3})(\n(?!\s*(\d{1,3})\n).*){4}` but I got only the first part and the last part. I couldn't do more. – YasserKhalil Jun 16 '20 at 08:27
  • 1
    It is not possible with VBA regex. [Repeated capturing groups](https://www.regular-expressions.info/captureall.html) behave so in most regex libraries. – Wiktor Stribiżew Jun 16 '20 at 08:28
  • Thank you very much for your great help. I appreciate a lot this awesome help. – YasserKhalil Jun 16 '20 at 08:33
  • Is it possible to use groups if each part defined .. The first part is between 1 and 3 digits - The second part is a number between 6 and 12 digits - The third number is 14 digits - The fourth part is long text - the fifith is text too between 11 to 20 characters – YasserKhalil Jun 16 '20 at 10:39
  • 1
    @YasserKhalil Yes, but you will have to "unwrap" the quantified part and use a much longer pattern with the groups where you need them. – Wiktor Stribiżew Jun 16 '20 at 10:42
  • I have tried this pattern `(^\s*\d{1,3}\n)(\s\d{6,}\s\n)(\s\d{14}\s\n)(.+\n)(.+\n)` .. Can you review it for me and correct if there are any errors? – YasserKhalil Jun 16 '20 at 10:57
  • 1
    @YasserKhalil Try [this one](https://regex101.com/r/dmCNuH/6). – Wiktor Stribiżew Jun 16 '20 at 11:03
  • Amazing. Thanks a lot. And so sorry for these many questions as I am trying to learn about regex. I feel confident than before. Would you please edit the answer and add an explanation for this too, if you don't mind of course? – YasserKhalil Jun 16 '20 at 11:07
  • Is it possible to use this part `[^\S\n]*` like that `\s*`? – YasserKhalil Jun 16 '20 at 12:16
  • 1
    @YasserKhalil If you do not care if the match is on a single line, sure. – Wiktor Stribiżew Jun 16 '20 at 15:16
1

Thanks a lot for Wiktor and QHarr for helping me a lot with this issue. I appreciate a lot their help. Here is the final code and I welcome any other ideas or modifications to the code

Sub Test()
    Dim x, a(1 To 1000, 1 To 5), bot As New ChromeDriver, col As Object, sInput As String, sPattern As String, i As Long, j As Long, cnt As Long
    sPattern = "^\s*\d{1,3}(?:\n(?!\s*\d{1,3}\n).*){4}"
    With bot
        .AddArgument "--headless"
        .Get "file:///C:\Sample.html"
        sInput = .FindElementByCss("table[id='all']").Text
    End With
    With CreateObject("VBScript.RegExp")
        .Global = True: .MultiLine = True: .IgnoreCase = True
        .Pattern = sPattern
        If .Test(sInput) Then
            Set col = .Execute(sInput)
            For i = 0 To col.Count - 1
                x = Split(col.Item(i), vbLf)
                cnt = cnt + 1
                For j = LBound(x) To UBound(x)
                    a(i + 1, j + 1) = Application.WorksheetFunction.Clean(Trim(x(j)))
                Next j
            Next i
        End If
    End With
    ActiveSheet.Range("A1").Resize(cnt, UBound(a, 2)).Value = a
End Sub
YasserKhalil
  • 5,029
  • 2
  • 13
  • 39