-2

I have a data set that looks like this

abcdefg-1004958-2019-services

xyz-3104568-technology ltd-UG32594

xxfgdrtg-GH267384-FO(1082564)-2016-software

FO501117898-ahdndje-2016-service

I need to extract 1004958, 3104568, 1082564, 501117898 from this dataset. In other words, anything greater than 100000 must be returned.

Is there any formula to do this? Because I have more than 10000 entries like this, which are in uneven order, and I cannot use text to columns as well.

Any help will be greatly appreciated. Thank you!

JvdV
  • 41,931
  • 5
  • 24
  • 46
  • I think Regular Expressions are a better solution for this problem. Regex can be very powerful, but it has a disadvantage that it might become (processor)time consuming. In your case the regex is quite simple: `\d{6,}` which means "find any part in the string of 6 or more digits on a row". [Here](https://stackoverflow.com/a/22542835/1770778) is explained how to use regex in excel, but there are numerous other tools that can do this regular expression task for you. – ffonz Feb 26 '21 at 15:02

3 Answers3

3

Using Microsoft365, you could try:

enter image description here

Formula in B1:

=MAX(FILTERXML("<t><s>"&CONCAT(IFERROR(MID(A1,SEQUENCE(LEN(A1)),1)*1,"</s><s>"))&"</s></t>","//s[.*0=0]"))
JvdV
  • 41,931
  • 5
  • 24
  • 46
  • 1
    Thanks a million, this worked like a charm – Gagan Ganapathy Feb 26 '21 at 17:10
  • 1
    Very cool trick JvDV. Dig the post https://www.debugcn.com/en/article/21139068.html – Chris Feb 26 '21 at 21:08
  • @Chris, thanks. Seems like something/someone copied that over from SO. =) – JvdV Feb 26 '21 at 21:18
  • @JvdV - ha. Just spent the past hr putting all these usecases into a workbook. Sooo helpful!! Noticed a few minor edits though. I think '//s[count(node())=0]' should be '//s[count(node())=1]', '//s[contains(concat(., preceding::*[2]), 'FA')]' didn't produce any results based on the example string, and '//s[not(node())]' threw me a value error. Not sure if that's a copy and paste error on their part, or something in your original post. Thought I'd mention it regardless :) – Chris Feb 26 '21 at 22:21
  • @Chris, as crazy as it looks: `//s[count(node())=0]` is correct syntax. Let's imagine `"ABC|123|DEF||456|XY-1A|ZY-2F|XY-3F||XY-4f|xyz|123"` and you'll see it'll return two error values for the two empty nodes. `//s[not(node())]` is the equivalent of the other one. Both meant to return empty nodes. Excel won't really process these to empty string values but rather returns (multiple) errors unfortunately. Also, `//s[contains(concat(., preceding::*[2]), 'FA')]` is correct syntax and upon trying returns `"DEF"` to me as it should. Not sure if you made a small typo somewhere? – JvdV Feb 27 '21 at 08:59
  • @Chris, note that the link you found and shared does miss some information I added more recently about escaping certain characters. See the full answer over [here](https://stackoverflow.com/q/61837696/9758194). Thanks for sharing a mutual interest in `FILTERXML()` =) – JvdV Feb 27 '21 at 09:02
  • Ugh - realized my issue. I thought you said all nodes would be return, not NO nodes for `//s[count(node())=0]`. Apologies. But still getting a value errors on for `//s[contains(concat(., preceding::*[2]), 'FA')]`. I'm sure the issue is with my as you mention seeing you can't replicate. Many thanks for all the tips and explanations @JvdV. You open up my eyes to a whole other feature set of excel for when folks can't run VBA. – Chris Feb 27 '21 at 18:10
  • @Chris. It would still be interesting to see what exactly you did that caused the error. I re-checked time and time again and it works as it should copying right from what I wrote on SO. If you feel like trying to uncover what is wrong you can invite me to chat and we will try to fix this. – JvdV Feb 27 '21 at 18:30
1

You can use a user defined function. Here's what i came up to:

Function FunNumber(RngTarget As Range, Optional BlnReportAsNumber As Boolean = True, Optional BytPlacement As Byte = 1)
    
    'Declarations.
    Dim StrValue As String
    Dim BytPosition As Byte
    Dim StrString01 As String
    
    'Setting.
    StrValue = RngTarget.Cells(1, 1).Value
    
    'Covering each character in StrValue.
    For BytPosition = 1 To Len(StrValue)
        'Removing any non-numerica character from StrValue.
        If Not IsNumeric(Mid(StrValue, BytPosition, 1)) Then
            StrValue = Left(StrValue, BytPosition - 1) & " " & Right(StrValue, Len(StrValue) - BytPosition)
        End If
    Next
    
    'Removing leading and trailing spaces from StrValue.
    StrValue = Trim(StrValue)
    
    'Covering each character in StrValue but the last one.
    For BytPosition = 1 To Len(StrValue) - 1
        'Removing any multiple space.
        If Mid(StrValue, BytPosition, 2) = "  " Then
            StrValue = Left(StrValue, BytPosition - 1) & " " & Right(StrValue, Len(StrValue) - BytPosition - 1)
            BytPosition = BytPosition - 1
        End If
    Next
    
    'Setting FunNumber.
    FunNumber = ""
    
CP_Small_Number_Removed:

    'Covering each section of StrValue marked by a space.
    For BytPosition = 0 To UBound(Split(StrValue, " "))
        
        'Setting StrString01 as section of StrValue marked by space.
        StrString01 = Split(StrValue, " ")(BytPosition)
        
        'Checking for a number smaller than 100000.
        If Len(StrString01) < 6 Then
            
            'Removing the number smaller than 100000.
            Select Case BytPosition
                Case Is = 0
                    StrValue = Right(StrValue, Len(StrValue) - Len(StrString01) - 1)
                    GoTo CP_Small_Number_Removed
                Case Is < UBound(Split(StrValue, " "))
                    StrValue = Replace(StrValue, " " & StrString01 & " ", " ")
                    GoTo CP_Small_Number_Removed
                Case Is = UBound(Split(StrValue, " "))
                    StrValue = Left(StrValue, Len(StrValue) - Len(StrString01) - 1)
                    GoTo CP_Small_Number_Removed
            End Select
            
        End If
    Next
    
    
    'Setting FunNumber.
    FunNumber = Split(StrValue, " ")(BytPlacement - 1)
    
    'Resetting FunNumber as number if needed.
    If BlnReportAsNumber = True Then
        FunNumber = FunNumber * 1
    End If
    
End Function

Place it in a module. Let's say you string is in cell A1. In another cell write =FunNumber(A1) and you should obtain the desired result. If you want it reported as text (maybe leading zeroes are important) you can write =FunNumber(A1,FALSE). If there are more than one number greater than 100000 and you want to find for example the second, write something like =FunNumber(A1,,2) (you can of course specify if you want a string or a number while also specifying that you want the second or third or whatever number) .If no number greater than 100000 is found then the function returns #VALUE error.

Evil Blue Monkey
  • 1,096
  • 1
  • 5
  • 9
0

For those who can use LibreOffice Calc, the answer is much simpler since Calc supports Regular Expressions by default:

=REGEX(A1;"\d{6,}")

The question was about Excel, but since the difference in complexity of answers is so gigantic I didn't want to withhold the LibreOffice answer.

ffonz
  • 1,138
  • 1
  • 10
  • 24
  • Change to `\d{7,}` to get the right results though. – JvdV Feb 27 '21 at 23:00
  • @JvdV "Baie dankie" for your remark. I know counting so many zero's is difficult, but the questioner asked "anything greater then 100000 must be returned". IMHO that means \d{6,}, although there is a catch: This regex is **greater then or equal to** 100,000. If 100,000 has to be excluded the answer is much more complex. But the accepted answer is also including 100,000, so I don't think this is a problem. – ffonz Feb 28 '21 at 11:23
  • I understand what you mean but your current solution yields the wrong results on the third sample. It was just a heads-up – JvdV Feb 28 '21 at 12:12