-1

I am trying to extract the total number of n digited numbers from a string using bash.

E.g. For a 3 digit number,

I am trying to extract 3 digited numbers 333, 334, 335 from this string #should return 3
I have 243 pens for sale #should return 1

Unfortunately, I will not be able to use sed or grep with perl-regexp.

Appreciate any suggestions!

Shrav
  • 369
  • 3
  • 17
  • No. The restriction is because I have to do this on a remote server that doesn't support `perl-regexp`. – Shrav Oct 19 '20 at 23:23
  • 1
    what about `a123`, `#456` , `123-456`, do you want to consider any of them for the result? – thanasisp Oct 19 '20 at 23:29
  • 1
    @thanasisp For my use case, the numbers will never have anything outside [0-9] before/after/between them. – Shrav Oct 19 '20 at 23:40
  • You're not going to be able to match a bounded 3 digits that is separated by a single character `123 456`. Ever thought of using Perl ? You're going to need at least a lookahead capability. –  Oct 20 '20 at 00:01
  • @thanasisp `\b[0-9][0-9][0-9]\b` won't match 123 in `a123b` –  Oct 20 '20 at 00:28
  • @Maxt8r see comment above by Shrav, this case is not included into matches. – thanasisp Oct 20 '20 at 00:30
  • @thanasisp I don't understand `the numbers will never have anything outside [0-9] before/after/between them` If that means [a-zA-Z_] then `\b` is your answer if `\b` is a bash regex capability. –  Oct 20 '20 at 00:34
  • @Maxt8r I refer grep (without PCREs as requested), not bash. Either `\b` or `\< \>`. – thanasisp Oct 20 '20 at 00:38
  • Probably going to be the \< \>, old school. –  Oct 20 '20 at 00:41
  • @Shrav : How should 6 digits in a row be treated? I.e. `text text 555666 more text`. – user1934428 Oct 20 '20 at 05:25
  • @user1934428 I guess 6 digits are treated as a 6-digit number. – thanasisp Oct 20 '20 at 15:50
  • @thanasisp : Seeing from the pattern matching viewpoint, we could also regard it as 4 overlapping strings of 3-digit-numbers, i.e. 555, 556, 566, 666. – user1934428 Oct 21 '20 at 17:46

5 Answers5

4

You can use regular expressions in bash.

#! /bin/bash
cat <<EOF |
I am trying to extract 3 digited numbers 333, 334, 335 from this string #should return 3, but should ignore 12345
I have 243 pens for sale #should return 1
123 should work at text boundaries, too 123
EOF
while read line ; do
    c=0
    while [[ $line =~ ([^[:digit:]]|^)([0-9][0-9][0-9])([^[:digit:]]|$) ]] ; do
        m=${BASH_REMATCH[0]}
        line=${line#*$m}
        ((++c))
    done
    echo $c
done

The regex explained:

([^[:digit:]]|^)([0-9][0-9][0-9])([^[:digit:]]|$)
~~~~~~~~~~~~~                                     non-digit
             ~~                                   or the very beginning
                 ~~~~~~~~~~~~~~~                  three digits
                                  ~~~~~~~~~~~~    non-digit
                                              ~~  or the very end

As bash can't match the same string several times, we need to remove the already processed part from the string before trying another match.

choroba
  • 200,498
  • 20
  • 180
  • 248
  • Doesn't match the 456 here `123 456` –  Oct 19 '20 at 23:59
  • @Maxt8r: It shows "2" for `123 456`. – choroba Oct 20 '20 at 08:08
  • Oh, sorry about that, didn't realize bash regex has no way of doing global matches. Right I should have realized the only way is to remake / trim the _line_ each pass until no more matches. Note appending `${BASH_REMATCH[3]}` doesn't seem necessary. Also, it gets trickier if multiple overlapping is required. Wow, imagine having to match millions of these things at a time. The sheer overhead of making a new string each time, it would be prohibitively time consuming. So, it might work as a neat demo trick but it's not a method to do large amounts of matches. –  Oct 20 '20 at 18:40
  • @Maxt8r: You're right, I've removed the appending. If performance is an issue, don't use bash. – choroba Oct 20 '20 at 18:52
  • Very true. I've just tutored myself in bash in the last hour just to find out those string syntax you used. Very cryptic, almost a primitive perl. –  Oct 20 '20 at 18:58
  • @Maxt8r: Maybe that's because I think in Perl ;-) – choroba Oct 20 '20 at 20:28
2
echo "$str" | grep -o '\b[0-9]\{3\}\b' | wc -l

This way we match 3-digit numbers inside word bountaries, which are allowed to be re-used (e.g. if two numbers are separated by one char that is a bountary, like comma or space).

Or like this:

echo "$str" | grep -o '\<[[:digit:]]\{3\}\>' | wc -l
thanasisp
  • 5,575
  • 3
  • 11
  • 27
  • I would also offer the slightly more intuitive: ```echo "$str" | grep -o ' [0-9][0-9][0-9] ' | wc -l``` – Lenna Oct 20 '20 at 02:07
  • It would actually print `0`, but I see now that `\b` is needed at either end. – Lenna Oct 20 '20 at 02:17
  • Perhaps to fully anser the question for "n" digits, we must say: `echo "123," | grep -o "\b[0-9]\{${N}\}\b" | wc -l` where `N=3` for this example – Lenna Oct 20 '20 at 02:19
1

Using POSIX shell grammar only:

#!/usr/bin/env sh

# Should return 3
str1='I am trying to extract 3 digited numbers 333, 334, 335 from this string'

# Should return 1
str2='I have 243 pens for sale'

# should return 2
str3='This is 123 456'

_OIFS=$IFS
IFS=$IFS' ,.:;!?-_+=*#$§^&{}[]|`@"()\\/'\'

for str in "$str1" "$str2" "$str3"
do
  count=0
  for word in $str
  do
    case $word in
      [[:digit:]][[:digit:]][[:digit:]])
        count=$((count + 1 ))
        ;;
    esac
  done
  printf 'String:\n%s\n-> Count: %d\n\n' "$str" "$count"
done

IFS=$_OIFS

Output:

String:
I am trying to extract 3 digited numbers 333, 334, 335 from this string
-> Count: 3

String:
I have 243 pens for sale
-> Count: 1

String:
This is 123 456
-> Count: 2

Léa Gris
  • 10,780
  • 3
  • 21
  • 32
  • 1
    That will count words containing *at least* 3 digits, not words of *exactly* 3 digits which is what I interpret the OP to want. – glenn jackman Oct 19 '20 at 23:59
  • @glennjackman thank-you. It is fixed now! – Léa Gris Oct 20 '20 at 00:08
  • Will this match the 456 in this string `123 456` ? Its got to match the 123 as well. –  Oct 20 '20 at 00:14
  • @Maxt8r yes it will count `123 456` as 2 groups of 3 digits. – Léa Gris Oct 20 '20 at 00:18
  • Are you splitting by non digits ? When you say `word in` does that take into account 4 or more digits where three is a subset, or must it match the entire text in word ? Seems like there should be a better _utility_ to do this. –  Oct 20 '20 at 00:25
  • @Maxt8r My `IFS` is setup to split on space, tab, newline (its default) plus any punctuation. so the `case` pattern will only match 3 digits stand-alone. – Léa Gris Oct 20 '20 at 00:34
  • So `$word in` is the same as `$word equals` then ? –  Oct 20 '20 at 00:36
  • So if `$word in` is the same as `$word equals` or special syntax on the regex makes it so, then splitting if resulting in `abc123` or `4567` will not match then. If so, you have a winner. –  Oct 20 '20 at 00:39
  • @Maxt8r just go play with it: https://repl.it/@leagris/TornLopsidedUserinterface#main.sh – Léa Gris Oct 20 '20 at 00:45
0

Assuming the OP only wants exactly 3-digit numbers and is not interested in breaking longer numbers down into 3-digit segments, eg, the string 12345 will return a zero count as opposed to a 3 count ( 123 / 234 / 345 ).


Some sample data:

$ cat numbers.dat
I am trying to extract 3 digited numbers 333, 334, 335 from this string #should return 3
I have 243 pens for sale #should return 1
123 xyz
def 456
def 789-345 abc                    # should match 7-8-9 and 3-4-5
tester876tester                    # should match 8-7-6
testing9999testing                 # should not match 9-9-9-9

$ str=$(cat numbers.dat)           # load data into a variable

A 2-pass grep solution:

NOTE: borrowed thanasisp's word boundary flag (\b)

Find patterns of 3-digits with non-digit book ends (including front/end of line)

$ grep -Eo '(^|[^0-9]|\b)[0-9]{3}(\b|[^0-9]|$)' numbers.dat
# or
$ grep -Eo '(^|[^0-9]|\b)[0-9]{3}(\b|[^0-9]|$)' <<< "${str}"

 333,
 334,
 335
 243
123
 456
 789-
345
r876t

Now strip off the non-digits:

$ grep -Eo '(^|[^0-9]|\b)[0-9]{3}(\b|[^0-9]|$)' numbers.dat | grep -Eo '[0-9]{3}'
# or
$ grep -Eo '(^|[^0-9]|\b)[0-9]{3}(\b|[^0-9]|$)' <<< "${str}" | grep -Eo '[0-9]{3}'

333
334
335
243
123
456
789
345
876

Pipe to wc -l for a count:

$ grep -Eo '(^|[^0-9]|\b)[0-9]{3}(\b|[^0-9]|$)' numbers.dat | grep -Eo '[0-9]{3}' | wc -l
# or
$ grep -Eo '(^|[^0-9]|\b)[0-9]{3}(\b|[^0-9]|$)' <<< "${str}" | grep -Eo '[0-9]{3}' | wc -l

9

Storing count in a variable:

$ counter=$(grep -Eo '(^|[^0-9]|\b)[0-9]{3}(\b|[^0-9]|$)' numbers.dat | grep -Eo '[0-9]{3}' | wc -l)
# or
$ counter=$(grep -Eo '(^|[^0-9]|\b)[0-9]{3}(\b|[^0-9]|$)' <<< "${str}" | grep -Eo '[0-9]{3}' | wc -l)

$ echo "${counter}"

9
markp-fuso
  • 8,786
  • 2
  • 8
  • 24
0

Could you please try following, written and tested with following link https://ideone.com/bh6zjR#stdin in shown samples. Since OP said in comments digits can't have anything else before/in between/after (apart from , I believe as per samples) so going with traversing all fields of current line and using regex to find matches for them.

awk '
{
  for(i=1;i<=NF;i++){
    if(match($i,/^[0-9]{3}[,]?$/)){
       count++
    }
  }
  print "Line " FNR " has " count " number of 3 digits."
  count=""
}
' Input_file

Output will be as follows.

Line 1 has 3 number of 3 digits.
Line 2 has 1 number of 3 digits.
RavinderSingh13
  • 101,958
  • 9
  • 41
  • 77