Bash: How do I extract the count of all the "n" digit numbers in a string?

Question

I am trying to extract the total number of n digited numbers from a string using bash.

E.g. For a 3 digit number,

I am trying to extract 3 digited numbers 333, 334, 335 from this string #should return 3
I have 243 pens for sale #should return 1

Unfortunately, I will not be able to use sed or grep with perl-regexp.

Appreciate any suggestions!

No. The restriction is because I have to do this on a remote server that doesn't support `perl-regexp`. — Shrav, Oct 19 '20 at 23:23
what about `a123`, `#456` , `123-456`, do you want to consider any of them for the result? — thanasisp, Oct 19 '20 at 23:29
@thanasisp For my use case, the numbers will never have anything outside [0-9] before/after/between them. — Shrav, Oct 19 '20 at 23:40
You're not going to be able to match a bounded 3 digits that is separated by a single character `123 456`. Ever thought of using Perl ? You're going to need at least a lookahead capability. — , Oct 20 '20 at 00:01
@Maxt8r see comment above by Shrav, this case is not included into matches. — thanasisp, Oct 20 '20 at 00:30
@thanasisp I don't understand `the numbers will never have anything outside [0-9] before/after/between them` If that means [a-zA-Z_] then `\b` is your answer if `\b` is a bash regex capability. — , Oct 20 '20 at 00:34
@Maxt8r I refer grep (without PCREs as requested), not bash. Either `\b` or `\< \>`. — thanasisp, Oct 20 '20 at 00:38
@Shrav : How should 6 digits in a row be treated? I.e. `text text 555666 more text`. — user1934428, Oct 20 '20 at 05:25
@user1934428 I guess 6 digits are treated as a 6-digit number. — thanasisp, Oct 20 '20 at 15:50
@thanasisp : Seeing from the pattern matching viewpoint, we could also regard it as 4 overlapping strings of 3-digit-numbers, i.e. 555, 556, 566, 666. — user1934428, Oct 21 '20 at 17:46

choroba · Answer 1 · 2020-10-20T18:51:47.163

4

You can use regular expressions in bash.

#! /bin/bash
cat <<EOF |
I am trying to extract 3 digited numbers 333, 334, 335 from this string #should return 3, but should ignore 12345
I have 243 pens for sale #should return 1
123 should work at text boundaries, too 123
EOF
while read line ; do
    c=0
    while [[ $line =~ ([^[:digit:]]|^)([0-9][0-9][0-9])([^[:digit:]]|$) ]] ; do
        m=${BASH_REMATCH[0]}
        line=${line#*$m}
        ((++c))
    done
    echo $c
done

The regex explained:

([^[:digit:]]|^)([0-9][0-9][0-9])([^[:digit:]]|$)
~~~~~~~~~~~~~                                     non-digit
             ~~                                   or the very beginning
                 ~~~~~~~~~~~~~~~                  three digits
                                  ~~~~~~~~~~~~    non-digit
                                              ~~  or the very end

As bash can't match the same string several times, we need to remove the already processed part from the string before trying another match.

edited Oct 20 '20 at 18:51

answered Oct 19 '20 at 23:24

choroba

200,498
20
180
248

Doesn't match the 456 here `123 456` – Oct 19 '20 at 23:59
@Maxt8r: It shows "2" for `123 456`. – choroba Oct 20 '20 at 08:08
Oh, sorry about that, didn't realize bash regex has no way of doing global matches. Right I should have realized the only way is to remake / trim the _line_ each pass until no more matches. Note appending `${BASH_REMATCH[3]}` doesn't seem necessary. Also, it gets trickier if multiple overlapping is required. Wow, imagine having to match millions of these things at a time. The sheer overhead of making a new string each time, it would be prohibitively time consuming. So, it might work as a neat demo trick but it's not a method to do large amounts of matches. – Oct 20 '20 at 18:40
@Maxt8r: You're right, I've removed the appending. If performance is an issue, don't use bash. – choroba Oct 20 '20 at 18:52
Very true. I've just tutored myself in bash in the last hour just to find out those string syntax you used. Very cryptic, almost a primitive perl. – Oct 20 '20 at 18:58
@Maxt8r: Maybe that's because I think in Perl ;-) – choroba Oct 20 '20 at 20:28

thanasisp · Accepted Answer · 2020-10-20T00:49:48.343

2

echo "$str" | grep -o '\b[0-9]\{3\}\b' | wc -l

This way we match 3-digit numbers inside word bountaries, which are allowed to be re-used (e.g. if two numbers are separated by one char that is a bountary, like comma or space).

Or like this:

echo "$str" | grep -o '\<[[:digit:]]\{3\}\>' | wc -l

edited Oct 20 '20 at 00:49

answered Oct 20 '20 at 00:04

thanasisp

5,575
3
11
27

I would also offer the slightly more intuitive: ```echo "$str" | grep -o ' [0-9][0-9][0-9] ' | wc -l``` – Lenna Oct 20 '20 at 02:07
It would actually print `0`, but I see now that `\b` is needed at either end. – Lenna Oct 20 '20 at 02:17
Perhaps to fully anser the question for "n" digits, we must say: `echo "123," | grep -o "\b[0-9]\{${N}\}\b" | wc -l` where `N=3` for this example – Lenna Oct 20 '20 at 02:19

Léa Gris · Answer 3 · 2020-10-20T00:19:25.117

1

Using POSIX shell grammar only:

#!/usr/bin/env sh

# Should return 3
str1='I am trying to extract 3 digited numbers 333, 334, 335 from this string'

# Should return 1
str2='I have 243 pens for sale'

# should return 2
str3='This is 123 456'

_OIFS=$IFS
IFS=$IFS' ,.:;!?-_+=*#$§^&{}[]|`@"()\\/'\'

for str in "$str1" "$str2" "$str3"
do
  count=0
  for word in $str
  do
    case $word in
      [[:digit:]][[:digit:]][[:digit:]])
        count=$((count + 1 ))
        ;;
    esac
  done
  printf 'String:\n%s\n-> Count: %d\n\n' "$str" "$count"
done

IFS=$_OIFS

Output:

String:
I am trying to extract 3 digited numbers 333, 334, 335 from this string
-> Count: 3

String:
I have 243 pens for sale
-> Count: 1

String:
This is 123 456
-> Count: 2

edited Oct 20 '20 at 00:19

answered Oct 19 '20 at 23:43

Léa Gris

10,780
3
21
32

1

That will count words containing *at least* 3 digits, not words of *exactly* 3 digits which is what I interpret the OP to want. – glenn jackman Oct 19 '20 at 23:59
@glennjackman thank-you. It is fixed now! – Léa Gris Oct 20 '20 at 00:08
Will this match the 456 in this string `123 456` ? Its got to match the 123 as well. – Oct 20 '20 at 00:14
@Maxt8r yes it will count `123 456` as 2 groups of 3 digits. – Léa Gris Oct 20 '20 at 00:18
Are you splitting by non digits ? When you say `word in` does that take into account 4 or more digits where three is a subset, or must it match the entire text in word ? Seems like there should be a better _utility_ to do this. – Oct 20 '20 at 00:25
@Maxt8r My `IFS` is setup to split on space, tab, newline (its default) plus any punctuation. so the `case` pattern will only match 3 digits stand-alone. – Léa Gris Oct 20 '20 at 00:34
So `$word in` is the same as `$word equals` then ? – Oct 20 '20 at 00:36
So if `$word in` is the same as `$word equals` or special syntax on the regex makes it so, then splitting if resulting in `abc123` or `4567` will not match then. If so, you have a winner. – Oct 20 '20 at 00:39
@Maxt8r just go play with it: https://repl.it/@leagris/TornLopsidedUserinterface#main.sh – Léa Gris Oct 20 '20 at 00:45

markp-fuso · Answer 4 · 2020-10-20T00:28:43.643

Assuming the OP only wants exactly 3-digit numbers and is not interested in breaking longer numbers down into 3-digit segments, eg, the string 12345 will return a zero count as opposed to a 3 count ( 123 / 234 / 345 ).

Some sample data:

$ cat numbers.dat
I am trying to extract 3 digited numbers 333, 334, 335 from this string #should return 3
I have 243 pens for sale #should return 1
123 xyz
def 456
def 789-345 abc                    # should match 7-8-9 and 3-4-5
tester876tester                    # should match 8-7-6
testing9999testing                 # should not match 9-9-9-9

$ str=$(cat numbers.dat)           # load data into a variable

A 2-pass grep solution:

NOTE: borrowed thanasisp's word boundary flag (\b)

Find patterns of 3-digits with non-digit book ends (including front/end of line)

$ grep -Eo '(^|[^0-9]|\b)[0-9]{3}(\b|[^0-9]|$)' numbers.dat
# or
$ grep -Eo '(^|[^0-9]|\b)[0-9]{3}(\b|[^0-9]|$)' <<< "${str}"

 333,
 334,
 335
 243
123
 456
 789-
345
r876t

Now strip off the non-digits:

$ grep -Eo '(^|[^0-9]|\b)[0-9]{3}(\b|[^0-9]|$)' numbers.dat | grep -Eo '[0-9]{3}'
# or
$ grep -Eo '(^|[^0-9]|\b)[0-9]{3}(\b|[^0-9]|$)' <<< "${str}" | grep -Eo '[0-9]{3}'

333
334
335
243
123
456
789
345
876

Pipe to wc -l for a count:

$ grep -Eo '(^|[^0-9]|\b)[0-9]{3}(\b|[^0-9]|$)' numbers.dat | grep -Eo '[0-9]{3}' | wc -l
# or
$ grep -Eo '(^|[^0-9]|\b)[0-9]{3}(\b|[^0-9]|$)' <<< "${str}" | grep -Eo '[0-9]{3}' | wc -l

9

Storing count in a variable:

$ counter=$(grep -Eo '(^|[^0-9]|\b)[0-9]{3}(\b|[^0-9]|$)' numbers.dat | grep -Eo '[0-9]{3}' | wc -l)
# or
$ counter=$(grep -Eo '(^|[^0-9]|\b)[0-9]{3}(\b|[^0-9]|$)' <<< "${str}" | grep -Eo '[0-9]{3}' | wc -l)

$ echo "${counter}"

9

RavinderSingh13 · Answer 5 · 2020-10-20T00:27:07.793

Could you please try following, written and tested with following link https://ideone.com/bh6zjR#stdin in shown samples. Since OP said in comments digits can't have anything else before/in between/after (apart from , I believe as per samples) so going with traversing all fields of current line and using regex to find matches for them.

awk '
{
  for(i=1;i<=NF;i++){
    if(match($i,/^[0-9]{3}[,]?$/)){
       count++
    }
  }
  print "Line " FNR " has " count " number of 3 digits."
  count=""
}
' Input_file

Output will be as follows.

Line 1 has 3 number of 3 digits.
Line 2 has 1 number of 3 digits.

Bash: How do I extract the count of all the "n" digit numbers in a string?

5 Answers5