811

Given a filename in the form someletters_12345_moreleters.ext, I want to extract the 5 digits and put them into a variable.

So to emphasize the point, I have a filename with x number of characters then a five digit sequence surrounded by a single underscore on either side then another set of x number of characters. I want to take the 5 digit number and put that into a variable.

I am very interested in the number of different ways that this can be accomplished.

codeforester
  • 28,846
  • 11
  • 78
  • 104
Berek Bryan
  • 10,925
  • 10
  • 29
  • 43
  • 5
    Most of the answers don't seem to answer your question because the question is ambiguous. *"I have a filename with x number of characters then a five digit sequence surrounded by a single underscore on either side then another set of x number of characters"*. By that definition `abc_12345_def_67890_ghi_def` is a valid input. What do you want to happen? Let's assume there is only one 5 digit sequence. You still have `abc_def_12345_ghi_jkl` or `1234567_12345_1234567` or `12345d_12345_12345e` as valid input based on your definition of input and most of the answers below will not handle this. – gman Apr 20 '18 at 04:42
  • 5
    This question has an example input that's too specific. Because of that, it got a lot of specific answers for *this particular case* (digits only, same `_` delimiter, input that contains the target string only once etc.). The [best (most generic and fastest) answer](https://stackoverflow.com/questions/428109/extract-substring-in-bash/436730#436730) has, after 10 years, only 7 upvotes, while other limited answers have hundreds. Makes me lose faith in developers – Dan Dascalescu May 08 '19 at 18:30
  • Clickbait title. The meaning of substring function is well established and means getting a part by numerical positions. All the other things, (indexOf, regex) are about search. A 3-month older question that asks precisely about substring in bash, answered the same, but w/o "substring" in the title. Not misleading, but not properly named. Results: the answer about built-in function in most voted question buried 5 screens down with activity sorting; older and more precise question, marked duplicate. https://stackoverflow.com/questions/219402/what-linux-shell-command-returns-a-part-of-a-string – user9999 Oct 23 '20 at 11:04

22 Answers22

1193

If x is constant, the following parameter expansion performs substring extraction:

b=${a:12:5}

where 12 is the offset (zero-based) and 5 is the length

If the underscores around the digits are the only ones in the input, you can strip off the prefix and suffix (respectively) in two steps:

tmp=${a#*_}   # remove prefix ending in "_"
b=${tmp%_*}   # remove suffix starting with "_"

If there are other underscores, it's probably feasible anyway, albeit more tricky. If anyone knows how to perform both expansions in a single expression, I'd like to know too.

Both solutions presented are pure bash, with no process spawning involved, hence very fast.

JB.
  • 34,745
  • 10
  • 79
  • 105
  • You can do both expansions at the same time: `${${a#*_}%_*}`. I've used this before to string together multiple bash string ops to get a specific section of a substring. – Spencer Rathbun Jun 26 '13 at 12:21
  • 20
    @SpencerRathbun `bash: ${${a#*_}%_*}: bad substitution` on my GNU bash 4.2.45. – JB. Jun 28 '13 at 11:02
  • same here with bash 4.1.10(4): t="someletters_12345_moreleters.ext";echo ${${t#*_}%_*} ${${t#*_}%_*}: bad substitution @SpencerRathbun: I've never heard of a way to do this in one param substiution, can you tell us where you got that to work? – johnnyB Oct 29 '13 at 16:09
  • 2
    @jonnyB, Some time in the past that worked. I am told by my coworkers it stopped, and they changed it to be a sed command or something. Looking at it in the history, I was running it in a `sh` script, which was probably dash. At this point I can't get it to work anymore. – Spencer Rathbun Oct 29 '13 at 17:52
  • 23
    JB, you should clarify that "12" is the offset (zero-based) and "5" is the length. Also, +1 for @gontard 's link that lays it all out! – Doktor J Sep 12 '14 at 17:32
  • 1
    While running this inside a script as "sh run.sh", one might get Bad Substitution error. To avoid that, change permissions for run.sh (chmod +x run.sh) and then run the script as "./run.sh" – Ankur Jan 06 '15 at 10:13
  • @Ankur: what you write is mostly¹ correct, but quite off-topic here. THe invocation method determines whether the shell runs using POSIX or bash semantics, but this question is tagged [tag:bash], so bash semantics are assumed. It's tag FAQ #2, see also [this question](http://stackoverflow.com/q/26717870/12274). [1: “mostly” because you'll still get a bad substitution if your script runs under a `#! /bin/sh` shebang] – JB. Jan 07 '15 at 09:32
  • giving only first number behaves like substr, or "from position x to the end" – Sergio Abreu Dec 08 '16 at 10:32
  • @Picrochole this is a pure bash answer. sed is obviously not pure bash. Would you mind moving your comment over to some answer where it could actually be appropriate? Or just make it an actual answer, if you think it's worth anything on its own. – JB. Aug 21 '18 at 20:46
  • Only works with the form `${a:12:5}`. Doesn't work with `${"some_string":12:5}` or `${$(basename $my_var):12:5}`. `${a:12:5}` is an operation on `$a`, and it makes sense that `$"some_string"` and `$$(basename $my_var)` are invalid. – Roger Dueck Dec 07 '18 at 20:19
  • A length param can be negative: `${a:12:-5}` trims 12 chars from the beginning and 5 chars from the end of a string. – Mike Shiyan Dec 28 '19 at 16:22
  • 5
    The offset param can be negative too, BTW. You just have to take care not to glue it to the colon, or bash will interpret it as a `:-` “Use Default Values” substitution. So `${a: -12:5}` yields the 5 characters 12 characters from the end, and `${a: -12:-5}` the 7 characters between end-12 and end-5. – JB. Dec 30 '19 at 17:21
  • Is threre any documentation for this ? – Maskim Jan 24 '20 at 09:25
  • @Maxime.D it's all in the bash manpage. “Parameter expansion” – JB. Jan 24 '20 at 10:06
  • @SpencerRathbun's embedded expansion works in zsh but not bash. I don't know about other shells. – Bruce Mar 31 '21 at 11:10
797

Use cut:

echo 'someletters_12345_moreleters.ext' | cut -d'_' -f 2

More generic:

INPUT='someletters_12345_moreleters.ext'
SUBSTRING=$(echo $INPUT| cut -d'_' -f 2)
echo $SUBSTRING
Victor Yarema
  • 859
  • 9
  • 14
FerranB
  • 31,954
  • 18
  • 64
  • 82
  • 3
    the more generic answer is exactly what i was looking for, thanks – Berek Bryan Jan 09 '09 at 14:00
  • 88
    The -f flag takes 1-based indices, rather than the 0-based indices a programmer would be used to. – Matthew G Jul 23 '13 at 00:49
  • 2
    INPUT=someletters_12345_moreleters.ext SUBSTRING=$(echo $INPUT| cut -d'_' -f 2) echo $SUBSTRING – mani deepak Mar 24 '14 at 10:29
  • 3
    You should properly use double quotes around the arguments to `echo` unless you know for sure that the variables cannot contain irregular whitespace or shell metacharacters. See further http://stackoverflow.com/questions/10067266/when-to-wrap-quotes-around-a-variable – tripleee Jan 24 '17 at 09:30
  • 1
    The number '2' after '-f' is to tell shell to extract the 2nd set of substring. – Sandun Jul 10 '18 at 13:42
  • Maybe this is a stupid question, but why doesn't `SUBSTRING=$INPUT | cut -d'_' -f 2` work? – Neil Jul 18 '19 at 18:45
  • I'd also add '-s' ('--only-delimited') as without this flag `SUBSTRING` will include the whole string in case there's no delimiter in the file name. It would be safer. – Jean Spector Apr 19 '20 at 10:29
106

Generic solution where the number can be anywhere in the filename, using the first of such sequences:

number=$(echo $filename | egrep -o '[[:digit:]]{5}' | head -n1)

Another solution to extract exactly a part of a variable:

number=${filename:offset:length}

If your filename always have the format stuff_digits_... you can use awk:

number=$(echo $filename | awk -F _ '{ print $2 }')

Yet another solution to remove everything except digits, use

number=$(echo $filename | tr -cd '[[:digit:]]')
Johannes Schaub - litb
  • 466,055
  • 116
  • 851
  • 1,175
101

just try to use cut -c startIndx-stopIndx

brown.2179
  • 1,410
  • 1
  • 11
  • 15
  • 3
    Is there something like startIndex-lastIndex - 1? – Niklas Jul 30 '15 at 08:00
  • 1
    @Niklas In bash, proly `startIndx-$((lastIndx-1))` – brown.2179 Jul 31 '15 at 13:19
  • 3
    `start=5;stop=9; echo "the rain in spain" | cut -c $start-$(($stop-1))` – brown.2179 Jul 31 '15 at 18:14
  • 1
    The problem is that the input is dynamic since I also use the pipe to get it so it's basically. `git log --oneline | head -1 | cut -c 9-(end -1)` – Niklas Jul 31 '15 at 18:19
  • This can be done with cut if break into two parts as `line=`git log --oneline | head -1` && echo $line | cut -c 9-$((${#line}-1))` but in this particular case, might be better to use [sed](https://en.wikipedia.org/wiki/Sed) as `git log --oneline | head -1 | sed -e 's/^[a-z0-9]* //g'` – brown.2179 Aug 03 '15 at 13:50
  • This command works great for getting timestamps, etc. from a command like stat! Time-saver! – Sean Halls Nov 17 '16 at 19:19
37

In case someone wants more rigorous information, you can also search it in man bash like this

$ man bash [press return key]
/substring  [press return key]
[press "n" key]
[press "n" key]
[press "n" key]
[press "n" key]

Result:

${parameter:offset}
       ${parameter:offset:length}
              Substring Expansion.  Expands to  up  to  length  characters  of
              parameter  starting  at  the  character specified by offset.  If
              length is omitted, expands to the substring of parameter  start‐
              ing at the character specified by offset.  length and offset are
              arithmetic expressions (see ARITHMETIC  EVALUATION  below).   If
              offset  evaluates  to a number less than zero, the value is used
              as an offset from the end of the value of parameter.  Arithmetic
              expressions  starting  with  a - must be separated by whitespace
              from the preceding : to be distinguished from  the  Use  Default
              Values  expansion.   If  length  evaluates to a number less than
              zero, and parameter is not @ and not an indexed  or  associative
              array,  it is interpreted as an offset from the end of the value
              of parameter rather than a number of characters, and the  expan‐
              sion is the characters between the two offsets.  If parameter is
              @, the result is length positional parameters beginning at  off‐
              set.   If parameter is an indexed array name subscripted by @ or
              *, the result is the length members of the array beginning  with
              ${parameter[offset]}.   A  negative  offset is taken relative to
              one greater than the maximum index of the specified array.  Sub‐
              string  expansion applied to an associative array produces unde‐
              fined results.  Note that a negative offset  must  be  separated
              from  the  colon  by  at least one space to avoid being confused
              with the :- expansion.  Substring indexing is zero-based  unless
              the  positional  parameters are used, in which case the indexing
              starts at 1 by default.  If offset  is  0,  and  the  positional
              parameters are used, $0 is prefixed to the list.
jperelli
  • 6,382
  • 4
  • 43
  • 82
  • 3
    A very important caveat with negative values as stated above: **Arithmetic expressions starting with a - must be separated by whitespace from the preceding : to be distinguished from the Use Default Values expansion.** So to get last four characters of a var: `${var: -4}` – sshow Jul 27 '17 at 17:22
37

Here's how i'd do it:

FN=someletters_12345_moreleters.ext
[[ ${FN} =~ _([[:digit:]]{5})_ ]] && NUM=${BASH_REMATCH[1]}

Explanation:

Bash-specific:

Regular Expressions (RE): _([[:digit:]]{5})_

  • _ are literals to demarcate/anchor matching boundaries for the string being matched
  • () create a capture group
  • [[:digit:]] is a character class, i think it speaks for itself
  • {5} means exactly five of the prior character, class (as in this example), or group must match

In english, you can think of it behaving like this: the FN string is iterated character by character until we see an _ at which point the capture group is opened and we attempt to match five digits. If that matching is successful to this point, the capture group saves the five digits traversed. If the next character is an _, the condition is successful, the capture group is made available in BASH_REMATCH, and the next NUM= statement can execute. If any part of the matching fails, saved details are disposed of and character by character processing continues after the _. e.g. if FN where _1 _12 _123 _1234 _12345_, there would be four false starts before it found a match.

nicerobot
  • 8,536
  • 6
  • 37
  • 44
  • 3
    This is a generic way that works even if you need to extract more than one thing, as I did. – zebediah49 Feb 05 '13 at 23:14
  • 3
    This is the most generic answer indeed, and should be accepted one. It works for a regular expression, not just a string of characters at a fixed position, or between the same delimiter (which enables `cut`). It also doesn't rely on executing an external command. – Dan Dascalescu May 08 '19 at 18:22
  • This is great! I adapted this to use different start/stop dilimeters (replace the _) and variable length numbers (. for {5}) for my situation. Can someone break down this black magic and explain it? – Paul May 15 '20 at 17:37
  • 2
    @Paul I added more details to my answer. Hope that helps. – nicerobot May 16 '20 at 19:43
22

I'm surprised this pure bash solution didn't come up:

a="someletters_12345_moreleters.ext"
IFS="_"
set $a
echo $2
# prints 12345

You probably want to reset IFS to what value it was before, or unset IFS afterwards!

user1338062
  • 9,351
  • 3
  • 54
  • 56
20

Building on jor's answer (which doesn't work for me):

substring=$(expr "$filename" : '.*_\([^_]*\)_.*')
PEZ
  • 15,930
  • 6
  • 39
  • 63
12

Following the requirements

I have a filename with x number of characters then a five digit sequence surrounded by a single underscore on either side then another set of x number of characters. I want to take the 5 digit number and put that into a variable.

I found some grep ways that may be useful:

$ echo "someletters_12345_moreleters.ext" | grep -Eo "[[:digit:]]+" 
12345

or better

$ echo "someletters_12345_moreleters.ext" | grep -Eo "[[:digit:]]{5}" 
12345

And then with -Po syntax:

$ echo "someletters_12345_moreleters.ext" | grep -Po '(?<=_)\d+' 
12345

Or if you want to make it fit exactly 5 characters:

$ echo "someletters_12345_moreleters.ext" | grep -Po '(?<=_)\d{5}' 
12345

Finally, to make it be stored in a variable it is just need to use the var=$(command) syntax.

weston
  • 51,132
  • 20
  • 132
  • 192
fedorqui 'SO stop harming'
  • 228,878
  • 81
  • 465
  • 523
  • 2
    I believe nowadays there is no need to use egrep, the command itself warns you: `Invocation as 'egrep' is deprecated; use 'grep -E' instead`. I've edited your answer. – Neurotransmitter Jun 16 '14 at 13:27
12

If we focus in the concept of:
"A run of (one or several) digits"

We could use several external tools to extract the numbers.
We could quite easily erase all other characters, either sed or tr:

name='someletters_12345_moreleters.ext'

echo $name | sed 's/[^0-9]*//g'    # 12345
echo $name | tr -c -d 0-9          # 12345

But if $name contains several runs of numbers, the above will fail:

If "name=someletters_12345_moreleters_323_end.ext", then:

echo $name | sed 's/[^0-9]*//g'    # 12345323
echo $name | tr -c -d 0-9          # 12345323

We need to use regular expresions (regex).
To select only the first run (12345 not 323) in sed and perl:

echo $name | sed 's/[^0-9]*\([0-9]\{1,\}\).*$/\1/'
perl -e 'my $name='$name';my ($num)=$name=~/(\d+)/;print "$num\n";'

But we could as well do it directly in bash(1) :

regex=[^0-9]*([0-9]{1,}).*$; \
[[ $name =~ $regex ]] && echo ${BASH_REMATCH[1]}

This allows us to extract the FIRST run of digits of any length
surrounded by any other text/characters.

Note: regex=[^0-9]*([0-9]{5,5}).*$; will match only exactly 5 digit runs. :-)

(1): faster than calling an external tool for each short texts. Not faster than doing all processing inside sed or awk for large files.

10

Without any sub-processes you can:

shopt -s extglob
front=${input%%_+([a-zA-Z]).*}
digits=${front##+([a-zA-Z])_}

A very small variant of this will also work in ksh93.

Darron
  • 20,463
  • 5
  • 47
  • 53
9

Here's a prefix-suffix solution (similar to the solutions given by JB and Darron) that matches the first block of digits and does not depend on the surrounding underscores:

str='someletters_12345_morele34ters.ext'
s1="${str#"${str%%[[:digit:]]*}"}"   # strip off non-digit prefix from str
s2="${s1%%[^[:digit:]]*}"            # strip off non-digit suffix from s1
echo "$s2"                           # 12345
codist
  • 99
  • 1
  • 1
7

I love sed's capability to deal with regex groups:

> var="someletters_12345_moreletters.ext"
> digits=$( echo $var | sed "s/.*_\([0-9]\+\).*/\1/p" -n )
> echo $digits
12345

A slightly more general option would be not to assume that you have an underscore _ marking the start of your digits sequence, hence for instance stripping off all non-numbers you get before your sequence: s/[^0-9]\+\([0-9]\+\).*/\1/p.


> man sed | grep s/regexp/replacement -A 2
s/regexp/replacement/
    Attempt to match regexp against the pattern space.  If successful, replace that portion matched with replacement.  The replacement may contain the special  character  &  to
    refer to that portion of the pattern space which matched, and the special escapes \1 through \9 to refer to the corresponding matching sub-expressions in the regexp.

More on this, in case you're not too confident with regexps:

  • s is for _s_ubstitute
  • [0-9]+ matches 1+ digits
  • \1 links to the group n.1 of the regex output (group 0 is the whole match, group 1 is the match within parentheses in this case)
  • p flag is for _p_rinting

All escapes \ are there to make sed's regexp processing work.

Campa
  • 3,502
  • 3
  • 30
  • 34
6

My answer will have more control on what you want out of your string. Here is the code on how you can extract 12345 out of your string

str="someletters_12345_moreleters.ext"
str=${str#*_}
str=${str%_more*}
echo $str

This will be more efficient if you want to extract something that has any chars like abc or any special characters like _ or -. For example: If your string is like this and you want everything that is after someletters_ and before _moreleters.ext :

str="someletters_123-45-24a&13b-1_moreleters.ext"

With my code you can mention what exactly you want. Explanation:

#* It will remove the preceding string including the matching key. Here the key we mentioned is _ % It will remove the following string including the matching key. Here the key we mentioned is '_more*'

Do some experiments yourself and you would find this interesting.

Alex Raj Kaliamoorthy
  • 1,761
  • 2
  • 23
  • 35
6

Given test.txt is a file containing "ABCDEFGHIJKLMNOPQRSTUVWXYZ"

cut -b19-20 test.txt > test1.txt # This will extract chars 19 & 20 "ST" 
while read -r; do;
> x=$REPLY
> done < test1.txt
echo $x
ST
Teemu Leisti
  • 3,672
  • 2
  • 27
  • 37
Rick Osman
  • 69
  • 1
  • 1
  • This is extremely specific to that particular input. The only general solution to the general question (which the OP should have asked) is to [use a regexp](https://stackoverflow.com/questions/428109/extract-substring-in-bash/436730#436730). – Dan Dascalescu May 08 '19 at 18:28
4

shell cut - print specific range of characters or given part from a string

#method1) using bash

 str=2020-08-08T07:40:00.000Z
 echo ${str:11:8}

#method2) using cut

 str=2020-08-08T07:40:00.000Z
 cut -c12-19 <<< $str

#method3) when working with awk

 str=2020-08-08T07:40:00.000Z
 awk '{time=gensub(/.{11}(.{8}).*/,"\\1","g",$1); print time}' <<< $str
hankyo
  • 83
  • 3
3

Ok, here goes pure Parameter Substitution with an empty string. Caveat is that I have defined someletters and moreletters as only characters. If they are alphanumeric, this will not work as it is.

filename=someletters_12345_moreletters.ext
substring=${filename//@(+([a-z])_|_+([a-z]).*)}
echo $substring
12345
morbeo
  • 59
  • 4
2

similar to substr('abcdefg', 2-1, 3) in php:

echo 'abcdefg'|tail -c +2|head -c 3
diyism
  • 10,974
  • 5
  • 44
  • 43
  • This is extremely specific to that input. The only general solution to the general question (which the OP should have asked) is to [use a regexp](https://stackoverflow.com/questions/428109/extract-substring-in-bash/436730#436730). – Dan Dascalescu May 08 '19 at 18:27
1

A little late, but I just ran across this problem and found the following:

host:/tmp$ asd=someletters_12345_moreleters.ext 
host:/tmp$ echo `expr $asd : '.*_\(.*\)_'`
12345
host:/tmp$ 

I used it to get millisecond resolution on an embedded system that does not have %N for date:

set `grep "now at" /proc/timer_list`
nano=$3
fraction=`expr $nano : '.*\(...\)......'`
$debug nano is $nano, fraction is $fraction
russell
  • 584
  • 1
  • 8
  • 17
1

A bash solution:

IFS="_" read -r x digs x <<<'someletters_12345_moreleters.ext'

This will clobber a variable called x. The var x could be changed to the var _.

input='someletters_12345_moreleters.ext'
IFS="_" read -r _ digs _ <<<"$input"
1

There's also the bash builtin 'expr' command:

INPUT="someletters_12345_moreleters.ext"  
SUBSTRING=`expr match "$INPUT" '.*_\([[:digit:]]*\)_.*' `  
echo $SUBSTRING
jor
  • 701
  • 3
  • 7
1

Inklusive end, similar to JS and Java implementations. Remove +1 if you do not desire this.

function substring() {
    local str="$1" start="${2}" end="${3}"
    
    if [[ "$start" == "" ]]; then start="0"; fi
    if [[ "$end"   == "" ]]; then end="${#str}"; fi
    
    local length="((${end}-${start}+1))"
    
    echo "${str:${start}:${length}}"
} 

Example:

    substring 01234 0
    01234
    substring 012345 0
    012345
    substring 012345 0 0
    0
    substring 012345 1 1
    1
    substring 012345 1 2
    12
    substring 012345 0 1
    01
    substring 012345 0 2
    012
    substring 012345 0 3
    0123
    substring 012345 0 4
    01234
    substring 012345 0 5
    012345

More example calls:

    substring 012345 0
    012345
    substring 012345 1
    12345
    substring 012345 2
    2345
    substring 012345 3
    345
    substring 012345 4
    45
    substring 012345 5
    5
    substring 012345 6
    
    substring 012345 3 5
    345
    substring 012345 3 4
    34
    substring 012345 2 4
    234
    substring 012345 1 3
    123
mmm
  • 18,431
  • 26
  • 99
  • 165