1

I have the following lines in an apache access log

/sms/receiveHLRLookup?Ported=No&Status=Success&MSISDN=647930229655&blah
/sms/receiveHLRLookup?Ported=No&Status=Success&MSISDN=647930229656&blah
/sms/receiveHLRLookup?Ported=No&Status=Success&MSISDN=647930229657&blah
/sms/receiveHLRLookup?Ported=No&Status=Success&MSISDN=647930229658&blah

and i want to extract the MSISDN value only, so expected output would be

647930229655
647930229656
647930229657
647930229658

I'm using the following sed command but i can't get it to stop capturing at &

sed 's/.*MSISDN=\(.*\)/\1/'
Manse
  • 36,627
  • 9
  • 75
  • 105
  • 1
    Try `sed 's/.*MSISDN=\([0-9]*\).*/\1/'` – Wiktor Stribiżew Mar 05 '18 at 09:37
  • 1
    `.*` is greedy, it will try to match as much as possible while trying to honor entire regex expression... you've to find a way to tell that it needs to stop before `&`... depending on single/multiple character, you'd need different ways to handle it.. or even tool with features not available with sed like lookarounds.. https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean might help – Sundeep Mar 05 '18 at 09:51

5 Answers5

4

sed solution:

sed -E 's/.*&MSISDN=([^&]+).*/\1/' file
  • & - is key/value pair separator in URL syntax, so you should rely on it
  • ([^&]+) - 1st captured group containing any character sequence except &
  • \1 - backreference to the 1st captured group

The output:

647930229655
647930229656
647930229657
647930229658
RomanPerekhrest
  • 73,078
  • 4
  • 37
  • 76
3

-o : means print only matching string not the whole line.
-P: To enable pcre regex.
\K: means ignore everything on the left. But should be part of actual input string.
\d: means digit, + means one or more digit.

grep -oP 'MSISDN=\K\d+' input
647930229655
647930229656
647930229657
647930229658
P....
  • 10,757
  • 1
  • 19
  • 38
2
$ grep -oP '(?<=&MSISDN=)\d+' file 
647930229655
647930229656
647930229657
647930229658

-o option is meant to show only matched output -P option is meant to enable PCRE (Perl Compatible Regex) (?<=regex) this is to enforce positive look behind assertion. You can read more about them over here. Lookarounds dont consume any characters while matching unlike normal regex. Hence the only matched output you get it \d+ which is 1 or more digits.

or using sed:

$ sed -r 's/^.*MSISDN=([0-9]+).*$/\1/' file 
647930229655
647930229656
647930229657
647930229658
riteshtch
  • 8,189
  • 4
  • 19
  • 35
2

Following simple sed may help you on same.

sed 's/.*MSISDN=//;s/&.*//'  Input_file

Explanation:

s/.*MSISDN=//: s means substitute .*MSISDN= string with // NULL in current line.

; semi colon tells sed that there is 1 more statement to be executed.

s/&.*//g': s/&.*// means substitute &.* from & to everything with NULL.

RavinderSingh13
  • 101,958
  • 9
  • 41
  • 77
0

you can also pipe cut to cut

cut -d '&' -f3 Input_file |cut -d '=' -f2
once
  • 915
  • 2
  • 14
  • 26