0

I'm trying to read bounced e-mails by connecting via PHP to a IMAP account and fetching all e-mails. I'm looking to retrieve the "Diagnostic-Code" message for each e-mail and I wrote the following regex:

/Diagnostic-Code:\s+?(.*)/i

The message that I'm trying to parse is this:

Diagnostic-Code: smtp; 550-5.1.1 The email account that you tried to reach does
    not exist. Please try 550-5.1.1 double-checking the recipient's email
    address for typos or 550-5.1.1 unnecessary spaces. Learn more at 550 5.1.1
    https://support.google.com/mail/?p=NoSuchUser 63si4621095ybi.465 - gsmtp

The regex works partly meaning it only fetches the first row of text. I want to be able to fetch the entire message, so all the four rows of text.

Is it possible to update the expression to do this matching?

Thanks.

Cosmin
  • 681
  • 3
  • 9
  • 27

3 Answers3

2
/Diagnostic-Code:\s(.*\n(?:(?!--).*\n)*)/i
  • result will be in capture group 1
  • first .*\n matches first line including trailing newline
  • (?:(?!--).*\n)* matches subsquent lines that don't begin "--"
jhnc
  • 5,023
  • 4
  • 19
  • It's almost perfect :) It seems like for the following message ": host gmail-smtp-in.l.google.com[1.1.1.1] said: 552-5.2.2 The email account that you tried to reach is over quota. Please direct 552-5.2.2 the recipient to 552 5.2.2 https://support.google.com/mail/?p=OverQuotaPerm u14si4562135ybj.341 - gsmtp (in reply to RCPT TO command)" the parsing stops at "gsmtp" and the "(in reply to RCPT TO command)" is left out. – Cosmin Jan 27 '19 at 21:45
  • It ought to be including that. What does your code look like? – jhnc Jan 27 '19 at 21:59
  • I think it should stop when it reaches the beginning of a new part, meaning the "--" characters. – Cosmin Jan 27 '19 at 22:15
  • Not sure what you mean by "new part" or "--" characters. I had assumed "Diagnostic-Code:" was a message header. If you're trying to match something in the body, then using `\n\s` may not be the appropriate way to match continuation lines. – jhnc Jan 27 '19 at 22:52
  • I'm not getting this information from the e-mail header, but from its body. So the parsing of the Diagnostic-Code message should stop when it reaches the "--" characters. – Cosmin Jan 28 '19 at 08:04
  • Thanks! Works fine now. – Cosmin Jan 28 '19 at 09:28
2

If there can be multiple messages starting with Diagnostic-Code: you could use:

^Diagnostic-Code:\K.*(?:\R(?!Diagnostic-Code:).*)*

See the regex demo | Php demo

Explanation

  • ^ Start of the string
  • Diagnostic-Code: Match literally
  • \K.* Forget what was matched and follow the rest of the string
  • (?: Non capturin group
    • \R(?!Diagnostic-Code:).* Match unicode newline sequence followed by a negative lookahead to check what follows is not !Diagnostic-Code:. If that is the case then match the whole string
  • )* Close non caputuring group and repeat 0+ times
The fourth bird
  • 96,715
  • 14
  • 35
  • 52
0

Add the s flag:

/Diagnostic-Code:\s+?(.*)/si

From this question:

In PHP... [t]he s at the end causes the dot to match all characters including newlines.

This will allow your regex to match the whole thing (see this regex101). Just remember to add some way to end it if you have more text after that.

connectyourcharger
  • 1,322
  • 1
  • 11
  • 30