0

i try to parse set-cookie headers with regex in Python. For the set-cookie header i read the RFC 6265 Section 4.1 that describe how to build the set-cookie header. I try to build a regex from the specification and this is my current state:

([\x21\x23-\x27\x2A\x2B\x2D-\x39\x41-\x5A\x5E-\x7A\x7C\x7E]+)=[\x21\x23-\x2B\x2D-\x3A\x3C-\x5B\x5D-\x7E]*(;[\x20](((Expires|expires)=(Mon|Tue|Wed|Thu|Fri|Sat|Sun),[\x20][0-9]{2}-(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)-[0-9]{4}[\x20][0-9]{2}:[0-9]{2}:[0-9]{2}[\x20]GMT)|((Max-Age|max-age)=[1-9]+)|((Path|path)=[\x20-\x3A\x3C-\x7E]+)|(Secure|secure)|(HttpOnly|httponly)|([\x20-\x3A\x3C-\x7E]*)))*

I have problems with the recursive definition of the subdomain in the set-cookie header (domain=...), that describes in RFC 1034 Section 3.5 and need help to frame that in regex.

But also my previous code work not expected completely. For example this set-cookie header

VISITOR_INFO1_LIVE=M_6WYFFF_fo; path=/; domain=.youtube.com; secure; expires=Tue, 07-Jul-2020 00:17:35 GMT; httponly; samesite=None, GPS=1; path=/; domain=.youtube.com; expires=Thu, 09-Jan-2020 00:47:35 GMT, YSC=8sXes3YfFFF; path=/; domain=.youtube.com; httponly, VISITOR_INFO1_LIVE=M_6WYFFF_fo; path=/; domain=.youtube.com; secure; expires=Tue, 07-Jul-2020 00:17:35 GMT; httponly; samesite=None

includes 4 cookies (VISITOR_INFO1_LIVE twice, GPS and YSC) but my regex only catch 3 cookies (the YSC cookie is missing). I test that on https://regex101.com/

Later i would parse many set-cookie headers to get the name of the cookies (or in the RFC calls that cookie-name).

Thanks for help!

Basti G.
  • 319
  • 3
  • 16

2 Answers2

1

Short answer, as you asked how to parse the cookies with regex:

([^;]+);?

Then iterate through the matches.

The way you have formulated the question indicates that you would also like to validate the cookies and probably also separate them.

Pan
  • 321
  • 1
  • 7
  • For my set-cookie header example above, your regex give me all but not the cookie names. I'm only interested on the cookie names like `GPS` or `YSC` and so on. – Basti G. Jan 10 '20 at 18:40
  • It gives the cookie names (with value) as well as all the other fields. I'll try to revise it to only return the cookie names but it's quite tricky as the string uses the same separators between cookies as it does inside individual fields (comma followed by space). – Pan Jan 10 '20 at 19:14
  • I see that you tagged Python as well. Why not try to solve this programatically instead of through a one-liner? My answer returns all the fields. Iterate through them in your program and validate the contents of each (using regex if you desire) to find the information you seek. – Pan Jan 10 '20 at 19:58
1

After spending some more time on this question, I think that it is close to impossible to achieve what you desire using only regex.

There are no unique identifiers or delimiters for each cookie. Delimiters are used inside columns as well as between cookies. There is also no set number of columns or a mandatory final column. It is very difficult to write the negative part of this expression (what not to match).

Pan
  • 321
  • 1
  • 7