4

From this URL view-source:https://www.amazon.com/dp/073532753X?smid=A3P5ROKL5A1OLE I want to get string between var iframeContent = and obj.onloadCallback = onloadCallback;

I have this regex iframeContent(.*?)obj.onloadCallback = onloadCallback;

But it does not work. I am not good at regex so please pardon my lack of knowledge.

I even tried iframeContent(.*?)obj.onloadCallback but it does not work.

Umair Ayub
  • 13,220
  • 12
  • 53
  • 124
  • What do you mean with *it doesn't work* ?? please post a http://stackoverflow.com/help/mcve example, and try to excerpt from the original data the string you think must be matched. An amazon page is crowded of possible points of match, so it is going to be difficult to assist you in getting what you want from this question. – Luis Colorado Oct 21 '16 at 11:06

3 Answers3

3

I suspect that input string lies across multiple lines.Try adding re.M in search line (ie. re.findall('someString', text_Holder, re.M)).

Fejs
  • 2,508
  • 2
  • 17
  • 35
3

It looks like you just want that giant encoded string. I believe yours is failing for two reasons. You're not running in DOTALL mode, which means your . won't match across multiple lines, and your regex is failing because of catastrophic backtracking, which can happen when you have a very long variable length match that matches the same characters as the ones following it.

This should get what you want

m = re.search(r'var iframeContent = \"([^"]+)\"', html_source)
print m.group(1)

The regex is just looking for any characters except double quotes [^"] in between two double quotes. Because the variable length match and the match immediately after it don't match any of the same characters, you don't run into the catastrophic backtracking issue.

Brendan Abel
  • 28,703
  • 11
  • 72
  • 95
2

You could try this regex too

(?<=iframeContent =)(.*)(?=obj.onloadCallback = onloadCallback)

you can check at this site the test.

Is it very important you use DOTALL mode, which means that you will have single-line