0

I use regexr for making regular expression, but can't matching anything.

I wrote such regex '.dataLayer = (.+)</script>'

I have such structure of template:

<!DOCTYPE html> 
<html lang="en-PH" dir="ltr"> 
<head> … </head> 
<body class="is_full PdpV4"> 
<script> This one script which I should get </script>
 <script> window.dataLayer = window.dataLayer || []; dataLayer.push({"feature_test":"VariableControl:1"}); dataLayer.push({"feature_set":"Control"}); 
</script> 
<script>....</script> 
<script> … </script>
 </body> 
</html>

And I need to get the first one script... /script

<script>
    dataLayer = [
        {
            "agent_id": 558921,
            "agent_name": "The City Townhouse",
            "attributes": {
                "agent_ratings_enabled": 0,
                "approved": 1,
                                "attribute_set_id": 1,
                                "categories": JSON.parse("[15,19]"),
                "indoor_features": ["Balcony","Maid's room"],
                "is_agent": 1,
                "listing_type": "Classifieds",
                "other_features": [],
                "outdoor_features": ["Garage"],
                "price_formatted": "₱ 11,300,000 ",
                "price_not_shown": false,
                "seller_is_trusted": 1,
                "show_listing_address": 1,
                "show_mobile": 1
        }
    ];
</script>

I need to get everything inside tags . Thanks a lot.

  • Is there anything else outside of the ` – trincot May 01 '21 at 14:12
  • There are several in the body of the template. – try_to_code May 01 '21 at 14:28
  • – try_to_code May 01 '21 at 14:29
  • @try_to_code `/.dataLayer = (.+?)/gs` regex. Add the s flag to regex. – Example person May 01 '21 at 14:30
  • Can you edit your question and add this to the input? Is it the very *first* script content you need? How do you decide which script is of your interest? – trincot May 01 '21 at 14:30

2 Answers2

1

Avoid parsing HTML with regular expressions. A seminal stack overflow answer explains why.

Instead, you should use a package like html5lib to parse the HTML and extract the contents of the <script> elements and then parse out the contents you want from that. This will mean you only need to look at the JavaScript code, which should be a much simpler task.

Paul Fisher
  • 9,140
  • 5
  • 33
  • 51
0

First, what Paul said, don't use a regex to parse HTML.

Second, and I don't recommend it, but if you're really intent on doing this for whatever reason, this regex will match characters on the inside of a <script>..</script> tag:

(?<=<script>).*(?=<\/script>)

You have been warned.

Jamie Scott
  • 381
  • 2
  • 17