-1

I need to get all script tags from an html string, separated the inline scripts and the "linked" scripts. By inline scripts I mean script tags without the src attribute.

Here is how I get the "linked scripts":

<script(.)+src=(.)+(/>|</script>)

so, having <script followed by one or more any character, followed by src=, followed by /> or </script>.

This works as expected.

Now I want to get all the script tags without the src tag, having some javascript code between <script .....> and </script>, but I can't figure it out how to do that. I just started understanding regular expressions, so the help of a more experienced r.e. guru is needed :)

UPDATE Ok, so dear downvoters. I have the html code for a whole html page in a variable. I want to extract script tags from it. How to do it, using jquery for example?

var dom = $(html);
console.log(html.find('script');

will not work. So, what is the way to accomplish that?

UPDATE 2 I don't need to solve this problem with regex, but because now I am learning about them, I thought I will try it. I am opened for any other solution.

Tamás Pap
  • 16,112
  • 13
  • 65
  • 94
  • 3
    You have access to the DOM, don't use regex to search for DOM elements. – zzzzBov Jan 28 '13 at 19:25
  • 4
    Please for God sake, don't use Regex to parse HTML. – Rohit Jain Jan 28 '13 at 19:25
  • Yes I have access to dom. This is why you downvoted? I want to learn regular expressions, and this time I want to solve the problem this way. What's the problem with that? – Tamás Pap Jan 28 '13 at 19:26
  • 2
    @TamasPap. The problem is that, you will Abuse Regex for no good once you find that you can't parse HTML with it. And then, may be you stop learning it. Regex is not the right tool for it. It can only understand Regular Languages. – Rohit Jain Jan 28 '13 at 19:28
  • 1
    @TamasPap http://stackoverflow.com/a/1732454/1640800 – CorrugatedAir Jan 28 '13 at 19:31
  • 3
    @TamasPap - It is absolutely possible to parse HTML with regex, and don't let anyone tell you otherwise. [But it is really, really, really hard](http://stackoverflow.com/a/4234491/211627) ( – JDB still remembers Monica Jan 28 '13 at 19:32
  • Ok, let's say I use jquery for that. I can't just use: `var dom = $(html); console.log(dom.find('script'));` `dom.find('script')` return nothing. So how to do it then? :) – Tamás Pap Jan 28 '13 at 19:34
  • @apsillers In my situation user writes html code in a textarea or whatever, and I want to extract script, style and link tags to process them. – Tamás Pap Jan 28 '13 at 19:36
  • 1
    @TamasPap Okay, so do something like `var dummyDoc = document.createElement("html"); dummyDoc.innerHTML = myTextArea.value;` and then use DOM methods to extract the elements from `dummyDoc`. – apsillers Jan 28 '13 at 19:38
  • I think you ought to clarify your question: do you want the *best* solution (i.e., using built-in DOM parsing methods) or a *regex* solution (e.g., so you can learn regex better, **not** because you're actually interested in solving this problem in the best way)? – apsillers Jan 28 '13 at 19:41

2 Answers2

2

Create a DOM element using document.createElement, then set its innerHTML to the contents of your HTML string. This will automatically parse your HTML using the browser's built-in parser and fill your newly-created element with children.

dummyDoc = document.createElement("html");
dummyDoc.innerHTML = "<body><script>alert('foo');</script></body>"; // or myInput.value
var dom = $(dummyDoc);
var scripts = dom.find('script');

(I only use jQuery because you do so in your question. This is certainly also possible without jQuery.)

apsillers
  • 101,930
  • 15
  • 206
  • 224
  • Thank you for giving me a solution, and not just commenting about how a wild and bad idea is to parse html with regexp :). However, I wasn't "parsing HTML", I just tried to extract script tags from it, which is absolutely fine, and I don't think people can give me a real world example of when it will not work. Thank you again! – Tamás Pap Jan 28 '13 at 19:50
1

If you are in the position where no dom access is available (nodejs?), you'd be forced to use regex. Here is a solution that worked for me in the similar circumstances:

function scrapeInlineScripts(sHtml) {
    var a = sHtml.split(/<script[^>]*>/).join('</script>').split('</script>'),
        s = '';

    for (var n=1; n<a.length; n+=2) {
        s += a[n];
    }
    return s;
}
Lex
  • 2,428
  • 18
  • 34