0

I was working on a HW problem that involves removing all of the html tags "<...>" from the text of an html code and then count all of the tokens in that text.

I wrote a solution that works but it all comes down to a single line of code that I didn't actually write and I'm curious to learn more about how this kind of code works.

public static int tagStrip(Scanner in) {
     int count = 0; 

     while(in.hasNextLine()) {
         String line = in.nextLine();

         line = line.replaceAll("<[^>\r\n]*>", "");

         Scanner scan = new Scanner(line);

         while(scan.hasNext()) {
            String word = scan.next();
            count++;
         }
     }
     return count;
}  

Line 7 is the one I'm curious about. I understand how the replaceAll() method works. I'm not sure how that String "<[^>\r\n]*>" works. I read a little bit about patterns and messed around with it a bit.
I replaced it with "<[^>]+>" and it still works exactly the same. So I was hoping somebody could explain how these characters work and what they do especially within the construct of this type of program.

Emma
  • 1
  • 9
  • 28
  • 53
  • `"\r\n]*>` A negative class turns it's items into an _AND_. So, it will stop matching if it finds a `>` or a `\r` or a `\n`, basically won't span lines. `]+>` will span lines since `\r\n` is removed. –  May 18 '19 at 20:42
  • 1
    The regex you should be using though is this `"[\S\s]*?"|'[\S\s]*?'|(?:(?!/>)[^>])?)+)?\s*>)[\S\s]*?\1\s*(?=>))|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>` https://regex101.com/r/ZE9Ayg/1 –  May 18 '19 at 20:44

1 Answers1

0

RegEx

If you wish to explore or modify your expression, you can modify/change your expressions in regex101.com.

<[^>]+> may not work since it would pass your new lines, which seems to be undesired.

enter image description here

RegEx Circuit

You can also visualize your expressions in jex.im:

enter image description here

Community
  • 1
  • 1
Emma
  • 1
  • 9
  • 28
  • 53