1

Is it possible in C# to check whether a HTML string actually contains some text or is just made up of HTML tags and entities only?

For example

string str = @"<p xmlns=""http://www.w3.org/1999/xhtml"" />"

This contains only HTML tag and no text.

Mr_Green
  • 36,985
  • 43
  • 143
  • 241
Ravi Gupta
  • 5,684
  • 16
  • 52
  • 76
  • I would recommend looking at :- http://stackoverflow.com/questions/846994/how-to-use-html-agility-pack – Derek Nov 06 '12 at 10:15

5 Answers5

1
XDocument doc = XDocument.Parse(yourString);
bool containsText = doc.Root.DescendantNodes()
          .Count(el => el.GetType() == typeof (XText)) > 0

Tip:

I often combine this approach with SGMLReader to ensure valid xml for XDocument.Parse(...)

Lukas Winzenried
  • 1,869
  • 1
  • 14
  • 21
0

In case you only want to parse valid XHTML, you can use classes from the default .NET libary. XmlReader or XDocument.

You will need to parse your entire HTML string. For each element, simply check whether it contains any text.

However, as others have mentioned this will only work for valid XML, which HTML often isn't. In this case you are probably better of with the libraries as mentioned in the other answers.

Steven Jeuris
  • 15,774
  • 7
  • 61
  • 137
  • @L.B He is using the XHTML namespace. If I'm not mistaken that should be strict XML? Good point though if he wants to parse _any_ HTML. – Steven Jeuris Nov 06 '12 at 10:17
0

If you parse your input with the HTML Agility Pack, you can then check the document.DocumentNode.InnerText property to see whether there is any text in the whole fragment.

Rawling
  • 45,907
  • 6
  • 80
  • 115
  • What if your fragment is `"I laugh "`? – TrueWill Apr 16 '13 at 21:08
  • I honestly don't know, but even if it treats `` as an HTML tag, the inner text should still contain "I laugh " rather than be empty. – Rawling Apr 17 '13 at 07:58
  • `var doc = new HtmlDocument(); doc.LoadHtml("I laugh "); Console.WriteLine(doc.DocumentNode.InnerText);` prints "I laugh ". – TrueWill Apr 17 '13 at 14:20
  • So the check "does it contain any text or is it all HTML tags and entities?" will correctly return "it contains some text". – Rawling Apr 17 '13 at 14:53
  • But the check will fail for the input "", which is not an HTML tag. (This is an edge case, and may be acceptable to the OP.) – TrueWill Apr 17 '13 at 15:00
  • But `` doesn't contain any text! Sure, it's not a regular HTML tag, but write that in an HTML file and open it in a browser and you won't see `` on the page. – Rawling Apr 17 '13 at 16:19
0

This is one case where using a regular expression with HTML would be a valid approach. It normally isn't with HTML, because HTML isn't a regular language. However, the features we care about can be expressed in terms of regular language - we don't care about the potentially limitless nesting of tags, which is what makes HTML not a regular language.

Or in other words, the rule that you can't parse HTML with a regular expression still applies, but you aren't actually parsing here. (Incidentally, a recursive regular expression also allows parsing HTML, in theory at least).

The tricky bit in writing it, is that > is allowed in attribute values. Where it not for that, the simple expression ^(<[^>]*>)$ would be all it would take to match a tags-only string (adjust to allow whitespace also if you want).

The fiddliness of dealing with > in attributes though, makes me favour:

public static bool IsTagsOnly(string html)
{
  bool inTag = false;
  char attChar = '\0';
  foreach(char c in html)
  {
    if(char.IsWhiteSpace(c))//include or excise this bit depending on whether you count whitespace as "content"
    {
      continue;
    }
    if(!inTag)
    {
      if(c == '<')
        inTag = true;
      else
        return false;
    }
    switch(c)
    {
      case '\'':
        switch(attChar)
        {
          case '\'':
            attChar = '\0';
            break;
          case '\0':
            attChar = '\'';
            break;
        }
        break;
      case '"':
        switch(attChar)
        {
          case '"':
            attChar = '\0';
            break;
          case '\0':
            attChar = '"';
            break;
        }
        break;
      case '>':
        if(attChar == '\0')
          inTag = false;
        break;
    }
  }
  return true;
}
Jon Hanna
  • 102,999
  • 9
  • 134
  • 232
-1

Whenever you deal with HTML, it's quite tricky.

You could simply achieve that with regular expression but please note that PARSING HTML WITH REGULAR EXPRESSION IS A BAD IDEA !!!. This is simply because HTML can be incorrectly formatted.

If you want to do it properly I would suggest using HTML parsers like Argotic or HtmlAgilityPack (they are both available in NuGet).

Hope it helps

Sebastian Siek
  • 1,995
  • 16
  • 16
  • OP hasn't mentioned anything about Regex. No need to shout. [Oh Yes You Can Use Regexes to Parse HTML!](http://stackoverflow.com/a/4234491/932418) – L.B Nov 06 '12 at 10:23
  • I wasn't shouting, only emphasized that Regex should not be used. – Sebastian Siek Nov 06 '12 at 10:24
  • 1
    @SebastianSiek In that case, please don't _over_ emphatize. ;p – Steven Jeuris Nov 06 '12 at 10:27
  • "This is simply because HTML can be incorrectly formatted" is triply untrue. 1: Incorrect formatting could always result in different treatment by a test like this and a given browser for any test other than using that browser's parsing engine. That will be the case, whatever solution is used. 2: Even with guaranteed correct formatting, HTML cannot be parsed with a regular expression, because HTML is not a regular language (though there is a common subset that is both valid HTML and also a regular expression). 3: The question does not require the HTML to be parsed. – Jon Hanna Nov 06 '12 at 11:33