Is it possible in C# to check whether a HTML string actually contains some text or is just made up of HTML tags and entities only?
For example
string str = @"<p xmlns=""http://www.w3.org/1999/xhtml"" />"
This contains only HTML tag and no text.
Is it possible in C# to check whether a HTML string actually contains some text or is just made up of HTML tags and entities only?
For example
string str = @"<p xmlns=""http://www.w3.org/1999/xhtml"" />"
This contains only HTML tag and no text.
XDocument doc = XDocument.Parse(yourString);
bool containsText = doc.Root.DescendantNodes()
.Count(el => el.GetType() == typeof (XText)) > 0
Tip:
I often combine this approach with SGMLReader to ensure valid xml for XDocument.Parse(...)
In case you only want to parse valid XHTML, you can use classes from the default .NET libary. XmlReader
or XDocument
.
You will need to parse your entire HTML string. For each element, simply check whether it contains any text.
However, as others have mentioned this will only work for valid XML, which HTML often isn't. In this case you are probably better of with the libraries as mentioned in the other answers.
If you parse your input with the HTML Agility Pack, you can then check the document.DocumentNode.InnerText
property to see whether there is any text in the whole fragment.
This is one case where using a regular expression with HTML would be a valid approach. It normally isn't with HTML, because HTML isn't a regular language. However, the features we care about can be expressed in terms of regular language - we don't care about the potentially limitless nesting of tags, which is what makes HTML not a regular language.
Or in other words, the rule that you can't parse HTML with a regular expression still applies, but you aren't actually parsing here. (Incidentally, a recursive regular expression also allows parsing HTML, in theory at least).
The tricky bit in writing it, is that >
is allowed in attribute values. Where it not for that, the simple expression ^(<[^>]*>)$
would be all it would take to match a tags-only string (adjust to allow whitespace also if you want).
The fiddliness of dealing with >
in attributes though, makes me favour:
public static bool IsTagsOnly(string html)
{
bool inTag = false;
char attChar = '\0';
foreach(char c in html)
{
if(char.IsWhiteSpace(c))//include or excise this bit depending on whether you count whitespace as "content"
{
continue;
}
if(!inTag)
{
if(c == '<')
inTag = true;
else
return false;
}
switch(c)
{
case '\'':
switch(attChar)
{
case '\'':
attChar = '\0';
break;
case '\0':
attChar = '\'';
break;
}
break;
case '"':
switch(attChar)
{
case '"':
attChar = '\0';
break;
case '\0':
attChar = '"';
break;
}
break;
case '>':
if(attChar == '\0')
inTag = false;
break;
}
}
return true;
}
Whenever you deal with HTML, it's quite tricky.
You could simply achieve that with regular expression but please note that PARSING HTML WITH REGULAR EXPRESSION IS A BAD IDEA !!!. This is simply because HTML can be incorrectly formatted.
If you want to do it properly I would suggest using HTML parsers like Argotic or HtmlAgilityPack (they are both available in NuGet).
Hope it helps