0

Strange question, but I won't waste time explaining why I need to do this, just that I need to do it.

I have the following:

<input type="radio" name="eq_9" id="eq_9_2"  onclick="nextStepID_load(this);" value="Installer." title="912" /><label for="eq_9_2">Installer</label> <br />

I need to turn that into:

<button type="button" name="eq_9" id="eq_9_2" onclick="nextStepID_load('912');">Installer</button><br />

I am using C#/asp.net (3.5 or below), and javascript for the performJS(which is a placeholder until I figure out how to replace the html).

Please note, the source providing this is sending me a string with the MANY rows of the inputs. And I need to replace each row with the info that is valid for it.

Right now, I've tried adding a .Replace("","\">"); which does replace the radio tags, but obviously makes it look horrible codewise, and doesn't remove the label or put the label contents in between the tags.

I'm sure this is probably best solved by a regex, but I'm not very familiar with regex. I have been toying with regexlib to see if I can figure out a regex on my own... here is what I have so far, although I imagine I'm pretty far off.

string strRegex = @"<input type=""radio"" [\s]*=[\s]*""?[^\w_]*""?[^>]*>";
RegexOptions myRegexOptions = RegexOptions.IgnoreCase | RegexOptions.Multiline;
Regex myRegex = new Regex(strRegex, myRegexOptions);
string strTargetString = @"<input type=""radio"" name=""eq_9"" id=""eq_9_2""  onclick=""nextStepID_load(this);"" value=""Installer."" title=""912"" /><label for=""eq_9_2"">Installer</label> <br />";
string strReplace = "<button type="button"></button>";
return myRegex.Replace(strTargetString, strReplace);
JClaspill
  • 1,637
  • 18
  • 28
  • 1
    Ooh! Ooh! It's my turn to post this link detailing the folly of parsing HTML with regular expressions: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Ben Nov 23 '10 at 21:59
  • This appears to be fixed HTML of well-known characteristics. If that is true, then it is trivial to parse with regexes, and you should probably simply do that. The real question is how fixed it actually is, and what goes where. – tchrist Nov 23 '10 at 22:30
  • @tchrist: Sure, his little example looks like fixed html, and I'm sure the document he's receiving is well-formed (to who knows what spec), and I'm equally sure that somewhere you can build a reg ex to parse the example (this time). But at the end of the day, Friends don't let Friends mix RegEx and HTML. Ever. – NotMe Nov 23 '10 at 23:02
  • @Chris: I have no qualms about using regexes on HTML *that I have myself generated*. On other peoples’, well, that’s when I start to get queasy. – tchrist Nov 23 '10 at 23:07
  • 1
    @Chris: Plus at the risk of sacrificing modest for honesty, what *I* find feasible with regexes is probably a substantially larger set than this user would find, considering things like [this](http://stackoverflow.com/questions/4246077/simple-problem-with-regular-expression-only-digits-and-commas/4247184#4247184) and [this](http://stackoverflow.com/questions/4031112/regular-expression-matching/4034386#4034386), and [this](http://stackoverflow.com/questions/4231382/regular-expression-pattern-not-matching-anywhere-in-string/4234491#4234491). See what I mean? :) – tchrist Nov 23 '10 at 23:15
  • @tchrist: from reading those links I'm pretty sure that what you find feasible with regex is beyond what any normal person could even consider.(yes, that's a compliment). So, I'll amend my statement to say, "Friends don't let Friends mix RegEx and HTML.. unless you are tchrist." – NotMe Nov 23 '10 at 23:20
  • @Chris: Thanks. ☺ And you're a faster reader, too. ☺² – tchrist Nov 23 '10 at 23:26

3 Answers3

4

Do not use regular expressions to work with HTML. It is only flexible enough for 95% of all cases, and that should tell you that it is the wrong tool for the job.

Using the HTML Agility Pack, you can load in your document and use something like this to replace...

HtmlDocument doc = new HtmlDocument();
doc.Load(@"C:\Path\To\Page.html");

HtmlNode radios = doc.SelectNodes("//input[@type=radio]");

foreach (HtmlNode node in radios)
{
    HtmlAttribute name = node.Attributes["name"];

    if (name != null && name.ToLower().StartsWith("eq_"))
    {
        //Build your button element and replace the radio using ReplaceChild
    }
}
carla
  • 1,728
  • 1
  • 30
  • 35
Josh Stodola
  • 77,975
  • 43
  • 178
  • 222
2

I'm sure this is probably best solved by a regex, but I'm not very familiar with regex.

I'm afraid that is not a good sign, since if you are not very familiar with regexes, then it is rather unlikely that anything is best solved by a regex. :(

There isn't enough description of the problem to know for sure whether even a regex wizard could quickly craft a solution using regexes. I am pretty sure you need to do more than merely exchange one fixed string for another, because if you did, you'd've done that already. So parts of it must be parameterized. I just don't know which.

Not counting whitespace, would you say that your problem is one of transforming input of this form:

<input 
    type="radio"
    name="eq_X"
    id="eq_X_Y"
    onclick="nextStepID_load('XNY');"
    value="Z."
    title="XNY"  
/>
<label for="eq_X_Y"> Z </label>
<br />

into output of this form:

<button 
    type="button" 
    name="eq_X" 
    id="eq_X_Y" 
    onclick="nextStepID_load('XNY');" 
>
Z
</button>

<br />

As you see, I've parameterized X, Y, Z, and N. Here are questions I have for you:

  1. Would you say my parameterization of your problem describes it accurately?

  2. Does this validate under some particular DTD, and if so, which one?

  3. Are the attributes always none other than those?

  4. Are the attributes that occur always in that precise order?

  5. How many such things do you have to do?

  6. Does this occur in plain HTML, or is it actually hidden in some Javascript?

  7. Do you know whether there are any <script> or <style> elements, or <!-- ... --> comments that contain things that look just what you are looking for?

  8. Do you know whether there are any such elements intervening in the middle of the thing you are looking for?

  9. Are all the attributes of the form NAME="VALUE" with only double quotes around the value, never single quotes or omitted altogether?

  10. Is the casing of the identifiers always in lower case?

  11. Are they all in one file?

  12. Is there some reason that your output sample lost some of its non-significant whitespace?

It is questions like these that show why the problem is almost certainly much more complicated than it appears — which begs a couple of final questions:

  1. Have you ever used an HTML parsing class before?

  2. Would you like to learn how?

tchrist
  • 74,913
  • 28
  • 118
  • 169
  • +1. I think the questions you raise alone should push them to your conclusion about using an HTML parser. This is a very good run down of the problem space the OP is faced with. – NotMe Nov 23 '10 at 23:15
  • 2) Using xHTML 1.0 Transitional . 3) Those are always the order and general data contained in each element/attribute. 4) Yes. 5) This is a dynamic amount based on the current step. But I ALWAYS need to process ALL of them, be it 1 or 1000. 6) A C# class returns this(html code) as a string to the page I am trying to alter. 7) All is parsed on the C# class I cannot edit, nor remove from the process. 8) There is never a time when the attributes are available to me until the html is passed back as a string. – JClaspill Nov 24 '10 at 16:11
  • 9) Format is exactly as described above(an actual example). 10) Yes. Again, always contains the same format, just different attribute data. 11) One string, passed back to the page. 12) There is no whitespace outside of attributes on the returned data. _______ Part 2 1) No. 2) Would love to learn new tricks – JClaspill Nov 24 '10 at 16:12
  • @JClaspill: I hope you consider using a parsing class. I normally would write you a regex for this, but I’ve had my fill of paddling upstream for a while: I've just spent the last too-many-hours reïmplementing `\w`,`\W`, `\d`,`\D`,`\s`,`\S`, `\b`,`\B`, `\p{Space}`, `\p{Alpha}`, `\p{Hyphen}`,`\p{Dash}`, and `\p{QMark}` for Java because they’re all either **utterly broken** or completely missing in Java’s standard Pattern class. Regexes in Java — especially with Unicode — are too much like trying to teach a pig to sing: it wastes your time and annoys the pig. It also annoys everybody around you. – tchrist Nov 24 '10 at 20:27
1

this exercise does what you need utilizing regular expressions.

This is how it works: I do not replace values in the original string. Instead, I take the goal string, where we want to arrive to, and build it with the correct values. I believe that this approach will give you flexibility, you can format the output string however you want.

  • The inputList is a list of test cases.
  • The target is the goal string with placeholders, we can format it however we want/need/have to.
  • The GetValue() method utilizes the regex argument to seek for an specific value. It finds the html tag as a KeyValue pair, it takes the Value and removes the enclosing quotes.
  • Finally with the string.Format() we build the output string as desired.

It is full code, so you can try it. You can also turn the idea of this piece of code into a method an integrate it in your solution.

Please let me know if it worked for you too.

      static void Main(string[] args)
    {

        List<string> inputList = new List<string>();
        inputList.Add("<input type=\"radio\" name=\"eq_9\" id=\"eq_9_2\"  onclick=\"nextStepID_load(this);\" value=\"Installer.\" title=\"912\" /><label for=\"eq_9_2\">Installer</label> <br />");
        inputList.Add("<input type=\"radio\" name=\"eq_10\" id=\"eq_9_3\"  onclick=\"nextStepID_load(this);\" value=\"Installer1.\" title=\"913\" /><label for=\"eq_9_3\">InstallerA</label> <br />");
        inputList.Add("<input type=\"radio\" name=\"eq_11\" id=\"eq_9_4\"  onclick=\"nextStepID_load(this);\" value=\"Installer2.\" title=\"914\" /><label for=\"eq_9_4\">InstallerB</label> <br />");
        inputList.Add("<input type=\"radio\" name=\"eq_12\" id=\"eq_9_5\"  onclick=\"nextStepID_load(this);\" value=\"Installer3.\" title=\"915\" /><label for=\"eq_9_5\">InstallerC</label> <br />");
        inputList.Add("<input type=\"radio\" name=\"eq_13\" id=\"eq_9_6\"  onclick=\"nextStepID_load(this);\" value=\"Installer4.\" title=\"916\" /><label for=\"eq_9_6\">InstallerD</label> <br />");

        string output = string.Empty;
        string target = "<button type=\"button\" name=\"{0}\" id=\"{1}\" onclick=\"nextStepID_load('{2}');\">{3}</button><br />";

        foreach (string input in inputList)
        {
            string name = GetValue(@"(?<Value>name=[\S]+)", input);
            string id = GetValue(@"(?<Value>id=[\S]+)", input);
            string title = GetValue(@"(?<Value>title=[\S]+)", input);
            string value = GetValue(@"(?<Value>value=[\S]+)", input);

            output = string.Format(target, name, id, title, value);
            System.Diagnostics.Debug.WriteLine(output);
        }

    }

    private static string GetValue(string pattern, string input)
    {
        Regex regex = new Regex(pattern);
        Match match = regex.Match(input);
        return match.ToString().Split('=').Last().Replace("\"", string.Empty);
    }

this is the input:

    <input type="radio" name="eq_9" id="eq_9_2"  onclick="nextStepID_load(this);" value="Installer." title="912" /><label for="eq_9_2">Installer</label> <br />
    <input type="radio" name="eq_10" id="eq_9_3"  onclick="nextStepID_load(this);" value="Installer1." title="913" /><label for="eq_9_3">InstallerA</label> <br />
    <input type="radio" name="eq_11" id="eq_9_4"  onclick="nextStepID_load(this);" value="Installer2." title="914" /><label for="eq_9_4">InstallerB</label> <br />
    <input type="radio" name="eq_12" id="eq_9_5"  onclick="nextStepID_load(this);" value="Installer3." title="915" /><label for="eq_9_5">InstallerC</label> <br />
    <input type="radio" name="eq_13" id="eq_9_6"  onclick="nextStepID_load(this);" value="Installer4." title="916" /><label for="eq_9_6">InstallerD</label> <br />

this is the output:

    <button type="button" name="eq_9" id="eq_9_2" onclick="nextStepID_load('912');">Installer.</button><br />
    <button type="button" name="eq_10" id="eq_9_3" onclick="nextStepID_load('913');">Installer1.</button><br />
    <button type="button" name="eq_11" id="eq_9_4" onclick="nextStepID_load('914');">Installer2.</button><br />
    <button type="button" name="eq_12" id="eq_9_5" onclick="nextStepID_load('915');">Installer3.</button><br />
    <button type="button" name="eq_13" id="eq_9_6" onclick="nextStepID_load('916');">Installer4.</button><br />
  • This is a very interesting approach. +1. From other comments I've nearly decided this is best done by just putting my foot down and forcing a change to how we acquire the data. Or perhaps HTML parsing. But this code has some great education to it, so thanks again. – JClaspill Nov 24 '10 at 16:17