2

So this is a rather odd question, I know that. I use a tool called pdf2htmlEX, which converts a PDF to HTML. So far the results has been pretty damn impressive. I have yet seen a single error in all the PDFs I have converted to HTML.

With this HTML, I need to replace some strings dynamically with C#. However, I can't simply say line.Replace("#SOME_STRING", "Another string"), although I wrote #SOME_STRING in the document before exporting to PDF. Why not, you might ask? Because the output of pdf2htmlEX can look something like this:

<div class="t m0 x5 h5 ya ff4 fs3 fc0 sc0 ls0 ws0">#SOME_ST<span class="_ _5"></span>RING </div>

See that empty span-tag with a _ and _5 class? Yep, that prevents me from replacing my word. The _5 class simply has some width (like width: 0.9889px).

In this case, how would I replace #SOME_ST<span class="_ _5"></span>RING with something else?

Here are some cases:

(#SOME_STRING)          #SOME_ST<span class="_ _5"></span>RING
(#SOME_OTHER_STRING)    #SOME_<span class="_ _7"></span>OTHER_ST<span class="_ _5"></span>RING

I'm kind of lost here, because I can't remove all the _5 elements, because the class is randomized everytime I change something in the document.

EDIT: So I basically need a way to filter out the HTML tags from my own Key-Value pair, so I can replace the words like #SOME_STRING -> SOMETHING_ELSE.

MortenMoulder
  • 5,021
  • 6
  • 44
  • 89

1 Answers1

0

Try using regex to filter all empty spans:

var myRegex = new Regex(@"(?<emptyspan><span[^>]*></span>)", RegexOptions.None);
var strTargetString = @"<div class=""t m0 x5 h5 ya ff4 fs3 fc0 sc0 ls0 ws0"">#SOME_ST<span class=""_ _5""></span>RING </div> <span></span>";

foreach (Match myMatch in myRegex.Matches(strTargetString))
{
    var emptyString = myMatch.Groups["emptyspan"].Value;
    // replace or remove empty string ??
}
richej
  • 741
  • 3
  • 16
  • The problem is as I said: I simply cannot remove all empty span-tags. The ones I can "legally" remove are the ones between my keys (`#SOME_STRING` for example). The rest of the empty span-tags do have classes that are needed by the document. – MortenMoulder Apr 05 '18 at 11:08
  • Then what about removing all spans that are between a word beginning with # ... that would be: `#\w*(?]*>)\w` – richej Apr 05 '18 at 11:14
  • So simply something like `Regex.Replace(line, @"#\w*(?]*>)\w", string.Empty, RegexOptions.Multiline` or? – MortenMoulder Apr 05 '18 at 11:20
  • Problem: What if the line has multiple `#KEY`s? Because this just gives me some output I can't really explain what looks like. Half of each word is basically gone. – MortenMoulder Apr 05 '18 at 11:31