0

IT is tight here, but I'm trying to figure out if there's a way to take a standard XML or HTML file and convert all of the characters using just Notepad++. If I can do it without a plugin that'd be nifty.

I see I can change the encoding to ANSI, but I don't see an option for ASCII, and I don't think they're exactly the same thing, are they? The XML/HTML has to go up on a server, and the ingestion stuff we use doesn't like special characters like apostrophes that don't seem to fit.

I'm guessing because HTTP servers like ASCII. Basically, a lot of time is wasted by techs right now manually pouring over each and every file for these darn characters, which is causing a lot of eyebleed. The encoding of the files by default I think is UTF-8 when they're generated.

Peter Mortensen
  • 28,342
  • 21
  • 95
  • 123
Adam R. Turner
  • 95
  • 1
  • 2
  • 14
  • So you want the characters to be removed when you change encodings? Considering that that'd be destructive to the file (I know you want that, but other people might not) I don't think it's possible. Honestly, you could just write a quick Python script to remove/replace certain characters in the files. – mbomb007 Feb 09 '17 at 01:17
  • Also, an apostrophe is an ASCII character. Are you removing characters, or replacing them with an ASCII character? – mbomb007 Feb 09 '17 at 01:19
  • Surely, some basic powershell should be able to what you need. But, please clarify the problem. HTTP servers treat content payload as bytes so it's probably not the server per se that is having trouble. – Tom Blodget Feb 09 '17 at 17:46

1 Answers1

0

I'm guessing the reason ASCII isn't listed is because ASCII doesn't support all byte values. ASCII only has bytes 0x00 - 0x7F. UTF-8 is a "super-set" of ASCII, in that the first 128 bytes are the same, but it also supports 0x80 - 0xFF.

See UTF-8 codepage layout

Basically, if there is some reason that you can't use UTF-8 (like you're going to use it for a program that only supports ASCII, like you don't want any bytes 0x80 - 0xFF), simply use UTF-8, and make sure those bytes are not contained in your program.

Note that for the same reason as listed above, you can also use ANSI, which is also a super-set of ASCII. See this SO explanation.

Community
  • 1
  • 1
mbomb007
  • 3,077
  • 2
  • 30
  • 51
  • All characters in XML are Unicode. I'm pretty sure he's referring to the document encoding, which can be ASCII and still support the full Unicode character set due to XML's numeric character entity references. (For example, [🖖](http://www.fileformat.info/info/unicode/char/1f596/index.htm). But, yes, not supporting UTF-8 means a system or process has some serious problems. – Tom Blodget Feb 09 '17 at 00:22
  • @TomBlodget He said he's using Notepad++, which is why I talked about the encodings. Honestly, the question should be tagged [notepad++]. I'll add the tag. – mbomb007 Feb 09 '17 at 01:14
  • Thanks for the explanation! – Adam R. Turner Jun 12 '17 at 12:16