-2

I need a regex which match everything expect for several words.

The input-string is something like:

This Is A &ltTest$gt;

It should match

This Is A Test

So I want to have everything around  , < and >

I've tried something like [^ ] to ignore all appearances of   but this excludes every character.

mrzasa
  • 21,673
  • 11
  • 52
  • 88
Tomtom
  • 8,386
  • 7
  • 45
  • 83
  • 3
    What is the language? Do you mean you need to replace all ` ` with a space? Use a mere string replace method. – Wiktor Stribiżew Feb 22 '18 at 10:09
  • With PCRE, you may use [` (*SKIP)(*F)|(?:(?! ).)+`](https://regex101.com/r/GZl31C/1) if you want to match any text but some multicharacter string and get separate matches, like `['This', 'Is', 'A', 'Test']`. – Wiktor Stribiżew Feb 22 '18 at 10:20
  • Possible duplicate of [Learning Regular Expressions](https://stackoverflow.com/questions/4736/learning-regular-expressions) – Biffen Feb 22 '18 at 10:53
  • Or even [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). – Wiktor Stribiżew Feb 22 '18 at 10:54
  • @WiktorStribiżew, They aren't trying to parse (X)HTML, they are trying to remove a set of known characters from a string. – KyleFairns Feb 22 '18 at 10:56
  • 1
    @KyleFairns ` ` is an [**HTML** character entity reference](https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references). – Wiktor Stribiżew Feb 22 '18 at 11:01
  • But the answer you are referring to is about the tags within HTML. This one is not – KyleFairns Feb 22 '18 at 11:10

1 Answers1

1
/&[a-zA-Z]{2,8};/g

Breakdown:

  • & - match & literally
  • [a-zA-Z]{2,8} - match any characters in ranges a-z and A-Z from 2 to 8 times
  • ; - until a semi colon

The longest special character that you could encounter is ϑ - ϑ, and so I've taken this into account in the regex.

The proper formatting replaces each of the special characters with a space, and replaces multiple spaces in a row with a single space

let regex = /&[a-zA-Z]{2,8};/g,
string = "This Is A <Test>",
properlyFormatted = string.replace(regex, " ").replace(/\ +/g, " ");

console.log(properlyFormatted);

The alternative:

/&(?:lt|gt|nbsp);/g

Breakdown:

  • & - match & literally
  • (?:lt|gt|nbsp) - match any group in lt, gt, nbsp
  • ; - directly followed by a semi colon

This regex will only take into account the specific characters you described.

let regex = /&(?:lt|gt|nbsp);/g,
string = "This Is A <Test>",
properlyFormatted = string.replace(regex, " ").replace(/\ +/g, " ");

console.log(properlyFormatted);
KyleFairns
  • 2,754
  • 1
  • 11
  • 33
  • 1
    This is really not generic and bad way of doing. You better use a HTML decoder library (that follow RFCs) instead of writing your own quick-and-dirty decoder. – Arount Feb 22 '18 at 10:41
  • Added a specific example on how to get rid of the 3 special chars that need to be gotten rid of. The generic answer is still valid if you want to get rid of all of the special characters within a string – KyleFairns Feb 22 '18 at 10:53