Get everything except special words

Question

I need a regex which match everything expect for several words.

The input-string is something like:

This&nbsp;Is&nbsp;A&nbsp;&ltTest$gt;

It should match

This Is A Test

So I want to have everything around  , < and >

I've tried something like [^ ] to ignore all appearances of   but this excludes every character.

What is the language? Do you mean you need to replace all ` ` with a space? Use a mere string replace method. — Wiktor Stribiżew, Feb 22 '18 at 10:09
With PCRE, you may use [` (*SKIP)(*F)|(?:(?! ).)+`](https://regex101.com/r/GZl31C/1) if you want to match any text but some multicharacter string and get separate matches, like `['This', 'Is', 'A', 'Test']`. — Wiktor Stribiżew, Feb 22 '18 at 10:20
Possible duplicate of [Learning Regular Expressions](https://stackoverflow.com/questions/4736/learning-regular-expressions) — Biffen, Feb 22 '18 at 10:53
Or even [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). — Wiktor Stribiżew, Feb 22 '18 at 10:54
@WiktorStribiżew, They aren't trying to parse (X)HTML, they are trying to remove a set of known characters from a string. — KyleFairns, Feb 22 '18 at 10:56
@KyleFairns ` ` is an [**HTML** character entity reference](https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references). — Wiktor Stribiżew, Feb 22 '18 at 11:01
But the answer you are referring to is about the tags within HTML. This one is not — KyleFairns, Feb 22 '18 at 11:10

KyleFairns · Answer 1 · 2018-02-22T10:51:00.193

/&[a-zA-Z]{2,8};/g

Breakdown:

& - match & literally
[a-zA-Z]{2,8} - match any characters in ranges a-z and A-Z from 2 to 8 times
; - until a semi colon

The longest special character that you could encounter is &thetasym; - ϑ, and so I've taken this into account in the regex.

The proper formatting replaces each of the special characters with a space, and replaces multiple spaces in a row with a single space

let regex = /&[a-zA-Z]{2,8};/g,
string = "This&nbsp;Is&nbsp;A&nbsp;&lt;Test&gt;",
properlyFormatted = string.replace(regex, " ").replace(/\ +/g, " ");

console.log(properlyFormatted);

The alternative:

/&(?:lt|gt|nbsp);/g

Breakdown:

& - match & literally
(?:lt|gt|nbsp) - match any group in lt, gt, nbsp
; - directly followed by a semi colon

This regex will only take into account the specific characters you described.

let regex = /&(?:lt|gt|nbsp);/g,
string = "This&nbsp;Is&nbsp;A&nbsp;&lt;Test&gt;",
properlyFormatted = string.replace(regex, " ").replace(/\ +/g, " ");

console.log(properlyFormatted);

This is really not generic and bad way of doing. You better use a HTML decoder library (that follow RFCs) instead of writing your own quick-and-dirty decoder. — Arount, Feb 22 '18 at 10:41
Added a specific example on how to get rid of the 3 special chars that need to be gotten rid of. The generic answer is still valid if you want to get rid of all of the special characters within a string — KyleFairns, Feb 22 '18 at 10:53

Get everything except special words

1 Answers1