0

I have a text where I would like to find different words. The following text is in Portuguese, Brazil, and serves only as a test case:

Um dia eu conheci Pedro Álvares Cabral, e descobri muitas informações interessantes.

To find any of the words in the text, I am using the following regular expression:

/\b(Cabral)\b/i // Finds Cabral
/\b(dia)\b/i    // Finds dia
/\b(Pedro)\b/i  // Finds Pedro
Etc...

If I need to find more than one word, I do as follows:

/\b(informações|muitas)\b/ig

I am testing the functionality of the expression in both JavaScript and using this online utility. JavaScript code example:

var input = "Um dia eu conheci Pedro Álvares Cabral, e descobri muitas informações interessantes."
var matchRegExp = new RegExp("\\b(coNHECi)\\b", "i");

if(regs = matchRegExp.exec(input)) {
  console.log('OK');
}
else {
  console.log('NOPE');
}

THE PROBLEM

All the words I put into the expression are found, except Álvares. For example, I cannot find the word with the following expression:

/\b(Álvares)\b/i

If I remove the Á character, lvares is found. I would like to:

  1. To know why and for what reason I can't find Álvares.
  2. To know how I can find any word in a text that has the following characters: áàâãÁÀÂÃéèêÉÈÊíìîÍÌÎóòôõÓÒÔÕúùûÚÙÛñÑçÇ regardless of whether these characters represent the first, last, or any letter of a word.
Loa
  • 1,819
  • 1
  • 15
  • 38
  • I think it's the `\b` that's causing the problem. I suspect that it does not properly treat the accented "A" as being a "word" character. – Pointy Nov 28 '19 at 15:08
  • JavaScript regex support for Unicode is pretty sad. – Pointy Nov 28 '19 at 15:08

0 Answers0