0

The file I'm reading contains one word per line. I have issues with some of these words, as it seems some characters are unusual. see the following example with the first word of my list

stopwords <- read.csv("stopwords_fr.txt",stringsAsFactors = FALSE,header=FALSE,encoding="UTF-8")$V1
stopwords[1] # "a" , if you copy paste into R studio this character with the quotes around it, you'll see a little red dot preceding the a.
stopwords[1] == "a" # FALSE

How did it happen ? How can I avoid it ? And if I haven't avoided it, how do I convert this dotted "a" into a regular "a" ?

EDIT:

you can reproduce the issue by just copy pasting this in Rstudio:

"a" == "a" # FALSE

here's where I get the file from: https://sites.google.com/site/kevinbouge/stopwords-lists/stopwords_fr.txt?attredirects=0&d=1

The encoding of the file, according to notepad++, is UTF-8-BOM. But using "UTF-8-BOM" as the encoding doesn't help. though it seemed to work in this answer: Read a UTF-8 text file with BOM

stopwords <- read.csv("stopwords_fr.txt",stringsAsFactors = FALSE,header=FALSE,encoding="UTF-8-BOM")$V1
stopwords[1] # "a"

I have R version 3.0.2

Community
  • 1
  • 1
Moody_Mudskipper
  • 39,313
  • 10
  • 88
  • 124
  • Can you link to your data file please? Or something similar. You also need to define what makes a word "unusual", since it might not be unusual in whatever language the word is written in. As a native Englishman, most words with accents of any kind are "unusual" :) – Spacedman Apr 05 '17 at 07:47
  • If it is the first character, it could be a Byte Order Mark - see this previous question... http://stackoverflow.com/questions/2223882/whats-different-between-utf-8-and-utf-8-without-bom The easiest way round it is probably just to set your first value to `"a"` (typed from the keyboard!) – Andrew Gustar Apr 05 '17 at 07:55
  • @Spacedman : I edited the with source and reproducible example. I know "unusual" isn't a great term, I haven't found any other... – Moody_Mudskipper Apr 05 '17 at 07:59
  • @AndrewGustar I think you're right, opening the file in notepad++, I see that the encoding of the file is UTF-8-BOM, however this encoding doesn't seem to be available in R. I've read your link but I don't know how to solve my issue in R. – Moody_Mudskipper Apr 05 '17 at 08:02
  • There is a possible answer here... http://stackoverflow.com/questions/21624796/read-a-utf-8-text-file-with-bom - `fileEncoding` rather than `encoding` – Andrew Gustar Apr 05 '17 at 08:06
  • Yes I saw it, but in my case it doesn't help... – Moody_Mudskipper Apr 05 '17 at 08:09
  • Or you could perhaps just use `sub` to replace all occurrences of `""` with `""` – Andrew Gustar Apr 05 '17 at 08:16
  • Impossible infortunately, reading with this encoding messes up many characters (it's a French file :)). I can change the encoding in notepad++ directly but I was hoping for a more robust solution... – Moody_Mudskipper Apr 05 '17 at 08:24
  • I don't get this problem when I load the file from the link above using `read_csv` from the `readr` package `sw – Andrew Gustar Apr 05 '17 at 08:53
  • "I have R version 3.0.2" might be your problem. Try upgrading... – Spacedman Apr 05 '17 at 11:42
  • Have you seen [this note in the docs](https://cran.r-project.org/doc/manuals/r-release/R-data.html#Variations-on-read_002etable) (scroll down a bit to "12. Encoding")? It seems like your last approach is really the way to go, but – as @Spacedman pointed out – you might have to upgrade your R version. – lenz Apr 05 '17 at 20:39

0 Answers0