unusual characters when reading with read.csv

Question

The file I'm reading contains one word per line. I have issues with some of these words, as it seems some characters are unusual. see the following example with the first word of my list

stopwords <- read.csv("stopwords_fr.txt",stringsAsFactors = FALSE,header=FALSE,encoding="UTF-8")$V1
stopwords[1] # "a" , if you copy paste into R studio this character with the quotes around it, you'll see a little red dot preceding the a.
stopwords[1] == "a" # FALSE

How did it happen ? How can I avoid it ? And if I haven't avoided it, how do I convert this dotted "a" into a regular "a" ?

EDIT:

you can reproduce the issue by just copy pasting this in Rstudio:

"a" == "a" # FALSE

here's where I get the file from: https://sites.google.com/site/kevinbouge/stopwords-lists/stopwords_fr.txt?attredirects=0&d=1

The encoding of the file, according to notepad++, is UTF-8-BOM. But using "UTF-8-BOM" as the encoding doesn't help. though it seemed to work in this answer: Read a UTF-8 text file with BOM

stopwords <- read.csv("stopwords_fr.txt",stringsAsFactors = FALSE,header=FALSE,encoding="UTF-8-BOM")$V1
stopwords[1] # "ï»¿a"

I have R version 3.0.2

Can you link to your data file please? Or something similar. You also need to define what makes a word "unusual", since it might not be unusual in whatever language the word is written in. As a native Englishman, most words with accents of any kind are "unusual" :) — Spacedman, Apr 05 '17 at 07:47
If it is the first character, it could be a Byte Order Mark - see this previous question... http://stackoverflow.com/questions/2223882/whats-different-between-utf-8-and-utf-8-without-bom The easiest way round it is probably just to set your first value to `"a"` (typed from the keyboard!) — Andrew Gustar, Apr 05 '17 at 07:55
@Spacedman : I edited the with source and reproducible example. I know "unusual" isn't a great term, I haven't found any other... — Moody_Mudskipper, Apr 05 '17 at 07:59
@AndrewGustar I think you're right, opening the file in notepad++, I see that the encoding of the file is UTF-8-BOM, however this encoding doesn't seem to be available in R. I've read your link but I don't know how to solve my issue in R. — Moody_Mudskipper, Apr 05 '17 at 08:02
There is a possible answer here... http://stackoverflow.com/questions/21624796/read-a-utf-8-text-file-with-bom - `fileEncoding` rather than `encoding` — Andrew Gustar, Apr 05 '17 at 08:06
Or you could perhaps just use `sub` to replace all occurrences of `"ï»¿"` with `""` — Andrew Gustar, Apr 05 '17 at 08:16
Impossible infortunately, reading with this encoding messes up many characters (it's a French file :)). I can change the encoding in notepad++ directly but I was hoping for a more robust solution... — Moody_Mudskipper, Apr 05 '17 at 08:24
I don't get this problem when I load the file from the link above using `read_csv` from the `readr` package `sw — Andrew Gustar, Apr 05 '17 at 08:53
"I have R version 3.0.2" might be your problem. Try upgrading... — Spacedman, Apr 05 '17 at 11:42
Have you seen [this note in the docs](https://cran.r-project.org/doc/manuals/r-release/R-data.html#Variations-on-read_002etable) (scroll down a bit to "12. Encoding")? It seems like your last approach is really the way to go, but – as @Spacedman pointed out – you might have to upgrade your R version. — lenz, Apr 05 '17 at 20:39

unusual characters when reading with read.csv

0 Answers0