Dataprep - accents and special characters

Question

How do I solve this problem with accents / special characters in the dataprep? I need this information to appear.

Thank you very much for your attention.

Alexandre Moraes · Accepted Answer · 2020-08-17T13:28:29.383

2

DataPrep has builtin recipes which allow you to remove or change special characters. For example, you can change accented letters to unaccented ones with Remove accents in text or you can also replace non recognised characters for another character with Replace text or patterns.

Below are the steps to change a special character or accented letter.

Create your flow.
Add/import your data
Click Add a recipe, as per documentation. In your case you can do one or both of the following:

First, in case you have an accented word, go to Search Transformations > Select Remove accents in text. Then, select the column, which there are accented words. It will replace the accented words for non-accented ones. Your data your be shown to you so you can check the transformation.

Second, in case you have an non recognised character, go to Search Transformations > Replace text or patterns > Select the column you want to transform the data > Within Find write the letter/symbol between single quotes > In Replace with write the letter which will be placed instead. Finally, preview your data to see the transformation.

UPDATE: I was able to load a .csv file with the mentioned characters to DataPrep. Below are my steps and sample data:

The .csv file I used had the following content:

Test
Non rec. char É
Non rec. char ç
Accented word não

In the DataPrep UI home page, click on Import Data (top right corner) Google Cloud Storage (left part of the screen). Then, find and select you file (test just importing one file instead of parametrizing) and click in the add(+) symbol. In this step, you can already see the characters, in my case I could see them normally. Finally, click in Import&Wrangle and visualise your data. Using the data above, I was able to see the characters properly without any issues.

edited Aug 17 '20 at 13:28

answered Aug 17 '20 at 10:30

Alexandre Moraes

2,797
1
2
9

1

I tried to use a function that adjusts accents, but it didn't work. I think it's because of the shape that the accents have become a symbol. Do you have any other ideas? – Theorp Aug 17 '20 at 12:23
1

@Theorp, can you try to use the **Replace text or patterns** instead? Tell me if it worked for you. Also, would you mind sharing the symbol that it is not being recognised by DataPrep? So I can investigate specifically further. – Alexandre Moraes Aug 17 '20 at 12:28
1

I also tested it, but it doesn't change. The Symbol is � , in the csv file it is "É" or "ç". – Theorp Aug 17 '20 at 12:50
1

@Theorp, I uploaded a *.txt* file with the characters you mentioned and they were recognised by DataPrep. Also, I was able to replace them. Can you tell me where is your data and how you are importing it ? If it is from a file , which format and can you see the characters correctly within the file? – Alexandre Moraes Aug 17 '20 at 12:56
I created a dataset with parameter to import 3 csv from storage. Enconding is utf-8. – Theorp Aug 17 '20 at 13:01
1

Ok, I see. Now, if you open the .csv files can you properly see the characters "É" and "ç" ? – Alexandre Moraes Aug 17 '20 at 13:10
yes, csv is right. When I create the dataset, it gets the symbol. – Theorp Aug 17 '20 at 13:13
I can separate only the csv column with the problem, and send you the file, in case you want to test it. – Theorp Aug 17 '20 at 13:22
@Theorp, I have updated the question with the test I performed and I did not find any issues. Although, since you can share this column of data with me, can you upload just this column in your question? So I can test with it. – Alexandre Moraes Aug 17 '20 at 13:29
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/219970/discussion-between-alexandre-moraes-and-theorp). – Alexandre Moraes Aug 17 '20 at 13:30

Dataprep - accents and special characters

1 Answers1