0

I have a dataframe with a column containing text. This data is coming from and being saved to a csv file and contains strings such as:

 Supporter🇨🇮
 🇮🇪🇪🇺
 📞061 300149 💻sdim.csdg@dsga.com

Is it possible to remove these strings from the textual data? If so what is the best way to do this?

I have tried:

 df['text'] = df['text'].replace(r'(?<![@\w])(^\W+)', '', regex=True)

But unfortunately it doesn't remove the strings.

Thanks!

  • Please check https://stackoverflow.com/questions/1276764 – Winner Mar 31 '19 at 18:27
  • Without an indication of the encoding of this file, it's hard to give good answers. This looks vaguely like you were incorrectly reading some legacy encoding. Can you [edit] the question to provide a hex dump of a few bytes around the apparent garbage, and/or an educated guess about the encoding? See also the [Stack Overflow `character-encoding` tag info page](http://stackoverflow.com/tags/character-encoding/info) which has troubleshooting tips and help for asking a well-defined encoding question. – tripleee Mar 31 '19 at 18:39
  • You can't read any kind of text file without knowing the character encoding. If you don't have it, that's data loss. (This is just one of the weakness of CSV as a data transfer format.) – Tom Blodget Apr 01 '19 at 18:19

2 Answers2

1

For example for the following DataFrame

                Supporter
0                🇨🇮
1                     foo
2        🇮🇪🇪🇺
3          üìû061 300149
4                     bar
5  💻sdim.csdg@dsga.com

we can use str.match to remove any line containing special characters:

df.loc[~df['Supporter'].str.match('[\u0080-\uFFFF]')]

Output:

  Supporter
1       foo
4       bar

Also, if you want to just remove special characters while keeping the actual records:

df['Supporter'] = df['Supporter'].str.replace('[\u0080-\uFFFF]', '')

print(df)

Output:

    Supporter
0            
1         foo
2            
3  061 300149
4         bar

Note: If there are any NA values in the DataSet, they should be dropped before running these with:

df = df.dropna()
perl
  • 8,379
  • 7
  • 21
  • I'm getting `TypeError: bad operand type for unary ~: 'float'` – jackiegirl89 Mar 31 '19 at 18:30
  • Please try with `df[~df['Supporter'].fillna('').str.match('')]` (this adds `fillna` to replace `NA` values that you may have there) – perl Mar 31 '19 at 18:32
  • Doesn't remove them – jackiegirl89 Mar 31 '19 at 18:34
  • Here's another option to try, which removes every line containing at least one non-alphanumeric character (not just the Apple logo): `df[~df['Supporter'].dropna().str.match('[\u0080-\uFFFF]')]` – perl Mar 31 '19 at 18:50
  • I've added a note to my answer also, if there are any `NA` values in the dataset, we can just drop them with `df = df.dropna()` and then run everything without `fillna`s – perl Mar 31 '19 at 18:53
0

You can try the methods described here: Replace non-ASCII characters with a single space

Instead of replacing with a space, pass the empty string '' to get rid of the characters.

rdas
  • 15,648
  • 5
  • 22
  • 36