4

I have a file, some lines in a .csv file that are jamming up a database import because of funky characters in some field in the line.

I have searched, found articles on how to replace non-ascii characters in Python 3, but nothing works.

When I open the file in vi and do :set list, there is a $ at the end of a line where there should not be, and ^I^I at the beginning of the next line. The two lines should be one joined line and no ^I there. I know that $ is end of line '\n' and have tried to replace those, but nothing works.

I don't know what the ^I represents, possibly a tab.


I have tried this function to no avail:

def remove_non_ascii(text):
    new_text = re.sub(r"[\n\t\r]", "", text)
    new_text = ''.join(new_text.split("\n"))
    new_text = ''.join([i if ord(i) < 128 else ' ' for i in new_text])
    new_text = "".join([x for x in new_text if ord(x) < 128])
    new_text = re.sub(r'[^\x00-\x7F]+', ' ', new_text)
    new_text = new_text.rstrip('\r\n')
    new_text = new_text.strip('\n')
    new_text = new_text.strip('\r')
    new_text = new_text.strip('\t')
    new_text = new_text.replace('\n', '')
    new_text = new_text.replace('\r', '')
    new_text = new_text.replace('\t', '')
    new_text = filter(lambda x: x in string.printable, new_text)
    new_text = "".join(list(new_text))

    return new_text

Is there some tool that will show me exactly what this offending character is, and a then find a method to replace it?

I am opening the file like so (the .csv was saved as UTF-8)

f_csv_in = open(csv_in, "r", encoding="utf-8")

Below are two lines that should be one with the problem non-ascii characters visible.

These two lines should be one line. Notice the $ at the end of line 37, and line 38 begins with ^I^I.

Part of the problem, that vi is showing, is that there is a new line $ on line 37 where I don't want it to be. This should be one line.

37 Cancelled,01-19-17,,basket,00-00-00,00-00-00,,,,98533,SingleSource,,,17035 Cherry Hill Dr,"L/o 1-19-17 @ 11:45am$
38 ^I^IVictorville",SAN BERNARDINO,CA,92395,,,,,0,,,,,Lock:6111 ,,,No,No,,0.00,0.00,No,01-19-17,0.00,0.00,,01-19-17,00-00-00,,provider,,,Unread,00-00-00,,$
snakecharmerb
  • 28,223
  • 10
  • 51
  • 86
user10664542
  • 642
  • 6
  • 24
  • If you're using Windows, `^` is used to [escape characters](https://en.wikipedia.org/wiki/Caret#Escape_character)., while `$` usually denotes the end of the line, eg. in regular expressions. Could it be that your file is "pipe-delimited" and that the two first fields are empty? What editor are you using, can you disable the display of formatting chars? – hyperTrashPanda Feb 08 '19 at 11:11
  • The fields in the file are comma delimited. There are 150K lines in the file, and all are fine, except for 81 bad lines. Each of the 81 bad lines are split into two lines when I open in an IDE (see 37, 38 above). Two such lines (that should be one) are given in the example above. If any field is empty, that's OK, there is nothing between the commas (delimeters). I am not sure what disabling the display of formatting chars means. – user10664542 Feb 08 '19 at 11:51
  • I want to remove any escape characters such as ^I and also I want to remove the new line $ character at the end of line 37 above. If there is code that will remove the ^I escape character and the new line character in the middle of the line as in line 37, so that ultimately line 37 and 38 are one line, please share. I tried all of the options listed in the original post to remove both of these. – user10664542 Feb 08 '19 at 11:53

3 Answers3

3

A simple way to remove non-ascii chars could be doing:

new_text = "".join([c for c in text if c.isascii()])

NB: If you are reading this text from a file, make sure you read it with the correct encoding

olinox14
  • 5,087
  • 1
  • 14
  • 34
  • I tried that code snippet, but had no success. The .csv file was saved as UTF-8, I am opening it like this. f_csv_in = open(csv_in, "r") – user10664542 Feb 08 '19 at 10:48
  • Try with `f_csv_in = open(csv_in, "r", encoding='utf-8')` – olinox14 Feb 08 '19 at 10:51
  • No such luck so far with opening with encoding='utf-8'. Out of 150K lines in the .csv file, there are 162 that have some character in it that hang things up, the others nearly 150K lines are OK. – user10664542 Feb 08 '19 at 10:55
  • Are you using linux? If you run `file csv_in` what is the encoding reported? – hyperTrashPanda Feb 08 '19 at 11:06
  • I am using cygwin under Windows. I am using Python under cygwin. The file was saved in UTF-8 .csv using Excel. It is opened in Python with encoding='utf-8' – user10664542 Feb 08 '19 at 11:29
2

In the case of non-printable characters, the built-in string module has some ways of filtering out non-printable or non-ascii characters, eg. with the isprintable() functionality.
A concise way of filtering the whole string at once is presented below

>>> import string
>>>
>>> str1 = '\nsomestring'
>>> str1.isprintable()
False
>>> str2 = 'otherstring'
>>> str2.isprintable()
True
>>>
>>> res = filter(lambda x: x in string.printable, '\x01mystring')
>>> "".join(list(res))
'mystring'

This question has had some discussion on SO in the past, but there are many ways to do things, so I understand it may be confusing, since you can use anything from Regular Expressions to str.translate()

Another thing one could do is to take a look at Unicode Categories, and filter out your data based on the set of symbols you need.

hyperTrashPanda
  • 810
  • 3
  • 18
  • Based on your comments, I added these two lines to the method: remove_non_ascii(text), with no success. new_text = filter(lambda x: x in string.printable, new_text) new_text = "".join(list(new_text))``` – user10664542 Feb 08 '19 at 11:08
  • Can you open your file in another text editor? Are the characters still there? Can you provide a couple of example lines to work with real data? – hyperTrashPanda Feb 08 '19 at 11:13
  • I can open the file in a text editor, I can open it in vi. When I type :set list in vi, I see there is a $ at the end of the line, where there should not be. The line below it starts with ^I (control I), two of them in a row. The two lines should be joined to one line with no $ at the end of the first and no ^I^I at any point in the line. I can provide a line to work with , but how? – user10664542 Feb 08 '19 at 11:15
  • Vi is supposed to show `$` when you type `:set list` to denote line endings, so that's not an error. There is not any 'real' character there to remove, it's just `vi` styling for line ending. As for the `^|`, what are your csv files delimited by? – hyperTrashPanda Feb 08 '19 at 11:18
  • Below are two lines that should be one with non-ascii characters visible. These two lines should be one line. Notice the $ at the end of line 37, and line 38 begins with ^I^I. the problem, that vi is showing, is that there is a new line $ on line 37 where I don't want it to be. should be one line. ```37 Cancelled,01-19-17,,basket,00-00-00,00-00-00,,,,98533,SingleSource,,,17035 Cherry Hill Dr,"L/o 1-19-17 @ 11:45am$ 38 ^I^IVictorville",SAN BERNARDINO,CA,92395,,,,,0,,,,,Lock:6111 ,,,No,No,,0.00,0.00,No,01-19-17,0.00,0.00,,01-19-17,00-00-00,,provider,,,Unread,00-00-00,,$``` – user10664542 Feb 08 '19 at 11:22
  • Nothing seems to work. I'm not sure what is in these errant lines that jam things up. Nothing seems to work to replace the new line character in the middle of the line or to replace the ^I control characters. No Python function seems to work. I have tried dozens of different approaches. – user10664542 Feb 08 '19 at 12:09
0

It looks as if you have a csv file that contains quoted values, that is values such as embedded commas or newlines which have to be surrounded with quotes so that csv readers handle them correctly.

If you look at the example data you can see there's an opening doublequote but no closing doublequote at the end of the first line, and a closing doublequote with no opening doublequote on the second line, indicating that the quotes contain a value with an embedded newline.

The fact that the lines are broken in two may be an artefact of the application used to view them, or the code that's processing them: if the software doesn't understand csv quoting it will assume each newline character denotes a new line.

It's not clear exactly what problem this is causing in the database, but it's quite likely that quote characters - especially unmatched quotes - could be causing a problem, particularly if the data isn't being properly escaped before insertion.

This snippet rewrites the file, removing embedded commas, newlines and tabs, and instructs the writer not to quote any values. It will fail with the error message _csv.Error: need to escape, but no escapechar set if it finds a value that needs to be escaped. Depending on your data, you may need to adjust the regex pattern.

with open('lines.csv') as f, open('fixed.csv', 'w') as out:
    reader = csv.reader(f)
    writer = csv.writer(out, quoting=csv.QUOTE_NONE)
    for line in reader:
        new_row = [re.sub(r'\t|\n|,', ' ', x) for x in line]
        writer.writerow(new_row)
snakecharmerb
  • 28,223
  • 10
  • 51
  • 86