Removing  from the text

Question

I am converting a word file to text string using Python. The resultant text string has Bullet points (in word file) converted to  (in converted string). How can I remove this from the text string using Python, so that I have only the Text string without these boxes ( ).

from docx import Document

document = Document(file_to_read)

text_string = ''
for paragraph in document.paragraphs:
    text_string += paragraph.text+"\n"# -*- coding: utf-8 -*-

print text_string

The output is like:

 Computer Science fundamentals in data structures.

 Computer Science fundamentals in algorithm design, problem solving, and complexity analysis

@BhargavRao The problem is not that the output string has unicode characters, but it has actual squares in it. — Srinivasan A, Jul 01 '16 at 18:57
@SrinivasanA As you are on Python2, Test this [Replace non-ASCII characters with a single space](http://stackoverflow.com/q/20078816) and confirm if it works. I have reopened the post. — Bhargav Rao, Jul 01 '16 at 19:15

skyking · Answer 1 · 2016-07-01T05:02:10.800

3

Your attempt doesn't try to remove the characters. You can use the replace method to replace characters in a string, it can also be used to remove characters by just replacing with the empty string.

The only problem is to properly represent the 0xF0B7 in your source code and the proper way depends on whether document.paragraphs contains normal strings or unicode strings (I'd recommend using python3 to avoid unicode problems). I assume that they are unicode strings and then you would represent the code point as `u"\uF0B7" (if it's normal strings then it will depend on the encoding).

Apart from that your code has an issue since the way you build text_string may be suboptimal. Another way to build a string from fragments is to put the fragments in a list and then join them by using "".join(l).

Putting this together you get (assuming that document.paragraphs is unicode strings):

from docx import Document

document = Document(file_to_read)

text_string = u"\n".join([p.replace(u"\uF0B7", u"") 
                          for p in document.paragraphs])

print(text_string)

If you use python3 you have to remove the us before the strings (since in python3 all strings are unicode). Also note that when printing you must make sure that you have an encoding that supports all the characters in the document (which may have been the reason you want to remove bullets in the first place).

edited Jul 01 '16 at 05:02

answered Jun 30 '16 at 12:10

skyking

12,561
29
47

1

Use a list comprehension with a `join` instead of a generator expression. This is because `join` iterates over the list twice, hence it is faster to have a list there instead of a generator which has to be re-created for the second iteration. See [Raymond Hettinger's answer](http://stackoverflow.com/a/9061024/4099593). – Bhargav Rao Jun 30 '16 at 22:36
@BhargavRao I didn't know that before. Thank's for pointing that out, I've updated my answer. – skyking Jul 01 '16 at 05:03
Just clarifying Bhargav Rao's comment: the generator expression isn't run twice, since the second could yield different results. Instead, `.join` saves the output of the gen exp into a list; Martijn mentions this in his answer of the linked "possible duplicate" question. – PM 2Ring Jul 02 '16 at 01:10
@PM2Ring That what Raymond Hettinger's answer says yes. There's no other possibility (in python) since an iterator is only guaranteed to be possible to iterate over only once. However the behavior mustn't be that `join` creates a list and then does the same thing as if it were given a list to start with - that would need to iterate over the data thrice. Instead it can build the list the simultaneously as it builds the list for the second iteration. Btw I guess that `join` works directly over tuples as well. – skyking Jul 02 '16 at 11:56
@skyking: I'm pretty sure you're correct; I haven't checked the source code for `.join`. And I'm almost certain that `.join` works directly on tuples, there's no (sane) reason why it wouldn't. – PM 2Ring Jul 02 '16 at 12:08

score 0 · Answer 2 · answered Jun 30 '16 at 12:10

if you only want English characters this could do:

text_string = text_string.decode('ascii', errors='ignore')

I think the best solution would be to identify exactly which byte is causing issues and replace it.

This # -*- coding: utf-8 -*- specifies the encoding of your source file not that of your string.

Removing  from the text

2 Answers2