Remove Unicode values that have spaces between them

Question

I have a file containing Unicode strings aligned line by line.

ജുഗുപ്‌സയോ നീരസമോ പരിഹാസമോ ദ്യോതിപ്പിക്കുന്ന മുഖഭാവം
വളവ്‌
വക്രത
തിരിവ്‌
കോട്ടം
നന്നേ ചെറുപ്രായത്തില്‍ അസാമന്യ ജീവിത വിജയം നേടുന്നയാള്‍
ഇന്റര്‍നെറ്റിലെ പ്രധാനപ്പെട്ട സേവനം
സ്‌ക്രീനില്‍ കാണുന്ന അതേ രൂപത്തിലും ഭാവത്തിലും പ്രിന്ററില്‍ നിന്ന്‌ ലഭിക്കുന്ന കോപ്പി
തെറ്റ്‌ എന്നു കാണിക്കുന്ന അടയാളം
യുണിക്‌സിനെ ആധാരമാക്കിയുള്ള പ്രവര്‍ത്തന കേന്ദ്രങ്ങളില്‍ ഉപയോഗപ്പെടുത്തുന്ന ഒരു നെറ്റവര്‍ക്ക്‌ വിന്‍ഡോ സ്ഥാപന അന്തരീക്ഷം
പ്രിന്ററിലൂടെ കടലാസ്‌ നീങ്ങിപ്പോകുന്ന ദിശക്ക്‌ ലംബമായുള്ള ദിശ
കമ്പ്യൂട്ടറിലെ ഒരു ഡിസ്‌കിലുള്ള വിവരങ്ങള്‍ മറ്റൊരു ഡിസ്‌കിലേക്ക്‌ കോപ്പിചെയ്‌തു വെക്കാന്‍ ഡോസ്‌ എന്ന ഓപ്പറേറ്റിംഗ്‌ സിസ്റ്റത്തിലുള്ള സംവിധാനം
ക്രിസ്‌തുമസ്‌
പ്രായപൂര്‍ത്തിയായവര്‍ക്കുള്ള ചലച്ചിത്രം
ചില പ്രത്യേക കിരണങ്ങളുടെ സഹായത്താല്‍ എടുക്കുന്ന ചിത്രങ്ങള്‍
എക്‌സറേ
അദൃശ്യാലക്തിക കിരണം
മരണം വരെയും സൗന്ദര്യം ഒരേപോലെ നിലനിര്‍ത്താന്‍ കഴിഞ്ഞവര്‍
കലഹപ്രിയ
ശണ്‌ഠക്കാരി

How to remove sentences from the file ?

I need to get

  ക്രിസ്‌തുമസ്‌
 കലഹപ്രിയ
    ശണ്‌ഠക്കാരി
വളവ്‌
    വക്രത
    തിരിവ്‌
    കോട്ടം

and remove all the sentences like this

പ്രിന്ററിലൂടെ കടലാസ്‌ നീങ്ങിപ്പോകുന്ന ദിശക്ക്‌ ലംബമായുള്ള ദിശ
    കമ്പ്യൂട്ടറിലെ ഒരു ഡിസ്‌കിലുള്ള വിവരങ്ങള്‍ മറ്റൊരു ഡിസ്‌കിലേക്ക്‌ കോപ്പിചെയ്‌തു വെക്കാന്‍ ഡോസ്‌ എന്ന ഓപ്പറേറ്റിംഗ്‌ സിസ്റ്റത്തിലുള്ള സംവിധാനം

They are separated by spaces.

I am using python 2.7

m = open('olam-enml.txt','w')

UTF8

When I tried this code

string = "നന്നേ ചെറുപ്രായത്തില്‍ അസാമന്യ ജീവിത വിജയം നേടുന്നയാള്‍"

if u' ' not in string .strip():
    print string

I got this error

Traceback (most recent call last):
  File "/home/akallararajappan/Music/Mycodeexp/d.py", line 3, in <module>
    if u' ' not in string .strip():
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)

Define 'sentences'. Most of us here have some trouble reading this specific script, let alone divine what exactly you mean. What in your text should be removed, and why **exactly**? — Martijn Pieters, Feb 07 '14 at 11:56
Are all spaces simple ASCII codepoint 32 characters? How are you reading the file? Do you need to write back to the same file? Is this Python 2 or Python 3? — Martijn Pieters, Feb 07 '14 at 12:06
What have you tried so far? Exactly what part of the problem is giving you difficulty? — Wooble, Feb 07 '14 at 12:07
@karu: Your error is caused by the fact you tested a **byte** string. `string` is encoded data, not a unicode value. What encoding was used depends on your source code editor, and you then also need to tell python about the codec by setting a [special comment](http://docs.python.org/2/howto/unicode.html#unicode-literals-in-python-source-code). — Martijn Pieters, Feb 07 '14 at 12:21
If you're embedding non-ASCII text (really, even if you're using ASCII it's a good habit) in Python 2 source code, you need to specify the source encoding with `#coding: utf-8` at the top of your file, and use `u"unicode stings like this"`. (In Python 3 text is treated sensibly, which for some reason makes a lot of people angry.) — Wooble, Feb 07 '14 at 12:22

Martijn Pieters · Accepted Answer · 2014-02-07T12:20:39.157

1

You can strip whitespace from the start and end of the lines, and if there are still spaces in the string you have a sentence:

if u' ' not in line.strip():
    # line is a *not* a sentence

Open your file with io.open() instead; to write just the lines that are not sentences you can use a simple generator expression:

import io

with io.open('olam-enml.txt', 'r', encoding='utf8') as infh:
    with io.open('olam-enml-words.txt', 'w', encoding='utf8') as outfh:
        outfh.writelines(line for line in infh if u' ' not in line.strip())

edited Feb 07 '14 at 12:20

answered Feb 07 '14 at 12:07

Martijn Pieters

889,049
245
3,507
2,997

How to know a white space is there between strings – Feb 07 '14 at 12:12
@karu: That's what `u' ' in string.strip()` *does*, test if there is any whitespace *between* the words. – Martijn Pieters Feb 07 '14 at 12:15

Remove Unicode values that have spaces between them

1 Answers1