space correction in a file in python

Question

I want to write a function that takes a file and (corrects) deletes extra spaces after or before punctuations "() [] {} "" . , : ;! ? $" or add spaces where necessary. example: word .word should changes to word. word word( word )word should changes to word (word) word

the code I wrote doesn't work (maybe I should use regular expression but I don't know how)

def C(file):
    listpunc = ['.', ',', ':', ';', '!', '؟']
    listclose = ['(', '[', '"', '{']
    listopen = [')', ']', '"', '}']
    f1 = open ('c.txt', 'r')
    f2 = open ('Z.txt', 'w')
    s = f1.readlines()
    sw = s[0]
    for i in range (1, len(s)):
        if s[i] in listpunc and s[i - 1] == ' ':
            sw = s[i - 1] + ''  # changes space to none
        if s[i] in listpunc and s[i + 1] != ' ':
            sw = s[i] + ' '
            if s[i] in listopen and s[i - 1] != ' ':
                sw = s[i - 1] + ' '
                if s[i] in listclose and s[i + 1] != ' ':
                    sw = s[i] + ' '
                else:
                    sw += s[i]
                    f2.write(file)
            f1.close()
            f2.close()

I wrote a new code, but again it doesn't corrects the spaces!

import re

def c (file):
    with open ('c.txt')as f1:
        for line in f1:
           result = re.sub(r'\s([?.!(){}[]:;,"](?:\s|$))', r'\1', line)
           f2 = open ('zzzzz.txt', 'a')
           f2.write(result)
           print(result)

What exactly doesn't work? Please provide example behaviours — lisu, Mar 14 '15 at 12:19
The first problem you have, is that `f1.readlines()` returns a list. So when you write `s[i]` you are getting the complete line at index `i` and not the `i`-th caracter of the string. Please also the code you copied, especially the indentation. The calls to `close()` appear inside the for loop and at line `f2.write(file)`, we don't know what `file` is... — Cilyan, Mar 14 '15 at 12:56
there should be a very simpler code with regular expression. — torabi, Mar 14 '15 at 13:12
I'm maybe wrong, but trying to get some code working, I believe there are no simple way and you may have to build a full parser/syntax tree... Especially due to the dual meaning of `"`. — Cilyan, Mar 14 '15 at 13:59
@Cilyan you are absolutely correct that regexps cannot express the alterations the OP wants. I'd recommend against building her own parser especially with [module ast](https://docs.python.org/3/library/ast.html); it still ain't going to be easy. However [Pylint](http://www.pylint.org/) already does most of the what the OP is trying to reinvent. — msw, Mar 14 '15 at 18:48

Cilyan · Accepted Answer · 2015-03-15T03:37:59.767

I finally got some time to play on it. I had to rework it several times, but here is how I would do. As you noted that you are new in programming, I will describe some of the concepts involved in the code. I imagine that it will be a little hard at first, but I'm sure it's worth discovering all the power of Python behind this concrete example.

Object Oriented Programming

The first concept you find in the following program is introduced by the keyword class. There is a nice introduction here, that you should read if you are not familiar with the concept. I will not go into details, because the above link is much better. But to make the link with the next chapters, here are some basic concepts.

The class keyword is used to define a class, in the same way the def keyword defines functions. The code contained into it is not executed until it is explicitely called. Classes contain variables and functions (called methods) and form a blueprint used to create objects. Let's use an example. I can define a class Cake. A Cake has a flavor and slices, they would be variables. The class contain a method take_slice that would decrease the number of slices available.

From this blueprint, I can bake several Cakes, that will all have different flavors and number of slices. As you imagine, if I call take_slice on the cheesecake, I won't affect the chocolate cake.

Lexing

A lexer transforms a text in a suite of tokens. Tokens are a piece of the text that has a meaning as a unit (for example a word, an operator, a punctuation, ...). Combining the tokens to analyse the grammar and get the meaning or manipulate the text is generally out of the scope of the lexer, and is left to another part of the software, often called the parser. This is what the _tokenize function does, just "eat" the text to generate tokens, without further understanding what it means that one token is after or before another.

Generators

If you are not familiar with this concept, this is probably where you will find it hard to understand the below code. There is a good post on that topic. But overly simplified (and thus inexact), this is what I would say:

You already know functions. When you call one, the code of the function is excuted, and it will return a value, giving back the control to the caller. If a function has more than one return statement, the first one encountered will terminate the function.

Now imagine a kind of function that when you call it would execute some code and return a value, but when you call it again, it will resume the execution of code where it last stopped and return a new value, and so on until some kind of processing is finished. To be more precise, let's say that when you call this strange function, you actually get back an object, a special one. To get the actual "return values" of this strange function, you have to apply the next operator on the object. each time you do that, you get the next value produced by the function. There you are! You have a generator. The "function" is the generator, the object returned is an iterator.

>>> def simple_range(n):
    i = 0
    while i < n:
        yield i # This is the magical word that returns a value
        # and "pauses" the execution. Code will resume from here
        # the next time "next" is applied.
        i += 1
    # When the end of the function is reached, StopIteration is raised


>>> rng = simple_range(3) # Calling a generator gives an iterator
>>> next(rng) # You access values by asking the *iterator*
0
>>> next(rng)
1
>>> next(rng)
2
>>> next(rng)
Traceback (most recent call last):
  File "<pyshell#22>", line 1, in <module>
    next(rng)
StopIteration
>>>

It is now worth noting that Python's for loop operates on iterators, so it is very handy to work with generators:

>>> rng2 = simple_range(5)
>>> for i in rng2:
    print("Next:", i)


Next: 0
Next: 1
Next: 2
Next: 3
Next: 4

The code !

Ok, enough concept, where is the result ? There !

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import re # Oh yeah :)

class WhiteFormater:
    """
        A formatter that cleans a text's whitespacing around punctuation,
        quotes, parenthesis, brackets or curly brackets. There should be no
        spaces before a punctuation or a closing delimiter, but at least a space
        after them. There should be at least a space before an opening delimiter
        and none after them. Quotes are handled in pair. The first encountered
        is the opening one, the second is a closing one.

          `strip`: If set, starting and trailing whitespaces are removed
          `keependline`: If set, if the string contains an ending newline,
                it will be preserved, even if `strip` is set.
          `reduce_whitespace`: If set, multiple whitespaces are merged into one
                single: a single space, except if the chain contains a tab. In
                this case, the tab will be kept.
    """
    # List of tokens the lexer will understand: a name and its associated regex
    tokens = [
        ("word", re.compile(r"[\w\-_\d]+")),
        ("punct", re.compile(r"[\.,;:!?$]")),
        ("open", re.compile(r"[(\[\{]")),
        ("close", re.compile(r"[)\]\}]")),
        ("white", re.compile(r"\s+")),
        ("quote", re.compile(r'"'))
    ]

    def __init__(self, strip=True, keependline=True, reduce_whitespace=False):
        self.strip = strip
        self.keependline = keependline
        self.reduce_whitespace = reduce_whitespace

    def _do_reduce_whitespace(self, whitespace):
        """
            This function merges a chain of whitespaces according to the
            settings and the priority of whitespace types.
        """
        # Do we need to take action ?
        if len(whitespace) > 1 and self.reduce_whitespace:
            # Give higher priority to a tab
            if "\t" in whitespace:
                return "\t"
            else:
                return " "
        else:
            return whitespace

    def _tokenize(self, source, initpos=0):
        """
            This is the lexer. It is responsible of "eating" the source by
            matching tokens one after the other. It does only that. Tokens
            are yield and will be processed by other functions.

            Objects yield are tuples of dimension 2, containing the token
            name (or identifier) followed by the matched text.
            This format will be retained by all filters, so that they can
            be chained.
        """
        pos = initpos
        end = len(source)
        # Till the end of the source
        while pos < end:
            # Try all the regexes to find one that matches
            for token_name, regex in self.tokens:
                match = regex.match(source, pos)
                if match:
                    # Advance the reading cursor to after the match
                    pos = match.end()
                    # Push the token and matched text to the parser
                    yield (token_name, match.group())
                    break
            else:
                # In case no regexes match, this usually indicates that the
                # text contains a syntax error. In our case, this may just be
                # a character that is not a letter or a digit but is valid
                # (i.e. / or + or *...). Just push a fake "unknown" token that
                # will not be taken into account)
                yield ("unknown", source[pos])
                # Advance just by one character in the source
                pos += 1

    def _quote_sorter(self, tokenizer):
        """
            This is a filter that sort quotes. The first one encountered will
            be transformed to an opening delimiter, the second one to a closing
            one.
        """
        in_quote = False
        # Process all matched tokens
        for token, matched in tokenizer:
            # Transform quote tokens
            if token == "quote":
                if in_quote:
                    yield "close", matched
                else:
                    yield "open", matched
                in_quote = not in_quote
            # Other tokens are left untouched
            else:
                yield token, matched

    def _correcter(self, tokenizer):
        """
            The main filter that cleans whitespaces in the text.

            The main idea in the algorithm is that it look at the previous and
            the next token to take a decision about the current token.
        """
        # At the beginning, there are no previous token
        prev_token, prev_matched = (None, None)
        # Initialise the current token with the first token.
        # Then, these "registers" will be shifted at each iteration
        # ---
        # Note: if the string is empty, this will raise a StopIteration,
        # which is fine because then our generator must also end, and to
        # end a generator, we need to raise a StopIteration. So just let the
        # exception being catched by the caller.
        token, matched = next(tokenizer)
        # Iterate over tokens, but actually the one we fetch is the "next"
        # considering the token we analyse actually. Then it is shifted.
        # So in a way, the for loop is one step ahead of the analysed token.
        for next_token, next_matched in tokenizer:
            if token == "white":
                # A whitespace is accepted if it is not after an opening
                # delimiter, and not before a punctuation or a closing
                # delimiter. It is also not accepted at the start of the string
                # if self.strip is set.
                if (
                    (prev_token != "open") and 
                    (next_token not in ["close", "punct"]) and
                    ((prev_token != None) or (not self.strip))
                ):
                    # Call do_reduce_whitespace to merge whitespaces if needed
                    yield token, self._do_reduce_whitespace(matched)
                # Else, reject token
            elif token == "open":
                # If there was no whitespace before an opening delimiter, 
                # we need to add one.
                if prev_token not in ["white", None]:
                    yield "white", " "
                # Output the opening delimiter itself
                yield token, matched
            elif token in ["close", "punct"]:
                # First output the delimiter
                yield token, matched
                # Then check that it is followed by a whitespace, else add one
                if next_token not in ["white", "close", "punct"]:
                    yield "white", " "
            else:
                # Other tokens ("word", "unknown") are passed through untouched
                yield token, matched
            # Shift token reading position. previous = actual, actual = next
            prev_token, prev_matched = token, matched
            token, matched = next_token, next_matched
        # Handle the last token (remember, the for loop is one step ahead)
        if token != "white":
            # Only whitespaces must be processed in the last token.
            # Is it? I think so. Maybe not... I don't see a corner case, yet :)
            yield token, matched
        else:
            if self.strip:
                # If strip is set, we sould remove the trailing whitespace,
                # but still retain the endline if the option is set.
                if matched.endswith("\n") and self.keependline:
                    yield "white", "\n"
            else:
                # If we do not strip, check however if the option is set to
                # remove the endline. Merge the whitespace if needed.
                if matched.endswith("\n") and not self.keependline:
                    yield "white", self._do_reduce_whitespace(matched[:-1])
                else:
                    yield "white", self._do_reduce_whitespace(matched)

    def format_line(self, source):
        """
            Format a single line of text, returning the cleaned text as a
            string.

            Note that multi-line text will probably result in badly cleaned
            text around the newline. If `reduce_whitespace` is set, it will
            even merge lines together.
        """
        # Create the tokenizer by chaining the filters
        tokenizer = self._correcter(self._quote_sorter(self._tokenize(source)))
        # Join all pieces together in a single string
        return "".join((matched for token, matched in tokenizer))

# Test functions using the formatter
def test_simple_string():
    s = (
        ' Hello ( this is  )not a "proper " sentence : punctuation , '
        'and (parenthesis) or   brackets like [and/or ] are not '
        '{ correctly}spaced :please correct " if you can " ! \n'
    )
    formatter = WhiteFormater()
    print(formatter.format_line(s))
    # Hello (this is) not a "proper" sentence: punctuation, and (parenthesis)
    #  or   brackets like [and/or] are not {correctly} spaced: please correct 
    # "if you can"!

def test_file():
    formatter = WhiteFormater(reduce_whitespace=True)
    with open("test_in.txt", "r") as fin, open("test_out.txt", "w") as fout:
        for line in fin:
            fout.write(formatter.format_line(line))

if __name__ == "__main__":
    test_file()

test_in.txt

Hello !
This is a" test " file  
 for a ( little )program that
[properly]   ( ?   ) formats text containing   (parenthesis )
   or other kinds of { curly brackets}...
Enjoy!  
   Cilyan

test_out.txt

Hello!
This is a "test" file
for a (little) program that
[properly] (?) formats text containing (parenthesis)
or other kinds of {curly brackets}...
Enjoy!
Cilyan

Good luck !

Thanks Cilyan. It was a great help. – torabi Mar 15 '15 at 18:41 — torabi, Mar 15 '15 at 18:41

space correction in a file in python

1 Answers1

Object Oriented Programming

Lexing

Generators

The code !