4

I have a project in which I run multiple data through a specific function that "cleans" them.

The cleaning function looks like this: Misc.py

def clean(my_data)
    sys.stdout.write("Cleaning genes...\n")

    synonyms = FileIO("raw_data/input_data", 3, header=False).openSynonyms()
    clean_genes = {}

    for g in data:
        if g in synonyms:
            # Found a data point which appears in the synonym list.
            #print synonyms[g]
            for synonym in synonyms[g]:
                if synonym in data:
                    del data[synonym]
                    clean_data[g] = synonym
                    sys.stdout.write("\t%s is also known as %s\n" % (g, clean_data[g]))
    return data

FileIO is a custom class I made to open files.

My question is, this function will be called many times throughout the program's life cycle. What I want to achieve is don't have to read the input_data every time since it's gonna be the same every time. I know that I can just return it, and pass it as an argument in this way:

def clean(my_data, synonyms = None) 
    if synonyms == None:
       ...
    else
       ...

But is there another, better looking way of doing this?

My file structure is the following:

lib
    Misc.py
    FileIO.py
    __init__.py
    ...
raw_data
runme.py

From runme.py, I do this from lib import * and call all the functions I made.

Is there a pythonic way to go around this? Like a 'memory' for the function

Edit: this line: synonyms = FileIO("raw_data/input_data", 3, header=False).openSynonyms() returns a collections.OrderedDict() from input_data and using the 3rd column as the key of the dictionary.

The dictionary for the following dataset:

column1    column2    key    data
  ...        ...      A      B|E|Z
  ...        ...      B      F|W
  ...        ...      C      G|P
  ...

Will look like this:

OrderedDict([('A',['B','E','Z']), ('B',['F','W']), ('C',['G','P'])])

This tells my script that A is also known as B,E,Z. B as F,W. etc...

So these are the synonyms. Since, The synonyms list will never change throughout the life of the code. I want to just read it once, and re-use it.

Pavlos Panteliadis
  • 1,335
  • 1
  • 13
  • 23
  • Another option would be to have a generator that calculates the result once and then just yields it forever. `def g(): x = getAnswer() while True: yield x` – Patrick Haugh Jan 24 '17 at 18:46
  • Side note: any particular reason you're using `sys.stdout.write()` instead of just `print()`? – skrrgwasme Jan 24 '17 at 18:47
  • @skrrgwasme I come from a `C` background, it's more like a `personal signature` type of thing. A `perk` if I may... – Pavlos Panteliadis Jan 24 '17 at 19:13
  • @Pavlos That's not a very good reason to do it. You should strive to write code that is idiomatic. When you switch languages, you should switch your techniques and constructs to match the new language as well. If you're not doing any kind of printing or stream handling that *requires* `sys.stdout.write()`, then you should just use `print()`. – skrrgwasme Jan 24 '17 at 22:15
  • @skrrgwasme Check this [link](http://stackoverflow.com/questions/3263672/python-the-difference-between-sys-stdout-write-and-print) basically, `print` is `sys.stdout.write()` with the added flexibility of knowing exactly what and how you are printing – Pavlos Panteliadis Jan 24 '17 at 22:32

3 Answers3

4

Use a class with a __call__ operator. You can call objects of this class and store data between calls in the object. Some data probably can best be saved by the constructor. What you've made this way is known as a 'functor' or 'callable object'.

Example:

class Incrementer:
    def __init__ (self, increment):
        self.increment = increment

    def __call__ (self, number):
        return self.increment + number

incrementerBy1 = Incrementer (1)

incrementerBy2 = Incrementer (2)

print (incrementerBy1 (3))
print (incrementerBy2 (3))

Output:

4
5

[EDIT]

Note that you can combine the answer of @Tagc with my answer to create exactly what you're looking for: a 'function' with built-in memory.

Name your class Clean rather than DataCleaner and the name the instance clean. Name the method __call__ rather than clean.

Jacques de Hooge
  • 6,204
  • 2
  • 19
  • 37
3

Like a 'memory' for the function

Half-way to rediscovering object-oriented programming.

Encapsulate the data cleaning logic in a class, such as DataCleaner. Make it so that instances read synonym data once when instantiated and then retain that information as part of their state. Have the class expose a clean method that operates on the data:

class FileIO(object):
    def __init__(self, file_path, some_num, header):
        pass

    def openSynonyms(self):
        return []

class DataCleaner(object):
    def __init__(self, synonym_file):
        self.synonyms = FileIO(synonym_file, 3, header=False).openSynonyms()

    def clean(self, data):
        for g in data:
            if g in self.synonyms:
                # ...
                pass

if __name__ == '__main__':
    dataCleaner = DataCleaner('raw_data/input_file')
    dataCleaner.clean('some data here')
    dataCleaner.clean('some more data here')

As a possible future optimisation, you can expand on this approach to use a factory method to create instances of DataCleaner which can cache instances based on the synonym file provided (so you don't need to do expensive recomputation every time for the same file).

Tagc
  • 7,701
  • 6
  • 47
  • 99
  • this should be on a different post. but why `Factory` and not `Singleton` for this particular problem?? – Pavlos Panteliadis Jan 24 '17 at 19:18
  • 1
    Singletons are [widely regarded as an anti-pattern](http://stackoverflow.com/questions/137975/what-is-so-bad-about-singletons) and using factory methods allows for the creation of `DataCleaner` instances based on different sets of synonym data. – Tagc Jan 24 '17 at 19:21
  • Thank you. That is really helpful material! – Pavlos Panteliadis Jan 24 '17 at 19:24
1

I think the cleanest way to do this would be to decorate your "clean" (pun intended) function with another function that provides the synonyms local for the function. this is iamo cleaner and more concise than creating another custom class, yet still allows you to easily change the "input_data" file if you need to (factory function):

def defineSynonyms(datafile):
    def wrap(func):
        def wrapped(*args, **kwargs):
            kwargs['synonyms'] = FileIO(datafile, 3, header=False).openSynonyms()
            return func(*args, **kwargs)
        return wrapped
    return wrap

@defineSynonyms("raw_data/input_data")
def clean(my_data, synonyms={}):
    # do stuff with synonyms and my_data...
    pass
Aaron
  • 7,351
  • 1
  • 24
  • 36