3

I have a sample Python class

class bean :
    def __init__(self, puid, pogid, bucketId, dt, at) :
    self.puid = puid
    self.pogid = pogid
    self.bucketId = bucketId
    self.dt = (datetime.datetime.today() - datetime.datetime.strptime(dt, "%Y-%m-%d %H:%M:%S")).days
    self.absdt=dt
    self.at = at

Now i know that in Java to make a class serializable we just have to extend Serializable and ovverride a few methods and life is simple. Though Python is so simplistic yet i cant find a way to serialize the objects of this class.

This class should be serializable over network because the objects of this call goes to apache spark which distributes the object over network.

What is the best way to do that.

I also found this but dont know if it is the best way to do it.

I also read

Classes, functions, and methods cannot be pickled -- if you pickle an object, the object's class is not pickled, just a string that identifies what class it belongs to.

So does that mean those classes cant be serialized ?

PS: There would be millions of object of this class as the data is huge. So please provide 2 solution one the easiest and other the most efficient way of doing so.

EDIT :

For clarification i have to use this something like

def myfun():
    **Some Logic **
    t1 = bean(<params>)
    t2 = bean(<params2>)
    temp = list()
    temp.append(t1)
    temp.append(t2)
    return temp

Now how it is finally called

PairRDD.map(myfun).collect()

which throws exception

<function __init__ at 0x7f3549853c80> is not JSON serializable
Community
  • 1
  • 1
iec2011007
  • 1,630
  • 3
  • 20
  • 34
  • If you're going to send the objects to something, shouldn't you look at the requirements of the receive? I mean, you're asking here for *any* way to serialize a python object, but you've linked to a question about *json* serialization. Do you need json serialization? Have you looked at `pickle`? – Lasse V. Karlsen Dec 31 '15 at 09:23
  • Yes i looked at pickle and cpickle. i was hoping to find a fuctionality like in java we can get an instance of serializable object. I linked that question because thats the only workaround i found. For pickle all the examples i found we have to specify a file save location for pickled object. But in my case i have to return the serialized object instance from some other function – iec2011007 Dec 31 '15 at 09:26
  • Could you explain what is the issue here? It looks like [XY problem](http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem) to me. What are the types of the attributes (puid, pogid, bucketId, dt, at)? Also please fix the indentation. – zero323 Dec 31 '15 at 15:29

2 Answers2

3

First, for your example pickle will work great. pickle doesn't serialize "functions", it only serializes "data" - so if you have the types you are trying to serialize on the remote script, i.e. if you have the type "bean" imported on the receiving end - you can use pickle or cpickle and everything will work.the mentioned disclaimer stated it doesn't keep the code of the class, meaning if you don't have it imported on the receiving end, pickle won't work for you.

All cross-language serialization solutions (i.e. json, xml) will never provide the ability to transfer class "code" because there's no reasonable way to represent it. If you're using the same language on both ends (like here) there are way to get this to work - you could for example marshal the object, pickle the result, send it over, receive on receiving end, unpickle and unmarshal and you have an object with it's functions - this is in fact sending the code and eval()-ing it on the receiving end..

Here's a quick example based on your class for pickling object:

test.py

import datetime
import pickle

class bean:
    def __init__(self, puid, pogid, bucketId, dt, at) :
        self.puid = puid
        self.pogid = pogid
        self.bucketId = bucketId
        self.dt = (datetime.datetime.today() - datetime.datetime.strptime(dt, "%Y-%m-%d %H:%M:%S")).days
        self.absdt=dt
        self.at = at
    def whoami(self):
        return "%d %d"%(self.puid, self.pogid)

def myfun():
    t1 = bean(1,2,3,"2015-12-31 11:50:25",4)
    t2 = bean(5,6,7,"2015-12-31 12:50:25",8)
    tmp = list()
    tmp.append(t1)
    tmp.append(t2)
    return tmp

if __name__ == "__main__":
    with open("test.data", "w") as f:
        pickle.dump(myfun(), f)
    with open("test.data", "r") as f2:
        obj = pickle.load(f2)
    print "|".join([bean.whoami() for bean in obj])

running it:

ben@ben-lnx:~$ python test.py 
1 2|5 6

so you can see pickle works as long as you have the imported type..

ben.pere
  • 193
  • 3
  • 13
1

As long as the all arguments you pass to __init__ (puid, pogid, bucketId, dt, at) can be serialized there should be no need for any additional steps. If you experience any problems it most likely means you didn't properly distribute your modules over the cluster.

While PySpark automatically distributes variables and functions referenced inside closures, distributing modules, libraries and classes is your responsibility. In case of simples classes creating a separate module and passing it via SparkContext.addPyFile should be just enough:

# https://www.python.org/dev/peps/pep-0008/#class-names
from some_module import Bean  

sc.addPyFile("some_module.py")
zero323
  • 283,404
  • 79
  • 858
  • 880
  • Well i think the use case is generic. How to go about doing this when say my class is a derived class from a couple of parent series and it also has instance of other class, some function and stuff (what i mean there are also custom object) how to do this in that case. Can you please add that to your answer so some links – iec2011007 Jan 03 '16 at 07:42
  • There is really nothing to add. All required classes should to placed in module(s) and distributed to the workers. Number of classes or nesting doesn't really affect that. The rest is pretty much a standard pickle with [its limitations](https://docs.python.org/3/library/pickle.html#what-can-be-pickled-and-unpickled). – zero323 Jan 03 '16 at 10:47