1

I write the following Pyspark code in jupyterLab:

class Base:
    def __init__(self,line):
        self.line = line

def process_line(line):

    return Base(line)
input_path = 'the_path_of_input'
samples = sc.textFile(input_path).map(process_line)
print(samples.take(1))

An error was encountered when I execute the above code. The following is error message:

_pickle.PicklingError: Can't pickle <class 'Base'>: attribute lookup Base on builtins failed

    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$class.foreach(Iterator.scala:891)
    at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
    at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
    at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
    at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
    at org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:153)
    at org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:153)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2157)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2157)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

I try all kinds of methods I thought, but the error is still.

  • How about `Base(line).line`? – mck Nov 06 '20 at 10:08
  • The main problem is that I can't return a my defined object Base. Base.line is a builtin string, not a user-defined object. – Tengfei Tian Nov 06 '20 at 10:46
  • what do you mean by a 'builtin' string?? Base.line is initialised when you instantiate a Base object. I'm asking why you can't return Base(line).line in the function process_line – mck Nov 06 '20 at 10:54
  • Sorry. I didn't describe the problem clearly. The problem is that I want the function process_line return my defined custom object, but the error is encountered. Why can't return my defined custom object from the function? – Tengfei Tian Nov 06 '20 at 16:49

1 Answers1

1

Based on the answers and comments here : Python: Can't pickle type X, attribute lookup failed

It seems you need to make Base class its own module and then it will work. So here is Base.py

class Base:
    def __init__(self,line):
        self.line = line

And mainscript.py

from pyspark import SparkContext, SQLContext
from Base import Base

sc = SparkContext('local')
sqlContext = SQLContext(sc)

def process_line_baseclass(line):
    return Base(line)


input_path = 'path/to/inputfile'
samples = sc.textFile(input_path).map(process_line_baseclass)

print(samples.take(1))

Output :

[<Base.Base object at 0x7fc2188e64a8>]
user238607
  • 1,332
  • 1
  • 11
  • 17
  • 1
    Thanks for your answer. It works. I read the solution in the url you paste. But I still have some doubts. Can't the executor find the Base class if I defined the Base class in the same mainscript.py in Pyspakr? – Tengfei Tian Nov 06 '20 at 16:57
  • 1
    As the comment on the question states this is a pickling issue. Pickle protocol 4 solves it. But the spark for python3 by default uses pickle protocol 3. If you can find a way to use protocol 4 then it should work. – user238607 Nov 06 '20 at 18:25