2

I'm trying to rewrite scala code for computing random walks in python using pyspark library. I've faced the issue how to best handle the implementation of scala object GraphMap to compute the routing table. The problem is that the values of GraphMap get both accessed and modified on each worker. Since GraphMap is a Scala object, the outcome is the global modification of its values.

In python I've created a module graphmap.py

import threading


def synchronized(func):
    func.__lock__ = threading.Lock()

    def synced_func(*args, **kws):
        with func.__lock__:
            return func(*args, **kws)

    return synced_func


class GraphMap(object):
    scr_vertex_map = dict()
    offsets = []
    lengths = []
    edges = []
    index_counter = 0
    offset_counter = 0
    first_get = True
    vertex_partition_map = dict()

    @synchronized
    def add_vertex(self, vertex_id, neighbors):
        """
        Add vertex to the map
        :param vertex_id: <int>
        :param neighbors: <list(int,float)>
        :return:
        """
        if vertex_id not in self.scr_vertex_map:
            if len(neighbors):
                self.scr_vertex_map[vertex_id] = self.index_counter
                self.offsets.insert(self.index_counter, self.offset_counter)
                self.lengths.insert(self.index_counter, len(neighbors))
                self.index_counter += 1
                for element in neighbors:
                    self.edges.insert(self.offset_counter, element)
                    self.offset_counter += 1
            else:
                self.scr_vertex_map[vertex_id] = -1

And a separate module to compute the routing table

from graphmap import GraphMap
import pytest
from pyspark.sql import SparkSession


def test_graphmap():
    spark_session = SparkSession \
        .builder \
        .master('local[*]')\
        .getOrCreate()

    graph = spark_session.sparkContext\
        .parallelize([(1, [(2, 1.0)]), (2, []), (3, [])])\
        .partitionBy(3)
    graph_map = GraphMap()

    def f(splitIndex, iterator):
        for vid, neighbors in iterator:
            graph_map.add_vertex(vid, neighbors)
        yield []

    routing_table = graph.mapPartitionsWithIndex(f, preservesPartitioning=True)
    assert(routing_table.count() == 3)
    assert(all(len(i) == 0 for i in routing_table.collect()))
    assert(graph_map.index_counter == 1)
    assert (graph_map.offset_counter == 1)
    assert(len(graph_map.scr_vertex_map) == 3)

The last three assertions fail since graph_map in my case is not a singleton and a separate instance gets created and modified on each worker. I've tried to follow recommendations from this post to create a singleton for graphmap (in particular, module approach, Borg pattern and metaclass approach). However, this did not help. I'm not sure what the reason is, but my guess is that has something to do with pickling. I would really appreciate if someone could explain what exactly happens here.

In the end, I've solved this issue by storing the state of the graphmap locally in a file. So, on each partition I first read in the state from the file and after the modification I write the state to the file. As I understand, this approach will not work if I run spark in the cluster mode.

Is that the best I could do? Are there nicer approaches? Any suggestions are highly appreciated.

npobedina
  • 333
  • 2
  • 14

0 Answers0