How to map RDD function over each RDD in iterator returned by mapPartitions

Question

I have a DataFrame with document ids doc_id, line ids for a set of lines in each document line_id, and a dense vector representation of each line vectors. For each document (doc_id), I want to convert the set of vectors representing each line into a mllib.linalg.distributed.BlockMatrix

It is relatively straight forward to convert the vectors of the entire DataFrame, or DataFrame filtered by doc_id into a BlockMatrix by first converting the vectors into an RDD of (numRows, numCols), DenseMatrix). A coded example of that below.

However, I am having trouble converting the RDD of Iterator[(numRows, numCols), DenseMatrix)] returned by mapPartition, which converted the vectors column for each doc_id partition, into a separate BlockMatrix for each doc_id partition.

My cluster has 3 worker nodes with 16 cores and 62 GB of memory each.

Imports and start spark

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types as T
from pyspark.mllib.random import RandomRDDs
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.linalg import VectorUDT
from pyspark.mllib.linalg import Matrices
from pyspark.mllib.linalg import MatrixUDT
from pyspark.mllib.linalg.distributed import BlockMatrix

spark = (
    SparkSession.builder
    .master('yarn')
    .appName("linalg_test")
    .getOrCreate()
)

Create test dataframe

nRows = 25000

""" Create ids dataframe """
win = (W
    .partitionBy(F.col('doc_id'))    
    .rowsBetween(W.unboundedPreceding, W.currentRow)
)

df_ids = (
    spark.range(0, nRows, 1)
    .withColumn('rand1', (F.rand(seed=12345) * 50).cast(T.IntegerType()))
    .withColumn('doc_id', F.floor(F.col('rand1')/3).cast(T.IntegerType()) )
    .withColumn('int', F.lit(1))
    .withColumn('line_id', F.sum(F.col('int')).over(win))
    .select('id', 'doc_id', 'line_id')
)

""" Create vector dataframe """
df_vecSchema = T.StructType([
    T.StructField('vectors', T.StructType([T.StructField('vectors', VectorUDT())] ) ), 
    T.StructField('id', T.LongType()) 
])

vecDim = 50
df_vec = (
    spark.createDataFrame(
        RandomRDDs.normalVectorRDD(sc, numRows=nRows, numCols=vecDim, seed=54321)
        .map(lambda x: Row(vectors=Vectors.dense(x),))
        .zipWithIndex(), schema=df_vecSchema)
    .select('id', 'vectors.*')
)

""" Create final test dataframe """
df_SO = (
    df_ids.join(df_vec, on='id', how='left')
    .select('doc_id', 'line_id', 'vectors')
    .orderBy('doc_id', 'line_id')
)

numDocs = df_SO.agg(F.countDistinct(F.col('doc_id'))).collect()[0][0]
# numDocs = df_SO.groupBy('doc_id').agg(F.count(F.col('line_id'))).count()

df_SO = df_SO.repartition(numDocs, 'doc_id')

RDD functions to create matrices out of Vector column

def vec2mat(row):
    return ( 
        (row.line_id-1, 0), 
        Matrices.dense(1, vecDim, (row.vectors.toArray().tolist())), )

create dense matrix out of each line_id vector

mat = df_SO.rdd.map(vec2mat)

create distributed BlockMatrix from RDD of DenseMatrix

blk_mat = BlockMatrix(mat, 1, vecDim)

check output

blk_mat

<pyspark.mllib.linalg.distributed.BlockMatrix at 0x7fe1da370a50>

blk_mat.blocks.take(1)

[((273, 0),
  DenseMatrix(1, 50, [1.749, -1.4873, -0.3473, 0.716, 2.3916, -1.5997, -1.7035, 0.0105, ..., -0.0579, 0.3074, -1.8178, -0.2628, 0.1979, 0.6046, 0.4566, 0.4063], 0))]

Problem

I cannot get the same thing to work after converting each partition of doc_id with mapPartitions. The mapPartitions function works, but I cannot get the RDD that it returns converted into a BlockMatrix.

RDD function to create dense matrix out of each line_id vector separately for each doc_id partition

def vec2mat_p(iter):
    yield [((row.line_id-1, 0), 
            Matrices.dense(1, vecDim, (row.vectors.toArray().tolist())), )
        for row in iter]

create dense matrix out of each line_id vector separately for each doc_id partition

mat_doc = df_SO.rdd.mapPartitions(vec2mat_p, preservesPartitioning=True)

Check

mat_doc

PythonRDD[4991] at RDD at PythonRDD.scala:48

mat_test.take(1)

[[((0, 0),
   DenseMatrix(1, 50, [1.814, -1.1681, -2.1887, -0.5371, -0.7509, 2.3679, 0.2795, 1.4135, ..., -0.3584, 0.5059, -0.6429, -0.6391, 0.0173, 1.2109, 1.804, -0.9402], 0)),
  ((1, 0),
   DenseMatrix(1, 50, [0.3884, -1.451, -0.0431, -0.4653, -2.4541, 0.2396, 1.8704, 0.8471, ..., -2.5164, 0.1298, -1.2702, -0.1286, 0.9196, -0.7354, -0.1816, -0.4553], 0)),
  ((2, 0),
   DenseMatrix(1, 50, [0.1382, 1.6753, 0.9563, -1.5251, 0.1753, 0.9822, 0.5952, -1.3924, ..., 0.9636, -1.7299, 0.2138, -2.5694, 0.1701, 0.2554, -1.4879, -1.6504], 0)),
  ...]]

Check types

(mat_doc 
    .filter(lambda p: len(p) > 0)
    .map(lambda mlst: [(type(m[0]), (type(m[0][0]),type(m[0][1])), type(m[1])) for m in mlst] )
    .first()
)

[(tuple, (int, int), pyspark.mllib.linalg.DenseMatrix),
 (tuple, (int, int), pyspark.mllib.linalg.DenseMatrix),
 (tuple, (int, int), pyspark.mllib.linalg.DenseMatrix),
 ...]

Seems correct, however, running:

(mat_doc 
    .filter(lambda p: len(p) > 0)
    .map(lambda mlst: [BlockMatrix((m[0], m[1])[0], 1, vecDim) for m in mlst] )
    .first()
)

results in the following type error:

TypeError: blocks should be an RDD of sub-matrix blocks as ((int, int), matrix) tuples, got

Unfortunately, the error stops short and does not tell me what it 'got'.

Also, I cannot call sc.parallelize() inside of a map() call.

How do I convert each item in the RDD iterator that mapPartitions returns into a RDD that BlockMatrix will accept?

That's not going to work. You cannot have nested RDDs, and BlockMatrix is just wrapper around RDD. — user10938362, Jun 08 '19 at 20:36
I don't want nested rdds. I want to make a separate BlockMatrix for each partition. So the only way is to filter the dataframe or rdd into separate variables and convert each one separately? — Clay, Jun 08 '19 at 20:53
`BlockMatrix` is am `RDD`, so yeah. Additionally it seems like you misunderstood how [hash partitioner (`repartition`) works](https://stackoverflow.com/q/31424396/10958683). Each partition can contain multiple `doc_ids` (or none), not a single one. — user10938362, Jun 08 '19 at 21:16

How to map RDD function over each RDD in iterator returned by mapPartitions

Problem

0 Answers0