0

I need to remove a single quote in a string. The column name is Keywords. I have an array hidden in a string. So I need to use Regex within Spark Dataframe to remove a single quote from the beginning of the string and at the end. The string looks like this:

Keywords=
'
  [
      "shade perennials"," shade loving perennials"," perennial plants"," perennials"," perennial flowers"," perennial plants for shade"," full shade perennials"
  ]
'

I have tried the following:

remove_single_quote = udf(lambda x: x.replace(u"'",""))
cleaned_df = spark_df.withColumn('Keywords', remove_single_quote('Keywords'))

But the single quote is still there, I have also tried (u"\'","")

Mashiro
  • 1,008
  • 2
  • 9
  • 20
Sonya
  • 89
  • 7

2 Answers2

2
from pyspark.sql.functions import regexp_replace

new_df = data.withColumn('Keywords', regexp_replace('Keywords', "\'", ""))
Ghost
  • 304
  • 2
  • 8
1

Try regexp_replace

from pyspark.sql.functions import regexp_replace,col
    cleaned_df = spark_df.withColumn('Keywords', regexp_replace('Keywords',"\'",""))

OR

from pyspark.sql import functions as f
    cleaned_df = spark_df.withColumn('Keywords', f.regexp_replace('Keywords',"\'",""))

I have not tested It but should work

import ast

    cleaned_df = spark_df.withColumn('Keywords',ast.literal_eval('Keywords'))

Please refer

Strick
  • 1,010
  • 4
  • 12