1

I have a record which is like

"_row"\n"<BR>Datetime:2018.06.30^
Name:ABC^
Se:4^
Machine:XXXXXXX^
InnerTrace:^
AdditionalInfo:^
<ER>

I would want to remove everything before
in each record. Is there an easy way to do this with spark dataframe

import pyspark.sql.functions as f

data.select(f.regexp_replace(pattern='\n<BR>',replacement="<BR>",str="row")

something like this? What should the pattern be?

Florian
  • 21,690
  • 4
  • 34
  • 66
Gayatri
  • 1,777
  • 2
  • 18
  • 31

1 Answers1

2

To match all characters until the beginning of the string, you could use .*<BR>. However, this does not match line breaks (\n). I found a solution for that here, so our pattern could be (?s).*<BR>. A working example is given below, hope this helps!

import pyspark.sql.functions as F

df = spark.createDataFrame([('''"_row"\n"<BR>Datetime:2018.06.30^
Name:ABC^
Se:4^
Machine:XXXXXXX^
InnerTrace:^
AdditionalInfo:^
<ER>''',), ],schema=['text'])

df = df.withColumn('text_cleaned',
               F.regexp_replace(F.col('text'),pattern='(?s).*<BR>',replacement="<BR>"))

Let's verify that that worked;

print(df.select('text').collect()[0][0])

outputs

"_row"
"<BR>Datetime:2018.06.30^
Name:ABC^
Se:4^
Machine:XXXXXXX^
InnerTrace:^
AdditionalInfo:^
<ER>

and

print(df.select('text_cleaned').collect()[0][0])

outputs:

<BR>Datetime:2018.06.30^
Name:ABC^
Se:4^
Machine:XXXXXXX^
InnerTrace:^
AdditionalInfo:^
<ER>
Florian
  • 21,690
  • 4
  • 34
  • 66