To match all characters until the beginning of the string, you could use .*<BR>
. However, this does not match line breaks (\n
). I found a solution for that here, so our pattern could be (?s).*<BR>
. A working example is given below, hope this helps!
import pyspark.sql.functions as F
df = spark.createDataFrame([('''"_row"\n"<BR>Datetime:2018.06.30^
Name:ABC^
Se:4^
Machine:XXXXXXX^
InnerTrace:^
AdditionalInfo:^
<ER>''',), ],schema=['text'])
df = df.withColumn('text_cleaned',
F.regexp_replace(F.col('text'),pattern='(?s).*<BR>',replacement="<BR>"))
Let's verify that that worked;
print(df.select('text').collect()[0][0])
outputs
"_row"
"<BR>Datetime:2018.06.30^
Name:ABC^
Se:4^
Machine:XXXXXXX^
InnerTrace:^
AdditionalInfo:^
<ER>
and
print(df.select('text_cleaned').collect()[0][0])
outputs:
<BR>Datetime:2018.06.30^
Name:ABC^
Se:4^
Machine:XXXXXXX^
InnerTrace:^
AdditionalInfo:^
<ER>