I have the below python list.
lst=['name','age','country
']
Spark dataframe is below.
column_a
name Xxxx, age 23, country aaaa
name yyyy, age 25, country bbbb
I have to compare the list with spark dataframe string column and remove the values from list from the column.
Expected output is:
column_a
Xxxx, 23, aaaa
yyyy, 25, bbbb
CodePudding user response:
You can use regexp_replace
with '|'.join()
. The first is commonly used to replace substring matches. The latter will join the different elements of the list with |
. The combination of the two will remove any parts of your column that are present in your list.
import pyspark.sql.functions as F
df = df.withColumn('column_a', F.regexp_replace('column_a', '|'.join(lst), ''))