could you please help me to replace column values in dataframes spark:
data = [["1", "xxx", "company 0"],
["2", "xxx", "company 1"],
["3", "company 44", "company 2"],
["4", "xxx", "company 1"],
["5", "bobby", "company 1"]]
dataframe = spark.createDataFrame(data)
I am trying to replace "company" with "cmp". "Company" can be met in different columns.
CodePudding user response:
Because the "Company" may appear in any columns, you'd have to loop through each column and apply regex_replace
onto each of them:
from pyspark.sql import functions as F
cols = dataframe.columns
for c in cols:
dataframe = dataframe.withColumn(c, F.regexp_replace(c, 'company', 'cmp'))
--- ------ -----
| _1| _2| _3|
--- ------ -----
| 1| xxx|cmp 0|
| 2| xxx|cmp 1|
| 3|cmp 44|cmp 2|
| 4| xxx|cmp 1|
| 5| bobby|cmp 1|
--- ------ -----
CodePudding user response:
Another way: Array, higher order functions; transform replace, psition explode and then pivot.
df
data = [["1", "xxx", "company 0"],
["2", "xxx", "company 1"],
["3", "company 44", "company 2"],
["4", "xxx", "company 1"],
["5", "bobby", "company 1"]]
df = spark.createDataFrame(data, ('col1','col2','col3'))
Solution
df1 =(df.withColumn('col4', array('col2','col3'))# array all columns with company
.withColumn('col4', expr("transform(col4, x -> regexp_replace(x,'(company)','cmp'))"))#Use high order functions and regex to replace
.select('col1', F.posexplode('col4').alias("pos", "val"))#Explode assigning index value to each array element
.groupBy("col1").pivot("pos").agg(F.first("val"))#pivot
)
CodePudding user response:
functional programming approach
from functools import reduce
from pyspark.sql import functions as F
cols = dataframe.columns
reduce(lambda dataframe, c: dataframe.withColumn(c, F.regexp_replace(c, 'company', 'cmp')), cols, dataframe).show()