Replace a column value in the spark DataFrame-CodePudding

could you please help me to replace column values in dataframes spark:

data = [["1", "xxx", "company 0"],
        ["2", "xxx", "company 1"],
        ["3", "company 44", "company 2"],
        ["4", "xxx", "company 1"],
        ["5", "bobby", "company 1"]]


dataframe = spark.createDataFrame(data)

I am trying to replace "company" with "cmp". "Company" can be met in different columns.

CodePudding user response：

Because the "Company" may appear in any columns, you'd have to loop through each column and apply regex_replace onto each of them:

from pyspark.sql import functions as F

cols = dataframe.columns

for c in cols:
    dataframe = dataframe.withColumn(c, F.regexp_replace(c, 'company', 'cmp'))

 --- ------ ----- 
| _1|    _2|   _3|
 --- ------ ----- 
|  1|   xxx|cmp 0|
|  2|   xxx|cmp 1|
|  3|cmp 44|cmp 2|
|  4|   xxx|cmp 1|
|  5| bobby|cmp 1|
 --- ------ -----

CodePudding user response：

Another way: Array, higher order functions; transform replace, psition explode and then pivot.

data = [["1", "xxx", "company 0"],
        ["2", "xxx", "company 1"],
        ["3", "company 44", "company 2"],
        ["4", "xxx", "company 1"],
        ["5", "bobby", "company 1"]]


df = spark.createDataFrame(data, ('col1','col2','col3'))

Solution

df1 =(df.withColumn('col4', array('col2','col3'))# array all columns with company
      .withColumn('col4', expr("transform(col4, x -> regexp_replace(x,'(company)','cmp'))"))#Use high order functions and regex to replace
      .select('col1', F.posexplode('col4').alias("pos", "val"))#Explode assigning index value to each array element
      .groupBy("col1").pivot("pos").agg(F.first("val"))#pivot
     
     )

CodePudding user response：

functional programming approach

from functools import reduce
from pyspark.sql import functions as F
cols = dataframe.columns
reduce(lambda dataframe, c: dataframe.withColumn(c, F.regexp_replace(c, 'company', 'cmp')), cols, dataframe).show()