Home > Blockchain >  Replace a column value in the spark DataFrame
Replace a column value in the spark DataFrame

Time:03-02

could you please help me to replace column values in dataframes spark:

data = [["1", "xxx", "company 0"],
        ["2", "xxx", "company 1"],
        ["3", "company 44", "company 2"],
        ["4", "xxx", "company 1"],
        ["5", "bobby", "company 1"]]


dataframe = spark.createDataFrame(data)

I am trying to replace "company" with "cmp". "Company" can be met in different columns.

CodePudding user response:

Because the "Company" may appear in any columns, you'd have to loop through each column and apply regex_replace onto each of them:

from pyspark.sql import functions as F

cols = dataframe.columns

for c in cols:
    dataframe = dataframe.withColumn(c, F.regexp_replace(c, 'company', 'cmp'))

 --- ------ ----- 
| _1|    _2|   _3|
 --- ------ ----- 
|  1|   xxx|cmp 0|
|  2|   xxx|cmp 1|
|  3|cmp 44|cmp 2|
|  4|   xxx|cmp 1|
|  5| bobby|cmp 1|
 --- ------ ----- 

CodePudding user response:

Another way: Array, higher order functions; transform replace, psition explode and then pivot.

df

data = [["1", "xxx", "company 0"],
        ["2", "xxx", "company 1"],
        ["3", "company 44", "company 2"],
        ["4", "xxx", "company 1"],
        ["5", "bobby", "company 1"]]


df = spark.createDataFrame(data, ('col1','col2','col3'))

Solution

df1 =(df.withColumn('col4', array('col2','col3'))# array all columns with company
      .withColumn('col4', expr("transform(col4, x -> regexp_replace(x,'(company)','cmp'))"))#Use high order functions and regex to replace
      .select('col1', F.posexplode('col4').alias("pos", "val"))#Explode assigning index value to each array element
      .groupBy("col1").pivot("pos").agg(F.first("val"))#pivot
     
     )

CodePudding user response:

functional programming approach

from functools import reduce
from pyspark.sql import functions as F
cols = dataframe.columns
reduce(lambda dataframe, c: dataframe.withColumn(c, F.regexp_replace(c, 'company', 'cmp')), cols, dataframe).show()
  • Related