Home > Back-end >  Pyspark replace string in every column name
Pyspark replace string in every column name

Time:12-02

I am converting Pandas commands into Spark ones. I bumped into wanting to convert this line into Apache Spark code:

This line replaces every two spaces into one.

df = df.columns.str.replace('  ', ' ')

Is it possible to replace a string from all columns using Spark? I came into this, but it is not quite right.

df = df.withColumnRenamed('--', '-')

To be clear I want this

// --- ---------------------- ----- 
//|id |address__test         |state|
// --- ---------------------- ----- 

to this

// --- ---------------------- ----- 
//|id |address_test          |state|
// --- ---------------------- ----- 

CodePudding user response:

You can apply the replace method on all columns by iterating over them and then selecting, like so:

df = spark.createDataFrame([(1, 2, 3)], "id: int, address__test: int, state: int")
df.show()
 --- ------------- ----- 
| id|address__test|state|
 --- ------------- ----- 
|  1|            2|    3|
 --- ------------- ----- 

from pyspark.sql.functions import col

new_cols = [col(c).alias(c.replace("__", "_")) for c in df.columns]
df.select(*new_cols).show()
 --- ------------ ----- 
| id|address_test|state|
 --- ------------ ----- 
|  1|           2|    3|
 --- ------------ ----- 


On the sidenote: calling withColumnRenamed makes Spark create a Projection for each distinct call, while a select makes just single Projection, hence for large number of columns, select will be much faster.

CodePudding user response:

Here's a suggestion.

We get all the target columns:

columns_to_edit = [col for col in df.columns if "__" in col]

Then we use a for loop to edit them all one by one:

for column in columns_to_edit:
    new_column = column.replace("__", "_")
    df = df.withColumnRenamed(column, new_column)

Would this solve your issue?

  • Related