I am converting Pandas commands into Spark ones. I bumped into wanting to convert this line into Apache Spark code:
This line replaces every two spaces into one.
df = df.columns.str.replace(' ', ' ')
Is it possible to replace a string from all columns using Spark? I came into this, but it is not quite right.
df = df.withColumnRenamed('--', '-')
To be clear I want this
// --- ---------------------- -----
//|id |address__test |state|
// --- ---------------------- -----
to this
// --- ---------------------- -----
//|id |address_test |state|
// --- ---------------------- -----
CodePudding user response:
You can apply the replace
method on all columns by iterating over them and then selecting, like so:
df = spark.createDataFrame([(1, 2, 3)], "id: int, address__test: int, state: int")
df.show()
--- ------------- -----
| id|address__test|state|
--- ------------- -----
| 1| 2| 3|
--- ------------- -----
from pyspark.sql.functions import col
new_cols = [col(c).alias(c.replace("__", "_")) for c in df.columns]
df.select(*new_cols).show()
--- ------------ -----
| id|address_test|state|
--- ------------ -----
| 1| 2| 3|
--- ------------ -----
On the sidenote: calling withColumnRenamed
makes Spark create a Projection for each distinct call, while a select
makes just single Projection, hence for large number of columns, select
will be much faster.
CodePudding user response:
Here's a suggestion.
We get all the target columns:
columns_to_edit = [col for col in df.columns if "__" in col]
Then we use a for loop to edit them all one by one:
for column in columns_to_edit:
new_column = column.replace("__", "_")
df = df.withColumnRenamed(column, new_column)
Would this solve your issue?