How rename specific columns in PySpark?-CodePudding

I have a dataframe in PySpark, result of a groupBy with agg. Like this:

df1 = df.groupBy(['data', 'id']).pivot('type').agg(F.sum('value').alias("Values"), F.count('value').alias("Quantity"))

But I need to put the alias ("Values" and "Quantity) as the prefix of these columns, not as the suffix.

This is the example of dataframe.

Result of my script:

data	id	some_type_Values	some_type_Quantity
2022-01-01	1234	12.50	2

Desire output:

data	id	Values some_type	Quantity some_type
2022-01-01	1234	12.50	2

What I've tried so far:

selected = df1.select([s for s in df1.columns if 'Values' in s])
select_volume = [col(col_name).alias("Values "   col_name)  for col_name in selected.columns]
df2 = df1.select(*select_volume)

This works, but splits my dataframe. And I will also need to cut the _Values and _Quantity at the end of the column.

How can I rename selected columns for each operation and also remove this alias from the end of each one?

CodePudding user response：

Python's rfind may be useful.

Example dataframes:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('2022-01-01', 1234, 'some_type_1', 2),
     ('2022-01-01', 1234, 'some_type_2', 3)],
    ['data', 'id', 'type', 'value'])
df1 = df.groupBy(['data', 'id']).pivot('type').agg(F.sum('value').alias("Values"), F.count('value').alias("Quantity"))
df1.show()
#  ---------- ---- ------------------ -------------------- ------------------ -------------------- 
# |      data|  id|some_type_1_Values|some_type_1_Quantity|some_type_2_Values|some_type_2_Quantity|
#  ---------- ---- ------------------ -------------------- ------------------ -------------------- 
# |2022-01-01|1234|                 2|                   1|                 3|                   1|
#  ---------- ---- ------------------ -------------------- ------------------ --------------------

Script for renaming:

df1 = df1.select(
    *['data', 'id'],
    *[F.col(c).alias(f"{c[c.rfind('_') 1:]} {c[:c.rfind('_')]}") for c in df1.columns if c not in ['data', 'id']]
)
df1.show()
#  ---------- ---- ------------------ -------------------- ------------------ -------------------- 
# |      data|  id|Values some_type_1|Quantity some_type_1|Values some_type_2|Quantity some_type_2|
#  ---------- ---- ------------------ -------------------- ------------------ -------------------- 
# |2022-01-01|1234|                 2|                   1|                 3|                   1|
#  ---------- ---- ------------------ -------------------- ------------------ --------------------

toDF is also possible and it's less verbose, but it can be more prone to error in some cases.

df1 = df1.toDF(
    *['data', 'id'],
    *[f"{c[c.rfind('_') 1:]} {c[:c.rfind('_')]}" for c in df1.columns if c not in ['data', 'id']]
)
df1.show()
#  ---------- ---- ------------------ -------------------- ------------------ -------------------- 
# |      data|  id|Values some_type_1|Quantity some_type_1|Values some_type_2|Quantity some_type_2|
#  ---------- ---- ------------------ -------------------- ------------------ -------------------- 
# |2022-01-01|1234|                 2|                   1|                 3|                   1|
#  ---------- ---- ------------------ -------------------- ------------------ --------------------