summing multiple columns values which has different column names by using pattern matching of column-CodePudding

for example: I have a dataframe with columns like

lens	plain-prod 102	plain-prod 105	plain-prod 107
First	1	3	4
Second	2	5	3
First	3	7	2
Second	4	8	1

so i need to do a pattern matching (^plain-prod.*) and pick up all 3 columns matching that and create new column plain_sum having the sum how to achieve this using pyspark or pandas.

lens	plain-prod 102	plain-prod 105	plain-prod 107	plain_sum
First	1	3	4	8
Second	2	5	3	10
First	3	7	2	12
Second	4	8	1	13

CodePudding user response：

Try this approach with Pandas (df is your data frame):

df['plain_sum'] = df.filter(regex='^plain-prod.*').sum(axis=1)

CodePudding user response：

You could do something like this in PySpark:

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.getOrCreate()
data = [
    {"lens": "First", "plain-prod 102": 1, "plain-prod 105": 3, "plain-prod 107": 4},
    {"lens": "Second", "plain-prod 102": 2, "plain-prod 105": 3, "plain-prod 107": 4},
]
df = spark.createDataFrame(data)
df = df.withColumn(
    "plain_sum", sum([F.col(x) for x in df.columns if "plain-prod" in x])
)

Result:

 ------ -------------- -------------- -------------- ---------                  
|lens  |plain-prod 102|plain-prod 105|plain-prod 107|plain_sum|
 ------ -------------- -------------- -------------- --------- 
|First |1             |3             |4             |8        |
|Second|2             |3             |4             |9        |
 ------ -------------- -------------- -------------- ---------