Home > Mobile >  summing multiple columns values which has different column names by using pattern matching of column
summing multiple columns values which has different column names by using pattern matching of column

Time:11-26

for example: I have a dataframe with columns like

lens plain-prod 102 plain-prod 105 plain-prod 107
First 1 3 4
Second 2 5 3
First 3 7 2
Second 4 8 1

so i need to do a pattern matching (^plain-prod.*) and pick up all 3 columns matching that and create new column plain_sum having the sum how to achieve this using pyspark or pandas.

lens plain-prod 102 plain-prod 105 plain-prod 107 plain_sum
First 1 3 4 8
Second 2 5 3 10
First 3 7 2 12
Second 4 8 1 13

CodePudding user response:

Try this approach with Pandas (df is your data frame):

df['plain_sum'] = df.filter(regex='^plain-prod.*').sum(axis=1)

CodePudding user response:

You could do something like this in PySpark:

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.getOrCreate()
data = [
    {"lens": "First", "plain-prod 102": 1, "plain-prod 105": 3, "plain-prod 107": 4},
    {"lens": "Second", "plain-prod 102": 2, "plain-prod 105": 3, "plain-prod 107": 4},
]
df = spark.createDataFrame(data)
df = df.withColumn(
    "plain_sum", sum([F.col(x) for x in df.columns if "plain-prod" in x])
)

Result:

 ------ -------------- -------------- -------------- ---------                  
|lens  |plain-prod 102|plain-prod 105|plain-prod 107|plain_sum|
 ------ -------------- -------------- -------------- --------- 
|First |1             |3             |4             |8        |
|Second|2             |3             |4             |9        |
 ------ -------------- -------------- -------------- --------- 
  • Related