for example: I have a dataframe with columns like
lens | plain-prod 102 | plain-prod 105 | plain-prod 107 |
---|---|---|---|
First | 1 | 3 | 4 |
Second | 2 | 5 | 3 |
First | 3 | 7 | 2 |
Second | 4 | 8 | 1 |
so i need to do a pattern matching (^plain-prod.*) and pick up all 3 columns matching that and create new column plain_sum having the sum how to achieve this using pyspark or pandas.
lens | plain-prod 102 | plain-prod 105 | plain-prod 107 | plain_sum |
---|---|---|---|---|
First | 1 | 3 | 4 | 8 |
Second | 2 | 5 | 3 | 10 |
First | 3 | 7 | 2 | 12 |
Second | 4 | 8 | 1 | 13 |
CodePudding user response:
Try this approach with Pandas (df
is your data frame):
df['plain_sum'] = df.filter(regex='^plain-prod.*').sum(axis=1)
CodePudding user response:
You could do something like this in PySpark:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.getOrCreate()
data = [
{"lens": "First", "plain-prod 102": 1, "plain-prod 105": 3, "plain-prod 107": 4},
{"lens": "Second", "plain-prod 102": 2, "plain-prod 105": 3, "plain-prod 107": 4},
]
df = spark.createDataFrame(data)
df = df.withColumn(
"plain_sum", sum([F.col(x) for x in df.columns if "plain-prod" in x])
)
Result:
------ -------------- -------------- -------------- ---------
|lens |plain-prod 102|plain-prod 105|plain-prod 107|plain_sum|
------ -------------- -------------- -------------- ---------
|First |1 |3 |4 |8 |
|Second|2 |3 |4 |9 |
------ -------------- -------------- -------------- ---------