Below is the original pyspark dataframe:
----- ----- ----- -----
| Idx| Name|Value|Wight|
----- ----- ----- -----
| A|Peter| 123| 0.1|
| B|Peter| 234| 0.1|
| B| Mary| 287| 0.1|
| B|Jason| 243| 0.1|
| C| Mary| 109| 0.1|
| D|David| 231| 0.1|
| D| Sean| 101| 0.1|
| D| Tim| 238| 0.1|
| E| Ken| 910| 0.1|
| E| Jim| 112| 0.1|
----- ----- ----- -----
As Idx A and C only match with one name, but idx B, D, E match with several names separtely. So I want to keep idx b, d, e and multiple their value and weight to a new column named 'result'.
Below is the desired output:
----- ----- ----- ----- ------
| Idx| Name|Value|Wight|Result|
----- ----- ----- ----- ------
| B|Peter| 234| 0.1| 23.4|
| B| Mary| 287| 0.1| 28.7|
| B|Jason| 243| 0.1| 24.3|
| D|David| 231| 0.1| 23.1|
| D| Sean| 101| 0.1| 10.1|
| D| Tim| 238| 0.1| 23.8|
| E| Ken| 910| 0.1| 91.0|
| E| Jim| 112| 0.1| 11.2|
----- ----- ----- ----- ------
Actually, I don't know what this can be called so I cannot find a solution online. I only want to calculate the result of one idx to multiple name columns.
CodePudding user response:
It sounds like you need to filter rows with duplicated Idx
values and perform Value * Wight
in new column:
from pyspark.sql.functions import col
...
df.exceptAll(df.dropDuplicates(['Idx'])).withColumn("Result", col("Value") * col("Weight"))
CodePudding user response:
You can use value_counts
and apply
function in Pandas to get desired output.
df1 = df.loc[df.index.value_counts() > 1]
df1['Result'] = df1.apply(lambda row: row['Value'] * row['Wight'], axis=1)