How to filter out one value in a column to one value in another column, but calculate one to many va-CodePudding

Below is the original pyspark dataframe:

 ----- ----- ----- ----- 
|  Idx| Name|Value|Wight|
 ----- ----- ----- ----- 
|    A|Peter|  123|  0.1|
|    B|Peter|  234|  0.1|
|    B| Mary|  287|  0.1|
|    B|Jason|  243|  0.1|
|    C| Mary|  109|  0.1|
|    D|David|  231|  0.1|
|    D| Sean|  101|  0.1|
|    D|  Tim|  238|  0.1|
|    E|  Ken|  910|  0.1|
|    E|  Jim|  112|  0.1|
 ----- ----- ----- -----

As Idx A and C only match with one name, but idx B, D, E match with several names separtely. So I want to keep idx b, d, e and multiple their value and weight to a new column named 'result'.

Below is the desired output:

 ----- ----- ----- ----- ------ 
|  Idx| Name|Value|Wight|Result|
 ----- ----- ----- ----- ------ 
|    B|Peter|  234|  0.1|  23.4|
|    B| Mary|  287|  0.1|  28.7|
|    B|Jason|  243|  0.1|  24.3|
|    D|David|  231|  0.1|  23.1|
|    D| Sean|  101|  0.1|  10.1|
|    D|  Tim|  238|  0.1|  23.8|
|    E|  Ken|  910|  0.1|  91.0|
|    E|  Jim|  112|  0.1|  11.2|
 ----- ----- ----- ----- ------

Actually, I don't know what this can be called so I cannot find a solution online. I only want to calculate the result of one idx to multiple name columns.

CodePudding user response：

It sounds like you need to filter rows with duplicated Idx values and perform Value * Wight in new column:

from pyspark.sql.functions import col

...
df.exceptAll(df.dropDuplicates(['Idx'])).withColumn("Result", col("Value") * col("Weight"))

CodePudding user response：

You can use value_counts and apply function in Pandas to get desired output.

df1 = df.loc[df.index.value_counts() > 1]
df1['Result'] = df1.apply(lambda row: row['Value'] * row['Wight'], axis=1)