Home > Software design >  How to filter out one value in a column to one value in another column, but calculate one to many va
How to filter out one value in a column to one value in another column, but calculate one to many va

Time:01-18

Below is the original pyspark dataframe:

 ----- ----- ----- ----- 
|  Idx| Name|Value|Wight|
 ----- ----- ----- ----- 
|    A|Peter|  123|  0.1|
|    B|Peter|  234|  0.1|
|    B| Mary|  287|  0.1|
|    B|Jason|  243|  0.1|
|    C| Mary|  109|  0.1|
|    D|David|  231|  0.1|
|    D| Sean|  101|  0.1|
|    D|  Tim|  238|  0.1|
|    E|  Ken|  910|  0.1|
|    E|  Jim|  112|  0.1|
 ----- ----- ----- ----- 

As Idx A and C only match with one name, but idx B, D, E match with several names separtely. So I want to keep idx b, d, e and multiple their value and weight to a new column named 'result'.

Below is the desired output:

 ----- ----- ----- ----- ------ 
|  Idx| Name|Value|Wight|Result|
 ----- ----- ----- ----- ------ 
|    B|Peter|  234|  0.1|  23.4|
|    B| Mary|  287|  0.1|  28.7|
|    B|Jason|  243|  0.1|  24.3|
|    D|David|  231|  0.1|  23.1|
|    D| Sean|  101|  0.1|  10.1|
|    D|  Tim|  238|  0.1|  23.8|
|    E|  Ken|  910|  0.1|  91.0|
|    E|  Jim|  112|  0.1|  11.2|
 ----- ----- ----- ----- ------ 

Actually, I don't know what this can be called so I cannot find a solution online. I only want to calculate the result of one idx to multiple name columns.

CodePudding user response:

It sounds like you need to filter rows with duplicated Idx values and perform Value * Wight in new column:

from pyspark.sql.functions import col

...
df.exceptAll(df.dropDuplicates(['Idx'])).withColumn("Result", col("Value") * col("Weight"))

CodePudding user response:

You can use value_counts and apply function in Pandas to get desired output.

df1 = df.loc[df.index.value_counts() > 1]
df1['Result'] = df1.apply(lambda row: row['Value'] * row['Wight'], axis=1)
  • Related