Home > Net >  How to get Percentage of Total counts for each item in the Map/Dict column?
How to get Percentage of Total counts for each item in the Map/Dict column?

Time:06-28

I have a PySpark dataframe which looks like this, I have a map datatype column Map<Str,Int>

 Date              Item (Map<Str,int>)                           Total Items   

2021-02-01    Item_A -> 3, Item_B -> 10, Item_C -> 2                 15
2021-02-02    Item_A -> 1, Item_B -> 5, Item_C ->  7                 13
2021-02-03    Item_A -> 8, Item_B -> 3, Item_C ->  1                 12

I want to create a new column which gives me the individual Item dominance percentage from the total number of items. Item_A / total number of items and so on to all other items. The resulting column should also be a map.

I want something like this:

 Date              Item (Map<Str,int>)                    Total Items     Item count % 
                                                                        (item/total items)*100

2021-02-01    Item_A -> 3, Item_B -> 10, Item_C -> 5           15     Item_A -> 20%, Item_B -> 66%, Item_c -> 33%
2021-02-02    Item_A -> 1, Item_B -> 5, Item_C ->  7           13    Item_A -> 7%, Item_B -> 38%, Item_C -> 53%
2021-02-03    Item_A -> 8, Item_B -> 3, Item_C ->  1           12    Item_A -> 66%, Item_B -> 25%, Item_C -> 8.3%

My approach:

df = df.withColumn('Item_count_percentage', F.expr('aggregate(map_values(Item), 0 , (acc, x) -> (acc / int(x)/100)'))
df.show(truncate=False)

CodePudding user response:

  • Spark 3.0

    Making use of transform_values

    from pyspark.sql import functions as F
    
    df = df.withColumn(
        'Item count',
        F.expr("transform_values(Item, (k, v) -> round(v / `Total Items` * 100, 1))")
    )
    
  • Spark 2.4

    Recreating the map (map_from_arrays) by first extracting keys (map_keys), then values (map_values) and applying a higher order function (transform) on values.

    from pyspark.sql import functions as F
    
    df = df.withColumn(
        'Item count',
        F.map_from_arrays(
            F.map_keys('Item'),
            F.expr("transform(map_values(Item), x -> round(x / `Total Items` * 100, 1))")
        )
    )
    
  • Related