How to get Percentage of Total counts for each item in the Map/Dict column?-CodePudding

I have a PySpark dataframe which looks like this, I have a map datatype column Map<Str,Int>

 Date              Item (Map<Str,int>)                           Total Items   

2021-02-01    Item_A -> 3, Item_B -> 10, Item_C -> 2                 15
2021-02-02    Item_A -> 1, Item_B -> 5, Item_C ->  7                 13
2021-02-03    Item_A -> 8, Item_B -> 3, Item_C ->  1                 12

I want to create a new column which gives me the individual Item dominance percentage from the total number of items. Item_A / total number of items and so on to all other items. The resulting column should also be a map.

I want something like this:

 Date              Item (Map<Str,int>)                    Total Items     Item count % 
                                                                        (item/total items)*100

2021-02-01    Item_A -> 3, Item_B -> 10, Item_C -> 5           15     Item_A -> 20%, Item_B -> 66%, Item_c -> 33%
2021-02-02    Item_A -> 1, Item_B -> 5, Item_C ->  7           13    Item_A -> 7%, Item_B -> 38%, Item_C -> 53%
2021-02-03    Item_A -> 8, Item_B -> 3, Item_C ->  1           12    Item_A -> 66%, Item_B -> 25%, Item_C -> 8.3%

My approach:

df = df.withColumn('Item_count_percentage', F.expr('aggregate(map_values(Item), 0 , (acc, x) -> (acc / int(x)/100)'))
df.show(truncate=False)

CodePudding user response：

Spark 3.0

Making use of transform_values

from pyspark.sql import functions as F

df = df.withColumn(
    'Item count',
    F.expr("transform_values(Item, (k, v) -> round(v / `Total Items` * 100, 1))")
)

Spark 2.4

Recreating the map (map_from_arrays) by first extracting keys (map_keys), then values (map_values) and applying a higher order function (transform) on values.

from pyspark.sql import functions as F

df = df.withColumn(
    'Item count',
    F.map_from_arrays(
        F.map_keys('Item'),
        F.expr("transform(map_values(Item), x -> round(x / `Total Items` * 100, 1))")
    )
)