I have a PySpark dataframe which looks like this, I have a map datatype column Map<Str,Int>
Date Item (Map<Str,int>) Total Items
2021-02-01 Item_A -> 3, Item_B -> 10, Item_C -> 2 15
2021-02-02 Item_A -> 1, Item_B -> 5, Item_C -> 7 13
2021-02-03 Item_A -> 8, Item_B -> 3, Item_C -> 1 12
I want to create a new column which gives me the individual Item dominance percentage from the total number of items. Item_A / total number of items and so on to all other items. The resulting column should also be a map.
I want something like this:
Date Item (Map<Str,int>) Total Items Item count %
(item/total items)*100
2021-02-01 Item_A -> 3, Item_B -> 10, Item_C -> 5 15 Item_A -> 20%, Item_B -> 66%, Item_c -> 33%
2021-02-02 Item_A -> 1, Item_B -> 5, Item_C -> 7 13 Item_A -> 7%, Item_B -> 38%, Item_C -> 53%
2021-02-03 Item_A -> 8, Item_B -> 3, Item_C -> 1 12 Item_A -> 66%, Item_B -> 25%, Item_C -> 8.3%
My approach:
df = df.withColumn('Item_count_percentage', F.expr('aggregate(map_values(Item), 0 , (acc, x) -> (acc / int(x)/100)'))
df.show(truncate=False)
CodePudding user response:
Spark 3.0
Making use of
transform_values
from pyspark.sql import functions as F df = df.withColumn( 'Item count', F.expr("transform_values(Item, (k, v) -> round(v / `Total Items` * 100, 1))") )
Spark 2.4
Recreating the map (
map_from_arrays
) by first extracting keys (map_keys
), then values (map_values
) and applying a higher order function (transform
) on values.from pyspark.sql import functions as F df = df.withColumn( 'Item count', F.map_from_arrays( F.map_keys('Item'), F.expr("transform(map_values(Item), x -> round(x / `Total Items` * 100, 1))") ) )