Home > OS >  Checking how many dictionaries a pyspark dataframe has and collecting values from them
Checking how many dictionaries a pyspark dataframe has and collecting values from them

Time:10-06

Basically, I have a dataframe that looks exactly like this:

id values
01 [{"final_price":10.0,"currency":"USD"},{"final_price":18.0,"currency":"CAD"}]
02 [{"final_price":44.15,"currency":"USD"},{"final_price":60.0,"currency":"CAD"}]
03 [{"final_price":99.99,"currency":"USD"},{"final_price":115.0,"currency":"CAD"}]
04 [{"final_price":25.0,"currency":"USD"},{"final_price":32.0,"currency":"CAD"}]

the same procut id have the price in US dollars and Canadian dollars. However, I need to check how many dicts this column has. Because some products only have the price in USD and others only in CAD. How can I check how many currencies are there and create new columns for each one of them?

Thanks!

CodePudding user response:

Convert the JSON strings into array of structs using from_json. The number of dicts (currencies) will correspond to the size of the resulting array. And to select them as new columns, you can pivot like this:

from pyspark.sql import functions as F

df = spark.createDataFrame([
    ("01", "[{'final_price':10.0,'currency':'USD'},{'final_price':18.0,'currency':'CAD'}]"),
    ("02", "[{'final_price':44.15,'currency':'USD'},{'final_price':60.0,'currency':'CAD'}]"),
    ("03", "[{'final_price':99.99,'currency':'USD'},{'final_price':115.0,'currency':'CAD'}]"),
    ("04", "[{'final_price':25.0,'currency':'USD'},{'final_price':32.0,'currency':'CAD'}]")
], ["id", "values"])

df.selectExpr(
    "id",
    "inline(from_json(values, 'array<struct<final_price:float,currency:string>>'))"
).groupby("id").pivot("currency").agg(
    F.first("final_price")
).show()

#  --- ----- ----- 
# | id|  CAD|  USD|
#  --- ----- ----- 
# | 01| 18.0| 10.0|
# | 03|115.0|99.99|
# | 02| 60.0|44.15|
# | 04| 32.0| 25.0|
#  --- ----- ----- 
  • Related