Basically, I have a dataframe that looks exactly like this:
id | values |
---|---|
01 | [{"final_price":10.0,"currency":"USD"},{"final_price":18.0,"currency":"CAD"}] |
02 | [{"final_price":44.15,"currency":"USD"},{"final_price":60.0,"currency":"CAD"}] |
03 | [{"final_price":99.99,"currency":"USD"},{"final_price":115.0,"currency":"CAD"}] |
04 | [{"final_price":25.0,"currency":"USD"},{"final_price":32.0,"currency":"CAD"}] |
the same procut id
have the price in US dollars and Canadian dollars. However, I need to check how many dicts this column has. Because some products only have the price in USD and others only in CAD. How can I check how many currencies are there and create new columns for each one of them?
Thanks!
CodePudding user response:
Convert the JSON strings into array of structs using from_json
. The number of dicts (currencies) will correspond to the size of the resulting array. And to select them as new columns, you can pivot like this:
from pyspark.sql import functions as F
df = spark.createDataFrame([
("01", "[{'final_price':10.0,'currency':'USD'},{'final_price':18.0,'currency':'CAD'}]"),
("02", "[{'final_price':44.15,'currency':'USD'},{'final_price':60.0,'currency':'CAD'}]"),
("03", "[{'final_price':99.99,'currency':'USD'},{'final_price':115.0,'currency':'CAD'}]"),
("04", "[{'final_price':25.0,'currency':'USD'},{'final_price':32.0,'currency':'CAD'}]")
], ["id", "values"])
df.selectExpr(
"id",
"inline(from_json(values, 'array<struct<final_price:float,currency:string>>'))"
).groupby("id").pivot("currency").agg(
F.first("final_price")
).show()
# --- ----- -----
# | id| CAD| USD|
# --- ----- -----
# | 01| 18.0| 10.0|
# | 03|115.0|99.99|
# | 02| 60.0|44.15|
# | 04| 32.0| 25.0|
# --- ----- -----