Home > Blockchain >  conver list of dictionaries as input to different columns in pyspark
conver list of dictionaries as input to different columns in pyspark

Time:04-08

i got the data as 3 string columns (looks like a list of dictionaries) and want to convert it into columns as shown below(example),

input data enter image description here

expected output,

enter image description here

basically I need help to convert

[{"name":"vat","value":"20.00"},{"name":"price","value":"160.00"}]

(as a string) to columns

|VAT|price|
|20|160|

CodePudding user response:

Please take a look at the following, if this helps:

_my_data = [{"name":"vat","value":"20.00"},{"name":"price","value":"160.00"}]
# Using list comprehension
# to iterate over the dictionaries in the list
# picking values only, which are not alphabets
values = [dict_values for element in _my_data for dict_values in element.values() if not dict_values.isalpha()]
print(values)
#####
['20.00', '160.00']

CodePudding user response:

Use from_json to parse your string into JSON object and select the exact element you want for each column.

from pyspark.sql import functions as F

(df
    .withColumn('kpi', F.from_json('kpi', 'array<struct<name:string, value:string>>'))
    .withColumn('vat', F.col('kpi')[0]['value'])
    .withColumn('price', F.col('kpi')[1]['value'])
    .select('vat', 'price')
    .show(10, False)
)

 ----- ------ 
|vat  |price |
 ----- ------ 
|20.00|160.00|
|10.00|610.00|
 ----- ------ 

CodePudding user response:

Using Pyspark as below -

Input Data

from pyspark.sql.functions import *
from pyspark.sql.types import *
# StructType([StructField("name", StringType(), True), StructField("value", StringType(), True)])

schema = StructType([
                     StructField("kpi", StringType(), True),
                     StructField("productid", IntegerType(), True),
                     StructField("productname", StringType(), True)
                    ]
                   )

df = spark.createDataFrame([("""[{"name":"vat","value":"20.00"},{"name":"price","value":"160.00"}]""", 1, "tshirt"), ("""[{"name":"vat","value":"20.00"},{"name":"price","value":"120.00"}]""", 2, "cap"), ("""[{"name":"vat","value":"20.00"},{"name":"price","value":"160.00"}]""", 3, "shoes")], schema = schema)

df.createOrReplaceTempView("tbl")

df.show(truncate=False)

 ------------------------------------------------------------------ --------- ----------- 
|kpi                                                               |productid|productname|
 ------------------------------------------------------------------ --------- ----------- 
|[{"name":"vat","value":"20.00"},{"name":"price","value":"160.00"}]|1        |tshirt     |
|[{"name":"vat","value":"20.00"},{"name":"price","value":"120.00"}]|2        |cap        |
|[{"name":"vat","value":"20.00"},{"name":"price","value":"160.00"}]|3        |shoes      |
 ------------------------------------------------------------------ --------- ----------- 

Required Output -

(df.select(from_json("kpi", ArrayType(StringType())).alias("col"), "productid", "productname")
   .select(explode("col"),"productid", "productname")
   .select(from_json("col", MapType(StringType(), StringType())).alias("kpi"), "productid", "productname")
   .select("kpi.name", "kpi.value","productid", "productname")
   .groupBy("productid", "productname").pivot("name").agg(first("value"))
).show(truncate=False)

 --------- ----------- ------ ----- 
|productid|productname|price |vat  |
 --------- ----------- ------ ----- 
|1        |tshirt     |160.00|20.00|
|2        |cap        |120.00|20.00|
|3        |shoes      |160.00|20.00|
 --------- ----------- ------ ----- 

  • Related