i got the data as 3 string columns (looks like a list of dictionaries) and want to convert it into columns as shown below(example),
expected output,
basically I need help to convert
[{"name":"vat","value":"20.00"},{"name":"price","value":"160.00"}]
(as a string) to columns
|VAT|price|
|20|160|
CodePudding user response:
Please take a look at the following, if this helps:
_my_data = [{"name":"vat","value":"20.00"},{"name":"price","value":"160.00"}]
# Using list comprehension
# to iterate over the dictionaries in the list
# picking values only, which are not alphabets
values = [dict_values for element in _my_data for dict_values in element.values() if not dict_values.isalpha()]
print(values)
#####
['20.00', '160.00']
CodePudding user response:
Use from_json
to parse your string into JSON object and select the exact element you want for each column.
from pyspark.sql import functions as F
(df
.withColumn('kpi', F.from_json('kpi', 'array<struct<name:string, value:string>>'))
.withColumn('vat', F.col('kpi')[0]['value'])
.withColumn('price', F.col('kpi')[1]['value'])
.select('vat', 'price')
.show(10, False)
)
----- ------
|vat |price |
----- ------
|20.00|160.00|
|10.00|610.00|
----- ------
CodePudding user response:
Using Pyspark as below -
Input Data
from pyspark.sql.functions import *
from pyspark.sql.types import *
# StructType([StructField("name", StringType(), True), StructField("value", StringType(), True)])
schema = StructType([
StructField("kpi", StringType(), True),
StructField("productid", IntegerType(), True),
StructField("productname", StringType(), True)
]
)
df = spark.createDataFrame([("""[{"name":"vat","value":"20.00"},{"name":"price","value":"160.00"}]""", 1, "tshirt"), ("""[{"name":"vat","value":"20.00"},{"name":"price","value":"120.00"}]""", 2, "cap"), ("""[{"name":"vat","value":"20.00"},{"name":"price","value":"160.00"}]""", 3, "shoes")], schema = schema)
df.createOrReplaceTempView("tbl")
df.show(truncate=False)
------------------------------------------------------------------ --------- -----------
|kpi |productid|productname|
------------------------------------------------------------------ --------- -----------
|[{"name":"vat","value":"20.00"},{"name":"price","value":"160.00"}]|1 |tshirt |
|[{"name":"vat","value":"20.00"},{"name":"price","value":"120.00"}]|2 |cap |
|[{"name":"vat","value":"20.00"},{"name":"price","value":"160.00"}]|3 |shoes |
------------------------------------------------------------------ --------- -----------
Required Output -
(df.select(from_json("kpi", ArrayType(StringType())).alias("col"), "productid", "productname")
.select(explode("col"),"productid", "productname")
.select(from_json("col", MapType(StringType(), StringType())).alias("kpi"), "productid", "productname")
.select("kpi.name", "kpi.value","productid", "productname")
.groupBy("productid", "productname").pivot("name").agg(first("value"))
).show(truncate=False)
--------- ----------- ------ -----
|productid|productname|price |vat |
--------- ----------- ------ -----
|1 |tshirt |160.00|20.00|
|2 |cap |120.00|20.00|
|3 |shoes |160.00|20.00|
--------- ----------- ------ -----