Anyone is familiar with spark sql to query the nested data? Is explode() function the correct way? What should the code look like?
I want to query myData-> productData (array)->data (string)
My wrong code here, I am not sure the select part. Anyone can help? Thanks a lot!
data = spark.read.parquet("s3:path").filter("Btype == 'a' and marketplaceId = 1").select(explode("myData.productData") as data)) ??
data is here:
|-- myData: struct (nullable = true)
| |-- headline: string (nullable = true)
| |-- page1: string (nullable = true)
| |-- pageValue: string (nullable = true)
| |-- xxId: long (nullable = true)
| |-- productData: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- roomId: integer (nullable = true)
| | | |-- data: string (nullable = true)
|-- itemname: string (nullable = true)
CodePudding user response:
To query the data
field in the nested productData
array using Spark SQL, you can use the explode
function. The explode
function takes in a column that contains an array, and outputs a new row for each element in the array.
Here is an example of how you can use the explode
function in your query:
# Read the data from the parquet file and filter the rows
df = spark.read.parquet("s3:path").filter("Btype == 'a' and marketplaceId = 1")
# Use the explode function to flatten the productData array
df = df.select(explode("myData.productData").alias("productData"))
# Select the data field from the exploded productData array
df = df.select($"productData.data")
The resulting DataFrame will contain a single column named data
, which contains the values of the data
field from the productData
array.