Question about spark to query nested data-CodePudding

Anyone is familiar with spark sql to query the nested data? Is explode() function the correct way? What should the code look like?

I want to query myData-> productData (array)->data (string)

My wrong code here, I am not sure the select part. Anyone can help? Thanks a lot!

data = spark.read.parquet("s3:path").filter("Btype == 'a' and marketplaceId = 1").select(explode("myData.productData") as data)) ??

data is here:

 |-- myData: struct (nullable = true)
 |    |-- headline: string (nullable = true)
 |    |-- page1: string (nullable = true)
 |    |-- pageValue: string (nullable = true)
 |    |-- xxId: long (nullable = true)
 |    |-- productData: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- roomId: integer (nullable = true)
 |    |    |    |-- data: string (nullable = true)
                |-- itemname: string (nullable = true)

CodePudding user response：

To query the data field in the nested productData array using Spark SQL, you can use the explode function. The explode function takes in a column that contains an array, and outputs a new row for each element in the array.

Here is an example of how you can use the explode function in your query:

# Read the data from the parquet file and filter the rows
df = spark.read.parquet("s3:path").filter("Btype == 'a' and marketplaceId = 1")

# Use the explode function to flatten the productData array
df = df.select(explode("myData.productData").alias("productData"))

# Select the data field from the exploded productData array
df = df.select($"productData.data")

The resulting DataFrame will contain a single column named data, which contains the values of the data field from the productData array.