I'm currently learning Databricks and using a combination of Python (pyspark) and SQL for data transformations.
As of now I have a json file in the following format:
{
"issuccess": true,
"jobProcess": "usersList",
"data": {
"members": [
{
"id": "bot1",
"name": "databot",
"active": true,
"profile": {
"title": "Test Bot",
"phone": "1234"
},
"is_mailbox_active": true
},
{
....
}
]
}
}
I'm able to dump this data into a temp view via. following Python (PySpark) logic:
usersData = spark \
.read \
.option("multiLine", True) \
.option("mode", "PERMISSIVE") \
.json("C:\\Test\data.json") \
.createOrReplaceTempView("vw_TestView")
With above, the vw_TestView data is in struct<members:array<struct<.... format.
Column | DataType |
---|---|
issuccess | boolean |
jobProcess | string |
data | struct<members:array<struct<id:string, ....> |
As an output, I need to select/display data only from members of 'data' column array in a proper format.
Performing a select * from will expectedly return 'results too large....' error. Also, since I need to eventually select specific content from 'data' column how can I construct a proper select query for this above view.
The select query output must look something as below:
id | name | profile |
---|---|---|
bot1 | databot | { "title": "Test Bot","phone": "1234"} |
bot2 | userbot | { "title": "User Bot","phone": "7890"} |
How can this be achieved?
I tried performing
%sql
select data.members.* from vw_TestView
but this is not supported for 'data.members' column's data type and errors out with following message:
Can only star expand struct data types. ..........
CodePudding user response:
The problem is that members
is an array. In this case you need to do that via following operations:
- Select
members
field usingselect("members")
- Explode the
members
field using theexplode
function (doc) - extract data from the underlying structs
Something like this:
select col.* from (select explode(data.members) as col from vw_TestView)
P.S. All of this could be done directly via PySpark as well.