Databricks Delta Lake - Reading data from JSON file-CodePudding

I'm currently learning Databricks and using a combination of Python (pyspark) and SQL for data transformations.

As of now I have a json file in the following format:

{
    "issuccess": true,
    "jobProcess": "usersList",
    "data": {
        "members": [
            {
                "id": "bot1",     
                "name": "databot",
                "active": true,
                "profile": {
                    "title": "Test Bot",
                    "phone": "1234"
                 },
                 "is_mailbox_active": true
             },
             {
                ....
             }
         ]
     }
}

I'm able to dump this data into a temp view via. following Python (PySpark) logic:

 usersData = spark \
                .read \
                .option("multiLine", True) \
                .option("mode", "PERMISSIVE") \
                .json("C:\\Test\data.json") \
                .createOrReplaceTempView("vw_TestView")

With above, the vw_TestView data is in struct<members:array<struct<.... format.

Column	DataType
issuccess	boolean
jobProcess	string
data	struct<members:array<struct<id:string, ....>

As an output, I need to select/display data only from members of 'data' column array in a proper format.

Performing a select * from will expectedly return 'results too large....' error. Also, since I need to eventually select specific content from 'data' column how can I construct a proper select query for this above view.

The select query output must look something as below:

id	name	profile
bot1	databot	{ "title": "Test Bot","phone": "1234"}
bot2	userbot	{ "title": "User Bot","phone": "7890"}

How can this be achieved?

I tried performing

%sql
select data.members.* from vw_TestView

but this is not supported for 'data.members' column's data type and errors out with following message:

Can only star expand struct data types.  ..........

CodePudding user response：

The problem is that members is an array. In this case you need to do that via following operations:

Select members field using select("members")
Explode the members field using the explode function (doc)
extract data from the underlying structs

Something like this:

select col.* from (select explode(data.members) as col from vw_TestView)

P.S. All of this could be done directly via PySpark as well.