Home > Software design >  Databricks Delta Lake - Reading data from JSON file
Databricks Delta Lake - Reading data from JSON file

Time:10-17

I'm currently learning Databricks and using a combination of Python (pyspark) and SQL for data transformations.

As of now I have a json file in the following format:

{
    "issuccess": true,
    "jobProcess": "usersList",
    "data": {
        "members": [
            {
                "id": "bot1",     
                "name": "databot",
                "active": true,
                "profile": {
                    "title": "Test Bot",
                    "phone": "1234"
                 },
                 "is_mailbox_active": true
             },
             {
                ....
             }
         ]
     }
}

I'm able to dump this data into a temp view via. following Python (PySpark) logic:

 usersData = spark \
                .read \
                .option("multiLine", True) \
                .option("mode", "PERMISSIVE") \
                .json("C:\\Test\data.json") \
                .createOrReplaceTempView("vw_TestView")

With above, the vw_TestView data is in struct<members:array<struct<.... format.

Column DataType
issuccess boolean
jobProcess string
data struct<members:array<struct<id:string, ....>

As an output, I need to select/display data only from members of 'data' column array in a proper format.

Performing a select * from will expectedly return 'results too large....' error. Also, since I need to eventually select specific content from 'data' column how can I construct a proper select query for this above view.

The select query output must look something as below:

id name profile
bot1 databot { "title": "Test Bot","phone": "1234"}
bot2 userbot { "title": "User Bot","phone": "7890"}

How can this be achieved?

I tried performing

%sql
select data.members.* from vw_TestView

but this is not supported for 'data.members' column's data type and errors out with following message:

Can only star expand struct data types.  ..........

CodePudding user response:

The problem is that members is an array. In this case you need to do that via following operations:

  1. Select members field using select("members")
  2. Explode the members field using the explode function (doc)
  3. extract data from the underlying structs

Something like this:

select col.* from (select explode(data.members) as col from vw_TestView)

P.S. All of this could be done directly via PySpark as well.

  • Related