Home > Net >  Convert string in dataframe pyspark to table, obtaining only the necessary from string
Convert string in dataframe pyspark to table, obtaining only the necessary from string

Time:09-26

{
    "schema": {
        "type": "struct",
        "fields": [
            {
                "type": "int32",
                "optional": true,
                "field": "c1"
            },
            {
                "type": "string",
                "optional": true,
                "field": "c2"
            },
            {
                "type": "int64",
                "optional": false,
                "name": "org.apache.kafka.connect.data.Timestamp",
                "version": 1,
                "field": "create_ts"
            },
            {
                "type": "int64",
                "optional": false,
                "name": "org.apache.kafka.connect.data.Timestamp",
                "version": 1,
                "field": "update_ts"
            }
        ],
        "optional": false,
        "name": "foobar"
    },
    "payload": {
        "c1": 67,
        "c2": "foo",
        "create_ts": 1663920002000,
        "update_ts": 1663920002000
    }
}

I have my json string in this format and I don't want the whole data into data into table , wanted the table in this format.

| c1 | c2  | create_ts           | update_ts           | 
 ------ ------ ------------------ ---------------------  
| 1 v| foo | 2022-09-21 10:47:54 | 2022-09-21 10:47:54 | 
| 28 | foo | 2022-09-21 13:16:45 | 2022-09-21 13:16:45 | 
| 29 | foo | 2022-09-21 14:19:10 | 2022-09-21 14:19:10 | 
| 30 | foo | 2022-09-21 14:19:20 | 2022-09-21 14:19:20 | 
| 31 | foo | 2022-09-21 14:29:19 | 2022-09-21 14:29:19 |

CodePudding user response:

Skip other (nested) attributes by specifying the only one you want to see in the resulting output:

(
  spark
  .read
  .option("multiline","true")
  .json("/path/json-path")
  .select("payload.*")
  .show()
)
  • Related