Pandas parse/expand nested string column within dataframe-CodePudding

I have an excel file that has a string column in a nested JSON-like format. I would like to parse/expand it. The dataframe looks like this when I used df.head(2)

   json_str
0 {"id":"lni001","pub_date":"20220301","doc_id":"7098727","unique_id":"64WP-UI-POLI","content":[{"c_id":"002","p_id":"P02","type":"org","source":"internet"},{"c_id":"003","p_id":"P03","type":"org","source":"internet"},{"c_id":"005","p_id":"K01","type":"people","source":"news"}]}
1 {"id":"lni002","pub_date":"20220301","doc_id":"7097889","unique_id":"64WP-UI-CFGT","content":[{"c_id":"012","p_id":"K21","type":"location","source":"internet"},{"c_id":"034","p_id":"P17","type":"people","source":"news"},{"c_id":"098","p_id":"K54","type":"people","source":"news"}]}

The structure of each row looks like this:

{
   "id":"lni001",
   "pub_date":"20220301",
   "doc_id":"7098727",
   "unique_id":"64WP-UI-POLI",
   "content":[
      {
         "c_id":"002",
         "p_id":"P02",
         "type":"org",
         "source":"internet"  
      },
      {
         "c_id":"003",
         "p_id":"P03",
         "type":"org",
         "source":"internet" 
      },
      {
         "c_id":"005",
         "p_id":"K01",
         "type":"people",
         "source":"news" 
      }
   ]
}

The type/class of the column is str by using type(df['json_str'].iloc[0])

All the rows have the same structure/format but some of them may have more information in content. In the example above, it has 3 different nested strings but some may have 1, 2, 4, 5, or more. The expected result will look like this below

  id          pub_date      doc_id       unique_id     c_id    p_id   type     source
lni001        20220301      7098727     64WP-UI-POLI    002     P02    org    internet
lni001        20220301      7098727     64WP-UI-POLI    003     P03    org    internet
lni001        20220301      7098727     64WP-UI-POLI    005     K01   people  internet
lni002        20220301      7097889     64WP-UI-CFGT    012     K21   location  internet
lni002        20220301      7097889     64WP-UI-CFGT    034     P17   people  news
lni002        20220301      7097889     64WP-UI-CFGT    098     K54   people  news

I have tried to convert the column into the dictionary and extract the information out but it doesn't work that well. I am wondering are there any better ways to do it.

CodePudding user response：

Building off of @enke's answer, you could first convert the strings to real JSON, and then use pd.json_normalize:

import ast
new_df = pd.json_normalize(df['YOUR COLUMN'].apply(ast.literal_eval), ['content'], list(data.keys()-{'content'}))

If you care about the order of the columns, you can rearrange them:

new_df = new_df[['id', 'pub_date', 'doc_id', 'unique_id', 'c_id', 'p_id', 'type', 'source']]

CodePudding user response：

We could use apply json.loads on each row and use json_normalize:

import json
data = df['json_str'].apply(json.loads).tolist()
out = (pd.json_normalize(data, ['content'], list(data[0].keys()-{'content'}))
       [['id', 'pub_date', 'doc_id', 'unique_id', 'c_id', 'p_id', 'type', 'source']])

Output:

       id  pub_date   doc_id     unique_id c_id p_id      type    source
0  lni001  20220301  7098727  64WP-UI-POLI  002  P02       org  internet
1  lni001  20220301  7098727  64WP-UI-POLI  003  P03       org  internet
2  lni001  20220301  7098727  64WP-UI-POLI  005  K01    people      news
3  lni002  20220301  7097889  64WP-UI-CFGT  012  K21  location  internet
4  lni002  20220301  7097889  64WP-UI-CFGT  034  P17    people      news
5  lni002  20220301  7097889  64WP-UI-CFGT  098  K54    people      news

Here, data[0].keys() corresponds to all keys other than "content" in each dictionary.