I have an excel file that has a string column in a nested JSON-like format. I would like to parse/expand it.
The dataframe looks like this when I used df.head(2)
json_str
0 {"id":"lni001","pub_date":"20220301","doc_id":"7098727","unique_id":"64WP-UI-POLI","content":[{"c_id":"002","p_id":"P02","type":"org","source":"internet"},{"c_id":"003","p_id":"P03","type":"org","source":"internet"},{"c_id":"005","p_id":"K01","type":"people","source":"news"}]}
1 {"id":"lni002","pub_date":"20220301","doc_id":"7097889","unique_id":"64WP-UI-CFGT","content":[{"c_id":"012","p_id":"K21","type":"location","source":"internet"},{"c_id":"034","p_id":"P17","type":"people","source":"news"},{"c_id":"098","p_id":"K54","type":"people","source":"news"}]}
The structure of each row looks like this:
{
"id":"lni001",
"pub_date":"20220301",
"doc_id":"7098727",
"unique_id":"64WP-UI-POLI",
"content":[
{
"c_id":"002",
"p_id":"P02",
"type":"org",
"source":"internet"
},
{
"c_id":"003",
"p_id":"P03",
"type":"org",
"source":"internet"
},
{
"c_id":"005",
"p_id":"K01",
"type":"people",
"source":"news"
}
]
}
The type/class of the column is str
by using type(df['json_str'].iloc[0])
All the rows have the same structure/format but some of them may have more information in content
. In the example above, it has 3 different nested strings but some may have 1, 2, 4, 5, or more.
The expected result will look like this below
id pub_date doc_id unique_id c_id p_id type source
lni001 20220301 7098727 64WP-UI-POLI 002 P02 org internet
lni001 20220301 7098727 64WP-UI-POLI 003 P03 org internet
lni001 20220301 7098727 64WP-UI-POLI 005 K01 people internet
lni002 20220301 7097889 64WP-UI-CFGT 012 K21 location internet
lni002 20220301 7097889 64WP-UI-CFGT 034 P17 people news
lni002 20220301 7097889 64WP-UI-CFGT 098 K54 people news
I have tried to convert the column into the dictionary and extract the information out but it doesn't work that well. I am wondering are there any better ways to do it.
CodePudding user response:
Building off of @enke's answer, you could first convert the strings to real JSON, and then use pd.json_normalize
:
import ast
new_df = pd.json_normalize(df['YOUR COLUMN'].apply(ast.literal_eval), ['content'], list(data.keys()-{'content'}))
If you care about the order of the columns, you can rearrange them:
new_df = new_df[['id', 'pub_date', 'doc_id', 'unique_id', 'c_id', 'p_id', 'type', 'source']]
CodePudding user response:
We could use apply json.loads
on each row and use json_normalize
:
import json
data = df['json_str'].apply(json.loads).tolist()
out = (pd.json_normalize(data, ['content'], list(data[0].keys()-{'content'}))
[['id', 'pub_date', 'doc_id', 'unique_id', 'c_id', 'p_id', 'type', 'source']])
Output:
id pub_date doc_id unique_id c_id p_id type source
0 lni001 20220301 7098727 64WP-UI-POLI 002 P02 org internet
1 lni001 20220301 7098727 64WP-UI-POLI 003 P03 org internet
2 lni001 20220301 7098727 64WP-UI-POLI 005 K01 people news
3 lni002 20220301 7097889 64WP-UI-CFGT 012 K21 location internet
4 lni002 20220301 7097889 64WP-UI-CFGT 034 P17 people news
5 lni002 20220301 7097889 64WP-UI-CFGT 098 K54 people news
Here, data[0].keys()
corresponds to all keys other than "content" in each dictionary.