Converting JSON file to CSV but all the attributes values separted by comman(,).
phone is multivalues attributes so 2 or more phone should be delimited by pipe(|) address is complex multivalues attribute so each value in address shoule be delimited by semicollon(;).
When I am converting json to csv I have only delimiter comma but not able to delimited multivalue and complex multivalue attributes.
code tried
df = pd.read_json("file")
df1 = df.to_csv("file", sep=",",index=False)
Input file in json
[
{
"parsed_address":[
{
"address_type":"primary",
"address_line_1":"abc",
"city":"jersey",
"state":"nj",
"postal_code":"073024588",
"country":"usa"
},
{
"address_type":"work",
"address_line_1":"xyz",
"city":"ny",
"state":"ns",
"postal_code":"073024533",
"country":"london"
}
],
"phone":[
{
"phone":" 12177218280",
"phone_type":"Mobile"
},
{
"phone":" 1217721340",
"phone_type":"Work"
}
],
"first_name":"saman",
"last_name":"zonouz"
},
]
Output file in CSV
first_name,last_name,phone,parsed_address
samon,zonouz, 12177218280| 1217721340,abc;jersey;nj;073024588;usa|xyz;ny;ns;073024533;london
CodePudding user response:
I think the easiest way is to build a new list of dictionaries with the right keys:
import pandas as pd
addresses = [
{
"parsed_address":[
{
"address_type":"primary",
"address_line_1":"abc",
"city":"jersey",
"state":"nj",
"postal_code":"073024588",
"country":"usa"
},
{
"address_type":"work",
"address_line_1":"xyz",
"city":"ny",
"state":"ns",
"postal_code":"073024533",
"country":"london"
}
],
"phone":[
{
"phone":" 12177218280",
"phone_type":"Mobile"
},
{
"phone":" 1217721340",
"phone_type":"Work"
}
],
"first_name":"saman",
"last_name":"zonouz"
}
]
formatted_addr = []
for addr in addresses:
new_dic={}
new_dic['first_name'] = addr['first_name']
new_dic['last_name'] = addr['last_name']
new_dic['phone'] = '|'.join([dic_phone['phone'] for dic_phone in addr['phone']])
new_dic['parsed_address'] = '|'.join(
[';'.join([dic_addr[key]
for key in dic_addr.keys() if key != 'address_type'])
for dic_addr in addr['parsed_address']])
formatted_addr.append(new_dic)
df = pd.DataFrame(formatted_addr)
df1 = df.to_csv('example.csv', sep=",",index=False)
Output:
first_name,last_name,phone,parsed_address
saman,zonouz, 12177218280| 1217721340,abc;jersey;nj;073024588;usa|xyz;ny;ns;073024533;london
CodePudding user response:
You can first load the json and then use pd.json_normalize
to get your data in the right format. Using record_path
and meta_var
allows to fetch data in nested levels of the json while still knowing to which entry they belong:
>>> with open('f.json') as f:
... data = json.load(f)
...
>>> phones = pd.json_normalize(data, record_path='phone', meta=['first_name', 'last_name'])
>>> phones
phone phone_type first_name last_name
0 12177218280 Mobile saman zonouz
1 1217721340 Work saman zonouz
>>> phones.groupby(['first_name', 'last_name']).agg('|'.join)
phone phone_type
first_name last_name
saman zonouz 12177218280| 1217721340 Mobile|Work
It is only slightly more complicated with the address as there is an additional join:
>>> addr = pd.json_normalize(data, record_path='parsed_address', meta=['first_name', 'last_name'])
>>> addr = addr[['first_name', 'last_name']].join(
... addr.drop(columns=['first_name', 'last_name']).agg(';'.join, axis='columns').rename('address')
... )
>>> addr
first_name last_name address
0 saman zonouz primary;abc;jersey;nj;073024588;usa
1 saman zonouz work;xyz;ny;ns;073024533;london
>>> addr.groupby(['first_name', 'last_name']).agg('|'.join)
address
first_name last_name
saman zonouz primary;abc;jersey;nj;073024588;usa|work;xyz;n...
So in the end you can just join these dataframes together:
>>> df = pd.merge(
... addr.groupby(['first_name', 'last_name']).agg('|'.join),
... phones.groupby(['first_name', 'last_name'])['phone'].agg('|'.join),
... on=['first_name', 'last_name']
... )
>>> df
address phone
first_name last_name
saman zonouz primary;abc;jersey;nj;073024588;usa|work;xyz;n... 12177218280| 1217721340
>>> df.to_csv('test.csv')
Which results in the following file:
first_name,last_name,address,phone
saman,zonouz,primary;abc;jersey;nj;073024588;usa|work;xyz;ny;ns;073024533;london, 12177218280| 1217721340
A completely generic version of this code, but it keeps the phone type, is as follows:
# Load data naively
df = pd.json_normalize(data)
# Remove all columns that have nesting
nested_cols = df.columns[df.applymap(type).eq(list).any()]
df = df.drop(columns=nested_cols)
# Remaining columns identify csv rows
id_cols = df.columns.to_list()
# Columns we are not interested in, key is first-level nesting column
drop_cols = {
'phone': ['phone_type'],
}
# Now for each nested column, join columns with ; then join rows with |
for col in nested_cols:
nested = pd.json_normalize(data, record_path=col, meta=id_cols)
nested = nested[id_cols].join(
nested.drop(columns=id_cols drop_cols.get(col, [])).agg(';'.join, axis='columns').rename(col)
).groupby(id_cols).agg('|'.join)
df = df.merge(nested, on=id_cols)
This gives you the following dataframe:
first_name last_name parsed_address phone
0 saman zonouz primary;abc;jersey;nj;073024588;usa|work;xyz;n... 12177218280| 1217721340