Converting Json to CSV file but Separator are different based on attributes-CodePudding

Converting JSON file to CSV but all the attributes values separted by comman(,).

phone is multivalues attributes so 2 or more phone should be delimited by pipe(|) address is complex multivalues attribute so each value in address shoule be delimited by semicollon(;).

When I am converting json to csv I have only delimiter comma but not able to delimited multivalue and complex multivalue attributes.

code tried

df = pd.read_json("file")
df1 = df.to_csv("file", sep=",",index=False)

Input file in json

[
   {
      "parsed_address":[
         {
            "address_type":"primary",
            "address_line_1":"abc",
            "city":"jersey",
            "state":"nj",
            "postal_code":"073024588",
            "country":"usa"
         },
         {
            "address_type":"work",
            "address_line_1":"xyz",
            "city":"ny",
            "state":"ns",
            "postal_code":"073024533",
            "country":"london"
         }
      ],
      "phone":[
         {
            "phone":" 12177218280",
            "phone_type":"Mobile"
         },
         {
            "phone":" 1217721340",
            "phone_type":"Work"
         }
      ],
      "first_name":"saman",
      "last_name":"zonouz"
   },
]

Output file in CSV

first_name,last_name,phone,parsed_address
samon,zonouz, 12177218280| 1217721340,abc;jersey;nj;073024588;usa|xyz;ny;ns;073024533;london

CodePudding user response：

I think the easiest way is to build a new list of dictionaries with the right keys:

import pandas as pd

addresses = [
  {
      "parsed_address":[
         {
            "address_type":"primary",
            "address_line_1":"abc",
            "city":"jersey",
            "state":"nj",
            "postal_code":"073024588",
            "country":"usa"
         },
         {
            "address_type":"work",
            "address_line_1":"xyz",
            "city":"ny",
            "state":"ns",
            "postal_code":"073024533",
            "country":"london"
         }
      ],
      "phone":[
         {
            "phone":" 12177218280",
            "phone_type":"Mobile"
         },
         {
            "phone":" 1217721340",
            "phone_type":"Work"
         }
      ],
      "first_name":"saman",
      "last_name":"zonouz"
   }
]

formatted_addr = []
for addr in addresses:
    new_dic={}
    new_dic['first_name'] = addr['first_name']
    new_dic['last_name'] = addr['last_name']
    new_dic['phone'] = '|'.join([dic_phone['phone'] for dic_phone in addr['phone']])
    new_dic['parsed_address'] = '|'.join(
                                    [';'.join([dic_addr[key] 
                                    for key in dic_addr.keys() if key != 'address_type'])
                                    for dic_addr in addr['parsed_address']])
    formatted_addr.append(new_dic)

df = pd.DataFrame(formatted_addr)
df1 = df.to_csv('example.csv', sep=",",index=False)

Output:

first_name,last_name,phone,parsed_address
saman,zonouz, 12177218280| 1217721340,abc;jersey;nj;073024588;usa|xyz;ny;ns;073024533;london

CodePudding user response：

You can first load the json and then use pd.json_normalize to get your data in the right format. Using record_path and meta_var allows to fetch data in nested levels of the json while still knowing to which entry they belong:

>>> with open('f.json') as f:
...   data = json.load(f)
... 
>>> phones = pd.json_normalize(data, record_path='phone', meta=['first_name', 'last_name'])
>>> phones
          phone phone_type first_name last_name
0   12177218280     Mobile      saman    zonouz
1    1217721340       Work      saman    zonouz
>>> phones.groupby(['first_name', 'last_name']).agg('|'.join)
                                         phone   phone_type
first_name last_name                                       
saman      zonouz      12177218280| 1217721340  Mobile|Work

It is only slightly more complicated with the address as there is an additional join:

>>> addr = pd.json_normalize(data, record_path='parsed_address', meta=['first_name', 'last_name'])
>>> addr = addr[['first_name', 'last_name']].join(
...     addr.drop(columns=['first_name', 'last_name']).agg(';'.join, axis='columns').rename('address')
... )
>>> addr
  first_name last_name                              address
0      saman    zonouz  primary;abc;jersey;nj;073024588;usa
1      saman    zonouz      work;xyz;ny;ns;073024533;london
>>> addr.groupby(['first_name', 'last_name']).agg('|'.join)
                                                                address
first_name last_name                                                   
saman      zonouz     primary;abc;jersey;nj;073024588;usa|work;xyz;n...

So in the end you can just join these dataframes together:

>>> df = pd.merge(
...     addr.groupby(['first_name', 'last_name']).agg('|'.join), 
...     phones.groupby(['first_name', 'last_name'])['phone'].agg('|'.join), 
...     on=['first_name', 'last_name']
... )
>>> df
                                                                address                     phone
first_name last_name                                                                             
saman      zonouz     primary;abc;jersey;nj;073024588;usa|work;xyz;n...   12177218280| 1217721340
>>> df.to_csv('test.csv')

Which results in the following file:

first_name,last_name,address,phone
saman,zonouz,primary;abc;jersey;nj;073024588;usa|work;xyz;ny;ns;073024533;london, 12177218280| 1217721340

A completely generic version of this code, but it keeps the phone type, is as follows:

# Load data naively
df = pd.json_normalize(data)

# Remove all columns that have nesting
nested_cols = df.columns[df.applymap(type).eq(list).any()]
df = df.drop(columns=nested_cols)

# Remaining columns identify csv rows
id_cols = df.columns.to_list()

# Columns we are not interested in, key is first-level nesting column
drop_cols = {
   'phone': ['phone_type'],
}

# Now for each nested column, join columns with ; then join rows with |
for col in nested_cols:
   nested = pd.json_normalize(data, record_path=col, meta=id_cols)
   nested = nested[id_cols].join(
      nested.drop(columns=id_cols   drop_cols.get(col, [])).agg(';'.join, axis='columns').rename(col)
   ).groupby(id_cols).agg('|'.join)
   df = df.merge(nested, on=id_cols)

This gives you the following dataframe:

  first_name last_name                                     parsed_address                     phone
0      saman    zonouz  primary;abc;jersey;nj;073024588;usa|work;xyz;n...   12177218280| 1217721340