Converting complex nested json to csv via pandas-CodePudding

I have the following json file

{
    "matches": [
        {
            "team": "Sunrisers Hyderabad",
            "overallResult": "Won",
            "totalMatches": 3,
            "margins": [
                {
                    "bar": 290
                },
                {
                    "bar": 90
                }
            ]
        },
        {
            "team": "Pune Warriors",
            "overallResult": "None",
            "totalMatches": 0,
            "margins": null
        }
    ],
    "totalMatches": 70
}

Note - Above json is fragment of original json. The actual file contains lot more attributes after 'margins', some of them nested and others not so. I just put some for brevity and to give an idea of expectations.

My goal is to flatten the data and load it into CSV. Here is the code I have written so far -

import json
import pandas as pd

path = r"/Users/samt/Downloads/test_data.json"

with open(path) as f:
    t_data = {}
    data = json.load(f)
    for team in data['matches']:
        if team['margins']:
            for idx, margin in enumerate(team['margins']):
                t_data['team'] = team['team']
                t_data['overallResult'] = team['overallResult']
                t_data['totalMatches'] = team['totalMatches']
                t_data['margin'] = margin.get('bar')
        else:
            t_data['team'] = team['team']
            t_data['overallResult'] = team['overallResult']
            t_data['totalMatches'] = team['totalMatches']
            t_data['margin'] = margin.get('bar')

    df = pd.DataFrame.from_dict(t_data, orient='index')
    print(df)

I know that data is getting over-written and loop is not properly structured.I am bit new to dealing with JSON objects using Python and I am not able to understand how to concate the results.

My goal is once, all the results are appended, use to_csv and convert them into rows. For each margin, the entire data is to be replicated as a seperate row. Here is what I am expecting the output to be. Can someone please help how to translate this?

From whatever I find on the net, it is about first gathering the dictionary items but how to transpose it to rows is something I am not able to understand. Also, is there a better way to parse the json than doing the loop twice for one attribute i.e. margins?

I can't use json_normalize as that library is not supported in our environment.

[output data]

CodePudding user response：

You can use pd.DataFrame to create DataFrame and explode the margins column

import json
import pandas as pd

with open('data.json', 'r', encoding='utf-8') as f:
    data = json.loads(f.read())

df = pd.DataFrame(data['matches']).explode('margins', ignore_index=True)

print(df)

                  team overallResult  totalMatches       margins
0  Sunrisers Hyderabad           Won             3  {'bar': 290}
1  Sunrisers Hyderabad           Won             3   {'bar': 90}
2        Pune Warriors          None             0          None

Then fill the None value in margins column to dictionary and convert it to column

bar = df['margins'].apply(lambda x: x if x else {'bar': pd.NA}).apply(pd.Series)

print(bar)

    bar
0   290
1    90
2  <NA>

At last, join the Series to original dataframe

df = df.join(bar).drop(columns='margins')

print(df)

                  team overallResult  totalMatches   bar
0  Sunrisers Hyderabad           Won             3   290
1  Sunrisers Hyderabad           Won             3    90
2        Pune Warriors          None             0  <NA>

CodePudding user response：

Using the json and csv modules: create a dictionary for each team, for each margin if there is one.

import json, csv

s = '''{
    "matches": [
        {
            "team": "Sunrisers Hyderabad",
            "overallResult": "Won",
            "totalMatches": 3,
            "margins": [
                {
                    "bar": 290
                },
                {
                    "bar": 90
                }
            ]
        },
        {
            "team": "Pune Warriors",
            "overallResult": "None",
            "totalMatches": 0,
            "margins": null
        }
    ],
    "totalMatches": 70
}'''

j = json.loads(s)

matches = j['matches']
rows = []
for thing in matches:
    # print(thing)
    if not thing['margins']:
        rows.append(thing)
    else:
        for bar in (b['bar'] for b in thing['margins']):
            d = dict((k,thing[k]) for k in ('team','overallResult','totalMatches'))
            d['margins'] = bar
            rows.append(d)

# for row in rows: print(row)            

# using an in-memory stream for this example instead of an actual file
import io
f = io.StringIO(newline='')

fieldnames=('team','overallResult','totalMatches','margins')
writer = csv.DictWriter(f,fieldnames=fieldnames)
writer.writeheader()
writer.writerows(rows)
f.seek(0)
print(f.read())

team,overallResult,totalMatches,margins
Sunrisers Hyderabad,Won,3,290
Sunrisers Hyderabad,Won,3,90
Pune Warriors,None,0,

Getting multiple item values from a dictionary can be aided by using operator.itemgetter()

>>> import operator
>>> items = operator.itemgetter(*('team','overallResult','totalMatches'))
>>> #items = operator.itemgetter('team','overallResult','totalMatches')
>>> #stuff = ('team','overallResult','totalMatches'))
>>> #items = operator.itemgetter(*stuff)
>>> d = {'margins': 90,
...   'overallResult': 'Won',
...   'team': 'Sunrisers Hyderabad',
...   'totalMatches': 3}
>>> items(d)
('Sunrisers Hyderabad', 'Won', 3)
>>>

I like to use use it and give the callable a descriptive name but I don't see it used much here on SO.