How to convert nested json in csv with pandas-CodePudding

I have a nested json file (100k rows), which looks like this:

{"UniqueId":"4224f3c9-323c-e911-a820-a7f2c9e35195","TransactionDateUTC":"2019-03-01 15:00:52.627 UTC","Itinerary":"MUC-CPH-ARN-MUC","OriginAirportCode":"MUC","DestinationAirportCode":"CPH","OneWayOrReturn":"Return","Segment":[{"DepartureAirportCode":"MUC","ArrivalAirportCode":"CPH","SegmentNumber":"1","LegNumber":"1","NumberOfPassengers":"1"},{"DepartureAirportCode":"ARN","ArrivalAirportCode":"MUC","SegmentNumber":"2","LegNumber":"1","NumberOfPassengers":"1"}]}

I am trying to create a csv, so that it can easily be loaded in a rdbms. I am trying to use json_normalize() in pandas but even before I get there I am getting below error.

with open('transactions.json') as data_file:    
    data = json.load(data_file)

JSONDecodeError: Extra data: line 2 column 1 (char 466)

CodePudding user response：

If your problem originates in reading the json file itself, then i would just use:

json.loads()

and then use

pd.read_csv()

If your problem originates in the conversion from your json dict to dataframe you can use this:

test = {"UniqueId":"4224f3c9-323c-e911-a820-a7f2c9e35195","TransactionDateUTC":"2019-03-01 15:00:52.627 UTC","Itinerary":"MUC-CPH-ARN-MUC","OriginAirportCode":"MUC","DestinationAirportCode":"CPH","OneWayOrReturn":"Return","Segment":[{"DepartureAirportCode":"MUC","ArrivalAirportCode":"CPH","SegmentNumber":"1","LegNumber":"1","NumberOfPassengers":"1"},{"DepartureAirportCode":"ARN","ArrivalAirportCode":"MUC","SegmentNumber":"2","LegNumber":"1","NumberOfPassengers":"1"}]}

import json
import pandas

# convert json to string and read
df = pd.read_json(json.dumps(test), convert_axes=True)

# 'unpack' the dict as series and merge them with original df
df = pd.concat([df, df.Segment.apply(pd.Series)], axis=1)

# remove dict
df.drop('Segment', axis=1, inplace=True)

That would be my approach but there might be more convenient approaches.

CodePudding user response：

@wolfstter has provided a suggestion on how to handle one record. Now, you need to loop over all the records in your file, which you can do like this:

with open('transactions.json', encoding="utf8") as data_file:
    for line in data_file:
        df = pd.read_json(line, convert_axes=True)
        # or: data = json.loads(line) 
        ...