I have a nested json file (100k rows), which looks like this:
{"UniqueId":"4224f3c9-323c-e911-a820-a7f2c9e35195","TransactionDateUTC":"2019-03-01 15:00:52.627 UTC","Itinerary":"MUC-CPH-ARN-MUC","OriginAirportCode":"MUC","DestinationAirportCode":"CPH","OneWayOrReturn":"Return","Segment":[{"DepartureAirportCode":"MUC","ArrivalAirportCode":"CPH","SegmentNumber":"1","LegNumber":"1","NumberOfPassengers":"1"},{"DepartureAirportCode":"ARN","ArrivalAirportCode":"MUC","SegmentNumber":"2","LegNumber":"1","NumberOfPassengers":"1"}]}
I am trying to create a csv, so that it can easily be loaded in a rdbms. I am trying to use json_normalize() in pandas but even before I get there I am getting below error.
with open('transactions.json') as data_file:
data = json.load(data_file)
JSONDecodeError: Extra data: line 2 column 1 (char 466)
CodePudding user response:
If your problem originates in reading the json file itself, then i would just use:
json.loads()
and then use
pd.read_csv()
If your problem originates in the conversion from your json dict to dataframe you can use this:
test = {"UniqueId":"4224f3c9-323c-e911-a820-a7f2c9e35195","TransactionDateUTC":"2019-03-01 15:00:52.627 UTC","Itinerary":"MUC-CPH-ARN-MUC","OriginAirportCode":"MUC","DestinationAirportCode":"CPH","OneWayOrReturn":"Return","Segment":[{"DepartureAirportCode":"MUC","ArrivalAirportCode":"CPH","SegmentNumber":"1","LegNumber":"1","NumberOfPassengers":"1"},{"DepartureAirportCode":"ARN","ArrivalAirportCode":"MUC","SegmentNumber":"2","LegNumber":"1","NumberOfPassengers":"1"}]}
import json
import pandas
# convert json to string and read
df = pd.read_json(json.dumps(test), convert_axes=True)
# 'unpack' the dict as series and merge them with original df
df = pd.concat([df, df.Segment.apply(pd.Series)], axis=1)
# remove dict
df.drop('Segment', axis=1, inplace=True)
That would be my approach but there might be more convenient approaches.
CodePudding user response:
@wolfstter has provided a suggestion on how to handle one record. Now, you need to loop over all the records in your file, which you can do like this:
with open('transactions.json', encoding="utf8") as data_file:
for line in data_file:
df = pd.read_json(line, convert_axes=True)
# or: data = json.loads(line)
...