How do I convert multiple JSON files with unidentical structure to a single pandas dataframe?-CodePudding

The input is many JSON files differing in structure, and the desired output is a single dataframe.

Input Description:

Each JSON file may have 1 or many attackers and exactly 1 victim. The attackers key points to a list of dictionaries. Each dictionary is 1 attacker with keys such as character_id, corporation_id, alliance_id, etc. The victim key points to dictionary with similar keys. Important thing to note here is that the keys might differ between the same JSON. For example, a JSON file may have attackers key which looks like this:

{
    "attackers": [
        {
            "alliance_id": 99005678,
            "character_id": 94336577,
            "corporation_id": 98224639,
            "damage_done": 3141,
            "faction_id": 500003,
            "final_blow": true,
            "security_status": -9.4,
            "ship_type_id": 73796,
            "weapon_type_id": 3178
        },
        {
            "damage_done": 1614,
            "faction_id": 500003,
            "final_blow": false,
            "security_status": 0,
            "ship_type_id": 32963
        }
    ],
...

Here the JSON file has 2 attackers. But only the first attacker has the afore-mentioned keys. Similarly, the victim may look like this:

...
"victim": {
        "character_id": 2119076173,
        "corporation_id": 98725195,
        "damage_taken": 4755,
        "faction_id": 500002,
        "items": [...
...

Output Description:

As an output I want to create a dataframe from many (about 400,000) such JSON files stored in the same directory. Each row of the resulting dataframe should have 1 attacker and 1 victim. JSONs with multiple attackers should be split into equal number of rows, where the attackers' properties are different, but the victim properties are the same. For e.g., 3 rows if there are 3 attackers and NaN values where a certain attacker doesn't have a key-value pair. So, the character_id for the second attacker in the dataframe of the above example should be NaN.

Current Method:

To achieve this, I first create an empty list. Then iterate through all the files, open them, load them as JSON objects, convert to dataframe then append dataframe to the list. Please note that pd.DataFrame([json.load(fi)]) has the same output as pd.json_normalize(json.load(fi)).

mainframe = []

for file in tqdm(os.listdir("D:/Master/killmails_jul"), ncols=100, ascii='          >'):
    with open("%s/%s" % ("D:/Master/killmails_jul", file),'r') as fi:
        mainframe.append(pd.DataFrame([json.load(fi)]))

After this loop, I am left with a list of dataframes which I concatenate using pd.concat().

mainframe = pd.concat(mainframe)

As of yet, the dataframe only has 1 row per JSON irrespective of the number of attackers. To fix this, I use pd.explode() in the next step.

mainframe = mainframe.explode('attackers')
mainframe.reset_index(drop=True, inplace=True)

Now I have separate rows for each attacker, however the attackers & victim keys are still hidden in their respective column. To fix this I 'explode' the the two columns horizontally by pd.apply(pd.Series) and apply prefix for easy recognition as follows:

intframe = mainframe["attackers"].apply(pd.Series).add_prefix("attackers_").join(mainframe["victim"].apply(pd.Series).add_prefix("victim_"))

In the next step I join this intermediate frame with the mainframe to retain the killmail_id and killmail_hash columns. Then remove the attackers & victim columns as I have now expanded them.

mainframe = intframe.join(mainframe)
mainframe.fillna(0, inplace=True)
mainframe.drop(['attackers','victim'], axis=1, inplace=True)

This gives me the desired output with the following 24 columns:

['attackers_character_id', 'attackers_corporation_id', 'attackers_damage_done', 'attackers_final_blow', 'attackers_security_status', 'attackers_ship_type_id', 'attackers_weapon_type_id', 'attackers_faction_id', 'attackers_alliance_id', 'victim_character_id', 'victim_corporation_id', 'victim_damage_taken', 'victim_items', 'victim_position', 'victim_ship_type_id', 'victim_alliance_id', 'victim_faction_id', 'killmail_id', 'killmail_time', 'solar_system_id', 'killmail_hash', 'http_last_modified', 'war_id', 'moon_id']

Question:

Is there a better way to do this than I am doing right now? I tried to use generators but couldn't get them to work. I get an AttributeError: 'str' object has no attribute 'read'

all_files_paths = glob(os.path.join('D:\\Master\\kmrest', '*.json'))

def gen_df(files):
    for file in files:
        with open(file, 'r'):
            data = json.load(file)
        data = pd.DataFrame([data])
        yield data

mainframe = pd.concat(gen_df(all_files_paths), ignore_index=True)

Will using the pd.concat() function with generators lead to quadratic copying? Also, I am worried opening and closing many files is slowing down computation. Maybe it would be better to create a JSONL file from all the JSONs first and then creating a dataframe for each line.

If you'd like to get your hands on the files, I am trying to work with you can click here. Let me know if further information is needed.

CodePudding user response：

You could use pd.json_normalize() to help with the heavy lifting:

First, load your data:

import json
import requests
import tarfile
from tqdm.notebook import tqdm

url = 'https://data.everef.net/killmails/2022/killmails-2022-11-22.tar.bz2'
with requests.get(url, stream=True) as r:
    fobj = io.BytesIO(r.raw.read())
    with tarfile.open(fileobj=fobj, mode='r:bz2') as tar:
        json_files = [it for it in tar if it.name.endswith('.json')]
        data = [json.load(tar.extractfile(it)) for it in tqdm(json_files)]

To do the same with your files:

import json
from glob import glob

def json_load(filename):
    with open(filename) as f:
        return json.load(f)

topdir = '...'  # the dir containing all your json files
data = [json_load(fn) for fn in tqdm(glob(f'{topdir}/*.json'))]

Once you have a list of dicts in data:

others = ['killmail_id', 'killmail_hash']
a = pd.json_normalize(data, 'attackers', others, record_prefix='attackers.')
v = pd.json_normalize(data).drop('attackers', axis=1)
df = a.merge(v, on=others)

Some quick inspection:

>>> df.shape
(44903, 26)

# check:
>>> sum([len(d['attackers']) for d in data])
44903

>>> df.columns
Index(['attackers.alliance_id', 'attackers.character_id',
       'attackers.corporation_id', 'attackers.damage_done',
       'attackers.final_blow', 'attackers.security_status',
       'attackers.ship_type_id', 'attackers.weapon_type_id',
       'attackers.faction_id', 'killmail_id', 'killmail_hash', 'killmail_time',
       'solar_system_id', 'http_last_modified', 'victim.alliance_id',
       'victim.character_id', 'victim.corporation_id', 'victim.damage_taken',
       'victim.items', 'victim.position.x', 'victim.position.y',
       'victim.position.z', 'victim.ship_type_id', 'victim.faction_id',
       'war_id', 'moon_id'],
      dtype='object')

>>> df.iloc[:5, :5]
   attackers.alliance_id  attackers.character_id  attackers.corporation_id  attackers.damage_done  attackers.final_blow
0  99007887.0             1.450608e 09            2.932806e 08              1426                   False               
1  99010931.0             1.628193e 09            5.668252e 08              1053                   False               
2  99007887.0             1.841341e 09            1.552312e 09              1048                   False               
3  99007887.0             2.118406e 09            9.872458e 07               662                   False               
4  99005839.0             9.573650e 07            9.947834e 08               630                   False               

>>> df.iloc[-5:, -5:]
       victim.position.z  victim.ship_type_id  victim.faction_id  war_id  moon_id
44898  1.558110e 11       670                 NaN                NaN     NaN     
44899 -7.678686e 10       670                 NaN                NaN     NaN     
44900 -7.678686e 10       670                 NaN                NaN     NaN     
44901 -7.678686e 10       670                 NaN                NaN     NaN     
44902 -7.678686e 10       670                 NaN                NaN     NaN

Note also that, as desired, missing keys for attackers are NaN:

>>> df.iloc[15:20, :2]
    attackers.alliance_id  attackers.character_id
15  99007887.0             2.117497e 09          
16  99011893.0             1.593514e 09          
17         NaN             9.175132e 07          
18         NaN             2.119191e 09          
19  99011258.0             1.258332e 09