The input is many JSON files differing in structure, and the desired output is a single dataframe.
Input Description:
Each JSON file may have 1 or many attackers and exactly 1 victim. The attackers
key points to a list of dictionaries. Each dictionary is 1 attacker with keys such as character_id
, corporation_id
, alliance_id
, etc. The victim
key points to dictionary with similar keys. Important thing to note here is that the keys might differ between the same JSON. For example, a JSON file may have attackers
key which looks like this:
{
"attackers": [
{
"alliance_id": 99005678,
"character_id": 94336577,
"corporation_id": 98224639,
"damage_done": 3141,
"faction_id": 500003,
"final_blow": true,
"security_status": -9.4,
"ship_type_id": 73796,
"weapon_type_id": 3178
},
{
"damage_done": 1614,
"faction_id": 500003,
"final_blow": false,
"security_status": 0,
"ship_type_id": 32963
}
],
...
Here the JSON file has 2 attackers. But only the first attacker has the afore-mentioned keys. Similarly, the victim
may look like this:
...
"victim": {
"character_id": 2119076173,
"corporation_id": 98725195,
"damage_taken": 4755,
"faction_id": 500002,
"items": [...
...
Output Description:
As an output I want to create a dataframe from many (about 400,000) such JSON files stored in the same directory. Each row of the resulting dataframe should have 1 attacker and 1 victim. JSONs with multiple attackers should be split into equal number of rows, where the attackers' properties are different, but the victim properties are the same. For e.g., 3 rows if there are 3 attackers and NaN
values where a certain attacker doesn't have a key-value pair. So, the character_id
for the second attacker in the dataframe of the above example should be NaN
.
Current Method:
To achieve this, I first create an empty list. Then iterate through all the files, open them, load them as JSON objects, convert to dataframe then append dataframe to the list. Please note that pd.DataFrame([json.load(fi)])
has the same output as pd.json_normalize(json.load(fi))
.
mainframe = []
for file in tqdm(os.listdir("D:/Master/killmails_jul"), ncols=100, ascii=' >'):
with open("%s/%s" % ("D:/Master/killmails_jul", file),'r') as fi:
mainframe.append(pd.DataFrame([json.load(fi)]))
After this loop, I am left with a list of dataframes which I concatenate using pd.concat()
.
mainframe = pd.concat(mainframe)
As of yet, the dataframe only has 1 row per JSON irrespective of the number of attackers
. To fix this, I use pd.explode()
in the next step.
mainframe = mainframe.explode('attackers')
mainframe.reset_index(drop=True, inplace=True)
Now I have separate rows for each attacker, however the attackers
& victim
keys are still hidden in their respective column. To fix this I 'explode' the the two columns horizontally by pd.apply(pd.Series)
and apply prefix for easy recognition as follows:
intframe = mainframe["attackers"].apply(pd.Series).add_prefix("attackers_").join(mainframe["victim"].apply(pd.Series).add_prefix("victim_"))
In the next step I join this intermediate frame with the mainframe to retain the killmail_id
and killmail_hash
columns. Then remove the attackers
& victim
columns as I have now expanded them.
mainframe = intframe.join(mainframe)
mainframe.fillna(0, inplace=True)
mainframe.drop(['attackers','victim'], axis=1, inplace=True)
This gives me the desired output with the following 24 columns:
['attackers_character_id', 'attackers_corporation_id', 'attackers_damage_done', 'attackers_final_blow', 'attackers_security_status', 'attackers_ship_type_id', 'attackers_weapon_type_id', 'attackers_faction_id', 'attackers_alliance_id', 'victim_character_id', 'victim_corporation_id', 'victim_damage_taken', 'victim_items', 'victim_position', 'victim_ship_type_id', 'victim_alliance_id', 'victim_faction_id', 'killmail_id', 'killmail_time', 'solar_system_id', 'killmail_hash', 'http_last_modified', 'war_id', 'moon_id']
Question:
Is there a better way to do this than I am doing right now? I tried to use generators but couldn't get them to work. I get an AttributeError: 'str' object has no attribute 'read'
all_files_paths = glob(os.path.join('D:\\Master\\kmrest', '*.json'))
def gen_df(files):
for file in files:
with open(file, 'r'):
data = json.load(file)
data = pd.DataFrame([data])
yield data
mainframe = pd.concat(gen_df(all_files_paths), ignore_index=True)
Will using the pd.concat()
function with generators lead to quadratic copying?
Also, I am worried opening and closing many files is slowing down computation. Maybe it would be better to create a JSONL file from all the JSONs first and then creating a dataframe for each line.
If you'd like to get your hands on the files, I am trying to work with you can click here. Let me know if further information is needed.
CodePudding user response:
You could use pd.json_normalize()
to help with the heavy lifting:
First, load your data:
import json
import requests
import tarfile
from tqdm.notebook import tqdm
url = 'https://data.everef.net/killmails/2022/killmails-2022-11-22.tar.bz2'
with requests.get(url, stream=True) as r:
fobj = io.BytesIO(r.raw.read())
with tarfile.open(fileobj=fobj, mode='r:bz2') as tar:
json_files = [it for it in tar if it.name.endswith('.json')]
data = [json.load(tar.extractfile(it)) for it in tqdm(json_files)]
To do the same with your files:
import json
from glob import glob
def json_load(filename):
with open(filename) as f:
return json.load(f)
topdir = '...' # the dir containing all your json files
data = [json_load(fn) for fn in tqdm(glob(f'{topdir}/*.json'))]
Once you have a list of dicts in data
:
others = ['killmail_id', 'killmail_hash']
a = pd.json_normalize(data, 'attackers', others, record_prefix='attackers.')
v = pd.json_normalize(data).drop('attackers', axis=1)
df = a.merge(v, on=others)
Some quick inspection:
>>> df.shape
(44903, 26)
# check:
>>> sum([len(d['attackers']) for d in data])
44903
>>> df.columns
Index(['attackers.alliance_id', 'attackers.character_id',
'attackers.corporation_id', 'attackers.damage_done',
'attackers.final_blow', 'attackers.security_status',
'attackers.ship_type_id', 'attackers.weapon_type_id',
'attackers.faction_id', 'killmail_id', 'killmail_hash', 'killmail_time',
'solar_system_id', 'http_last_modified', 'victim.alliance_id',
'victim.character_id', 'victim.corporation_id', 'victim.damage_taken',
'victim.items', 'victim.position.x', 'victim.position.y',
'victim.position.z', 'victim.ship_type_id', 'victim.faction_id',
'war_id', 'moon_id'],
dtype='object')
>>> df.iloc[:5, :5]
attackers.alliance_id attackers.character_id attackers.corporation_id attackers.damage_done attackers.final_blow
0 99007887.0 1.450608e 09 2.932806e 08 1426 False
1 99010931.0 1.628193e 09 5.668252e 08 1053 False
2 99007887.0 1.841341e 09 1.552312e 09 1048 False
3 99007887.0 2.118406e 09 9.872458e 07 662 False
4 99005839.0 9.573650e 07 9.947834e 08 630 False
>>> df.iloc[-5:, -5:]
victim.position.z victim.ship_type_id victim.faction_id war_id moon_id
44898 1.558110e 11 670 NaN NaN NaN
44899 -7.678686e 10 670 NaN NaN NaN
44900 -7.678686e 10 670 NaN NaN NaN
44901 -7.678686e 10 670 NaN NaN NaN
44902 -7.678686e 10 670 NaN NaN NaN
Note also that, as desired, missing keys for attackers are NaN
:
>>> df.iloc[15:20, :2]
attackers.alliance_id attackers.character_id
15 99007887.0 2.117497e 09
16 99011893.0 1.593514e 09
17 NaN 9.175132e 07
18 NaN 2.119191e 09
19 99011258.0 1.258332e 09