I'm trying to flatten a json database into pandas dataframe and as it's the first time I'm dealing with json format, I can't do what I want. The database is located here https://mtgjson.com/downloads/all-files/#allprices and according to the model, the structure is like this :
{
"0120a941-9cfb-50b5-b5e4-4e0c7bd32410": {
"mtgo": {
"cardhoarder": {
"currency": "USD",
"retail": {
"foil": {
..., // more rows
"2020-04-21": 0.02
},
"normal": {
..., // more rows
"2020-04-21": 0.02
}
}
},
},
"paper": {
"cardkingdom" : {
"buylist": {
"foil": {
..., // more rows
"2020-04-21": 0.6
},
"normal": {
..., // more rows
"2020-04-21": 0.01
}
},
"currency": "USD",
"retail": {
"foil": {
..., // more rows
"2020-04-21": 0.12
},
"normal": {
..., // more rows
"2020-04-21": 0.02
}
}
},
"cardmarket": {
"currency": "EUR",
"retail": {
"foil": {
..., // more rows
"2020-04-21": 0.12
},
"normal": {
..., // more rows
"2020-04-21": 0.02
}
}
},
"tcgplayer": {
"currency": "USD",
"retail": {
"foil": {
..., // more rows
"2020-04-21": 0.12
},
"normal": {
..., // more rows
"2020-04-21": 0.02
}
}
}
}
}
}
When I look into the json file I have this :
{"meta": {"date": "2021-11-07", "version": "5.1.0 20211107"}, "data": {"00010d56-fe38-5e35-8aed-518019aa36a5": {"paper": {"cardkingdom": {"buylist": {"foil.....
When I do the basic pd.read_json('AllPrices.json')
I got this
meta | data | |
---|---|---|
date | 2021-11-07 | NaN |
version | 5.1.0 20211107 | NaN |
00010d56-fe38-5e35-8aed-518019aa36a5 | NaN | {'paper': {'cardkingdom': {'buylist': {'foil':... |
0001e0d0-2dcd-5640-aadc-a84765cf5fc9 | NaN | {'paper': {'cardkingdom': {'buylist': {'normal... |
So I did some research and found the json_normalize
and wrote this piece of code :
with open('AllPrices.json','r') as f:
data = json.loads(f.read())
pd.json_normalize(data, errors='ignore')
This did the job by flattening the json database but I ended with one row and 31 millions columns. What I want is only one information in this database that is the uuid and the cardmarket price of a normal paper card on the date I want like this :
uuid | paper.cardmarket.retail.normal.2021-11-07 |
---|---|
00010d56-fe38-5e35-8aed-518019aa36a5 | 0.5 |
0001e0d0-2dcd-5640-aadc-a84765cf5fc9 | 0.25 |
I played with the record_path =
parameter and the meta =
parameter but the best I did was not my expected table.
I tried record_path = ['data']
that give me only the uuid in one column.
Thanks for your help
CodePudding user response:
In your case, use json_normalize
for each uuid
record then extract the desired information 'paper.cardmarket.retail.normal':
with open('AllPrices.json') as fp:
prices = json.load(fp)
data = []
for uuid in prices['data']:
df = pd.json_normalize(prices['data'][uuid]) \
.filter(like='paper.cardmarket.retail.normal')
if df.empty:
continue
df.columns = df.columns.str.rsplit('.', 1).str[-1]
df.index = [uuid]
data.append(df)
df = pd.concat(data)
Output: (tested on AllPrices.json
file)
2021-08-09 2021-08-11 2021-08-12 2021-08-13 ... 2021-10-27 2021-10-28 2021-11-02 2021-11-09
00010d56-fe38-5e35-8aed-518019aa36a5 4.35 4.35 4.35 4.35 ... 4.35 4.35 4.35 4.35
0001e0d0-2dcd-5640-aadc-a84765cf5fc9 4.95 4.95 3.45 3.45 ... 7.99 6.81 5.42 6.77
0003caab-9ff5-5d1a-bc06-976dd0457f19 0.24 0.27 0.07 0.38 ... 0.25 0.04 0.13 0.36
0003d249-25d9-5223-af1e-1130f09622a7 0.30 0.30 0.75 0.75 ... 0.15 0.20 0.04 0.25
0004a4fb-92c6-59b2-bdbe-ceb584a9e401 0.27 0.14 0.19 0.10 ... 0.10 0.05 0.19 0.13
... ... ... ... ... ... ... ... ... ...
fffa4ccf-733e-513a-98f9-181b9549de62 0.23 0.15 0.21 0.21 ... 0.10 0.21 0.15 0.20
fffb659e-b3fa-5cd8-9423-fe5ac74248b5 0.49 0.49 0.49 0.49 ... 0.35 0.35 0.35 0.20
fffbc95a-c4d1-56aa-8653-8a7c71fe19ce 6.95 6.95 6.95 6.95 ... 10.26 10.26 10.26 6.43
fffc1305-a118-559b-9504-3d7b56ca0bde 0.18 0.18 0.18 0.18 ... 0.04 0.04 0.04 0.04
fffdd333-3789-5104-a8be-37be199a2cb1 0.87 0.73 0.99 0.99 ... 0.49 0.45 0.45 0.15
[50568 rows x 75 columns]
CodePudding user response:
You can define a function that will get the value in the dictionary and apply this function to df['data']
:
df = pd.read_json('AllPrices.json')
def extract_from_json(cell):
if isinstance(cell, dict) and "paper" in cell.keys():
return list(cell["paper"]["cardmarket"]["retail"]["normal"].values())[0]
return cell
df["data"]=df["data"].apply(extract_from_json)
print(df)`
This will return the price for the first date in the list (index 0). If you only want the price at a specific date your function should return the following:
return cell["paper"]["cardmarket"]["retail"]["normal"].get("2021-11-07", None)