I recovered a list of ISO 3166-2 countries and regions in this Github repository. I managed to have a first look of the regions using the following code:
import pandas as pd
import json
data = "/content/data.json"
df = pd.read_json(data)
df = df.T
Which gives the following output:
name | divisions |
---|---|
Afghanistan | {'AF-BDS': 'Badakhshān', 'AF-BDG': 'Bādghīs', 'AF-BGL': 'Baghlān', 'AF-BAL': 'Balkh', 'AF-BAM': 'Bāmīān', 'AF-FRA': 'Farāh', 'AF-FYB': 'Fāryāb', 'AF-GHA': 'Ghaznī', 'AF-GHO': 'Ghowr', 'AF-HEL': 'Helmand', 'AF-HER': 'Herāt', 'AF-JOW': 'Jowzjān', 'AF-KAB': 'Kabul (Kābol)', 'AF-KAN': 'Kandahār', 'AF-KAP': 'Kāpīsā', 'AF-KNR': 'Konar (Kunar)', 'AF-KDZ': 'Kondoz (Kunduz)', 'AF-LAG': 'Laghmān', 'AF-LOW': 'Lowgar', 'AF-NAN': 'Nangrahār (Nangarhār)', 'AF-NIM': 'Nīmrūz', 'AF-ORU': 'Orūzgān (Urūzgā', 'AF-PIA': 'Paktīā', 'AF-PKA': 'Paktīkā', 'AF-PAR': 'Parwān', 'AF-SAM': 'Samangān', 'AF-SAR': 'Sar-e Pol', 'AF-TAK': 'Takhār', 'AF-WAR': 'Wardak (Wardag)', 'AF-ZAB': 'Zābol (Zābul)'} |
Albania | {'AL-BR': 'Berat', 'AL-BU': 'Bulqizë', 'AL-DL': 'Delvinë', 'AL-DV': 'Devoll', 'AL-DI': 'Dibër', 'AL-DR': 'Durrës', 'AL-EL': 'Elbasan', 'AL-FR': 'Fier', 'AL-GR': 'Gramsh', 'AL-GJ': 'Gjirokastër', 'AL-HA': 'Has', 'AL-KA': 'Kavajë', 'AL-ER': 'Kolonjë', 'AL-KO': 'Korcë', 'AL-KR': 'Krujë', 'AL-KC': 'Kucovë', 'AL-KU': 'Kukës', 'AL-LA': 'Laç', 'AL-LE': 'Lezhë', 'AL-LB': 'Librazhd', 'AL-LU': 'Lushnjë', 'AL-MM': 'Malësia e Madhe', 'AL-MK': 'Mallakastër', 'AL-MT': 'Mat', 'AL-MR': 'Mirditë', 'AL-PQ': 'Peqin', 'AL-PR': 'Përmet', 'AL-PG': 'Pogradec', 'AL-PU': 'Pukë', 'AL-SR': 'Sarandë', 'AL-SK': 'Skrapar', 'AL-SH': 'Shkodër', 'AL-TE': 'Tepelenë', 'AL-TR': 'Tiranë', 'AL-TP': 'Tropojë', 'AL-VL': 'Vlorë'} |
But I can't manage to achieve the following output because of the nested JSON.
country code | country name | region code | region name |
---|---|---|---|
AF | Afghanistan | AF-BDS | Badakhshān |
AF | Afghanistan | AF-BDG | Bādghīs |
I tried to loop inside the DataFrame with :
df = json_normalize(df['divisions']).unstack().apply(pd.Series)
But I'm not getting any satisfying result.
CodePudding user response:
This should work:
df1 = (
pd.DataFrame(data)
.transpose()
.reset_index(names="country code")
.rename(columns={"name": "country name"})
)
divisions = [(k1, v1) for k, v in df1["divisions"].to_dict().items() for k1, v1 in v.items()]
df2 = pd.DataFrame(divisions, columns=["region code", "region name"])
final_df = (
pd
.merge(df1.explode("divisions"), df2, left_on="divisions", right_on="region code")
.drop(columns="divisions")
)
print(final_df.head(10))
country code country name region code region name
0 AF Afghanistan AF-BDS Badakhshān
1 AF Afghanistan AF-BDG Bādghīs
2 AF Afghanistan AF-BGL Baghlān
3 AF Afghanistan AF-BAL Balkh
4 AF Afghanistan AF-BAM Bāmīān
5 AF Afghanistan AF-FRA Farāh
6 AF Afghanistan AF-FYB Fāryāb
7 AF Afghanistan AF-GHA Ghaznī
8 AF Afghanistan AF-GHO Ghowr
9 AF Afghanistan AF-HEL Helmand
CodePudding user response:
you can simply read in the data one country at a time
J = json.load(open("iso-3166-2.json","r"))
dfs = []
for country_code in J:
df = pd.DataFrame(J[country_code])
df.index.name="region_code"
df['country_code'] = country_code
dfs.append(df)
df = pd.concat(dfs).reset_index()
# region_code name divisions country_code
#0 AF-BAL Afghanistan Balkh AF
#1 AF-BAM Afghanistan Bāmīān AF
#2 AF-BDG Afghanistan Bādghīs AF
#3 AF-BDS Afghanistan Badakhshān AF
#4 AF-BGL Afghanistan Baghlān AF
#... ... ... ... ...
#3802 ZW-MI Zimbabwe Midlands ZW
#3803 ZW-MN Zimbabwe Matabeleland North ZW
#3804 ZW-MS Zimbabwe Matabeleland South ZW
#3805 ZW-MV Zimbabwe Masvingo ZW
#3806 ZW-MW Zimbabwe Mashonaland West ZW