I want to read a .xlsx
file, do some things with the data and convert it to a dict
to save it in a .json
file. To do that I use Python3
and pandas
.
This is the code:
import pandas as pd
import json
xls = pd.read_excel(
io = "20codmun.xlsx",
converters = {
"CODAUTO" : str,
"CPRO" : str,
"CMUN" : str,
"DC" : str
}
)
print(xls)
#print(xls.columns.values)
outDict = {}
print(len(xls["NOMBRE"])) # 8131 rows
for i in range(len(xls.index)):
codauto = xls["CODAUTO"][i]
cpro = xls["CPRO"][i]
cmun = xls["CMUN"][i]
dc = xls["DC"][i]
aemetId = cpro cmun
outDict[xls["NOMBRE"][i]] = {
"CODAUTO" : codauto,
"CPRO" : cpro,
"CMUN" : cmun,
"DC" : dc,
"AEMET_ID" : aemetId
}
print(i) # 8130
print(len(outDict)) # 8114 entries, SOME ENTIRES ARE LOST!!!!!
#print(outDict["Petrer"])
with open("data.json", "w") as outFile:
json.dump(outDict, outFile)
I add here the source of the .xlsx
file (Spanish government). Select "Fichero con todas las provincias". You have to delete the first row.
As you can see, the pandas.DataFrame
has 8131 rows, the for
index at the end is 8130, but the length of the final dict
is 8114, so some data is lost!
You can check that "Aljucén" is on the .xlsx
file, but not in the .json
one.
CodePudding user response:
I have analyzed the file and seems like some "NOMBRE" values are duplicated. Try executing xls["NOMBRE"].value_counts()
and you will see that for example "Sada" is twice. You will also see that the unique values are 8114 exactly.
As you are using the city name as the dictionary key, when the key is duplicated, you are modifying the previous value of the dict.
CodePudding user response:
I agree with gontxomde that if column "NOMBRE" has not only unique values, than it may lead to overwriting this key in the new dictionary.
To make a proof of concept I made a minimal example based on your approach:
import pandas as pd
feature_str = ['a', 'b', 'c']
df = pd.DataFrame({"NOMBRE": [1, 1, 3],
"CODAUTO": feature_str,
"CPRO" : feature_str,
"CMUN" : feature_str,
"DC" : feature_str
})
outDict = {}
print(len(df["NOMBRE"])) # 8131 rows
for i in range(len(df.index)):
codauto = df["CODAUTO"][i]
cpro = df["CPRO"][i]
cmun = df["CMUN"][i]
dc = df["DC"][i]
aemetId = cpro cmun
outDict[df["NOMBRE"][i]] = {
"CODAUTO" : codauto,
"CPRO" : cpro,
"CMUN" : cmun,
"DC" : dc,
"AEMET_ID" : aemetId
}
print(outDict)
Which yields:
{1: {'CODAUTO': 'b', 'CPRO': 'b', 'CMUN': 'b', 'DC': 'b', 'AEMET_ID': 'bb'},
3: {'CODAUTO': 'c', 'CPRO': 'c', 'CMUN': 'c', 'DC': 'c', 'AEMET_ID': 'cc'}}
If I could suggest, instead of iterating over the index of the DataFrame, it would be better to use DataFrame methods:
df.set_index("NOMBRE") \
.to_dict(orient='index')
If you would use this in a dataset with unique values at NOMBRE, it would yield the same result, than the function you created. Additionally, in case you had duplicates it would raise an ValueError:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [15], in <module>
----> 1 df.set_index("NOMBRE").to_dict(orient='index')
File ~/.pyenv/versions/3.8.7/envs/jupyter/lib/python3.8/site-packages/pandas/core/frame.py:1607, in DataFrame.to_dict(self, orient, into)
1605 elif orient == "index":
1606 if not self.index.is_unique:
-> 1607 raise ValueError("DataFrame index must be unique for orient='index'.")
1608 return into_c(
1609 (t[0], dict(zip(self.columns, t[1:])))
1610 for t in self.itertuples(name=None)
1611 )
1613 else:
ValueError: DataFrame index must be unique for orient='index'.
CodePudding user response:
If you have duplicated values in xls["NOMBRE"]
, each new duplicated will overwrite the previous one. So, you need to choose the strategy deal with duplicates, e.g. do you want different entries, like Sada
and Sada(2)
? Or do you want a single key Sada
with the data from all the duplicates?
For the first example:
for i in range(len(xls.index)):
# if it's the first time the value appears, just do the "normal" thing
if xls["NOMBRE"][i] not in outDict.keys():
outDict[xls["NOMBRE"][i]] = {
"CODAUTO" : codauto,
"CPRO" : cpro,
"CMUN" : cmun,
"DC" : dc,
"AEMET_ID" : aemetId
}
# if the value was read before, add number of duplicate after the name
else:
for i in range(1, xls['NOMBRE'].value_counts()[xls["NOMBRE"][i]]):
if xls["NOMBRE"][i] '(' str(i 1) ')' not in outDict.keys():
outDict[xls["NOMBRE"][i] '(' str(i 1) ')'] = {
"CODAUTO" : codauto,
"CPRO" : cpro,
"CMUN" : cmun,
"DC" : dc,
"AEMET_ID" : aemetId
}