I have sparsed dataframe that I needed to convert it to list which I already did it. Now I want to transform this list to dictionary, so I can do key-value comparison in my actual use case. To do so, I attempted to convert list to dictionary but I have value error instead. How can I do this correctly in python? Does anyone knows what is correct way to do this in python?
reproducible data and my attempt
mydict={'code0': {0: 'nan', 1: " '40'", 2: " '98'", 3: " '98'", 4: " '52'", 5: " '52'", 6: " '52'", 7: " '52'", 8: " '40'", 9: " '58'"}, 'code1': {0: " ('VA','HC','NIH','SAP','AUS','HOL','ATT','COL','UCL')", 1: 'nan', 2: " ('ATT','NC')", 3: " ('ATT','VA','NC')", 4: " 'NC'", 5: " 'NC'", 6: " 'NC'", 7: " 'NC'", 8: " 'VA'", 9: " 'CE'"}, 'code2': {0: 'nan', 1: 'nan', 2: " ('103','104','105','106','31')", 3: " ('104','105','106','31')", 4: " '109'", 5: " '109'", 6: " '109'", 7: " '109'", 8: " '11'", 9: " ('109')"}, 'code3': {0: 'nan', 1: " '518'", 2: " '810'", 3: 'nan', 4: " ('610','620','682','642','621','611')", 5: " ('396','340','394','393','240')", 6: " ('612','790','110')", 7: " ('730','320','350','379','812','374')", 8: " ('113','174','131','115')", 9: " ('423','114')"}, 'code4': {0: 'nan', 1: 'nan', 2: " 'computer science'", 3: " 'computer science'", 4: " 'biology'", 5: " 'biology'", 6: "biology'", 7: "biology'", 8: 'nan', 9: 'nan'}, 'code5': {0: 'nan', 1: 'nan', 2: 'nan', 3: 'nan', 4: 'nan', 5: " ('12','18')", 6: " ('12','16','18','19')", 7: " ('12','18','19')", 8: " ('11','19','31')", 9: " '31'"}, 'code6': {0: 'nan', 1: " '594'", 2: 'nan', 3: 'nan', 4: " ('712','479','297','639','452','172')", 5: 'nan', 6: " ('285','295','236','239','269','284','237')", 7: 'nan', 8: " ('164','157','388','158')", 9: " ('372','238')"}, 'rules_desc': {0: 'rules1', 1: 'rules2', 2: 'rules2', 3: 'rules2', 4: 'rules2', 5: 'rules2', 6: 'rules2', 7: 'rules2', 8: 'rules2', 9: 'rules2'}}
mydf=pd.DataFrame.from_dict(my_dict)
cols = mydf.columns.values
res=[",".join("{}:{}".format(*t) for t in zip(cols, row)) for _, row in mydf[cols].iterrows()]
res=[list(s.split('""')) for s in res]
intconv = lambda x: (x[0], int(x[1]))
for s in res:
b = dict([i.split(':') for i in s])
final = dict((k, int(v)) for k, v in b.items())
print(final)
but I have value error as follow:
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Input In [60], in <cell line: 9>() 8 intconv = lambda x: (x[0], int(x[1])) 9 for s in res: ---> 10 b = dict([i.split(':') for i in s]) 11 final = dict((k, int(v)) for k, v in b.items()) 12 print(final)
ValueError: dictionary update sequence element #0 has length 9; 2 is required
not sure how to debug this, seems value can't accept list of string. the reason why I am doing this because I want to do key-value comparison. Does anyone know any workaround on this?
desired output
I don't want to try hard here. basically I want to create list of dictionary for each row of dataframe, where column_name is key and column value is value. I know in some column, its value is list of string, that's why I don't know how to create dictionary from it. here is the sample dictionary that I want:
output_dict={{'code0':'nan', 'code1': ('VA','HC','NIH','SAP','AUS','HOL','ATT','COL','UCL'), 'code2':'nan', 'code3':'nan', 'code4':'nan', 'code5':'nan', 'code6':'nan','rules_desc': rules1},
{'code0': 40, 'code1':'nan','code2':'nan','code3':518, 'code4':'nan', 'code5':'nan','code6':594, 'rules_desc': rules2},
...
}
can we get dictionary something like this? because I want to create such list or dictionary to filter out another big dataframe. Any help would be appreciated. thanks
CodePudding user response:
The data you have in mydf
need cleaning. I'm using ast.literal_eval
to try to convert the strings to tuples, and then int()
the strings to integers:
from ast import literal_eval
out = []
for _, row in mydf.iterrows():
tmp = {}
for k, v in zip(row.index, row):
try:
tmp[k] = literal_eval(v)
if isinstance(tmp[k], str) and tmp[k].isnumeric():
tmp[k] = int(tmp[k])
except (ValueError, SyntaxError):
tmp[k] = v
out.append(tmp)
print(out)
Prints:
[
{
"code0": "nan",
"code1": ("VA", "HC", "NIH", "SAP", "AUS", "HOL", "ATT", "COL", "UCL"),
"code2": "nan",
"code3": "nan",
"code4": "nan",
"code5": "nan",
"code6": "nan",
"rules_desc": "rules1",
},
{
"code0": 40,
"code1": "nan",
"code2": "nan",
"code3": 518,
"code4": "nan",
"code5": "nan",
"code6": 594,
"rules_desc": "rules2",
},
{
"code0": 98,
"code1": ("ATT", "NC"),
"code2": ("103", "104", "105", "106", "31"),
"code3": 810,
"code4": "computer science",
"code5": "nan",
"code6": "nan",
"rules_desc": "rules2",
},
{
"code0": 98,
"code1": ("ATT", "VA", "NC"),
"code2": ("104", "105", "106", "31"),
"code3": "nan",
"code4": "computer science",
"code5": "nan",
"code6": "nan",
"rules_desc": "rules2",
},
{
"code0": 52,
"code1": "NC",
"code2": 109,
"code3": ("610", "620", "682", "642", "621", "611"),
"code4": "biology",
"code5": "nan",
"code6": ("712", "479", "297", "639", "452", "172"),
"rules_desc": "rules2",
},
{
"code0": 52,
"code1": "NC",
"code2": 109,
"code3": ("396", "340", "394", "393", "240"),
"code4": "biology",
"code5": ("12", "18"),
"code6": "nan",
"rules_desc": "rules2",
},
{
"code0": 52,
"code1": "NC",
"code2": 109,
"code3": ("612", "790", "110"),
"code4": "biology'",
"code5": ("12", "16", "18", "19"),
"code6": ("285", "295", "236", "239", "269", "284", "237"),
"rules_desc": "rules2",
},
{
"code0": 52,
"code1": "NC",
"code2": 109,
"code3": ("730", "320", "350", "379", "812", "374"),
"code4": "biology'",
"code5": ("12", "18", "19"),
"code6": "nan",
"rules_desc": "rules2",
},
{
"code0": 40,
"code1": "VA",
"code2": 11,
"code3": ("113", "174", "131", "115"),
"code4": "nan",
"code5": ("11", "19", "31"),
"code6": ("164", "157", "388", "158"),
"rules_desc": "rules2",
},
{
"code0": 58,
"code1": "CE",
"code2": 109,
"code3": ("423", "114"),
"code4": "nan",
"code5": 31,
"code6": ("372", "238"),
"rules_desc": "rules2",
},
]
CodePudding user response:
Not sure I understood your question, but I think this will give you what you want - a list of dictionaries with key-value based on cols.
mydf.to_dict(orient='records')
Let me know if I missed something