I have the following data frame:
| Category | idlist |
| -------- | -----------------------------------------------------------------|
| new | {100: ['1A', '2B'], 200: ['2A', '3B'], 300: ['3A', '3B', '4B']} |
| old | ['99A', '99B'] |
I want it to be converted into this:
| Category | id | list |
| -------- | --- | -------------------- |
| new | 100 | ['1A', '2B'] |
| new | 200 | ['2A', '3B'] |
| new | 300 | ['3A', '3B', '4B'] |
| old | -1 | ['99A', '99B'] |
The column 'idlist' is a json for category new with id as key and value a list of items. Then there is category that is old and it just has a list of items and since it doesn't have any id we can default it to -1 or 0.
I was trying some examples on internet but I was not able to flatten the json for category new because the key is a number. Can this be achieved? I can handle the old separately as a subset by putting id as '-1' because I really don't need to flatten the cell corresponding to old. But is there a way to handle all this together?
As requested, I am sharing the original JSON format before I converted it into a df:
{'new' : {100: ['1-A', '2-B'], 200: ['2-A', '3-B'], 300: ['3-A', '3-B', '4-B']}, 'old' : ['99-A', '99-B']}
CodePudding user response:
For your original JSON of this:
jj = '''
{
"new": {"100": ["1A", "2B"], "200": ["2A", "3B"], "300": ["3A", "3B", "4B"]},
"old": ["99A", "99B"]
}
'''
I think it's easier to pre-process the JSON rather than post-process the dataframe:
import json
import pandas as pd
dd = json.loads(jj)
ml = []
for k, v in dd.items():
if isinstance(v, dict):
ml = [{ 'Category' : k, 'id' : k2, 'list' : v2 } for k2, v2 in v.items()]
else:
ml = [{ 'Category' : k, 'id' : -1, 'list' : v }]
df = pd.DataFrame(ml)
Output:
Category id list
0 new 100 [1A, 2B]
1 new 200 [2A, 3B]
2 new 300 [3A, 3B, 4B]
3 old -1 [99A, 99B]
CodePudding user response:
def function1(s:pd.Series):
global df2
list1 = []
var1=eval(s.idlist)
if isinstance(var1,dict):
for i in var1:
list1.append({'Category':s.Category,'id':i,'list':var1[i]})
else:
list1.append({'Category':s.Category,'id':-1,'list':var1})
df2=pd.concat([df2,pd.DataFrame(list1)])
df2=pd.DataFrame()
df1.apply(function1,axis=1)
df2.set_index('Category')
CodePudding user response:
- First isolate new and old in separate dataframes.
- Create new columns corresponding to each key in the idlist dictionary for df_new.
- Use pd.melt to "unpivot" df_new
- Modify df_old to match column for column with df_new,
- Use pd.concat to rejoin the two dataframes on axis 0 (rows).
See below:
# Recreated before you json file was attached
import pandas as pd
df = pd.DataFrame({'Category': ['new', 'old'],
'idlist': [{100: ['1A', '2B'], 200: ['2A', '3B'], 300: ['3A', '3B', '4B']}, ['99A', '99B']]})
# isolate new and old in separate dataframes
df_new = df[df['Category'] == "new"]
df_old = df[df['Category'] == "old"]
# Create new columns corresponding to each key in the idlist dictionary for df_new
my_dict = df_new['idlist'][0]
keys = list(my_dict.keys())
values = list(my_dict.values())
for k,v in zip(keys, values):
df_new[k] = f"{v}" # F-string to format the values in output as "string lists"
# Drop idlist because you don't need it anymore
df_new = df_new.drop('idlist', 1)
# Use pd.melt to "unpivot" df_new
df_new = pd.melt(df_new,
id_vars='Category',
value_vars=list(df_new.columns[1:]),
var_name='id',
value_name='list')
# Modify df_old to match column for column with df_new
df_old['list'] = df_old['idlist']
df_old = df_old.drop('idlist', axis = 1)
df_old['id'] = -1
# Use pd.concat to rejoin the two dataframes on axis 0 (rows)
clean_df = pd.concat([df_new, df_old], axis = 0).reset_index(drop = True)