Flatten JSON with columns having values as a list in dataframe-CodePudding

I have the following data frame:

| Category | idlist                                                           |
| -------- | -----------------------------------------------------------------|
| new      | {100: ['1A', '2B'], 200: ['2A', '3B'], 300: ['3A', '3B', '4B']}  |
| old      |  ['99A', '99B']                                                  |

I want it to be converted into this:

| Category | id  | list                 |
| -------- | --- | -------------------- |
| new      | 100 | ['1A', '2B']         |     
| new      | 200 | ['2A', '3B']         |
| new      | 300 | ['3A', '3B', '4B']   |
| old      | -1  | ['99A', '99B']       |

The column 'idlist' is a json for category new with id as key and value a list of items. Then there is category that is old and it just has a list of items and since it doesn't have any id we can default it to -1 or 0.

I was trying some examples on internet but I was not able to flatten the json for category new because the key is a number. Can this be achieved? I can handle the old separately as a subset by putting id as '-1' because I really don't need to flatten the cell corresponding to old. But is there a way to handle all this together?

As requested, I am sharing the original JSON format before I converted it into a df:

{'new' : {100: ['1-A', '2-B'], 200: ['2-A', '3-B'], 300: ['3-A', '3-B', '4-B']}, 'old' :  ['99-A', '99-B']}

CodePudding user response：

For your original JSON of this:

jj = '''
{
 "new": {"100": ["1A", "2B"], "200": ["2A", "3B"], "300": ["3A", "3B", "4B"]},
 "old": ["99A", "99B"]
}
'''

I think it's easier to pre-process the JSON rather than post-process the dataframe:

import json
import pandas as pd

dd = json.loads(jj)
ml = []
for k, v in dd.items():
    if isinstance(v, dict):
        ml  = [{ 'Category' : k, 'id' : k2, 'list' : v2 } for k2, v2 in v.items()]
    else:
        ml  = [{ 'Category' : k, 'id' : -1, 'list' : v }]

df = pd.DataFrame(ml)

Output:

  Category   id          list
0      new  100      [1A, 2B]
1      new  200      [2A, 3B]
2      new  300  [3A, 3B, 4B]
3      old   -1    [99A, 99B]

CodePudding user response：

click here

   def function1(s:pd.Series):
    global df2
    list1 = []
    var1=eval(s.idlist)
    if isinstance(var1,dict):
        for i in var1:
            list1.append({'Category':s.Category,'id':i,'list':var1[i]})
    else:
        list1.append({'Category':s.Category,'id':-1,'list':var1})
    df2=pd.concat([df2,pd.DataFrame(list1)])

df2=pd.DataFrame()
df1.apply(function1,axis=1)
df2.set_index('Category')

CodePudding user response：

First isolate new and old in separate dataframes.
Create new columns corresponding to each key in the idlist dictionary for df_new.
Use pd.melt to "unpivot" df_new
Modify df_old to match column for column with df_new,
Use pd.concat to rejoin the two dataframes on axis 0 (rows).

See below:

# Recreated before you json file was attached    
import pandas as pd 
df = pd.DataFrame({'Category': ['new', 'old'], 
               'idlist': [{100: ['1A', '2B'], 200: ['2A', '3B'], 300: ['3A', '3B', '4B']}, ['99A', '99B']]})

# isolate new and old in separate dataframes
df_new = df[df['Category'] == "new"]
df_old = df[df['Category'] == "old"]

# Create new columns corresponding to each key in the idlist dictionary for df_new
my_dict = df_new['idlist'][0]
keys = list(my_dict.keys())
values = list(my_dict.values())

for k,v in zip(keys, values):
    df_new[k] = f"{v}" # F-string to format the values in output as "string lists"

# Drop idlist because you don't need it anymore
df_new = df_new.drop('idlist', 1)


# Use pd.melt to "unpivot" df_new
df_new = pd.melt(df_new, 
        id_vars='Category', 
        value_vars=list(df_new.columns[1:]),
        var_name='id', 
        value_name='list')

# Modify df_old to match column for column with df_new
df_old['list'] = df_old['idlist']
df_old = df_old.drop('idlist', axis = 1)
df_old['id'] = -1

# Use pd.concat to rejoin the two dataframes on axis 0 (rows)
clean_df = pd.concat([df_new, df_old], axis = 0).reset_index(drop = True)