Dropping duplicates from json deep structure pandas-CodePudding

I am working on a scenario of converting excel to nested json with group by which is to extend to the header as well as the items. Tried as below:

Able to apply transformation rules using pandas:

df['Header'] = df[['A','B']].to_dict('records') 
df['Item'] = df[['A', 'C', 'D'].to_dict('records')

By this, I am able to separate the records into separate data frames. Applying below:

data_groupedby = data.groupby(['A', 'B']).agg(list).reset_index()
result = data_groupedby['A','B','Item'].to_json(orient='records')

This gives me the required json with header as well as further drill down of items as a nested deep structure. With groupby, I am able to group fields of header but not able to apply the group by to the respective items, and its not grouping correctly. Any idea as how we can achieve it.

Example DS: Excel:

  A     B      C    D
100 Test1   XX10    L
100 Test1   XX10    L
100 Test1   XX20    L
101 Test2   XX10    L
101 Test2   XX20    L
101 Test2   XX20    L

Current output:

[
    {
        "A": 100,
        "B": "Test1",
        "Item": [
            {
                "A": 100,
                "C": "XX10",
                "D": "L"
            },
            {
                "A": 100,
                "C": "XX10",
                "D": "L"
            },
            {
                "A": 100,
                "C": "XX20",
                "D": "L"
            }
        ]
    },
    {
        "A": 101,
        "B": "Test2",
        "Item": [
            {
                "A": 101,
                "C": "XX10",
                "D": "L"
            },
            {
                "A": 101,
                "C": "XX20",
                "D": "L"
            },
            {
                "A": 101,
                "C": "XX20",
                "D": "L"
            }
        ]
    }
]

If you look at the Array Items, Same values are not grouped by and are repeated.

Thanks TC

CodePudding user response：

You can drop_duplicates and then groupby, then apply the to_dict transformation for columns C and D, and then clean up with reset_index and rename.

(data.drop_duplicates()
     .groupby(["A", "B"])
     .apply(lambda x: x[["C", "D"]].to_dict("records"))
     .to_frame()
     .reset_index()
     .rename(columns={0: "Item"})
     .to_dict("records"))

Output:

[{'A': 100,
  'B': 'Test1',
  'Item': [{'C': 'XX10', 'D': 'L'}, {'C': 'XX20', 'D': 'L'}]},
 {'A': 101,
  'B': 'Test2',
  'Item': [{'C': 'XX10', 'D': 'L'}, {'C': 'XX20', 'D': 'L'}]}]