How to convert nested dictionaries to dataframes in the below format in python-CodePudding

I have a dictionary like this:

dict = {Student_ID_1:{Course_1:[12,45,378], Course_2: [33,78,345]},
        Student_ID_2:{Course_6:[15,25,48], Course_24: [31,38,342]},
        ....<truncated>.....}

I have thousands of Student_IDs & 50 course_IDs. Now I would like to create a dataframe from this dictionary in this format:

Student_ID   Course_1_a  Course_1_b  Course_1_c ... Course_50_a  Course_50_b  Course_50_c
   12855         12          35         234            21            55           342

How can I convert my dictionary into a dataframe in this format? I tried different ways but I could get only the first value in the course list into my dataframe columns. Can anyone help me with this?

CodePudding user response：

I hope I've understood your question right. You can preprocess the dictionary before creating the dataframe:

dct = {
    "Student_ID_1": {"Course_1": [12, 45, 378], "Course_2": [33, 78, 345]},
    "Student_ID_2": {"Course_6": [15, 25, 48], "Course_24": [31, 38, 342]},
}

dct = {
    k: {f"{kk}_{b}": a for kk, vv in v.items() for a, b in zip(vv, "abc")}
    for k, v in dct.items()
}
df = (
    pd.DataFrame.from_dict(dct, orient="index")
    .reset_index()
    .rename(columns={"index": "Student ID"})
)
print(df)

Prints:

     Student ID  Course_1_a  Course_1_b  Course_1_c  Course_2_a  Course_2_b  Course_2_c  Course_6_a  Course_6_b  Course_6_c  Course_24_a  Course_24_b  Course_24_c
0  Student_ID_1        12.0        45.0       378.0        33.0        78.0       345.0         NaN         NaN         NaN          NaN          NaN          NaN
1  Student_ID_2         NaN         NaN         NaN         NaN         NaN         NaN        15.0        25.0        48.0         31.0         38.0        342.0

CodePudding user response：

You could try df.explode() to split the list in the dataframe

import pandas as pd

dict = {'Student_ID_1':{'Course_1':[12,45,378], 'Course_2': [33,78,345]},
        'Student_ID_2':{'Course_6':[15,25,48], 'Course_24': [31,38,342]}}
df = pd.DataFrame(dict)

df1 = df.explode('Student_ID_1').explode('Student_ID_2')
print(df1)

          Student_ID_1 Student_ID_2
Course_1            12          NaN
Course_1            45          NaN
Course_1           378          NaN
Course_2            33          NaN
Course_2            78          NaN
Course_2           345          NaN
Course_6           NaN           15
Course_6           NaN           25
Course_6           NaN           48
Course_24          NaN           31
Course_24          NaN           38
Course_24          NaN          342

After that, transpose the dataframe and rename the columns

df1 = df1.T
df1.columns = [col '_' s for col in df.T.columns for s in ['a', 'b', 'c'] ]
print(df1)

Output:

             Course_1_a Course_1_b Course_1_c Course_2_a Course_2_b Course_2_c Course_6_a Course_6_b Course_6_c Course_24_a Course_24_b Course_24_c
Student_ID_1         12         45        378         33         78        345        NaN        NaN        NaN         NaN         NaN         NaN
Student_ID_2        NaN        NaN        NaN        NaN        NaN        NaN         15         25         48          31          38         342

CodePudding user response：

Here's a way to do what your question asks:

dct = {12855:{'Course_1':[12,45,378], 'Course_2': [33,78,345]},
       12856:{'Course_6':[15,25,48], 'Course_24': [31,38,342]}}

df = pd.DataFrame(dct).apply(lambda x : [[None]*3 if y is np.NaN else y for y in x])
df = ( df
    .assign(course=[[f'{c}_{letter}' for letter in 'abc'] for c in df.index])
    .explode(['course']   list(df.columns))
    .rename_axis('Student_ID', axis=1)
    .set_index('course').rename_axis(None).T.reset_index() )

Output:

   Student_ID Course_1_a Course_1_b Course_1_c Course_2_a Course_2_b Course_2_c Course_6_a Course_6_b Course_6_c Course_24_a Course_24_b Course_24_c
0       12855         12         45        378         33         78        345       None       None       None        None        None        None
1       12856       None       None       None       None       None       None         15         25         48          31          38         342

Explanation:

Use pd.DataFrame(dct) to create a dataframe like this:

                   12855          12856
Course_1   [12, 45, 378]            NaN
Course_2   [33, 78, 345]            NaN
Course_6             NaN   [15, 25, 48]
Course_24            NaN  [31, 38, 342]

Use apply() to convert NaN values to a list with 3 None values like this:

                        12855               12856
Course_1        [12, 45, 378]  [None, None, None]
Course_2        [33, 78, 345]  [None, None, None]
Course_6   [None, None, None]        [15, 25, 48]
Course_24  [None, None, None]       [31, 38, 342]

Use assign() to add a column course with a list whose items are the original course name with _a, _b, and _c appended like this:

                        12855               12856                                   course
Course_1        [12, 45, 378]  [None, None, None]     [Course_1_a, Course_1_b, Course_1_c]
Course_2        [33, 78, 345]  [None, None, None]     [Course_2_a, Course_2_b, Course_2_c]
Course_6   [None, None, None]        [15, 25, 48]     [Course_6_a, Course_6_b, Course_6_c]
Course_24  [None, None, None]       [31, 38, 342]  [Course_24_a, Course_24_b, Course_24_c]

Use explode() to turn each row into 3 rows, one for each successive list item in the row's respective columns like this:

          12855 12856       course
Course_1     12  None   Course_1_a
Course_1     45  None   Course_1_b
Course_1    378  None   Course_1_c
Course_2     33  None   Course_2_a
Course_2     78  None   Course_2_b
Course_2    345  None   Course_2_c
Course_6   None    15   Course_6_a
Course_6   None    25   Course_6_b
Course_6   None    48   Course_6_c
Course_24  None    31  Course_24_a
Course_24  None    38  Course_24_b
Course_24  None   342  Course_24_c

Use rename_axis() to name the column index Student_ID
Use set_index() to replace the index with column course and use rename_axis() to change the index name to None
Use .T to transpose and use .reset_index() to change the Student_ID index to a column, getting the Output shown above.