Home > Net >  How to convert nested dictionaries to dataframes in the below format in python
How to convert nested dictionaries to dataframes in the below format in python

Time:08-26

I have a dictionary like this:

dict = {Student_ID_1:{Course_1:[12,45,378], Course_2: [33,78,345]},
        Student_ID_2:{Course_6:[15,25,48], Course_24: [31,38,342]},
        ....<truncated>.....}

I have thousands of Student_IDs & 50 course_IDs. Now I would like to create a dataframe from this dictionary in this format:

Student_ID   Course_1_a  Course_1_b  Course_1_c ... Course_50_a  Course_50_b  Course_50_c
   12855         12          35         234            21            55           342

How can I convert my dictionary into a dataframe in this format? I tried different ways but I could get only the first value in the course list into my dataframe columns. Can anyone help me with this?

CodePudding user response:

I hope I've understood your question right. You can preprocess the dictionary before creating the dataframe:

dct = {
    "Student_ID_1": {"Course_1": [12, 45, 378], "Course_2": [33, 78, 345]},
    "Student_ID_2": {"Course_6": [15, 25, 48], "Course_24": [31, 38, 342]},
}

dct = {
    k: {f"{kk}_{b}": a for kk, vv in v.items() for a, b in zip(vv, "abc")}
    for k, v in dct.items()
}
df = (
    pd.DataFrame.from_dict(dct, orient="index")
    .reset_index()
    .rename(columns={"index": "Student ID"})
)
print(df)

Prints:

     Student ID  Course_1_a  Course_1_b  Course_1_c  Course_2_a  Course_2_b  Course_2_c  Course_6_a  Course_6_b  Course_6_c  Course_24_a  Course_24_b  Course_24_c
0  Student_ID_1        12.0        45.0       378.0        33.0        78.0       345.0         NaN         NaN         NaN          NaN          NaN          NaN
1  Student_ID_2         NaN         NaN         NaN         NaN         NaN         NaN        15.0        25.0        48.0         31.0         38.0        342.0

CodePudding user response:

You could try df.explode() to split the list in the dataframe

import pandas as pd

dict = {'Student_ID_1':{'Course_1':[12,45,378], 'Course_2': [33,78,345]},
        'Student_ID_2':{'Course_6':[15,25,48], 'Course_24': [31,38,342]}}
df = pd.DataFrame(dict)

df1 = df.explode('Student_ID_1').explode('Student_ID_2')
print(df1)

          Student_ID_1 Student_ID_2
Course_1            12          NaN
Course_1            45          NaN
Course_1           378          NaN
Course_2            33          NaN
Course_2            78          NaN
Course_2           345          NaN
Course_6           NaN           15
Course_6           NaN           25
Course_6           NaN           48
Course_24          NaN           31
Course_24          NaN           38
Course_24          NaN          342

After that, transpose the dataframe and rename the columns

df1 = df1.T
df1.columns = [col '_' s for col in df.T.columns for s in ['a', 'b', 'c'] ]
print(df1)

Output:

             Course_1_a Course_1_b Course_1_c Course_2_a Course_2_b Course_2_c Course_6_a Course_6_b Course_6_c Course_24_a Course_24_b Course_24_c
Student_ID_1         12         45        378         33         78        345        NaN        NaN        NaN         NaN         NaN         NaN
Student_ID_2        NaN        NaN        NaN        NaN        NaN        NaN         15         25         48          31          38         342

CodePudding user response:

Here's a way to do what your question asks:

dct = {12855:{'Course_1':[12,45,378], 'Course_2': [33,78,345]},
       12856:{'Course_6':[15,25,48], 'Course_24': [31,38,342]}}

df = pd.DataFrame(dct).apply(lambda x : [[None]*3 if y is np.NaN else y for y in x])
df = ( df
    .assign(course=[[f'{c}_{letter}' for letter in 'abc'] for c in df.index])
    .explode(['course']   list(df.columns))
    .rename_axis('Student_ID', axis=1)
    .set_index('course').rename_axis(None).T.reset_index() )

Output:

   Student_ID Course_1_a Course_1_b Course_1_c Course_2_a Course_2_b Course_2_c Course_6_a Course_6_b Course_6_c Course_24_a Course_24_b Course_24_c
0       12855         12         45        378         33         78        345       None       None       None        None        None        None
1       12856       None       None       None       None       None       None         15         25         48          31          38         342

Explanation:

  • Use pd.DataFrame(dct) to create a dataframe like this:
                   12855          12856
Course_1   [12, 45, 378]            NaN
Course_2   [33, 78, 345]            NaN
Course_6             NaN   [15, 25, 48]
Course_24            NaN  [31, 38, 342]
  • Use apply() to convert NaN values to a list with 3 None values like this:
                        12855               12856
Course_1        [12, 45, 378]  [None, None, None]
Course_2        [33, 78, 345]  [None, None, None]
Course_6   [None, None, None]        [15, 25, 48]
Course_24  [None, None, None]       [31, 38, 342]
  • Use assign() to add a column course with a list whose items are the original course name with _a, _b, and _c appended like this:
                        12855               12856                                   course
Course_1        [12, 45, 378]  [None, None, None]     [Course_1_a, Course_1_b, Course_1_c]
Course_2        [33, 78, 345]  [None, None, None]     [Course_2_a, Course_2_b, Course_2_c]
Course_6   [None, None, None]        [15, 25, 48]     [Course_6_a, Course_6_b, Course_6_c]
Course_24  [None, None, None]       [31, 38, 342]  [Course_24_a, Course_24_b, Course_24_c]
  • Use explode() to turn each row into 3 rows, one for each successive list item in the row's respective columns like this:
          12855 12856       course
Course_1     12  None   Course_1_a
Course_1     45  None   Course_1_b
Course_1    378  None   Course_1_c
Course_2     33  None   Course_2_a
Course_2     78  None   Course_2_b
Course_2    345  None   Course_2_c
Course_6   None    15   Course_6_a
Course_6   None    25   Course_6_b
Course_6   None    48   Course_6_c
Course_24  None    31  Course_24_a
Course_24  None    38  Course_24_b
Course_24  None   342  Course_24_c
  • Use rename_axis() to name the column index Student_ID
  • Use set_index() to replace the index with column course and use rename_axis() to change the index name to None
  • Use .T to transpose and use .reset_index() to change the Student_ID index to a column, getting the Output shown above.
  • Related