Home > Net >  How do I combine rows with empty cells into one full row in python?
How do I combine rows with empty cells into one full row in python?

Time:06-30

I have student test data for three types of tests that I would like to condense into one row per subject. Here is my current table:

StudentTestID   TestDate    SubjectName 1Label      1Value  2Label 2Value   
1000077         1-May       Math        Perf Level  3-NI                        
1000077         1-Apr       ELA         Perf Level  2-W                     
1000077         1-Jun       Science     Perf Level                              
1000077         1-May       Math                            Scaled 232          
1000077         1-Apr       ELA                             Scaled 214          
1000077         1-Jun       Science                         Scaled              
1000078         1-May       Math        Perf Level  2-W                     
1000078         1-Apr       ELA         Perf Level  4-P                         
1000078         1-Jun       Science     Perf Level                              
1000078         1-May       Math                            Scaled 221          
1000078         1-Apr       ELA                             Scaled 215          
1000078         1-Jun       Science                         Scaled          

And this is the desired output:

StudentTestID   TestDate    SubjectName 1Label      1Value  2Label 2Value   
1000077         1-May       Math        Perf Level  3-NI    Scaled 232                  
1000077         1-Apr       ELA         Perf Level  2-W     Scaled 214              
1000077         1-Jun       Science     Perf Level          Scaled                  
1000078         1-May       Math        Perf Level  2-W     Scaled 221                  
1000078         1-Apr       ELA         Perf Level  4-P     Scaled 215              
1000078         1-Jun       Science     Perf Level          Scaled  

I have tried various joins and groupby but I'm not getting the desired output. Examples of what I've tried:

df.groupby(['SubjectName'])
df.groupby(['StudentTestID'])
edf.join(edf.set_index('SubjectName'), on='SubjectName')

I am still relatively new to python so these were attempts after some googling but they did not achieve the desired output.

CodePudding user response:

Hope you enjoy it, Hope you are using pandas to work, I use simple and direct approach to solve this.



#df = df.fillna('', inplace=True)  #uncomment it, if you are getting empty space

# create a new dataframe from old dataframe 
new_df = df[['StudentTestID','2Label','2Value']]
#  delete unwanted row 
new_df = new_df[new_df['2Label'].notna()]
#create another df to store 1label and 1 value 
new_df2 = df[['1Label','1Value']]
new_df2 = new_df2[new_df2['1Label'].notna()]
# use inner merge to dataframe
df = pd.merge(df, new_df, on= ['StudentTestID','2Label','2Value'], how= 'inner')
df['1Label'] = new_df2['1Label'].values
df['1Value'] = new_df2['1Value'].values
print(df)

here is the output,

    StudentTestID TestDate SubjectName    1Label   1Value 2Label  2Value
0        1000077    1-May        Math  Perf-Level   3-NI  Scaled   232.0
1        1000077    1-Apr         ELA  Perf-Level    2-W  Scaled   214.0
2        1000077    1-Jun     Science  Perf-Level    NaN  Scaled     NaN
3        1000078    1-May        Math  Perf-Level    2-W  Scaled   221.0
4        1000078    1-Apr         ELA  Perf-Level   4-P   Scaled   215.0
5        1000078    1-Jun     Science  Perf-Level    NaN  Scaled     NaN

CodePudding user response:

Guessing that you are using pandas to do above things here is a solution.

While typing the sol got a little lengthy but it's simply because I printed the output for each step, the answer itself is hardly five lines of code. So don't get overwhelmed by seeing the length. Happy coding!

You can split the dataframe into different frames for different labels and use merge command (more on it later). First reproducing the given example

# Recreating the given example
df = pd.DataFrame({'StudentTestID' : ['1000077','1000077','1000077','1000077','1000077','1000077','1000078','1000078','1000078','1000078','1000078','1000078'],
                   'TestDate' : ['1-May','1-Apr','1-Jun','1-May','1-Apr','1-Jun','1-May','1-Apr','1-Jun','1-May','1-Apr','1-Jun'],
                   'SubjectName' : ['Math', 'ELA', 'Science', 'Math', 'ELA', 'Science', 'Math', 'ELA', 'Science', 'Math', 'ELA', 'Science'],
                   '1Label' : ['Perf Level', 'Perf Level', 'Perf Level', '', '', '', 'Perf Level', 'Perf Level', 'Perf Level', '', '', ''],
                   '1Value' : ['3-NI', '2-W', '', '', '', '', '2-W', '4-P ', '', '', '', ''],
                   '2Label' : ['', '', '', 'Scaled', 'Scaled', 'Scaled', '', '', '', 'Scaled', 'Scaled', 'Scaled'],
                   '2Value' : ['', '', '', '232', '214', '', '', '', '', '221', '215', '']})

df

StudentTestID   TestDate    SubjectName 1Label       1Value   2Label    2Value
1000077            1-May    Math        Perf Level   3-NI       
1000077            1-Apr    ELA         Perf Level   2-W        
1000077            1-Jun    Science     Perf Level          
1000077            1-May    Math                               Scaled   232
1000077            1-Apr    ELA                                Scaled   214
1000077            1-Jun    Science                            Scaled   
1000078            1-May    Math        Perf Level    2-W       
1000078            1-Apr    ELA         Perf Level    4-P       
1000078            1-Jun    Science     Perf Level          
1000078            1-May    Math                               Scaled   221
1000078            1-Apr    ELA                                Scaled   215
1000078            1-Jun    Science                            Scaled   

We could split the dataframe based on labels

# Getting only 1Label, 1Value and not 2Label, 2Value
df1 = df[['StudentTestID', 'TestDate', 'SubjectName', '1Label', '1Value']]

df1

StudentTestID   TestDate    SubjectName 1Label      1Value
1000077         1-May       Math        Perf Level  3-NI
1000077         1-Apr       ELA         Perf Level  2-W
1000077         1-Jun       Science     Perf Level  
1000077         1-May       Math        
1000077         1-Apr       ELA     
1000077         1-Jun       Science     
1000078         1-May       Math        Perf Level  2-W
1000078         1-Apr       ELA         Perf Level  4-P 
1000078         1-Jun       Science     Perf Level  
1000078         1-May       Math        
1000078         1-Apr       ELA     
1000078         1-Jun       Science     

Now dropping the empty values based on 1Label column

df1 = df1[df1['1Label'] != '']    #Getting non-empty values
df1

StudentTestID   TestDate    SubjectName 1Label      1Value
1000077         1-May       Math        Perf Level  3-NI
1000077         1-Apr       ELA         Perf Level  2-W
1000077         1-Jun       Science     Perf Level  
1000078         1-May       Math        Perf Level  2-W
1000078         1-Apr       ELA         Perf Level  4-P 
1000078         1-Jun       Science     Perf Level  

Similarly going for 2label, 2value

df2 = df[['StudentTestID', 'TestDate', 'SubjectName', '2Label', '2Value']]
df2 = df2[df2['2Label'] != '']

df2

StudentTestID   TestDate    SubjectName 2Label  2Value
1000077         1-May       Math        Scaled  232
1000077         1-Apr       ELA         Scaled  214
1000077         1-Jun       Science     Scaled  
1000078         1-May       Math        Scaled  221
1000078         1-Apr       ELA         Scaled  215
1000078         1-Jun       Science     Scaled  

Now we can merge them using merge command like below, oversimplifying it will put all the columns from both the dataframes into a single frame and return it (the common columns are not repeated)

final_df = pd.merge(df1, df2, on = ['StudentTestID', 'TestDate', 'SubjectName'])

final_df

StudentTestID   TestDate    SubjectName 1Label      1Value  2Label  2Value
1000077         1-May       Math        Perf Level  3-NI    Scaled  232
1000077         1-Apr       ELA         Perf Level  2-W     Scaled  214
1000077         1-Jun       Science     Perf Level          Scaled  
1000078         1-May       Math        Perf Level  2-W     Scaled  221
1000078         1-Apr       ELA         Perf Level  4-P     Scaled  215
1000078         1-Jun       Science     Perf Level          Scaled  

So the final good-to-go code is

df1 = df[['StudentTestID', 'TestDate', 'SubjectName', '1Label', '1Value']]
df1 = df1[df1['1Label'] != '']
df2 = df[['StudentTestID', 'TestDate', 'SubjectName', '2Label', '2Value']]
df2 = df2[df2['2Label'] != '']
pd.merge(df1, df2, on = ['StudentTestID', 'TestDate', 'SubjectName'])
  • Related