I have student test data for three types of tests that I would like to condense into one row per subject. Here is my current table:
StudentTestID TestDate SubjectName 1Label 1Value 2Label 2Value
1000077 1-May Math Perf Level 3-NI
1000077 1-Apr ELA Perf Level 2-W
1000077 1-Jun Science Perf Level
1000077 1-May Math Scaled 232
1000077 1-Apr ELA Scaled 214
1000077 1-Jun Science Scaled
1000078 1-May Math Perf Level 2-W
1000078 1-Apr ELA Perf Level 4-P
1000078 1-Jun Science Perf Level
1000078 1-May Math Scaled 221
1000078 1-Apr ELA Scaled 215
1000078 1-Jun Science Scaled
And this is the desired output:
StudentTestID TestDate SubjectName 1Label 1Value 2Label 2Value
1000077 1-May Math Perf Level 3-NI Scaled 232
1000077 1-Apr ELA Perf Level 2-W Scaled 214
1000077 1-Jun Science Perf Level Scaled
1000078 1-May Math Perf Level 2-W Scaled 221
1000078 1-Apr ELA Perf Level 4-P Scaled 215
1000078 1-Jun Science Perf Level Scaled
I have tried various joins and groupby but I'm not getting the desired output. Examples of what I've tried:
df.groupby(['SubjectName'])
df.groupby(['StudentTestID'])
edf.join(edf.set_index('SubjectName'), on='SubjectName')
I am still relatively new to python so these were attempts after some googling but they did not achieve the desired output.
CodePudding user response:
Hope you enjoy it, Hope you are using pandas to work, I use simple and direct approach to solve this.
#df = df.fillna('', inplace=True) #uncomment it, if you are getting empty space
# create a new dataframe from old dataframe
new_df = df[['StudentTestID','2Label','2Value']]
# delete unwanted row
new_df = new_df[new_df['2Label'].notna()]
#create another df to store 1label and 1 value
new_df2 = df[['1Label','1Value']]
new_df2 = new_df2[new_df2['1Label'].notna()]
# use inner merge to dataframe
df = pd.merge(df, new_df, on= ['StudentTestID','2Label','2Value'], how= 'inner')
df['1Label'] = new_df2['1Label'].values
df['1Value'] = new_df2['1Value'].values
print(df)
here is the output,
StudentTestID TestDate SubjectName 1Label 1Value 2Label 2Value
0 1000077 1-May Math Perf-Level 3-NI Scaled 232.0
1 1000077 1-Apr ELA Perf-Level 2-W Scaled 214.0
2 1000077 1-Jun Science Perf-Level NaN Scaled NaN
3 1000078 1-May Math Perf-Level 2-W Scaled 221.0
4 1000078 1-Apr ELA Perf-Level 4-P Scaled 215.0
5 1000078 1-Jun Science Perf-Level NaN Scaled NaN
CodePudding user response:
Guessing that you are using pandas to do above things here is a solution.
While typing the sol got a little lengthy but it's simply because I printed the output for each step, the answer itself is hardly five lines of code. So don't get overwhelmed by seeing the length. Happy coding!
You can split the dataframe into different frames for different labels and use merge
command (more on it later). First reproducing the given example
# Recreating the given example
df = pd.DataFrame({'StudentTestID' : ['1000077','1000077','1000077','1000077','1000077','1000077','1000078','1000078','1000078','1000078','1000078','1000078'],
'TestDate' : ['1-May','1-Apr','1-Jun','1-May','1-Apr','1-Jun','1-May','1-Apr','1-Jun','1-May','1-Apr','1-Jun'],
'SubjectName' : ['Math', 'ELA', 'Science', 'Math', 'ELA', 'Science', 'Math', 'ELA', 'Science', 'Math', 'ELA', 'Science'],
'1Label' : ['Perf Level', 'Perf Level', 'Perf Level', '', '', '', 'Perf Level', 'Perf Level', 'Perf Level', '', '', ''],
'1Value' : ['3-NI', '2-W', '', '', '', '', '2-W', '4-P ', '', '', '', ''],
'2Label' : ['', '', '', 'Scaled', 'Scaled', 'Scaled', '', '', '', 'Scaled', 'Scaled', 'Scaled'],
'2Value' : ['', '', '', '232', '214', '', '', '', '', '221', '215', '']})
df
StudentTestID TestDate SubjectName 1Label 1Value 2Label 2Value
1000077 1-May Math Perf Level 3-NI
1000077 1-Apr ELA Perf Level 2-W
1000077 1-Jun Science Perf Level
1000077 1-May Math Scaled 232
1000077 1-Apr ELA Scaled 214
1000077 1-Jun Science Scaled
1000078 1-May Math Perf Level 2-W
1000078 1-Apr ELA Perf Level 4-P
1000078 1-Jun Science Perf Level
1000078 1-May Math Scaled 221
1000078 1-Apr ELA Scaled 215
1000078 1-Jun Science Scaled
We could split the dataframe based on labels
# Getting only 1Label, 1Value and not 2Label, 2Value
df1 = df[['StudentTestID', 'TestDate', 'SubjectName', '1Label', '1Value']]
df1
StudentTestID TestDate SubjectName 1Label 1Value
1000077 1-May Math Perf Level 3-NI
1000077 1-Apr ELA Perf Level 2-W
1000077 1-Jun Science Perf Level
1000077 1-May Math
1000077 1-Apr ELA
1000077 1-Jun Science
1000078 1-May Math Perf Level 2-W
1000078 1-Apr ELA Perf Level 4-P
1000078 1-Jun Science Perf Level
1000078 1-May Math
1000078 1-Apr ELA
1000078 1-Jun Science
Now dropping the empty values based on 1Label
column
df1 = df1[df1['1Label'] != ''] #Getting non-empty values
df1
StudentTestID TestDate SubjectName 1Label 1Value
1000077 1-May Math Perf Level 3-NI
1000077 1-Apr ELA Perf Level 2-W
1000077 1-Jun Science Perf Level
1000078 1-May Math Perf Level 2-W
1000078 1-Apr ELA Perf Level 4-P
1000078 1-Jun Science Perf Level
Similarly going for 2label, 2value
df2 = df[['StudentTestID', 'TestDate', 'SubjectName', '2Label', '2Value']]
df2 = df2[df2['2Label'] != '']
df2
StudentTestID TestDate SubjectName 2Label 2Value
1000077 1-May Math Scaled 232
1000077 1-Apr ELA Scaled 214
1000077 1-Jun Science Scaled
1000078 1-May Math Scaled 221
1000078 1-Apr ELA Scaled 215
1000078 1-Jun Science Scaled
Now we can merge them using merge
command like below, oversimplifying it will put all the columns from both the dataframes into a single frame and return it (the common columns are not repeated)
final_df = pd.merge(df1, df2, on = ['StudentTestID', 'TestDate', 'SubjectName'])
final_df
StudentTestID TestDate SubjectName 1Label 1Value 2Label 2Value
1000077 1-May Math Perf Level 3-NI Scaled 232
1000077 1-Apr ELA Perf Level 2-W Scaled 214
1000077 1-Jun Science Perf Level Scaled
1000078 1-May Math Perf Level 2-W Scaled 221
1000078 1-Apr ELA Perf Level 4-P Scaled 215
1000078 1-Jun Science Perf Level Scaled
So the final good-to-go code is
df1 = df[['StudentTestID', 'TestDate', 'SubjectName', '1Label', '1Value']]
df1 = df1[df1['1Label'] != '']
df2 = df[['StudentTestID', 'TestDate', 'SubjectName', '2Label', '2Value']]
df2 = df2[df2['2Label'] != '']
pd.merge(df1, df2, on = ['StudentTestID', 'TestDate', 'SubjectName'])