Convert a Pandas Dataframe to a specific format. Unique ID appears in two subsequent rows-CodePudding

Hi im trying to convert this this dataframe to a specific format. I have simplified the data to the following.

import pandas as pd 
import numpy as np

student_id = [1, 2, 2, 4, 5, 5]
student_names = ["Bob", "Alex", "Alex", "Alice", "Sharon", "Sharon"]
student_status = ["Inactive", "Full Time", "Full Time", "Inactive", "Inactive", "Inactive"]
course_description = [np.nan, "Physics", "History", np.nan, "Physics", "History"]
course_paid = [np.nan, "Yes", "No", np.nan, "No", "Yes"]
enrollement = [np.nan, "Enrolled", "Not Enrolled", np.nan, "Not Enrolled", "Enrolled"]

df = pd.DataFrame(data = student_id, columns=["Student ID"])
df["Student Name"] = student_names
df["Student Status"] = student_status
df["Course Description"] = course_description
df["Course paid"] = course_paid
df["Enrollment"] = enrollement

print(df)

student_id = [1, 2, 4, 5]
student_names = ["Bob", "Alex", "Alice", "Sharon"]
student_status = ["Inactive", "Full Time", "Inactive", "Inactive"]
physics = [np.nan, "Enrolled", np.nan, "Not Enrolled"]
physics_paid = [np.nan, "Yes", np.nan, "No"]
history = [np.nan, "Not Enrolled", np.nan, "Enrolled"]
history_paid = [np.nan, "No", np.nan, "Yes"]

df2 = pd.DataFrame(data = student_id, columns=["Student ID"])
df2["Student Name"] = student_names
df2["Student Status"] = student_status
df2["Physics"] = physics
df2["Physics Paid"] = physics_paid
df2["History"] = history
df2["History Paid"] = history_paid

print(df2)

Any tips? Sorry, new to stackoverflow. I don't know how to show the output. The issue I have is working with the student ids in two rows.

CodePudding user response：

Your description is no so clear. if you want to remove duplicates in your data based off of a column (e.g., 'Student ID'), you can use the following method.

df.drop_duplicates(subset=['Student ID'])

CodePudding user response：

One way using pandas.DataFrame.pivot:

new_df = df.pivot(index=["Student ID", "Student Name", "Student Status"], 
         columns=["Course Description"], 
         values=["Course paid", "Enrollment"]).dropna(axis=1, how="all")

new_df.columns = [j   " paid" if i.endswith("paid") else j for i, j in new_df.columns]
new_df.reset_index()

Output:

   Student ID Student Name Student Status History paid Physics paid  \
0           1          Bob       Inactive          NaN          NaN   
1           2         Alex      Full Time           No          Yes   
2           4        Alice       Inactive          NaN          NaN   
3           5       Sharon       Inactive          Yes           No   

        History       Physics  
0           NaN           NaN  
1  Not Enrolled      Enrolled  
2           NaN           NaN  
3      Enrolled  Not Enrolled

CodePudding user response：

You should use a duplicates dropping formula df.drop_duplicates

For more please take a look

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html