Home > Software design >  Convert a Pandas Dataframe to a specific format. Unique ID appears in two subsequent rows
Convert a Pandas Dataframe to a specific format. Unique ID appears in two subsequent rows

Time:09-22

Hi im trying to convert this this dataframe to a specific format. I have simplified the data to the following.

import pandas as pd 
import numpy as np

student_id = [1, 2, 2, 4, 5, 5]
student_names = ["Bob", "Alex", "Alex", "Alice", "Sharon", "Sharon"]
student_status = ["Inactive", "Full Time", "Full Time", "Inactive", "Inactive", "Inactive"]
course_description = [np.nan, "Physics", "History", np.nan, "Physics", "History"]
course_paid = [np.nan, "Yes", "No", np.nan, "No", "Yes"]
enrollement = [np.nan, "Enrolled", "Not Enrolled", np.nan, "Not Enrolled", "Enrolled"]

df = pd.DataFrame(data = student_id, columns=["Student ID"])
df["Student Name"] = student_names
df["Student Status"] = student_status
df["Course Description"] = course_description
df["Course paid"] = course_paid
df["Enrollment"] = enrollement

print(df)

To

student_id = [1, 2, 4, 5]
student_names = ["Bob", "Alex", "Alice", "Sharon"]
student_status = ["Inactive", "Full Time", "Inactive", "Inactive"]
physics = [np.nan, "Enrolled", np.nan, "Not Enrolled"]
physics_paid = [np.nan, "Yes", np.nan, "No"]
history = [np.nan, "Not Enrolled", np.nan, "Enrolled"]
history_paid = [np.nan, "No", np.nan, "Yes"]

df2 = pd.DataFrame(data = student_id, columns=["Student ID"])
df2["Student Name"] = student_names
df2["Student Status"] = student_status
df2["Physics"] = physics
df2["Physics Paid"] = physics_paid
df2["History"] = history
df2["History Paid"] = history_paid

print(df2)

Any tips? Sorry, new to stackoverflow. I don't know how to show the output. The issue I have is working with the student ids in two rows.

CodePudding user response:

Your description is no so clear. if you want to remove duplicates in your data based off of a column (e.g., 'Student ID'), you can use the following method.

df.drop_duplicates(subset=['Student ID'])

CodePudding user response:

One way using pandas.DataFrame.pivot:

new_df = df.pivot(index=["Student ID", "Student Name", "Student Status"], 
         columns=["Course Description"], 
         values=["Course paid", "Enrollment"]).dropna(axis=1, how="all")

new_df.columns = [j   " paid" if i.endswith("paid") else j for i, j in new_df.columns]
new_df.reset_index()

Output:

   Student ID Student Name Student Status History paid Physics paid  \
0           1          Bob       Inactive          NaN          NaN   
1           2         Alex      Full Time           No          Yes   
2           4        Alice       Inactive          NaN          NaN   
3           5       Sharon       Inactive          Yes           No   

        History       Physics  
0           NaN           NaN  
1  Not Enrolled      Enrolled  
2           NaN           NaN  
3      Enrolled  Not Enrolled  

CodePudding user response:

You should use a duplicates dropping formula df.drop_duplicates

For more please take a look

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html

  • Related