Hi im trying to convert this this dataframe to a specific format. I have simplified the data to the following.
import pandas as pd
import numpy as np
student_id = [1, 2, 2, 4, 5, 5]
student_names = ["Bob", "Alex", "Alex", "Alice", "Sharon", "Sharon"]
student_status = ["Inactive", "Full Time", "Full Time", "Inactive", "Inactive", "Inactive"]
course_description = [np.nan, "Physics", "History", np.nan, "Physics", "History"]
course_paid = [np.nan, "Yes", "No", np.nan, "No", "Yes"]
enrollement = [np.nan, "Enrolled", "Not Enrolled", np.nan, "Not Enrolled", "Enrolled"]
df = pd.DataFrame(data = student_id, columns=["Student ID"])
df["Student Name"] = student_names
df["Student Status"] = student_status
df["Course Description"] = course_description
df["Course paid"] = course_paid
df["Enrollment"] = enrollement
print(df)
To
student_id = [1, 2, 4, 5]
student_names = ["Bob", "Alex", "Alice", "Sharon"]
student_status = ["Inactive", "Full Time", "Inactive", "Inactive"]
physics = [np.nan, "Enrolled", np.nan, "Not Enrolled"]
physics_paid = [np.nan, "Yes", np.nan, "No"]
history = [np.nan, "Not Enrolled", np.nan, "Enrolled"]
history_paid = [np.nan, "No", np.nan, "Yes"]
df2 = pd.DataFrame(data = student_id, columns=["Student ID"])
df2["Student Name"] = student_names
df2["Student Status"] = student_status
df2["Physics"] = physics
df2["Physics Paid"] = physics_paid
df2["History"] = history
df2["History Paid"] = history_paid
print(df2)
Any tips? Sorry, new to stackoverflow. I don't know how to show the output. The issue I have is working with the student ids in two rows.
CodePudding user response:
Your description is no so clear. if you want to remove duplicates in your data based off of a column (e.g., 'Student ID'
), you can use the following method.
df.drop_duplicates(subset=['Student ID'])
CodePudding user response:
One way using pandas.DataFrame.pivot
:
new_df = df.pivot(index=["Student ID", "Student Name", "Student Status"],
columns=["Course Description"],
values=["Course paid", "Enrollment"]).dropna(axis=1, how="all")
new_df.columns = [j " paid" if i.endswith("paid") else j for i, j in new_df.columns]
new_df.reset_index()
Output:
Student ID Student Name Student Status History paid Physics paid \
0 1 Bob Inactive NaN NaN
1 2 Alex Full Time No Yes
2 4 Alice Inactive NaN NaN
3 5 Sharon Inactive Yes No
History Physics
0 NaN NaN
1 Not Enrolled Enrolled
2 NaN NaN
3 Enrolled Not Enrolled
CodePudding user response:
You should use a duplicates dropping formula
df.drop_duplicates
For more please take a look
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html