I have a large csv where each row is a separate school course, each of which is tagged with one or more topics, like so:
school name | department | course name | topics |
---|---|---|---|
A | A1 | X | 1; 2 |
A | A1 | Y | 1; 3 |
B | B1 | Z | 1; 2; 4 |
C | C1 | XX | 1; 5 |
I need to calculate the presence of each topic for each course. Each topic needs to be appended as its own column, with each row coded as either a 0 or 1 depending on whether the topic is present in the course. There are 49 topics in total, so I need to add 49 rows to the table.
There are 4000 rows, many of which are repeats of the same course name, so those need to be grouped as well. For example, if one instance of course name X has topics 1, 2 but another instance of the same course has topics 1, 3 the binary values for column topic 1 should be 1, topic 2 should be 1, and topic 3 should be 1, while the rest of the topic columns are coded as 0.
The output should look something like this:
school name | department | course name | topic 1 | topic 2 | topic 3 | topic 4 | topic 5 |
---|---|---|---|---|---|---|---|
A | A1 | X | 1 | 1 | 0 | 0 | 0 |
A | A1 | Y | 1 | 0 | 1 | 0 | 0 |
B | B1 | Z | 1 | 1 | 0 | 1 | 0 |
C | C1 | XX | 1 | 0 | 0 | 0 | 1 |
I have a nested loop structure that splits the topics by semicolon and then does some cleaning to standardize the topic names, but I'm stuck on how to append a new column for each topic coded as the correct binary value.
import pandas as pd
import io
df = pd.read_csv(io.BytesIO(uploaded['courses.csv']))
df_dict = {}
result = []
topic = []
count = []
for item in df['Course ID']:
for item in df['Final Code']:
if type(item) == str:
item_list = item.split(';')
for each in item_list:
each = each.strip().lower()
if each == "[too ambigious]" or each == "too ambiguous":
each = "[too ambiguous]"
if each == "equity and equality\\":
each = "equity and equality"
if each == "[NA]" or each == "N/A":
each = "NA"
#create new column for each unique topic
for item in df['Final Code']:
result.append(item)
#1 if that topic is present in each course, else 0
if each in df_dict:
df_dict[each] = 1
else:
df_dict[each] = 0
print(df_dict)
print(result)
print(df)
I haven't used python for a long time, so I forget how to go about this.
CodePudding user response:
You can try this:
df.set_index(['school name', 'department', 'course name'])['topics'].str.get_dummies(';')\
.add_prefix('topics ')\
.reset_index()
Output:
school name department course name topics 2 topics 3 topics 4 topics 5 topics 1
0 A A1 X 1 0 0 0 1
1 A A1 Y 0 1 0 0 1
2 B B1 Z 1 0 1 0 1
3 C C1 XX 0 0 0 1 1
CodePudding user response:
Try using the Series.str.split method with the expand=True
key word argument.
CodePudding user response:
Try using np.where(condition,resultiftrue, resultIfFalse)
As shown in the below example link
https://www.dataquest.io/blog/tutorial-add-column-pandas-dataframe-based-on-if-else-condition/
CodePudding user response:
Try .str.split
following pd.crosstab
. Then append the two dataframes together:
x = df.pop("topics").str.split(r"\s*;\s*").explode()
x = pd.crosstab(x.index, x).add_prefix("topic ")
df = pd.concat([df, x], axis=1)
print(df)
Prints:
school name department course name topic 1 topic 2 topic 3 topic 4 topic 5
0 A A1 X 1 1 0 0 0
1 A A1 Y 1 0 1 0 0
2 B B1 Z 1 1 0 1 0
3 C C1 XX 1 0 0 0 1