Python group by row and topic and create new columns with binary value-CodePudding

I have a large csv where each row is a separate school course, each of which is tagged with one or more topics, like so:

school name	department	course name	topics
A	A1	X	1; 2
A	A1	Y	1; 3
B	B1	Z	1; 2; 4
C	C1	XX	1; 5

I need to calculate the presence of each topic for each course. Each topic needs to be appended as its own column, with each row coded as either a 0 or 1 depending on whether the topic is present in the course. There are 49 topics in total, so I need to add 49 rows to the table.

There are 4000 rows, many of which are repeats of the same course name, so those need to be grouped as well. For example, if one instance of course name X has topics 1, 2 but another instance of the same course has topics 1, 3 the binary values for column topic 1 should be 1, topic 2 should be 1, and topic 3 should be 1, while the rest of the topic columns are coded as 0.

The output should look something like this:

school name	department	course name	topic 1	topic 2	topic 3	topic 4	topic 5
A	A1	X	1	1	0	0	0
A	A1	Y	1	0	1	0	0
B	B1	Z	1	1	0	1	0
C	C1	XX	1	0	0	0	1

I have a nested loop structure that splits the topics by semicolon and then does some cleaning to standardize the topic names, but I'm stuck on how to append a new column for each topic coded as the correct binary value.

import pandas as pd
import io
 
df = pd.read_csv(io.BytesIO(uploaded['courses.csv']))

df_dict = {}
result = []
topic = []
count = []

for item in df['Course ID']:
    for item in df['Final Code']:
        if type(item) == str:
            item_list = item.split(';')
            for each in item_list:
                each = each.strip().lower()
                if each == "[too ambigious]" or each == "too ambiguous":
                    each = "[too ambiguous]"
                if each == "equity and equality\\":
                    each = "equity and equality"
                if each == "[NA]" or each == "N/A":
                    each = "NA"
                    #create new column for each unique topic
                    for item in df['Final Code']:
                        result.append(item)


                        #1 if that topic is present in each course, else 0
                        if each in df_dict:
                            df_dict[each] = 1
                        else:
                            df_dict[each] = 0

print(df_dict)
print(result)
print(df)

I haven't used python for a long time, so I forget how to go about this.

CodePudding user response：

You can try this:

df.set_index(['school name', 'department', 'course name'])['topics'].str.get_dummies(';')\
  .add_prefix('topics ')\
  .reset_index()

Output:

  school name department course name  topics  2  topics  3  topics  4  topics  5  topics 1
0           A         A1           X          1          0          0          0         1
1           A         A1           Y          0          1          0          0         1
2           B         B1           Z          1          0          1          0         1
3           C         C1          XX          0          0          0          1         1

CodePudding user response：

Try using the Series.str.split method with the expand=True key word argument.

CodePudding user response：

Try using np.where(condition,resultiftrue, resultIfFalse) As shown in the below example link https://www.dataquest.io/blog/tutorial-add-column-pandas-dataframe-based-on-if-else-condition/

CodePudding user response：

Try .str.split following pd.crosstab. Then append the two dataframes together:

x = df.pop("topics").str.split(r"\s*;\s*").explode()
x = pd.crosstab(x.index, x).add_prefix("topic ")
df = pd.concat([df, x], axis=1)

print(df)

Prints:

  school name department course name  topic 1  topic 2  topic 3  topic 4  topic 5
0           A         A1           X        1        1        0        0        0
1           A         A1           Y        1        0        1        0        0
2           B         B1           Z        1        1        0        1        0
3           C         C1          XX        1        0        0        0        1