Home > Back-end >  How to get substring more efficiently in pandas
How to get substring more efficiently in pandas

Time:03-16

I been seeing this, this, and this threads, but still cannot understand how to make the following problem more efficient:

I have a DataFrame with Course Names and the university who offers it:

df_courses:

course name university
0 name of course one university one
1 name of course two university one
2 "name of course three, with comma" university two

And I have another DataFrame that contains students enrollments:

df_enrollments:

enrollments student email
0 name of course one, name of course two [email protected]
1 name of course two, name of course three [email protected]
2 name of course three, with comma, name of course one, name of course two [email protected]

What I want to do is to get a new dataframe with each enrollment by student:

df_all_enrollments:

course name student email
0 name of course one [email protected]
1 name of course two [email protected]
2 name of course two [email protected]
3 "name of course three, with comma" [email protected]
4 "name of course three, with comma" [email protected]
5 name of course one [email protected]
6 name of course two [email protected]

The main issue is with the courses names with commas.

What I'm doing now to get this result is to make a list of df_courses['course name'] and then iterate over df_enrollments['enrollments'] searching contains and adding a new column with the course name:

courses = df_courses['course name'].to_list()   
df_all_enrollments = pd.DataFrame()
for i in courses:
    df_all_enroll = df_enrollments.loc[df_enrollments['enrollments'].str.contains(i, na=False, regex=False, case=True)]
    df_all_enroll.insert(1, 'Course Name', i)
    df_all_enrollments = pd.concat([df_all_enrollments, df_all_enroll ])

Until now this approach has worked, but I'm wondering if there's a more efficient way to perform this task.

Any help will be greatly appreciated.

CodePudding user response:

Is it what you expect:

courses = df2['enrollments'].str.split(', ')

df_all_enrollments = df_enrollments.assign(**{'course name': courses}) \
                                   .explode('course name', ignore_index=True)
print(df_all_enrollments)

Output

>>> df_all_enrollments
                                         enrollments          student email           course name
0             name of course one, name of course two   [email protected]    name of course one
1             name of course one, name of course two   [email protected]    name of course two
2           name of course two, name of course three  [email protected]    name of course two
3           name of course two, name of course three  [email protected]  name of course three
4  name of course three, name of course one, name...   [email protected]  name of course three
5  name of course three, name of course one, name...   [email protected]    name of course one
6  name of course three, name of course one, name...   [email protected]    name of course two

CodePudding user response:

Well, I have a possible answer.

The original approach took 6.6 minutes to process 41541 rows and output a DataFrame and CSV with 240603 rows.

Corralien's proposed answer is absolutely correct, but my data has some names with commas (eg 'Introduction to TensorFlow for AI, machine learning and deep learning')

The first problem (checking for matched course names in df_enrollments) was solved with a list comprehension:

df_enrollments['Course Name'] = [[y for y in cursos if y in x] for x in df_enrollments['Enrollments']]

The result, which was a column with a list of courses, is exploded up and gives me the expected result:

df_enrollments = df_enrollments.explode('Course Name')

Now it only takes 98.73 seconds to do it all :D

  • Related