I have a large pandas dataframe (15million lines) and I want to replace any value that starts with 'College' and replace it with a blank. I know I could do this with a for loop or 'np.where', but this takes way too long on my large dataframe. I also want to create a 'combined_id' column where I take the student name and the college. I want to skip the ones that don't have a proper college name. What is the fastest way to do this?
original:
id1 id2 college_name student combined id
0 01 01 Stanford haley id/haley_Stanford
1 01 02 College12 josh id/josh_College12
2 01 03 Harvard jake id/jake_Harvard
2 01 05 UPenn emily id/emily_UPenn
2 01 00 College10 sarah id/sarah_College10
desired:
id1 id2 college_name student combined id
0 01 01 Stanford haley id/haley_Stanford
1 01 02 josh
2 01 03 Harvard jake id/jake_Harvard
2 01 05 UPenn emily id/emily_UPenn
2 01 00 sarah
CodePudding user response:
Use boolean indexing:
m = df['college_name'].str.startswith('College')
df.loc[m, 'college_name'] = ''
df.loc[m, 'combined id'] = ''
Or if "combined id" does not exist, you have to use numpy.where
:
df['combined id'] = np.where(m, '', 'id/' df['student'] '_' df['college_name'])
CodePudding user response:
Here's a way to get from original
to desired
in your question:
df.loc[df.college_name.str.startswith("College"), ['college_name', 'combined_id']] = ''
Output:
id1 id2 college_name student combined_id
0 1 1 Stanford haley id/haley_Stanford
1 1 2 josh
2 1 3 Harvard jake id/jake_Harvard
2 1 5 UPenn emily id/emily_UPenn
2 1 0 sarah