I'm trying to remove single word strings from a dataframe (ou
) and move it to another dataframe (removedSet
and allowedSet
), and then to a csv (names.csv
and removed.csv
). I am able to filter out specific strings, but I am having trouble removing single words from the dataframe I just made allowedSet
.
So I need to use the two dataframes I just made and check them for the single word strings. I want to append the single words to the dataframe with removed strings removedSet
and remove the single words from the other dataframe with only full names allowedSet
.
This is my desired output COMPnames.csv
:
COMP:Jenny Pepper:
COMP:Harry Perlinkle
COMPremoved.csv
COMP:LINK:
COMP:Printer90
COMP:faxeast test
But my COMPnames.csv
file has:
COMP:Jenny Pepper:
and COMPremoved.csv
COMP:Harry Perlinkle
COMP:LINK:
COMP:Printer90
COMP:faxeast test
Then do the same for NEWS department, etc.
The code below shows what I have attempted so far.
This is a smaller scale example of my code that you can run:
import pandas as pd
ou = {'OU': ['COMP:Jenny Pepper:', 'COMP:Harry Perlinkle', 'COMP:LINK:', 'NEWS:Peter Parker:', 'NEWS:PARK:', 'NEWS:Clark Kent:', 'NEWS:Store Merch', 'COMP:Printer90', 'NEWS:store123', 'COMP:faxeast test']}
df = pd.DataFrame(ou)
#my ou list
oulist = pd.DataFrame({'OU':['COMP', 'NEWS']})
#the strings I want to remove
removeStr = ['printer', 'store', 'fax', 'link', 'park']
for dept in oulist ['OU']:
#I want only the rows that contains strings in the OU column
df_dept = df[df['OU'].str.startswith(f'{dept}:')]
#put all of them in one csv file
df_dept['OU'].to_csv(f'{dept}all.csv', index=False, header=False)
#these two lines look in the removeStr list and make sure to check between the ':' so it doesn't grab the department name.
removedStr = [any(x in row[row.find(':') 1:].lower() for x in removeStr ) for row in df_dept['OU']]
allowedStr = [all(not x in row[row.find(':') 1:].lower() for x in removeStr ) for row in df_dept['OU']]
#remake the dataframe now
removedSet = df_dept[removedStr]
csvRemove = f'{dept}removed.csv'
allowedSet = df_dept[allowedStr ]
csvAllowed = f'{dept}names.csv'
#move both of them to csv
removedSet.to_csv(csvRemove, sep=',', encoding='utf-8', index=False, mode='a', header=False)
allowedSet.to_csv(csvAllowed , sep=',', encoding='utf-8', index=False, mode='a', header=False)
Below shows one of my attempts. I thought manually removing the single word strings through a simple str.split()
, and then move them to new dataframes, df1
and df2
. But this also didn't wors.
m=removedSet['OU'].str.split('^[\w] \:|\s').str.len()==2
m=list(m) #as this is how it's outputted for removedStr and allowedStr
df1=removedSet[m]
df2=allowedSet[~m]
#move both of them to csv
df1.to_csv(csvRemove, sep=',', encoding='utf-8', index=False, mode='a', header=False)
df2.to_csv(csvAllowed , sep=',', encoding='utf-8', index=False, mode='a', header=False)
I got this error:
df2=allowedSet[~m]
TypeError: bad operand type for unary ~: 'list'
CodePudding user response:
IIUC, use a regex with word boundaries and groupby
to save your files:
import re
regex = '|'.join(re.escape(w) for w in removeStr)
# 'printer|store|fax|link|park'
group1 = df['OU'].str.extract('([^:] )', expand=False)
group2 = (df['OU'].str.contains(fr'\b({regex})\d*\b', case=False)
.map({True: 'removed',
False: 'names'}))
for (g1, g2), g in df.groupby([group1, group2]):
filename = f'{g1}_{g2}.csv'
print(f'saving "{filename}"')
print(g)
#g.to_csv(filename) # uncomment to save
output:
saving "COMP_names.csv"
OU
0 COMP:Jenny Pepper:
1 COMP:Harry Perlinkle
9 COMP:faxeast test
saving "COMP_removed.csv"
OU
2 COMP:LINK:
7 COMP:Printer90
saving "NEWS_names.csv"
OU
3 NEWS:Peter Parker:
5 NEWS:Clark Kent:
saving "NEWS_removed.csv"
OU
4 NEWS:PARK:
6 NEWS:Store Merch
8 NEWS:store123
NB. the produced regex '\\b(printer|store|fax|link|park)\\d*\\b'
matches the blacklisted words as whole words, optionally allowing digits at the end.