Removing single word strings from a Dataframe and moving them to a csv-CodePudding

I'm trying to remove single word strings from a dataframe (ou) and move it to another dataframe (removedSet and allowedSet), and then to a csv (names.csv and removed.csv). I am able to filter out specific strings, but I am having trouble removing single words from the dataframe I just made allowedSet.

So I need to use the two dataframes I just made and check them for the single word strings. I want to append the single words to the dataframe with removed strings removedSet and remove the single words from the other dataframe with only full names allowedSet.

This is my desired output COMPnames.csv:

COMP:Jenny Pepper:
COMP:Harry Perlinkle

COMPremoved.csv

COMP:LINK:
COMP:Printer90
COMP:faxeast test

But my COMPnames.csv file has:

COMP:Jenny Pepper:

and COMPremoved.csv

COMP:Harry Perlinkle
COMP:LINK:
COMP:Printer90
COMP:faxeast test

Then do the same for NEWS department, etc.

The code below shows what I have attempted so far.

This is a smaller scale example of my code that you can run:

import pandas as pd

ou = {'OU':  ['COMP:Jenny Pepper:', 'COMP:Harry Perlinkle', 'COMP:LINK:', 'NEWS:Peter Parker:', 'NEWS:PARK:', 'NEWS:Clark Kent:', 'NEWS:Store Merch', 'COMP:Printer90', 'NEWS:store123', 'COMP:faxeast test']}

df = pd.DataFrame(ou)

#my ou list
oulist = pd.DataFrame({'OU':['COMP', 'NEWS']})

#the strings I want to remove
removeStr = ['printer', 'store', 'fax', 'link', 'park']

for dept in oulist ['OU']:
        #I want only the rows that contains strings in the OU column
        df_dept = df[df['OU'].str.startswith(f'{dept}:')]
        
        #put all of them in one csv file
        df_dept['OU'].to_csv(f'{dept}all.csv', index=False, header=False)
        
        #these two lines look in the removeStr list and make sure to check between the ':' so it doesn't grab the department name.         
        removedStr = [any(x in row[row.find(':') 1:].lower() for x in removeStr ) for row in df_dept['OU']]
        allowedStr = [all(not x in row[row.find(':') 1:].lower() for x in removeStr ) for row in df_dept['OU']]
        
        #remake the dataframe now
        removedSet = df_dept[removedStr]
        csvRemove = f'{dept}removed.csv'
        
        allowedSet = df_dept[allowedStr ]
        csvAllowed = f'{dept}names.csv'

        #move both of them to csv
        removedSet.to_csv(csvRemove, sep=',', encoding='utf-8', index=False,  mode='a', header=False)
        allowedSet.to_csv(csvAllowed , sep=',', encoding='utf-8', index=False, mode='a', header=False)

Below shows one of my attempts. I thought manually removing the single word strings through a simple str.split(), and then move them to new dataframes, df1 and df2. But this also didn't wors.

m=removedSet['OU'].str.split('^[\w] \:|\s').str.len()==2
m=list(m) #as this is how it's outputted for removedStr and allowedStr
df1=removedSet[m]
df2=allowedSet[~m]

#move both of them to csv
df1.to_csv(csvRemove, sep=',', encoding='utf-8', index=False,  mode='a', header=False)
df2.to_csv(csvAllowed , sep=',', encoding='utf-8', index=False, mode='a', header=False)

I got this error:

df2=allowedSet[~m]
TypeError: bad operand type for unary ~: 'list'

CodePudding user response：

IIUC, use a regex with word boundaries and groupby to save your files:

import re
regex = '|'.join(re.escape(w) for w in removeStr)
# 'printer|store|fax|link|park'

group1 = df['OU'].str.extract('([^:] )', expand=False)
group2 = (df['OU'].str.contains(fr'\b({regex})\d*\b', case=False)
                  .map({True: 'removed',
                        False: 'names'}))
for (g1, g2), g in df.groupby([group1, group2]):
    filename = f'{g1}_{g2}.csv'
    print(f'saving "{filename}"')
    print(g)
    #g.to_csv(filename) # uncomment to save

output:

saving "COMP_names.csv"
                     OU
0    COMP:Jenny Pepper:
1  COMP:Harry Perlinkle
9     COMP:faxeast test
saving "COMP_removed.csv"
               OU
2      COMP:LINK:
7  COMP:Printer90
saving "NEWS_names.csv"
                   OU
3  NEWS:Peter Parker:
5    NEWS:Clark Kent:
saving "NEWS_removed.csv"
                 OU
4        NEWS:PARK:
6  NEWS:Store Merch
8     NEWS:store123

NB. the produced regex '\\b(printer|store|fax|link|park)\\d*\\b' matches the blacklisted words as whole words, optionally allowing digits at the end.