I have CSV with 10k lines. I want to first of all search for the rows with required info and then edit those lines.
example below
jobs = [
'X01_TEST1_C',
'P01_TEST3_B'
]
headers = ['job', 'name', 'date', 'extrainfo']
data = [
['X01_TEST1_C', 'NAME', 'DATE', 'EXTRADATA'],
['P01_TEST3_C', 'NAME', 'DATE', 'EXTRADATA'],
['X01_TEST1002_C', 'NAME', 'DATE', 'EXTRADATA'],
['X01_TEST4231_C', 'NAME', 'DATE', 'EXTRAP01_TEST3_BDATA']
]
I can load this into PANDAS and then search for single items using below.
df= pd.read_csv("filename",sep=",", encoding='cp1252')
df1 = df[(df['job'].str.contains("X01_TEST1_C", na=False))]
print(df1)
which would print
['X01_TEST1_C', 'NAME', 'DATE', 'EXTRADATA']
How can I search for multiple values at once via pandas
I want something like
df1 = df[(df['job'].str.contains(jobs, na=False))]
But I get error TypeError: first argument must be string or compiled pattern
Once I get passed this part I want to update some jobs from X01_TEST1_C to X01_NEW_TEST1_C - adding this bit of info in incase easier to do whole thing at once.
Is Pandas good for this or do I need to try via different method like import csv?
Thanks for any help.
CodePudding user response:
try:
jobs = [
'X01_TEST1_C',
'P01_TEST3_B'
]
df1 = df[df['job'].str.contains('|'.join(jobs), na=False)] #the default is regex=True so no need to add it
#this is similar to:
df1 = df[df['job'].str.contains('X01_TEST1_C|P01_TEST3_B', na=False)]
CodePudding user response:
The simplest might be to concatenate your jobs
to a single regexp and use that:
jobs_re = "|".join(re.escape(job) for job in jobs)
df1 = df[df['job'].str.contains(jobs_re, regex=True, na=False)]