How do I find a list of substrings in a dataframe against an array and use the array values to create a new column? For example, I started off using str.contains
and typing out the actual string value(see below).
import pandas as pd
import numpy as np
#Filepath directory
csv_report = filepath
#Creates dataframe of CSV report
csv_df = pd.read_csv(csv_report)
csv_df['animal'] = np.where(csv_df.item_name.str.contains('Condor'), "Condor",
np.where(csv_df.item_name.str.contains('Marmot'), "Marmot",
np.where(csv_df.item_name.str.contains('Bear'),"Bear",
np.where(csv_df.item_name.str.contains('Pika'),"Pika",
np.where(csv_df.item_name.str.contains('Rat'),"Rat",
np.where(csv_df.item_name.str.contains('Racoon'),"Racoon",
np.where(csv_df.item_name.str.contains('Opossum'),"Opossum")))))))
How would I go about achieving the above code if the string values are in an array instead? Sample below:
import pandas as pd
import numpy as np
#Filepath directory
csv_report = filepath
#Creates dataframe of CSV report
csv_df = pd.read_csv(csv_report)
animal_list = np.array(['Condor', 'Marmot','Bear','Pika','Rat','Racoon','Opossum'])
CodePudding user response:
I think there's a cleaner way to write this, but it does what you want. If you are worried about case-sensitive, or full word matching, you'll have to modify this to your needs. Also, you don't need a np.array, just a list.
import io
import pandas as pd
data = '''item_name
Condor
Marmot
Bear
Condor a
Marmotb
Bearxyz
'''
df = pd.read_csv(io.StringIO(data), sep=' \s ', engine='python')
df
animal_list = ['Condor', 'Marmot','Bear','Pika','Rat','Racoon','Opossum']
def find_matches(x):
for animal in animal_list:
if animal in x['item_name']:
return animal
df.apply(lambda x: find_matches(x), axis=1)
0 Condor
1 Marmot
2 Bear
3 Condor
4 Marmot
5 Bear
dtype: object
CodePudding user response:
There is a better way than using apply
or several np.where
. Have a look at np.select.
Here as on the other answer we are assuming that each row can have only one match
Data
Stolen from @Jonathan Leon
import pandas as pd
import numpy
data = ['Condor',
'Marmot',
'Bear',
'Condor a',
'Marmotb',
'Bearxyz']
df = pd.DataFrame(data, columns=["item_name"])
animal_list = ['Condor',
'Marmot',
'Bear',
'Pika',
'Rat',
'Racoon',
'Opossum']
Define conditions for numpy select
cond_list = [df["item_name"].str.contains(animal)
for animal in animal_list]
df["animal"] = np.select(cond_list, animal_list)
output
item_name animal
0 Condor Condor
1 Marmot Marmot
2 Bear Bear
3 Condor a Condor
4 Marmotb Marmot
5 Bearxyz Bear