Home > Enterprise >  Finding a Substring in a Dataframe from a Numpy Array?
Finding a Substring in a Dataframe from a Numpy Array?

Time:10-05

How do I find a list of substrings in a dataframe against an array and use the array values to create a new column? For example, I started off using str.contains and typing out the actual string value(see below).

import pandas as pd
import numpy as np

#Filepath directory
csv_report = filepath

#Creates dataframe of CSV report
csv_df = pd.read_csv(csv_report)
  
csv_df['animal'] = np.where(csv_df.item_name.str.contains('Condor'), "Condor",
                   np.where(csv_df.item_name.str.contains('Marmot'), "Marmot",
                   np.where(csv_df.item_name.str.contains('Bear'),"Bear",
                   np.where(csv_df.item_name.str.contains('Pika'),"Pika",
                   np.where(csv_df.item_name.str.contains('Rat'),"Rat",
                   np.where(csv_df.item_name.str.contains('Racoon'),"Racoon",
                   np.where(csv_df.item_name.str.contains('Opossum'),"Opossum")))))))

How would I go about achieving the above code if the string values are in an array instead? Sample below:

import pandas as pd
import numpy as np

#Filepath directory
csv_report = filepath

#Creates dataframe of CSV report
csv_df = pd.read_csv(csv_report)

animal_list = np.array(['Condor', 'Marmot','Bear','Pika','Rat','Racoon','Opossum'])

CodePudding user response:

I think there's a cleaner way to write this, but it does what you want. If you are worried about case-sensitive, or full word matching, you'll have to modify this to your needs. Also, you don't need a np.array, just a list.

import io
import pandas as pd

data = '''item_name
Condor
Marmot
Bear
Condor a
Marmotb
Bearxyz
'''
df = pd.read_csv(io.StringIO(data), sep=' \s ', engine='python')
df

animal_list = ['Condor', 'Marmot','Bear','Pika','Rat','Racoon','Opossum']

def find_matches(x):
    for animal in animal_list:
        if animal in x['item_name']:
            return animal

df.apply(lambda x: find_matches(x), axis=1)

0    Condor
1    Marmot
2      Bear
3    Condor
4    Marmot
5      Bear
dtype: object

CodePudding user response:

There is a better way than using apply or several np.where. Have a look at np.select. Here as on the other answer we are assuming that each row can have only one match

Data

Stolen from @Jonathan Leon

import pandas as pd
import numpy
data = ['Condor', 
        'Marmot',
        'Bear',
        'Condor a',
        'Marmotb',
        'Bearxyz']

df = pd.DataFrame(data, columns=["item_name"])

animal_list = ['Condor', 
               'Marmot',
               'Bear',
               'Pika',
               'Rat',
               'Racoon',
               'Opossum']

Define conditions for numpy select

cond_list = [df["item_name"].str.contains(animal) 
             for animal in animal_list]

df["animal"] = np.select(cond_list, animal_list)

output


  item_name  animal
0    Condor  Condor
1    Marmot  Marmot
2      Bear    Bear
3  Condor a  Condor
4   Marmotb  Marmot
5   Bearxyz    Bear
  • Related