Home > Enterprise >  Find a substring in cells across multiple columns in a Pandas dataframe
Find a substring in cells across multiple columns in a Pandas dataframe

Time:02-12

I have a large DataFrame with 50 columns which I'm simplifying here below:

students = [('Samurai', 34, '777.0', 'usa--->jp', 'usd--->yen') ,
            ('Jack', 31, '555.5','usa','usd') ,
            ('Mojo', 16,'488.1','n/a','n/a') ,
            ('Jojo', 32,'119.11','uk--->usa','pound--->usd')]

# Create a DataFrame object
df = pd.DataFrame(students, columns=['Name', 'Age', 'Balance', 'Country','Currency'])

enter image description here

I'm trying to find

a) whether there are any instances of '--->' in any of the cells across the DataFrame?

b) if so where? (optional)

So far I've tried 2 approaches

boolDf = df.isin(['--->']).any().any()

this only works for strings not substrings

columns = list(df)
for col in columns:    
    df[col].str.find('--->', 0).any()

I get:

AttributeError: Can only use .str accessor with string values!

(I believe this may only work for columns with string types)

Would appreciate any help. Open to other approaches as well.

CodePudding user response:

You could do:

df.apply(lambda x: x.str.contains('--->').any(), axis=1)

0     True
1    False
2    False
3     True 

CodePudding user response:

You could apply str.contains for each column to identify which cell contains a particular string:

out = df.astype(str).apply(lambda col: col.str.contains('--->'))

Output:

    Name    Age  Balance  Country  Currency
0  False  False    False     True      True
1  False  False    False    False     False
2  False  False    False    False     False
3  False  False    False     True      True

Then out.any().any() will produce True.

As a one-liner:

df.astype(str).apply(lambda col: col.str.contains('--->')).any().any()

CodePudding user response:

Try this:

boolDf = df['Country'].where(df['Country'].str.contains('--->'))
df_filter=boolDf.isnull()
df[~df_filter]

CodePudding user response:

import pandas as pd

students = [('Samurai', 34, '777.0', 'usa--->jp', 'usd--->yen') ,
            ('Jack', 31, '555.5','usa','usd') ,
            ('Mojo', 16,'488.1','n/a','n/a') ,
            ('Jojo', 32,'119.11','uk--->usa','pound--->usd')]

# Create a DataFrame object
df = pd.DataFrame(students, columns=['Name', 'Age', 'Balance', 'Country','Currency'])

def find_substr(df:pd.DataFrame,substr:str) -> pd.DataFrame:
    """
    df: pd.DataFrame we want to search
    substr: the string that we are looking for in all string columns
    returns df of boolean values
    """
    # create new df to house bool values (if substring is found)
    new_df = pd.DataFrame()
    # loop through the columns
    for col in df.columns:
        # if the series is a string we will search for it
        if pd.api.types.is_string_dtype(df[col]):
            # the series of our new df represent if we found the substring
            new_df[col] = df[col].str.contains(substr)

    return new_df

found_str = find_substr(df,'--->')

found_str
# -- OUTPUT -- 
    Name  Balance  Country  Currency
0  False    False     True      True
1  False    False    False     False
2  False    False    False     False
3  False    False     True      True

This function should work it gives you a new df of boolean values for each cell where you found the substring.

edit: This uses pd.Series.str.contains() to find each string that contains your substring. Read more here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html

  • Related