Python pandas cross check subset relations between columns-CodePudding

A	B	C
catastrop	animal	True
ani	count	True
rows	cata	False
Second	rownumbers	False
counter	counters	True
column	inner	False
	time
	strong
	kind
	membrane

A and B contain both string and substring, I want to check if column A has a value 'counter', and that value is either having a substring in column B 'count' or is a substring of column B 'counters'. If any of this satisfied, return True. (Something like combination of .isin and .str.contains). C is the output, which state whether the value of A is substring of B or A contains B. Like catastrop contain cata, so True, and ani is substring of animal so True

The initial code came into my mind is

list1=a1['A'].tolist()
output1=[]
for i in list1:
    output1.append(any(a1['B'].str.contains(i,regex=False)))

and do an or operation with column B

But if I doing it reversely, like

list2=a1['B'].tolist()
output2=[]
for i in list2:
    output2.append(any(a1['A'].str.contains(i,regex=False)))

list 2 would contain verify result about column B, rather than column A.

How could I write this code?

CodePudding user response：

One way to achieve this is as follows:

For a_contains_b, use Series.str.contains with a string consisting of a list of values from column B, joined with | delimiter to create alternatives (so: 'animal|count|cata|rownumbers|counters|inner|time|strong|kind|membrane').
For b_contains_a, we want to use Series.str.extractall instead, since we need to feed the result back to Series.isin to know which rows from column A have been matched. I.e. the result of extractall ( [0].tolist()) will be ['ani', 'counter'], which we use as input for df.A.isin.
Finally, we use both boolean series as input for the new column. Use |, since we need True if either series contains True. Wrap the result in brackets and apply Series.where combined with Series.notna to overwrite all False values in rows that have NaN values in column A.

import pandas as pd
import numpy as np

data = {'A': {0: 'catastrop', 1: 'ani', 2: 'rows', 3: 'Second', 4: 'counter', 
              5: 'column', 6: np.nan, 7: np.nan, 8: np.nan, 9: np.nan}, 
        'B': {0: 'animal', 1: 'count', 2: 'cata', 3: 'rownumbers', 
              4: 'counters', 5: 'inner', 6: 'time', 7: 'strong', 8: 'kind', 
              9: 'membrane'}, 
        'C': {0: True, 1: True, 2: False, 3: False, 4: True, 5: False, 
              6: np.nan, 7: np.nan, 8: np.nan, 9: np.nan}}

df = pd.DataFrame(data)

a_contains_b = df.A.str.contains('|'.join(df.B.dropna().tolist()))

b_contains_a = df.A.isin(df.B.str.extractall(
    '(' '|'.join(df.A.dropna().tolist())   ')')[0].tolist())

df['D'] = (b_contains_a | a_contains_b).where(df.A.notna())

print(df)

           A           B      C      D
0  catastrop      animal   True   True
1        ani       count   True   True
2       rows        cata  False  False
3     Second  rownumbers  False  False
4    counter    counters   True   True
5     column       inner  False  False
6        NaN        time    NaN    NaN
7        NaN      strong    NaN    NaN
8        NaN        kind    NaN    NaN
9        NaN    membrane    NaN    NaN

N.B. If you want the matching to be case-insensitive, consider using Series.str.lower. In the case of str.contains you can also make use of the parameter case, and set it to False.