A | B | C |
---|---|---|
catastrop | animal | True |
ani | count | True |
rows | cata | False |
Second | rownumbers | False |
counter | counters | True |
column | inner | False |
time | ||
strong | ||
kind | ||
membrane |
A and B contain both string and substring, I want to check if column A has a value 'counter', and that value is either having a substring in column B 'count' or is a substring of column B 'counters'. If any of this satisfied, return True. (Something like combination of .isin and .str.contains). C is the output, which state whether the value of A is substring of B or A contains B. Like catastrop contain cata, so True, and ani is substring of animal so True
The initial code came into my mind is
list1=a1['A'].tolist()
output1=[]
for i in list1:
output1.append(any(a1['B'].str.contains(i,regex=False)))
and do an or operation with column B
But if I doing it reversely, like
list2=a1['B'].tolist()
output2=[]
for i in list2:
output2.append(any(a1['A'].str.contains(i,regex=False)))
list 2 would contain verify result about column B, rather than column A.
How could I write this code?
CodePudding user response:
One way to achieve this is as follows:
- For
a_contains_b
, useSeries.str.contains
with a string consisting of a list of values from columnB
,joined
with|
delimiter to create alternatives (so:'animal|count|cata|rownumbers|counters|inner|time|strong|kind|membrane'
). - For
b_contains_a
, we want to useSeries.str.extractall
instead, since we need to feed the result back toSeries.isin
to know which rows from columnA
have been matched. I.e. the result ofextractall
([0].tolist()
) will be['ani', 'counter']
, which we use as input fordf.A.isin
. - Finally, we use both boolean series as input for the new column. Use
|
, since we needTrue
if either series containsTrue
. Wrap the result in brackets and applySeries.where
combined withSeries.notna
to overwrite allFalse
values in rows that haveNaN
values in columnA
.
import pandas as pd
import numpy as np
data = {'A': {0: 'catastrop', 1: 'ani', 2: 'rows', 3: 'Second', 4: 'counter',
5: 'column', 6: np.nan, 7: np.nan, 8: np.nan, 9: np.nan},
'B': {0: 'animal', 1: 'count', 2: 'cata', 3: 'rownumbers',
4: 'counters', 5: 'inner', 6: 'time', 7: 'strong', 8: 'kind',
9: 'membrane'},
'C': {0: True, 1: True, 2: False, 3: False, 4: True, 5: False,
6: np.nan, 7: np.nan, 8: np.nan, 9: np.nan}}
df = pd.DataFrame(data)
a_contains_b = df.A.str.contains('|'.join(df.B.dropna().tolist()))
b_contains_a = df.A.isin(df.B.str.extractall(
'(' '|'.join(df.A.dropna().tolist()) ')')[0].tolist())
df['D'] = (b_contains_a | a_contains_b).where(df.A.notna())
print(df)
A B C D
0 catastrop animal True True
1 ani count True True
2 rows cata False False
3 Second rownumbers False False
4 counter counters True True
5 column inner False False
6 NaN time NaN NaN
7 NaN strong NaN NaN
8 NaN kind NaN NaN
9 NaN membrane NaN NaN
N.B. If you want the matching to be case-insensitive, consider using Series.str.lower
. In the case of str.contains
you can also make use of the parameter case
, and set it to False
.