import re
import pandas as pd
df = pd.DataFrame({"Name": ['match', '1234match_sypid_1234_34_7', 'matchsypid_1234_56_7', 'Hellow', 'hello', 'oaoaooo', 'ciao', 'salut','sypid_09_2_3match']})
print(df.shape)
# => (9, 1)
mask = [re.search(p,s) for p,s in zip(r"match", df['Name'])]
print(len(mask))
# => 5
print(mask)
>>[<re.Match object; span=(0, 1), match='m'>, <re.Match object; span=(5, 6), match='a'>, <re.Match object; span=(2, 3), match='t'>, None, <re.Match object; span=(0, 1), match='h'>]
mask = [True if x is not None else False for x in mask]
print(mask)
# => [True, True, True, False, True]
Nothing changes if I pass a list instead of a df column. I would expect 9 results, plus the 5th and last result is matching "match" with the fifth string "hello".
CodePudding user response:
in
mask = [re.search(p,s) for p,s in zip(r"match", df['Name'])]
zip
will iterate the characters found in "match", i.e. the string will be interpreted as a sequence:
for a, b in zip(r"match", df['Name']):
print(a, b)
m match
a 1234match_sypid_1234_34_7
t matchsypid_1234_56_7
c Hellow
h hello
The iteration will stop as soon as the shortest sequence is exhausted, i.e. here after the last character in "match" (in case you ever need something else: there is zip_longest). Since "match" has 5 characters, your output has 5 elements.
A simpler way to obtain the mask
could be
mask = df["Name"].str.contains("match")
df[mask]
Name
0 match
1 1234match_sypid_1234_34_7
2 matchsypid_1234_56_7
8 sypid_09_2_3match
df[~mask]
Name
3 Hellow
4 hello
5 oaoaooo
6 ciao
7 salut