Re Search with regex (plain string match) applied to Pandas Dataframe column returns less results th-CodePudding

import re
import pandas as pd 

df = pd.DataFrame({"Name": ['match', '1234match_sypid_1234_34_7', 'matchsypid_1234_56_7', 'Hellow', 'hello', 'oaoaooo', 'ciao', 'salut','sypid_09_2_3match']})

print(df.shape)

# => (9, 1)

mask = [re.search(p,s) for p,s in zip(r"match", df['Name'])]
print(len(mask))

# => 5

print(mask)

>>[<re.Match object; span=(0, 1), match='m'>, <re.Match object; span=(5, 6), match='a'>, <re.Match object; span=(2, 3), match='t'>, None, <re.Match object; span=(0, 1), match='h'>]

mask = [True if x is not None else False for x in mask]
print(mask)

# => [True, True, True, False, True]

Nothing changes if I pass a list instead of a df column. I would expect 9 results, plus the 5th and last result is matching "match" with the fifth string "hello".

CodePudding user response：

mask = [re.search(p,s) for p,s in zip(r"match", df['Name'])]

zip will iterate the characters found in "match", i.e. the string will be interpreted as a sequence:

for a, b in zip(r"match", df['Name']):
    print(a, b)
    
m match
a 1234match_sypid_1234_34_7
t matchsypid_1234_56_7
c Hellow
h hello

The iteration will stop as soon as the shortest sequence is exhausted, i.e. here after the last character in "match" (in case you ever need something else: there is zip_longest). Since "match" has 5 characters, your output has 5 elements.

A simpler way to obtain the mask could be

mask = df["Name"].str.contains("match")

df[mask]
                        Name
0                      match
1  1234match_sypid_1234_34_7
2       matchsypid_1234_56_7
8          sypid_09_2_3match

df[~mask]
      Name
3   Hellow
4    hello
5  oaoaooo
6     ciao
7    salut