Home > Mobile >  Re Search with regex (plain string match) applied to Pandas Dataframe column returns less results th
Re Search with regex (plain string match) applied to Pandas Dataframe column returns less results th

Time:09-01

import re
import pandas as pd 

df = pd.DataFrame({"Name": ['match', '1234match_sypid_1234_34_7', 'matchsypid_1234_56_7', 'Hellow', 'hello', 'oaoaooo', 'ciao', 'salut','sypid_09_2_3match']})

print(df.shape)

# => (9, 1)

mask = [re.search(p,s) for p,s in zip(r"match", df['Name'])]
print(len(mask))

# => 5

print(mask)

>>[<re.Match object; span=(0, 1), match='m'>, <re.Match object; span=(5, 6), match='a'>, <re.Match object; span=(2, 3), match='t'>, None, <re.Match object; span=(0, 1), match='h'>]

mask = [True if x is not None else False for x in mask]
print(mask)

# => [True, True, True, False, True]

Nothing changes if I pass a list instead of a df column. I would expect 9 results, plus the 5th and last result is matching "match" with the fifth string "hello".

CodePudding user response:

in

mask = [re.search(p,s) for p,s in zip(r"match", df['Name'])]

zip will iterate the characters found in "match", i.e. the string will be interpreted as a sequence:

for a, b in zip(r"match", df['Name']):
    print(a, b)
    
m match
a 1234match_sypid_1234_34_7
t matchsypid_1234_56_7
c Hellow
h hello

The iteration will stop as soon as the shortest sequence is exhausted, i.e. here after the last character in "match" (in case you ever need something else: there is zip_longest). Since "match" has 5 characters, your output has 5 elements.

A simpler way to obtain the mask could be

mask = df["Name"].str.contains("match")

df[mask]
                        Name
0                      match
1  1234match_sypid_1234_34_7
2       matchsypid_1234_56_7
8          sypid_09_2_3match

df[~mask]
      Name
3   Hellow
4    hello
5  oaoaooo
6     ciao
7    salut
  • Related