Home > Software design >  Match standalone words in string in DataFrame column
Match standalone words in string in DataFrame column

Time:06-17

I have a dataframe such as:

import pandas as pd
import re

df = pd.DataFrame({"Name": ["D1", "D2", "D3", "D4", "M1", "M2", "M3"], 
                  "Requirements": ["3 meters|2/3 meters|3.5 meters",
                                   "3 meters",
                                   "3/5 meters|3 meters",
                                   "2/3 meters",
                                   "steel|g1_steel",
                                   "steel",
                                   "g1_steel"]})

dataframe df

  Name                    Requirements
0   D1  3 meters|2/3 meters|3.5 meters
1   D2                        3 meters
2   D3             3/5 meters|3 meters
3   D4                      2/3 meters
4   M1                  steel|g1_steel
5   M2                           steel
6   M3                        g1_steel

I have a list of words req_list = ['3 meters', 'steel'] and I am trying to extract rows from df where the strings in column Requirements contain standalone words that are from req_list. This is what I have done:

This one prints just D2 and M2

df[df.Requirements.apply(lambda x: any(len(x.replace(y, '')) == 0 for y in req_list))]

This one prints all rows

df[df['Requirements'].str.contains(fr"\b(?:{'|'.join(req_list)})\b")]

My desired result is as follows:

  Name                    Requirements
0   D1  3 meters|2/3 meters|3.5 meters
1   D2                        3 meters
2   D3             3/5 meters|3 meters
4   M1                  steel|g1_steel
5   M2                           steel

In this desired output, D4 and M3 are eliminated because they do not have words from req_list as standalone strings. Is there any way to achieve this preferably in an one-liner without using custom functions?

EDIT

The strings in the column Requirements can come in any pattern such as:

    Name                    Requirements
0   D1    3 meters|2/3 meters|3.5 meters
1   D2                          3 meters
2   D3               3/5 meters|3 meters
3   D4                        2/3 meters
4   D5                       3::3 meters # New pattern which needs to be eliminated
5   D6                        3.3 meters # New pattern which needs to be eliminated
6   D7                        3?3 meters # New pattern which needs to be eliminated
7   M1                    steel|g1_steel
8   M2                             steel
9   M3                          g1_steel

CodePudding user response:

Since you want to make sure you do not match 3 meters that is preceded with a digit /, you may add a (?<!\d/) negative lookbehind after the intial word boundary:

df[df['Requirements'].str.contains(fr"\b(?<!\d/)(?:{'|'.join(req_list)})\b")]

Output:

  Name                    Requirements
0   D1  3 meters|2/3 meters|3.5 meters
1   D2                        3 meters
2   D3             3/5 meters|3 meters
4   M1                  steel|g1_steel
5   M2                           steel

See the regex demo.

Notes

  • Since req_list contains phrases (mutiword strings) you might have to sort the items by length in the descending order before joining with the | OR operator, so you'd better use fr"\b(?<!\d/)(?:{'|'.join(sorted(req_list, key=len, reverse=True))})\b" as regex
  • If the req_list ever contains items with special chars you should also use adaptive dynamic word boundaries, i.e. fr"(?!\B\w)(?<!\d/)(?:{'|'.join(sorted(map(re.escape, req_list), key=len, reverse=True))})(?<!\w\B)".

CodePudding user response:

Here is one more way to do it

def chk(row):
    for r in row:
        if r.strip() in req_list:
            return True
    return False

df[df.assign(lst = df['Requirements'].str.split('|'))['lst'].apply(chk) == True]

    Name    Requirements
0   D1      3 meters|2/3 meters|3.5 meters
1   D2      3 meters
2   D3      3/5 meters|3 meters
4   M1      steel|g1_steel
5   M2      steel
  • Related