Home > Back-end >  Index a pandas dataframe based on row string value conditionally containing some row specific regex
Index a pandas dataframe based on row string value conditionally containing some row specific regex

Time:05-30

I would like to conditionally select rows in a pandas dataframe if a string value contains some other string values, defined as a regex. The string values to check for change per row, and right now are stored in a series, with the formats displayed below:

df = pd.DataFrame(["a", "a", "b", "c", "de", "de"], columns=["Value"])

df:
| Index   | Value   |
|   0     | "a"     |
|   1     | "a"     |
|   2     | "b"     |
|   3     | "c"     |
|   4     | "de"    |
|   5     | "de"    |

series = pd.Series(["a|b|c", "a", "d|e", "c", "c|a", "f|e"])

Series with contains regex per row:
| Index   | Value   |
|   0     | "a|b|c" |
|   1     | "a"     |
|   2     | "d|e"   |
|   3     | "c"     |
|   4     | "c|a"   |
|   5     | "f|e"   |

The expected output I want would be a mask that I can use to index the dataframe only to the rows that match the regex:

mask = [True, True, False, True, False, True] 

df[mask]: 
| Index   | Value   |
|   0     | "a"     |
|   1     | "a"     |
|   3     | "c"     |
|   5     | "de"    |

I would like to avoid lambdas and apply as much as possible, since I am processing a big dataset and I need execution to be as performant as possible

Thanks a lot,

CodePudding user response:

If the regex are all different, like in the example, no way, each RE must be applied to all Value, this means a minimum complexity of O(n).

Maybe the fastest solution is, (using apply):

df['re'] = series
df[['Value','re']].T.apply(lambda x: bool(re.match(x.Value,x.re)))

output: 
0     True
1     True
2    False
3     True
4    False
5    False
dtype: bool

CodePudding user response:

    def common_char(string1 , string2):
        
        string1_list = list(string1)
    
        for i in string1_list:
            if i in string2:
                return True
            
        
        return False
    
    
    df = pd.DataFrame(["a", "a", "b", "c", "de", "de"])
    series = pd.Series(["a|b|c", "a", "d|e", "c", "c|a", "f|e"])
    
    
    true_false_list = []
    resulted_data = pd.DataFrame({"index":[0],
                                "value":["0"]})
    
    
    #I added Initial values just in order to make the data type of ( index ) Column integer,
    #so I'm removing it here.
    resulted_data = resulted_data.drop(0)  
    
    for i in range(df.shape[0]):
    
        if common_char(df[0][i] , series[i]):
            true_false_list.append(True)
            resulted_data = resulted_data.append(pd.DataFrame({ "index":[i],
                                              "value":[df[0][i]]}))
            
            
        else:
            true_false_list.append(False)


output:

index value
0      0     a
0      1     a
0      3     c
0      5    de
  • Related