I would like to conditionally select rows in a pandas dataframe if a string value contains some other string values, defined as a regex. The string values to check for change per row, and right now are stored in a series, with the formats displayed below:
df = pd.DataFrame(["a", "a", "b", "c", "de", "de"], columns=["Value"])
df:
| Index | Value |
| 0 | "a" |
| 1 | "a" |
| 2 | "b" |
| 3 | "c" |
| 4 | "de" |
| 5 | "de" |
series = pd.Series(["a|b|c", "a", "d|e", "c", "c|a", "f|e"])
Series with contains regex per row:
| Index | Value |
| 0 | "a|b|c" |
| 1 | "a" |
| 2 | "d|e" |
| 3 | "c" |
| 4 | "c|a" |
| 5 | "f|e" |
The expected output I want would be a mask that I can use to index the dataframe only to the rows that match the regex:
mask = [True, True, False, True, False, True]
df[mask]:
| Index | Value |
| 0 | "a" |
| 1 | "a" |
| 3 | "c" |
| 5 | "de" |
I would like to avoid lambdas and apply as much as possible, since I am processing a big dataset and I need execution to be as performant as possible
Thanks a lot,
CodePudding user response:
If the regex are all different, like in the example, no way, each RE must be applied to all Value, this means a minimum complexity of O(n).
Maybe the fastest solution is, (using apply):
df['re'] = series
df[['Value','re']].T.apply(lambda x: bool(re.match(x.Value,x.re)))
output:
0 True
1 True
2 False
3 True
4 False
5 False
dtype: bool
CodePudding user response:
def common_char(string1 , string2):
string1_list = list(string1)
for i in string1_list:
if i in string2:
return True
return False
df = pd.DataFrame(["a", "a", "b", "c", "de", "de"])
series = pd.Series(["a|b|c", "a", "d|e", "c", "c|a", "f|e"])
true_false_list = []
resulted_data = pd.DataFrame({"index":[0],
"value":["0"]})
#I added Initial values just in order to make the data type of ( index ) Column integer,
#so I'm removing it here.
resulted_data = resulted_data.drop(0)
for i in range(df.shape[0]):
if common_char(df[0][i] , series[i]):
true_false_list.append(True)
resulted_data = resulted_data.append(pd.DataFrame({ "index":[i],
"value":[df[0][i]]}))
else:
true_false_list.append(False)
output:
index value
0 0 a
0 1 a
0 3 c
0 5 de