I have the following dataframe
Path Latency Noise SNR
1 A->B 0.001769 3.535534e-07 34.515450
2 A->C->D->B 0.006240 1.247207e-06 29.040613
3 A->C->D->E->F->B 0.011315 2.261351e-06 26.456319
4 A->C->D->F->B 0.008269 1.652609e-06 27.818298
5 A->C->E->D->B 0.008111 1.620994e-06 27.902185
.. ... ... ... ...
346 F->D->A->C->E 0.008002 1.599196e-06 27.960983
347 F->D->B->A->C->E 0.009928 1.984271e-06 27.023989
348 F->D->C->E 0.005527 1.104621e-06 29.567867
349 F->D->E 0.003859 7.713011e-07 31.127760
350 F->E 0.003094 6.184658e-07 32.086843
I have a method taking an input string and returning a bool(this is a simplified version, not the actual method):
def foo(str: str) -> bool:
if str[0] == 'A': return False
return True
I want to filter the dataframe keeping the rows whoose path if passed to foo() return True. I cannot modify foo(), how can I do it?
Expected Output
Path Latency Noise SNR
.. ... ... ... ...
346 F->D->A->C->E 0.008002 1.599196e-06 27.960983
347 F->D->B->A->C->E 0.009928 1.984271e-06 27.023989
348 F->D->C->E 0.005527 1.104621e-06 29.567867
349 F->D->E 0.003859 7.713011e-07 31.127760
350 F->E 0.003094 6.184658e-07 32.086843
CodePudding user response:
TLDR
mask = df.apply(lambda row: foo(row['Path']), axis=1)
res: pd.DataFrame = df[mask]
Solution
To filter the rows of a DataFrame according to the return value of foo(str: str) -> bool
applied to the values contained in column Path
of each row the solution is to generate a mask with pandas.DataFrame.apply().
How does a mask work?
The mask works as follow: given a dataframe df: pd.DataFrame
and a mask: pd.Series<bool>
accessing with square brackets df[mask]
will result in a new DataFrame with only the rows corresponnding to a True
value of the mask series.
How to get the mask
Since df.apply(fuction, axis, ...)
takes as input a function one would be tempted to pass foo()
as argument of the apply()
but this is wrong.
The function argumennt of apply() must be a function taking as argument a pd.Series and not a string therefore the correct way to get the mask is the following, where axis = 1
indicates that we're applyinng the lambda to get the boolean value to every row of the dataframe rather than to every column.
mask = df.apply(lambda row: foo(row['Path']), axis=1)
CodePudding user response:
mask = df.apply(filter_fn, axis=1)
df = df[mask]
should work fine. filter_fn
should return bool
values.
CodePudding user response:
Use str.starwith()
that NOT startwith 'A'
df
###
Path Latency Noise SNR
0 A->B 0.001769 3.535534e-07 34.515450
1 A->C->D->B 0.006240 1.247207e-06 29.040613
2 A->C->D->E->F->B 0.011315 2.261351e-06 26.456319
3 A->C->D->F->B 0.008269 1.652609e-06 27.818298
4 A->C->E->D->B 0.008111 1.620994e-06 27.902185
5 F->D->A->C->E 0.008002 1.599196e-06 27.960983
6 F->D->B->A->C->E 0.009928 1.984271e-06 27.023989
7 F->D->C->E 0.005527 1.104621e-06 29.567867
8 F->D->E 0.003859 7.713011e-07 31.127760
9 F->E 0.003094 6.184658e-07 32.086843
df[~df['Path'].str.startswith('A')]
###
Path Latency Noise SNR
5 F->D->A->C->E 0.008002 1.599196e-06 27.960983
6 F->D->B->A->C->E 0.009928 1.984271e-06 27.023989
7 F->D->C->E 0.005527 1.104621e-06 29.567867
8 F->D->E 0.003859 7.713011e-07 31.127760
9 F->E 0.003094 6.184658e-07 32.086843
In function:
def foo(path: str) -> bool:
return not path.startswith('A')
df[df['Path'].apply(foo)]
###
Path Latency Noise SNR
5 F->D->A->C->E 0.008002 1.599196e-06 27.960983
6 F->D->B->A->C->E 0.009928 1.984271e-06 27.023989
7 F->D->C->E 0.005527 1.104621e-06 29.567867
8 F->D->E 0.003859 7.713011e-07 31.127760
9 F->E 0.003094 6.184658e-07 32.086843
Reference: pandas.Series.str.startswith