Home > Blockchain >  Filter dataframe rows based on return value of foo() applied to first column
Filter dataframe rows based on return value of foo() applied to first column

Time:08-12

I have the following dataframe

     Path              Latency   Noise         SNR
1                A->B  0.001769  3.535534e-07  34.515450
2          A->C->D->B  0.006240  1.247207e-06  29.040613
3    A->C->D->E->F->B  0.011315  2.261351e-06  26.456319
4       A->C->D->F->B  0.008269  1.652609e-06  27.818298
5       A->C->E->D->B  0.008111  1.620994e-06  27.902185
..                ...       ...           ...        ...
346     F->D->A->C->E  0.008002  1.599196e-06  27.960983
347  F->D->B->A->C->E  0.009928  1.984271e-06  27.023989
348        F->D->C->E  0.005527  1.104621e-06  29.567867
349           F->D->E  0.003859  7.713011e-07  31.127760
350              F->E  0.003094  6.184658e-07  32.086843

I have a method taking an input string and returning a bool(this is a simplified version, not the actual method):

def foo(str: str) -> bool:
    if str[0] == 'A': return False
    return True

I want to filter the dataframe keeping the rows whoose path if passed to foo() return True. I cannot modify foo(), how can I do it?

Expected Output

     Path              Latency   Noise         SNR
..                ...       ...           ...        ...
346     F->D->A->C->E  0.008002  1.599196e-06  27.960983
347  F->D->B->A->C->E  0.009928  1.984271e-06  27.023989
348        F->D->C->E  0.005527  1.104621e-06  29.567867
349           F->D->E  0.003859  7.713011e-07  31.127760
350              F->E  0.003094  6.184658e-07  32.086843

CodePudding user response:

TLDR

mask = df.apply(lambda row: foo(row['Path']), axis=1)
res: pd.DataFrame = df[mask]

Solution

To filter the rows of a DataFrame according to the return value of foo(str: str) -> bool applied to the values contained in column Path of each row the solution is to generate a mask with pandas.DataFrame.apply().

How does a mask work?

The mask works as follow: given a dataframe df: pd.DataFrame and a mask: pd.Series<bool> accessing with square brackets df[mask] will result in a new DataFrame with only the rows corresponnding to a True value of the mask series.

How to get the mask

Since df.apply(fuction, axis, ...) takes as input a function one would be tempted to pass foo() as argument of the apply() but this is wrong. The function argumennt of apply() must be a function taking as argument a pd.Series and not a string therefore the correct way to get the mask is the following, where axis = 1 indicates that we're applyinng the lambda to get the boolean value to every row of the dataframe rather than to every column.

mask = df.apply(lambda row: foo(row['Path']), axis=1)

CodePudding user response:

mask = df.apply(filter_fn, axis=1)
df = df[mask]

should work fine. filter_fn should return bool values.

CodePudding user response:

Use str.starwith() that NOT startwith 'A'

df
###
               Path   Latency         Noise        SNR
0              A->B  0.001769  3.535534e-07  34.515450
1        A->C->D->B  0.006240  1.247207e-06  29.040613
2  A->C->D->E->F->B  0.011315  2.261351e-06  26.456319
3     A->C->D->F->B  0.008269  1.652609e-06  27.818298
4     A->C->E->D->B  0.008111  1.620994e-06  27.902185
5     F->D->A->C->E  0.008002  1.599196e-06  27.960983
6  F->D->B->A->C->E  0.009928  1.984271e-06  27.023989
7        F->D->C->E  0.005527  1.104621e-06  29.567867
8           F->D->E  0.003859  7.713011e-07  31.127760
9              F->E  0.003094  6.184658e-07  32.086843
df[~df['Path'].str.startswith('A')]
###
               Path   Latency         Noise        SNR
5     F->D->A->C->E  0.008002  1.599196e-06  27.960983
6  F->D->B->A->C->E  0.009928  1.984271e-06  27.023989
7        F->D->C->E  0.005527  1.104621e-06  29.567867
8           F->D->E  0.003859  7.713011e-07  31.127760
9              F->E  0.003094  6.184658e-07  32.086843


In function:

def foo(path: str) -> bool:
    return not path.startswith('A')

df[df['Path'].apply(foo)]
###
               Path   Latency         Noise        SNR
5     F->D->A->C->E  0.008002  1.599196e-06  27.960983
6  F->D->B->A->C->E  0.009928  1.984271e-06  27.023989
7        F->D->C->E  0.005527  1.104621e-06  29.567867
8           F->D->E  0.003859  7.713011e-07  31.127760
9              F->E  0.003094  6.184658e-07  32.086843

Reference: pandas.Series.str.startswith

  • Related