Home > Software engineering >  Filtering dataframes based on one column with a different type of other column
Filtering dataframes based on one column with a different type of other column

Time:12-17

I have the following problem

import pandas as pd

data = {
  "ID": [420, 380, 390, 540, 520, 50, 22],
  "duration": [50, 40, 45,33,19,1,3],
  "next":["390;50","880;222"  ,"520;50"  ,"380;111"   ,"810;111" ,"22;888" ,"11" ]

}

#load data into a DataFrame object:
df = pd.DataFrame(data)

print(df)

As you can see I have

    ID  duration     next
0  420        50   390;50
1  380        40  880;222
2  390        45   520;50
3  540        33  380;111
4  520        19  810;111
5   50         1   22;888
6   22         3       11

Things to notice:

  • ID type is int
  • next type is a string with numbers separated by ; if more than two numbers

I would like to filter the rows with no next in the ID

For example in this case

  • 420 has a follow up in both 390 and 50

  • 380 has as next 880 and 222 both of which are not in ID so this one

  • 540 has as next 380 and 111 and while 111 is not in ID, 380 is so not this one

  • same with 50

In the end I want to get

1  380        40  880;222
4  520        19  810;111
6   22         3       11

With only one value I used print(df[~df.next.astype(int).isin(df.ID)]) but in this case isin can not be simply applied.

How can I do this?

CodePudding user response:

Let us try with split then explode with isin check

s = df.next.str.split(';').explode().astype(int)
out = df[~s.isin(df['ID']).groupby(level=0).any()]
Out[420]: 
    ID  duration     next
1  380        40  880;222
4  520        19  810;111
6   22         3       11

CodePudding user response:

Use a regex with word boundaries for efficiency:

pattern = '|'.join(df['ID'].astype(str))

out = df[~df['next'].str.contains(fr'\b(?:{pattern})\b')]

Output:

    ID  duration     next
1  380        40  880;222
4  520        19  810;111
6   22         3       11
  • Related