Home > Software engineering >  Filtering out rows based on other rows using pandas
Filtering out rows based on other rows using pandas

Time:08-17

I have a dataframe that looks like this:

dict = {'companyId': {0: 198236, 1: 198236, 2: 900814, 3: 153421, 4: 153421, 5: 337815},
 'region': {0: 'Europe', 1: 'Europe', 2: 'Asia-Pacific', 3: 'North America', 4: 'North America', 5:'Africa'},
 'value': {0: 560, 1: 771, 2: 964, 3: 217, 4: 433, 5: 680},
 'type': {0: 'actual', 1: 'forecast', 2: 'actual', 3: 'forecast', 4: 'actual', 5: 'forecast'}}

df = pd.DataFrame(dict)

    companyId     region          value  type
0   198236        Europe          560    actual
1   198236        Europe          771    forecast
2   900814        Asia-Pacific    964    actual
3   153421        North America   217    forecast
4   153421        North America   433    actual
5   337815        Africa          680    forecast

I can't seem to figure out a way to filter out certain rows based on the following condition:

If there are two entries under the same companyId, as is the case for 198236 and 153421, I want to keep only the entry where type is actual.

If there is only one entry under a companyId, as is the case for 337815 and 900814, I want to keep that row, irrespective of the value in column type.

Does anyone have an idea how to go about this?

CodePudding user response:

You can check with argsort then drop_duplicates

out = df.iloc[df.type.ne('actual').argsort()].drop_duplicates('companyId')
Out[925]: 
   companyId         region  value      type
0     198236         Europe    560    actual
2     900814   Asia-Pacific    964    actual
4     153421  North America    433    actual
5     337815         Africa    680  forecast

CodePudding user response:

You can use a groupby and transform to create boolean indexing:

#Your condition i.e. retain the rows which are not duplicated and those
# which are duplicated but only type==actual. Lets express that as a lambda.
to_filter = lambda x: (len(x) == 1) | ((len(x) > 1) & (x == 'actual'))

#then create a boolean indexing mask as below
m = df.groupby('companyId')['type'].transform(to_filter)


#then filter your df with that m:
df[m]:

   companyId         region  value      type
0     198236         Europe    560    actual
2     900814   Asia-Pacific    964    actual
4     153421  North America    433    actual
5     337815         Africa    680  forecast

CodePudding user response:

Create a column for counts, then drop where the count is > 1 and the type is not actual

df['count'] = df.groupby(['companyId'])['value'].transform('count')
df.drop(df.loc[(df['count'] > 1) & (df['type'] != 'actual')].index, inplace=True)

CodePudding user response:

you can create a filter based on the count and type, then drop matching rows:

counts = df.groupby("companyId").companyId.transform("count")
row_filter = (counts > 1) & (df['type'] == "forecast")
df_filtered = df[~row_filter]

output:

   companyId         region  value      type
0     198236         Europe    560    actual
2     900814   Asia-Pacific    964    actual
4     153421  North America    433    actual
5     337815         Africa    680  forecast
  • Related