Filtering out rows based on other rows using pandas-CodePudding

I have a dataframe that looks like this:

dict = {'companyId': {0: 198236, 1: 198236, 2: 900814, 3: 153421, 4: 153421, 5: 337815},
 'region': {0: 'Europe', 1: 'Europe', 2: 'Asia-Pacific', 3: 'North America', 4: 'North America', 5:'Africa'},
 'value': {0: 560, 1: 771, 2: 964, 3: 217, 4: 433, 5: 680},
 'type': {0: 'actual', 1: 'forecast', 2: 'actual', 3: 'forecast', 4: 'actual', 5: 'forecast'}}

df = pd.DataFrame(dict)

    companyId     region          value  type
0   198236        Europe          560    actual
1   198236        Europe          771    forecast
2   900814        Asia-Pacific    964    actual
3   153421        North America   217    forecast
4   153421        North America   433    actual
5   337815        Africa          680    forecast

I can't seem to figure out a way to filter out certain rows based on the following condition:

If there are two entries under the same companyId, as is the case for 198236 and 153421, I want to keep only the entry where type is actual.

If there is only one entry under a companyId, as is the case for 337815 and 900814, I want to keep that row, irrespective of the value in column type.

Does anyone have an idea how to go about this?

CodePudding user response：

You can check with argsort then drop_duplicates

out = df.iloc[df.type.ne('actual').argsort()].drop_duplicates('companyId')
Out[925]: 
   companyId         region  value      type
0     198236         Europe    560    actual
2     900814   Asia-Pacific    964    actual
4     153421  North America    433    actual
5     337815         Africa    680  forecast

CodePudding user response：

You can use a groupby and transform to create boolean indexing:

#Your condition i.e. retain the rows which are not duplicated and those
# which are duplicated but only type==actual. Lets express that as a lambda.
to_filter = lambda x: (len(x) == 1) | ((len(x) > 1) & (x == 'actual'))

#then create a boolean indexing mask as below
m = df.groupby('companyId')['type'].transform(to_filter)


#then filter your df with that m:
df[m]:

   companyId         region  value      type
0     198236         Europe    560    actual
2     900814   Asia-Pacific    964    actual
4     153421  North America    433    actual
5     337815         Africa    680  forecast

CodePudding user response：

Create a column for counts, then drop where the count is > 1 and the type is not actual

df['count'] = df.groupby(['companyId'])['value'].transform('count')
df.drop(df.loc[(df['count'] > 1) & (df['type'] != 'actual')].index, inplace=True)

CodePudding user response：

you can create a filter based on the count and type, then drop matching rows:

counts = df.groupby("companyId").companyId.transform("count")
row_filter = (counts > 1) & (df['type'] == "forecast")
df_filtered = df[~row_filter]

output:

   companyId         region  value      type
0     198236         Europe    560    actual
2     900814   Asia-Pacific    964    actual
4     153421  North America    433    actual
5     337815         Africa    680  forecast