I have a dataframe that looks like this:
dict = {'companyId': {0: 198236, 1: 198236, 2: 900814, 3: 153421, 4: 153421, 5: 337815},
'region': {0: 'Europe', 1: 'Europe', 2: 'Asia-Pacific', 3: 'North America', 4: 'North America', 5:'Africa'},
'value': {0: 560, 1: 771, 2: 964, 3: 217, 4: 433, 5: 680},
'type': {0: 'actual', 1: 'forecast', 2: 'actual', 3: 'forecast', 4: 'actual', 5: 'forecast'}}
df = pd.DataFrame(dict)
companyId region value type
0 198236 Europe 560 actual
1 198236 Europe 771 forecast
2 900814 Asia-Pacific 964 actual
3 153421 North America 217 forecast
4 153421 North America 433 actual
5 337815 Africa 680 forecast
I can't seem to figure out a way to filter out certain rows based on the following condition:
If there are two entries under the same companyId
, as is the case for 198236
and 153421
, I want to keep only the entry where type
is actual
.
If there is only one entry under a companyId
, as is the case for 337815
and 900814
, I want to keep that row, irrespective of the value in column type
.
Does anyone have an idea how to go about this?
CodePudding user response:
You can check with argsort
then drop_duplicates
out = df.iloc[df.type.ne('actual').argsort()].drop_duplicates('companyId')
Out[925]:
companyId region value type
0 198236 Europe 560 actual
2 900814 Asia-Pacific 964 actual
4 153421 North America 433 actual
5 337815 Africa 680 forecast
CodePudding user response:
You can use a groupby and transform to create boolean indexing:
#Your condition i.e. retain the rows which are not duplicated and those
# which are duplicated but only type==actual. Lets express that as a lambda.
to_filter = lambda x: (len(x) == 1) | ((len(x) > 1) & (x == 'actual'))
#then create a boolean indexing mask as below
m = df.groupby('companyId')['type'].transform(to_filter)
#then filter your df with that m:
df[m]:
companyId region value type
0 198236 Europe 560 actual
2 900814 Asia-Pacific 964 actual
4 153421 North America 433 actual
5 337815 Africa 680 forecast
CodePudding user response:
Create a column for counts, then drop where the count is > 1 and the type is not actual
df['count'] = df.groupby(['companyId'])['value'].transform('count')
df.drop(df.loc[(df['count'] > 1) & (df['type'] != 'actual')].index, inplace=True)
CodePudding user response:
you can create a filter based on the count and type, then drop matching rows:
counts = df.groupby("companyId").companyId.transform("count")
row_filter = (counts > 1) & (df['type'] == "forecast")
df_filtered = df[~row_filter]
output:
companyId region value type
0 198236 Europe 560 actual
2 900814 Asia-Pacific 964 actual
4 153421 North America 433 actual
5 337815 Africa 680 forecast