Fellow contributors,
I would like to check if a set of specific key words exists on a grouped pandas DataFrame. The words I would like to check are start
, pending
and either finished
or almost_finished
. I would like to define a custom function for this and apply
it to pandas groupby
as defining a function to apply on columns is a bit not clear for me comparing to rowwise operations where we address every row with (row[colname]).
In this example if the sequence of the desired words exist I would like the last value in column number
for each ID
to be copied in a new column and it doesn't matter if other values before that are empty strings. Here is a reproducible example:
import pandas as pd
df = pd.DataFrame({'ID' : [1100, 1100, 1100, 1200, 1200, 1200, 1300, 1300],
'number' : ['Yes', 'No', 'No', 'Yes', 'No', 'No', 'Yes', 'No'],
'status' : ['start', 'pending', 'finished', 'start', 'pending', 'partially_finished', 'start', 'pending']})
In this case the last group of ID == 1300
has no return value.
Basically I am asking this question to learn the best approach for these kinda problem where you need to check some values in a column, since I am coming from R I need to familiarize myself with the way I would do the same thing in Python. I would also appreciate any better solution you may suggest.
Thank you very much in advance.
CodePudding user response:
df.groupby.apply is probably what you're looking for. You can apply a function to each group and return either a single value, series, or dataframe. The result will be aggregated together.
e.g. the following function
def return_last_num(df):
if df.status.str.contains('start').any() & df.status.str.contains('pending').any() \
& (df.status.str.contains('finished').any() | df.status.str.contains('partically_finished').any()):
df['last_number'] = df.number.values[-1]
else:
df['last_number'] = str()
return df
When applied: df.groupby('ID').apply(return_last_num)
.
Returns:
ID number status last_number
0 1100 Yes start No
1 1100 No pending No
2 1100 No finished No
3 1200 Yes start No
4 1200 No pending No
5 1200 No partially_finished No
6 1300 Yes start
7 1300 No pending
Alternatively: returning a single value
def return_last_num(df):
if df.status.str.contains('start').any() & df.status.str.contains('pending').any() \
& (df.status.str.contains('finished').any() | df.status.str.contains('partically_finished').any()):
return df.number.values[-1]
else:
return str()
When applied: df.groupby('ID').apply(return_last_num)
.
Returns just the last 'number' value per ID if the sequence condition was matched:
ID
1100 No
1200 No
1300
CodePudding user response:
I was thinking on a vectorized approach.
First, if finished
and almost finished
have the same effect, I would "merge" them, and would make them a unique number easy to check:
>>> df['status2'] = df['status'].map({'finished':1,'partially_finished':1,'pending':10,'start':100})
>>> df
ID number status status2
0 1100 Yes start 100
1 1100 No pending 10
2 1100 No finished 1
3 1200 Yes start 100
4 1200 No pending 10
5 1200 No partially_finished 1
6 1300 Yes start 100
7 1300 No pending 10
That allows me to "extract" the desired status (100 10 1):
idstatus=df.groupby('ID', sort=False).sum('status2')==111
status2
ID
1100 True
1200 True
1300 False
And the actual number values:
valuenumber=df.query('status2==1').set_index('ID')
number status status2
ID
1100 No finished 1
1200 No partially_finished 1
And finally merge:
idstatus.merge(valuenumber, left_index=True, right_index=True, how='left')
>>> idstatus.merge(valuenumber, left_index=True, right_index=True, how='left')
status2_x number status status2_y
ID
1100 True No finished 1.0
1200 True No partially_finished 1.0
1300 False NaN NaN NaN
it is possible to merge less entries:
>>> idstatus.merge(valuenumber[['number','status']], left_index=True, right_index=True, how='left')
status2 number status
ID
1100 True No finished
1200 True No partially_finished
1300 False NaN NaN
>>>
If you only want to have the outputs of the finished ones, I would merge the other way around
CodePudding user response:
You can aggregate with set
and use intersection
to check.
But first, I would map partially_finished
or almost_finished
to finished
, if these should be treated equally.
df['status'] = df.status.replace('partially_finished|almost_finished', 'finished', regex=True)
Next, aggregate number
to last value and status
to set
, then I use intersect
to check if all values are existing in status
.
checkcriteria = {'start', 'pending', 'finished'}
df = df.groupby('ID').agg({'number': 'last', 'status': set})
df['check'] = df.status.transform(lambda x: len(x.intersection(checkcriteria)) == 3)
This should give a result,
number status check
ID
1100 No {start, pending, finished} True
1200 No {start, pending, finished} True
1300 No {start, pending} False
You can either filter by check
or mask
and remove the value for number
.
# This will only return ID == 1100, 1200
df[df.check]
# OR mask to remove the number value for when check == False
df.loc[~df.check, 'number'] = None