I have a dataframe with a series of months and each row contains a value of either 1 or 0. How do I find the index of the first occurrence of 1 and the last occurrence of 1 ?
I thought about grouping all of the column together to create 1 row that just has a list of all the 0s and 1s and then enumerating over it to get the min and max values but I was getting stuck on grouping everything and putting it into a column
Here's a sample of my dataset:
Jan 2020 Feb2020 March 2020 April 2020 May 2020
User1 1 0 0 0 0
User2 0 1 1 0 1
User 3 1 1 1 1 1
Id like my output to look like this:
Jan 2020 Feb2020 March 2020 April 2020 May 2020 First_occurance Last Occurance
User1 1 0 0 0 0 1 1
User2 0 1 1 0 1 2 5
User 3 1 1 1 1 1 1 5
CodePudding user response:
For a solution without loop/apply, you can use:
import numpy as np
out = df.join(
(df*(np.arange(df.shape[1]) 1))
.where(df.eq(1))
.agg(['min', 'max'], axis=1)
.rename(columns={'min': 'first occurrence',
'max': 'last occurrence',
})
.convert_dtypes()
)
Or, to modify the DataFrame in place:
df[['first occurrence', 'last occurrence']] = (
(df*(np.arange(df.shape[1]) 1))
.where(df.eq(1))
.agg(['min', 'max'], axis=1)
.convert_dtypes()
)
output:
Jan 2020 Feb2020 March 2020 April 2020 May 2020 first occurrence last occurrence
User1 1 0 0 0 0 1 1
User2 0 1 1 0 1 2 5
User 3 1 1 1 1 1 1 5
CodePudding user response:
You can use numpy.where
for this. You have not mentioned; but following also handles non-occurrence as -1
, else it will result in array index error.
def get_first_last_occurrence(row):
arr = np.where(row.to_numpy() == 1)
if len(arr):
return pd.Series({"First Occurrence":arr[0][0], "Last Occurrence":arr[0][-1]})
else:
return -1
df[["First Occurrence", "Last Occurrence"]] = df.apply(lambda row: get_first_last_occurrence(row), axis=1)
>> User Jan2020 Feb2020 March2020 April2020 May2020 First Occurrence Last Occurrence
>> 0 User1 1 0 0 0 0 1 1
>> 1 User2 0 1 1 0 1 2 5
>> 2 User3 1 1 1 1 1 1 5
CodePudding user response:
Here's an alternative solution which returns the column label of each occurrence instead of an integer.
Starting with this dataframe:
In [3]: df
Out[3]:
Jan_2020 Feb_2020 March_2020 April_2020 May_2020
User1 1 0 0 0 0
User2 0 1 1 0 1
User3 1 1 1 1 1
We can get the name of the months we're interested in with cumsum()
and a row-wise apply()
:
In [4]: df.cumsum(axis=1).apply(lambda row: (row[row == 1].idxmin(), row.idxmax()), axis=1)
Out[4]:
User1 (Jan_2020, Jan_2020)
User2 (Feb_2020, May_2020)
User3 (Jan_2020, May_2020)
dtype: object
For each row, return the tuple consisting of the lowest index where the cumsum == 1
, and the first index at which the cumulative sum is at its max value for the row (idxmax()
returns the first occurrence of a duplicate max).
If you want a dataframe returned (i.e., for assigning back as new columns to your original dataframe):
df[["first", "last"]] = df.cumsum(axis=1).apply(lambda row: pd.Series([row[row == 1].idxmin(), row.idxmax()]), axis=1)
Maybe clean it up:
def get_first_last(row):
csum = row.cumsum()
first = csum[csum == 1].idxmin()
last = csum.idxmax()
return pd.Series({"first": first, "last": last})
df.apply(get_first_last, axis=1)
Output:
first last
User1 Jan_2020 Jan_2020
User2 Feb_2020 May_2020
User3 Jan_2020 April_2020