Home > database >  find first and last occurrences in pandas df
find first and last occurrences in pandas df

Time:10-19

I have a dataframe with a series of months and each row contains a value of either 1 or 0. How do I find the index of the first occurrence of 1 and the last occurrence of 1 ?

I thought about grouping all of the column together to create 1 row that just has a list of all the 0s and 1s and then enumerating over it to get the min and max values but I was getting stuck on grouping everything and putting it into a column

Here's a sample of my dataset:

          Jan 2020      Feb2020    March 2020     April 2020    May 2020
User1         1            0           0               0            0
User2         0            1           1               0            1
User 3        1            1           1               1            1

Id like my output to look like this:

          Jan 2020      Feb2020    March 2020     April 2020    May 2020   First_occurance    Last Occurance
User1         1            0           0               0            0            1                    1
User2         0            1           1               0            1            2                    5
User 3        1            1           1               1            1            1                    5       
 

CodePudding user response:

For a solution without loop/apply, you can use:

import numpy as np

out = df.join(
    (df*(np.arange(df.shape[1]) 1))
    .where(df.eq(1))
    .agg(['min', 'max'], axis=1)
    .rename(columns={'min': 'first occurrence',
                     'max': 'last occurrence',
                    })
    .convert_dtypes()
)

Or, to modify the DataFrame in place:

df[['first occurrence', 'last occurrence']] = (
    (df*(np.arange(df.shape[1]) 1))
    .where(df.eq(1))
    .agg(['min', 'max'], axis=1)
    .convert_dtypes()
)

output:

        Jan 2020  Feb2020  March 2020  April 2020  May 2020  first occurrence  last occurrence
User1          1        0           0           0         0                 1                1
User2          0        1           1           0         1                 2                5
User 3         1        1           1           1         1                 1                5

CodePudding user response:

You can use numpy.where for this. You have not mentioned; but following also handles non-occurrence as -1, else it will result in array index error.

def get_first_last_occurrence(row):
  arr = np.where(row.to_numpy() == 1)
  if len(arr):
    return pd.Series({"First Occurrence":arr[0][0], "Last Occurrence":arr[0][-1]})
  else:
    return -1

df[["First Occurrence", "Last Occurrence"]] = df.apply(lambda row: get_first_last_occurrence(row), axis=1)

>>     User  Jan2020  Feb2020  March2020  April2020  May2020  First Occurrence  Last Occurrence
>> 0  User1        1        0          0          0        0                 1                1
>> 1  User2        0        1          1          0        1                 2                5
>> 2  User3        1        1          1          1        1                 1                5

CodePudding user response:

Here's an alternative solution which returns the column label of each occurrence instead of an integer.

Starting with this dataframe:

In [3]: df
Out[3]:
       Jan_2020  Feb_2020  March_2020  April_2020  May_2020
User1         1         0           0           0         0
User2         0         1           1           0         1
User3         1         1           1           1         1

We can get the name of the months we're interested in with cumsum() and a row-wise apply():

In [4]: df.cumsum(axis=1).apply(lambda row: (row[row == 1].idxmin(), row.idxmax()), axis=1)
Out[4]:
User1    (Jan_2020, Jan_2020)
User2    (Feb_2020, May_2020)
User3    (Jan_2020, May_2020)
dtype: object

For each row, return the tuple consisting of the lowest index where the cumsum == 1, and the first index at which the cumulative sum is at its max value for the row (idxmax() returns the first occurrence of a duplicate max).

If you want a dataframe returned (i.e., for assigning back as new columns to your original dataframe):

df[["first", "last"]] = df.cumsum(axis=1).apply(lambda row: pd.Series([row[row == 1].idxmin(), row.idxmax()]), axis=1)

Maybe clean it up:

def get_first_last(row):
    csum = row.cumsum()
    first = csum[csum == 1].idxmin()
    last = csum.idxmax()
    return pd.Series({"first": first, "last": last})


df.apply(get_first_last, axis=1)

Output:

          first        last
User1  Jan_2020    Jan_2020
User2  Feb_2020    May_2020
User3  Jan_2020  April_2020
  • Related