Count first and last occurrences in a dataset-CodePudding

I ultimately want to count the number of months in a given range per each user. For example, see below, user 1 has 1 range of data from April 2021-June 2021. Where I'm struggling is counting users that multiple ranges (see users 3 & 4).

I have a pandas df w/ columns that looks like these:

username Jan_2021 Feb_2021 March_2021 April_2021 May_2021 June_2021 July_2021   Sum_of_Months
user 1      0         0        0          1         1          1        0            3
user 2      0         0        0          0         0          0        1            1
user 3      1         1        1          0         1          1        0            5
user 4      0         1        1          1         0          1        1            5

Id like to be able to get a summary column that says the number of groups and their count. For example: When I say num of groups I mean the amount of grouped 1's together. and when I say length of group I mean the amount of months in 1 group, like if I were to draw a circle around the 1s. For example, user 1 is 3 because there's a 1 in columns April-June 2021

username Num_of_groups Lenth_of_group
user 1          1           3
user 2          1           1
user 3          2           3,2
user 4          2           3,2

CodePudding user response：

You can try with groupby function from itertools

from itertools import groupby
df1 = df[[col for col in df.columns if "2021" in col]]
df["Lenth_of_group"] = df1.apply(lambda x: [sum(g) for i, g in groupby(x) if i == 1],axis=1)
df["Num_of_groups"] = df["Lenth_of_group"].apply(lambda x: len(x))

Hope this Helps...

CodePudding user response：

This solution uses staircase, and works by treating each users data as a step function (of 1s and 0s)

setup

import pandas as pd

df = pd.DataFrame(
    {
        "username":["user 1", "user 2", "user 3", "user 4"],
        "Jan_2021":[0,0,1,0],
        "Feb_2021":[0,0,1,0],
        "Mar_2021":[0,0,1,1],
        "April_2021":[1,0,0,1],
        "May_2021":[1,0,1,0],
        "June_2021":[1,0,1,1],
        "July_2021":[0,1,1,0],
        "Sum_of_Months":[3,1,5,5],
    }
)

solution

import staircase as sc

# trim down to month columns only, and transpose to make users correspond to columns, and months correspond to rows
data = df[["Jan_2021", "Feb_2021", "Mar_2021", "April_2021", "May_2021", "June_2021", "July_2021"]].transpose().reset_index(drop=True)

def extract_groups(series):
    return (
        sc.Stairs.from_values(initial_value=0, values=series)  # create step function for each user
        .clip(0, len(series))  # clip step function to region of interest
        .to_frame()  # represent data as start/stop intervals in a dataframe
        .query("value==1")  # filter for groups of 1s
        .eval("dist=end-start")  # calculate the length of each "group"
        ["dist"].to_list()  # convert the result from Series to list
    )

sfs = data.columns.to_series().apply(lambda c: extract_groups(data[c]))

sfs is a pandas.Series where the values are lists representing number of groups and the lengths of each. It looks like this:

0       [3]
1       [1]
2    [3, 3]
3    [2, 1]
dtype: object

You can use it to create the data you need, eg

df["Num_of_groups"] = sfs.apply(list.__len__)

adds the Num_of_groups column to your original dataframe

Disclaimer: I am author of staircase