Create pandas summary table (but not groupby)-CodePudding

I got the following table in pandas:

x	y
1	1
2	3
2	5
2	4
1	4
1	5

I would like to see the pattern of variable x, so I would like to see the pattern goes.

In the table you would see x=1 then x=2 for three times and then go back to x=1 twice.

x	# occurance	first y value	last y value
1	1	1	1
2	3	3	4
1	2	4	5

I initially tried groupby but using groupby would group all x together, and that is not the desired way.

For convenience, I should provide the content of the dataframe.

data = {'x': [1, 2, 2, 2, 1, 1],
        'y': [1, 3, 5, 4, 4, 5]}
df = pd.DataFrame(data)

Thanks in advance.

CodePudding user response：

This is an island problem. We need to first group the x values into islands by taking the index and subtracting the cumcount of that x group:

group = df.index - df.groupby('x').cumcount()

Output:

0    0
1    1
2    1
3    1
4    3
5    3
dtype: int64

We can now group by that variable, adding the appropriate list of functions for the desired outputs:

df.groupby(group).agg({'x':[('x', 'first'), ('# occurrence', 'size')], 'y':[('first y', 'first'), ('last y', 'last')]}).reset_index(drop=True)

Output:

   x                    y
   x # occurrence first y last y
0  1            1       1      1
1  2            3       3      4
2  1            2       4      5

Dependent on your version of pandas (>=0.25), you could also use a dict to specify the aggregations:

df.groupby(group).agg(**{ 'x' :('x', 'first'), '# occurrence':('x', 'count'), 'first y':('y', 'first'), 'last y':('y', 'last')})

Output:

   x  # occurrence  first y  last y
0  1             1        1       1
1  2             3        3       4
3  1             2        4       5

CodePudding user response：

Use:

g = (df['x'].shift(1, fill_value=df['x'].iloc[0])!=df['x']).cumsum()
from collections import Counter
df.groupby(g).agg({'x': [('x', lambda x: x.iloc[0]), ('# occurance', lambda x: list(Counter(x).values())[0])], 'y': [('first y value', lambda x: x.iloc[0]), ('last y value', lambda x: x.iloc[-1])]}).droplevel(level=0, axis=1)

Output:

   x    # occurance first y value   last y value
x               
0   1   1   1   1
1   2   3   3   4
2   1   2   4   5

CodePudding user response：

Create consecutive groups by compare shifted values with not equal and cumulative sum and then use named aggregation:

df = (df.groupby(df['x'].ne(df['x'].shift()).cumsum())
        .agg(**{'x':('x','first'),
                '# occurance':('x','size'),
                'first y value':('y','first'),
                'last y value':('y','last')})
        .reset_index(drop=True))
print (df)
   x  # occurance  first y value  last y value
0  1            1              1             1
1  2            3              3             4
2  1            2              4             5