Pandas pivot_table-like output, vertically concatenating multiple column values from group-CodePudding

I have a table as follows.

id	y	val1	val2	val3
1	100	3	1	2
1	150	1	2	4
1	250	4	2	6
2	200	3	1	4
2	250	2	2	2
2	350	4	2	4
3	200	3	3	4
3	300	3	2	4
3	400	6	3	3

I want to aggregate by the ID and concatenate all the values in order, and then take only the last y of each id. For example, the table would look like this:

id	y	val1	val2	val3	...	val1	val2	val3	...	val1	val2	val3	...
1	250	3	1	2	...	1	2	4	...	4	2	6	...
2	350	3	1	4	...	2	2	2	...	4	2	4	...
3	400	3	3	4	...	3	2	4	...	6	3	3	...

Notice the y, 250 is from the last row of id=1 (in the real table there is another id I can use to specify what y I want to keep), 350 is from the last row of id=2 and 400 is from the last row of id=3. The values are simply concatenated into one row in order.

I looked into pivot_table and know I can use new_ds = pd.pivot_table(dataset, index='id') to aggregate by the ID, but I want to be selective (not including y for all rows, and in reality there's other garbage data I do not want). It is important that the values have to be in order, so the values from the second row come after the first, third comes the second etc. There's more than 100 values.

I've looked into pivot and groupby but can't figure out exactly how to apply to this.

CodePudding user response：

This isn't a pivot table, but a funky groupby, or maybe two separate groupby's.

(I wouldn't try to use .agg() either because you want to concatenate the other columns in-order, all together, but .agg() is really pedantic about forcing you to define an individual aggregate function for each column, which here would be a pain.)

Taking the last 'y' value in a group is easy:

df.groupby('id').agg({'y': lambda s: s.iloc[-1]})

# where we don't use .tail() to avoid the current bug on a series which throws "ValueError: Must produce aggregated value"

Now to vertically concatenate the rows in the group consecutively, for all the other columns:

we actually don't even need pd.concat([...], axis=1) like I thought we would
we can apply this solution inside the df.groupby('id').apply(lambda g: g.drop(columns=['id','y']).values.flatten())
first, explicitly specify which columns you do want included:

    df[['id','val1', 'val2', 'val3']].groupby('id').apply(lambda g: g.values.flatten())

    id
    1    [3, 1, 2, 1, 2, 4, 4, 2, 6]
    2    [3, 1, 4, 2, 2, 2, 4, 2, 4]
    3    [3, 3, 4, 3, 2, 4, 6, 3, 3]

or if you prefer, you can move the .drop('y') to the front:

df.drop(columns='y').groupby('id').apply(lambda g: g.values.flatten()

We can't legally concatenate to have duplicate column names in the output as @DocZero pointed out, your example is illegal syntax. You need to figure out how you want to add a prefix/suffix/other name-mangling to the column names.

Minor note: pandas .values accessor is discouraged and will in future be deprecated, we're supposed to start using to_numpy() or .array.

id	y	val1	val2	val3
1	100	3	1	2
1	150	1	2	4
1	250	4	2	6
2	200	3	1	4
2	250	2	2	2
2	350	4	2	4
3	200	3	3	4
3	300	3	2	4
3	400	6	3	3

id	y	val1	val2	val3	...	val1	val2	val3	...	val1	val2	val3	...
1	250	3	1	2	...	1	2	4	...	4	2	6	...
2	350	3	1	4	...	2	2	2	...	4	2	4	...
3	400	3	3	4	...	3	2	4	...	6	3	3	...

id	y	val1	val2	val3
1	100	3	1	2
1	150	1	2	4
1	250	4	2	6
2	200	3	1	4
2	250	2	2	2
2	350	4	2	4
3	200	3	3	4
3	300	3	2	4
3	400	6	3	3

id	y	val1	val2	val3	...	val1	val2	val3	...	val1	val2	val3	...
1	250	3	1	2	...	1	2	4	...	4	2	6	...
2	350	3	1	4	...	2	2	2	...	4	2	4	...
3	400	3	3	4	...	3	2	4	...	6	3	3	...

id	y	val1	val2	val3
1	100	3	1	2
1	150	1	2	4
1	250	4	2	6
2	200	3	1	4
2	250	2	2	2
2	350	4	2	4
3	200	3	3	4
3	300	3	2	4
3	400	6	3	3

id	y	val1	val2	val3	...	val1	val2	val3	...	val1	val2	val3	...
1	250	3	1	2	...	1	2	4	...	4	2	6	...
2	350	3	1	4	...	2	2	2	...	4	2	4	...
3	400	3	3	4	...	3	2	4	...	6	3	3	...