I have collected data from a piece of software has separated the contents of a message between two messages. I am relatively new to using Pandas as a whole. Lets say I have a Pandas DataFrame in the following format:
Type | Message |
---|---|
A | Start |
A | End |
A | Start2 |
A | End2 |
I need to combine message pairs that share the same type. Multiple packet types can occur, however I only care about merging one type. Most of the time the packets are separated in pairs so I was able to combine them using the following:
group = ("A" != df['Type'].shift()).cumsum().rename('group')
df = df.groupby(['Type', group], sort=False)['Message'].agg(''.join).reset_index().drop('group', axis=1)
Example of packet pairs:
Type | Message |
---|---|
B | StartEnd |
C | Start2End2 |
A | Start |
A | End |
B | Start4End4 |
C | Start5End5 |
A | Start |
A | End |
The output using the current code gives:
Type | Message |
---|---|
B | StartEnd |
C | Start2End2 |
A | StartEnd |
B | Start4End4 |
C | Start5End5 |
A | StartEnd |
However this only solves the problem of consecutive messages with the same type.
The desired output is as follows:
Type | Message |
---|---|
A | StartEnd |
A | Start2End2 |
I am unsure of a way to limit these groupings to only two items. I need to be able to preserve the type of each message as it is important for later processing.
CodePudding user response:
You can create consecutive groups g
by compare by shifted values with pairs groups by GroupBy.cumcount
with integer division by 2
and pass to final groupby
:
print (df)
Type Message
0 C Start
1 C End
2 B Start4End4
3 A Start
4 A End
5 A Start2
6 A End2
7 B StartEnd
8 C Start2End2
9 A Start
10 A End
11 B Start4End4
12 C Start5End5
13 A Start
14 A End
m = df['Type'].eq('A')
g = m.ne(m.shift()).where(m).cumsum().fillna(0)
g1 = df.groupby(g).cumcount()
g1 = g1.mask(m, g1 // 2)
df1 = (df.groupby(['Type', g, g1], sort=False)['Message']
.agg(''.join)
.droplevel([1,2])
.reset_index())
print (df1)
Type Message
0 C Start
1 C End
2 B Start4End4
3 A StartEnd
4 A Start2End2
5 B StartEnd
6 C Start2End2
7 A StartEnd
8 B Start4End4
9 C Start5End5
10 A StartEnd
CodePudding user response:
If you only care about group A, you can simplify to a single groupby.agg
:
# filter Type A
m = df['Type'].eq('A')
# aggregate Type A by pairs
out = (df[m].groupby(np.arange(m.sum())//2)
.agg({'Type': 'first', 'Message': ''.join})
)
output:
Type Message
0 A StartEnd
1 A Start2End2
used input:
Type Message
0 B StartEnd
1 C Start2End2
2 A Start
3 A End
4 B Start4End4
5 C Start5End5
6 A Start2
7 A End2