Home > Software design >  Combine consecutive row pairs of a dataframe based on a condition
Combine consecutive row pairs of a dataframe based on a condition

Time:11-10

I have collected data from a piece of software has separated the contents of a message between two messages. I am relatively new to using Pandas as a whole. Lets say I have a Pandas DataFrame in the following format:

Type Message
A Start
A End
A Start2
A End2

I need to combine message pairs that share the same type. Multiple packet types can occur, however I only care about merging one type. Most of the time the packets are separated in pairs so I was able to combine them using the following:

group = ("A" != df['Type'].shift()).cumsum().rename('group')
df = df.groupby(['Type', group], sort=False)['Message'].agg(''.join).reset_index().drop('group', axis=1)

Example of packet pairs:

Type Message
B StartEnd
C Start2End2
A Start
A End
B Start4End4
C Start5End5
A Start
A End

The output using the current code gives:

Type Message
B StartEnd
C Start2End2
A StartEnd
B Start4End4
C Start5End5
A StartEnd

However this only solves the problem of consecutive messages with the same type.

The desired output is as follows:

Type Message
A StartEnd
A Start2End2

I am unsure of a way to limit these groupings to only two items. I need to be able to preserve the type of each message as it is important for later processing.

CodePudding user response:

You can create consecutive groups g by compare by shifted values with pairs groups by GroupBy.cumcount with integer division by 2 and pass to final groupby:

print (df)
   Type     Message
0     C       Start
1     C         End
2     B  Start4End4
3     A       Start
4     A         End
5     A      Start2
6     A        End2
7     B    StartEnd
8     C  Start2End2
9     A       Start
10    A         End
11    B  Start4End4
12    C  Start5End5
13    A       Start
14    A         End

m = df['Type'].eq('A')
g = m.ne(m.shift()).where(m).cumsum().fillna(0)

g1 = df.groupby(g).cumcount()
g1 = g1.mask(m, g1 // 2)


df1 = (df.groupby(['Type', g, g1], sort=False)['Message']
        .agg(''.join)
        .droplevel([1,2])
        .reset_index())

print (df1)
   Type     Message
0     C       Start
1     C         End
2     B  Start4End4
3     A    StartEnd
4     A  Start2End2
5     B    StartEnd
6     C  Start2End2
7     A    StartEnd
8     B  Start4End4
9     C  Start5End5
10    A    StartEnd

CodePudding user response:

If you only care about group A, you can simplify to a single groupby.agg:

# filter Type A
m = df['Type'].eq('A')

# aggregate Type A by pairs
out = (df[m].groupby(np.arange(m.sum())//2)
        .agg({'Type': 'first', 'Message': ''.join})
      )

output:

  Type     Message
0    A    StartEnd
1    A  Start2End2

used input:

  Type     Message
0    B    StartEnd
1    C  Start2End2
2    A       Start
3    A         End
4    B  Start4End4
5    C  Start5End5
6    A      Start2
7    A        End2
  • Related