Home > Back-end >  Pandas: combining information from rows with consecutive duplicate values in a column
Pandas: combining information from rows with consecutive duplicate values in a column

Time:01-30

The title might be a bit confusing here, but in essence: We have a dataframe, let’s say like this:

SKUCODE PROCCODE RUNCODE RSNCODE BEGTIME ENDTIME RSNNAME
0 218032 A 21183528 1010 2020-11-19 04:00 2020-11-19 04:15 Bank holidays
1 218032 A 21183959 300.318 2021-01-27 10:50 2021-01-27 10:53 Base paper roll change
2 218032 A 21183959 300.302 2021-01-27 11:50 2021-01-27 11:53 Product/grade change
3 218032 A 21183959 300.302 2021-01-27 12:02 2021-01-27 13:12 Product/grade change
4 200402 A 21184021 300.302 2021-01-27 13:16 2021-01-27 13:19 Product/grade change

Where each row is a break event happening on a production line. As can be seen, some singular break events (common RSNNAME) are spread out on multiple consecutive rows (for one data gathering reason or another), and we would like to compress all of these into just one row, for example compressing the rows 2 through 4 to a single row in our example dataframe, resulting in something like this:

SKUCODE PROCCODE RUNCODE RSNCODE BEGTIME ENDTIME RSNNAME
0 218032 A 21183528 1010 2020-11-19 04:00 2020-11-19 04:15 Bank holidays
1 218032 A 21183959 300.318 2021-01-27 10:50 2021-01-27 10:53 Base paper roll change
2 218032 A 21183959 300.302 2021-01-27 11:50 2021-01-27 13:19 Product/grade change

The resulting one row would have the BEGTIME (signifying the start of the break) of the first row that was combined, and the ENDTIME (signifying the end of the break) of the last row that was combined, this way making sure we capture the correct timestamps from the entire break event.

If we want to make the problem harder still, we might want to add a time threshold for row combining. Say, if there is a period of more than 15 minutes between (the ENDTIME of the former and the BEGTIME of the latter) two different rows seemingly of the same row event, we would treat them as separate ones instead.

This is accomplished quite easily through iterrows by comparing one row to the next, checking if they contain a duplicate value in the RSNNAME column, and grabbing the ENDTIME of the latter one onto the former one if that is the case. The latter row can then be dropped as useless. Here we might also introduce logic to see if the seemingly singular break events might actually be telling of two different ones of the same nature merely happening some time apart of each other.

However, using iterrows for this purpose gets quite slow. Is there a way to figure out this problem through vectorized functions or other more efficient means? I've played around with shifting the rows around and comparing to each other - shifting and comparing two adjacent rows is quite simple and allows us to easily grab the ENDTIME of the latter row if a duplicate is detected, but we run into issues in the case of n consecutive duplicate causes.

Another idea would be to create a boolean mask to check if the row below the current one is a duplicate, resulting in a scenario where, in the case of multiple consecutive duplicate rows, we have multiple corresponding consecutive "True" labels, the last of which before a "False" label signifies the last row where we would want to grab the ENDTIME from for the first consecutive "True" label of that particular series of consecutive "Trues". I'm yet to find a way to implement this in practice, however, using vectorization.

CodePudding user response:

For the basic problem, drop_duplicates can be used. Essentially, you

  1. drop the duplicates on the RSNNAME column, keeping the first occurrence
  2. replace the ENDTIME column with the end times by dropping duplicates again, this time keeping the last occurrence.
(
    df.drop_duplicates("RSNNAME", keep="first").assign(
        ENDTIME=df.drop_duplicates("RSNNAME", keep="last").ENDTIME.values
    )
)

(By using the .values we ignore the index in the assignment.)

To give you an idea for the more complex scenario: You are on the right track with your last idea. You want to .shift the column in question by one row and compare that to the original column. That gives you flags where new consecutive events start:

>>> df.RSNNAME != df.shift().RSNNAME
0     True
1     True
2     True
3    False
4    False
Name: RSNNAME, dtype: bool

To turn that into something .groupby-able, you compute the cumulative sum:

>>> (df.RSNNAME != df.shift().RSNNAME).cumsum()
0    1
1    2
2    3
3    3
4    3
Name: RSNNAME, dtype: int64

For your case, one option could be to extend the df.RSNNAME != df.shift().RSNNAME with some time difference to get the proper flags but I suggest you play a bit with this shift/cumsum approach.

CodePudding user response:

df1 looks like :- enter image description here

new_df = pd.DataFrame(columns=df1.columns)

for name,df in df1.groupby([(df1.RSNNAME != df1.RSNNAME.shift()).cumsum()] ) :
    if df.shape[0] == 1 :
        new_df = pd.concat([new_df,df])
    else :
        df.iloc[0, df.columns.get_loc('ENDTIME')] = df.iloc[-1]["ENDTIME"]
        new_df = pd.concat([new_df,df.head(1)])

new_df looks like :-

enter image description here

If you really think that I was using chatgpt then how do you explain all this? enter image description here

enter image description here

Be greatful that someone is trying to help you.

  • Related