Home > Enterprise >  How to identify non-contiguous ranges in Pandas groupby?
How to identify non-contiguous ranges in Pandas groupby?

Time:12-07

Consider the following example, where I attempt to automatically extract contiguous ranges of data samples, based on numrun and test columns:

import sys
if sys.version_info[0] < 3:
    from StringIO import StringIO
else:
    from io import StringIO

import pandas as pd

# to demonstrate problem, replace lines 8,9,10 below with:
# 63182.8423443,0,ZZ,661,615,767
# 63183.1940174,0,ZZ,661,615,767
# 63183.6097631,0,ZZ,661,615,767
TESTDATA = StringIO("""tstamp,numrun,test,valA,valB,valC
63180.6208982,0,AA,661,615,767
63180.9724180,0,AA,661,615,767
63181.3240476,0,AA,661,615,767
63181.7246253,0,ZZ,661,615,767
63182.0752904,0,BB,661,615,767
63182.4280107,0,BB,661,615,767
63182.8423443,0,CC,661,615,767
63183.1940174,0,CC,661,615,767
63183.6097631,0,CC,661,615,767
63183.9612234,1,AA,661,615,767
63184.3441680,1,AA,661,615,767
63184.6971598,1,AA,661,615,767
63185.0964634,1,BB,661,615,767
63185.4480095,1,BB,660,615,767
63185.8035320,1,BB,661,615,767
63186.1987603,1,BB,660,615,767
63186.5500834,1,BB,661,615,767
63186.9661803,1,BB,661,615,767
63187.3186681,2,AA,660,615,767
63187.7182817,2,AA,660,615,767
63188.0696054,2,AA,660,615,767
63188.4689450,2,ZZ,660,615,767
63188.8204257,2,ZZ,660,615,767
63189.1719920,2,ZZ,660,615,767
63189.5240004,2,BB,660,615,767
63189.9121130,2,BB,660,615,767
63190.2593203,2,BB,660,615,767
""")

df_data = pd.read_csv(TESTDATA)
#print(df_data.head())

# use sort=False, else tstamp order gets mesed up!
grouped = df_data.groupby(['numrun', 'test'], sort=False)
for igx, (group_name, df_grouped) in enumerate(grouped):
  firstlastrow = df_grouped.iloc[[0, -1]]
  ttime_start = firstlastrow.iloc[0]["tstamp"]
  ttime_end = firstlastrow.iloc[1]["tstamp"]
  deltasec = ttime_end - ttime_start
  with pd.option_context('float_format', '{:f}'.format):
    print("{:02d}: {}: {:.2f} sec ({} - {})".format(igx 1, group_name, deltasec, ttime_start, ttime_end ))

The above example, in fact, works great - when I run it, I get the expected printout:

$ python3 /tmp/test.py
01: (0, 'AA'): 0.70 sec (63180.6208982 - 63181.3240476)
02: (0, 'ZZ'): 0.00 sec (63181.7246253 - 63181.7246253)
03: (0, 'BB'): 0.35 sec (63182.0752904 - 63182.4280107)
04: (0, 'CC'): 0.77 sec (63182.8423443 - 63183.6097631)
05: (1, 'AA'): 0.74 sec (63183.9612234 - 63184.6971598)
06: (1, 'BB'): 1.87 sec (63185.0964634 - 63186.9661803)
07: (2, 'AA'): 0.75 sec (63187.3186681 - 63188.0696054)
08: (2, 'ZZ'): 0.70 sec (63188.468945 - 63189.171992)
09: (2, 'BB'): 0.74 sec (63189.5240004 - 63190.2593203)

However, my actual data, has entries labeled (say) numrun = 0 and test='ZZ', which can be seen as non-contiguous: that is, you might get a range of 0/ZZ labeled data samples, then another label might follow, and then again a range of 0/ZZ labeled data samples appear.

To simulate that, replace the lines in TESTDATA as noted in the to demonstrate problem ... comment in the code, and run the script again. I get:

$ python3 /tmp/test.py
01: (0, 'AA'): 0.70 sec (63180.6208982 - 63181.3240476)
02: (0, 'ZZ'): 1.89 sec (63181.7246253 - 63183.6097631)
03: (0, 'BB'): 0.35 sec (63182.0752904 - 63182.4280107)
04: (1, 'AA'): 0.74 sec (63183.9612234 - 63184.6971598)
05: (1, 'BB'): 1.87 sec (63185.0964634 - 63186.9661803)
06: (2, 'AA'): 0.75 sec (63187.3186681 - 63188.0696054)
07: (2, 'ZZ'): 0.70 sec (63188.468945 - 63189.171992)
08: (2, 'BB'): 0.74 sec (63189.5240004 - 63190.2593203)

Note that only one (0, 'ZZ') range is identified, whereas the change in the data introduced one more (so two in total).

Is it somehow possible to instruct Pandas, to extract all the ranges in the above data - and where there is a repeated combination of labels, maybe add a random or counter string to the name, so that all contiguous ranges are identified regardless?

Basically the desired output of the changed data would be something like this:

01: (0, 'AA'): 0.70 sec (63180.6208982 - 63181.3240476)
02: (0, 'ZZ'): 0.00 sec (63181.7246253 - 63181.7246253)
03: (0, 'BB'): 0.35 sec (63182.0752904 - 63182.4280107)
04: (0, 'ZZ 1'): 0.77 sec (63182.8423443 - 63183.6097631)
05: (1, 'AA'): 0.74 sec (63183.9612234 - 63184.6971598)
06: (1, 'BB'): 1.87 sec (63185.0964634 - 63186.9661803)
07: (2, 'AA'): 0.75 sec (63187.3186681 - 63188.0696054)
08: (2, 'ZZ'): 0.70 sec (63188.468945 - 63189.171992)
09: (2, 'BB'): 0.74 sec (63189.5240004 - 63190.2593203)

CodePudding user response:

You can create consecutive groups by both columns and add to groupby:

g = df_data[['numrun', 'test']].ne(df_data[['numrun', 'test']].shift()).any(axis=1).cumsum()

grouped = df_data.groupby(['numrun', 'test', g], sort=False)
for igx, (group_name, df_grouped) in enumerate(grouped):
    #your code
  • Related