Home > Mobile >  Collapse identical rows / columns in pandas DataFrame to intervals
Collapse identical rows / columns in pandas DataFrame to intervals

Time:12-12

Let's imagine a DataFrame with some patterns that do intersect.
A sample code for a minimal reproducible example:

import pandas as pd
from random import randint as ri
from random import choice

get_pattern = lambda: [i for ii in [[choice([True,False])]*ri(2,6) for _ in range(0,5)] for i in ii]
patterns = [get_pattern()[:15] for _ in range(0,3)]

df = pd.DataFrame({
    'A': patterns[0],
    'B': patterns[1],
    'C': patterns[2]
}, index = pd.interval_range(start=0, end=15, closed='both')).T.replace(False,'')
Output:
[0, 1] [1, 2] [2, 3] [3, 4] [4, 5] [5, 6] [6, 7] [7, 8] [8, 9] [9, 10] [10, 11] [11, 12] [12, 13] [13, 14] [14, 15]
A True True True True
B True True True True True True True True True True True
C True True True True True True True True

I did random to produce different results for the question on purpose. I also did a transpose for the table to fit better and use less vertical space, Falses are removed for a better visual representation.
In this particular case one can observe identical columns:

  • from 0 to 4
  • from 4 to 8
  • from 8 to 12
  • from 12 to 15

The question is how to collapse those identical columns (or rows without transpose) and change the corresponding intervals.

After some searching the only solution I've come up with was iterating through rows / columns and searching for duplicates. Than assigning left and right interval for a newly created DataFrame.

I suppose there might be a more elegant solution to get this:

intervals = [
    pd.Interval(left=0, right=4, closed='both'),
    pd.Interval(left=4, right=8, closed='both'),
    pd.Interval(left=8, right=12, closed='both'),
    pd.Interval(left=12, right=15, closed='both')
]
pd.DataFrame({
    'A': [False,False,True,False],
    'B': [True,True,False,True],
    'C': [False,True,True,False]
}, pd.IntervalIndex(intervals)).T.replace(False,'')
Output
[0, 4] [4, 8] [8, 12] [12, 15]
A True
B True True True
C True True

CodePudding user response:

One way using shift-groupby trick:

m = df.shift().ne(df).any(axis=1).cumsum()

res = []
for _, d in df.groupby(m):
    s = d.iloc[0]
    s.name = pd.Interval(d.index.left[0], d.index.right[-1], "both")
    res.append(s)
    
new_df = pd.concat(res, axis=1)

Output:

  [0, 4] [4, 8] [8, 12] [12, 15]
A                  True         
B   True   True             True
C          True    True         
  • Related