Delete sequential rows within a multi-index dynamically-CodePudding

I have a df:

          pageid
sid vid
 1  ABC     dog
    ABC     dog
    ABC     dog
    ABC     dog
 2  DEF     cat
    DEF     cat
    DEF     pig
    DEF     cat
 3  GHI     pig
    GHI     cat
    GHI     dog
    GHI     dog

Constructor:

import pandas as pd

i = pd.MultiIndex.from_arrays(
    [[1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
     ['ABC', 'ABC', 'ABC', 'ABC', 'DEF', 'DEF', 'DEF', 'DEF', 'GHI', 'GHI',
      'GHI', 'GHI']],
    names=('sid', 'vid')
)

df = pd.DataFrame({
    'pageid': ['dog', 'dog', 'dog', 'dog', 'cat', 'cat', 'pig', 'cat',
               'pig', 'cat', 'dog', 'dog']
}, index=i)

I want to essentially remove duplicates from the pageid column if they exist within a session, sid and if and only if they're sequential up to n times. The only examples I have found use .shift() which would work well if I didn't have to worry about n > 1 duplicates. Unfortunately, I have some instances where I get something like n = 30 sequential duplicates.

Before:

          pageid
sid vid
 1  ABC     dog
    ABC     dog
    ABC     dog
    ABC     dog
 2  DEF     cat
    DEF     cat
    DEF     pig
    DEF     cat
 3  GHI     pig
    GHI     cat
    GHI     dog
    GHI     dog

After:

           pageid
sid vid
 1  ABC     dog
 2  DEF     cat
    DEF     pig
    DEF     cat
 3  GHI     pig
    GHI     cat
    GHI     dog

CodePudding user response：

global duplicates

You can reset_index and compute the duplicated:

df[~df.reset_index().duplicated().values]

Output:

        pageid
sid vid       
1   ABC    dog
2   DEF    cat
    DEF    pig
3   GHI    pig
    GHI    cat
    GHI    dog

sequential duplicates

df2 = df[['pageid']].reset_index()
df[~df2.eq(df2.shift()).all(1).values]

Output:

        pageid
sid vid       
1   ABC    dog
2   DEF    cat
    DEF    pig
    DEF    cat
3   GHI    pig
    GHI    cat
    GHI    dog

sequential duplicates with threshold

thresh = 3

df2 = df[['pageid']].reset_index()
m = df2.eq(df2.shift()).all(1).groupby(df.set_index('pageid', append=True).index).cumsum()
df.loc[m.lt(thresh).values]

output (example threshold: 3):

        pageid
sid vid       
1   ABC    dog
    ABC    dog
    ABC    dog
2   DEF    cat
    DEF    cat
    DEF    pig
    DEF    cat
3   GHI    pig
    GHI    cat
    GHI    dog
    GHI    dog

CodePudding user response：

I think you can use shift on a groupby, then rolling().sum(), again on groupby:

# lazy groupby
groups = df.groupby(level=[0,1])

# if this is equal to the previous data
df['shifted'] = groups['pageid'].shift() == df['pageid']

# threshold
thresh = 2
mask = groups['shifted'].rolling(thresh).sum().fillna(0) < thresh

df.loc[mask.values]

Output:

        pageid  shifted
sid vid                
1   ABC    dog    False
    ABC    dog     True
2   DEF    cat    False
    DEF    cat     True
    DEF    pig    False
    DEF    cat    False
3   GHI    pig    False
    GHI    cat    False
    GHI    dog    False
    GHI    dog     True

CodePudding user response：

If you can count on the order of pageid within each multi-index, one option would be to step through each element and preserve the state of whether it's the same as the element before it. For example:

class Duplicated():
    def __init__(self):
        self.last = None
        
    def is_duplicate(self, x):
        if x == self.last:
            return True
        
        else:
            self.last = x
            return False
        
df=pd.MultiIndex.from_arrays([[1,1,1,1,2,2,2,2,3,3,3,3], 
['ABC','ABC','ABC','ABC','DEF','DEF','DEF','DEF','GHI','GHI','GHI','GHI']],names= 
('sid','vid'));

df=pd.DataFrame({'pageid': 
['dog','dog','dog','dog','cat','cat','pig','cat','pig','cat','dog','dog']},index=df)
dupe_checker = Duplicated()

df['duped'] = [dupe_checker.is_duplicate(x) for x in df['pageid'].iteritems()]
df

Then you can simply drop rows that are duplicated.

df = df[~df['duped']]
df.drop(columns='duped', inplace=True)

Gives

        pageid
sid vid       
1   ABC    dog
2   DEF    cat
    DEF    pig
    DEF    cat
3   GHI    pig
    GHI    cat
    GHI    dog