I have a df:
pageid
sid vid
1 ABC dog
ABC dog
ABC dog
ABC dog
2 DEF cat
DEF cat
DEF pig
DEF cat
3 GHI pig
GHI cat
GHI dog
GHI dog
Constructor:
import pandas as pd
i = pd.MultiIndex.from_arrays(
[[1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
['ABC', 'ABC', 'ABC', 'ABC', 'DEF', 'DEF', 'DEF', 'DEF', 'GHI', 'GHI',
'GHI', 'GHI']],
names=('sid', 'vid')
)
df = pd.DataFrame({
'pageid': ['dog', 'dog', 'dog', 'dog', 'cat', 'cat', 'pig', 'cat',
'pig', 'cat', 'dog', 'dog']
}, index=i)
I want to essentially remove duplicates from the pageid
column if they exist within a session, sid
and if and only if they're sequential up to n
times. The only examples I have found use .shift() which would work well if I didn't have to worry about n > 1 duplicates. Unfortunately, I have some instances where I get something like n = 30 sequential duplicates.
Before:
pageid
sid vid
1 ABC dog
ABC dog
ABC dog
ABC dog
2 DEF cat
DEF cat
DEF pig
DEF cat
3 GHI pig
GHI cat
GHI dog
GHI dog
After:
pageid
sid vid
1 ABC dog
2 DEF cat
DEF pig
DEF cat
3 GHI pig
GHI cat
GHI dog
CodePudding user response:
global duplicates
You can reset_index
and compute the duplicated
:
df[~df.reset_index().duplicated().values]
Output:
pageid
sid vid
1 ABC dog
2 DEF cat
DEF pig
3 GHI pig
GHI cat
GHI dog
sequential duplicates
df2 = df[['pageid']].reset_index()
df[~df2.eq(df2.shift()).all(1).values]
Output:
pageid
sid vid
1 ABC dog
2 DEF cat
DEF pig
DEF cat
3 GHI pig
GHI cat
GHI dog
sequential duplicates with threshold
thresh = 3
df2 = df[['pageid']].reset_index()
m = df2.eq(df2.shift()).all(1).groupby(df.set_index('pageid', append=True).index).cumsum()
df.loc[m.lt(thresh).values]
output (example threshold: 3):
pageid
sid vid
1 ABC dog
ABC dog
ABC dog
2 DEF cat
DEF cat
DEF pig
DEF cat
3 GHI pig
GHI cat
GHI dog
GHI dog
CodePudding user response:
I think you can use shift
on a groupby, then rolling().sum()
, again on groupby:
# lazy groupby
groups = df.groupby(level=[0,1])
# if this is equal to the previous data
df['shifted'] = groups['pageid'].shift() == df['pageid']
# threshold
thresh = 2
mask = groups['shifted'].rolling(thresh).sum().fillna(0) < thresh
df.loc[mask.values]
Output:
pageid shifted
sid vid
1 ABC dog False
ABC dog True
2 DEF cat False
DEF cat True
DEF pig False
DEF cat False
3 GHI pig False
GHI cat False
GHI dog False
GHI dog True
CodePudding user response:
If you can count on the order of pageid within each multi-index, one option would be to step through each element and preserve the state of whether it's the same as the element before it. For example:
class Duplicated():
def __init__(self):
self.last = None
def is_duplicate(self, x):
if x == self.last:
return True
else:
self.last = x
return False
df=pd.MultiIndex.from_arrays([[1,1,1,1,2,2,2,2,3,3,3,3],
['ABC','ABC','ABC','ABC','DEF','DEF','DEF','DEF','GHI','GHI','GHI','GHI']],names=
('sid','vid'));
df=pd.DataFrame({'pageid':
['dog','dog','dog','dog','cat','cat','pig','cat','pig','cat','dog','dog']},index=df)
dupe_checker = Duplicated()
df['duped'] = [dupe_checker.is_duplicate(x) for x in df['pageid'].iteritems()]
df
Then you can simply drop rows that are duplicated.
df = df[~df['duped']]
df.drop(columns='duped', inplace=True)
Gives
pageid
sid vid
1 ABC dog
2 DEF cat
DEF pig
DEF cat
3 GHI pig
GHI cat
GHI dog