I'm trying to delete repeating rows in a data frame based on the following condition: If the value of the column pagePath is the same as in the previous row and the SessionId is the same, I need this row deleted. If the SessionId is different, then the repeating pagePath shouldn't be deleted. This is what I tried:
data = data.sort_values(['SessionId', 'Datum'], ascending=True, ignore_index=True)
i = 0
for i, _ in data.iterrows(): # i = index, _ = row
if i != 0:
try:
while data.SessionId[i] == data.SessionId[i - 1] and data.pagePath[i] == data.pagePath[i - 1]:
data = data.drop(i - 1)
data = data.reset_index(drop=True)
except KeyError:
continue
As you can see, I'm getting the KeyError Exception, though I don't think it's bad as the code does what it should with the data frame with 1000 rows. The only problem is that it's not working with a larger dataset with 6,5 Mio rows. It's either never finishes, or I get SIGKILL. I am well aware that I shouldn't use for-loop for datasets, but I couldn't find a better solution and would be thankful if you could help me improve my code.
CodePudding user response:
Anyways, as per usual had to solve this one by myself, wouldn't be possible without @np8's comment tho. For anybody who might be interested:
locations = []
data = data.sort_values(['SessionId', 'Datum'], ascending=True, ignore_index=True)
i = 0
for i, _ in data.iterrows(): # i = index, _ = row
if i != 0:
try:
if data.SessionId[i] == data.SessionId[i - 1] and data.pagePath[i] == data.pagePath[i - 1]:
locations.append(i)
except KeyError as e:
print(e)
continue
data_cleaned = data.drop(index=locations)
This took 470 seconds for 6,5 Mio rows DataFrame, which is okay considering the code wasn't finishing executing at all before.
CodePudding user response:
groupby
on SessionId
and pagePath
and find cumulative count of each pair's occurrence; then find difference of consecutive elements using np.ediff1d
and assign it to df['cumcount']
, and since we want to filter out consecutive duplicates, we filter out df['cumcount']!=1
:
cols = df.columns
df['cumcount'] = np.concatenate(([0], np.ediff1d(df.groupby(['SessionId','pagePath']).cumcount())))
out = df.loc[df['cumcount']!=1, cols]