Delete repeating rows in a DataFrame based on a condition pandas-CodePudding

I'm trying to delete repeating rows in a data frame based on the following condition: If the value of the column pagePath is the same as in the previous row and the SessionId is the same, I need this row deleted. If the SessionId is different, then the repeating pagePath shouldn't be deleted. This is what I tried:

data = data.sort_values(['SessionId', 'Datum'], ascending=True, ignore_index=True)
i = 0
for i, _ in data.iterrows():  # i = index, _ = row
    if i != 0:
        try:
            while data.SessionId[i] == data.SessionId[i - 1] and data.pagePath[i] == data.pagePath[i - 1]:
                data = data.drop(i - 1)
                data = data.reset_index(drop=True)
        except KeyError:
            continue

As you can see, I'm getting the KeyError Exception, though I don't think it's bad as the code does what it should with the data frame with 1000 rows. The only problem is that it's not working with a larger dataset with 6,5 Mio rows. It's either never finishes, or I get SIGKILL. I am well aware that I shouldn't use for-loop for datasets, but I couldn't find a better solution and would be thankful if you could help me improve my code.

CodePudding user response：

Anyways, as per usual had to solve this one by myself, wouldn't be possible without @np8's comment tho. For anybody who might be interested:

locations = []
data = data.sort_values(['SessionId', 'Datum'], ascending=True, ignore_index=True)
i = 0
for i, _ in data.iterrows():  # i = index, _ = row
    if i != 0:
        try:
            if data.SessionId[i] == data.SessionId[i - 1] and data.pagePath[i] == data.pagePath[i - 1]:
                locations.append(i)
        except KeyError as e:
            print(e)
            continue

data_cleaned = data.drop(index=locations)

This took 470 seconds for 6,5 Mio rows DataFrame, which is okay considering the code wasn't finishing executing at all before.

CodePudding user response：

groupby on SessionId and pagePath and find cumulative count of each pair's occurrence; then find difference of consecutive elements using np.ediff1d and assign it to df['cumcount'], and since we want to filter out consecutive duplicates, we filter out df['cumcount']!=1:

cols = df.columns
df['cumcount'] = np.concatenate(([0], np.ediff1d(df.groupby(['SessionId','pagePath']).cumcount())))
out = df.loc[df['cumcount']!=1, cols]