Home > Mobile >  Delete repeating rows in a DataFrame based on a condition pandas
Delete repeating rows in a DataFrame based on a condition pandas

Time:12-10

I'm trying to delete repeating rows in a data frame based on the following condition: If the value of the column pagePath is the same as in the previous row and the SessionId is the same, I need this row deleted. If the SessionId is different, then the repeating pagePath shouldn't be deleted. This is what I tried:

data = data.sort_values(['SessionId', 'Datum'], ascending=True, ignore_index=True)
i = 0
for i, _ in data.iterrows():  # i = index, _ = row
    if i != 0:
        try:
            while data.SessionId[i] == data.SessionId[i - 1] and data.pagePath[i] == data.pagePath[i - 1]:
                data = data.drop(i - 1)
                data = data.reset_index(drop=True)
        except KeyError:
            continue

As you can see, I'm getting the KeyError Exception, though I don't think it's bad as the code does what it should with the data frame with 1000 rows. The only problem is that it's not working with a larger dataset with 6,5 Mio rows. It's either never finishes, or I get SIGKILL. I am well aware that I shouldn't use for-loop for datasets, but I couldn't find a better solution and would be thankful if you could help me improve my code.

CodePudding user response:

Anyways, as per usual had to solve this one by myself, wouldn't be possible without @np8's comment tho. For anybody who might be interested:

locations = []
data = data.sort_values(['SessionId', 'Datum'], ascending=True, ignore_index=True)
i = 0
for i, _ in data.iterrows():  # i = index, _ = row
    if i != 0:
        try:
            if data.SessionId[i] == data.SessionId[i - 1] and data.pagePath[i] == data.pagePath[i - 1]:
                locations.append(i)
        except KeyError as e:
            print(e)
            continue

data_cleaned = data.drop(index=locations)

This took 470 seconds for 6,5 Mio rows DataFrame, which is okay considering the code wasn't finishing executing at all before.

CodePudding user response:

groupby on SessionId and pagePath and find cumulative count of each pair's occurrence; then find difference of consecutive elements using np.ediff1d and assign it to df['cumcount'], and since we want to filter out consecutive duplicates, we filter out df['cumcount']!=1:

cols = df.columns
df['cumcount'] = np.concatenate(([0], np.ediff1d(df.groupby(['SessionId','pagePath']).cumcount())))
out = df.loc[df['cumcount']!=1, cols]
  • Related