Delete rows in apply() function or depending on apply() result-CodePudding

Here I have a working solution but my question focus on how to do this the Pandas way. I assume Pandas over better solutions for this.

I use groupby() and then apply(axis=1) to compare the values in the rows of the groups. And while doing this I made the decision which row to delete.

The rule doesn't matter! In this example here the rule is that when values in column A differ only by 1 (the values are "near") then delete the second one. How the decision is made is not part of the question. There could also be a list of color names and I would say that darkblue and marineblue are "near" and one if should be deleted.

The initial data frame is that.

   X   A  B
0  A   9  0  <--- DELETE
1  A  14  1
2  A   8  2
3  A   1  3
4  A  18  4
5  B  10  5
6  B  20  6
7  B  11  7  <--- DELETE
8  B  30  8

Row index 0 should be deleted because it's value 9 is near the value 8 in row index 2. The same with row index 7: It's value 11 is "near" 10 in row index 5.

That is the code

#!/usr/bin/env python3
import pandas as pd

df = pd.DataFrame(
    {
        'X': list('AAAAABBBB'),
        'A': [9, 14, 8, 1, 18, 10, 20, 11, 30],
        'B': range(9)
    }
)
print(df)

def mark_near_neighbors(group):
    # I snip the decission process here.
    # Delete 9 because it is "near" 8.

    default_result = pd.Series(
        data=[False] * len(group),
        index=['Delete'] * len(group)
    )

    if group.X.iloc[0] is 'A':
        # the 9
        default_result.iloc[0] = True
    else:
        # the 11
        default_result.iloc[2] = True

    return default_result

result = df.groupby('X').apply(mark_near_neighbors)
result = result.reset_index(drop=True)
print(result)

df = df.loc[~result]
print(df)

So in the end I use a "boolean indexing thing" to solve this

0     True
1    False
2    False
3    False
4    False
5    False
6    False
7     True
8    False
dtype: bool

But is there a better way to do this?

CodePudding user response：

Initialize the dataframe

df = pd.DataFrame([

    ['A', 9, 0],
    ['A', 14, 1],
    ['A', 8, 2],
    ['A', 1, 3],
    ['B', 18, 4],
    ['B', 10, 5],
    ['B', 20, 6],
    ['B', 11, 7],
    ['B', 30, 8],

], columns=['X', 'A', 'B'])

Sort the dataframe based on A column

df = df.sort_values('A')

Find the difference between values

df["diff" ] =df.groupby('X')['A'].diff()

Select the rows where the difference is not 1

result = df[df["diff"] != 1.0]

Drop the extra column and sort by index to get the initial dataframe

result.drop("diff", axis=1, inplace=True)
result = result.sort_index()

Sample output

    X   A   B
1   A   14  1
2   A   8   2
3   A   1   3
4   B   18  4
5   B   10  5
6   B   20  6
8   B   30  8

CodePudding user response：

IIUC, you can use numpy broadcasting to compare all values within a group. Keeping everything with apply here as it seems wanted:

def mark_near_neighbors(group, thresh=1):
    a = group.to_numpy().astype(float)
    idx = np.argsort(a)
    b = a[idx]
    d = abs(b-b[:,None])
    d[np.triu_indices(d.shape[0])] = thresh 1
    return pd.Series((d>thresh).all(1)[np.argsort(idx)], index=group.index)

out = df[df.groupby('X')['A'].apply(mark_near_neighbors)]

output:

   X   A  B
1  A  14  1
2  A   8  2
3  A   1  3
4  A  18  4
5  B  10  5
6  B  20  6
8  B  30  8