Home > database >  How to delete the consecutive duplicate elements from list also reflect that in another list in same
How to delete the consecutive duplicate elements from list also reflect that in another list in same

Time:12-14

I have a data frame like this

A B C
32 [1,2,2,3,4] [a,b,c,d,e]
35 [5,5,7,7,7,8] [q,w,e,r,t,y]

so there are the repeating elements in the B column that need to remove the second, third and so on occurrences and similarly, the elements in the C columns also get deleted with the same positions as B

the final result will be:

A B C
32 [1,2,3,4] [a,b,d,e]
35 [5,7,8] [q,e,y]

it's in a pandas data frame.

CodePudding user response:

Use DataFrame.explode by both columns, then convert df.index to column df['index'] and remove duplicates per index and B column by DataFrame.drop_duplicates and last aggregate by index with A column with list:

import pandas as pd
data = {'A': [32, 35],
        'B': [[1, 2, 2, 3, 4], [5, 5, 7, 7, 7, 8]],
        'C': [['a', 'b', 'c', 'd', 'e'], ['q', 'w', 'e', 'r', 't', 'y']]} 
df = pd.DataFrame(data)



df1 = (df.explode(['B','C'])
         .reset_index()
         .drop_duplicates(['index','B'])
         .groupby(['index', 'A'])
         .agg(list)
         .reset_index(level=1)
         .rename_axis(None))

print (df1)
    A             B             C
0  32  [1, 2, 3, 4]  [a, b, d, e]
1  35     [5, 7, 8]     [q, e, y]

If values in A column are always unique:

df2 = (df.explode(['B','C'])
         .drop_duplicates(['A','B'])
         .groupby(['A'])
         .agg(list)
         .reset_index())

print (df2)

EDIT: Solution for oldier pandas version:

df1 = (df[['A']].join(df[['B','C']].apply(lambda x: x.explode()))
         .reset_index()
         .drop_duplicates(['index','B'])
         .groupby(['index', 'A'])
         .agg(list)
         .reset_index(level=1)
         .rename_axis(None))

print (df1)

CodePudding user response:

Not very pandish, but trying to optimize memory used:

import numpy as np

# For each row, compute the indices of the first occurrence of each value in column B
df['_indices_to_keep'] = df.B.map(
        lambda x: np.unique(x, return_index=True)[1] if x else None,
)

# Keep only values in columns B and C corresponding to the previously found positions
for c in ('B', 'C'):
    df[c] = df.apply(lambda r: [r[c][i] for i in r['_indices_to_keep']] if r[c] else None, axis=1)

# Drop the temporary column with the indices
df.drop(columns=['_indices_to_keep'], inplace=True)

The result is:

    A             B             C
0  32  [1, 2, 3, 4]  [a, b, d, e]
1  35     [5, 7, 8]     [q, e, y]

CodePudding user response:

Short solution basing on np.unique and np.take routines:

def drop_dups(r):
    vals, idx_start = np.unique(r['B'], return_index=True)
    r['B'], r['C'] = vals, np.take(r['C'], idx_start)
    return r

df = pd.DataFrame(data)
df[['B', 'C']] = df[['B', 'C']].apply(drop_dups, axis=1)
print(df)

The output:

   A             B             C
0  32  [1, 2, 3, 4]  [a, b, d, e]
1  35     [5, 7, 8]     [q, e, y]
  • Related