How to delete the consecutive duplicate elements from list also reflect that in another list in same-CodePudding

I have a data frame like this

A	B	C
32	[1,2,2,3,4]	[a,b,c,d,e]
35	[5,5,7,7,7,8]	[q,w,e,r,t,y]

so there are the repeating elements in the B column that need to remove the second, third and so on occurrences and similarly, the elements in the C columns also get deleted with the same positions as B

the final result will be:

A	B	C
32	[1,2,3,4]	[a,b,d,e]
35	[5,7,8]	[q,e,y]

it's in a pandas data frame.

CodePudding user response：

Use DataFrame.explode by both columns, then convert df.index to column df['index'] and remove duplicates per index and B column by DataFrame.drop_duplicates and last aggregate by index with A column with list:

import pandas as pd
data = {'A': [32, 35],
        'B': [[1, 2, 2, 3, 4], [5, 5, 7, 7, 7, 8]],
        'C': [['a', 'b', 'c', 'd', 'e'], ['q', 'w', 'e', 'r', 't', 'y']]} 
df = pd.DataFrame(data)



df1 = (df.explode(['B','C'])
         .reset_index()
         .drop_duplicates(['index','B'])
         .groupby(['index', 'A'])
         .agg(list)
         .reset_index(level=1)
         .rename_axis(None))

print (df1)
    A             B             C
0  32  [1, 2, 3, 4]  [a, b, d, e]
1  35     [5, 7, 8]     [q, e, y]

If values in A column are always unique:

df2 = (df.explode(['B','C'])
         .drop_duplicates(['A','B'])
         .groupby(['A'])
         .agg(list)
         .reset_index())

print (df2)

EDIT: Solution for oldier pandas version:

df1 = (df[['A']].join(df[['B','C']].apply(lambda x: x.explode()))
         .reset_index()
         .drop_duplicates(['index','B'])
         .groupby(['index', 'A'])
         .agg(list)
         .reset_index(level=1)
         .rename_axis(None))

print (df1)

CodePudding user response：

Not very pandish, but trying to optimize memory used:

import numpy as np

# For each row, compute the indices of the first occurrence of each value in column B
df['_indices_to_keep'] = df.B.map(
        lambda x: np.unique(x, return_index=True)[1] if x else None,
)

# Keep only values in columns B and C corresponding to the previously found positions
for c in ('B', 'C'):
    df[c] = df.apply(lambda r: [r[c][i] for i in r['_indices_to_keep']] if r[c] else None, axis=1)

# Drop the temporary column with the indices
df.drop(columns=['_indices_to_keep'], inplace=True)

The result is:

    A             B             C
0  32  [1, 2, 3, 4]  [a, b, d, e]
1  35     [5, 7, 8]     [q, e, y]

CodePudding user response：

Short solution basing on np.unique and np.take routines:

def drop_dups(r):
    vals, idx_start = np.unique(r['B'], return_index=True)
    r['B'], r['C'] = vals, np.take(r['C'], idx_start)
    return r

df = pd.DataFrame(data)
df[['B', 'C']] = df[['B', 'C']].apply(drop_dups, axis=1)
print(df)

The output:

   A             B             C
0  32  [1, 2, 3, 4]  [a, b, d, e]
1  35     [5, 7, 8]     [q, e, y]