I have a data frame like this
A | B | C |
---|---|---|
32 | [1,2,2,3,4] | [a,b,c,d,e] |
35 | [5,5,7,7,7,8] | [q,w,e,r,t,y] |
so there are the repeating elements in the B column that need to remove the second, third and so on occurrences and similarly, the elements in the C columns also get deleted with the same positions as B
the final result will be:
A | B | C |
---|---|---|
32 | [1,2,3,4] | [a,b,d,e] |
35 | [5,7,8] | [q,e,y] |
it's in a pandas data frame.
CodePudding user response:
Use DataFrame.explode
by both columns, then convert df.index
to column df['index']
and remove duplicates per index
and B
column by DataFrame.drop_duplicates
and last aggregate by index
with A
column with list
:
import pandas as pd
data = {'A': [32, 35],
'B': [[1, 2, 2, 3, 4], [5, 5, 7, 7, 7, 8]],
'C': [['a', 'b', 'c', 'd', 'e'], ['q', 'w', 'e', 'r', 't', 'y']]}
df = pd.DataFrame(data)
df1 = (df.explode(['B','C'])
.reset_index()
.drop_duplicates(['index','B'])
.groupby(['index', 'A'])
.agg(list)
.reset_index(level=1)
.rename_axis(None))
print (df1)
A B C
0 32 [1, 2, 3, 4] [a, b, d, e]
1 35 [5, 7, 8] [q, e, y]
If values in A
column are always unique:
df2 = (df.explode(['B','C'])
.drop_duplicates(['A','B'])
.groupby(['A'])
.agg(list)
.reset_index())
print (df2)
EDIT: Solution for oldier pandas version:
df1 = (df[['A']].join(df[['B','C']].apply(lambda x: x.explode()))
.reset_index()
.drop_duplicates(['index','B'])
.groupby(['index', 'A'])
.agg(list)
.reset_index(level=1)
.rename_axis(None))
print (df1)
CodePudding user response:
Not very pandish, but trying to optimize memory used:
import numpy as np
# For each row, compute the indices of the first occurrence of each value in column B
df['_indices_to_keep'] = df.B.map(
lambda x: np.unique(x, return_index=True)[1] if x else None,
)
# Keep only values in columns B and C corresponding to the previously found positions
for c in ('B', 'C'):
df[c] = df.apply(lambda r: [r[c][i] for i in r['_indices_to_keep']] if r[c] else None, axis=1)
# Drop the temporary column with the indices
df.drop(columns=['_indices_to_keep'], inplace=True)
The result is:
A B C
0 32 [1, 2, 3, 4] [a, b, d, e]
1 35 [5, 7, 8] [q, e, y]
CodePudding user response:
Short solution basing on np.unique
and np.take
routines:
def drop_dups(r):
vals, idx_start = np.unique(r['B'], return_index=True)
r['B'], r['C'] = vals, np.take(r['C'], idx_start)
return r
df = pd.DataFrame(data)
df[['B', 'C']] = df[['B', 'C']].apply(drop_dups, axis=1)
print(df)
The output:
A B C
0 32 [1, 2, 3, 4] [a, b, d, e]
1 35 [5, 7, 8] [q, e, y]