Home > Software engineering >  Pandas force reindex duplicate axis
Pandas force reindex duplicate axis

Time:12-15

I want reindex dataframe's rows by specific column, let's say that dataframe contain collumn colX. That's mean I need call method df.set_index(colX)

Consider df.head() is

colX colY
2.71 foo1
3.14 foo2
6.9 foo3
6.9 foo4
9.6 foo5

When I reindex it

df=df.set_index('colX')
df2=df.reindex(index=[9.6,6.9,6.9,3.14,2.71])

I expect the result df2.head() is

colX colY
9.6 foo5
6.9 foo3 # duplicate index in colX
6.9 foo4 # duplicate index in colX
3.14 foo2
2.71 foo1

Or this

colX colY
9.6 foo5
6.9 foo4 # duplicate index in colX
6.9 foo3 # duplicate index in colX
3.14 foo2
2.71 foo1

Reindex won't work if there's duplicate axis.

ValueError: cannot reindex from a duplicate axis

Note: df was created by df=pd.read_csv('foobar.csv')

CodePudding user response:

Error means there is at least one duplicated values, so reindex raise error, because working only with unique values in list passed to reindex.


Solution with helper column created by GroupBy.cumcount and DataFrame.merge with default inner join:

df['g'] = df.groupby('colX').cumcount()

L = [9.6,6.9,6.9,3.14,2.71]

df11 = pd.DataFrame({'colX': L})
df11['g'] = df11.groupby('colX').cumcount()
print (df11)
   colX  g
0  9.60  0
1  6.90  0
2  6.90  1
3  3.14  0
4  2.71  0

df = df.merge(df11, on=['colX','g'])
print (df)
   colX  colY  g
0  2.71  foo1  0
1  3.14  foo2  0
2  6.90  foo3  0
3  6.90  foo4  1
4  9.60  foo5  0

Or with reindex:

df['g'] = df.groupby('colX').cumcount()

L = [9.6,6.9,6.9,3.14,2.71]

df11 = pd.DataFrame({'colX': L})
df11['g'] = df11.groupby('colX').cumcount()
print (df11)

df=df.set_index(['colX', 'g']).reindex(df11.set_index(['colX','g']).index)
print (df)
        colY
colX g      
9.60 0  foo5
6.90 0  foo3
     1  foo4
3.14 0  foo2
2.71 0  foo1

Wrong first solution:

df=df.set_index('colX')
df2=df.loc[[9.6,6.9,6.9,3.14,2.71]]
print (df2)
      colY
colX      
9.60  foo5
6.90  foo3 > combinations of 6.9 (2 times in original * 2 times in list = 4)
6.90  foo4
6.90  foo3
6.90  foo4
3.14  foo2
2.71  foo1
  • Related