I want reindex dataframe's rows by specific column, let's say that dataframe contain collumn colX
. That's mean I need call method df.set_index(colX)
Consider df.head() is
colX colY
2.71 foo1
3.14 foo2
6.9 foo3
6.9 foo4
9.6 foo5
When I reindex it
df=df.set_index('colX')
df2=df.reindex(index=[9.6,6.9,6.9,3.14,2.71])
I expect the result df2.head()
is
colX colY
9.6 foo5
6.9 foo3 # duplicate index in colX
6.9 foo4 # duplicate index in colX
3.14 foo2
2.71 foo1
Or this
colX colY
9.6 foo5
6.9 foo4 # duplicate index in colX
6.9 foo3 # duplicate index in colX
3.14 foo2
2.71 foo1
Reindex won't work if there's duplicate axis.
ValueError: cannot reindex from a duplicate axis
Note: df was created by df=pd.read_csv('foobar.csv')
CodePudding user response:
Error means there is at least one duplicated values, so reindex
raise error, because working only with unique values in list passed to reindex
.
Solution with helper column created by GroupBy.cumcount
and DataFrame.merge
with default inner join:
df['g'] = df.groupby('colX').cumcount()
L = [9.6,6.9,6.9,3.14,2.71]
df11 = pd.DataFrame({'colX': L})
df11['g'] = df11.groupby('colX').cumcount()
print (df11)
colX g
0 9.60 0
1 6.90 0
2 6.90 1
3 3.14 0
4 2.71 0
df = df.merge(df11, on=['colX','g'])
print (df)
colX colY g
0 2.71 foo1 0
1 3.14 foo2 0
2 6.90 foo3 0
3 6.90 foo4 1
4 9.60 foo5 0
Or with reindex
:
df['g'] = df.groupby('colX').cumcount()
L = [9.6,6.9,6.9,3.14,2.71]
df11 = pd.DataFrame({'colX': L})
df11['g'] = df11.groupby('colX').cumcount()
print (df11)
df=df.set_index(['colX', 'g']).reindex(df11.set_index(['colX','g']).index)
print (df)
colY
colX g
9.60 0 foo5
6.90 0 foo3
1 foo4
3.14 0 foo2
2.71 0 foo1
Wrong first solution:
df=df.set_index('colX')
df2=df.loc[[9.6,6.9,6.9,3.14,2.71]]
print (df2)
colY
colX
9.60 foo5
6.90 foo3 > combinations of 6.9 (2 times in original * 2 times in list = 4)
6.90 foo4
6.90 foo3
6.90 foo4
3.14 foo2
2.71 foo1