Pandas concat - InvalidIndexError: Reindexing only valid with uniquely valued Index objects-CodePudding

I am trying to concat two pandas dataframe's and am running into IndexError. Here's some mock data:

import pandas as pd

df1 = pd.DataFrame({'col1': [1,2,3],
                    'col2': [4,5,6] 
                  })

df2 = pd.DataFrame({'col1': [7,8,9],
                    'col2': ['10','11','12'],
                    'col3': ['13','14','15'] 
                  })

# Concat and keep only cols from df1

df3 = pd.concat([df1, df2], ignore_index=True).reindex(df1.columns, axis='columns')

Expected output:

Full Traceback:

    /Applications/Anaconda/anaconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py in get_indexer(self, target, method, limit, tolerance)
   3440 
   3441         if not self._index_as_unique:
-> 3442             raise InvalidIndexError(self._requires_unique_msg)
   3443 
   3444         if not self._should_compare(target) and not is_interval_dtype(self.dtype):

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

CodePudding user response：

For me working correct with sample data.

I try change data for raise error, reason is duplicated columns names:

df1 = pd.DataFrame({'col1': [1,2,3],
                    'col2': [4,5,6] 
                  }).rename(columns={'col2':'col1'})
print (df1)
   col1  col1 <- col1 is duplicated
0     1     4
1     2     5
2     3     6

df2 = pd.DataFrame({'col1': [7,8,9],
                    'col2': ['10','11','12'],
                    'col3': ['13','14','15'] 
                  })

# Concat and keep only cols from df1

df3 = pd.concat([df1, df2], ignore_index=True).reindex(df1.columns, axis='columns')
print (df3)

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

You can find them:

print (df1.columns[df1.columns.duplicated(keep=False)])
Index(['col1', 'col1'], dtype='object')

print (df2.columns[df2.columns.duplicated(keep=False)])
Index([], dtype='object')

Solution is deduplicated them:

print (pd.io.parsers.ParserBase({'names':df1.columns})._maybe_dedup_names(df1.columns))
['col1', 'col1.1']