Python - Pandas mapping error: Reindexing only valid with uniquely valued Index objects-CodePudding

I am working on a function which is meant to map an ID from one dataframe to another based on one of two fields (Ticker or CUSIP). The main df is a custodian file which has the fields below:

	Ticker	CUSIP
0	AAA	NaN
1	ABC	NaN
2	NaN	123456789
3	UNK	NaN

The map_df is a secondary file downloaded internally which maps the custodian file (df) to an arbitrary "Owned_ID" generated the first time a security is entered into our system. It looks something like:

	Owned_ID	Ticker	CUSIP
0	0	AAA
1	11	ABC	111111111
2	22	NaN	123456789
3	33
4	44

My goal is to create a file which maps the CUSIP to the map file to pull the Owned ID. If a CUSIP is not available, then it should take the Ticker and pull an Owned ID. If neither the CUSIP nor Ticker can be found within the mapping file, it can return NaN. Therefore the output dataframe should look like:

	Ticker	CUSIP	Owned_ID
0	AAA	NaN	0
1	ABC	NaN	11
2	NaN	123456789	22
3	UNK	NaN	NaN

I have come up with the code below which essentially maps the dataframes over by reindexing but keep running into the error:

  File "C:\Users\JeffNi\PycharmProjects\KP21 - Addepar Upload Generator\Archives\OwnedID Match Sample.py", line 41, in <module>
    df['Owned_ID'] = np.where(mask_a, df['CUSIP'].map(a)

  File "C:\Users\JeffNi\anaconda3\lib\site-packages\pandas\core\series.py", line 3909, in map
    new_values = super()._map_values(arg, na_action=na_action)

  File "C:\Users\JeffNi\anaconda3\lib\site-packages\pandas\core\base.py", line 907, in _map_values
    indexer = mapper.index.get_indexer(values)

  File "C:\Users\JeffNi\anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 3171, in get_indexer
    raise InvalidIndexError(

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

Here is the code snippet I am using to replicate this error.


import pandas as pd
df = pd.DataFrame({'Ticker': {0: 'AAA', 1: 'ABC', 2: pd.NA, 3:'UNK'},  
                   'CUSIP': {0: pd.NA, 1: pd.NA, 2: '123456789', 3:pd.NA}})


df_map = pd.DataFrame(
{'Owned_ID' : {0: '11', 1: '22', 2: '33', 3:'44', 4:'55'},
'Ticker': {0: 'AAA', 1: 'ABC', 2: pd.NA, 3:pd.NA, 4:pd.NA}, 
'CUSIP': {0: pd.NA, 1: '111111111', 2: '123456789', 3:pd.NA, 4:pd.NA}})

a = df_map.set_index('CUSIP')['Owned_ID']

b = df_map.set_index('Ticker')['Owned_ID']

mask_a = df['CUSIP'].isnull()

df['Owned_ID'] = np.where(mask_a, df['CUSIP'].map(a), df['Ticker'].map(b))


print(df)

I think it might have something to do with the mapping file having duplicates in the form of NA, but I have not been able to figure out what I need to do to make the indexing work (or if there is a better way to approach this problem). I imagine it being similar to a Vlookup function in Excel, where I lookup CUSIP in map_df, it if ails, lookup Ticker in map_df and if that fails just return an error.

Thanks for any input!

CodePudding user response：

You should drop the na values first for the mapping dataframes:

a = df_map.dropna(subset=["CUSIP"]).set_index('CUSIP')['Owned_ID']

b = df_map.dropna(subset=["Ticker"]).set_index('Ticker')['Owned_ID']

mask_a = df['CUSIP'].isnull()

df['Owned_ID'] = np.where(mask_a, df['Ticker'].map(b), df['CUSIP'].map(a))

Note that you also needed to swap the np.where arguments, because they were the wrong way around.

To show that this is in the order you want:

df = pd.DataFrame({'Ticker': {0: 'AAA', 1: 'ABC', 2: "test", 3:'UNK'},  
                   'CUSIP': {0: pd.NA, 1: pd.NA, 2: '123456789', 3:pd.NA}})


df_map = pd.DataFrame(
{'Owned_ID' : {0: '11', 1: '22', 2: '33', 3:'44', 4:'55', 5: "66"},
'Ticker': {0: 'AAA', 1: 'ABC', 2: pd.NA, 3:pd.NA, 4:pd.NA, 5: "test"}, 
'CUSIP': {0: pd.NA, 1: '111111111', 2: '123456789', 3:pd.NA, 4:pd.NA, 5:pd.NA}})

a = df_map.dropna(subset=["CUSIP"]).set_index('CUSIP')['Owned_ID']

b = df_map.dropna(subset=["Ticker"]).set_index('Ticker')['Owned_ID']

mask_a = df['CUSIP'].isnull()

df['Owned_ID'] = np.where(mask_a, df['Ticker'].map(b), df['CUSIP'].map(a))
print(df)
#  Ticker      CUSIP Owned_ID
#0    AAA       <NA>       11
#1    ABC       <NA>       22
#2   test  123456789       33
#3    UNK       <NA>      NaN

Index "2" is 33, not 66 here, so took the CUSIP as this was available.