I am working on a function which is meant to map an ID from one dataframe to another based on one of two fields (Ticker or CUSIP). The main df is a custodian file which has the fields below:
Ticker | CUSIP | |
---|---|---|
0 | AAA | NaN |
1 | ABC | NaN |
2 | NaN | 123456789 |
3 | UNK | NaN |
The map_df is a secondary file downloaded internally which maps the custodian file (df) to an arbitrary "Owned_ID" generated the first time a security is entered into our system. It looks something like:
Owned_ID | Ticker | CUSIP | |
---|---|---|---|
0 | 0 | AAA | |
1 | 11 | ABC | 111111111 |
2 | 22 | NaN | 123456789 |
3 | 33 | ||
4 | 44 |
My goal is to create a file which maps the CUSIP to the map file to pull the Owned ID. If a CUSIP is not available, then it should take the Ticker and pull an Owned ID. If neither the CUSIP nor Ticker can be found within the mapping file, it can return NaN. Therefore the output dataframe should look like:
Ticker | CUSIP | Owned_ID | |
---|---|---|---|
0 | AAA | NaN | 0 |
1 | ABC | NaN | 11 |
2 | NaN | 123456789 | 22 |
3 | UNK | NaN | NaN |
I have come up with the code below which essentially maps the dataframes over by reindexing but keep running into the error:
File "C:\Users\JeffNi\PycharmProjects\KP21 - Addepar Upload Generator\Archives\OwnedID Match Sample.py", line 41, in <module>
df['Owned_ID'] = np.where(mask_a, df['CUSIP'].map(a)
File "C:\Users\JeffNi\anaconda3\lib\site-packages\pandas\core\series.py", line 3909, in map
new_values = super()._map_values(arg, na_action=na_action)
File "C:\Users\JeffNi\anaconda3\lib\site-packages\pandas\core\base.py", line 907, in _map_values
indexer = mapper.index.get_indexer(values)
File "C:\Users\JeffNi\anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 3171, in get_indexer
raise InvalidIndexError(
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
Here is the code snippet I am using to replicate this error.
import pandas as pd
df = pd.DataFrame({'Ticker': {0: 'AAA', 1: 'ABC', 2: pd.NA, 3:'UNK'},
'CUSIP': {0: pd.NA, 1: pd.NA, 2: '123456789', 3:pd.NA}})
df_map = pd.DataFrame(
{'Owned_ID' : {0: '11', 1: '22', 2: '33', 3:'44', 4:'55'},
'Ticker': {0: 'AAA', 1: 'ABC', 2: pd.NA, 3:pd.NA, 4:pd.NA},
'CUSIP': {0: pd.NA, 1: '111111111', 2: '123456789', 3:pd.NA, 4:pd.NA}})
a = df_map.set_index('CUSIP')['Owned_ID']
b = df_map.set_index('Ticker')['Owned_ID']
mask_a = df['CUSIP'].isnull()
df['Owned_ID'] = np.where(mask_a, df['CUSIP'].map(a), df['Ticker'].map(b))
print(df)
I think it might have something to do with the mapping file having duplicates in the form of NA, but I have not been able to figure out what I need to do to make the indexing work (or if there is a better way to approach this problem). I imagine it being similar to a Vlookup function in Excel, where I lookup CUSIP in map_df, it if ails, lookup Ticker in map_df and if that fails just return an error.
Thanks for any input!
CodePudding user response:
You should drop the na values first for the mapping dataframes:
a = df_map.dropna(subset=["CUSIP"]).set_index('CUSIP')['Owned_ID']
b = df_map.dropna(subset=["Ticker"]).set_index('Ticker')['Owned_ID']
mask_a = df['CUSIP'].isnull()
df['Owned_ID'] = np.where(mask_a, df['Ticker'].map(b), df['CUSIP'].map(a))
Note that you also needed to swap the np.where
arguments, because they were the wrong way around.
To show that this is in the order you want:
df = pd.DataFrame({'Ticker': {0: 'AAA', 1: 'ABC', 2: "test", 3:'UNK'},
'CUSIP': {0: pd.NA, 1: pd.NA, 2: '123456789', 3:pd.NA}})
df_map = pd.DataFrame(
{'Owned_ID' : {0: '11', 1: '22', 2: '33', 3:'44', 4:'55', 5: "66"},
'Ticker': {0: 'AAA', 1: 'ABC', 2: pd.NA, 3:pd.NA, 4:pd.NA, 5: "test"},
'CUSIP': {0: pd.NA, 1: '111111111', 2: '123456789', 3:pd.NA, 4:pd.NA, 5:pd.NA}})
a = df_map.dropna(subset=["CUSIP"]).set_index('CUSIP')['Owned_ID']
b = df_map.dropna(subset=["Ticker"]).set_index('Ticker')['Owned_ID']
mask_a = df['CUSIP'].isnull()
df['Owned_ID'] = np.where(mask_a, df['Ticker'].map(b), df['CUSIP'].map(a))
print(df)
# Ticker CUSIP Owned_ID
#0 AAA <NA> 11
#1 ABC <NA> 22
#2 test 123456789 33
#3 UNK <NA> NaN
Index "2" is 33, not 66 here, so took the CUSIP as this was available.