I have two tables where I want to find the outer join based on a Ticker variable.
In Table I, I have only one Ticker for each entity (fund), but in table II, I may have multiple records (multiple Ticker) for each "FundID". The goal is to count the unique funds.
I want to have table III, which is the combination of both tables I and II at the fund level. If a Ticker exists in both tables, that is considered as one record in table III. If a Ticker exists in table II but not in table I, I want it in table III.
However, I don't want another Ticker from the same fund ("FundID"). In that case, we may pick one representative Ticker at random from each fund group (again, they will have the same "FundID"), and thus we won't have multiple Ticker for one specific fund.
Table 1:
Ticker |
---|
A |
B |
C |
D |
Table II:
Ticker | FundID |
---|---|
A | 1 |
AA | 1 |
AB | 1 |
B | 2 |
BB | 2 |
E | 3 |
EB | 3 |
EC | 3 |
Table III(combined):
Ticker |
---|
A |
B |
C |
D |
E |
CodePudding user response:
You can try with merge
then pass the filter with duplicated
out = t1.merge(t2,how='outer')
out = out[~out['FundID'].duplicated() | out['FundID'].isna()]
Out[87]:
Ticker FundID
0 A 1.0
1 B 2.0
2 C NaN
3 D NaN
7 E 3.0
CodePudding user response:
You can first filter df2
for rows where for each FundID, none of their corresponding "Ticker" is in df1['Ticket']
. Then among these FundIDs, sample one Ticker for each FundID and concatenate this to df1
:
sub_df2 = df2[~df2['Ticker'].isin(df1['Ticket']).groupby(df2['FundID']).cummax()]
out = pd.concat((df1, sub_df2.groupby('FundID')['Ticker'].sample(n=1).to_frame().rename(columns={'Ticker':'Ticket'})))
Output:
Ticket
0 A
1 B
2 C
3 D
5 E