I have a dataframe with stock ticker names and dates as two columns, and i would like to update this dataframe with price value from another larger dataframe matching those 2 columns
Eg: df1:
ticker Date
AAPL 2022-01-03
GE 2022-04-18
df2:
ticker Date Close
AAPL 2022-01-02 120
AAPL 2022-01-03 122
AAPL 2022-01-04 125
AAPL 2022-01-05 121
.
.
.
GE 2022-04-16 20
GE 2022-04-17 22
GE 2022-04-18 25
GE 2022-04-19 21
The output should be:
ticker Date Close
AAPL 2022-01-03 122
GE 2022-04-18 25
I can do a loop and update row by row, but i would like to check if there is a pythonic way using the whole series/vectors...
CodePudding user response:
TL;DR: If you can index your dataframes in advance, you can do about 1000 times better than merge()
, otherwise merge() is probably your best bet.
Your question says:
I have a dataframe with stock ticker names and dates as two columns, and i would like to update this dataframe with price value from another larger dataframe matching those 2 columns
... and you ask:
i would like to check if there is a pythonic way using the whole series/vectors
If your question is asked in the context of just this single dataframe query, then you probably can't get better performance than merge()
.
However, if you have the option of initializing your dataframes to use ticker, Date
as their index, or if you can at least set their indexes to ticker, Date
before needing to run multiple queries of the kind described in your question, then you can beat merge()
.
Here is a benchmark of 6 different strategies for df1 with 500 rows and df2 with 25 million rows:
Timeit results:
foo_1 (merge) ran in 23.043055499998445 seconds
foo_2 (indexed join) ran in 51.69773360000181 seconds
foo_3 (pre-indexed join) ran in 0.0027679000013449695 seconds
foo_4 (pre-indexed df1 join) ran in 24.431038499998976 seconds
foo_5 (merge right) ran in 22.99117219999971 seconds
foo_6 (pre-indexed assign) ran in 0.007970200000272598 seconds
Note that pre-indexed join
is about 1000x faster than merge
(and pre-indexed assign
is also quick at about 300x faster), primarily because pre-indexed dataframes have hash table access by index which have key search time of O(1) vs worst case O(n) time for non-indexed keys. However, indexed join
is more than twice as slow as merge
because indexed join
includes the initial indexing effort (which can be done just once for multiple queries such as the one in your question, and which is excluded from pre-indexed join
).
Explanation of the various strategies:
- The
merge
strategy uses no indexing. - The
indexed join
strategy includes the time to index both dataframes. - The
pre-indexed join
strategy excludes the initial overhead of indexing both dataframes. - The
pre-indexed df1 join
strategy excludes the initial overhead of indexing df1 but works with an unindexed df2. - The
merge right
strategy swaps df1 and df2 as object and argument ofmerge()
. - The
pre-indexed assign
strategy doesn't usemerge()
orjoin()
, instead doing an index-aligned assignment from df2 to a new column in df1.
Here's the code for each strategy:
df1_orig = pd.DataFrame([('A' str(i) , f'{2020 i//365}-{(i//28) 1}-{i( 1}') for i in range(500)], columns=['ticker', 'Date'])
print(df1_orig)
df2_orig = pd.DataFrame([('A' str(i) , f'{2020 (iP0)//365}-{((iP0)//28) 1}-{(iP0)( 1}', (10 * i 1) % 300) for i in range(25_000_000)], columns=['ticker', 'Date', 'Close'])
print(df2_orig)
df1_indexed_orig = df1_orig.set_index(['ticker', 'Date'])
df2_indexed_orig = df2_orig.set_index(['ticker', 'Date'])
# merge
def foo_1(df1, df2):
df1 = df1.merge(df2, on = ['ticker', 'Date'], how = 'left')
return df1
# indexed join
def foo_2(df1, df2):
df1.set_index(['ticker', 'Date'], inplace=True)
df2.set_index(['ticker', 'Date'], inplace=True)
df1 = df1.join(df2)
return df1
# pre-indexed join
def foo_3(df1, df2):
# called with df1_indexed_orig and df2_indexed_orig
df1 = df1.join(df2)
return df1
# pre-indexed df1 join
def foo_4(df1, df2):
# called with df1_indexed_orig
df1 = df2.join(df1, on = ['ticker', 'Date'], how = 'right')
return df1
# merge right
def foo_5(df1, df2):
df1 = df2.merge(df1, on = ['ticker', 'Date'], how = 'right')
return df1
# pre-indexed assign
def foo_6(df1, df2):
# called with df1_indexed_orig and df2_indexed_orig
df1 = df1.assign(Close=df2.Close)
return df1
CodePudding user response:
Try merging on those two columns:
df1.merge(df2, on = ['ticker', 'Date'], how = 'left')