Home > Software engineering >  pandas update a dataframe with values from another dataframe on the match of column values
pandas update a dataframe with values from another dataframe on the match of column values

Time:07-11

I have a dataframe with stock ticker names and dates as two columns, and i would like to update this dataframe with price value from another larger dataframe matching those 2 columns

Eg: df1:

ticker  Date
AAPL    2022-01-03
GE      2022-04-18

df2:

ticker   Date             Close
AAPL     2022-01-02       120
AAPL     2022-01-03       122
AAPL     2022-01-04       125
AAPL     2022-01-05       121
.
.
.
GE     2022-04-16       20
GE     2022-04-17       22
GE     2022-04-18       25
GE     2022-04-19       21

The output should be:

ticker  Date         Close
AAPL    2022-01-03   122
GE      2022-04-18   25

I can do a loop and update row by row, but i would like to check if there is a pythonic way using the whole series/vectors...

CodePudding user response:

TL;DR: If you can index your dataframes in advance, you can do about 1000 times better than merge(), otherwise merge() is probably your best bet.


Your question says:

I have a dataframe with stock ticker names and dates as two columns, and i would like to update this dataframe with price value from another larger dataframe matching those 2 columns

... and you ask:

i would like to check if there is a pythonic way using the whole series/vectors

If your question is asked in the context of just this single dataframe query, then you probably can't get better performance than merge().

However, if you have the option of initializing your dataframes to use ticker, Date as their index, or if you can at least set their indexes to ticker, Date before needing to run multiple queries of the kind described in your question, then you can beat merge().

Here is a benchmark of 6 different strategies for df1 with 500 rows and df2 with 25 million rows:

Timeit results:
foo_1 (merge) ran in 23.043055499998445 seconds
foo_2 (indexed join) ran in 51.69773360000181 seconds
foo_3 (pre-indexed join) ran in 0.0027679000013449695 seconds
foo_4 (pre-indexed df1 join) ran in 24.431038499998976 seconds
foo_5 (merge right) ran in 22.99117219999971 seconds
foo_6 (pre-indexed assign) ran in 0.007970200000272598 seconds

Note that pre-indexed join is about 1000x faster than merge (and pre-indexed assign is also quick at about 300x faster), primarily because pre-indexed dataframes have hash table access by index which have key search time of O(1) vs worst case O(n) time for non-indexed keys. However, indexed join is more than twice as slow as merge because indexed join includes the initial indexing effort (which can be done just once for multiple queries such as the one in your question, and which is excluded from pre-indexed join).

Explanation of the various strategies:

  • The merge strategy uses no indexing.
  • The indexed join strategy includes the time to index both dataframes.
  • The pre-indexed join strategy excludes the initial overhead of indexing both dataframes.
  • The pre-indexed df1 join strategy excludes the initial overhead of indexing df1 but works with an unindexed df2.
  • The merge right strategy swaps df1 and df2 as object and argument of merge().
  • The pre-indexed assign strategy doesn't use merge() or join(), instead doing an index-aligned assignment from df2 to a new column in df1.

Here's the code for each strategy:

df1_orig = pd.DataFrame([('A' str(i) , f'{2020 i//365}-{(i//28)   1}-{i(   1}') for i in range(500)], columns=['ticker', 'Date'])
print(df1_orig)

df2_orig = pd.DataFrame([('A' str(i) , f'{2020 (iP0)//365}-{((iP0)//28)   1}-{(iP0)(   1}', (10 * i   1) % 300) for i in range(25_000_000)], columns=['ticker', 'Date', 'Close'])
print(df2_orig)

df1_indexed_orig = df1_orig.set_index(['ticker', 'Date'])
df2_indexed_orig = df2_orig.set_index(['ticker', 'Date'])

# merge
def foo_1(df1, df2):
    df1 = df1.merge(df2, on = ['ticker', 'Date'], how = 'left')
    return df1

# indexed join
def foo_2(df1, df2):
    df1.set_index(['ticker', 'Date'], inplace=True)
    df2.set_index(['ticker', 'Date'], inplace=True)
    df1 = df1.join(df2)
    return df1

# pre-indexed join
def foo_3(df1, df2):
    # called with df1_indexed_orig and df2_indexed_orig
    df1 = df1.join(df2)
    return df1

# pre-indexed df1 join
def foo_4(df1, df2):
    # called with df1_indexed_orig
    df1 = df2.join(df1, on = ['ticker', 'Date'], how = 'right')
    return df1

# merge right
def foo_5(df1, df2):
    df1 = df2.merge(df1, on = ['ticker', 'Date'], how = 'right')
    return df1

# pre-indexed assign
def foo_6(df1, df2):
    # called with df1_indexed_orig and df2_indexed_orig
    df1 = df1.assign(Close=df2.Close)
    return df1

CodePudding user response:

Try merging on those two columns:

df1.merge(df2, on = ['ticker', 'Date'], how = 'left')
  • Related