Home > database >  How to speed up this pandas dataframes combination that runs too slow?
How to speed up this pandas dataframes combination that runs too slow?

Time:07-21

I want to speed up this kind of combination of dataframes that I wrote.
More details: Scores are not important and they are some kind of random numbers so skip that. Index in df1 is time series with step 5 and Index in df2 is time series with step 15 and Index in df3 is time series with step 30

Thanks.

import pandas as pd

#initialize dataframes and fill some data

df1 = pd.DataFrame([[6,20],[11,19],[16,18],[21,17],[26,16],[31,15],[36,14]],columns=['Index','Score'])
df1.set_index('Index', inplace=True)
print(df1)

df2 = pd.DataFrame([[6,20],[21,19],[36,18]],columns=['Index','Score'])
df2.set_index('Index', inplace=True)
print(df2)

df3 = pd.DataFrame([[6,20],[36,19]],columns=['Index','Score'])
df3.set_index('Index', inplace=True)
print(df3)

#This code block runs slow and I want to speed up here.
#-----------------------------------------------------
for index1 in df1.index:
  for index2 in df2.index:
    if (index2-index1<=10):
        df1.at[index1,'Score2'] =df2.at[index2,'Score']
for index1 in df1.index:
  for index2 in df3.index:
    if (index2-index1<=25):
        df1.at[index1,'Score3'] =df3.at[index2,'Score'] 
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^      
print(df1)
       Score
Index       
6         20
11        19
16        18
21        17
26        16
31        15
36        14
       Score
Index       
6         20
21        19
36        18
       Score
Index       
6         20
36        19
       Score  Score2  Score3
Index                       
6         20    20.0    20.0
11        19    19.0    19.0
16        18    19.0    19.0
21        17    19.0    19.0
26        16    18.0    19.0
31        15    18.0    19.0
36        14    18.0    19.0

CodePudding user response:

If the values does not matter, you just need to do a merge:

df1 = df1.merge(right=df2.merge(right=df3,how='left',left_index=True, right_index=True) , how = 'left', left_index=True, right_index=True)
df1.columns = ['Score', 'Score 2', 'Score 3']
df1

CodePudding user response:

convtools allows to define data transforms like this (I must confess -- I've developed it). Once a conversion is defined and you run gen_converter it generates & compiles ad-hoc python code under the hood, so you have a simple python function to use. github | docs

from convtools import conversion as c
from convtools.contrib.tables import Table

items_1 = [[6, 20], [11, 19], [16, 18], [21, 17], [26, 16], [31, 15], [36, 14]]
items_2 = [[6, 20], [21, 19], [36, 18]]
items_3 = [[6, 20], [36, 19]]

rows_iter = (
    Table.from_rows(items_1, header=["Index", "Score"])
    .join(
        Table.from_rows(items_2, header=["Index", "Score2"]),
        on=c.RIGHT.col("Index") - c.LEFT.col("Index") <= 10,
        how="left",
    )
    .drop("Index_RIGHT")
    .rename({"Index_LEFT": "Index"})
    .join(
        Table.from_rows(items_3, header=["Index", "Score3"]),
        on=c.RIGHT.col("Index") - c.LEFT.col("Index") <= 25,
        how="left",
    )
    .drop("Index_RIGHT")
    .rename({"Index_LEFT": "Index"})
    .into_iter_rows(dict)
)

# rows_iter contains duplicate rows after joins, so we need to take only last
# ones per index;
deduplicate_converter = (
    c.chunk_by(c.item("Index"))
    .aggregate(c.ReduceFuncs.Last(c.this))
    .gen_converter()
)

# the deduplicate_converter also returns an iterable
data = list(deduplicate_converter(rows_iter))

"""
In [13]: data
Out[13]:
[{'Index': 6, 'Score': 20, 'Score2': 20, 'Score3': 20},
 {'Index': 11, 'Score': 19, 'Score2': 19, 'Score3': 19},
 {'Index': 16, 'Score': 18, 'Score2': 19, 'Score3': 19},
 {'Index': 21, 'Score': 17, 'Score2': 19, 'Score3': 19},
 {'Index': 26, 'Score': 16, 'Score2': 18, 'Score3': 19},
 {'Index': 31, 'Score': 15, 'Score2': 18, 'Score3': 19},
 {'Index': 36, 'Score': 14, 'Score2': 18, 'Score3': 19}]
"""

Should you have any questions, please comment below, I'll adjust the code. If it works for you, please, comment on whether there's any speed-up on large data sets.

  • Related