I want to speed up this kind of combination of dataframes that I wrote.
More details:
Scores are not important and they are some kind of random numbers so skip that. Index in df1 is time series with step 5 and Index in df2 is time series with step 15 and Index in df3 is time series with step 30
Thanks.
import pandas as pd
#initialize dataframes and fill some data
df1 = pd.DataFrame([[6,20],[11,19],[16,18],[21,17],[26,16],[31,15],[36,14]],columns=['Index','Score'])
df1.set_index('Index', inplace=True)
print(df1)
df2 = pd.DataFrame([[6,20],[21,19],[36,18]],columns=['Index','Score'])
df2.set_index('Index', inplace=True)
print(df2)
df3 = pd.DataFrame([[6,20],[36,19]],columns=['Index','Score'])
df3.set_index('Index', inplace=True)
print(df3)
#This code block runs slow and I want to speed up here.
#-----------------------------------------------------
for index1 in df1.index:
for index2 in df2.index:
if (index2-index1<=10):
df1.at[index1,'Score2'] =df2.at[index2,'Score']
for index1 in df1.index:
for index2 in df3.index:
if (index2-index1<=25):
df1.at[index1,'Score3'] =df3.at[index2,'Score']
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
print(df1)
Score
Index
6 20
11 19
16 18
21 17
26 16
31 15
36 14
Score
Index
6 20
21 19
36 18
Score
Index
6 20
36 19
Score Score2 Score3
Index
6 20 20.0 20.0
11 19 19.0 19.0
16 18 19.0 19.0
21 17 19.0 19.0
26 16 18.0 19.0
31 15 18.0 19.0
36 14 18.0 19.0
CodePudding user response:
If the values does not matter, you just need to do a merge:
df1 = df1.merge(right=df2.merge(right=df3,how='left',left_index=True, right_index=True) , how = 'left', left_index=True, right_index=True)
df1.columns = ['Score', 'Score 2', 'Score 3']
df1
CodePudding user response:
convtools allows to define data transforms like this (I must confess -- I've developed it). Once a conversion is defined and you run gen_converter
it generates & compiles ad-hoc python code under the hood, so you have a simple python function to use. github | docs
from convtools import conversion as c
from convtools.contrib.tables import Table
items_1 = [[6, 20], [11, 19], [16, 18], [21, 17], [26, 16], [31, 15], [36, 14]]
items_2 = [[6, 20], [21, 19], [36, 18]]
items_3 = [[6, 20], [36, 19]]
rows_iter = (
Table.from_rows(items_1, header=["Index", "Score"])
.join(
Table.from_rows(items_2, header=["Index", "Score2"]),
on=c.RIGHT.col("Index") - c.LEFT.col("Index") <= 10,
how="left",
)
.drop("Index_RIGHT")
.rename({"Index_LEFT": "Index"})
.join(
Table.from_rows(items_3, header=["Index", "Score3"]),
on=c.RIGHT.col("Index") - c.LEFT.col("Index") <= 25,
how="left",
)
.drop("Index_RIGHT")
.rename({"Index_LEFT": "Index"})
.into_iter_rows(dict)
)
# rows_iter contains duplicate rows after joins, so we need to take only last
# ones per index;
deduplicate_converter = (
c.chunk_by(c.item("Index"))
.aggregate(c.ReduceFuncs.Last(c.this))
.gen_converter()
)
# the deduplicate_converter also returns an iterable
data = list(deduplicate_converter(rows_iter))
"""
In [13]: data
Out[13]:
[{'Index': 6, 'Score': 20, 'Score2': 20, 'Score3': 20},
{'Index': 11, 'Score': 19, 'Score2': 19, 'Score3': 19},
{'Index': 16, 'Score': 18, 'Score2': 19, 'Score3': 19},
{'Index': 21, 'Score': 17, 'Score2': 19, 'Score3': 19},
{'Index': 26, 'Score': 16, 'Score2': 18, 'Score3': 19},
{'Index': 31, 'Score': 15, 'Score2': 18, 'Score3': 19},
{'Index': 36, 'Score': 14, 'Score2': 18, 'Score3': 19}]
"""
Should you have any questions, please comment below, I'll adjust the code. If it works for you, please, comment on whether there's any speed-up on large data sets.