I am having a hard time coming up with the correct query to get the data I need, and I wonder whether it's something that can realistically be done using SQL or if I should deal with it in Python.
My first Table (api_results) looks like this
ticker | entry | date | change |
---|---|---|---|
AAPL | 2 | 2022-06-09 | 1.05 |
TSLA | 7 | 2022-09-09 | -0.34 |
And the second one (historical_data) looks like this, excluding irrelevant columns.
Ticker | Date | Change |
---|---|---|
AAPL | 2022-06-09 | 1.05 |
AAPL | 2022-07-09 | -0.34 |
AAPL | 2022-08-09 | 2.5 |
AAPL | 2022-09-09 | 1.12 |
AAPL | 2022-10-09 | ... |
AAPL | 2022-11-09 | ... |
MSFT | 2022-06-09 | ... |
MSFT | 2022-07-09 | ... |
MSFT | 2022-08-09 | ... |
MSFT | 2022-09-09 | ... |
MSFT | 2022-10-09 | ... |
MSFT | 2022-11-09 | ... |
I am trying to get the date in api_results.date for every ticker in api_results and see what the sum of changes in a 10-days span is for each ticker in api_results. The non-dynamic query is the following:
SELECT ticker, Date, sum(change) as change FROM
(
SELECT change, ticker, Date FROM historical_data
WHERE ticker = 'T' AND date >= '2013-12-13 00:00:00'
limit 10
)
Which returns
Ticker | Date | Change |
---|---|---|
T | 2013-12-13 | 3.76 |
The result is correct, but how can I do the same thing for every date-ticker pair in api_results. The idea is to apply a function to every row of the table. What I would do in Python is this:
import sqlite3
from config import db_path
import pandas as pd
connection = sqlite3.connect(db_path)
cursor = connection.cursor()
historical_data = pd.read_sql("SELECT Date, Ticker, Change from historical_data", connection)
api_results = cursor.execute("SELECT ticker, date from api_results").fetchall()
data = []
for ticker, date in api_results:
index = list(historical_data['Date'].index(date))
data.append(historical_data.iloc[index:index 10]['Change'].sum())
This appears to be working just fine, but it takes way too long. After two hours, the loop was still going. Mind you, that historical data has 1M rows of data and 30 columns.
Should I just find a better way of doing it in Python, like vectorizing a function over historical data, or even just building a data frame and shifting the change appropriately, or can this be done using SQL? Thank you for your time and help.
CodePudding user response:
The query I came up with is the following:
SELECT
t1.ticker, t1.date, t2.change_10
FROM
api_results AS t1
INNER JOIN (
SELECT
ticker,
date,
SUM(change) OVER (
PARTITION BY ticker
ORDER BY date
RANGE BETWEEN 0 PRECEDING AND 10 FOLLOWING
) AS change_10
FROM
historical_data
ORDER BY
ticker, date
) AS t2
ON t1.ticker = t2.ticker AND t1.date = t2.date
where sub-query builds the rolling sum over changes in the historical data for 10 following entries, as you do with index:index 10
which I called change_10. In order to get the relevant bits from the sub-query, I use an inner join on a query on api_results
to get the needed ticker and date combinations.