Home > front end >  Pandas/SQL only return results where groupby has at least n rows
Pandas/SQL only return results where groupby has at least n rows

Time:11-13

I have a stock database of the following form where data is given by the minute.

enter image description here

I want to fetch all results which have at least N rows for a particular ticker, for a particular date. (when grouped by ticker and date)

The purpose is to ignore missing data for a given ticker on a given date. There should be at least 629 rows for any given ticker,date combination. Anything less than this, I want to ignore that combination.

I tried something like the following but it wasn't working

start_date = '2021-11-08'
SQL = "SELECT datetime, datetime_utc, ticker,open,close,session_type, high,low,volume FROM candles WHERE id in (SELECT id from candles WHERE session_type='Live' GROUP BY datetime, ticker HAVING COUNT(ID) >= 629) AND datetime >= '{} 00:00:00'::timestamp ORDER BY ticker, datetime_utc ASC".format(start_date)

Could anyone point me at the right SQL incantation? I am using postgres > 9.6.

sample data (apologies i tried to set up an SQL fiddle but it wasn't allowing me to due to character limitations, then data type limitations. The data is below in json, set to be read by pandas)

df = pd.io.json.read_json('{"datetime":{"12372":1636572720000,"12351":1636571460000,"12493":1636590240000,"15210":1636633380000,"16212":1636642500000,"16009":1636560060000,"12213":1636554180000,"16386":1636657800000,"13580":1636572660000,"14733":1636659000000,"12086":1636537200000,"14802":1636667880000,"14407":1636585980000,"14356":1636577640000,"16086":1636574280000,"15437":1636661040000,"14115":1636577400000,"14331":1636576140000,"12457":1636582800000,"14871":1636678080000},"datetime_utc":{"12372":1636572720000,"12351":1636571460000,"12493":1636590240000,"15210":1636633380000,"16212":1636642500000,"16009":1636560060000,"12213":1636554180000,"16386":1636657800000,"13580":1636572660000,"14733":1636659000000,"12086":1636537200000,"14802":1636667880000,"14407":1636585980000,"14356":1636577640000,"16086":1636574280000,"15437":1636661040000,"14115":1636577400000,"14331":1636576140000,"12457":1636582800000,"14871":1636678080000},"ticker":{"12372":"AAPL","12351":"AAPL","12493":"AAPL","15210":"AAPL","16212":"NET","16009":"NET","12213":"AAPL","16386":"NET","13580":"NET","14733":"AAPL","12086":"AAPL","14802":"AAPL","14407":"AAPL","14356":"AAPL","16086":"NET","15437":"AAPL","14115":"NET","14331":"AAPL","12457":"AAPL","14871":"AAPL"},"open":{"12372":148.29,"12351":148.44,"12493":148.1,"15210":148.68,"16212":199.72,"16009":202.27,"12213":150.21,"16386":198.65,"13580":194.9,"14733":147.94,"12086":150.2,"14802":147.87,"14407":148.1,"14356":148.09,"16086":193.82,"15437":148.01,"14115":194.64,"14331":148.07,"12457":148.12,"14871":148.2},"close":{"12372":148.32,"12351":148.44,"12493":148.15,"15210":148.69,"16212":199.32,"16009":202.52,"12213":150.25,"16386":198.57,"13580":194.96,"14733":147.99,"12086":150.17,"14802":147.9,"14407":148.1,"14356":147.99,"16086":194.43,"15437":148.01,"14115":194.78,"14331":148.05,"12457":148.11,"14871":148.28},"session_type":{"12372":"Live","12351":"Live","12493":"Post","15210":"Pre","16212":"Live","16009":"Live","12213":"Pre","16386":"Live","13580":"Live","14733":"Live","12086":"Pre","14802":"Post","14407":"Post","14356":"Live","16086":"Live","15437":"Live","14115":"Live","14331":"Live","12457":"Post","14871":"Post"},"high":{"12372":148.3600006104,"12351":148.4700012207,"12493":148.15,"15210":148.69,"16212":199.8399963379,"16009":202.5249938965,"12213":150.25,"16386":198.6499938965,"13580":195.0299987793,"14733":147.9900054932,"12086":150.2,"14802":147.9,"14407":148.1,"14356":148.1049957275,"16086":194.4400024414,"15437":148.0399932861,"14115":195.1699981689,"14331":148.0800018311,"12457":148.12,"14871":148.28},"low":{"12372":148.26,"12351":148.38,"12493":148.06,"15210":148.68,"16212":199.15,"16009":202.27,"12213":150.2,"16386":198.49,"13580":194.79,"14733":147.93,"12086":150.16,"14802":147.85,"14407":148.1,"14356":147.98,"16086":193.82,"15437":148.0,"14115":194.64,"14331":148.01,"12457":148.07,"14871":148.2},"volume":{"12372":99551.0,"12351":68985.0,"12493":0.0,"15210":0.0,"16212":9016.0,"16009":2974.0,"12213":0.0,"16386":1395.0,"13580":5943.0,"14733":59854.0,"12086":0.0,"14802":0.0,"14407":0.0,"14356":341196.0,"16086":8715.0,"15437":45495.0,"14115":16535.0,"14331":173785.0,"12457":0.0,"14871":0.0},"date":{"12372":1636502400000,"12351":1636502400000,"12493":1636502400000,"15210":1636588800000,"16212":1636588800000,"16009":1636502400000,"12213":1636502400000,"16386":1636588800000,"13580":1636502400000,"14733":1636588800000,"12086":1636502400000,"14802":1636588800000,"14407":1636502400000,"14356":1636502400000,"16086":1636502400000,"15437":1636588800000,"14115":1636502400000,"14331":1636502400000,"12457":1636502400000,"14871":1636588800000}}')

Expected result

if N = 4, then the results will exclude NET on 2021-11-11. (count of all shown here for illustration) enter image description here

Like below (sample only) enter image description here

CodePudding user response:

You can use analytical function (window function) to count rows for each date and ticker combination.

select datetime, datetime_utc, ticker,open,close,session_type, high,low,volume 
from (
    SELECT datetime, datetime_utc, ticker,open,close,session_type, high,low,volume,
           COUNT(*) OVER(PARTITION BY CAST(datetime AS DATE), ticker) as C
    FROM candles 
)
WHERE C >= 5

I am not familiar with postgres, so it may contain syntax problem.

CodePudding user response:

You can use the enter image description here

  • Related