I have a sql operation:
SELECT
date,
type,
SUM(COALESCE(value, 0)) OVER (PARTITION BY date ORDER BY type) value
FROM my_df
I know that SUM can be done with groupby().sum()
, but I don't understand what should be for SUM(COALESCE(value, 0)) OVER (PARTITION BY date ORDER BY type)
How to do that with Python?
CodePudding user response:
Assume you have dataframe as df
. SUM(COALESCE(value, 0)) OVER (PARTITION BY date ORDER BY type)
is SQL window function that's roughly equivalent to cumsum
in pandas:
df.value = df.value.fillna(0) # == COALESCE(value, 0)
df['value'] = df.sort_values('type').groupby('date').value.cumsum() # == SUM(COALESCE(value, 0)) OVER (PARTITION BY date ORDER BY type)
df[['date', 'type', 'value']] # == SELECT ... FROM df
CodePudding user response:
Note: this answer is not correct, as it assumes COALESCE works as a window function, which it doesn't. However, I'm leaving it here in case someone is looking for exactly that answer anyway. OP likely needs what @psidom posted.
Since the Transact-SQL selects the first non-null value for each date, when ordering the rows for that date by type (and selects 0 in case there are no non-null values), this does the same using pandas
(which I'm assuming you're using as you refer to DataFrame):
from pandas import DataFrame
data = [
# should result in 20 for this date, first that's not None once sorted
{'date':'2021-01-01', 'type':'a', 'value': None},
{'date':'2021-01-01', 'type':'c', 'value': 10},
{'date':'2021-01-01', 'type':'b', 'value': 20},
# should result in 10 for this date, only value
{'date':'2021-01-02', 'type':'a', 'value': 10},
# should result in 0 for this date, as there are no values that are not None
{'date':'2021-01-03', 'type':'a', 'value': None},
]
df = DataFrame(data)
print(df.iloc[
df.sort_values('type')
.groupby('date')
.apply(
lambda x: x['value'].first_valid_index()
).fillna(0).astype(int)
]['value'])
I wrote this naively chaining the operations needed together - it may be possible to optimise it a bit.