How to do this SQL operation with Python?-CodePudding

I have a sql operation:

    SELECT
        date,
        type,
        SUM(COALESCE(value, 0)) OVER (PARTITION BY date ORDER BY type) value
    FROM my_df

I know that SUM can be done with groupby().sum(), but I don't understand what should be for SUM(COALESCE(value, 0)) OVER (PARTITION BY date ORDER BY type) How to do that with Python?

CodePudding user response：

Assume you have dataframe as df. SUM(COALESCE(value, 0)) OVER (PARTITION BY date ORDER BY type) is SQL window function that's roughly equivalent to cumsum in pandas:

df.value = df.value.fillna(0)    # == COALESCE(value, 0)
df['value'] = df.sort_values('type').groupby('date').value.cumsum()   # == SUM(COALESCE(value, 0)) OVER (PARTITION BY date ORDER BY type)
df[['date', 'type', 'value']]    # == SELECT ... FROM df

CodePudding user response：

Note: this answer is not correct, as it assumes COALESCE works as a window function, which it doesn't. However, I'm leaving it here in case someone is looking for exactly that answer anyway. OP likely needs what @psidom posted.

Since the Transact-SQL selects the first non-null value for each date, when ordering the rows for that date by type (and selects 0 in case there are no non-null values), this does the same using pandas (which I'm assuming you're using as you refer to DataFrame):

from pandas import DataFrame

data = [
    # should result in 20 for this date, first that's not None once sorted
    {'date':'2021-01-01', 'type':'a', 'value': None},
    {'date':'2021-01-01', 'type':'c', 'value': 10},
    {'date':'2021-01-01', 'type':'b', 'value': 20},
    # should result in 10 for this date, only value
    {'date':'2021-01-02', 'type':'a', 'value': 10},
    # should result in 0 for this date, as there are no values that are not None
    {'date':'2021-01-03', 'type':'a', 'value': None},
]

df = DataFrame(data)
print(df.iloc[
    df.sort_values('type')
        .groupby('date')
        .apply(
            lambda x: x['value'].first_valid_index()
        ).fillna(0).astype(int)
]['value'])

I wrote this naively chaining the operations needed together - it may be possible to optimise it a bit.