Home > Mobile >  How to do this SQL operation with Python?
How to do this SQL operation with Python?

Time:10-01

I have a sql operation:

    SELECT
        date,
        type,
        SUM(COALESCE(value, 0)) OVER (PARTITION BY date ORDER BY type) value
    FROM my_df

I know that SUM can be done with groupby().sum(), but I don't understand what should be for SUM(COALESCE(value, 0)) OVER (PARTITION BY date ORDER BY type) How to do that with Python?

CodePudding user response:

Assume you have dataframe as df. SUM(COALESCE(value, 0)) OVER (PARTITION BY date ORDER BY type) is SQL window function that's roughly equivalent to cumsum in pandas:

df.value = df.value.fillna(0)    # == COALESCE(value, 0)
df['value'] = df.sort_values('type').groupby('date').value.cumsum()   # == SUM(COALESCE(value, 0)) OVER (PARTITION BY date ORDER BY type)
df[['date', 'type', 'value']]    # == SELECT ... FROM df

CodePudding user response:

Note: this answer is not correct, as it assumes COALESCE works as a window function, which it doesn't. However, I'm leaving it here in case someone is looking for exactly that answer anyway. OP likely needs what @psidom posted.

Since the Transact-SQL selects the first non-null value for each date, when ordering the rows for that date by type (and selects 0 in case there are no non-null values), this does the same using pandas (which I'm assuming you're using as you refer to DataFrame):

from pandas import DataFrame

data = [
    # should result in 20 for this date, first that's not None once sorted
    {'date':'2021-01-01', 'type':'a', 'value': None},
    {'date':'2021-01-01', 'type':'c', 'value': 10},
    {'date':'2021-01-01', 'type':'b', 'value': 20},
    # should result in 10 for this date, only value
    {'date':'2021-01-02', 'type':'a', 'value': 10},
    # should result in 0 for this date, as there are no values that are not None
    {'date':'2021-01-03', 'type':'a', 'value': None},
]

df = DataFrame(data)
print(df.iloc[
    df.sort_values('type')
        .groupby('date')
        .apply(
            lambda x: x['value'].first_valid_index()
        ).fillna(0).astype(int)
]['value'])

I wrote this naively chaining the operations needed together - it may be possible to optimise it a bit.

  • Related