How to get average values from dictionary looking values using SQL?-CodePudding

my dataframe looks like this:

id            value
a       0:3,1:0,2:0,3:4
a       0:0,1:0,2:2,3:0
a       0:0,1:5,2:4,3:0

I want to write a query to get average values of keys in column value?

So for example for 0:3,1:0,2:0,3:4 it must be (0 0 0 3 3 3 3)/7 = 1.71.

For 0:0,1:0,2:2,3:0 it must be (2 2)/2=2.

For 0:0,1:5,2:4,3:0 it must be (1 1 1 1 1 2 2 2 2)/9 = 1.44.

So desired result is:

id            value
a              1.71
a              2.00
a              1.44

How to do that? Are there sql functions to get this result?

CodePudding user response：

See this DBFIDDLE

code:

CREATE  PROCEDURE `avg_dict`(s varchar(100))
BEGIN
  SET @result = CONCAT('SELECT (', replace(replace(s, ":","*"),","," "), ')/(',regexp_replace(s,",?[0-9]:"," "),')');
  PREPARE stmt FROM @result;
  EXECUTE stmt  ;
  DEALLOCATE PREPARE stmt;
END

results:

stmt	output
`CALL avg_dict("0:3,1:0,2:0,3:4");`	1.1743
`CALL avg_dict("0:0,1:0,2:2,3:0");`	2.0000
`CALL avg_dict("0:0,1:5,2:4,3:0");`	1.4444

CodePudding user response：

With some combination of split's, transforms and repeat you can achieve your goal:

WITH dataset(id, value) AS (
    values ('a', '0:3,1:0,2:0,3:4'),
        ('a', '0:0,1:0,2:2,3:0'),
        ('a', '0:0,1:5,2:4,3:0')
)

SELECT id,
    reduce(arr, 0.0, (s, x)->s   x, s->s) / cardinality(arr)
FROM(
        SELECT *,
            flatten(
                transform(
                    transform(
                        split(value, ','),
                        s->split(s, ':')
                    ),
                    arr->repeat(
                        cast(arr [ 1 ] as INTEGER),
                        cast(arr [ 2 ] as INTEGER)
                    )
                )
            ) as arr
        FROM dataset
    )

Output:

id	_col1
a	1.7142857142857142
a	2.0
a	1.4444444444444444

Note:

Outer select can be substituted with array_average but I used he select cause Athena's version of Presto does not support it.

UPD

Another version which can be more performant:

SELECT id,
    reduce(
        arr,
        CAST(ROW(0.0, 0) AS ROW(sum DOUBLE, count INTEGER)),
        (s, r)->CAST(
            ROW(r.num * r.count   s.sum, s.count   r.count) AS ROW(sum DOUBLE, count INTEGER)
        ),
        s->IF(s.count = 0, NULL, s.sum / s.count)
    )
FROM(
        SELECT *,
            transform(
                split(value, ','),
                s->CAST(
                    ROW(
                        CAST(split(s, ':') [ 1 ] AS DOUBLE),
                        (CAST(split(s, ':') [ 2 ] AS INTEGER))
                    ) AS ROW(num DOUBLE, count INTEGER)
                )
            ) as arr
        FROM dataset
    )