my dataframe looks like this:
id value
a 0:3,1:0,2:0,3:4
a 0:0,1:0,2:2,3:0
a 0:0,1:5,2:4,3:0
I want to write a query to get average values of keys in column value?
So for example for 0:3,1:0,2:0,3:4
it must be (0 0 0 3 3 3 3)/7 = 1.71
.
For 0:0,1:0,2:2,3:0
it must be (2 2)/2=2
.
For 0:0,1:5,2:4,3:0
it must be (1 1 1 1 1 2 2 2 2)/9 = 1.44
.
So desired result is:
id value
a 1.71
a 2.00
a 1.44
How to do that? Are there sql functions to get this result?
CodePudding user response:
See this DBFIDDLE
code:
CREATE PROCEDURE `avg_dict`(s varchar(100))
BEGIN
SET @result = CONCAT('SELECT (', replace(replace(s, ":","*"),","," "), ')/(',regexp_replace(s,",?[0-9]:"," "),')');
PREPARE stmt FROM @result;
EXECUTE stmt ;
DEALLOCATE PREPARE stmt;
END
results:
stmt | output |
---|---|
CALL avg_dict("0:3,1:0,2:0,3:4"); |
1.1743 |
CALL avg_dict("0:0,1:0,2:2,3:0"); |
2.0000 |
CALL avg_dict("0:0,1:5,2:4,3:0"); |
1.4444 |
CodePudding user response:
With some combination of split
's, transform
s and repeat
you can achieve your goal:
WITH dataset(id, value) AS (
values ('a', '0:3,1:0,2:0,3:4'),
('a', '0:0,1:0,2:2,3:0'),
('a', '0:0,1:5,2:4,3:0')
)
SELECT id,
reduce(arr, 0.0, (s, x)->s x, s->s) / cardinality(arr)
FROM(
SELECT *,
flatten(
transform(
transform(
split(value, ','),
s->split(s, ':')
),
arr->repeat(
cast(arr [ 1 ] as INTEGER),
cast(arr [ 2 ] as INTEGER)
)
)
) as arr
FROM dataset
)
Output:
id | _col1 |
---|---|
a | 1.7142857142857142 |
a | 2.0 |
a | 1.4444444444444444 |
Note:
Outer select can be substituted with array_average
but I used he select cause Athena's version of Presto does not support it.
UPD
Another version which can be more performant:
SELECT id,
reduce(
arr,
CAST(ROW(0.0, 0) AS ROW(sum DOUBLE, count INTEGER)),
(s, r)->CAST(
ROW(r.num * r.count s.sum, s.count r.count) AS ROW(sum DOUBLE, count INTEGER)
),
s->IF(s.count = 0, NULL, s.sum / s.count)
)
FROM(
SELECT *,
transform(
split(value, ','),
s->CAST(
ROW(
CAST(split(s, ':') [ 1 ] AS DOUBLE),
(CAST(split(s, ':') [ 2 ] AS INTEGER))
) AS ROW(num DOUBLE, count INTEGER)
)
) as arr
FROM dataset
)