I have around 10,000 sets of data arriving each minute containing the number of people who are observed in each of 4624 grid references (0-4623)
An example of the data:
[2022-07-11 19:45:00]
[646] => 1
[647] => 1
[648] => 1
[776] => 1
[777] => 1
[2465] => 2
[2466] => 1
[2467] => 2
...
...
I'm planning to store these in a table like this, although open to suggestions (500 stores each with ~20 IP addresses), data per minute per IP
CREATE TABLE grid_counters
(
`storeID` UInt16,
`camIP` IPv4,
`timestamp` DateTime,
`grid_counts` Map(UInt16, UInt8)
)
ENGINE = MergeTree
PARTITION BY (toMonday(timestamp), storeID)
ORDER BY (timestamp, storeID)
Example Row
storeID | camIP | timestamp | grid_counts |
---|---|---|---|
404 | 192.168.2.156 | 2022-07-11 19:47:00 | {646:1,647:1,648:1,776:1,777:1,2465:2,2466:1,2467:2} |
My question is querying these records. I need to find out how many people were observed within an arbitrary time period (upto a day) who were present in any of a an arbitrary number of grid references.
I've tried SumMap, however that doesn't seem to be what I thought it was.
SELECT sumMap(grid_counts, [1, 2, 2535, 646, 647]) as ct FROM grid_counters WHERE
storeID = 404 AND
camIP = IPv4StringToNum('192.168.2.156') AND
timestamp >= '2022-07-11 19:30:00' AND
timestamp <= '2022-07-11 20:59:59' AND
hasAny(mapKeys(grid_counts), [1, 2, 2535, 646, 647])
While I can just select and parse the grid_counts after the query I would much prefer to return a total number of people directly from the query
SELECT grid_counts FROM grid_counters WHERE
storeID = 404 AND
camIP = IPv4StringToNum('192.168.2.156') AND
timestamp >= '2022-07-11 19:30:00' AND
timestamp <= '2022-07-11 20:59:59' AND
hasAny(mapKeys(grid_counts), [1, 2, 2535, 646, 647])
...namely the sum of all values matched across all rows - from the above query this would be the sum of values from matched keys: [2465, 2535, 2399, 646, 647]
For this row {646:1,647:1,648:1,776:1,777:1,2465:2,2466:1,2467:2}
it would sum the total for matched keys [646, 647] - total (2)
For this row {2535:3,647:1,99:10}
it would sum the total for matched keys [2535, 647]: (4)
I'm trying to find the sum of all totals - for all rows: (2 4) = 6
What should I use in place of sumMap to calculate the sum of all values with matching keys?
Help is much appreciated.
Edit Thanks to Denny Crane's two array advice and solution I settled upon:
SELECT arraySum(toSum) AS total
FROM
(
SELECT
(sumMapFiltered([2057, 2058, 2124, 2125, 2126, 9])(grid_count_keys, grid_count_values) AS r).1,
r.2 AS toSum
FROM grid_counters
)
┌─total─┐
│ 113 │
└───────┘
CodePudding user response:
select map(646,1,647,1) union all select map(646,3);
┌─map(646, 1, 647, 1)─┐
│ {646:3} │
│ {646:1,647:1} │
└─────────────────────┘
select sumMap(x)
from (
select map(646,1,647,1) x union all select map(646,3)
);
┌─sumMap(x)─────┐
│ {646:4,647:1} │
└───────────────┘
select sumMapFiltered([646])((mapKeys(x), mapValues(x)))
from (
select map(646,1,647,1) x union all select map(646,3)
);
┌─sumMapFiltered([646])(tuple(mapKeys(x), mapValues(x)))─┐
│ ([646],[4]) │
└────────────────────────────────────────────────────────┘
I think you should use 2 Arrays instead of Map (Map is unstable and slower).
`grid_counts_keys` Array(UInt16),
`grid_counts_values` Array(UInt8)
select sumMapFiltered([646])(grid_counts_keys, grid_counts_values) from (
select [646,647] grid_counts_keys,[1,1] grid_counts_values
union all select [646],[3]
);
┌─sumMapFiltered([646])(grid_counts_keys, grid_counts_values)─┐
│ ([646],[4]) │
└─────────────────────────────────────────────────────────────┘
select sumMap((arrayFilter(i->i.1=646, arrayZip(grid_counts_keys, grid_counts_values) ) as x).1, x.2) r
from (
select [646,647] grid_counts_keys,[1,1] grid_counts_values
union all select [646],[3]
);
┌─r───────────┐
│ ([646],[4]) │
└─────────────┘