Home > Software engineering >  ClickHouse querying the sum of MAP values, matching list of keys
ClickHouse querying the sum of MAP values, matching list of keys

Time:07-13

I have around 10,000 sets of data arriving each minute containing the number of people who are observed in each of 4624 grid references (0-4623)

An example of the data:

[2022-07-11 19:45:00]
  [646] => 1
  [647] => 1
  [648] => 1
  [776] => 1
  [777] => 1
  [2465] => 2
  [2466] => 1
  [2467] => 2
  ...
  ...

I'm planning to store these in a table like this, although open to suggestions (500 stores each with ~20 IP addresses), data per minute per IP

CREATE TABLE grid_counters
(
    `storeID` UInt16,
    `camIP` IPv4,
    `timestamp` DateTime,
    `grid_counts` Map(UInt16, UInt8)
)
ENGINE = MergeTree
PARTITION BY (toMonday(timestamp), storeID)
ORDER BY (timestamp, storeID)

Example Row

storeID camIP timestamp grid_counts
404 192.168.2.156 2022-07-11 19:47:00 {646:1,647:1,648:1,776:1,777:1,2465:2,2466:1,2467:2}

My question is querying these records. I need to find out how many people were observed within an arbitrary time period (upto a day) who were present in any of a an arbitrary number of grid references.

I've tried SumMap, however that doesn't seem to be what I thought it was.

SELECT sumMap(grid_counts, [1, 2, 2535, 646, 647]) as ct FROM grid_counters WHERE 
storeID = 404 AND 
camIP = IPv4StringToNum('192.168.2.156') AND
timestamp >= '2022-07-11 19:30:00' AND 
timestamp <= '2022-07-11 20:59:59' AND 
hasAny(mapKeys(grid_counts), [1, 2, 2535, 646, 647])

While I can just select and parse the grid_counts after the query I would much prefer to return a total number of people directly from the query

SELECT grid_counts FROM grid_counters WHERE
storeID = 404 AND
camIP = IPv4StringToNum('192.168.2.156') AND
timestamp >= '2022-07-11 19:30:00' AND
timestamp <= '2022-07-11 20:59:59' AND
hasAny(mapKeys(grid_counts), [1, 2, 2535, 646, 647])

...namely the sum of all values matched across all rows - from the above query this would be the sum of values from matched keys: [2465, 2535, 2399, 646, 647]

For this row {646:1,647:1,648:1,776:1,777:1,2465:2,2466:1,2467:2} it would sum the total for matched keys [646, 647] - total (2)

For this row {2535:3,647:1,99:10} it would sum the total for matched keys [2535, 647]: (4)

I'm trying to find the sum of all totals - for all rows: (2 4) = 6

What should I use in place of sumMap to calculate the sum of all values with matching keys?

Help is much appreciated.

Edit Thanks to Denny Crane's two array advice and solution I settled upon:

SELECT arraySum(toSum) AS total
FROM
(
    SELECT
        (sumMapFiltered([2057, 2058, 2124, 2125, 2126, 9])(grid_count_keys, grid_count_values) AS r).1,
        r.2 AS toSum
    FROM grid_counters
)

┌─total─┐
│   113 │
└───────┘

CodePudding user response:

select map(646,1,647,1) union all select map(646,3);
┌─map(646, 1, 647, 1)─┐
│ {646:3}             │
│ {646:1,647:1}       │
└─────────────────────┘

select sumMap(x) 
from (
   select map(646,1,647,1) x union all select map(646,3)
);
┌─sumMap(x)─────┐
│ {646:4,647:1} │
└───────────────┘

select sumMapFiltered([646])((mapKeys(x), mapValues(x))) 
from (
    select map(646,1,647,1) x union all select map(646,3)
);
┌─sumMapFiltered([646])(tuple(mapKeys(x), mapValues(x)))─┐
│ ([646],[4])                                            │
└────────────────────────────────────────────────────────┘

I think you should use 2 Arrays instead of Map (Map is unstable and slower).

`grid_counts_keys` Array(UInt16),
`grid_counts_values`  Array(UInt8)
select sumMapFiltered([646])(grid_counts_keys, grid_counts_values) from (
   select [646,647] grid_counts_keys,[1,1] grid_counts_values 
   union all select [646],[3]
);
┌─sumMapFiltered([646])(grid_counts_keys, grid_counts_values)─┐
│ ([646],[4])                                                 │
└─────────────────────────────────────────────────────────────┘

select sumMap((arrayFilter(i->i.1=646, arrayZip(grid_counts_keys, grid_counts_values) ) as x).1, x.2) r 
from (
  select [646,647] grid_counts_keys,[1,1] grid_counts_values 
  union all select [646],[3]
);
┌─r───────────┐
│ ([646],[4]) │
└─────────────┘
  • Related