How to Config ScyllaDB for million rows: run simple query count(*) return error : timeout during rea-CodePudding

I have table with simple table :

create table if not exists keyspace_test.table_test
(
    id          int,
    date          text,
    val float,
    primary key (id, date)
)
    with caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
     and compaction = {'class': 'SizeTieredCompactionStrategy'}
     and compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
     and dclocal_read_repair_chance = 0
     and speculative_retry = '99.0PERCENTILE'
     and read_repair_chance = 1;

After that i import 12 million rows. Than i want to run simple calculation count rows & sum column val. With this query :

SELECT COUNT(*), SUM(val)
FROM keyspace_test.table_test

but show error :

Cassandra timeout during read query at consistency ONE (1 responses were required but only 0 replica responded)

I am already add USING TIMEOUT 180s; but show error :

 Timed out waiting for server response

Configuration server that i use are in 2 location datacenter. Each datacenter has 4 server.

# docker exec -it scylla-120 nodetool status
Datacenter: dc2
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address          Load       Tokens       Owns    Host ID                               Rack
UN  10.3.192.25  79.04 GB   256          ?       5975a143fec6  Rack1
UN  10.3.192.24  74.2 GB    256          ?       61dc1cfd3e92  Rack1
UN  10.3.192.22  88.21 GB   256          ?       0d24d52d6b0a  Rack1
UN  10.3.192.23  63.41 GB   256          ?       962s266518ee  Rack1
Datacenter: dc3
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address          Load       Tokens       Owns    Host ID                               Rack
UN  34.77.78.21    83.5 GB    256          ?       5112f248dd38  Rack1
UN  34.77.78.20    59.87 GB   256          ?       e8db897ca33b  Rack1
UN  34.77.78.48    81.32 GB   256          ?       cb88bd9326db  Rack1
UN  34.77.78.47    79.8 GB    256          ?       562a721d4b77  Rack1

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless

And i create keyspace with :

CREATE KEYSPACE keyspace_test WITH replication = { 'class' : 'NetworkTopologyStrategy', 'dc2' : 3, 'dc3' : 3};

How realay config this scylla with million rows data ?

CodePudding user response：

Not sure about SUM, but you could use DSBulk to count the rows in a table.

dsbulk count \
    -k keyspace_test \
    -t table_test \
    -u username \
    -p password \
    -h 10.3.192.25

DSBulk takes token range ownership into account, so it's not as stressful on the cluster.

CodePudding user response：

A explained in the Scylla documentation (https://docs.scylladb.com/stable/kb/count-all-rows.html) a COUNT requires scanning the entire database, which can take a long time, so using USING TIMEOUT like you did is indeed the right thing.

I don't know whether or not 180 seconds for scanning 12 million rows on your table is a long enough timeout - to be sure you can try increasing this to 3600 seconds and see if it ever finishes, or try a full-table scan (not just a count) so see how fast it progresses to be able to estimate how long a count() might take (a count() should take less time than an actual scan returning data, but not much less - it still does all the same IO).

Also, it is important to note that until very recently, COUNT was implemented inefficiently - it proceeded sequentially instead of utilizing all the shards in the system. This was fixed in https://github.com/scylladb/scylladb/commit/fe65122ccd40a2a3577121aebdb9a5b50deb4a90 - but it only reached Scylla 5.1 (or the master branch), are you using an older version of Scylla? The example in that commit suggests that the new implementation may be as much as 30 times faster as the old one!

So hopefully, on Scylla 5.1 a much lower timeout would be enough for your COUNT operation to finish. On older versions, you can emulate what Scylla 5.1 does manually: Divide the token range into parts, and invoke a partial COUNT on each of these token ranges, all in parallel, and then sum up the results from all the different ranges.