Home > Software engineering >  How to count the number of instances of a substring in sqlite or duckdb across all rows?
How to count the number of instances of a substring in sqlite or duckdb across all rows?

Time:07-26

I have a simple table. It's format is:

Column 1: Integer
Column 2: String which consists of a bunch of words separated by spaces

An example of column 2 might be: "foo bar bax bux qux"

Given 2 rows:

  • (1, "foo bar baz bux qux")
  • (2, "foo baz bux qux bax")

I would like to generate the following data-structure:

{
    foo: 2,
    baz: 2,
    bux: 1,
    qux: 2,
    bax: 1
}

There will be thousands of tags.

Essentially, I want to calculate the # of occurrences of each word across all rows. What might be a good way to do this?

I can either use sqlite or duckdb.

CodePudding user response:

Here's a great answer from the DuckDB people:

Two functions will do the trick for you! String_split, and unnest. Unnest is a special function that takes lists and creates a separate row for each element.

With lists_split_into_rows as (
select 
  col1,
  unnest(string_split(col2, ' ')) as new_column
from my_table
)
Select 
  new_column,
  count(*) as my_count
from lists_split_into_rows
Group by
  New_column

Source: https://github.com/duckdb/duckdb/issues/4169

  • Related