Home > Enterprise >  like query count does not match
like query count does not match

Time:07-22

If the total number of records is x and the count of "like" query is y then the count of "not like" query should be x - y

I am getting "x" as the count of "not like" query as shown below:

SELECT COUNT(DISTINCT(b.word)) 
FROM "hunspell"."oscar2_sorted" AS b

Total Count: 9597651

SELECT COUNT(distinct(b.word))
FROM "hunspell"."oscar2_sorted" as b 
INNER JOIN invalidswar AS a 
ON b.word LIKE (CONCAT('%', a.word,'%'))

Like count: 73116

SELECT COUNT(distinct(b.word)) 
FROM "hunspell"."oscar2_sorted" AS b 
INNER JOIN invalidswar AS a
ON b.word NOT LIKE (CONCAT('%', a.word,'%'))

Not Like count: 9597651

Expected: 9524535

I am not sure what am I missing.


Update:

Left join count is close to expected, but still does not look correct.

SELECT COUNT(DISTINCT(b.word))
FROM "hunspell"."oscar2_sorted" AS b 
LEFT JOIN (SELECT DISTINCT(b.word) AS dword 
           FROM "hunspell"."oscar2_sorted" AS b 
           INNER JOIN invalidswar AS a 
           ON b.word LIKE (CONCAT('%', a.word,'%'))) AS d 
ON d.dword = b.word 
WHERE d.dword IS NULL

left join count: 9536539


update 2:

a difference of 12004 is traced back to the difference of how like and regexp_like is executed.

SELECT count(distinct(b.word)) 
          FROM "hunspell"."oscar2_sorted" as b
         INNER JOIN invalidswar AS a 
            ON regexp_like(b.word, a.word)

regex like count: 61112

CodePudding user response:

SELECT COUNT(word)
  FROM (SELECT word 
          FROM "hunspell"."oscar2_sorted"
        EXCEPT DISTINCT
       (SELECT b.word
          FROM "hunspell"."oscar2_sorted" as b
         INNER JOIN invalidswar AS a 
            ON regexp_like(b.word, a.word)))

See: EXCEPT clause and regexp_like

CodePudding user response:

WITH
  invalid_check AS
(
  SELECT
    o.word,
    CASE WHEN
      EXISTS (
        SELECT *
          FROM invalidswar AS i
         WHERE o.word LIKE CONCAT('%', i.word,'%')
      )
    THEN
      1
    ELSE
      0
    END
      AS is_invalid
  FROM
    "hunspell"."oscar2_sorted"   AS o
  GROUP BY
    o.word
)
SELECT
  COUNT(*)            AS all_words,
  SUM(is_invalid)     AS invalid_words,
  SUM(1-is_invalid)   AS valid_words
FROM
  valid_check

Assuming the above matches your expectations, just counting valid words could then be...

SELECT
  COUNT(DISTINCT o.word)
FROM
  "hunspell"."oscar2_sorted"   AS o
WHERE
  NOT EXISTS (
    SELECT *
      FROM invalidswar AS i
     WHERE o.word LIKE CONCAT('%', i.word,'%')
  )
  • Related