If the total number of records is x and the count of "like" query is y then the count of "not like" query should be x - y
I am getting "x" as the count of "not like" query as shown below:
SELECT COUNT(DISTINCT(b.word))
FROM "hunspell"."oscar2_sorted" AS b
Total Count: 9597651
SELECT COUNT(distinct(b.word))
FROM "hunspell"."oscar2_sorted" as b
INNER JOIN invalidswar AS a
ON b.word LIKE (CONCAT('%', a.word,'%'))
Like count: 73116
SELECT COUNT(distinct(b.word))
FROM "hunspell"."oscar2_sorted" AS b
INNER JOIN invalidswar AS a
ON b.word NOT LIKE (CONCAT('%', a.word,'%'))
Not Like count: 9597651
Expected: 9524535
I am not sure what am I missing.
Update:
Left join count is close to expected, but still does not look correct.
SELECT COUNT(DISTINCT(b.word))
FROM "hunspell"."oscar2_sorted" AS b
LEFT JOIN (SELECT DISTINCT(b.word) AS dword
FROM "hunspell"."oscar2_sorted" AS b
INNER JOIN invalidswar AS a
ON b.word LIKE (CONCAT('%', a.word,'%'))) AS d
ON d.dword = b.word
WHERE d.dword IS NULL
left join count: 9536539
update 2:
a difference of 12004 is traced back to the difference of how like and regexp_like is executed.
SELECT count(distinct(b.word))
FROM "hunspell"."oscar2_sorted" as b
INNER JOIN invalidswar AS a
ON regexp_like(b.word, a.word)
regex like count: 61112
CodePudding user response:
SELECT COUNT(word)
FROM (SELECT word
FROM "hunspell"."oscar2_sorted"
EXCEPT DISTINCT
(SELECT b.word
FROM "hunspell"."oscar2_sorted" as b
INNER JOIN invalidswar AS a
ON regexp_like(b.word, a.word)))
See: EXCEPT clause and regexp_like
CodePudding user response:
WITH
invalid_check AS
(
SELECT
o.word,
CASE WHEN
EXISTS (
SELECT *
FROM invalidswar AS i
WHERE o.word LIKE CONCAT('%', i.word,'%')
)
THEN
1
ELSE
0
END
AS is_invalid
FROM
"hunspell"."oscar2_sorted" AS o
GROUP BY
o.word
)
SELECT
COUNT(*) AS all_words,
SUM(is_invalid) AS invalid_words,
SUM(1-is_invalid) AS valid_words
FROM
valid_check
Assuming the above matches your expectations, just counting valid words could then be...
SELECT
COUNT(DISTINCT o.word)
FROM
"hunspell"."oscar2_sorted" AS o
WHERE
NOT EXISTS (
SELECT *
FROM invalidswar AS i
WHERE o.word LIKE CONCAT('%', i.word,'%')
)