Home > Back-end >  Trouble filtering out plural words
Trouble filtering out plural words

Time:07-19

I have a table with the most frequent words in the English language which looks like this:

word count
cat 43534889
dog 34584357
hat 4343878
...
hats 44747

I'd like to exclude all the plural words like 'hats' if they already exist in singular form.

So I wrote this query

SELECT
  word,
  CASE WHEN CONCAT(word,'s') IN (
    SELECT freq.word from `words.freq` as freq
    WHERE freq.word LIKE '%s' AND LENGTH(freq.word) > 4
    )
    THEN 'plural'
    ELSE 'sing'
    END AS plural
FROM `words.freq` LIMIT 1000

My logic is: if the word 'hat' 's' is found among words ending in 's' (subquery), it means it's just the plural form of that noun. Somehow the function CONCAT doesn't seem just to add 's' to each word, but it changes it so for example when I run this query, words like 'that' are somehow displayed as 'plural' as if they were longer than 4 characters and contained 's' at the end. I am really confused. Can anyone help?

CodePudding user response:

I thinks you can sort words in alphabetical descending order and compare a word with next word to check if it's singular form of it.

WITH sample_table AS (
  SELECT 'cat' word, 43534889 count UNION ALL
  SELECT 'dog', 34584357 UNION ALL
  SELECT 'hat', 4343878 UNION ALL
  SELECT 'dogs', 38738 UNION ALL
  SELECT 'hats', 44747
)
SELECT *,
       IF(CONCAT(LEAD(word) OVER (ORDER BY word DESC), 's') = word, 'plural', 'singular') is_plural
  FROM sample_table;

enter image description here

CodePudding user response:

This (in MySQL syntax) should do what you're looking for: as you say, this doesn't capture all the ways that English can make plurals, and it will also get some false positives ("hiss" would be considered as plural because "his" exists).

The idea is to look for words of >=4 characters ending in 's', and check whether the corresponding word with the final 's' removed exists:

SELECT word,
    CASE
    WHEN CHAR_LENGTH(word) >= 4 AND word LIKE '%s' AND LEFT(word, CHAR_LENGTH(word)-1) IN (SELECT word FROM words) THEN 'plural'
    ELSE 'singular'
    END AS plurality
FROM words;
  • Related