German keyword search - look for every possible combination-CodePudding

I'm working on a project where I define some nouns like Haus, Boot, Kampf, ... and what to detect every version (singular/plurar) and every combination of these words in sentences. For example, the algorithm should return true if a sentences does contain one of : Häuser, Hausboot, Häuserkampf, Kampfboot, Hausbau, Bootsanleger, ....

Are you familiar with an algorithm that can do such a thing (preferable in R)? Of course I could implement this manually, but I'm pretty sure that something should already exist.

Thanks!

CodePudding user response：

you can use stringr library and the grepl function as it is done in this example:

>   # Toy example text
>    text1 <- c(" This is an example where Hausbau appears twice (Hausbau)")
>    text2 <- c(" Here it does not appear the name")
>   # Load library
>     library(stringr)
>   # Does it appear "Hausbau"?
>     grepl("Hausbau", text1)
[1] TRUE
>     grepl("Hausbau", text2)
[1] FALSE
>   # Number of "Hausbau" in the text
>     str_count(text1, "Hausbau")
[1] 2

CodePudding user response：

check <- c("Der Häuser", "Das Hausboot ist", "Häuserkampf", "Kampfboot im Wasser", "NotMe", "Hausbau", "Bootsanleger", "Schauspiel")
base <- c("Haus", "Boot", "Kampf")

unlist(lapply(str_to_lower(stringi::stri_trans_general(check, "Latin-ASCII")), function(x) any(str_detect(x, str_to_lower(base)) == T)))

# [1]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE

Breaking it down

Note the comment of Roland, you will match false TRUE values in words like "Schauspiel"

You need to get rid of the special characters, you can use stri_trans_general to translate them to Latin-ASCII

You need to convert your strings to lowercase (i.e. match Boot in Kampfboot)

Then apply over the strings to test and check if they are in the base list, if any of those values is true. You got a match.