extract all two-character-combinations from a string-CodePudding

In order to identify nonsense text (e.g. djsarejslslasdfhsl) from real (German) words, I would like to do an analysis of letter-frequencies.

My idea is to calculate the relative frequencies of two-letter-combinations ("te", "ex", "xt", "is" etc.) using a long text. Based on that information I would like to calculate the probability that a given word (or sentence) is real German.

But my first problem is, how to extract all the two-letter-combinations and to count them? I fear that using substring(string, start, stop) and increasing the values of start and stop in a loop might not be a very efficient solution. Do you have any idea?

# A short sample text
text <- 'Es ist ein Freudentag – ohne Zweifel. Gesundheitsminister Alain Berset und der Bundesrat gehen weiter, als man annehmen durfte. Die Zertifikatspflicht wird aufgehoben, die Maskenpflicht gilt nur noch im ÖV und in Gesundheitseinrichtungen.
Die beste Meldung des Tages aber ist: Die Covid-19-Task-Force, inzwischen als «Task-Farce» verballhornt, wird auf Ende März aufgehoben – zwei Monaten früher als geplant. Die Dauerkritik war wohl mit ein Grund, dass dieses Gremium sich jetzt rasch auflösen will.
Keine Rosen ohne Dornen: Einzelne Punkte von Bersets Ausführungen geben zu denken.
Die «Isolationshaft» für positiv Getestete bleibt zwingend. Das ist Unsinn und steht in einem scharfen Kontrast zu den übrigen Öffnungsschritten. Die Grundimmunität der Bevölkerung beträgt über 90 Prozent, das Virus ist nicht mehr gefährlich, warum will man weiter Leute zu Hause einsperren? Wer schwer krank ist, geht von sich aus nicht zur Arbeit. Die krankheitsbedingte Bettruhe muss man den Menschen nicht vorschreiben.
Gesundheitsminister Berset findet, das Modell Task-Force habe eine interessante Möglichkeit aufgezeigt für die Zusammenarbeit zwischen Regierung und Wissenschaft. Unter Umständen eigne sich dieses Modell auch für andere Bereiche.
Nein danke, Herr Berset.
Die Task-Force war mit ihrem öffentlichen Dauer-Alarmismus und ihren haarsträubenden Falsch-Prognosen vor allem eine Manipulationsmaschine.
Und dann noch dies: Irgendwann während der heutigen Pressekonferenz gab Alain Berset zu verstehen, man habe mit all diesen Massnahmen die Bevölkerung schützen wollen. Vielleicht hatte man diese hehre Absicht einmal im Hinterkopf. Alle Massnahmen ab der zweiten Welle erfolgten nicht zum Schutz der Bevölkerung, sondern, um einen Zusammenbruch des Spital-Systems zu verhindern.
Doch jetzt stossen wir erst einmal auf das Ende der Apartheit an.'

# Some cleaning:

library(stringr)
text <- str_replace_all(text, "[^[:alnum:]]", " ")
text <- tolower(text)
words <- strsplit(text, "\\s ")[[1]]
words

for(word in words){
  ??? 
}

CodePudding user response：

Clean, replacing any sequence of non-alphanumeric with a space

text = tolower(gsub("[^[:alnum:]] ", " ", text))

Find all pairs of sequential letters

twos = substring(text, 1:(nchar(text) - 1), 2:nchar(text))

but only keep those that did not overlap a space

twos[nchar(trimws(twos)) == 2L]

Here's the result

> twos[nchar(trimws(twos)) == 2L] |> table()

19 90 aa ab af ag äg ah äh ai al am an än ap ar är as at ät au äu ba be bl br
 1  1  1  6  2  2  1  2  2  2 14  2 16  1  1 10  1 15  6  1 12  1  1 24  1  2
bs bt bu ce ch co da de dh di do du dw eb ed ef eg eh ei ek el em en ep er es
 1  1  1  4 34  1  9 23  3 18  2  2  1  1  1  1  1  9 32  1  7  5 54  1 42 19
et eu ev ez fa fä fe ff fg fi fl fn fo fr ft fü ga ge gi gl gn gr gs gt ha he
12  3  3  1  2  1  4  2  3  2  3  1  4  2  3  4  1 19  2  1  2  3  1  4  8 17
hi hk hl hm hn ho hr ht hu hü hw ib ic id ie if ig ih ik il im in io ip ir is
 3  1  1  3  2  3  9 11  1  1  1  2 16  1 18  2  4  2  2  3  3 28  2  1  5 12
it iu iv je ka ke kh ko kr kt la ld le lg lh li lk ll ln lö ls lt ma mä me mi
19  1  1  2  1  8  1  3  3  1  6  1  7  1  1  5  3 11  1  1  4  1 12  1  8  7
mm mo mö ms mu na nb nd ne nf ng ni nk nm nn no np nr ns nt nu nz ob oc od öf
 3  3  1  2  3  4  1 23 13  1 10  8  5  2  4  3  1  1  6 10  2  3  2  3  2  2
og ög oh ol öl on op or os ös ov öv oz pa pe pf pi pl po pr pu ra rä rb rc rd
 1  1  3  3  3  8  1  7  4  1  1  1  1  1  1  3  1  1  1  3  2  5  2  3  4  2
re rf rg rh ri rk rl rm rn ro rr rs rt ru rü rz sa sb sc se sf sh si sk sm sn
14  3  1  1  4  2  1  1  4  3  2  9  2 11  1  1  3  1 13 17  1  1  6  5  4  2
so sp sr ss st su sy ta tä te th ti tl to tr ts tt tu tz ub üb uc ud ue uf uh
 2  3  1  9 17  3  1  7  2 24  1  6  1  1  4  6  3  1  4  1  2  2  1  2  6  1
üh ul um un ur ür us ut üt ve vi vo vö wa wä we wi wo ys ze zt zu zw
 2  1  5 24  3  3  8  3  1  3  3  4  3  4  1  8  9  2  1  5  2  9  6

The algorithm seems to generalize to sequences of any number of letters by separating words with

chartuples <-
    function(text, n = 2)
{
    n0 <- n - 1
    text <- tolower(gsub(
        "[^[:alnum:]] ", paste(rep(" ", n0), collapse = ""), text
    ))
    tuples <- substring(text, 1:(nchar(text) - n0), n:nchar(text))
    tuples[nchar(trimws(tuples)) == n]
}

This is also easy to use for looking up the values of any 'word'

counts <- table(charuples(text))
counts[chartuples("djsarejslslasdfhsl")] |> as.vector()

(the NA's in the resulting vector mean letters not present in your original corpus).

CodePudding user response：

 words <- unlist(strsplit(text,  '[^[:alnum:]] '))
cmbs2 <- sapply(words, function(x)substring(x, len <- seq(nchar(x) - 1), len   1),USE.NAMES = TRUE)
head(cmbs2) ## Just to show a few words.
$Es
[1] "Es"

$ist
[1] "is" "st"

$ein
[1] "ei" "in"

$Freudentag
[1] "Fr" "re" "eu" "ud" "de" "en" "nt" "ta" "ag"

$ohne
[1] "oh" "hn" "ne"

$Zweifel
[1] "Zw" "we" "ei" "if" "fe" "el"