I'm looking for a method to generate sequences from every-grams up to length n
that match an input sentence:
Given a sentence: "Break this into sequences"
and n = 3
I want to create the sequences:
("Break", "this", "into", "sequences")
("Break", "this", "into sequences")
("Break", "this into", "sequences")
("Break this", "into", "sequences")
("Break this", "into sequences")
("Break", "this into sequences")
("Break this into", "sequences")
nltk
has the everygram
package, but I'm not quite sure how I'd use it toward my goal.
I've tried adapting the problem to focus on characters for simplicity, i.e.,
It may be helpful to consider these as character-grams (and, as rici suggested, spacing out characters [with and without spacing shown for clarity]):
abcd
goes to:
(a, b, c, d) (a, b, c, d)
(a, b, c d) (a, b, cd)
(a, b c, d) (a, bc, d)
(a b, c, d) (ab, c, d)
(a b, c d) (ab, cd)
(a, b c d) (a, bcd)
(a b c, d) (abc, d)
For clarity, this should generalize for any length, given a n
as the maximum-sized n-gram; so, for abcde
with n=3
we'd have:
(a, b, c, d, e) (a, b, c, d, e)
(a, b, c, d e) (a, b, c, de)
(a, b, c d, e) (a, b, cd, e)
(a, b c, d e) (a, bc, d, e)
(a b, c, d, e) (ab, c, d, e)
(a, b c, d e) (a, bc, de)
(a b, c, d e) (ab, c, de)
(a b, c d, e) (ab, cd, e)
(a, b, c d e) (a, b, cde)
(a, b c d, e) (a, bcd, e)
(a b c, d, e) (abc, d, e)
(a b, c d e) (ab, cde)
(a b c, d e) (abc, de)
I'm thinking I may need to generate a grammar, something like:
exp ::= ABC, d | a, BCD
ABC ::= AB, c | A, BC
BCD ::= BC, d | b, CD
AB ::= A, b | a, B
BC ::= B, c | b, C
CD ::= C, d | c, D
A ::= a
B ::= b
C ::= c
D ::= d
and find all parses of the sentence, but certainly there must be a procedural way to go about this?
CodePudding user response:
Maybe it would be helpful to space your example out a bit:
(a , b , c , d)
(a , b , c d)
(a , b c , d)
(a b , c , d)
(a b , c d)
(a , b c d)
(a b c , d)
(a b c d) # added for completeness
Looking at that, it's evident that what differentiates the rows is the presence or absence of commas, a typical binary choice. There are three places a comma could go, so there are eight possibilities, corresponding to the eight binary numbers of three digits.
The easiest way to list these possibilities is to count from 0 0 0
to 1 1 1
.
For your modified question, in which there is a maximum length of a part, one simple recursive solution in Python is:
def kgram(k, v):
'Generate all partitions of v with parts no larger than k'
def helper(sfx, m):
if m == 0: yield sfx
else:
for i in range(1, min(k, m) 1):
yield from helper([v[m-i:m]] sfx, m-i)
yield from helper([], len(v))
Here's a quick test:
>>> for p in gram(3, 'one two three four five'.split()): print(p)
...
[['one'], ['two'], ['three'], ['four'], ['five']]
[['one', 'two'], ['three'], ['four'], ['five']]
[['one'], ['two', 'three'], ['four'], ['five']]
[['one', 'two', 'three'], ['four'], ['five']]
[['one'], ['two'], ['three', 'four'], ['five']]
[['one', 'two'], ['three', 'four'], ['five']]
[['one'], ['two', 'three', 'four'], ['five']]
[['one'], ['two'], ['three'], ['four', 'five']]
[['one', 'two'], ['three'], ['four', 'five']]
[['one'], ['two', 'three'], ['four', 'five']]
[['one', 'two', 'three'], ['four', 'five']]
[['one'], ['two'], ['three', 'four', 'five']]
[['one', 'two'], ['three', 'four', 'five']]