Building Biword index from a document-CodePudding

I am trying to build a biword index from a document (i.e.) read a document and split it into two word indexes in a list as below:

doc:

There have been biographies of Dewey that briefly describe his system, but this is the first attempt to provide a detailed history of the work that more than any other has spurred the growth of librarianship in this country and abroad.

words=['There have','have been','been biographies','biographies of',etc]

Using python code, please help me on how I can do this!

CodePudding user response：

From python 3.10, you can use itertools.pairwise:

import itertools

text = 'There have been biographies of Dewey that briefly describe his system, but this is the first attempt to provide a detailed history of the work that more than any other has spurred the growth of librarianship in this country and abroad.'

output = itertools.pairwise(text.split())
for x in output:
    print(x)

If you don't have python 3.10 (like me), then you can copy and paste the function introduced in the previous python doc of pairwise:

def pairwise(iterable):
    # pairwise('ABCDEFG') --> AB BC CD DE EF FG
    a, b = itertools.tee(iterable)
    next(b, None)
    return zip(a, b)

output = pairwise(text.split())

If you are reluctant to use itertools for some reason, you can use zip:

text_split = text.split()
output = zip(text_split, text_split[2:])

CodePudding user response：

You can use zip and string.split:

def biword(string):
    s = string.split()
    return zip(s, s[1:])

>>> result = biword("There have been biographies of Dewey that briefly describe his system, but this is the first attempt to provide a detailed history of the work that more than any other has spurred the growth of librarianship in this country and abroad.")
>>> list(result)
[('There', 'have'), ('have', 'been'), ('been', 'biographies'), ('biographies', 'of'), ('of', 'Dewey'), ('Dewey', 'that'), ('that', 'briefly'), ('briefly', 'describe'), ('describe', 'his'), ('his', 'system,'), ('system,', 'but'), ('but', 'this'), ('this', 'is'), ('is', 'the'), ('the', 'first'), ('first', 'attempt'), ('attempt', 'to'), ('to', 'provide'), ('provide', 'a'), ('a', 'detailed'), ('detailed', 'history'), ('history', 'of'), ('of', 'the'), ('the', 'work'), ('work', 'that'), ('that', 'more'), ('more', 'than'), ('than', 'any'), ('any', 'other'), ('other', 'has'), ('has', 'spurred'), ('spurred', 'the'), ('the', 'growth'), ('growth', 'of'), ('of', 'librarianship'), ('librarianship', 'in'), ('in', 'this'), ('this', 'country'), ('country', 'and'), ('and', 'abroad.')]

FYI, if you want to make the code in the function one line you can use an assignment expression (but that's a little overdoing it):

zip(s := string.split(), s[1:])