I am trying to build a biword index from a document (i.e.) read a document and split it into two word indexes in a list as below:
doc:
There have been biographies of Dewey that briefly describe his system, but this is the first attempt to provide a detailed history of the work that more than any other has spurred the growth of librarianship in this country and abroad.
words=['There have','have been','been biographies','biographies of',etc]
Using python code, please help me on how I can do this!
CodePudding user response:
- From python 3.10, you can use
itertools.pairwise
:
import itertools
text = 'There have been biographies of Dewey that briefly describe his system, but this is the first attempt to provide a detailed history of the work that more than any other has spurred the growth of librarianship in this country and abroad.'
output = itertools.pairwise(text.split())
for x in output:
print(x)
- If you don't have python 3.10 (like me), then you can copy and paste the function introduced in the previous python doc of
pairwise
:
def pairwise(iterable):
# pairwise('ABCDEFG') --> AB BC CD DE EF FG
a, b = itertools.tee(iterable)
next(b, None)
return zip(a, b)
output = pairwise(text.split())
- If you are reluctant to use
itertools
for some reason, you can usezip
:
text_split = text.split()
output = zip(text_split, text_split[2:])
CodePudding user response:
You can use zip
and string.split
:
def biword(string):
s = string.split()
return zip(s, s[1:])
>>> result = biword("There have been biographies of Dewey that briefly describe his system, but this is the first attempt to provide a detailed history of the work that more than any other has spurred the growth of librarianship in this country and abroad.")
>>> list(result)
[('There', 'have'), ('have', 'been'), ('been', 'biographies'), ('biographies', 'of'), ('of', 'Dewey'), ('Dewey', 'that'), ('that', 'briefly'), ('briefly', 'describe'), ('describe', 'his'), ('his', 'system,'), ('system,', 'but'), ('but', 'this'), ('this', 'is'), ('is', 'the'), ('the', 'first'), ('first', 'attempt'), ('attempt', 'to'), ('to', 'provide'), ('provide', 'a'), ('a', 'detailed'), ('detailed', 'history'), ('history', 'of'), ('of', 'the'), ('the', 'work'), ('work', 'that'), ('that', 'more'), ('more', 'than'), ('than', 'any'), ('any', 'other'), ('other', 'has'), ('has', 'spurred'), ('spurred', 'the'), ('the', 'growth'), ('growth', 'of'), ('of', 'librarianship'), ('librarianship', 'in'), ('in', 'this'), ('this', 'country'), ('country', 'and'), ('and', 'abroad.')]
FYI, if you want to make the code in the function one line you can use an assignment expression (but that's a little overdoing it):
zip(s := string.split(), s[1:])