I am trying to undo an inverted index to generate a plain text format. I rarely use Python, so I am just using what I remember from a few years ago to generate the algorithm. Here is what I want to be printed out:
Despite growing interest in Open Access (OA) to scholarly literature, there is an unmet need for large-scale, up-to-date, and reproducible studies assessing the prevalence and characteristics of OA. We address this need using oaDOI, an open online service that determines OA status for 67 million articles. We use three samples, each of 100,000 articles, to investigate OA in three populations: (1) all journal articles assigned a Crossref DOI, (2) recent journal articles indexed in Web of Science, and (3) articles viewed by users of Unpaywall, an open-source browser extension that lets users find OA articles using oaDOI. We estimate that at least 28% of the scholarly literature is OA (19M in total) and that this proportion is growing, driven particularly by growth in Gold and Hybrid. The most recent year analyzed (2015) also has the highest percentage of OA (45%). Because of this growth, and the fact that readers disproportionately access newer articles, we find that Unpaywall users encounter OA quite frequently: 47% of articles they view are OA. Notably, the most common mechanism for OA is not Gold, Green, or Hybrid OA, but rather an under-discussed category we dub Bronze: articles made free-to-read on the publisher website, without an explicit Open license. We also examine the citation impact of OA articles, corroborating the so-called open-access citation advantage: accounting for age and discipline, OA articles receive 18% more citations than average, an effect driven primarily by Green and Hybrid OA. We encourage further research using the free oaDOI service, as a way to inform OA policy and practice.
Here is the data in the inverted index (can be found here under "abstract_inverted_index" -> https://api.openalex.org/W2741809807):
"abstract_inverted_index":{"Despite":[0],"growing":[1],"interest":[2],"in":[3,57,73,110,122],"Open":[4,201],"Access":[5],"(OA)":[6],"to":[7,54,252],"scholarly":[8,105],"literature,":[9],"there":[10],"is":[11,107,116,176],"an":[12,34,85,185,199,231],"unmet":[13],"need":[14,31],"for":[15,42,174,219],"large-scale,":[16],"up-to-date,":[17],"and":[18,24,77,112,124,144,221,237,256],"reproducible":[19],"studies":[20],"assessing":[21],"the":[22,104,134,145,170,195,206,213,245],"prevalence":[23],"characteristics":[25],"of":[26,51,75,83,103,137,141,163,209],"OA.":[27,168,239],"We":[28,46,97,203,240],"address":[29],"this":[30,114,142],"using":[32,95,244],"oaDOI,":[33],"open":[35],"online":[36],"service":[37],"that":[38,89,99,113,147,155],"determines":[39],"OA":[40,56,93,108,138,159,175,210,223,254],"status":[41],"67":[43],"million":[44],"articles.":[45],"use":[47],"three":[48,58],"samples,":[49],"each":[50],"100,000":[52],"articles,":[53,152,211],"investigate":[55],"populations:":[59],"(1)":[60],"all":[61],"journal":[62,70],"articles":[63,71,79,94,164,191,224],"assigned":[64],"a":[65,250],"Crossref":[66],"DOI,":[67],"(2)":[68],"recent":[69,128],"indexed":[72],"Web":[74],"Science,":[76],"(3)":[78],"viewed":[80],"by":[81,120,235],"users":[82,91,157],"Unpaywall,":[84],"open-source":[86],"browser":[87],"extension":[88],"lets":[90],"find":[92,154],"oaDOI.":[96],"estimate":[98],"at":[100],"least":[101],"28%":[102],"literature":[106],"(19M":[109],"total)":[111],"proportion":[115],"growing,":[117],"driven":[118,233],"particularly":[119],"growth":[121],"Gold":[123],"Hybrid.":[125],"The":[126],"most":[127,171],"year":[129],"analyzed":[130],"(2015)":[131],"also":[132,204],"has":[133],"highest":[135],"percentage":[136],"(45%).":[139],"Because":[140],"growth,":[143],"fact":[146],"readers":[148],"disproportionately":[149],"access":[150],"newer":[151],"we":[153,188],"Unpaywall":[156],"encounter":[158],"quite":[160],"frequently:":[161],"47%":[162],"they":[165],"view":[166],"are":[167],"Notably,":[169],"common":[172],"mechanism":[173],"not":[177],"Gold,":[178],"Green,":[179],"or":[180],"Hybrid":[181,238],"OA,":[182],"but":[183],"rather":[184],"under-discussed":[186],"category":[187],"dub":[189],"Bronze:":[190],"made":[192],"free-to-read":[193],"on":[194],"publisher":[196],"website,":[197],"without":[198],"explicit":[200],"license.":[202],"examine":[205],"citation":[207,216],"impact":[208],"corroborating":[212],"so-called":[214],"open-access":[215],"advantage:":[217],"accounting":[218],"age":[220],"discipline,":[222],"receive":[225],"18%":[226],"more":[227],"citations":[228],"than":[229],"average,":[230],"effect":[232],"primarily":[234],"Green":[236],"encourage":[241],"further":[242],"research":[243],"free":[246],"oaDOI":[247],"service,":[248],"as":[249],"way":[251],"inform":[253],"policy":[255],"practice.":[257]}
Here is my current code to decode the invert, however it returns just
import requests
abstractInvertedIndex = requests.get(
'https://api.openalex.org/W2741809807'
).json()['abstract_inverted_index']
arrayAbstractIndex = [[k, abstractInvertedIndex[k]] for k in abstractInvertedIndex]
# Position of the word in the abstract
wordPos = 0
# The number position of the key value
wordNum = 0
abstract = ""
for x in arrayAbstractIndex:
if wordPos in arrayAbstractIndex[wordNum][1]:
abstract = abstract str(arrayAbstractIndex[wordNum][0] ' ')
wordPos = wordPos 1
wordNum = wordNum 1
print(abstract)
Despite growing interest in Open Access (OA) to scholarly literature, there is an unmet need for large-scale, up-to-date, and reproducible studies assessing the prevalence
I know this is due to the fact that the word 'and' has multiple positions in the index, however, I don't know how to configure Python for loops to go through each dictionary value and all the array items in the key to make sure the entire plain text gets printed?
Any suggestions?
CodePudding user response:
abstractInvertedIndex
is a dictionary of word:[indices]. From this dictionary, first get a list of (word,index) pairsword_index = [] for k,v in abstractInvertedIndex.items(): for index in v: word_index.append([k,index])
Now sort this list
word_index
to retain index orderword_index = sorted(word_index,key = lambda x : x[1])
And finally join only the words from
word_index
list with a space
Despite growing interest in Open Access (OA) ...... as a way to inform OA policy and practice.