Home > Software design >  Extracting the Place and Publisher in a String of Sentence
Extracting the Place and Publisher in a String of Sentence

Time:05-16

So I have a list of Data Stating the Place and Publisher of a Journal

The Data is given in a single Sentence in a List

['Place: Amsterdam Publisher: Elsevier Science Bv WOS:000179813800003' ,
 'Place: Hanoi Publisher: Vietnam Acad Science & Technology-Vast WOS:000530921100003' , 
 'Publisher: SAGE Publications Ltd',
 'Place: London'] 

So as you can see in some strings Publisher is given but no place and some places it can be vice versa.

So I want the Output to be like in two lists

Places = ['Amsterdam','Hanoi','London']
Publishers = ['Elsevier Science',
              'Vietnam Acad Science & Technology- Vast',
              'SAGE Publications Ltd']

I am Using Python for this Data analysis..

I was thinking of using split() function to detect location of Place is written and chose the string next to it but it seems not to be working

My Code till Now

places=[]
for i in extrainfo :  #E xtrainfo Name of Initial List 
 
 if ('Place') in i :
       z=i
       i=i.split()
       counter=0
       for q in i :
        if q=='Place' :
          break
        counter=counter 1
 places=pleaces z[counter 1]       
print(places)

CodePudding user response:

  • split on colons ':' using s.split(':');
  • discard trailing whitespace using s.strip();
  • if one of the split substrings ends with 'Publisher' or 'Place', add the next substring to the relevant list;
  • some of the substrings added to the lists will end with 'Place' or 'Publisher': take care of that using s.removesuffix('Place').removesuffix('Publisher').
from itertools import pairwise # python>=3.10
# from itertools import tee
# def pairwise(iterable):
#     "s -> (s0,s1), (s1,s2), (s2, s3), ..."
#     a, b = tee(iterable)
#     next(b, None)
#     return zip(a, b)

data = ['Place: Amsterdam Publisher: Elsevier Science Bv WOS:000179813800003' , 'Place: Hanoi Publisher: Vietnam Acad Science & Technology-Vast WOS:000530921100003' , 'Publisher: SAGE Publications Ltd','Place: London']

things = {'Place': [], 'Publisher': [], 'WOS': []}

for sentence in data:
    for k, v in pairwise(map(str.strip, sentence.split(':'))):
        for cat in things:
            if k.endswith(cat):
                for suffix in things:
                    v = v.removesuffix(suffix).strip()
                things[cat].append(v)
                break

print(things)
# {'Place': ['Amsterdam', 'Hanoi', 'London'],
#  'Publisher': ['Elsevier Science Bv', 'Vietnam Acad Science & Technology-Vast', 'SAGE Publications Ltd'],
#  'WOS': ['000179813800003', '000530921100003']}

CodePudding user response:

Solution with re module:

import re

lst = [
    "Place: Amsterdam Publisher: Elsevier Science Bv WOS:000179813800003",
    "Place: Hanoi Publisher: Vietnam Acad Science & Technology-Vast WOS:000530921100003",
    "Publisher: SAGE Publications Ltd",
    "Place: London",
]

places = [
    m.group(1)
    for i in lst
    if (m := re.search(r"Place: (.*?)\s*(?:Publisher|$)", i))
]

publishers = [
    m.group(1)
    for i in lst
    if (m := re.search(r"Publisher: (.*?)\s*(?:WOS|$)", i))
]

print(places)
print(publishers)

Prints:

['Amsterdam', 'Hanoi', 'London']
['Elsevier Science Bv', 'Vietnam Acad Science & Technology-Vast', 'SAGE Publications Ltd']
  • Related