Home > Software engineering >  how to extract only POS out of its lemma
how to extract only POS out of its lemma

Time:10-24

I have Tibetan word and it's POS, as below:

སྩོལ་ VERB
སྐབས་ NOUN
ཆོས་ NOUN
ཞུ་བ་ VERB
ཚོས་ NOUN
དེབ་ NOUN

how can i extract only the pos as shown :

VERB
NOUN
NOUN
VERB
NOUN
NOUN

code I tried :

file = # given input file containing word and pos
for line in file:
            word = line.split(' ')[0]
            pos = line.split(' ')[2]

above code is not showing the desired result, if you guys have any idea, would help alot!

CodePudding user response:

Try using list comprehension with string.ascii_uppercase

import string

text = """སྩོལ་ VERB
སྐབས་ NOUN
ཆོས་ NOUN
ཞུ་བ་ VERB
ཚོས་ NOUN
དེབ་ NOUN"""


for line in text.split('\n'):
    POS = ''.join([i for i in line if i in string.ascii_uppercase])    
    print(POS)

or from a file:

with open(filename, encoding='utf8') as fd:
    for line in fd:
        POS = ''.join([i for i in line if i in string.ascii_uppercase])
        print(POS)    

output:

VERB
NOUN
NOUN
VERB
NOUN
NOUN

CodePudding user response:

You may have several spaces or tabulation between words and POS. I'd try to use sep=None to split by '\s ', see help(str.split). Also it seems reasonable to split from the right side and only one time:

data = '''སྩོལ་ VERB
སྐབས་ NOUN
ཆོས་ NOUN
ཞུ་བ་ VERB
ཚོས་ NOUN
དེབ་ NOUN'''

records = data.split('\n')
records = [rec.rstrip().rsplit(maxsplit=1) for rec in records]
POS = [r[-1] for r in records]

# as an option: 
data = {word: pos for word, pos in records}   
POS = [*data.values()]   # =dict(records).values()
  • Related