I have a list of "tags" and want only words in this list to be in the output string
tags = ['S', 'WHAVP', 'POS', 'RBR', 'TO', 'JJR', 'WDT', 'INTJ', 'PP', 'SINV', 'VBZ', 'NX', 'WP', 'WHADJP', 'RP', 'IN', 'VBN', 'RB', 'UH', 'PRP', 'SBAR', 'LST', 'SBARQ', 'FRAG', 'EX', 'NP', 'NN', 'VP', 'NNPS', 'PRT', 'PDT', 'QP', 'VBG', 'ADJP', 'CONJP', 'VB', 'CD', 'WHPP', 'JJ', 'SYM', 'JJS', 'NNP', 'WHNP', 'WRB', 'FW', 'NNS', 'RBS', 'MD', 'PRN', 'DT', 'LS', 'X', 'ADVP', 'VBD', 'SQ', 'NAC', 'CC', 'UCP', 'RRC', 'VBP', 'WP$', '(',')']
input = "(SBARQ (WHNP (WP What)) (SQ (VBP do) (NP (PRP you)) (VP (VB want)))"
This is the expected output:
(SBARQ(WHNP(WP))(SQ(VBP)(NP(PRP))(VP(VB)))
How do I get this to work?
CodePudding user response:
A brute-force method using list comprehension:
out = ''.join((s if s.strip('()') in tags else s.lower().strip('abcdefghijklmnopqrstuvwxyz') for s in my_string.split() ))
Output:
'(SBARQ(WHNP(WP))(SQ(VBP)(NP(PRP))(VP(VB)))'
CodePudding user response:
Using re:
tags = ['S', 'WHAVP', 'POS', 'RBR', 'TO', 'JJR', 'WDT', 'INTJ', 'PP', 'SINV', 'VBZ', 'NX', 'WP', 'WHADJP', 'RP', 'IN',
'VBN', 'RB', 'UH', 'PRP', 'SBAR', 'LST', 'SBARQ', 'FRAG', 'EX', 'NP', 'NN', 'VP', 'NNPS', 'PRT', 'PDT', 'QP',
'VBG', 'ADJP', 'CONJP', 'VB', 'CD', 'WHPP', 'JJ', 'SYM', 'JJS', 'NNP', 'WHNP', 'WRB', 'FW', 'NNS', 'RBS', 'MD',
'PRN', 'DT', 'LS', 'X', 'ADVP', 'VBD', 'SQ', 'NAC', 'CC', 'UCP', 'RRC', 'VBP', 'WP$',
# '(', ')'
]
str_input = "(SBARQ (WHNP (WP What)) (SQ (VBP do) (NP (PRP you)) (VP (VB want)))"
out = ''.join(re.findall(r'[\(\)]|' '|'.join(fr'\b{re.escape(tag)}\b' for tag in tags), str_input))
Output:
(SBARQ(WHNP(WP))(SQ(VBP)(NP(PRP))(VP(VB)))