Home > other >  English word segmentation, if not in accordance with the space segmentation
English word segmentation, if not in accordance with the space segmentation

Time:11-13

What (in English) tools, can do operation, similar to the (Chinese) separated (phrases) and not just according to English (space),

Such as the following using the Stanford Stanza of Chinese word segmentation:

 
# install stanza
! Pip3 install stanza

# import package
The import stanza

Model # download English
Print (" Downloading English model...
")Stanza. The download (' en ')

Model # download Chinese
Print (" Downloading Chinese model...
")Stanza. The download (' useful ', verbose=False)

# to be processed Chinese
Text="" "the British prime minister Saturday night because of deterioration, Johnson was transferred to the intensive care unit treatment, a downing street spokesman said that at present clear consciousness, Johnson will he transferred to the intensive care unit just preventive measures, spokesman said that Johnson was transferred to the intensive care unit arranged before the British foreign secretary lab to handle related affairs on his behalf, "" "

Zh_nlp=stanza. Pipeline (' useful ')
Doc=zh_nlp (text)

# loop output
For sent in doc. Sentences:
Print (" Sentence: "+ sent. Text) # there
Print (" Tokenize: "+". Join (token) for token with a text in sent) tokens)) # Chinese participle

The output is:

 
Sentence: the British prime minister Saturday night because of deterioration, Johnson was transferred to the intensive care unit treatment,
Tokenize: British prime minister/Johnson/6/day/night//deterioration, critical/into//care/room/treatment,

Sentence: a downing street spokesman said, the clear consciousness, Johnson will he transferred to the intensive care unit just preventive measures,
Tokenize: British prime minister/office/speech/people/say/,////clear consciousness, Johnson now transferring/he//to/severe/monitoring/e///is/sex/prevention,

Sentence: spokesman said that Johnson was transferred to the intensive care unit arranged before the British foreign secretary lab to handle related affairs on his behalf,
Tokenize: speech/people/say, Johnson//transfer/into/severe/monitoring/e///has/scheduled before British foreign secretary//lab/representative/he/processing/about/transaction,

If the input is in English, are separated by Spaces,

 
The import stanza

NLP=stanza. Pipeline (lang='en', processors='tokenize')
Doc=NLP (' This is a test sentence for stanza. This is United States. ')
For I, the sentence in enumerate (doc. Sentences) :
Print (f '======what {I + 1} tokens=======')
Print (* [f 'id: {token. Id} \ ttext: {token. Text}' for token in sentence. The tokens], sep='\ n')



Output:

 
======what 1 tokens=======
Id: (1) the text: This
Id: (2) the text: is
Id: (3) the text: a
Id: (4) the text: the test
Id: (5) text: what
Id: (6) the text: for
Id: (7) the text: stanza
Id: (8) text:
======what 2 tokens=======
Id: (1) the text: This
Id: (2) the text: is
Id: (3) the text: United
Id: (4) the text: States
Id: (5) text:

CodePudding user response:

https://lmbtfy.cn/s/57tcLjzAW3
  • Related