English word segmentation, if not in accordance with the space segmentation-CodePudding

What (in English) tools, can do operation, similar to the (Chinese) separated (phrases) and not just according to English (space),

Such as the following using the Stanford Stanza of Chinese word segmentation:

 
# install stanza 
! Pip3 install stanza 

# import package 
The import stanza 

Model # download English 
Print (" Downloading English model... 
 ")Stanza. The download (' en ') 

Model # download Chinese 
Print (" Downloading Chinese model... 
 ")Stanza. The download (' useful ', verbose=False) 

# to be processed Chinese 
Text="" "the British prime minister Saturday night because of deterioration, Johnson was transferred to the intensive care unit treatment, a downing street spokesman said that at present clear consciousness, Johnson will he transferred to the intensive care unit just preventive measures, spokesman said that Johnson was transferred to the intensive care unit arranged before the British foreign secretary lab to handle related affairs on his behalf, "" "

Zh_nlp=stanza. Pipeline (' useful ') 
Doc=zh_nlp (text) 

# loop output 
For sent in doc. Sentences: 
Print (" Sentence: "+ sent. Text) # there 
Print (" Tokenize: "+". Join (token) for token with a text in sent) tokens)) # Chinese participle

The output is:

 
Sentence: the British prime minister Saturday night because of deterioration, Johnson was transferred to the intensive care unit treatment, 
Tokenize: British prime minister/Johnson/6/day/night//deterioration, critical/into//care/room/treatment, 

Sentence: a downing street spokesman said, the clear consciousness, Johnson will he transferred to the intensive care unit just preventive measures, 
Tokenize: British prime minister/office/speech/people/say/,////clear consciousness, Johnson now transferring/he//to/severe/monitoring/e///is/sex/prevention, 

Sentence: spokesman said that Johnson was transferred to the intensive care unit arranged before the British foreign secretary lab to handle related affairs on his behalf, 
Tokenize: speech/people/say, Johnson//transfer/into/severe/monitoring/e///has/scheduled before British foreign secretary//lab/representative/he/processing/about/transaction,

If the input is in English, are separated by Spaces,

 
The import stanza 

NLP=stanza. Pipeline (lang='en', processors='tokenize') 
Doc=NLP (' This is a test sentence for stanza. This is United States. ') 
For I, the sentence in enumerate (doc. Sentences) : 
Print (f '======what {I + 1} tokens=======') 
Print (* [f 'id: {token. Id} \ ttext: {token. Text}' for token in sentence. The tokens], sep='\ n')

Output:

 
======what 1 tokens=======
Id: (1) the text: This 
Id: (2) the text: is 
Id: (3) the text: a 
Id: (4) the text: the test 
Id: (5) text: what 
Id: (6) the text: for 
Id: (7) the text: stanza 
Id: (8) text: 
======what 2 tokens=======
Id: (1) the text: This 
Id: (2) the text: is 
Id: (3) the text: United 
Id: (4) the text: States 
Id: (5) text:

CodePudding user response:

https://lmbtfy.cn/s/57tcLjzAW3