I am trying to get this
Input text: "Thai, or Central Thai, is a Tai language of the Kra–Dai language family spoken by the Central Thai people and a vast majority of Thai Chinese. It is the sole official language of Thailand. I want to... "
The expect ouput: ["Thai", "Central Thai", "Kra–Dai", "Thai Chinese", "Thailand"]
Then by uing the Wikipedia API I will get the definitions of the words above. I am using this regular expression:
[A-Z][-a-zA-Z]*(?:\s [A-Z][-a-zA-Z]*)?
However when I try the result is:
["Thai", "Central Thai", "Kra", "Dai", "Thai Chinese", "Thailand", "I", "It"]
It is separating the words with that contains "-" and including the ones that start with upper after a dot "." and also is including "I" and "It".
How could I get all uppercase words except the uppercase word after "."
CodePudding user response:
We can use word boundaries \b
.
let str = 'Thai, or Central Thai, is a Tai language of the Kra-Dai language family spoken by the Central Thai people and a vast majority of Thai Chinese. It is the sole official language of Thailand. I want to...';
let arr =[...str.matchAll( /\b[A-Z]\w{2,}-?(\s?\b[A-Z]\w*)?/g)].map(e=>e[0]);
console.log(arr);
How could I get all uppercase words except the uppercase word after "."
But you can get special words at the start of a sentence too:
Periplectic group consists of the group's last common ancestor and all its descendants
CodePudding user response:
This worked for me (?!.\s)[A-Z][a-z] (?:\s[A-Z][a-z] |[–] [A-Z][a-z]*|[a-z] )