Home > Software engineering >  How to split a group of sentence into json format?
How to split a group of sentence into json format?

Time:11-13

I'm having a hard time to split the sentence. This is an example of data

  1. Go to the dining room. Click on the cabinet to take the whisky bottle.
  2. Go to the kitchen. Click on the fridge. Jason gets a lemonade.

The result that I want it to be in JSON format.

{id: "1", text: "Go to the dining room. Click on the cabinet to take the whisky bottle."},
{id: "2", text: "Go to the kitchen. Click on the fridge. Jason gets a lemonade."}

The problem I don't know how to detect strings and split them correctly or using regex. Because some of data have a number such as 50. in the middle of the sentence and make it wrong data to split. And the number can be up to 4 digits and follow by "." make it harder to detect and split.

Any help would be appreciated.

CodePudding user response:

Assuming the provided text is one large string.

const text = `\
1. Go to the dining room. Click on the cabinet to take the whisky bottle.
2. Go to the kitchen. Click on the fridge. Jason gets a lemonade.
`;

You can split the string into sentences by using the regex /^(?=\d \. )/m. Here the ^ in combination with the m flag will target the beginning of a line, (?=) is a look ahead, to not remove the matched characters from the result. In the look ahead you'll find \d (one or more digits) followed by \. (the dot character) followed by (a space). Thus matching all beginnings of lines if they are immediately followed by a number, dot and space.

const sentences = text.split(/^(?=\d \. )/m);
//=> [
//   "1. Go to the dining room. Click on the cabinet to take the whisky bottle.\n",
//   "2. Go to the kitchen. Click on the fridge. Jason gets a lemonade.\n"
// ]

The next step is separating the number from the rest of the sentence and creating an object with the two parts. For this we could use the regex /(\d )\. (.*)/s. This places the starting digits in capture group 1. It then matches a dot and a space, and places everything after it in capture group 2. .* normally doesn't match newline characters, however by providing the s flag it does. The s flag is relatively new and if you can't use it yet you can replace .* with [^]*. [\S\s]* is also often used as a replacement.

const items = sentences.map((sentence) => {
  const [_match, id, text] = sentence.match(/(\d )\. (.*)/s);
  return { id, text };
});
//=> [{
//   id: "1",
//   text: "Go to the dining room. Click on the cabinet to take the whisky bottle.\n"
// }, {
//   id: "2",
//   text: "Go to the kitchen. Click on the fridge. Jason gets a lemonade.\n"
// }]

Note that text may include multiple lines, if you want to compact whitespace beforehand use the the following instead.

return { id, text: text.replace(/\s /g, " ").trim() }

Which matches one or more whitespace (spaces, tabs, newlines, vertical-tabs, etc.) and replaces it with a single space. trim() then removes possible whitespace at the start and end of the string.

Show code snippet

const text = `\
1. Go to the dining room. Click on the cabinet to take the whisky bottle.
2. Go to the kitchen. Click on the fridge. Jason gets a lemonade.
`;

const sentences = text.split(/^(?=\d \. )/m);

const items = sentences.map((sentence) => {
  const [_match, id, text] = sentence.match(/(\d )\. (.*)/s);
  return { id, text: text.replace(/\s /g, " ").trim() };
});

console.log(items);
<iframe name="sif1" sandbox="allow-forms allow-modals allow-scripts" frameborder="0"></iframe>


This answer assumes that text starts with a number (one or more digits), dot and space. If the start of text doesn't conform to this pattern sentence.match(/(\d )\. (.*)/s) will return either null or start half way into the sentence for the first sentence.

You could kick the first sentence in this scenario by doing the following after splitting the sentence.

if (!text.match(/^\d  /)) sentences.shift();

Alternatively you could replace split() with match() to search for sentences that do match the pattern, skipping the first bit of text that doesn't.

const sentences = text.match(/^\d \. (\D|(?!^)\d|\d(?!\. ))*/gm);

Which matches a number (one or more digits), dot and space at the start of line. Followed by zero or more characters that match the following criteria: non-digit, digit if not at the start of a line, digit if not followed by a dot and a space.

CodePudding user response:

You can give it a try.

const fun = (str) => {
   let i = 0;
   let index = '';
   while (true) {
    if (str.charCodeAt(i) >= 48 && str.charCodeAt(i) <= 57) {
        index = `${index}${str.charAt(i)}`;
        i  = 1;
    } else {
        break;
    }
   }
   return {id: index, text: str.slice(i   1).trim()}
}

const ans1 = fun(`109. Go to the dining room. Click on the 23. cabinet to take the whisky bottle`);
const ans2 = fun(`19. Go to the kitchen. cabinet to take the whisky bottle45.`);
console.log(ans1, ans2)

For every input string you can run the above function to get the result.

  • Related