regex to get strings between content generated by ckeditor, for server-end-CodePudding

I am trying to get a regex that matches the strings between this output

<p>save</p>
<p>11<br />\nabc<br />\nabc<br />\nhello</p>\n\n<p>dfcs dcsd</p>\n\n<p>sdcsd<br />\nsdcsdc<br />\nsdcd</p>\n
<p>1</p>\n\n<p>11<br />\n111</p>\n\n<p>1111<br />\n11111</p>\n\n<p>1</p>\n\n<p>&nbsp;</p>\n

expected output:

1) save
2) 11
3) abc
4) hello
5) dfcs dcsd
6) sdcsd
7) sdcsdc
8) 1
9) 11
10) 111
11) 1111
12) 11111
13) 1

CodePudding user response：

Your question needs more details but I would take the rendered HTML and then split it with this regex: /(?:\s*\r?\n\s*) /

This will give you an array of lines (and remove a few empty chars around).

Then remove the empty lines and then loop over them to have your numbered lines.

The code below and result in the console:

let body = document.querySelector('body');
let renderedHtml = body.innerText;
let lines = renderedHtml.split(/(?:\s*\r?\n\s*) /);

// Get rid of empty lines.
lines = lines.filter((line) => {
  return !line.match(/^\s*$/);
});

console.log(lines);

let output = '';

lines.forEach((line, i) => {
  output  = (i   1)   ') '   line   "\n";
});

console.log(output);

<p>save</p>
<p>11<br />
abc<br />
abc<br />
hello</p>

<p>dfcs dcsd</p>

<p>sdcsd<br />
sdcsdc<br />
sdcd</p>

<p>1</p>

<p>11<br />
111</p>

<p>1111<br />
11111</p>

<p>1</p>

<p>&nbsp;</p>

CodePudding user response：

Note ... since the OP did mention ... "for server-end" ... the OP most probably needs to find a package which comes close to a browser's native DOMParser Web API.

One approach was to use a mix of a parsed markup's (via DOMParser.parseFromString) innerText string value and a multiline capturing regex like e.g. /^(?:\\n|\s)*(?<content>.*)/gm together with matchAll and an additional map / filter task.

const htmlMarkup =
`<p>save</p>
<p>11<br />\nabc<br />\nabc<br />\nhello</p>\n\n<p>dfcs dcsd</p>\n\n<p>sdcsd<br />\nsdcsdc<br />\nsdcd</p>\n
<p>1</p>\n\n<p>11<br />\n111</p>\n\n<p>1111<br />\n11111</p>\n\n<p>1</p>\n\n<p>&nbsp;</p>\n`;

// see ... [https://regex101.com/r/4bzz4m/1]
const regXLineContent = /^(?:\\n|\s)*(?<content>.*)/gm;

const doc = (new DOMParser)
  .parseFromString(htmlMarkup, "text/html");

console.log(
  '... inner text ...',
  doc
    .body
    .innerText
);
console.log(
  '... list of pure content ...',
  Array
    .from(
      doc
        .body
        .innerText
        .matchAll(regXLineContent)
    )
    .map(match => match.groups.content)
    .filter(content => content !== '')
);

.as-console-wrapper { min-height: 100%!important; top: 0; }

Another, preferred approach, was to use a content splitting regex like ... /\n(?:\\n|\s)*/g ... for directly getting an array of (valid) line contents.

Yet, as for the provided markup sample, and like with the former approach, one still needs to run the sanitizing filter task.

const htmlMarkup =
`<p>save</p>
<p>11<br />\nabc<br />\nabc<br />\nhello</p>\n\n<p>dfcs dcsd</p>\n\n<p>sdcsd<br />\nsdcsdc<br />\nsdcd</p>\n
<p>1</p>\n\n<p>11<br />\n111</p>\n\n<p>1111<br />\n11111</p>\n\n<p>1</p>\n\n<p>&nbsp;</p>\n`;

// see ... [https://regex101.com/r/4bzz4m/2]
const regXLineSeparators = /\n(?:\\n|\s)*/g;

const doc = (new DOMParser)
  .parseFromString(htmlMarkup, "text/html");

console.log(
  '... inner text ...',
  doc
    .body
    .innerText
);
console.log(
  '... list of splitted content ...',
  doc
    .body
    .innerText
    .split(regXLineSeparators)
);
console.log(
  '... list of pure content ...',
  doc
    .body
    .innerText
    .split(regXLineSeparators)
    .filter(content => content !== '')
);

.as-console-wrapper { min-height: 100%!important; top: 0; }