An example describes it better. Suppose you have a structure like this:
<h1>TITLE OF HEAD 1</h1>
<table>
<tbody>
<tr>
<td >ITEM 1, AFTER HEAD 1</td>
</tr>
<tr>
<td >ITEM 2, AFTER HEAD 1</td>
</tr>
</tbody>
</table>
<table>
<tbody>
<tr>
<td >ITEM 3, AFTER HEAD 1</td>
</tr>
<tr>
<td >ITEM 4, AFTER HEAD 1</td>
</tr>
<tr>
<td >ITEM 5, AFTER HEAD 1</td>
</tr>
</tbody>
</table>
<h1>TITLE OF HEAD 2</h1>
<table>
<tbody>
<tr>
<td >ITEM 6, AFTER HEAD 2</td>
</tr>
</tbody>
</table>
<h1>TITLE OF HEAD 3</h1>
<table>
<tbody>
<tr>
<td >ITEM 7, AFTER HEAD 3</td>
</tr>
<tr>
<td >ITEM 8, AFTER HEAD 3</td>
</tr>
<tr>
<td >ITEM 9, AFTER HEAD 3</td>
</tr>
<tr>
<td >ITEM 10, AFTER HEAD 3</td>
</tr>
</tbody>
</table>
<h1>TITLE OF HEAD 4</h1>
<table>
<tbody>
<tr>
<td >ITEM 11, AFTER HEAD 4</td>
</tr>
<tr>
<td >ITEM 12, AFTER HEAD 4</td>
</tr>
</tbody>
</table>
And with regex, the outcome should be:
<table>
<tbody>
<tr>
<td >ITEM 1, AFTER HEAD 1</td>
<td >TITLE OF HEAD 1</td>
</tr>
<tr>
<td >ITEM 2, AFTER HEAD 1</td>
<td >TITLE OF HEAD 1</td>
</tr>
</tbody>
</table>
<table>
<tbody>
<tr>
<td >ITEM 3, AFTER HEAD 1</td>
<td >TITLE OF HEAD 1</td>
</tr>
<tr>
<td >ITEM 4, AFTER HEAD 1</td>
<td >TITLE OF HEAD 1</td>
</tr>
<tr>
<td >ITEM 5, AFTER HEAD 1</td>
<td >TITLE OF HEAD 1</td>
</tr>
</tbody>
</table>
<h1>TITLE OF HEAD 2</h1>
<table>
<tbody>
<tr>
<td >ITEM 6, AFTER HEAD 2</td>
<td >TITLE OF HEAD 2</td>
</tr>
</tbody>
</table>
<h1>TITLE OF HEAD 3</h1>
<table>
<tbody>
<tr>
<td >ITEM 7, AFTER HEAD 3</td>
<td >TITLE OF HEAD 3</td>
</tr>
<tr>
<td >ITEM 8, AFTER HEAD 3</td>
<td >TITLE OF HEAD 3</td>
</tr>
<tr>
<td >ITEM 9, AFTER HEAD 3</td>
<td >TITLE OF HEAD 3</td>
</tr>
<tr>
<td >ITEM 10, AFTER HEAD 3</td>
<td >TITLE OF HEAD 3</td>
</tr>
</tbody>
</table>
<h1>TITLE OF HEAD 4</h1>
<table>
<tbody>
<tr>
<td >ITEM 11, AFTER HEAD 4</td>
<td >TITLE OF HEAD 4</td>
</tr>
<tr>
<td >ITEM 12, AFTER HEAD 4</td>
<td >TITLE OF HEAD 4</td>
</tr>
</tbody>
</table>
What I've tried so far:
Now getting the strings inside the <h1>
is easy:
find: (<h1>)(.*?)(</h1>)
replace: $2
Then I tried:
find: (<h1>)(.*?)(</h1>)(\n|.)*?(<td >.*?</td>)
replace: $5<td >$2</td>
which works, but the other tags are removed as well, so I've modified it:
find (<h1>)(.*?)(</h1>)((\n|.)*?)(<td >.*?</td>)
replace: $4$6<td >$2</td>
Each string of a new h1
will be used for the tds
that occur afterwards until a new h1
occurs, which will then be used - the problem is this only works for each first td
after each h1
, not all tds
.
Could somebody tell me what needs to be added to the regex for this to work?
Thank you!
CodePudding user response:
Use
<h1>([^<]*)<\/h1>\s*\n([\w\W]*?)(([^\n\S]*)<td\s.*?<\/td>(\n))(?=\s*<\/tr>)|(?<=<h1>([^<]*)<\/h1>[\w\W]*?)(([^\n\S]*)<td\s.*?<\/td>(\n))(?=\s*<\/tr>)
See regex proof.
Replace with: $2$3$4$7$8<td >$1$6</td>$5$9
.
EXPLANATION
NODE EXPLANATION
--------------------------------------------------------------------------------
<h1> '<h1>'
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[^<]* any character except: '<' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
</h1> '</h1>'
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
\n '\n' (newline)
--------------------------------------------------------------------------------
( group and capture to \2:
--------------------------------------------------------------------------------
[\w\W]*? any character of: word characters (a-z,
A-Z, 0-9, _), non-word characters (all
but a-z, A-Z, 0-9, _) (0 or more times
(matching the least amount possible))
--------------------------------------------------------------------------------
) end of \2
--------------------------------------------------------------------------------
( group and capture to \3:
--------------------------------------------------------------------------------
( group and capture to \4:
--------------------------------------------------------------------------------
[^\n\S]* any character except: '\n' (newline),
non-whitespace (all but \n, \r, \t,
\f, and " ") (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
) end of \4
--------------------------------------------------------------------------------
<td '<td'
--------------------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
--------------------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
--------------------------------------------------------------------------------
</td> '</td>'
--------------------------------------------------------------------------------
( group and capture to \5:
--------------------------------------------------------------------------------
\n '\n' (newline)
--------------------------------------------------------------------------------
) end of \5
--------------------------------------------------------------------------------
) end of \3
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
</tr> '</tr>'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
(?<= look behind to see if there is:
--------------------------------------------------------------------------------
<h1> '<h1>'
--------------------------------------------------------------------------------
( group and capture to \6:
--------------------------------------------------------------------------------
[^<]* any character except: '<' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \6
--------------------------------------------------------------------------------
</h1> '</h1>'
--------------------------------------------------------------------------------
[\w\W]*? any character of: word characters (a-z,
A-Z, 0-9, _), non-word characters (all
but a-z, A-Z, 0-9, _) (0 or more times
(matching the least amount possible))
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
( group and capture to \7:
--------------------------------------------------------------------------------
( group and capture to \8:
--------------------------------------------------------------------------------
[^\n\S]* any character except: '\n' (newline),
non-whitespace (all but \n, \r, \t,
\f, and " ") (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
) end of \8
--------------------------------------------------------------------------------
<td '<td'
--------------------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
--------------------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
--------------------------------------------------------------------------------
</td> '</td>'
--------------------------------------------------------------------------------
( group and capture to \9:
--------------------------------------------------------------------------------
\n '\n' (newline)
--------------------------------------------------------------------------------
) end of \9
--------------------------------------------------------------------------------
) end of \7
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
</tr> '</tr>'
--------------------------------------------------------------------------------
) end of look-ahead