Advanced VS Regex Find/Replace: use string inside <h1> to add another <td> below each oc-CodePudding

An example describes it better. Suppose you have a structure like this:

<h1>TITLE OF HEAD 1</h1>
<table>
    <tbody>
        <tr>
            <td >ITEM 1, AFTER HEAD 1</td>
        </tr>
        <tr>
            <td >ITEM 2, AFTER HEAD 1</td>
        </tr>
    </tbody>
</table>
<table>
    <tbody>
        <tr>
            <td >ITEM 3, AFTER HEAD 1</td>
        </tr>
        <tr>
            <td >ITEM 4, AFTER HEAD 1</td>
        </tr>
        <tr>
            <td >ITEM 5, AFTER HEAD 1</td>
        </tr>
    </tbody>
</table>
<h1>TITLE OF HEAD 2</h1>
<table>
    <tbody>
        <tr>
            <td >ITEM 6, AFTER HEAD 2</td>
        </tr>
    </tbody>
</table>
<h1>TITLE OF HEAD 3</h1>
<table>
    <tbody>
        <tr>
            <td >ITEM 7, AFTER HEAD 3</td>
        </tr>
        <tr>
            <td >ITEM 8, AFTER HEAD 3</td>
        </tr>
        <tr>
            <td >ITEM 9, AFTER HEAD 3</td>
        </tr>
        <tr>
            <td >ITEM 10, AFTER HEAD 3</td>
        </tr>
    </tbody>
</table>
<h1>TITLE OF HEAD 4</h1>
<table>
    <tbody>
        <tr>
            <td >ITEM 11, AFTER HEAD 4</td>
        </tr>
        <tr>
            <td >ITEM 12, AFTER HEAD 4</td>
        </tr>
    </tbody>
</table>

And with regex, the outcome should be:

<table>
    <tbody>
        <tr>
            <td >ITEM 1, AFTER HEAD 1</td>
            <td >TITLE OF HEAD 1</td>
        </tr>
        <tr>
            <td >ITEM 2, AFTER HEAD 1</td>
            <td >TITLE OF HEAD 1</td>
        </tr>
    </tbody>
</table>
<table>
    <tbody>
        <tr>
            <td >ITEM 3, AFTER HEAD 1</td>
            <td >TITLE OF HEAD 1</td>
        </tr>
        <tr>
            <td >ITEM 4, AFTER HEAD 1</td>
            <td >TITLE OF HEAD 1</td>
        </tr>
        <tr>
            <td >ITEM 5, AFTER HEAD 1</td>
            <td >TITLE OF HEAD 1</td>
        </tr>
    </tbody>
</table>
<h1>TITLE OF HEAD 2</h1>
<table>
    <tbody>
        <tr>
            <td >ITEM 6, AFTER HEAD 2</td>
            <td >TITLE OF HEAD 2</td>
        </tr>
    </tbody>
</table>
<h1>TITLE OF HEAD 3</h1>
<table>
    <tbody>
        <tr>
            <td >ITEM 7, AFTER HEAD 3</td>
            <td >TITLE OF HEAD 3</td>
        </tr>
        <tr>
            <td >ITEM 8, AFTER HEAD 3</td>
            <td >TITLE OF HEAD 3</td>
        </tr>
        <tr>
            <td >ITEM 9, AFTER HEAD 3</td>
            <td >TITLE OF HEAD 3</td>
        </tr>
        <tr>
            <td >ITEM 10, AFTER HEAD 3</td>
            <td >TITLE OF HEAD 3</td>
        </tr>
    </tbody>
</table>
<h1>TITLE OF HEAD 4</h1>
<table>
    <tbody>
        <tr>
            <td >ITEM 11, AFTER HEAD 4</td>
            <td >TITLE OF HEAD 4</td>
        </tr>
        <tr>
            <td >ITEM 12, AFTER HEAD 4</td>
            <td >TITLE OF HEAD 4</td>
        </tr>
    </tbody>
</table>

What I've tried so far:

Now getting the strings inside the <h1> is easy:

find: (<h1>)(.*?)(</h1>) replace: $2

Then I tried:

find: (<h1>)(.*?)(</h1>)(\n|.)*?(<td >.*?</td>) replace: $5<td >$2</td>

which works, but the other tags are removed as well, so I've modified it:

find (<h1>)(.*?)(</h1>)((\n|.)*?)(<td >.*?</td>) replace: $4$6<td >$2</td>

Each string of a new h1 will be used for the tds that occur afterwards until a new h1 occurs, which will then be used - the problem is this only works for each first tdafter each h1, not all tds.

Could somebody tell me what needs to be added to the regex for this to work?

Thank you!

CodePudding user response：

Use

<h1>([^<]*)<\/h1>\s*\n([\w\W]*?)(([^\n\S]*)<td\s.*?<\/td>(\n))(?=\s*<\/tr>)|(?<=<h1>([^<]*)<\/h1>[\w\W]*?)(([^\n\S]*)<td\s.*?<\/td>(\n))(?=\s*<\/tr>)

See regex proof.

Replace with: $2$3$4$7$8<td >$1$6</td>$5$9.

EXPLANATION

NODE                     EXPLANATION
--------------------------------------------------------------------------------
  <h1>                     '<h1>'
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    [^<]*                    any character except: '<' (0 or more
                             times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  </h1>                    '</h1>'
--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  \n                       '\n' (newline)
--------------------------------------------------------------------------------
  (                        group and capture to \2:
--------------------------------------------------------------------------------
    [\w\W]*?                 any character of: word characters (a-z,
                             A-Z, 0-9, _), non-word characters (all
                             but a-z, A-Z, 0-9, _) (0 or more times
                             (matching the least amount possible))
--------------------------------------------------------------------------------
  )                        end of \2
--------------------------------------------------------------------------------
  (                        group and capture to \3:
--------------------------------------------------------------------------------
    (                        group and capture to \4:
--------------------------------------------------------------------------------
      [^\n\S]*                 any character except: '\n' (newline),
                               non-whitespace (all but \n, \r, \t,
                               \f, and " ") (0 or more times
                               (matching the most amount possible))
--------------------------------------------------------------------------------
    )                        end of \4
--------------------------------------------------------------------------------
    <td                      '<td'
--------------------------------------------------------------------------------
    \s                       whitespace (\n, \r, \t, \f, and " ")
--------------------------------------------------------------------------------
    .*?                      any character except \n (0 or more times
                             (matching the least amount possible))
--------------------------------------------------------------------------------
    </td>                    '</td>'
--------------------------------------------------------------------------------
    (                        group and capture to \5:
--------------------------------------------------------------------------------
      \n                       '\n' (newline)
--------------------------------------------------------------------------------
    )                        end of \5
--------------------------------------------------------------------------------
  )                        end of \3
--------------------------------------------------------------------------------
  (?=                      look ahead to see if there is:
--------------------------------------------------------------------------------
    \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    </tr>                    '</tr>'
--------------------------------------------------------------------------------
  )                        end of look-ahead
--------------------------------------------------------------------------------
 |                        OR
--------------------------------------------------------------------------------
  (?<=                     look behind to see if there is:
--------------------------------------------------------------------------------
    <h1>                     '<h1>'
--------------------------------------------------------------------------------
    (                        group and capture to \6:
--------------------------------------------------------------------------------
      [^<]*                    any character except: '<' (0 or more
                               times (matching the most amount
                               possible))
--------------------------------------------------------------------------------
    )                        end of \6
--------------------------------------------------------------------------------
    </h1>                    '</h1>'
--------------------------------------------------------------------------------
    [\w\W]*?                 any character of: word characters (a-z,
                             A-Z, 0-9, _), non-word characters (all
                             but a-z, A-Z, 0-9, _) (0 or more times
                             (matching the least amount possible))
--------------------------------------------------------------------------------
  )                        end of look-behind
--------------------------------------------------------------------------------
  (                        group and capture to \7:
--------------------------------------------------------------------------------
    (                        group and capture to \8:
--------------------------------------------------------------------------------
      [^\n\S]*                 any character except: '\n' (newline),
                               non-whitespace (all but \n, \r, \t,
                               \f, and " ") (0 or more times
                               (matching the most amount possible))
--------------------------------------------------------------------------------
    )                        end of \8
--------------------------------------------------------------------------------
    <td                      '<td'
--------------------------------------------------------------------------------
    \s                       whitespace (\n, \r, \t, \f, and " ")
--------------------------------------------------------------------------------
    .*?                      any character except \n (0 or more times
                             (matching the least amount possible))
--------------------------------------------------------------------------------
    </td>                    '</td>'
--------------------------------------------------------------------------------
    (                        group and capture to \9:
--------------------------------------------------------------------------------
      \n                       '\n' (newline)
--------------------------------------------------------------------------------
    )                        end of \9
--------------------------------------------------------------------------------
  )                        end of \7
--------------------------------------------------------------------------------
  (?=                      look ahead to see if there is:
--------------------------------------------------------------------------------
    \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    </tr>                    '</tr>'
--------------------------------------------------------------------------------
  )                        end of look-ahead