Home > front end >  Locating tags in a string in PHP (with respect to the string with tags removed)
Locating tags in a string in PHP (with respect to the string with tags removed)

Time:12-04

I want to create a function that labels the location of certain HTML tags (e.g., italics tags) in a string with respect to the locations of characters in a tagless version of the string. (I intend to use this label data to train a neural network for tag recovery from data that has had the tags stripped out.) The magic function I want to create is label_italics() in the below code.

$string = 'Disney movies: <i>Aladdin</i>, <i>Beauty and the Beast</i>.';
$string_all_tags_stripped_but_italics = strip_tags($string, '<i>'); // same as $string in this example
$string_all_tags_stripped = strip_tags($string); // 'Disney movies: Aladdin, Beauty and the Beast.'
$featr_string = $string_all_tags_stripped.' '; // Add a single space at the end
$label_string = label_italics($string_all_tags_stripped_but_italics);
echo $featr_string; // 'Disney movies: Aladdin, Beauty and the Beast. '
echo $label_string; // '0000000000000001000000101000000000000000000010'

If a character is supposed to have an <i> or </i> tag immediately preceding it, it is labeled with a 1 in $label_string; otherwise, it is labeled with a 0 in $label_string. (I'm thinking I don't need to worry about the difference between <i> and </i> because the recoverer will simply alternate between <i> and </i> so as to maintain well-formed markup, but I'm open to reasons as to why I'm wrong about this.)

I'm just not sure what the best way to create label_italics() is.

I wrote this function that seems to work in most cases, but it also seems a little clunky and I'm posting here in hopes that there is a better way. (If this turns out to be the best way, the below function would be easily generalizable to any HTML tag passed in as a second argument to the function, which could be renamed label_tag().)

function label_italics($stripped) {
  while ((stripos($stripped, '<i>') || stripos($stripped, '</i>')) !== FALSE) {
    $position = stripos($stripped, '<i>');
    if (is_numeric($position)) {
      for ($c = 0; $c < $position; $c  ) {
        $output .= '0';
      }
      $output .= '1';
    }
    $stripped = substr($stripped, $position   4, NULL);
    $position = stripos($stripped, '</i>');
    if (is_numeric($position)) {
      for ($c = 0; $c < $position; $c  ) {
        $output .= '0';
      }
      $output .= '1';
    }
    $stripped = substr($stripped, $position   5, NULL);
  }
  for ($c = 0; $c <= strlen($stripped); $c  ) {
    $output .= '0';
  }
  return $output;
}

The function produces bad output if the tags are surplus or the markup is badly formed in the input. For example, for the following input:

$string = 'Disney movies: <i><i>Aladdin</i>, <i>Beauty and the Beast</i>.';

The following misaligned output is given.

Disney movies: Aladdin, Beauty and the Beast.
0000000000000001000000000101000000000000000000010

(I'm also open to reasons why I'm going about the creation of the label data all wrong.)

CodePudding user response:

I think I've got something. How about this:

function label_italics($string) {
    return preg_replace(['/<i>/', '/<\/i>/', '/[^#]/', '/##0/', '/#0/'], 
                        ['#', '#', '0', '2', '1'], $string);
}

see: https://3v4l.org/cKG46

Note that you need to supply the string with the tags in it.

How does it work?

I use preg_replace() because it can use regular expressions, which I need once. This function goes through the two arrays and execute each replacement in order. First it replace all occurrences of <i> and </i> by # and anything else by 0. Then replaces ##0 by 2 and #0 by 1. The 2 is extra to be able to replace <i></i>. You can remove it, and simplify the function, if you don't need it.

The use of the # is arbitrary. You should use anything that doesn't clash with the content of your string.


Here's an updated version. It copes with tags at the end of the line and it ignores any # characters in the line.

function label_italics($string) {
    return preg_replace(['/[^<\/i\>]/', '/<i>/', '/<\/i>/', '/i/', '/##0/', '/#0/'], 
                        ['0', '#', '#', '0', '2', '1'], $string . ' ');
}

See: https://3v4l.org/BTnLc

CodePudding user response:

Here is an alternative approach to writing the label_italics function:

function label_italics($stripped) {
    $output = '';
    $tag_open = '<i>';
    $tag_close = '</i>';

    // Find the positions of the <i> and </i> tags in the input string
    $open_positions = array_keys(str_word_count($stripped, 1, $tag_open));
    $close_positions = array_keys(str_word_count($stripped, 1, $tag_close));

    // Create a list of all the tag positions
    $tag_positions = array_merge($open_positions, $close_positions);
    sort($tag_positions);

    // Loop through each character in the input string
    for ($i = 0; $i < strlen($stripped); $i  ) {
        // If the current character has a tag immediately preceding it, add a 1 to the output string
        if (in_array($i, $tag_positions)) {
            $output .= '1';
        } else {
            $output .= '0';
        }
    }
    return $output;
}

This function uses the str_word_count function to find the positions of the and tags in the input string, and then loops through each character in the input string to determine if it has a tag immediately preceding it. This approach should be more robust than stripos approach, as it doesn't rely on using the stripos function to search for the tags.

  • Related