Home > Blockchain >  How to group sentences by groups of 500 characters without breaking a sentence with PHP?
How to group sentences by groups of 500 characters without breaking a sentence with PHP?

Time:03-27

I have been scratching my head on this but cannot work out a solution.

Let's say you have a text of 5000 characters, I would like to split it into blocks of less than 500 characters, but, without breaking a single sentence. eg: if a paragraph is let's say 550 words and the last sentence stops at 550 characters but start at 450 characters, I would like to save this particular block to a maximum of 450 characters(this way no sentences are broken).

Any idea how to achieve this please?

My goal is to save each block into an array so I can work on them separately.

I was thinking about using preg_split, sum the outputs, and if the sum is above 500 characters, remove the last sum. But.....I find it difficult to separate the sentences without mistakes.

Any idea what preg_split rules I should use to make sure that every single sentences are well separated?

I tried to use this tool but cannot get it to give me the right output: https://www.phpliveregex.com/#tab-preg-split

Thanks

CodePudding user response:

I Think you need this

$string = "Hello world php is fun";
$array = explode(" ", $string);

OutPut is

Array ( [0] => Hello [1] => world [2] => php [3] => is [4] => fun )

CodePudding user response:

First of all: Thank you for the nice question!

The solution is not really stable and you have to adjust in the future. But it will shows you the possible way to archive this.

Split your text into the individual sentences and save each sentence as an element in an array. This way you can determine the length of the sentences when iterating the array. As long as the sentence and the previous sentence are smaller than the maximum block length, put the string into a temporary variable. As soon as the length of the text of the temporary variable the length of the current record are greater than the maximum block length, the record is stored in a new array as a block.

<?php
$txt = "111. 222 222. 333 333 333. 444 444 444 444. 555 555 555 555 555. 333 333 333. 222 222. 111.";

$length = 30;
$arr = explode(". ", $txt);
$b = [];
$tmp = '';

foreach($arr as $k => $s) {
    if (strlen($s)   strlen($tmp) <= ($length) ) {
        $tmp = $tmp . $s .'. ';
    } else {
        $b[] = $tmp;
        $tmp = '';
        $tmp = $s . '.';
    }
    
    if((count($arr)-1) === $k) {
        $b[] = $tmp ;
        $l = end($b);        
    }
    
}

print_r($arr);
print_r($b);

Output

// Sentence Array
Array
(
    [0] => 111
    [1] => 222 222
    [2] => 333 333 333
    [3] => 444 444 444 444
    [4] => 555 555 555 555 555
    [5] => 333 333 333
    [6] => 222 222
    [7] => 111.
)

// Your new Block Array
Array
(
    [0] => 111. 222 222. 333 333 333. 
    [1] => 444 444 444 444.
    [2] => 555 555 555 555 555.
    [3] => 333 333 333.222 222. 111.. 
)

CodePudding user response:

Seems easier to split by sentence, then you should be able to loop on it and concatenate if you are over your boundary

$data = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Id cursus metus aliquam eleifend mi in nulla posuere. Hac habitasse platea dictumst vestibulum rhoncus. Elementum facilisis leo vel fringilla est. Sem et tortor consequat id. Eleifend donec pretium vulputate sapien nec. Elit pellentesque habitant morbi tristique. Dictumst vestibulum rhoncus est pellentesque elit. Quis commodo odio aenean sed adipiscing. Id volutpat lacus laoreet non curabitur gravida arcu. Sit amet massa vitae tortor condimentum. Morbi blandit cursus risus at ultrices mi tempus.

Tortor consequat id porta nibh venenatis cras sed. Urna et pharetra pharetra massa massa. Ut consequat semper viverra nam. Hac habitasse platea dictumst quisque sagittis. Commodo odio aenean sed adipiscing diam donec. Imperdiet proin fermentum leo vel orci porta. Quisque non tellus orci ac auctor augue. In cursus turpis massa tincidunt dui. Purus faucibus ornare suspendisse sed. Tristique senectus et netus et malesuada fames ac turpis.';

$splited = preg_split('/([^.] \.)/mU', $data, -1, PREG_SPLIT_DELIM_CAPTURE);
// Basically here, I try to find everything before a `.`

$cleaned = array_filter(array_map('trim', $splited));

var_dump($cleaned);

I have that

array(22) {
  [1]=>
  string(123) "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
  [3]=>
  string(53) "Id cursus metus aliquam eleifend mi in nulla posuere."
  [5]=>
  string(49) "Hac habitasse platea dictumst vestibulum rhoncus."
  [7]=>
  string(42) "Elementum facilisis leo vel fringilla est."
  [9]=>
  string(27) "Sem et tortor consequat id."
  [11]=>
  string(44) "Eleifend donec pretium vulputate sapien nec."
  [13]=>
  string(43) "Elit pellentesque habitant morbi tristique."
  [15]=>
  string(50) "Dictumst vestibulum rhoncus est pellentesque elit."
  [17]=>
  string(40) "Quis commodo odio aenean sed adipiscing."
  [19]=>
  string(53) "Id volutpat lacus laoreet non curabitur gravida arcu."
  [21]=>
  string(40) "Sit amet massa vitae tortor condimentum."
  [23]=>
  string(49) "Morbi blandit cursus risus at ultrices mi tempus."
  [25]=>
  string(50) "Tortor consequat id porta nibh venenatis cras sed."
  [27]=>
  string(38) "Urna et pharetra pharetra massa massa."
  [29]=>
  string(32) "Ut consequat semper viverra nam."
  [31]=>
  string(47) "Hac habitasse platea dictumst quisque sagittis."
  [33]=>
  string(46) "Commodo odio aenean sed adipiscing diam donec."
  [35]=>
  string(45) "Imperdiet proin fermentum leo vel orci porta."
  [37]=>
  string(40) "Quisque non tellus orci ac auctor augue."
  [39]=>
  string(37) "In cursus turpis massa tincidunt dui."
  [41]=>
  string(38) "Purus faucibus ornare suspendisse sed."
  [43]=>
  string(57) "Tristique senectus et netus et malesuada fames ac turpis."
}

Quick update for Maik ;)

$data = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Id cursus metus aliquam eleifend mi in nulla posuere. Hac habitasse platea dictumst vestibulum rhoncus. Elementum facilisis leo vel fringilla est. Sem et tortor consequat id. Eleifend donec pretium vulputate sapien nec. Elit pellentesque habitant morbi tristique. Dictumst vestibulum rhoncus est pellentesque elit. Quis commodo odio aenean sed adipiscing. Id volutpat lacus laoreet non curabitur gravida arcu. Sit amet massa vitae tortor condimentum. Morbi blandit cursus risus at ultrices mi tempus.

Tortor consequat id porta nibh venenatis cras sed. Urna et pharetra pharetra massa massa. Ut consequat semper viverra nam. Hac habitasse platea dictumst quisque sagittis. Commodo odio aenean sed adipiscing diam donec. Imperdiet proin fermentum leo vel orci porta. Quisque non tellus orci ac auctor augue. In cursus turpis massa tincidunt dui. Purus faucibus ornare suspendisse sed. Tristique senectus et netus et malesuada fames ac turpis.';

$splited = preg_split('/([^.] \.)/mU', $data, -1, PREG_SPLIT_DELIM_CAPTURE);
// Basically here, I try to find everything before a `.`

$cleaned = array_filter(array_map('trim', $splited));

$lines = [];
$current = '';
$min = 50;

foreach ($cleaned as $sentence) {
  $current .= $sentence . ' '; // Mandatory to allow to add an other sentence
  $len_current = strlen($current);

  if ($len_current >= $min) {
    array_push($lines, trim($current)); // As we add an extra space, we remove it when adding to the lines

    $current = '';
  }
}

Looks like this

array(14) {
  [0]=>
  string(123) "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
  [1]=>
  string(53) "Id cursus metus aliquam eleifend mi in nulla posuere."
  [2]=>
  string(49) "Hac habitasse platea dictumst vestibulum rhoncus."
  [3]=>
  string(70) "Elementum facilisis leo vel fringilla est. Sem et tortor consequat id."
  [4]=>
  string(88) "Eleifend donec pretium vulputate sapien nec. Elit pellentesque habitant morbi tristique."
  [5]=>
  string(50) "Dictumst vestibulum rhoncus est pellentesque elit."
  [6]=>
  string(94) "Quis commodo odio aenean sed adipiscing. Id volutpat lacus laoreet non curabitur gravida arcu."
  [7]=>
  string(90) "Sit amet massa vitae tortor condimentum. Morbi blandit cursus risus at ultrices mi tempus."
  [8]=>
  string(50) "Tortor consequat id porta nibh venenatis cras sed."
  [9]=>
  string(71) "Urna et pharetra pharetra massa massa. Ut consequat semper viverra nam."
  [10]=>
  string(94) "Hac habitasse platea dictumst quisque sagittis. Commodo odio aenean sed adipiscing diam donec."
  [11]=>
  string(86) "Imperdiet proin fermentum leo vel orci porta. Quisque non tellus orci ac auctor augue."
  [12]=>
  string(76) "In cursus turpis massa tincidunt dui. Purus faucibus ornare suspendisse sed."
  [13]=>
  string(57) "Tristique senectus et netus et malesuada fames ac turpis."
}

CodePudding user response:

$longString = 'I like an apple. You like oranges. We like fruit. I like meat, also.';
$maxLength = 18;

var_export(
preg_split("/.{0,{$maxLength}}\K(?:\s |$)/", $longString, 0, 
PREG_SPLIT_NO_EMPTY)
);

output will be like,

array (
  0 => 'I like an apple. You',
  1 => 'like oranges. We',
  2 => 'like fruit. I like',
  3 => 'meat, also.',
)
  • Related