Find words length 8 to 10, starting with S-CodePudding

I want to read a file and show the count of words having length 8 to 10 and starting with S.

I am getting all the count of the file but not getting how to apply condition for length and starting with S.

I am new in php if anyone has an idea then let me know..

Below is my code:

<?php  
$count = 0;  
    
$file = fopen("data.txt", "r");  

while (($line = fgets($file)) !== false) {  
 
    $words = explode(" ", $line);  
    $count = $count   count($words);  
}  
   
echo "Number of words present in given file: " . $count;  
fclose($file);  
?>

I also need to know, how we do this for a CSV file.

CodePudding user response：

To find words, it's probably a bit more complicated because we might not have spaces between words and we also have ponctuation.

I know that you are new to PHP and I expect you don't know what regular expressions are so my answer might be rather complicated but I'll try to explain it. Regular expressions are very useful and are used to search or to replace things in strings. It's a very powerfull search engine and learning to use them is very useful in any programming language.

Counting the words

Splitting with space might not be suffisiant. They might be tabulations or other chars so we could split the string using a regular expression but this might also get complicated. Instead we'll use a regular expression to match all the words inside the line. This can be done like this:

$nbr_words = preg_match_all('/[\p{L}-] /u', $line, $matches, PREG_SET_ORDER, 0);

Here's the running example

The text could contain accents and ponctuation, like this:

En-tête: Attention, en français les mots contiennent des caractères accentués.

This will return 10 matches. It would also work if you have some tabulations instead of spaces.

Now, what does this regular expression mean?

Let's see it in action on regex101

Explanation:

\p{L} is to find any unicode letter, such as a, b, ü or é but only letters in any language. So , or ´ won't be matched.
[] is used to define a list of possible chars. So [abc] would mean the letter “a”, “b” or “c”. You can also set ranges like [a-z]. If you want to say “a”, “b” or “-“ then you have to put the “-“ char at the beginning or the end, like this [ab-]. As words can have hyphens like week-end, self-service or après-midi we have to match unicode letters or hyphens, leading to [\p{L}-].
this unicode letter or hyphen must be one or multiple times. To do that, we’ll use the operator. This leads us to [\p{L}-] .
The regular expression has some flags to change some settings. I have set the u flag for unicode. In PHP, you start your regular expression with a symbol (usually a /, but it could be ~ or wathever) then you put your pattern and you finish with the same symbol and you add the flags. So you could write ~[\p{L}-] ~u or #[\p{L}-] #u and it would be the same.

Counting words starting with `S` and 8-10 long

We'll use a regular expression again: /(?<=\P{L}|^)s[\p{L}-]{7,9}(?=\P{L}|$)/ui

A test case on regex101

This one is a bit more complicated:

we'll use the u for unicode flag and then we'll use the i for case-insensitive as we want to match s and also S in uppercase.
then, searching for a word of 8 to 10 chars is like search for a s followed by 7 to 9 unicode letters. To say that you want something 7 to 9 times you use {7,9} after the element you are searching. So this becomes [\p{L}-]{7,9} to say we want any unicode letter or hyphen 7 to 9 times. If we add the s in front, we get s[\p{L}-]{7,9}. This will match sex-appeal, SARS-CoV but not sos.
now, a bit more complicated. We only want to match if this word is preceeded by a non-letter or the beginning of the string. This is done with a positive lookbehind (?<= something ) and the something is \P{L} for a unicode non-letter or (use the pipe | operator) the beginning of a string with the ^ operator. This leads to this positive lookbehind: (?<=\P{L}|^)
same thing for what is after the word. It should be a non-letter or the end of the string. This is done with a positive lookahead (?= something ) where something is \P{L} to match a unicode non-letter or $ to match the end of a string. This leads to this positive lookahead: (?=\P{L}|$)

Intergrating in your code

<?php

$total_words = 0;
$total_s_words = 0;

$file = fopen("data.txt", "r");

while (($line = fgets($file)) !== false) {
    $nbr_words = preg_match_all('/[\p{L}-] /u', $line, $matches, PREG_SET_ORDER, 0);
    if ($nbr_words) $total_words  = $nbr_words;
    
    $nbr_s_words = preg_match_all('/(?<=\P{L}|^)s[\p{L}-]{7,9}(?=\P{L}|$)/ui', $line, $matches, PREG_SET_ORDER, 0);
    if ($nbr_s_words) $total_s_words  = $nbr_s_words;
}

print "Number of words present in given file: $total_words\n";
print "Number of words starting with 's' and 8-10 chars long: $total_s_words\n";

fclose($file);

?>

A working online example

CodePudding user response：

As mentioned in the comments, strlen() gives the length of a string. If you are using PHP 8 you can use str_starts_with() to get the first letter of the string. In older versions you can use strpos(), substr() or [0] to get the character in the first position (ex: $word[0]).

Since you have an array of words, you'll want to loop through it and check each one, something like:

foreach($words as $word) {
    if(strlen($word) >= 8 && strlen($word) <= 10) {
        //count words between 8 and 10
    }
    if(str_starts_with($word, 'S')) {
        //count words starting with S
    }
}

If you want words that are both between 8 and 10 characters and start with S at the same time, you can just combine the two above if statements.

References for these functions: https://www.php.net/manual/en/function.strlen.php https://www.php.net/manual/en/function.str-starts-with.php

CodePudding user response：

You have to use strlen() and substr(). Example code below

<?php  
$count = 0;  
    
$file = fopen("data.txt", "r");  

while (($line = fgets($file)) !== false) {  
 
    $words = explode(" ", $line);  
    foreach($words as $word) {  
            // strlen() will give the length of the string/word
            $len        = strlen($word);
            if($len >= 8 && $len <= 10) {
                // Check the first character, if S then increment the counter
                if(substr($word, 0, 1) == "S")
                    $count  ;
            }
        }
}  
   
echo "Number of words present in given file: " . $count;  
fclose($file);  
?>

Counting the words

Now, what does this regular expression mean?

Counting words starting with S and 8-10 long

Intergrating in your code

Counting words starting with `S` and 8-10 long