Home > Mobile >  Split text with array of delimiters
Split text with array of delimiters

Time:11-15

I want a function that split text by array of delimiters. I have a demo that works perfectly, but it is really really slow. Here is a example of parameters.

text:

"pop-pap-bab bob"

vector of delimiters:

"-"," "

the result:

"pop", "-", "pap", "-", "bab", "bob"

So the function loops throw the string and tries to find delimeters and if it finds one it pushes the text and the delimiter that was found to the result array, if the text only contains spaces or if it is empty then don't push the text.

std::string replace(std::string str,std::string old,std::string new_str){
    size_t pos = 0;
    while ((pos = str.find(old)) != std::string::npos) {
        str.replace(pos, old.length(), new_str);
    }
    return str;
}


std::vector<std::string> split_with_delimeter(std::string str,std::vector<std::string> delimeters){
    std::vector<std::string> result;
    std::string token;
    int flag = 0;
    for(int i=0;i<(int)str.size();i  ){
        for(int j=0;j<(int)delimeters.size();j  ){
            if(str.substr(i,delimeters.at(j).size()) == delimeters.at(j)){
                if(token != ""){
                    result.push_back(token);
                    token = "";
                }
                if(replace(delimeters.at(j)," ","") != ""){
                    result.push_back(delimeters.at(j));
                }
                i  = delimeters.at(j).size()-1;
                flag = 1;
                break;
            }
        }
        if(flag == 0){token  = str.at(i);}
        flag = 0;
    }
    if(token != ""){
        result.push_back(token);
    }
    return result;
}

My issue is that, the functions is really slow since it has 3 loops. I am wondering if anyone knows how to make the function faster. I am sorry, if I wasn't clear enough my english isn't the best.

CodePudding user response:

Maybe, as an alternative, you could use a regex? But maybe also too slow for you . . .

With a regex life would be very simple.

Please see the following example:

#include <iostream>
#include <string>
#include <vector>
#include <regex>
#include <iterator>

const std::regex re(R"((\w |[\- ]))");

int main() {
    
    std::string s{"pop-pap-bab bob"};
    
    std::vector<std::string> part{std::sregex_token_iterator(s.begin(),s.end(),re),{}};
    
    for (const std::string& p : part)   std::cout << p << '\n';
}

We use the std::sregex_token_iterator in combination with the std::vectors range constructor, to extract everything specified in the regex and then put all those stuff into the std::vector

The regex itself is also simple. It specifies words or delimiters.

Maybe its worth a try . . .

CodePudding user response:

NOTE: You've complained that your code is slow, but it's important to understand that most of the answers will have options to potentially speed up the program. And even if the author of the option measured the acceleration of the program, the option may be slower on your machine, so do not forget to measure the execution speed yourself.

If I were you, I would create a separate function that receives an array of strings and outputs an array of delimited strings. The problem with this approach may be that if the delimiter includes another delimiter, the result may not be what you expect, but it will be easier to iterate through different options for string splitting, finding the best. And my solution would looks like this(though, it requires c 20)

#include <iomanip>
#include <iostream>
#include <ranges>
#include <string_view>
#include <vector>

std::vector<std::string> split_elems_of_array(const std::vector<std::string>& array, const std::string& delim)
{
    std::vector<std::string> result;
    for(const auto str: array)
    {
        for (const auto word : std::views::split(str, delim))
        {
            std::string chunk(word.begin(), word.end());
            if(!chunk.empty() && chunk != " ")
                result.push_back(chunk   delim);
        }
    }

    return result;
}

std::vector<std::string> split_string(std::string str, std::vector<std::string> delims)
{
    std::vector<std::string> result = {std::string(str)};
    for(const auto&delim: delims)
        result = split_elems_of_array(result, delim);
    return {result.begin(), result.end()};
}

For my machine, my approach is 56 times faster: 67 ms versus 5112 ms. Length of string is 1000000, there are 100 delims with length 100

  •  Tags:  
  • c
  • Related