How can i split adjacent numbers and letters in c ?-CodePudding

I've got a large text document that including adjacent numbers and letters. Just like that, JACK1940383DAVID30284HAROLD68372TROY4392 etc.

How can i split this like below in C

List: Jack / 1940383 , David/30284, ...

CodePudding user response：

You can use std::string::find_first_of() and std::string::find_first_not_of() in a loop, using std::string::substr() to extract each piece, eg:

std::string s = "JACK1940383DAVID30284HAROLD68372TROY4392";
std::string::size_type start = 0, end;

while ((end = s.find_first_of("0123456789", start)) != std::string::npos) {
    std::string name = s.substr(start, end-start);
    start = end;

    int number;
    if ((end = s.find_first_not_of("0123456789", start)) != std::string::npos) {
        number = std::stoi(s.substr(start, end-start));
    }
    else {
        number = std::stoi(s.substr(start));
    }
    start = end;

    // use name and number as needed...
}

Online Demo

CodePudding user response：

You can use regex like this:

#include <iostream>
#include <string>
#include <regex>
#include <vector>

// create a struct to group your data
// this makes it easy to store it in a vector.
struct person_t
{
    std::string name;
    std::string number;
};


// overloaded output operator for printing one person's details
std::ostream& operator<<(std::ostream& os, const person_t& person)
{
    std::cout << person.name << ": " << person.number << std::endl;
    return os;
}

// get a vector of person_t based on the input
auto get_persons(const std::string& input)
{
    // make a regex in this case a regex that will match one or more capital letters 
    // and groups them using the ()
    // then match one or more digits and group them too.
    static const std::regex rx{ "([A-Z] )([0-9] )" };
    std::smatch match;

    // a vector to hold all the persons
    std::vector<person_t> persons;

    // start at begin of string and look for first part of the string
    // that matches the regex.
    auto cbegin = input.cbegin();

    while (std::regex_search(cbegin, input.cend(), match, rx))
    {
        // match[0] will contain the whole match, 
        // match[1]-match[n] will contain the groups from the regular expressions
        // match[1] will contain the match with characters and thus the name
        // match[2] will contain the match with the numbers and thus the number.
        // create a person_t struct with this info
        person_t person{ match[1], match[2] };

        // and add it to the vector
        persons.push_back(person);
        cbegin = match.suffix().first;
    }

    return persons;
}

int main()
{
    // parse and split the string
    auto persons = get_persons("JACK1940383DAVID30284HAROLD68372TROY4392");

    // show the output
    for (const auto& person : persons)
    {
        std::cout << person;
    }
}

CodePudding user response：

As pointed in other good answers you can use

find_first_of(), find_first_not_of() and substr() from std::string in a loop
regex

But it may be too much. I will add 3 more examples that you may find simpler.

The first 2 programs expects the file name on the command line for (my) convenience here, and the test file is in.txt. Contents are the same as posted

JACK1940383DAVID30284HAROLD68372TROY4392

The last example just parses the string data declared as a char[]

1. Using `fscanf()`

Since the target is to consume formatted data, fscanf() is an option. As the data structure is very simple, the program is just a one line loop:

    char  mask[] = "P[^0-9]P[0-9]";
    while ( 2 == fscanf(F, mask, tk_key, tk_value))
        std::cout << tk_key << "/" << tk_value << "\n";

program output

output is the same for all examples

JACK/1940383
DAVID/30284
HAROLD/68372
TROY/4392

code for ex. 1

#include <errno.h>
#include <iostream>
int main(int argc,char** argv)
{
    if (argc < 2)
    {   std::cerr << "Use: pgm FileName\n";
        return -1;
    }
    FILE* F = fopen(argv[1], "r");
    if (F == NULL)
    {
        perror("Could not open file");
        return -1;
    }
    std::cerr << "File: \"" << argv[1] << "\"\n";
    char  tk_key[50], tk_value[50];
    char  mask[] = "P[^0-9]P[0-9]";
    while ( 2 == fscanf(F, mask, tk_key, tk_value))
        std::cout << tk_key << "/" << tk_value << "\n";
    fclose(F);
    return 0;
}

using a state machine

There are just 2 states so it is not a fancy FSA ;) State machines are good for representing this kind of stuff, albeit here this seems to be overkill.

#define S_LETTER 0
#define S_DIGIT 1
#include <algorithm>
#include <iostream>
#include <fstream>
    using iich = std::istream_iterator<char>;

int main(int argc,char** argv)
{
    std::ifstream in_file{argv[1]};
    if ( not in_file.good()) return -1;
    iich p {in_file}, eofile{};
    std::string token{}; // string to build values
    char        st = S_LETTER; // state value for FSA
    std::for_each(p, eofile,
        [&token,&st](char ch)
        {
            char temp = 0;
            switch (st)
            {
                case S_LETTER:
                    if ((ch >= '0') && (ch <= '9'))
                    {
                        std::cout << token << "/";
                        token = ch;
                        st    = S_DIGIT;  // now in number
                    }
                    else token  = ch;  // concat in string
                    break;

                case S_DIGIT:
                default:

                    if ((ch < '0') || (ch > '9'))
                    {  // is a letter
                        std::cout << token << "\n";
                        token = ch;
                        st    = S_LETTER;  // now in name
                    }
                    else token  = ch;  // concat in string
                    break;
            };  // switch()
        });
    std::cout << token << "\n";  // print last token
}

Here we have no loop. for_each gets the data from an iterator and passes it to a function that builds the name and the value as strings and couts them

Output is the same

3. a simple FSA to consume the data

#define     S_LETTER 0
#define     S_DIGIT  1
#include <iostream>

int main(void)
{
    char one[] = "JACK1940383DAVID30284HAROLD68372TROY4392";
    char*       p     = (char*)&one;
    char*       token = p;
    char        st    = S_LETTER;
    char        temp  = 0;
    while (*p != 0)
    {
        switch (st)
        {
            case S_LETTER:
                if ((*p >= '0') && (*p <= '9'))
                {
                    temp = *p;
                    *p   = 0;
                    std::cout << token << "/";
                    *p    = temp;
                    token = p;
                    st    = S_DIGIT;  // now in number
                }
                break;

            case S_DIGIT:
            default:
                if ( (*p < '0') || (*p > '9'))
                {   // letter
                    temp = *p;
                    *p   = 0;
                    std::cout << token << "\n";
                    *p    = temp;
                    token = p;
                    st    = S_LETTER;  // now in name
                }
                break;
        };  // switch()
        p  = 1; // next symbol
    };  // while()
    std::cout << token << "\n"; // print last token
}

This code just uses a C-style loop to parse the input data

1. Using fscanf()

program output

code for ex. 1

using a state machine

3. a simple FSA to consume the data

1. Using `fscanf()`