I've got a large text document that including adjacent numbers and letters. Just like that, JACK1940383DAVID30284HAROLD68372TROY4392 etc.
How can i split this like below in C
List: Jack / 1940383 , David/30284, ...
CodePudding user response:
You can use std::string::find_first_of()
and std::string::find_first_not_of()
in a loop, using std::string::substr()
to extract each piece, eg:
std::string s = "JACK1940383DAVID30284HAROLD68372TROY4392";
std::string::size_type start = 0, end;
while ((end = s.find_first_of("0123456789", start)) != std::string::npos) {
std::string name = s.substr(start, end-start);
start = end;
int number;
if ((end = s.find_first_not_of("0123456789", start)) != std::string::npos) {
number = std::stoi(s.substr(start, end-start));
}
else {
number = std::stoi(s.substr(start));
}
start = end;
// use name and number as needed...
}
CodePudding user response:
You can use regex like this:
#include <iostream>
#include <string>
#include <regex>
#include <vector>
// create a struct to group your data
// this makes it easy to store it in a vector.
struct person_t
{
std::string name;
std::string number;
};
// overloaded output operator for printing one person's details
std::ostream& operator<<(std::ostream& os, const person_t& person)
{
std::cout << person.name << ": " << person.number << std::endl;
return os;
}
// get a vector of person_t based on the input
auto get_persons(const std::string& input)
{
// make a regex in this case a regex that will match one or more capital letters
// and groups them using the ()
// then match one or more digits and group them too.
static const std::regex rx{ "([A-Z] )([0-9] )" };
std::smatch match;
// a vector to hold all the persons
std::vector<person_t> persons;
// start at begin of string and look for first part of the string
// that matches the regex.
auto cbegin = input.cbegin();
while (std::regex_search(cbegin, input.cend(), match, rx))
{
// match[0] will contain the whole match,
// match[1]-match[n] will contain the groups from the regular expressions
// match[1] will contain the match with characters and thus the name
// match[2] will contain the match with the numbers and thus the number.
// create a person_t struct with this info
person_t person{ match[1], match[2] };
// and add it to the vector
persons.push_back(person);
cbegin = match.suffix().first;
}
return persons;
}
int main()
{
// parse and split the string
auto persons = get_persons("JACK1940383DAVID30284HAROLD68372TROY4392");
// show the output
for (const auto& person : persons)
{
std::cout << person;
}
}
CodePudding user response:
As pointed in other good answers you can use
find_first_of()
,find_first_not_of()
andsubstr()
fromstd::string
in a loopregex
But it may be too much. I will add 3 more examples that you may find simpler.
The first 2 programs expects the file name on the command line for (my) convenience here, and the test file is in.txt
. Contents are the same as posted
JACK1940383DAVID30284HAROLD68372TROY4392
The last example just parses the string data declared as a char[]
1. Using fscanf()
Since the target is to consume formatted data, fscanf()
is an option. As the data structure is very simple, the program is just a one line loop:
char mask[] = "P[^0-9]P[0-9]";
while ( 2 == fscanf(F, mask, tk_key, tk_value))
std::cout << tk_key << "/" << tk_value << "\n";
program output
output is the same for all examples
JACK/1940383
DAVID/30284
HAROLD/68372
TROY/4392
code for ex. 1
#include <errno.h>
#include <iostream>
int main(int argc,char** argv)
{
if (argc < 2)
{ std::cerr << "Use: pgm FileName\n";
return -1;
}
FILE* F = fopen(argv[1], "r");
if (F == NULL)
{
perror("Could not open file");
return -1;
}
std::cerr << "File: \"" << argv[1] << "\"\n";
char tk_key[50], tk_value[50];
char mask[] = "P[^0-9]P[0-9]";
while ( 2 == fscanf(F, mask, tk_key, tk_value))
std::cout << tk_key << "/" << tk_value << "\n";
fclose(F);
return 0;
}
using a state machine
There are just 2 states so it is not a fancy FSA ;) State machines are good for representing this kind of stuff, albeit here this seems to be overkill.
#define S_LETTER 0
#define S_DIGIT 1
#include <algorithm>
#include <iostream>
#include <fstream>
using iich = std::istream_iterator<char>;
int main(int argc,char** argv)
{
std::ifstream in_file{argv[1]};
if ( not in_file.good()) return -1;
iich p {in_file}, eofile{};
std::string token{}; // string to build values
char st = S_LETTER; // state value for FSA
std::for_each(p, eofile,
[&token,&st](char ch)
{
char temp = 0;
switch (st)
{
case S_LETTER:
if ((ch >= '0') && (ch <= '9'))
{
std::cout << token << "/";
token = ch;
st = S_DIGIT; // now in number
}
else token = ch; // concat in string
break;
case S_DIGIT:
default:
if ((ch < '0') || (ch > '9'))
{ // is a letter
std::cout << token << "\n";
token = ch;
st = S_LETTER; // now in name
}
else token = ch; // concat in string
break;
}; // switch()
});
std::cout << token << "\n"; // print last token
}
Here we have no loop. for_each
gets the data from an iterator and passes it to a function that builds the name and the value as strings and cout
s them
Output is the same
3. a simple FSA to consume the data
#define S_LETTER 0
#define S_DIGIT 1
#include <iostream>
int main(void)
{
char one[] = "JACK1940383DAVID30284HAROLD68372TROY4392";
char* p = (char*)&one;
char* token = p;
char st = S_LETTER;
char temp = 0;
while (*p != 0)
{
switch (st)
{
case S_LETTER:
if ((*p >= '0') && (*p <= '9'))
{
temp = *p;
*p = 0;
std::cout << token << "/";
*p = temp;
token = p;
st = S_DIGIT; // now in number
}
break;
case S_DIGIT:
default:
if ( (*p < '0') || (*p > '9'))
{ // letter
temp = *p;
*p = 0;
std::cout << token << "\n";
*p = temp;
token = p;
st = S_LETTER; // now in name
}
break;
}; // switch()
p = 1; // next symbol
}; // while()
std::cout << token << "\n"; // print last token
}
This code just uses a C-style loop to parse the input data