I would like to read big (3.5GB) file as fast as possible - thus I think I should load it into RAM first, instead of using ifstream
and getline()
.
My goal is to find lines of data with same string. Example
textdata abc123 XD0AA
textdata abc123 XD0AB
textdata abc123 XD0AC
textdata abc123 XD0AA
So I would need to read first line, then iterate through all file until I find the fourth (in this example) line with same XD0AA string.
This is what I did so far:
string line;
ifstream f("../BIG_TEXT_FILE.txt");
stringstream buffer;
buffer << f.rdbuf();
string f_data = buffer.str();
for (int i = 0; i < f_data.length(); i )
{
getline(buffer, line);//is this correct way to get the line (for iteration)?
line = line.substr(0, line.find("abc"));
cout << line << endl;
}
f.close();
return 0;
But it takes twice more RAM usage than file (7GB).
Here is fixed code:
string line, token;
int a;
ifstream osm("../BIG_TEXT_FILE.txt");
stringstream buffer;
buffer << f.rdbuf();
//string f_data = buffer.str();
f.close();
while (true)
{
getline(buffer, line);
if (line.length() == 0)
break;
//string delimiter = "15380022";
if (line.find("15380022") != std::string::npos)
cout << line << endl;
}
return 0;
But how do I make getline() read all over again?
CodePudding user response:
I have used compression in those situations. Decompressing has been faster than IO speed. The text compresses pretty well.
An example of reading gzipped file is here:
How to read a .gz file line-by-line in C ?
CodePudding user response:
I would like to read big (3.5GB) file as fast as possible - thus I think I should load it into RAM first
You will most likely not experience any significant performance benefit by loading the entire file into memory.
All modern common operating systems have a disk cache, which automatically keeps recent and frequently used disk reads in RAM.
Even if you do load the entire file into memory, in most common modern operating systems, this merely means that you are loading the file into virtual memory. It does not guarantee that the file is actually in physical memory, because virtual memory that is not used is often swapped to disk by the operating system. Therefore, it is generally best to simply let the operating system handle everything.
If you really want to ensure that the file is actually in physical memory (which I do not recommend), then you will have to use OS-specific functionality, such as the function mlock
on Linux or VirtualLock
on Microsoft Windows, which prevents the operating system from swapping the memory to disk. However, depending on the system configuration, locking such a large amount of memory will probably not be possible for a normal user with default priviledges, because it could endanger system stability. Therefore, special user priviledges may be required.
But how do I make getline() read all over again?
The problem is that using operator <<
on an object of type std::stringstream
will consume the input. In that respect, it is no different than reading from a file using std::ifstream
. However, when reading from a file, you can simply seek back to the beginning of the file, using the function std::istream::seekg
. Therefore, the best solution would probably be to read directly from the file using std::ifstream
.