I have a large binary file that is around 70MB in size. In my program, I have a method that looks up byte[]
array patterns against the file, to see if they exist within the file or not. I have around 1-10 millions patterns to run against the file. The options I see are the following:
- Read the file into memory by doing
byte[] file = File.ReadAllBytes(path)
then perform byte[] lookup of byte[] pattern(s) against the file bytes. I have used multiple methods for doing that from different topics on SO such as: byte[] array pattern search Find an array (byte[]) inside another array? Best way to check if a byte[] is contained in another byte[] Though,byte[]
versusbyte[]
lookups are extremely slow when the source is large in size. It would take take weeks to run 1 million patterns on normal computers. - Convert both the file and patterns into hex strings then do the comparisons using
contains()
method to perform the lookup. This one is faster thanbyte[]
lookups but converting bytes to hex would result in the file being larger in memory which results in more processing time. - Convert both the file and pattern into strings using
Encoding.GetEncoding(1252).GetBytes()
and perform the lookups. Then, compensate for the limitation of binary to string conversion (I know they incompatible) by running the matches ofcontains()
against another method which performsbyte[]
lookups (suggested first option). This one is the fastest option for me.
Using the third approach, which is the fastest, 1 million patterns would take 2/3 of a day to a day depending on CPU. I need information on how to speed up the lookups.
Thank you.
Edit: Thanks to @MySkullCaveIsADarkPlace I now have a fourth approach which is faster than the three approaches above. I was using limited byte[] lookup algorithms and now I am using MemoryExtensions.IndexOf()
byte[] lookup method which is slightly faster than the three approaches above. Though, even though this method is faster, the lookups are still slow. It takes 1 minute for 1000 pattern lookups.
The patterns are 12-20 bytes each.
CodePudding user response:
I assume that you are looking up one pattern after the other. I.e., you are doing 1 to 10 million pattern searches in the file bytes!
Consider doing it the other way round. Loop once through your file bytes and determine if the current position is the start of a pattern.
To do this efficiently, I suggest organizing the patterns in an array of list of patterns. Each pattern is stored in a list at array index 256 * byte[0] byte[1]
.
With 10 million patterns you will have an average of 152 patterns in the lists at each array position. This allows a fast lookup.
You could also use the 3 first bytes (256 * (256 * byte[0] byte[1]) byte[2]
) resulting in an array of length 256^3 ~ 16 millions (I worked with longer arrays; no problem for C#). Then you would have less than one pattern per array position in average. This would result in a nearly linear search time O(n)
with respect to the file length. A huge improvement compared to the quadratic O(num_of_patterns * file_length)
for a straight forward algorithm.
We can use a simple byte by byte comparison to compare the patterns, since we can compare starting at a known position. (Boyer Moore is of no use here.)
2 bytes index (patterns must be at least 2 bytes long)
byte[] file = { 23, 36, 43, 76, 125, 56, 34, 234, 12, 3, 5, 76, 8, 0, 6, 125, 234, 56, 211, 122, 22, 4, 7, 89, 76, 64, 12, 3, 5, 76, 8, 0, 6, 125 };
byte[][] patterns = {
new byte[]{ 12, 3, 5, 76, 8, 0, 6, 125, 11 },
new byte[]{ 211, 122, 22, 4 },
new byte[]{ 17, 211, 5, 8 },
new byte[]{ 22, 4, 7, 89, 76, 64 },
};
var patternMatrix = new List<byte[]>[256 * 256];
// Add patterns to matrix.
// We assume pattern.Length >= 2.
foreach (byte[] pattern in patterns) {
int index = 256 * pattern[0] pattern[1];
patternMatrix[index] ??= new List<byte[]>(); // Ensure we have a list.
patternMatrix[index].Add(pattern);
}
// The search. Loop through the file
for (int fileIndex = 0; fileIndex < file.Length - 1; fileIndex ) { // Length - 1 because we need 2 bytes.
int patternIndex = 256 * file[fileIndex] file[fileIndex 1];
List<byte[]> candiatePatterns = patternMatrix[patternIndex];
if (candiatePatterns != null) {
foreach (byte[] candidate in candiatePatterns) {
if (fileIndex candidate.Length <= file.Length) {
bool found = true;
for (int i = 0; i < candidate.Length; i ) {
if (candidate[i] != file[fileIndex i]) {
found = false;
break;
}
}
if (found) {
Console.WriteLine($"pattern {{{candidate[0]}, {candidate[1]} ..}} found at file index {fileIndex}");
}
}
}
}
}
Same algorithm with 3 bytes (even faster!)
3 bytes index (patterns must be at least 3 bytes long)
var patternMatrix = new List<byte[]>[256 * 256 * 256];
// Add patterns to matrix.
// We assume pattern.Length >= 3.
foreach (byte[] pattern in patterns) {
int index = 256 * 256 * pattern[0] 256 * pattern[1] pattern[2];
patternMatrix[index] ??= new List<byte[]>(); // Ensure we have a list.
patternMatrix[index].Add(pattern);
}
// The search. Loop through the file
for (int fileIndex = 0; fileIndex < file.Length - 2; fileIndex ) { // Length - 2 because we need 3 bytes.
int patternIndex = 256 * 256 * file[fileIndex] 256 * file[fileIndex 1] file[fileIndex 2];
List<byte[]> candiatePatterns = patternMatrix[patternIndex];
if (candiatePatterns != null) {
foreach (byte[] candidate in candiatePatterns) {
if (fileIndex candidate.Length <= file.Length) {
bool found = true;
for (int i = 0; i < candidate.Length; i ) {
if (candidate[i] != file[fileIndex i]) {
found = false;
break;
}
}
if (found) {
Console.WriteLine($"pattern {{{candidate[0]}, {candidate[1]} ..}} found at file index {fileIndex}");
}
}
}
}
}
Why is it faster?
A simple nested loops algorithm compares up to ~ 706 * 106 = 7 * 1014 (700 trillion) patterns! 706 is the length of the file. 106 is the number of patterns.
My algorithm with a 2 bytes index makes ~ 706 * 152 = 1010 pattern comparisons. The number 152 comes from the fact that there are in average 152 patterns for a given 2 bytes index ~ 106/(256 * 256). This is 65,536 times faster.
With 3 bytes you get less than about 706 pattern comparisons. This is more than 10 million times faster. This is the case because we store all the patterns in an array whose length is greater (16 millions) than the number of patterns (10 millions or less). Therefore, at any byte position plus 2 following positions within the file, we can pick up only the patterns starting with the same 3 bytes. And this is in average less than one pattern. Sometimes there may be 0 or 1, sometimes 2 or 3, but rarely more patterns at any array position.
Try it. The shift is from O(n2) to near O(n). The initialization time is O(n). The assumption is that the 2 or 3 first bytes of the patterns are more or less randomly distributed. If this was not the case, my algorithm would degrade to O(n2) in the worst case.
See: Big O notation - Wikipedia.
CodePudding user response:
One idea is to group the patterns by their length, put each group in a HashSet<byte[]>
for searching with O(1) complexity, and then scan the source byte[]
index by index for all groups. Since the number of groups in your case is small (only 9 groups), this optimization should yield significant performance improvements. Here is an implementation:
IEnumerable<byte[]> FindMatches(byte[] source, byte[][] patterns)
{
Dictionary<int, HashSet<ArraySegment<byte>>> buckets = new();
ArraySegmentComparer comparer = new();
foreach (byte[] pattern in patterns)
{
HashSet<ArraySegment<byte>> bucket;
if (!buckets.TryGetValue(pattern.Length, out bucket))
{
bucket = new(comparer);
buckets.Add(pattern.Length, bucket);
}
bucket.Add(pattern); // Implicit cast byte[] => ArraySegment<byte>
}
for (int i = 0; i < source.Length; i )
{
foreach (var (length, bucket) in buckets)
{
if (i length > source.Length) continue;
ArraySegment<byte> slice = new(source, i, length);
if (bucket.TryGetValue(slice, out var pattern))
{
yield return pattern.Array;
bucket.Remove(slice);
}
}
}
}
Currently (.NET 6) there is no equality comparer for sequences available in the standard libraries, so you'll have to provide a custom one:
class ArraySegmentComparer : IEqualityComparer<ArraySegment<byte>>
{
public bool Equals(ArraySegment<byte> x, ArraySegment<byte> y)
{
return x.AsSpan().SequenceEqual(y);
}
public int GetHashCode(ArraySegment<byte> obj)
{
HashCode hashcode = new();
hashcode.AddBytes(obj);
return hashcode.ToHashCode();
}
}
This algorithm assumes that there are no duplicates in the patterns. In case that's not the case, only one of the duplicates will be emitted.
In my (not very speedy) PC this algorithm takes around 10 seconds to create the buckets
dictionary (for 10,000,000 patterns with size 12-20), and then additional 5-6 minutes to scan a source byte[]
of size 70,000,000 (scans around 200,000 bytes per second). The number of the patterns does not affect the scanning phase (as long as the number of the groups is not increased).
Parallelizing this algorithm is not trivial, because the buckets
are mutated during the scan.