What is the most efficient way to get the length of common prefix and extract out the uncommon suffixes.
int a = 866559999;
int b = 866560000;
Basically I am looking to extract out 60000
and 59999
in the above example possibly in O(1) time i.e. without traversing the number and taking mod by 10. Probably this can be done using bitwise operator.
CodePudding user response:
A not so efficient an approach, but a baseline to compare. Does not use %
. as requested by OP.
int count_common(unsigned a, unsigned b) {
// Find power of 10 for a, b
const unsigned pow10_max = 1000u*1000*1000;
unsigned apow10 = pow10_max;
while (a < apow10 && apow10 > 0) {
apow10 /= 10;
}
unsigned bpow10 = pow10_max;
while (b < bpow10 && bpow10 > 0) {
bpow10 /= 10;
}
int count = 0;
while (a > 0 && b > 0) {
unsigned aquot = a/apow10;
unsigned bquot = b/bpow10;
if (aquot != bquot) break;
count ;
a -= aquot*apow10;
apow10 /= 10;
b -= bquot*bpow10;
bpow10 /= 10;
}
return count;
}
CodePudding user response:
You can still vectorize the integer - to - string conversion to have speedup.
Here is auto-vectorized integer-to-string conversion with some benchmarking:
// https://github.com/tugrul512bit/VectorizedKernel/blob/main/VectorizedKernel.h
#include "VectorizedKernel.h"
#include <iostream>
#include <string>
int main()
{
constexpr int simd =16;
auto kernelVectorized = Vectorization::createKernel<simd>(
[&](auto & factory,
auto & idThread,
unsigned int * bufferIn,
char * bufferOut
)
{
const int currentSimdWidth = factory.width;
auto digitsReversed = factory.template generateArray<unsigned int,12>();
auto digitAsChar = factory.template generate<char>();
auto tmp = factory.template generate<unsigned int>();
auto tmp2 = factory.template generate<unsigned int>();
auto tmp3 = factory.template generate<unsigned int>();
auto value = factory.template generate<unsigned int>();
auto writeIndex = factory.template generate<int>();
for(int j=0;j<N;j =simd)
{
// null termination
digitsReversed[0].broadcast(0);
writeIndex.readFrom(idThread);
value.readFrom(bufferIn j,idThread);
// 11 digits!!??
for(int i=1;i<11;i )
{
// fast-modulo to re-use the division
// value % 10 --> tmp (digit)
value.div((unsigned int)10,tmp3);
tmp3.mul((unsigned int)10,tmp2);
value.sub(tmp2,tmp);
// value /= 10
value.readFrom(tmp3);
// write digit character
tmp.add((unsigned int)'0',digitsReversed[i]);
}
writeIndex.mul(12,writeIndex);
for(int i=0;i<11;i )
{
digitsReversed[11-i-1].template castAndCopyTo<char>(digitAsChar);
digitAsChar.writeTo(bufferOut (j*12),writeIndex);
writeIndex.add((int)1,writeIndex);
}
}
},
/* defining kernel parameter types */
Vectorization::KernelArgs<unsigned int*,char*>{});
std::vector<unsigned int> testInput(N);
std::vector<std::string> testOutput(N);
std::vector<char> testOutputChar(N*12);
for(int i=0;i<N;i )
{
testInput[i]=50000000 i;
}
size_t t;
for(int j=0;j<20;j )
{
{
Vectorization::Bench bench(&t);
for(int i=0;i<N;i )
{
testOutput[i]=std::to_string(testInput[i]);
}
}
std::cout<<t<<" nanoseconds (non-vectorized)"<<std::endl;
}
for(int i=0;i<20;i )
{
{
Vectorization::Bench bench(&t);
kernelVectorized.run(simd,testInput.data(),testOutputChar.data());
}
std::cout<<t<<" nanoseconds (vectorized !!!)"<<std::endl;
}
for(int i=0;i<200;i )
{
std::cout<<std::string(testOutputChar.data() (i*12))<<" =?= "<<testOutput[i]<<std::endl;
}
return 0;
}
For FX8150, conversion for 1 million integers is 50 milliseconds with std::to_string and 32 milliseconds with vectorized version. On Cascadelake CPU, it is 24 milliseconds and 7.1 milliseconds respectively. These include writing zero-padded string to target string array.
If you are going to work with zero-padded format, then you need to skip the first n zeroes.
When string characters are not written to target, it has higher speedup but then to fully compute your algorithm, you still need to stringify the second parameter which will take another 7.1 milliseconds per 1 million inputs (making 14.2 milliseconds total) then do the comparison which would take similar time ~28 milliseconds total for Cascadelake CPU.
Same kernel with two variable converted at a time, from two input arrays:
https://godbolt.org/z/vdPhGK194
25384216 nanoseconds (non-vectorized)
25418696 nanoseconds (non-vectorized)
25396651 nanoseconds (non-vectorized)
9843377 nanoseconds (vectorized !!!)
8936126 nanoseconds (vectorized !!!)
8776456 nanoseconds (vectorized !!!)
8585255 nanoseconds (vectorized !!!)
8558483 nanoseconds (vectorized !!!)
CodePudding user response:
We can test for
a == b
=> 10/10 common digits, elsea/10 == b/10
=> 9/10 common digits, elsea/100 == b/100
=> 8/10 common digit, elsea/1000 == b/1000
=> 7/10 common digits, elsea/10000 == b/10000
=> 6/10 common digits, elsea/100000 == b/100000
=> 5/10 common digit, elsea/1000000 == b/1000000
=> 4/10 common digits, elsea/10000000 == b/10000000
=> 3/10 common digits, elsea/100000000 == b/100000000
=> 2/10 common digit, elsea/1000000000 == b/1000000000
=> 1/10 common digit, else 0/10 common digits.
Those divisions by a constant can be optimized with inverse multiplication constants instead, and be parallelized with SIMD instructions for multiplication, rightshift and comparison.
By extracting the prime factor 2 of the powers of 10, part of the division can be done with boolean ANDing, reducing the number of needed bits for the multiplication.
In the end it would be branchless and only take few cycles.
For extracting out the deviating suffixes one would multiply one of the two inputs of the topmost successful comparison (they are the same, if the comparison is successful) with the divisor and subtract it from both a
and b
. If none is successful one would just return a
and b
themselves.