Finding the k largest and smallest values from multiple arrays-CodePudding

I have an array of data generated in the future, and I am interested in obtaining the k smallest and k largest values. k could be for example 10% of the data. Since my data is enormous, I can't fit everything in memory at once.

I was simulating my idea in MATLAB to find the largest and smallest values.

x=rand(1,100)*10; %Genearting Randam number

x_sorted= sort(x)'; %True sorting just for testing my code performance

 %Simulating the divided data arrays
divide=4;           
trim_persentage=10;   %Trim persentage to discard data
y = reshape(x, length(x)/divide, divide);
x_local_sorted = sort(y);

%Finding the minima's value array
x_local_trimed_high=x_local_sorted(1:round(size(x_local_sorted,1)*trim_persentage/100),:);
globalsort_lows= sort(x_local_trimed_high(:));

%Finding the maimas's value array
x_local_trimed_low=x_local_sorted(ceil(size(x_local_sorted,1)*trim_persentage/100):end,:);
globalsort_highs= sort(x_local_trimed_low(:));

%Comparing it with True sorting to check performance
sum(x_sorted(1:length(globalsort_lows))==globalsort_lows)/length(globalsort_lows)*100
sum(x_sorted(numel(x_sorted)- 
length(globalsort_highs) 1:end)==globalsort_highs)/length(globalsort_highs)*100

The problem with the algorithm is that I am not getting the true 10% largest and 10% smallest values from the array. Does there is any better way to approach this problem?

P.S: Simplifying the code and comparing two different methods to find k max and min values. The 1st method is proposed by @hadi.

clear all
x=rand(10e3,1)*10;

kvalues=10;
%Simulating the divided data arrays
divide=8;
y = reshape(x, length(x)/divide, divide);
globalMins=[];
globalMaxs=[];

%Method 1
tic
for q=1:size(y,2)
    
    mi=find_k_min(y(:,q),kvalues);
    ma=find_k_max(y(:,q),kvalues);

    globalMins=[globalMins mi];
    globalMaxs=[globalMaxs ma];
    
end
Min_1st=sort(globalMins);
Max_1st=sort(globalMaxs);
toc

globalMins=[];
globalMaxs=[];

%Method 2
tic
for q=1:size(y,2)
    z=sort(y(:,q));
    mi=z(1:kvalues);
    ma=z(end-kvalues 1:end);
    globalMins=[globalMins; mi];
    globalMaxs=[globalMaxs; ma];
end

Min2nd=sort(globalMins);
Max2nd=sort(globalMaxs);
toc

function out=find_k_max(in,kvalue)
ma=zeros(1,kvalue);

for i=1:kvalue
    [ma(i),I]=max(in);
    in(I)=[];
end
out=ma;
end

function out=find_k_min(in,kvalue)
mi=zeros(1,kvalue);

for i=1:kvalue
    [mi(i),I]=min(in);
    in(I)=[];
end
out=mi;
end

The output of the code for multiple runs is

(1)
Elapsed time is 0.008850 seconds.
Elapsed time is 0.004439 seconds.
(2)
Elapsed time is 0.006718 seconds.
Elapsed time is 0.004550 seconds.
(3)
Elapsed time is 0.007108 seconds.
Elapsed time is 0.004618 seconds.

The method of sorting and trimming works (Method 2) efficiently as compared to the min and max method.

This deal with the efficiency of the code running performance; which is important. However, I am looking for an efficient method to find the k smallest or largest values.

CodePudding user response：

Looking at your code in more detail, I realize you are not looking for a few of the smallest and largest values, but a substantial amount. Techniques to efficiently find the k smallest values are efficient only when k << n, the total number of values (AFAIK).

Your technique involves finding the 10% smallest values in each subarray, but there is no guarantee that the 10% smallest values overall are not all in the same subarray. The only way to make this work correctly would be to determine k, the total number of values to be found, then find the k smallest values in one subarray, add those values to the 2nd subarray, get the k smallest values of the resulting combination, and repeat this with the other subarrays. At the end, you will have the k smallest values overall. This is not efficient, of course, and it limits how small the subarrays can be.

To find the 10% smallest values in an array, I would first find the 10th percentile, which can be done much more efficiently than sorting the whole array, and then find all values smaller or equal to this percentile.

Unfortunately, it is not possible to determine a percentile value across many subarrays by computing percentiles in each array separately. You would end up with exactly the same problem you ran into and I described in the second paragraph.

But you can find an approximation using histograms. If you have some idea of the distribution of values in the data, then you can fix your histogram parameters. Otherwise you need to loop over the data and collect the min and max values. With these, you can again fix histogram parameters. Now compute the histogram for each subarray and add them all together.

From the histogram you can estimate the 10th percentile. Add a margin to it (make the value a bit larger), and then collect all values in the data set that are below this estimate. Finally, remove the largest values from this set until you have the right size.

Of course you can do the same for the 10% largest values.

CodePudding user response：

Try this:

x=rand(1,100)*10; %Genearting Randam number
x_max = x(1);
x_min = x(1);
%% Comparing each position of the array
for i = 2:length(x)
  if x(i) > x_max
    x_max = x(i); % Update maximum value
  end
  if x(i) < x_min
    x_min = x(i); % Update minimum value
  end
end
%% Print outputs
disp(['Maximum Value = ',num2str(x_max)]);
disp(['Minimum Value = ',num2str(x_min)]);

Please, let me know how it goes!

CodePudding user response：

based on the comments, you could find min and max using min(array), max(array) then remove the value from the array and do the same thing again till you find 6 min and 6 max values. sort function is very expensive.

x=rand(100,1)*10; 
for i=1:6 
    [mi(i),I]=min(x);
    x(I)=[];
    [ma(i),I]=max(x);
    x(I)=[];
end

if x is very large you could use tall arrays, or you could slice x :

x=rand(4e6,1);
mi=[];
ma=[];
for s=1:4
    mi=[mi;find_6_min(x((s-1)*1e6 1:s*1e6))];
    ma=[ma;find_6_max(x((s-1)*1e6 1:s*1e6))];
end
mi=find_6_min(mi);
ma=find_6_max(ma);

function out=find_6_min(in)
    for i=1:6 
        [mi(i),I]=min(x);
        x(I)=[];
    end
end
function out=find_6_max(in)
    for i=1:6 
        [ma(i),I]=max(x);
        x(I)=[];
    end
end

CodePudding user response：

You can expand the same concept to larger arrays

%% Genearting Random number
n = 1e5;
m = round(n/10);
x = rand(1,n)*10;
x_max = flip(sort(x(1:m)));
x_min = sort(x(1:m));
%% Comparing each position of the array
for i = m 1 : length(x)
  if any(x(i) > x_max)
    for j = 1 : m
        if x(i) > x_max(j)
            x_max(j 1:end) = x_max(j:end-1);
            x_max(j) = x(i);
            break
        end
    end
  end
  if any(x(i) < x_min)
    for h = 1 : m
       if x(i) < x_min(h)
          x_min(h 1:end) = x_min(h:end-1);
          x_min(h) = x(i);
          break
       end
     end
  end
end
x_max = flip(x_max);
%% See Graphical Result
figure
box on
hold on
plot(x_min,'bo')
plot(n-m 1:n,x_max,'ro')
plot(sort(x),'k.')