Minimize difference between indicator variables in Matlab-CodePudding

I'm new to Matlab and want to write a program that chooses the value of a parameter (P) to minimize the difference between two vectors, where each vector is a variable in a dataframe. The first vector (call it A) is a predetermined vector of 1s and 0s, and the second vector (call it B) has each of its entries determined as an indicator function that depends on the value of the parameter P and other variables in the dataframe. For instance, let C be a third variable in the dataset, so

A = [1, 0, 0, 1, 0]

B = [x, y, z, u, v]

where x = 1 if (C[1] 10)^0.5 - P > (C[1])^0.5 and otherwise x = 0, and similarly, y = 1 if (C[2] 10)^0.5 - P > (C[2])^0.5 and otherwise y = 0, and so on.

I'm not really sure where to start with the code, except that it might be useful to use the fminsearch command. Any suggestions?

Edit: I changed the above by raising to a power, which is closer to the actual example that I have. I'm also providing a complete example in response to a comment: Let A be as above, and let C = [10, 1, 100, 1000, 1]. Then my goal with the Matlab code would be to choose a value of P to minimize the differences between the coordinates of the vectors A and B, where B[1] = 1 if (10 10)^0.5 - P > (10)^0.5 and otherwise B[1] = 0, and similarly B[2] = 1 if (1 10)^0.5 - P > (1)^0.5 and otherwise B[2] = 0, etc. So I want to choose P to maximize the likelihood that A[1] = B[1], A[2] = B[2], etc.

I have the following setup in Matlab, where ds is the name of my dataset:

ds.B = zeros(size(ds,1),1);     % empty vector to fill
for i = 1:size(ds,1)
if ((ds.C(i)   10)^(0.5) - P > (ds.C(i))^(0.5))
    ds.B(i) = 1;
else 
    ds.B(i) = 0;
end
end

Now I want to choose the value of P to minimize the difference between A and B. How can I do this?

CodePudding user response：

EDIT: It seems fminbnd isn't optimal, likely due to the stairstep nature of the indicator function. Here's a version which tests the midpoints of all the regions between indicator function flips, plus endpoints:

ds.A = [0; 1; 0; 0; 1]; %Changed so the new method gives better results
ds.C = [10; 1; 100; 1000; 1000]; %Changed so the new method gives better results
B = @(P)double((ds.C 10).^0.5 - P > ds.C.^0.5);

costFcn = @(P)sqrt(sum((ds.A-B(P)).^2));

% Solving the equation (C 10)^0.5 - P = C^0.5 for P, and sorting the results
BCrossingPoints = sort((ds.C 10).^0.5-ds.C.^0.5);
% Taking the average of each crossing point with its neighbors
BMidpoints = (BCrossingPoints(1:end-1) BCrossingPoints(2:end))/2;
% Appending endpoints onto the midpoints
PsToTest = [BCrossingPoints(1)-0.1; BMidpoints; BCrossingPoints(end) 0.1];
% Calculate the distance from A to B at each P to test
costResult = arrayfun(costFcn,PsToTest);
% Find the minimum cost
[~,lowestCostIndex] = min(costResult);
% Find the optimum P
optimumP = PsToTest(lowestCostIndex);

ds.B = B(optimumP);

This gives optimumP = 1.81 and B = [0 1 0 0 0].

Old answer:

If you can package your distance calculation up in a single function based on P, you can use fminbnd:

ds.A = [1, 0, 0, 1, 0];
ds.C = [10, 1, 100, 1000, 1];
B = @(P)double((ds.C 10).^0.5 - P > ds.C.^0.5);

costFcn = @(P)sqrt(sum((ds.A-B(P)).^2));

optimumP = fminbnd(costFcn,-1000,1000);

ds.B = B(optimumP);

which gives optimumP of 236.068

Rather than using an if statement, I just take the logical output of your indicator function (true or false) and convert it to a double-precision number (true becomes 1 and false becomes 0). Also rather than looping through each element, I use MATLAB's array mathematics to process all of A, B, and C at once. These two simplifications turn your above setup into two one-line functions without the need for any control flow and with the ability to take P as input.

For discussion of fminbnd vs fminsearch, see

figure
plot(p,d2)
grid on
xlabel('p');title('d2')

The only degree of freedom to optimise seems to be the sign of P regardless of |P| value.

3.- f(p,x) has either no root, or just one root, depending upon p

The threshold funtion is

if f(x)>0 then B(k)==1 else B(k)==0

this is

f(p,x)=(x 10)^.5-p-x^.5

Now

(x 10).^.5-p>x.^.5 is same as (x 10).^.5-x.^.5>p

There's a range of p that keeps f(p,x)=0 without any (real) root.

For the particular case p=0 then (x 10).^.5 and x.^.5 do not intersect (until Inf reached = there's no intersection)

figure;plot(x,(x 10).^.5,x,x.^.5);grid on
[![enter image description here][3]][3]
    
y2=diff((x 10).^.5-x.^.5)
figure;plot(x(2:end),y2);
grid on;xlabel('x')
title('y2=diff((x 10).^.5-x.^.5)')
[![enter image description here][3]][3]
% 005

This means the condition f(x)>0 is always true holding all bits of B=1. With B=1 then d(A,B) turns into d(A,1), a constant.

However, for a certain value of p then there's one root and f(x)>0 is always false keeping all bits of B=0.

In this case d(A,B) the cost function turns into d(A,0) and this is A itself.

4.- P as a vector

The optimization gains in degrees of freedom if instead of P scalar, P is considered as vector.

For a given x there's a value of p that switches B(k) from 0 to 1.

Any value of p below such threshold keeps B(k)=0.

Equivalently, inverting f(x) :

g(p)=(10-p^2)^2/(4*p^2)>x

Values of x below this threshold bring B closer to A because for each element of B it's flipped to the element value of A.

Therefore, it's convenient to consider P as a vector, not a ascalar, and :

For all, or as many (as possible) elements of C to meet c(k)<(10-p^2)^2/(4*p^2) in order to get C=A or minimize d(A,C)

5.- roots of f(p,x)

syms t positive
p=[-1000:.1:1000];
zp=NaN*ones(1,numel(p));
sol=zeros(1,numel(p));
for k1=1:1:numel(p)
  p(k1)
    eq1=(t 10)^.5-p(k1)-t^.5-p(k1)==0;
    s1=solve(eq1,t);
    if ~isempty(s1) 
        zp(k1)=s1;
    end
end

nzp=~isnan(zp);
zp(nzp)

returns

 =
  
  620.0100  151.2900   64.5344   34.2225   20.2500   12.7211
 
    8.2451    5.4056    3.5260    2.2500    1.3753    0.7803
 
    0.3882    0.1488    0.0278