Guessing a missing value based on historical data-CodePudding

Let's assume i have 100 different kinds of items, each item got a name and a physical weight. I know the names of all 100 items but only the weight of 80 items.

When i ship items, i pack them in groups of 10 and sum the weight of these items. Due to some items are missing their weight, this will give an inaccurate sum when im about to ship.

I have different shipments with missing weights

Shipment 1

Item Name	Item Weight
Item 2	10
Item 27	20
Item 42	20
Item 71	-
Item 77	-

Total weight: 75

Shipment 2

Item Name	Item Weight
Item 2	10
Item 27	20
Item 42	20
Item 71	-
Item 92	-

Total weight: 90

Shipment 3

Item Name	Item Weight
Item 2	10
Item 27	20
Item 42	20
Item 55	35
Item 77	-

Total weight: 100

Since some of the shipments share the same items with missing weights and i have the shipments total weight, is there a way with machine learning to determine the weight of these items without im unpacking the entire shipment? Or would it just be a, in this case, 100x3 Matrix with a lot of empty values?

At this point im not really sure if i should use some type of regression to solve this or if its just a matrix, that would expand a lot if i had n more items to ship. I also wondered if this was some type of knapsack problem, but i hope anyone can guide my in the right direction.

CodePudding user response：

Forget about machine learning. This is a simple system of linear equations.

w_71   w_77 = 25
w_71   w_92 = 40
w_77 = 15

You can solve it with sympy.solvers.solveset.linsolve, or scipy.optimize.linprog, or scipy.linalg.lstsq, or numpy.linalg.lstsq

sympy.linsolve is maybe the easiest to understand if you are not familiar with matrices; however, if the system is underdetermined, then instead of returning a particular solution to the system, sympy.linsolve will return the general solution in parametric form.
scipy.lstsq or numpy.lstsq expect the problem to be given in matrix form. If there is more than one possible solution, they will return the most "average" solution. However, they cannot take any positivity constraint into account: they might return a solution where one of the variables is negative. You can maybe fix this behaviour by adding a new equation to the system to manually force a variable to be positive, then solve again.
scipy.linprog expects the problem to be given in matrix form; it also expects you to specify a linear objective function, to choose which particular solution is "best" in case there is more than one possible solution. linprog also considers that all variables are nonnegative by default, or allows you to specify explicit bounds for the variables yourself. It also allows you to add inequality constraints, in addition to the equations, if you wish to.

Using sympy.solvers.solveset.linsolve

from sympy.solvers.solveset import linsolve
from sympy import symbols

w71, w77, w92 = symbols('w71 w77 w92')

eqs = [w71 w77-25, w71 w92-40, w77-15]

solution = linsolve(eqs, [w71, w77, w92])
# solution = {(10, 15, 30)}

In your example, there is only one possible solution, so linsolve returned that solution: w71 = 10, w77 = 15, w92 = 30.

However, in case there is more than one possible solution, linsolve will return a parametric form for the general solution:

x,y,z = symbols('x y z')

eqs = [x y-10, y z-20]

solution = linsolve(eqs, [x, y, z])
# solution = {(z - 10, 20 - z, z)}

Here there is an infinity of possible solutions. linsolve is telling us that we can pick any value for z, and then we'll get the corresponding x and y as x = z - 10 and y = 20 - z.

Using numpy.linalg.lstsq

lstsq expects the system of equations to be given in matrix form. If there is more than one possible solution, then it will return the most "average" solution. For instance, if the system of equation is simply x y = 10, then lstsq will return the particular solution x = 5, y = 5 and will ignore more "extreme" solutions such as x = 10, y = 0.

from numpy.linalg import lstsq

# w_71   w_77 = 25
# w_71   w_92 = 40
# w_77 = 15
A = [[1, 1, 0], [1, 0, 1], [0, 1, 0]]
b = [25, 40, 15]

solution = lstsq(A, b)
solution[0]
# array([10., 15., 30.])

Here lstsq found the unique solution, w71 = 10, w77=15, w92 = 30.

# x   y = 10
# y   z = 20
A = [[1, 1, 0], [0, 1, 1]]
b = [10, 20]

solution = lstsq(A, B)
solution[0]
# array([-3.55271368e-15,  1.00000000e 01,  1.00000000e 01])

Here lstsq had to choose a particular solution, and chose the one it considered most "average", x = 0, y = 10, z = 10. You might want to round the solution to integers.

One drawback of lstsq is that it doesn't take into account your non-negativity constraint. That is, it might return a solution where one of the variables is negative:

# x   y = 2
# y   z = 20
A = [[1, 1, 0], [0, 1, 1])
b = [2, 20]

solution = lstsq(A, b)
solution[0]
# array([-5.33333333,  7.33333333, 12.66666667])

See how lstsq ignored the possible positive solution x = 1, y = 1, z = 18 and instead returned the solution it considered most "average", x = -5.33, y = 7.33, z = 12.67.

One way to fix this is to add an equation yourself to force the offending variable to be positive. For instance, here we noticed that lstsq wanted x to be negative, so we can manually force x to be equal to 1 instead, and solve again:

# x   y = 2
# y   z = 20
# x = 1
A = [[1, 1, 0], [0, 1, 1], [1, 0, 0]]
b = [2, 20, 1]

solution = lstsq(A, b)
solution[0]
# array([ 1.,  1., 19.])

Now that we manually forced x to be 1, lstsq found solution x=1, y=1, z=19 which we're more happy with.

Using scipy.optimize.linprog

The particularity of linprog is that it expects you to specify the "objective" used to choose a particular solution, in case there is more than one possible solution.

Also, linprog allows you to specify bounds for the variables. The default is that all variables are nonnegative, which is what you want.

from scipy.optimize import linprog

# w_71   w_77 = 25
# w_71   w_92 = 40
# w_77 = 15
A = [[1, 1, 0], [1, 0, 1], [0, 1, 0]]
b = [25, 40, 15]
c = [1, 1, 1] # coefficients for objective: minimise w71   w77   w92.

solution = linprog(c, A_eq = A, b_eq = b)
solution.x
# array([10., 15., 30.])