Machine and the gap between human visual ability (1)-CodePudding

Author: Wang Yin

Reprinted from: http://www.yinwang.org/blog-cn/2019/09/14/machine-vs-human

Many people think quickly realized artificial intelligence, because they often confuse "recognition" and "understanding", now called "artificial intelligence" are doing recognition: speech recognition, image recognition, and the true intelligence is need understanding ability, we from understanding how far is it? I'm afraid the real work didn't even start,

For a long time, I have been thinking about the difference of understanding and recognition, understanding and recognition are very different, but always be confused, I deeply know the importance of understanding, but I found that few others know what is "understanding", AI field because of confusion of recognition and understanding, has been in chaos,

Recently because of image recognition and other fields have the bigger progress, the people of AI generates a lot of science fiction, blind faith, emerged since the 1980 s one of the biggest hot "AI", many people thought the AI really want to achieve, by companies to promote "black", but not the existing AI methods and the huge gap between human intelligence, so here I'd like to introduce what I comprehend the difference between machine and human visual ability, hope some people see, can have a cool head again,

Before in an article in the limitations of "artificial intelligence", I have expounded the view of natural language processing field error, because at that time the computer don't know much about the visual aspect, so does not contain the content of the visual aspects, familiar with the various methods of machine vision, I want to detailed in this article the visual aspects of content, these two articles, put together, summarized the I to be AI language and the visual perception of the two aspects,

"Recognition" and "visual understanding of the difference between"

For visual, confusing the AI field "recognition" and "visual understanding", now popular so-called "AI" is "recognition", and the optical system of the animal with a strong visual understanding, there are fundamental differences between visual image recognition and understanding,

Deep learning category (CNN) is a visual model from a large amount of data fitting out from "pixels=& gt; Name "function, it may be able to guess the figure from a pile of pixels in the object of" name ", but it does not know the object "what", not to objects, note I am specially USES the "guess" this word, because it really is in the guess, and not like a man knows exactly,

"Recognition" and "voice recognition" in the same level, to stay in the grammar (literally) level, and have no access to the "semantic", voice recognition is "voice=& gt; Text "transformation, and the image recognition is" image=& gt; Text "transformation, both the output text, and" text "and" understanding "in two different levels, on the surface of the text is symbol, you have to understand it will be meaningful,

How is the "understanding the object"? At the very least, you have to know what shape is it, what are the component, where is the location of each part and boundary, probably made of what material, what is the nature, so that you can effective action on it, to achieve the effect of need, otherwise this object just a box add a label above, cannot accurately judge and operation,

When in the face of all kinds of everyday things, your mind is their name? You picked up a knife to cut fruit, for example, no one to talk with you, beside you mind appeared the word "knife"? Is generally not, your mind is not a name, but a "common sense", common sense not words, but a kind of abstract and concrete data,

You know it's a knife, but your brain extract is not the word "dao", "what is" but the knife, your visual system tells you it's structure is what kind of, you know that it is made of metal, you saw blade, blade, knife, it may be folded, experience to tell you, the blade is something sharp can cut the part of the encounter might get hurt, a knife is can take place, if the knife is folded, you have to open it, then you from which head began to open it and its axis in where?

Smoothly picked up the knife, you begin to cut fruit, but your head still does not appear the word "knife", no "blade", "knife", such as word, at the same time of cut fruit, you can "language center" of the brain in hum a recently like the lyrics, it doesn't have anything to do with the knife, language just need tool when communicating with other people, you do things we don't need language, finish the action of cut fruit, all you need is produced by the visual understanding of object structure, rather than language,

You don't need to know what's the name of an item you can use it correctly, in the same way, just to know the name of an item, use it and can't help you, to see an object, if the first appeared in my head is the name of it, so you must be very stupid people, unable to arrange their own life, now the "machine visual basic is like that," the machine might be able to draw pictures on the name of the object, but did not know what it is, can't operate it,

Imagine an object can't understand the structure of the robot, it can only use the image recognition technology, identify one by one on your head area, labeled "forehead", "hair" and "ears"... Do you dare to let it give you a haircut?

This is what I call "visual understanding" and the difference of "recognition", you will realize that the difference is huge,

Visual identity cannot lack understanding

If we lower our standards, require only identify the name of the object, then based on the pixels of the image recognition, such as convolution neural network (CNN), also don't like people to identify objects, object recognition is not "photo, identify" like neural network, a two beat rhythm but a dynamic, continuous process: observation, understanding, observation, understanding, observation, understanding...

Accept information senses, interspersed with the understanding, understanding, in turn, controls the observation direction and order of objects in the process of understanding in the recognition, "observed/understanding" as an integral whole, people see a part of the object, understand what it is, and then continue to observe what is around it, the process again and again, the last to determine what is an object, the machine in the process of identifying components do not understand, this is why the machine cannot rival human on image recognition ability,

This process of "observation/understanding" happen so fast, blink of an eye, that a lot of people are not aware of its "understand ingredients", so now we slow down the process, to a slow motion close-up, and see what happens, if you have never seen this thing, do you know what is it?

A had never seen anything this person, will know that this is a "car", why? Because it has wheels, why do you know that it's wheels? Think carefully, because it is round and intermediate shaft, so as to roll on the ground, why do you know that it's "axis"? I will not continue with you, I think about it, all of these analysis are produced by the "visual understanding", and these comprehension depends on the accumulated experience of your life, that is what I call "common sense",

In fact, in order to identify this stuff, you don't need so many analysis, you do the analysis, because another person ask you "how do you know?" People identify objects by the so-called "intuition", at the sight of this picture, your mind naturally produced a 3 d model, a moment later, you realize that this model accords with mechanical motion principle of "car", because you ever seen a car, train, tractor... Your mind emerge this thing may be the movement of the lens, you seem to see it as the wheels in motion, which you see even a wheel pressure to the rock, along with the connecting rod, lifted it and the whole car still keep balance, but no, so that the car may be able to deal with rugged outdoor environment,

Here is an easy to overlook, the main points of the wheel shaft must and bodywork together, if the wheels with bodywork not connected, or position is wrong, looks can't together with body movement, people all know, this kind of shaft connected to the body of the relationship, belongs to a kind of concept called "topology" (topology),

Topology is a branch of mathematics difficulty is quite high, but people seem to have a natural and understand some simple topology concept, in fact it seems higher animals are all more or less understand some topology concept, they know what is a look together, which are separated, hunting animals know that prey tail is together with their body, so can catch them biting their tails,

Topology is an important concept, that is "hole", a wiser animals generally understand the concept of "hole", apparently rats, rabbits and other cave animals must understand what is hole, their predators, such as cats, also understand hole is what, if I take a carton for my cat to play, I dig a hole in it, such as drilling to, he is not in, I have to dig two holes in it, he would go in, why? Because he knew that if the box above is only one hole, hole was blocked after if he went in, he will not come!

The concept of how the machine can understand hole? How to understand "continuous" it?

In general, people see the object, he saw a 3 d model, he understood the topological relationship and geometric properties, so a person encounters unprecedented objects, he can also know what it is about to infer how to use it, understand that people can very accurately identify objects, machines without understanding ability is can't do it,

The difference between machine and human visual system

The person's eye has essential difference with the camera, the retina of the eye of the very small an area called "fovea," there is a very high density of photoreceptor cells, and the other part, there were fewer photoreceptor cells are fuzzy, but the eyes are will rotate, it was nerve control, agile to track the interested parts: line, plane, solid structure... Human visual system can accurately understand the shape of the object, understand the topology, and these are the 3 d, see not pixels of the human brain, it is a 3 d topology model,

Eyes to observe the order, not one line from the top down write down every "pixel", make it 6000 x4000 pixels of the image, but focus on the key, it can be along a straight line, can also be observed along the arc, can turn the circle, also can jump around, through their own understanding of the human brain, controlling the movement of the eye, let it be observations need to focus on, due to central retinal extremely high resolution, so the brain can get information accuracy is very high, but because not every place look so carefully, so eye collection of information may be small, the brain a lot of information will not need to deal with,

Human visual system can understand the point, line, the concept of understanding the object's surface is continuous or have holes, is concave and convex, tell inside and outside, far and near, the up and down or so... He can understand what is the surface quality of a material object, if get there will be what kind of reaction with the hand, he could imagine about what is on the back of the object, he can rotate or distorted object model in your mind, if there are defects among objects, he can even guess what that position before,

Human visual system is more interesting than a camera, a lot of people have seen "optical illusion" (optical illusion) images of them from a perspective reveals what is doing behind the human visual system, such as the image below is originally a static picture, but you will feel there are a lot of dark spots at the corner of the white line, but if you look at a certain corner, dark spot is missing, this is a classic illusion, called Herman grid, was widely studied in neural science, later I will mention this thing,

Originally is a static image, but you feel it in turn,

Have two things are the same color, but to look below color shallow some, if you block in the middle of the highlight part with finger, you will find the color of the upper and lower two pieces of actually is the same,

Another similar illusion, is the famous "Abelson chessboard illusion," in the figure A and B two lattice board color is the same, but you think A is black, and B is white, not letter you can use the software to cut off from the photos that the two pieces of grid, get together to compare, if you are curious why this is, you can refer to this article,

In the figure below, you will feel saw a black triangle, but actually it does not exist,

Many optical illusions that human visual system is not a simple camera, it has some special functions, the special function and mechanism leads to the illusion, this makes the human visual, unlike machines, enables the people to extract the structure information of the object, rather than just see pixels,

Extract the topological structure characteristics of the object, that is why people can understand the abstract painting, comics, toys, although the cats in the world looks like, a never seen "cat and mouse" cartoon of children, but know that this is a cat and a mouse, behind have a house, you try to make a did not take "cat and mouse" still trained deep learning model to identify the picture?

More abstract toys, the person also can identify what they are characters, head and limbs turned into a square, incredibly still feel very "like", don't you think it amazing?

nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull