Visual Turing Test

The Visual Turing Test is “an operator-assisted device that produces a stochastic sequence of binary questions from a given test image”.[1] The query engine produces a sequence of questions that have unpredictable answers given the history of questions. The test is only about vision and does not require any natural language processing. The job of the human operator is to provide the correct answer to the question or reject it as ambiguous. The query generator produces questions such that they follow a “natural story line”, similar to what humans do when they look at a picture.
History
Research in computer vision dates back to the 1960s when Seymour Papert first attempted to solve the problem. This unsuccessful attempt was referred to as the Summer Vision Project. The reason why it was not successful was because computer vision is more complicated than what people think. The complexity is in alignment with the human visual system. Roughly 50% of the human brain is devoted in processing vision, which indicates that it is a difficult problem.
Later there were attempts to solve the problems with models inspired by the human brain. Perceptrons by Frank Rosenblatt, which is a form of the neural networks, was one of the first such approaches. These simple neural networks could not live up to their expectations and had certain limitations due to which they were not considered in future research.
Later with the availability of the hardware and some processing power the research shifted to image processing which involves pixel-level operations, like finding edges, de-noising images or applying filters to name a few. There was some great progress in this field but the problem of vision which was to make the machines understand the images was still not being addressed. During this time the neural networks also resurfaced as it was shown that the limitations of the perceptrons can be overcome by Multi-layer perceptrons. Also in the early 1990s convolutional neural networks were born which showed great results on digit recognition but did not scale up well on harder problems.
The late 1990s and early 2000s saw the birth of modern computer vision. One of the reasons this happened was due to the availability of key, feature extraction and representation algorithms. Features along with the already present machine learning algorithms were used to detect, localise and segment objects in Images.
While all these advancements were being made, the community felt the need to have standardised datasets and evaluation metrics so the performances can be compared. This led to the emergence of challenges like the Pascal VOC challenge and the ImageNet challenge. The availability of standard evaluation metrics and the open challenges gave directions to the research. Better algorithms were introduced for specific tasks like object detection and classification.
Visual Turing Test aims to give a new direction to the computer vision research which would lead to the introduction of systems that will be one step closer to understanding images the way humans do.
Current evaluation practices
A large number of datasets have been annotated and generalised to benchmark performances of difference classes of algorithms to assess different vision tasks (e.g., object detection/recognition) on some image domain (e.g., scene images).
One of the most famous datasets in computer vision is ImageNet which is used to assess the problem of object level Image classification. ImageNet is one of the largest annotated datasets available and has over one million images. The other important vision task is object detection and localisation which refers to detecting the object instance in the image and providing the bounding box coordinates around the object instance or segmenting the object. The most popular dataset for this task is the Pascal dataset. Similarly there are other datasets for specific tasks like the H3D[2] dataset for human pose detection, Core dataset to evaluate the quality of detected object attributes such as colour, orientation, and activity.
Having these standard datasets has helped the vision community to come up with extremely well performing algorithms for all these tasks. The next logical step is to create a larger task encompassing of these smaller subtasks. Having such a task would lead to building systems that would understand images, as understanding images would inherently involve detecting objects, localising them and segmenting them.
Details
The Visual Turing Test (VTT) unlike the Turing test has a query engine system which interrogates a computer vision system in the presence of a human co-ordinator.
It is a system that generates a random sequence of binary questions specific to the test image, such that the answer to any question k is unpredictable given the true answers to the previous k − 1 questions (also known as history of questions).
The test happens in the presence of a human operator who serves two main purposes: removing the ambiguous questions and providing the correct answers to the unambiguous questions. Given an Image infinite possible binary questions can be asked and a lot of them are bound to be ambiguous. These questions if generated by the query engine are removed by the human moderator and instead the query engine generates another question such that the answer to it is unpredictable given the history of the questions.
The aim of the Visual Turing Test is to evaluate the Image understanding of a computer system, and an important part of image understanding is the story line of the image. When humans look at an image, they do not think that there is a car at ‘x’ pixels from the left and ‘y’ pixels from the top, but instead they look at it as a story, for e.g. they might think that there is a car parked on the road, a person is exiting the car and heading towards a building. The most important elements of the story line are the objects and so to extract any story line from an image the first and the most important task is to instantiate the objects in it, and that is what the query engine does.
Query engine
The query engine is the core of the Visual Turing Test and it comprises two main parts : Vocabulary and Questions
Vocabulary
Vocabulary is a set of words that represent the elements of the images. This vocabulary when used with appropriate grammar leads to a set of questions. The grammar is defined in the next section in a way that it leads to a space of binary questions.
The vocabulary consist of three components:
- Types of Objects
- Type-dependent attributes of objects
- Type-dependent relationships between two objects
For Images of urban street scenes the types of objects include people, vehicle and buildings. Attributes refer to the properties of these objects, for e.g. female, child, wearing a hat or carrying something, for people and moving, parked, stopped, one tire visible or two tires visible for vehicles. Relationships between each pair of object classes can be either “ordered” or “unordered”. The unordered relationships may include talking, walking together and the ordered relationships include taller, closer to the camera, occluding, being occluded etc.

Additionally all of this vocabulary is used in context of rectangular image regions w \in W which allow for the localisation of objects in the image. An extremely large number of such regions are possible and this complicates the problem, so for this test, regions at specific scales are only used which include 1/16 the size of image, 1/4 the size of image, 1/2 the size of image or larger.
Questions
The question space is composed of four types of questions:
- Existence questions: The aim of the existence questions is to find new objects in the image that have not been uniquely identified previously.  
 They are of the form :
Qexist = 'Is there an instance of an object of type t with attributes A partially visible in region w that was not previously instantiated?'
- Uniqueness questions: A uniqueness question tries to uniquely identify an object to instantiate it.
Quniq = 'Is there a unique instance of an object of type t with attributes A partially visible in region w that was not previously instantiated?'
The uniqueness questions along with the existence questions form the instantiation questions. As mentioned earlier instantiating objects leads to other interesting questions and eventually a story line. Uniqueness questions follow the existence questions and a positive answer to it leads to instantiation of an object.
- Attribute questions: An attribute question tries to find more about the object once it has been instantiated. Such questions can query about a single attribute, conjunction of two attributes or disjunction of two attributes.
Qatt(ot) = {'Does object ot have attribute a?' , 'Does object ot have attribute a1 or attribute a2?' , 'Does object ot have attribute a1 and attribute a2?'}  - Relationship questions: Once multiple objects have been instantiated, a relationship question explores the relationship between pairs of objects.
Qrel(ot,ot') = 'Does object ot have relationship r with object ot'?'
Implementation details
As mentioned before the core of the Visual Turing Test is the query generator which generates a sequence of binary questions such that the answer to any question k is unpredictable given the correct answers to the previous k − 1 questions. This is a recursive process, given a history of questions and their correct answers, the query generator either stops because there are no more unpredictable questions, or randomly selects an unpredictable question and adds it to the history.
The question space defined earlier implicitly imposes a constraint on the flow of the questions. To make it more clear this means that the attribute and relationship questions can not precede the instantiation questions. Only when the objects have been instantiated, can they be queried about their attributes and relations to other previously instantiated objects. Thus given a history we can restrict the possible questions that can follow it, and this set of questions are referred to as the candidate questions .
The task is to choose an unpredictable question from these candidate questions such that it conforms with the question flow that we will describe in the next section. For this, find the unpredictability of every question among the candidate questions.
Let be a binary random variable, where , if the history is valid for the Image and otherwise. Let can be the proposed question, and be the answer to the question .
Then, find the conditional probability of getting the answer Xq to the question q given the history H.
Given this probability the measure of the unpredictability is given by:
The closer is to 0, the more unpredictable the question is. for every question is calculated. The questions for which , are the set of almost unpredictable questions and the next question is randomly picked from these.
Question flow
As discussed in the previous section there is an implicit ordering in the question space, according to which the attribute questions come after the instantiation questions and the relationship questions come after the attribute questions, once multiple objects have been instantiated.
Therefore, the query engine follows a loop structure where it first instantiates an object with the existence and uniqueness questions, then queries about its attributes, and then the relationship questions are asked for that object with all the previously instantiated objects.
Look-ahead search
It is clear that the interesting questions about the attributes and the relations come after the instantiation questions, and so the query generator aims at instantiating as many objects as possible.
Instantiation questions are composed of both the existence and the uniqueness questions, but it is the uniqueness questions that actually instantiate an object if they get a positive response. So if the query generator has to randomly pick an instantiation question, it prefers to pick an unpredictable uniqueness question if present. If such a question is not present, the query generator picks an existence question such that it will lead to a uniqueness question with a high probability in the future. Thus the query generator performs a look-ahead search in this case.
Story line
An integral part of the ultimate aim of building systems that can understand images the way humans do, is the story line. Humans try to figure out a story line in the Image they see. The query generator achieves this by a continuity in the question sequences.
This means that once the object has been instantiated it tries to explore it in more details. Apart from finding its attributes and relation to the other objects, localisation is also an important step. Thus, as a next step the query generator tries to localise the object in the region it was first identified, so it restricts the set of instantiation questions to the regions within the original region.
Simplicity preference
Simplicity preference states that the query generator should pick simpler questions over the more complicated ones. Simpler questions are the ones that have fewer attributes in them. So this gives an ordering to the questions based on the number of attributes, and the query generator prefers the simpler ones.
Estimating predictability
To select the next question in the sequence, VTT has to estimate the predictability of every proposed question. This is done using the annotated training set of Images. Each Image is annotated with bounding box around the objects and labelled with the attributes, and pairs of objects are labelled with the relations.
Consider each question type separately:  
- Instantiation questions: The conditional probability estimator for instantiation questions can be represented as:
 
 The question is only considered if the denominator is at least 80 images. The condition of is very strict and may not be true for a large number of Images, as every question in the history eliminates approximately half of the candidates (Images in this case). As a result, the history is pruned and the questions which may not alter the conditional probability are eliminated. Having a shorter history lets us consider a larger number of Images for the probability estimation.
 The history pruning is done in two stages:- In the first stage all the attribute and relationship questions are removed, under the assumption that the presence and instantiation of objects only depends on other objects and not their attributes or relations. Also, all the existence questions referring to regions disjoint from the region being referred to in the proposed question, are dropped with the assumption being that the probability of the presence of an object at a location does not change with the presence or absence of objects at locations other than . And finally all the uniqueness questions with a negative response referring to regions disjointed from the region being referred to in the proposed question, are dropped with the assumption that the uniqueness questions with a positive response if dropped can alter the response of the future instantiation questions. The history of questions obtained after this first stage of pruning can be referred to as .
- In the second stage an image-by-image pruning is performed. Let  be a uniqueness question in  that has not been pruned and is preserved in . If this question is in context of a region which is disjoint from the region being referenced in the proposed question, then the expected answer to this question will be , because of the constraints in the first stage. But if the actual answer to this question for the training image is , then that training image is not considered for the probability estimation, and the question  is also dropped. The final history of questions after this is , and the probability is given by:
 
- Attribute questions: The probability estimator for attribute questions is dependent on the number of labeled objects rather than the images unlike the instantiation questions.
 Consider an attribute question of the form : ‘Does object ot have attribute a?’, where is an object of type and . Let be the set of attributes already known to belong to because of the history. Let be the set of all the annotated objects (ground truth) in the training set, and for each , let be the type of object, and be the set of attributes belonging to . Then the estimator is given by:
 
 This is basically the ratio of the number of times the object of type with attributes occurs in the training data, to the number of times the object of type with attributes occurs in the training data. A high number of attributes in leads to a sparsity problem similar to the instantiation questions. To deal with it we partition the attributes into subsets that are approximately independent conditioned on belonging to the object . For e.g. for person, attributes like crossing a street and standing still are not independent, but both are fairly independent of the sex of the person, whether the person is child or adult, and whether they are carrying something or not. These conditional independencies reduce the size of the set , and thereby overcome the problem of sparsity.
- Relationship questions: The approach for relationship questions is the same as the attribute questions, where instead of the number of objects, number of pair of objects is considered and for the independence assumption, the relationships that are independent of the attributes of the related objects and the relationships that are independent of each other are included.
Example
Detailed example sequences can be found here.[3]
Dataset
The Images considered for the Geman et al.[1] work are that of ‘Urban street scenes’ dataset,[1] which has scenes of streets from different cities across the world. This why the types of objects are constrained to people and vehicles for this experiment.

Another dataset introduced by the Max Planck Institute for Informatics is known as DAQUAR[4][5] dataset which has real world images of indoor scenes. But they[4] propose a different version of the visual Turing test which takes on a holistic approach and expects the participating system to exhibit human like common sense.

Conclusion
This is a very recent work published on March 9, 2015, in the journal Proceedings of the National Academy of Sciences, by researchers from Brown University and Johns Hopkins University. It evaluates how the computer vision systems understand the Images as compared to humans. Currently the test is written and the interrogator is a machine because having an oral evaluation by a human interrogator gives the humans an undue advantage of being subjective, and also expects real time answers.
The Visual Turing Test is expected to give a new direction to the computer vision research. Companies like Google and Facebook are investing millions of dollars into computer vision research, and are trying to build systems that closely resemble the human visual system. Recently Facebook announced its new platform M, which looks at an image and provides a description of it to help the visually impaired.[6] Such systems might be able to perform well on the VTT.
References
- ^ a b c Geman, Donald; Geman, Stuart; Hallonquist, Neil; Younes, Laurent (2015-03-24). "Visual Turing test for computer vision systems". Proceedings of the National Academy of Sciences. 112 (12): 3618–3623. Bibcode:2015PNAS..112.3618G. doi:10.1073/pnas.1422953112. ISSN 0027-8424. PMC 4378453. PMID 25755262.
- ^ "H3D". www.eecs.berkeley.edu. Retrieved 2015-11-19.
- ^ "Visual Turing Test | Division of Applied Mathematics". www.brown.edu. Retrieved 2015-11-19.
- ^ a b "Max-Planck-Institut für Informatik: Visual Turing Challenge". www.mpi-inf.mpg.de. Retrieved 2015-11-19.
- ^ Malinowski, Mateusz; Fritz, Mario (2014-10-29). "Towards a Visual Turing Challenge". arXiv:1410.8027 [cs.AI].
- ^ Metz, Cade (27 October 2015). "Facebook's AI Can Caption Photos for the Blind on Its Own". WIRED. Retrieved 2015-11-19.