1 VQA, and why it’s asking the wrong questionThe two most interesting things about recent methods for Visual Question Answering are how clever they are, and how daft they are. These methods have made amazing progress on a task that tests general, high-level image understanding, and have demonstrated results that would have been thought impossible only a few years ago. They seem to have done so without actually achieving general, high-level image understanding, however. This talk will discuss some possible steps towards truly general high-level image understanding, how difficult it is, and some of the resources we might leverage Anton van den Hengel, The Australian Centre for Visual Technologies, Director The University of Adelaide, Professor
2 VQA is a fantastic challengeIt has pushed the field in really interesting directions For a long time you couldn’t ask AI questions in CV papers An overreaction to the AI winter
3 The methods really are goodIt is answering questions we thought would be years away Generality / diversity of questions Generality of expression Loose relationship between training data and test questions Generalisation is the key to human intelligence It’s a step towards AI It’s an excuse to think about AI in CV
4 Q: Is this a fish or a bicycle?A: No
5 Q: What game are they playingA: Soccer
6 Counting is a complicated ideaQ: How many horses are in the image? A: 2
7 Counting is harder than it looksQ: How many horses are in the image? A1: 2 A2: 3
8 Counting is a complicated ideaQ: How many unicorns are in the image? A: 2
9 Q: How many unicorns are in the image?
10 Did this player win the point?yes tennis court Image and VQA:
11 Where was this photo taken?No Yes Lake Image and VQA:
12 Who’s winning? Yes No Skiing Image:
13 NLP QA tackles harder questionsTREC questions: What was the monetary value of the Nobel Peace Prize in 1989? What does the Peugeot company manufacture? How much did Mercury spend on advertising in 1993? What is the name of the managing director of Apricot Computer? Why did David Koresh ask the FBI for a word processor? From a large set of documents Still closed world Average performance is about 70%
14 NLP QA tackles harder questionsWatson won Jeopardy Q: William Wilkinson’s “An Account of the Principalities of Wallachia and Moldovia” inspired this author’s most famous novel A: Bram Stoker
15 Who wrote a book about this guy?michael bill queen big baby
16 NLP QA is complex
17 How much can an RNN remember?
18 We’re not covering the training data
19 Covering all of the training dataVisual Question Answering with Memory-Augmented Networks Chao Ma, Chunhua Shen, Anthony Dick, Anton van den Hengel
20 Reasoning (over graphs)
21 Visual info is not enoughFor example, to answer the question that ‘how many mammals in the image’, given an photo which contains dog, cat and bird, we need the common sense knowledge that dog and cat are mammals but bird is not. Another example is shown on the lower right and the question is ‘why are they wearing such bright colour?’. The external source knowledge required here is that bright colour jacket can be used for safety reasons.. Q: How many mammals are there in this image?
22 Bringing in other informationCritical if the process is to be general It’s not feasible to train an RNN to usefully encode all of the knowledge required to answer general questions But once you have another source of information, what can you do with it? This is the really interesting question!
23 Google it? Can we go to the web to get relevant information?It’s what humans do Take the top 5 attributes and search Actually we search Wikipedia, not the whole net That gives us text which relates to the image content
24 Wikipedia and Visual Question AnsweringImage Extract Image Features CNN Attribute/Label/ Location Predictions Wikipedia Q: What kind of glasses are they drinking out of ? Language Modeling LSTM A: Wine What Value Do Explicit High Level Concepts Have in Vision to Language Problems? Qi Wu, Chunhua Shen, Lingqiao Liu, Anthony Dick, and Anton van den Hengel, CVPR’16
25 Merging image and web information
26 Example Attributes: Caption: External Knowledge: Question Answering:A group of people enjoying a sunny day at the beach with umbrellas in the sand. External Knowledge: An umbrella is a canopy designed to protect against rain or sunlight. Larger umbrellas are often used as points of shade on a sunny beach. A beach is a landform along the coast of an ocean. It usually consists of loose particles, such as sand…. Question Answering: Q: Why do they have umbrellas? A : Shade. Attributes: umbrella beach sunny day people sand laying blue … … There is a real example produced by our ACK system. Given a such image, multiple attributes are first detected and further produce an internal textual representation. External Knowledge are mined from the Knowledge Base via the top-5 detected attributes as queries. All these information are encoded as vector representations. Given a question ‘why do they have umbrellas’, we give the word ‘shade’ as the answer. A key common sense knowledge reasons to the answer here is that ‘Larger umbrellas are often used as points of shade on a sunny beach.’
27
28 Explicit reasoning Separate the language from the reasoningAnswers are much easier to believe if there’s a reason given Explicit storage means less to store implicitly It’s not feasible to store common sense implicitly And why train a NN to do something it’s not good at c.f. Neural Turing Machines
29 Use a Knowledge Base Scraped or hand crafted RDF tuples In a DBMS
30 Reasoning in VQA Input Image Input Question DCNN models QUEPY LinkedBrown {x,y,w,h} name ObjCat -giraffe Obj-1 Obj-2 color KB:Cat- Herbivorous animals KB:Cat -Animals -Zoology Megafauna of Africa KB:Giraffe AttCat -zoo Img bbox Att-1 KB:Zoo same- concept KB:Zebra contain img-att subject broader Visual Concepts DBpedia Concepts Obj-3 Obj-4 -person KB:Human DCNN models Linked Knowledge Graph What are the common properties between the animal in this image and the zebra? Input Image QUEPY Input Question ?x:((KB:Giraffe, subject/?broader, ?x) AND (KB:Zebra, subject/?broader, ?x)) Database Queries Answer and Reason A: Herbivorous animals, Animals, Magafauna of Africa R: Herbivorous animals Magafauna of Africa Giraffe Zebra Object: Person, Giraffe Attributes: glass, house, room, standing, walking, wall, zoo Scenes: museum, indoor
31 Img Visual Concepts DBpedia Concepts Obj-4 KB:Human -person MegafaunaBrown {x,y,w,h} name ObjCat -giraffe Obj-1 Obj-2 color KB:Cat- Herbivorous animals KB:Cat -Animals -Zoology Megafauna of Africa KB:Giraffe AttCat -zoo Img bbox Att-1 KB:Zoo same- concept KB:Zebra contain img-att subject broader Visual Concepts DBpedia Concepts Obj-3 Obj-4 -person KB:Human
32 Traversing the Knowledge BaseQ: List close relatives of the animal. A: Donkey, horse, mule, asinus, hinny
33 We’re good at Vision alreadyYou can’t train an RNN to do all of Vision from QA examples And we’ve already solved a lot of Vision problems So learn to use existing algorithms C.f. the Neural Turing Machine
34 The VQA-Machine
35 The VQA-Machine
36 The VQA-Machine
37 VQA has been great It’s a great excuse to take real steps towards hard AI VQA2.0 is an important step Helps isolate the language prior But we need to move on to harder questions Part of the problem is that the current questions have no purpose The Turker doesn’t care about the answer This underlies the evaluation problem The NLP community have some good ideas
38