Deep Learning Hongfei Yan March 4, 2016

1 Deep Learning Hongfei Yan March 4, 2016 ...

Author: Luke Warren

0 downloads 1 Views

1 Deep Learning Hongfei Yan March 4, 2016

2 Ian Goodfellow http://research.google.com/pubs/IanGoodfellow.htmlIan Goodfellow is a Senior Research Scientist on the Google Brain team. He studies new methods for improving neural networks. Research Area(s) Machine Intelligence Machine Perception ====Machine Intelligence Research at Google is at the forefront of innovation in Machine Intelligence, with active research exploring virtually all aspects of machine learning, including deep learning and more classical algorithms. Exploring theory as well as application, much of our work on language, speech, translation, visual processing, ranking and prediction relies on Machine Intelligence. In all of those tasks and many others, we gather large volumes of direct or indirect evidence of relationships of interest, applying learning algorithms to understand and generalize. Machine Intelligence at Google raises deep scientific and engineering challenges, allowing us to contribute to the broader academic research community through technical talks and publications in major conferences and journals. Contrary to much of current theory and practice, the statistics of the data we observe shifts rapidly, the features of interest change as well, and the volume of data often requires enormous computation capacity. When learning systems are placed at the core of interactive services in a fast changing and sometimes adversarial environment, combinations of techniques including deep learning and statistical models need to be combined with ideas from control and game theory. ====Machine Perception Research in machine perception tackles the hard problems of understanding images, sounds, music and video. In recent years, our computers have become much better at such tasks, enabling a variety of new applications such as: content-based search in Google Photos and Image Search, natural handwritinginterfaces for Android, optical character recognitionfor Google Drive documents, and recommendation systems that understand music and YouTube videos. Our approach is driven by algorithms that benefit from processing very large, partially-labeled datasets using parallel computing clusters. A good example is our recent work on object recognition using a novel deep convolutional neural network architecture known as Inception that achieves state-of-the-art results on academic benchmarks and allows users to easily search through their large collection of Google Photos. The ability to mine meaningful information from multimedia is broadly applied throughout Google.

3 Yoshua Bengio http://www. iro. umontreal. ca/~bengioy/yoshua_en/indexYoshua Bengio is a French-born Canadian computer scientist Department of Computer Science and Operations Research Canada Research Chair in Statistical Learning Algorithms Born: March 5, 1964 (age 51), Paris, France

4 Aaron Courville https://aaroncourville.wordpress.com/an Assistant Professor in the Department of Computer Science and Operations Research (DIRO) at the University of Montreal, and member of the LISA lab (LISA: Laboratoire d'Informatique des Systèmes Adaptatifs)

5 Contents I Introduction (1-27) I Applied Math and Machine Learning Basics 2 Linear Algebra (30-51) 3 Probability and Information Theory(52-78) 4 Numerical Computation(79-96) 5 Machine Learn Basics(97-165) II Deep Networks: Modern Practices 6 Deep Feedforward Networks( ) 7 Regularization for Deep Learning( ) 8 Optimization for Training Deep Models( ) 9 Convolutional Networks( ) 10 Sequence Modeling: Recurrent and Recursive Nets( )) 11 Practical methodology( ) 12 Applications( ) III Deep Learning Research 13 Linear Factor Models( ) 14 Autoencoders( ) 15 Representation Learning( ) 16 Structured Probabilistic Models for Deep Learning( ) 17 Monte Carlo Methods( ) 18 Confronting the Partition Function( ) 19 Approximate inference( ) 20 Deep Generative Models( )

6 1.1 Who should read this book?One of these target audiences is university students(undergraduate or graduate) learning about machine learning, including those who are beginning a career in deep learning and AI research. The other target audience is software engineers who do not have a machine learning or statistics background, but want to rapidly acquire one and begin using deep learning in their product or platform. Deep learning has already proven useful in many software disciplines including computer vision, speech and audio processing, natural language processing, robotics, bioinformatics and chemistry, video games, search engines, online advertising and ﬁnance.

7 This book was organized into three partsPart I introduces basic mathematical tools and machine learning concepts. Part II describes the most established deep learning algorithms that are essentially solved technologies. Part III describes more speculative ideas that are widely believed to be important for future research in deep learning. Prerequisites: familiarity with programming, a basic understanding of computational performance issues, complexity theory, introductory level calculus and some of the terminology of graph theory

8 Figure 1. 6: The high-level organization of the bookFigure 1.6: The high-level organization of the book. An arrow from one chapter to another indicates that the former chapter is prerequisite material for understanding the latter.

9 Chaper1 Introduction Inventors have long dreamed of creating machines that think. E.g., Ancient Greek myths tell of intelligent objects, such as animated statues of human beings and tables that arrive full of food and drink when called. We look to intelligent software to automate routine labor, understand speech or images, make diagnoses in medicine and support basic scientiﬁc research. In the early days of AI, the ﬁeld rapidly tackled and solved problems that are intellectually diﬃcult for human beings but relatively straightforward for computers problems that can be described by a list of formal, mathematical rules. The true challenge to AI proved to be solving the tasks that are easy for people to perform but hard for people to describe formally problems that we solve intuitively, that feel automatic, like recognizing spoken words or faces in images. When programmable computers were first conceived, people wondered whether they might become intelligent, over a hundred years before one was built (Lovelace, 1842). Today, artificial intelligence (AI) is a thriving field with many practical applications and active research topics.

10 This book is about a solution to intuitive problemsThis solution is to allow computers to learn from experience and understand the world in terms of a hierarchy of concepts, with each concept deﬁned in terms of its relation to simpler concepts. By gathering knowledge from experience, this approach avoids the need for human operators to formally specify all of the knowledge that the computer needs. The hierarchy of concepts allows the computer to learn complicated concepts by building them out of simpler ones. If we draw a graph showing how these concepts are built on top of each other, the graph is deep, with many layers. For this reason, call this approach to AI deep learning.

11 Many of the early successes of AI took place inrelatively sterile and formal environments and did not require computers to have much knowledge about the world. For example, IBM’s Deep Blue chess-playing system defeated world champion Garry Kasparov in 1997 (Hsu, 2002). Chess is of course a very simple world, containing only sixty-four locations and thirty-two pieces that can move in only rigidly circumscribed ways. Devising a successful chess strategy is a tremendous accomplishment, but the challenge is not due to the diﬃculty of describing the set of chess pieces and allowable moves to the computer. Chess can be completely described by a very brief list of completely formal rules, easily provided ahead of time by the programmer.

12 Abstract and formal tasksthat are among the most diﬃcult mental undertakings for a human being are among the easiest for a computer. Computers have long been able to defeat even the best human chess player, but are only recently matching some of the abilities of average human beings to recognize objects or speech. A person’s everyday life requires an immense amount of knowledge about the world. Much of this knowledge is subjective and intuitive, and therefore diﬃcult to articulate in a formal way. Computers need to capture this same knowledge in order to behave in an intelligent way. One of the key challenges in artiﬁcial intelligence is how to get this informal knowledge into a computer.

13 Knowledge base approach to AISeveral AI projects have sought to hard-code knowledge about the world in formal languages. A computer can reason about statements in these formal languages automatically using logical inference rules. None of these projects has led to a major success. One of the most famous such projects is Cyc (Lenat and Guha, 1989). Cyc is an inference engine and a database of statements in a language called CycL. For example, Cyc failed to understand a story about a person named Fred shaving in the morning (Linde, 1992). Its inference engine detected an inconsistency in the story: it knew that people do not have electrical parts, but because Fred was holding an electric razor, it believed the entity “FredWhileShaving” contained electrical parts. It therefore asked whether Fred was still a person while he was shaving. One of the most famous such projects is Cyc (Lenat and Guha, 1989). Cyc is an inference engine and a database of statements in a language called CycL. These statements are entered by a staﬀ of human supervisors. It is an unwieldy process. People struggle to devise formal rules with enough complexity to accurately describe the world.

14 This capability is known as machine learning.The diﬃculties faced by systems relying on hard-coded knowledge suggest that AI systems need the ability to acquire their own knowledge, by extracting patterns from raw data. The introduction of machine learning allowed computers to tackle problems involving knowledge of the real world and make decisions that appear subjective. A simple machine learning algorithm called logistic regression can determine whether to recommend cesarean delivery (Mor-Yosef et al., 1990). A simple machine learning algorithm called naive Bayes can separate legitimate from spam . cesarean delivery 剖腹产

15 The performance of these simple machine learning algorithms depends heavily on the representation of the data they are given. For example, when logistic regression is used to recommend cesarean delivery, the AI system does not examine the patient directly. Instead, the doctor tells the system several pieces of relevant information, such as the presence or absence of a uterine scar. Each piece of information included in the representation of the patient is known as a feature. Logistic regression learns how each of these features of the patient correlates with various outcomes. However, it cannot inﬂuence the way that the features are deﬁned in any way. If logistic regression was given an MRI scan of the patient, rather than the doctor’s formalized report, it would not be able to make useful predictions. Individual pixels in an MRI scan have negligible correlation with any complications that might occur during delivery. uterine scar子宫疤痕 MRI 核磁共振成像 negligible可以忽略的; 微不足道的; 无足轻重的; 不足轻重

16 The choice of representation has an enormous eﬀect on the performance of machine learning algorithms. This dependence on representations is a general phenomenon that appears throughout computer science and even daily life. In computer science, operations such as searching a collection of data can proceed exponentially faster if the collection is structured and indexed intelligently. People can easily perform arithmetic on Arabic numerals, but ﬁnd arithmetic on Roman numerals much more time-consuming.

17 For a simple visual example

18 Many AI tasks can be solved by designing the right set of features to extract for that task,then providing these features to a simple machine learning algorithm. For example, a useful feature for speaker identiﬁcation from sound is an estimate of the size of speaker’s vocal tract. It gives a strong clue as to whether the speaker is a man, woman, or child. vocal tract声道

19 For many tasks, it is diﬃcult to know what features should be extractedFor example, suppose that we would like to write a program to detect cars in photographs. We know that cars have wheels, so we might like to use the presence of a wheel as a feature. Unfortunately, it is diﬃcult to describe exactly what a wheel looks like in terms of pixel values. A wheel has a simple geometric shape but its image may be complicated by shadows falling on the wheel, the sun glaring oﬀ the metal parts of the wheel, the fender of the car or an object in the foreground obscuring part of the wheel, and so on.

20 This approach is known as representation learning.One solution to this problem is to use machine learning to discover not only the mapping from representation to output but also the representation itself. Often result in much better performance than can be obtained with hand- designed representations. They also allow AI systems to rapidly adapt to new tasks, with minimal human intervention. A representation learning algorithm can discover a good set of features for a simple task in minutes, or a complex task in hours to months. Manually designing features for a complex task requires a great deal of human time and eﬀort; it can take decades for an entire community of researchers

21 A representation learning algorithm is the autoencoder.An autoencoder is the combination of an encoder function that converts the input data into a diﬀerent representation, and a decoder function that converts the new representation back into the original format. Autoencoders are trained to preserve as much information as possible when an input is run through the encoder and then the decoder, but are also trained to make the new representation have various nice properties. Diﬀerent kinds of autoencoders aim to achieve diﬀerent kinds of properties

22 Factors of variation (1/4)When designing features or algorithms for learning features, our goal is usually to separate the factors of variation that explain the observed data. In this context, we use the word “factors” simply to refer to separate sources of inﬂuence; the factors are usually not combined by multiplication. Such factors are often not quantities that are directly observed. Instead, they may exist either as unobserved objects or unobserved forces in the physical world that aﬀect observable quantities. They may also exist as constructs in the human mind that provide useful simplifying explanations or inferred causes of the observed data. They can be thought of as concepts or abstractions that help us make sense of the rich variability in the data. When analyzing a speech recording, the factors of variation include the speaker’s age, their sex, their accent and the words that they are speaking. When analyzing an image of a car, the factors of variation include the position of the car, its color, and the angle and brightness of the sun.

23 Factors of variation (2/4)A major source of diﬃculty in many real-world AI applications is that many of the factors of variation inﬂuence every single piece of data we are able to observe. The individual pixels in an image of a red car might be very close to black at night. The shape of the car’s silhouette depends on the viewing angle. Most applications require us to disentangle the factors of variation and discard the ones that we do not care about. Disentangle 解开…的结; 理顺; 使解脱; 使脱出 silhouette 轮廓

24 Factors of variation (3/4)Of course, it can be very diﬃcult to extract such high-level, abstract features from raw data. Many of these factors of variation, such as a speaker’s accent, can be identiﬁed only using sophisticated, nearly human-level understanding of the data. When it is nearly as diﬃcult to obtain a representation as to solve the original problem, representation learning does not, at ﬁrst glance, seem to help us.

25 Factors of variation (4/4)DL solves this central problem in representation learning by introducing representations that are expressed in terms of other, simpler representations. DL allows the computer to build complex concepts out of simpler concepts. Fig. 1.2 shows how a DL system can represent the concept of an image of a person by combining simpler concepts, such as corners and contours, which are in turn deﬁned in terms of edges

26 Figure 1.2: Illustration of a deep learning model.It is diﬃcult for a computer to understand the meaning of raw sensory input data, such as this image represented as a collection of pixel values. The function mapping from a set of pixels to an object identity is very complicated. Learning or evaluating this mapping seems insurmountable if tackled directly. Deep learning resolves this diﬃculty by breaking the desired complicated mapping into a series of nested simple mappings, each described by a diﬀerent layer of the model. The input is presented at the visible layer, so named because it contains the variables that we are able to observe. Then a series of hidden layers extracts increasingly abstract features from the image. These layers are called “hidden” because their values are not given in the data; instead the model must determine which concepts are useful for explaining the relationships in the observed data. The images here are visualizations of the kind of feature represented by each hidden unit. Given the pixels, the ﬁrst layer can easily identify edges, by comparing the brightness of neighboring pixels. Given the ﬁrst hidden layer’s description of the edges, the second hidden layer can easily search for corners and extended contours, which are recognizable as collections of edges. Given the second hidden layer’s description of the image in terms of corners and contours, the third hidden layer can detect entire parts of speciﬁc objects, by ﬁnding speciﬁc collections of contours and corners. Finally, this description of the image in terms of the object parts it contains can be used to recognize the objects present in the image. Images reproduced with permission from Zeiler and Fergus (2014).

27 The quintessential example of a deep learning model is the feedforward deep network or multilayer perceptron (MLP) A multilayer perceptron is just a mathematical function mapping some set of input values to output values. The function is formed by composing many simpler functions. We can think of each application of a diﬀerent mathematical function as providing a new representation of the input. perceptron 感知器（模拟人类视神经控制系统的图形识别机）

28 The idea of learning the right representation for the data provides one perspective on deep learning. Another perspective on deep learning is that depth allows the computer to learn a multi-step computer program. Each layer of the representation can be thought of as the state of the computer’s memory after executing another set of instructions in parallel. Networks with greater depth can execute more instructions in sequence. Sequential instructions oﬀer great power because later instructions can refer back to the results of earlier instructions. According to this view of deep learning, not all of the information in a layer’s activations necessarily encodes factors of variation that explain the input. The representation also stores state information that helps to execute a program that can make sense of the input. This state information could be analogous to a counter or pointer in a traditional computer program. It has nothing to do with the content of the input speciﬁcally, but it helps the model to organize its processing.

29 There are two main ways of measuring the depth of a model(1/3)The ﬁrst view is based on the number of sequential instructions that must be executed to evaluate the architecture. We can think of this as the length of the longest path through a ﬂow chart that describes how to compute each of the model’s outputs given its inputs. Just as two equivalent computer programs will have diﬀerent lengths depending on which language the program is written in, the same function may be drawn as a ﬂowchart with diﬀerent depths depending on which functions we allow to be used as individual steps in the ﬂowchart. Fig. 1.3 illustrates how this choice of language can give two diﬀerent measurements for the same architecture.

30 Figure 1.3: Illustration of computational graphs mapping an input to an output where each node performs an operation. Depth is the length of the longest path from input to output but depends on the deﬁnition of what constitutes a possible computational step. The computation depicted in these graphs is the output of a logistic regression model, σ(wTx), where σ is the logistic sigmoid function. If we use addition, multiplication and logistic sigmoids as the elements of our computer language, then this model has depth three. If we view logistic regression as an element itself, then this model has depth one.

31 There are two main ways of measuring the depth of a model(2/3)Another approach, used by deep probabilistic models, regards the depth of a model as being not the depth of the computational graph but the depth of the graph describing how concepts are related to each other. In this case, the depth of the ﬂowchart of the computations needed to compute the representation of each concept may be much deeper than the graph of the concepts themselves. This is because the system’s understanding of the simpler concepts can be reﬁned given information about the more complex concepts. For example, an AI system observing an image of a face with one eye in shadow may initially only see one eye. After detecting that a face is present, it can then infer that a second eye is probably present as well. In this case, the graph of concepts only includes two layers—a layer for eyes and a layer for faces—but the graph of computations includes 2n layers if we reﬁne our estimate of each concept given the other n times.

32 There are two main ways of measuring the depth of a model(3/3)Because it is not always clear which of these two views—the depth of the computational graph, or the depth of the probabilistic modeling graph—is most relevant, and because diﬀerent people choose diﬀerent sets of smallest elements from which to construct their graphs, there is no single correct value for the depth of an architecture, just as there is no single correct value for the length of a computer program. Nor is there a consensus about how much depth a model requires to qualify as “deep.” DL can safely be regarded as the study of models that either involve a greater amount of composition of learned functions or learned concepts than traditional machine learning does.

33 To summarize, deep learning, the subject of this book, is an approach to AISpeciﬁcally, it is a type of machine learning, a technique that allows computer systems to improve with experience and data. machine learning is the only viable approach to building AI systems that can operate in complicated, real-world environments. DL is a particular kind of machine learning that achieves great power and ﬂexibility by learning to represent the world as a nested hierarchy of concepts, with each concept deﬁned in relation to simpler concepts, and more abstract representations computed in terms of less abstract ones. Viable 切实可行的; 能养活的; 能自行生产发育的; 有望实现的

34 Fig. 1.4 illustrates the relationship between these diﬀerent AI disciplines.A Venn diagram showing how DL is a kind of representation learning, which is in turn a kind of machine learning, which is used for many but not all approaches to AI. Each section of the Venn diagram includes an example of an AI technology.

35 Fig. 1.5 gives a high-level schematic of how each works.Flowcharts showing how the diﬀerent parts of an AI system relate to each other within diﬀerent AI disciplines. Shaded boxes indicate components that are able to learn from data

36 1.2 Historical trends in deep learningIt is easiest to understand deep learning with some historical context. Rather than providing a detailed history of deep learning, we identify a few key trends: Deep learning has had a long and rich history, but has gone by many names reﬂecting diﬀerent philosophical viewpoints, and has waxed and waned in popularity. Deep learning has become more useful as the amount of available training data has increased. Deep learning models have grown in size over time as computer hardware and software infrastructure for deep learning has improved. Deep learning has solved increasingly complicated applications with increasing accuracy over time. waxed and waned 盈盈亏亏

37 1.2.1 The Many Names and Changing Fortunes of Neural NetworksDeep learning dates back to the 1940s. DL only appears to be new, because it was relatively unpopular for several years preceding its current popularity, and because it has gone through many diﬀerent names, and has only recently become called “deep learning.” The ﬁeld has been rebranded many times, reﬂecting the inﬂuence of diﬀerent researchers and diﬀerent perspectives. Three waves of development of DL: DL known as cybernetics in the 1940s–1960s, DL known as connectionism in the 1980s–1990s, and the current resurgence under the name deep learning beginning in 2006. A comprehensive history of deep learning is beyond the scope of this textbook. However, some basic context is useful for understanding deep learning. Broadly speaking, there have been

38 The ﬁrst wave started with cybernetics in the 1940s–1960s,The ﬁgure shows two of the three historical waves of artiﬁcial neural nets research. The ﬁrst wave started with cybernetics in the 1940s–1960s, with the development of theories of biological learning (McCulloch and Pitts, 1943; Hebb, 1949) and implementations of the ﬁrst models such as the perceptron (Rosenblatt, 1958) allowing the training of a single neuron. The second wave started with the connectionist approach of the 1980–1995 period, with back-propagation (Rumelhart et al., 1986a) to train a neural network with one or two hidden layers. The current and third wave, deep learning, started around 2006 (Hinton et al., 2006; Bengio et al., 2007; Ranzato et al., 2007a), which allowed us to train very deep networks, is not shown for lack of books on the subject. The ﬁgure shows on the vertical axis the Google books n-grams relative counts for “cybernetics” (solid curve) and “connectionism” or “neural networks” (dashed curve), for years (horizontal axis)between 1940 and 2004. Keep in mind that the scientiﬁc advances tend to happen years before they are discussed in many places in books, hence these curves incorporate a lag of several years compared to the scientiﬁc activity itself.

39 1.2.2 Increasing Dataset SizeThe age of “BigData” has made machine learning much easier because the key burden of statistical estimation—generalizing well to new data after observing only a small amount of data—has been considerably lightened. As of 2016, a rough rule of thumbs that a supervised deep learning algorithm will generally achieve acceptable performance with around 5,000 labeled examples per category, and will match or exceed human performance when trained with a dataset containing at least 10million labeled examples. Working successfully with datasets smaller than this is an important research area, focusing in particular on how we can take advantage of large quantities of unlabeled examples, with unsupervised or semi-supervised learning

40 Figure 1.8: Dataset sizes have increased greatly over time. In the early 1900s, statisticians studied datasets using hundreds or thousands of manually compiled measurements (Garson,1900; Gosset, 1908; Anderson, 1935; Fisher, 1936). In the 1950s through 1980s, the pioneers of biologically-inspired machine learning often worked with small, synthetic datasets, such as low-resolution bitmaps of letters, that were designed to incur low computational cost and demonstrate that neural networks were able to learn speciﬁc kinds of functions (Widrowand Hoﬀ, 1960; Rumelhart et al., 1986b). In the 1980s and 1990s, machine learning became more statistical in nature and began to leverage larger datasets containing tensof thousands of examples such as the MNIST dataset (shown in Fig. 1.9) of scans of handwritten numbers (LeCun et al., 1998c). In the ﬁrst decade of the 2000s, more sophisticated datasets of this same size, such as the CIFAR-10 dataset (Krizhevsky andHinton, 2009) continued to be produced. Toward the end of that decade and throughout the ﬁrst half of the 2010s, signiﬁcantly larger datasets, containing hundreds of thousands to tens of millions of examples, completely changed what was possible with deep learning. These datasets included the public Street View House Numbers dataset (Netzer et al.,2011), various versions of the ImageNet dataset (Deng et al., 2009, 2010a; Russakovskyet al., 2014a), and the Sports-1M dataset (Karpathy et al., 2014). At the top of the graph, we see that datasets of translated sentences, such as IBM’s dataset constructed from the Canadian Hansard (Brown et al., 1990) and the WMT 2014 English to French dataset (Schwenk, 2014) are typically far ahead of other dataset sizes.

41 Figure 1.9: Example inputs from the MNIST dataset. drosophila果蝇（因其寿命短，繁殖力强，故将其作为遗传实验用） Figure 1.9: Example inputs from the MNIST dataset. The “NIST” stands for National Institute of Standards and Technology, the agency that originally collected this data. The “M” stands for “modiﬁed,” since the data has been preprocessed for easier use with machine learning algorithms. The MNIST dataset consists of scans of handwritten digits and associated labels describing which digit 0-9 is contained in each image. This simple classiﬁcation problem is one of the simplest and most widely used tests in deep learning research. It remains popular despite being quite easy for modern techniques to solve. Geoﬀrey Hinton has described it as “the drosophila of machine learning,” meaning that it allows machine learning researchers to study their algorithms in controlled laboratory conditions, much as biologists often study fruit ﬂies

42 1.2.3 Increasing Model SizesOne of the main insights of connection-ism is that animals become intelligent when many of their neurons work together. An individual neuron or small collection of neurons is not particularly useful. Larger networks are able to achieve higher accuracy on more complex tasks. This trend looks set to continue for decades. Unless new technologies allow faster scaling, artiﬁcial neural networks will not have the same number of neurons as the human brain until at least the 2050s. The increase in model size over time, due to the availability of faster CPUs, the advent of general purpose GPUs, faster network connectivity and better software infrastructure for distributed computing, is one of the most important trends in the history of deep learning.

43 Figure 1.10: Initially, the number of connections between neurons in artiﬁcial neural networks was limited by hardware capabilities. Today, the number of connections between neurons is mostly a design consideration. Some artiﬁcial neural networks have nearly as many connections per neuron as a cat, and it is quite common for other neural networks to have as many connections per neuron as smaller mammals like mice. Even the human brain does not have an exorbitant amount of connections per neuron. Biological neural network sizes from Wikipedia (2015) exorbitant过度的，极高的

44 Figure 1.11: Since the introduction of hidden units, artiﬁcial neural networks have doubled in size roughly every 2.4 years. Biological neural network sizes from Wikipedia (2015).

45 1.2.4 Increasing Accuracy, Complexity and Real-World ImpactSince the 1980s, deep learning has consistently improved in its ability to provide accurate recognition or prediction. Moreover, deep learning has consistently been applied with success to broader and broader sets of applications.

46 Figure 1.12: Since deep networks reached the scale necessary to compete in the ImageNet Large Scale Visual Recognition Challenge, they have consistently won the competition every year, and yielded lower and lower error rates each time. Data from Russakovskyet al. (2014b) and He et al. (2015).27

47 In summary deep learning is an approach to machine learning that has drawn heavily on our knowledge of the human brain, statistics and applied math as it developed over the past several decades. In recent years, it has seen tremendous growth in its popularity and usefulness, due in large part to more powerful computers, larger datasets and techniques to train deeper networks. The years ahead are full of challenges and opportunities to improve deep learning even further and bring it to new frontiers.

48 Notation describe most of these ideas in chapters 2-4.

Deep Learning Hongfei Yan March 4, 2016

Recommend Documents