Transcription

Fundamentals of Deep LearningDesigning Next-Generation MachineIntelligence AlgorithmsNikhil Budumawith contributions by Nicholas LocascioBeijingBoston Farnham SebastopolTokyo

Fundamentals of Deep Learningby Nikhil Buduma and Nicholas LacascioCopyright 2017 Nikhil Buduma. All rights reserved.Printed in the United States of America.Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions arealso available for most titles (http://oreilly.com/safari). For more information, contact our corporate/insti‐tutional sales department: 800-998-9938 or [email protected]: Mike Loukides and Shannon CuttProduction Editor: Shiny KalapurakkelCopyeditor: Sonia SarubaProofreader: Amanda KerseyIndexer: Wendy CatalanoInterior Designer: David FutatoCover Designer: Karen MontgomeryIllustrator: Rebecca DemarestFirst EditionJune 2017:Revision History for the First Edition2017-05-25:First ReleaseThe O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Fundamentals of Deep Learning, thecover image, and related trade dress are trademarks of O’Reilly Media, Inc.While the publisher and the authors have used good faith efforts to ensure that the information andinstructions contained in this work are accurate, the publisher and the authors disclaim all responsibilityfor errors or omissions, including without limitation responsibility for damages resulting from the use ofor reliance on this work. Use of the information and instructions contained in this work is at your ownrisk. If any code samples or other technology this work contains or describes is subject to open sourcelicenses or the intellectual property rights of others, it is your responsibility to ensure that your usethereof complies with such licenses and/or rights.978-1-491-92561-4[TI]

Table of ContentsPreface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix1. The Neural Network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1Building Intelligent MachinesThe Limits of Traditional Computer ProgramsThe Mechanics of Machine LearningThe NeuronExpressing Linear Perceptrons as NeuronsFeed-Forward Neural NetworksLinear Neurons and Their LimitationsSigmoid, Tanh, and ReLU NeuronsSoftmax Output LayersLooking Forward123789121315152. Training Feed-Forward Neural Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17The Fast-Food ProblemGradient DescentThe Delta Rule and Learning RatesGradient Descent with Sigmoidal NeuronsThe Backpropagation AlgorithmStochastic and Minibatch Gradient DescentTest Sets, Validation Sets, and OverfittingPreventing Overfitting in Deep Neural NetworksSummary1719212223252734373. Implementing Neural Networks in TensorFlow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39What Is TensorFlow?How Does TensorFlow Compare to Alternatives?3940iii

Installing TensorFlowCreating and Manipulating TensorFlow VariablesTensorFlow OperationsPlaceholder TensorsSessions in TensorFlowNavigating Variable Scopes and Sharing VariablesManaging Models over the CPU and GPUSpecifying the Logistic Regression Model in TensorFlowLogging and Training the Logistic Regression ModelLeveraging TensorBoard to Visualize Computation Graphs and LearningBuilding a Multilayer Model for MNIST in TensorFlowSummary4143454546485152555859624. Beyond Gradient Descent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63The Challenges with Gradient DescentLocal Minima in the Error Surfaces of Deep NetworksModel IdentifiabilityHow Pesky Are Spurious Local Minima in Deep Networks?Flat Regions in the Error SurfaceWhen the Gradient Points in the Wrong DirectionMomentum-Based OptimizationA Brief View of Second-Order MethodsLearning Rate AdaptationAdaGrad—Accumulating Historical GradientsRMSProp—Exponentially Weighted Moving Average of GradientsAdam—Combining Momentum and RMSPropThe Philosophy Behind Optimizer SelectionSummary63646566697174777879808183835. Convolutional Neural Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85Neurons in Human VisionThe Shortcomings of Feature SelectionVanilla Deep Neural Networks Don’t ScaleFilters and Feature MapsFull Description of the Convolutional LayerMax PoolingFull Architectural Description of Convolution NetworksClosing the Loop on MNIST with Convolutional NetworksImage Preprocessing Pipelines Enable More Robust ModelsAccelerating Training with Batch NormalizationBuilding a Convolutional Network for CIFAR-10Visualizing Learning in Convolutional Networksiv Table of Contents85868990959899101103104107109

Leveraging Convolutional Filters to Replicate Artistic StylesLearning Convolutional Filters for Other Problem DomainsSummary1131141156. Embedding and Representation Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117Learning Lower-Dimensional RepresentationsPrincipal Component AnalysisMotivating the Autoencoder ArchitectureImplementing an Autoencoder in TensorFlowDenoising to Force Robust RepresentationsSparsity in AutoencodersWhen Context Is More Informative than the Input VectorThe Word2Vec FrameworkImplementing the Skip-Gram . Models for Sequence Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153Analyzing Variable-Length InputsTackling seq2seq with Neural N-GramsImplementing a Part-of-Speech TaggerDependency Parsing and SyntaxNetBeam Search and Global NormalizationA Case for Stateful Deep Learning ModelsRecurrent Neural NetworksThe Challenges with Vanishing GradientsLong Short-Term Memory (LSTM) UnitsTensorFlow Primitives for RNN ModelsImplementing a Sentiment Analysis ModelSolving seq2seq Tasks with Recurrent Neural NetworksAugmenting Recurrent Networks with AttentionDissecting a Neural Translation 1911942178. Memory Augmented Neural Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219Neural Turing MachinesAttention-Based Memory AccessNTM Memory Addressing MechanismsDifferentiable Neural ComputersInterference-Free Writing in DNCsDNC Memory ReuseTemporal Linking of DNC WritesUnderstanding the DNC Read Head219221223226229230231232Table of Contents v

The DNC Controller NetworkVisualizing the DNC in ActionImplementing the DNC in TensorFlowTeaching a DNC to Read and ComprehendSummary2322342372422449. Deep Reinforcement Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245Deep Reinforcement Learning Masters Atari GamesWhat Is Reinforcement Learning?Markov Decision Processes (MDP)PolicyFuture ReturnDiscounted Future ReturnExplore Versus ExploitPolicy Versus Value LearningPolicy Learning via Policy GradientsPole-Cart with Policy GradientsOpenAI GymCreating an AgentBuilding the Model and OptimizerSampling ActionsKeeping Track of HistoryPolicy Gradient Main FunctionPGAgent Performance on Pole-CartQ-Learning and Deep Q-NetworksThe Bellman EquationIssues with Value IterationApproximating the Q-FunctionDeep Q-Network (DQN)Training DQNLearning StabilityTarget Q-NetworkExperience ReplayFrom Q-Function to PolicyDQN and the Markov AssumptionDQN’s Solution to the Markov AssumptionPlaying Breakout wth DQNBuilding Our ArchitectureStacking FramesSetting Up Training OperationsUpdating Our Target Q-NetworkImplementing Experience Replayvi Table of 8268268269269

DQN Main LoopDQNAgent Results on BreakoutImproving and Moving Beyond DQNDeep Recurrent Q-Networks (DRQN)Asynchronous Advantage Actor-Critic Agent (A3C)UNsupervised REinforcement and Auxiliary Learning (UNREAL)Summary270272273273274275276Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277Table of Contents vii

Fundamentals of Deep Learning Designing Next-Generation Machine Intelligence Algorithms with contributions by Nicholas Locascio Beijing Boston Farnham Sebastopol Tokyo. 978-1-491-92561-4 . The Delta Rule and Learning Rates 21 Gradient Descent with Sigmoidal Neurons 22 .