Transcription

Fundamentals of DeepLearningRaymond Ptucha,Rochester Institute of TechnologyNVIDIA, Deep Learning InstituteElectronic Imaging 2020: SC20January 28, 2019, 8:30am-12:45pmR. Ptucha ‘2011Fair Use AgreementThis agreement covers the use of all slides in this document, please readcarefully. You may freely use these slides, if:– You send me an email telling me the conference/venue/company namein advance, and which slides you wish to use.– You receive a positive confirmation email back from me.– My name (R. Ptucha) appears on each slide you use.(c) Raymond Ptucha, [email protected] Ptucha ‘20221

Agenda Part I- Intuition and Theory– 8:35-9:15pm: Introduction– 9:15-10:00pm: Convolutional Neural Networks– 10:00-10:40pm: Recurrent Neural Networks 10:40-11:00pm: Break Part II- Hands on– 11:00am-12:45pm: Hands-on exercisesR. Ptucha ‘2044Two Most Important DeepLearning Fields Convolutional Neural Networks (CNN)– Examine high dimensional input, learnfeatures and classifier simultaneously Recurrent Neural Networks (RNN)– Learn temporal signals, remember both shortand long sequencesR. Ptucha ‘20552

Two Most Important DeepLearning Fields Convolutional Neural Networks (CNN)– Examine high dimensional input, learnfeatures and classifier simultaneously Recurrent Neural Networks (RNN)– Learn temporal signals, remember both shortand long sequences6R. Ptucha ‘206Fully Connected Layers? 200 200 pixel image. 40K input fullyconnected to 40Khidden (or output) layer. 1.6 billion weights! Generally don’t haveenough trainingsamples to learn thatmany weights.Ranzato CVPR’14R. Ptucha ‘20773

Convolution FilterRanzato CVPR’14 Convolution filters apply a transform to animage. The above filter detects vertical edges.8R. Ptucha ‘208Locally Connected Layer 200 200 pixel image. 40K input. Four 10 10 filters,each fully connected 40K 10 10 4 16Mweights .gettingbetter!Ranzato CVPR’14R. Ptucha ‘20994

Locally Connected Layer 200 200 pixel image. 40K input. Four 10 10 filters,each fully connected 40K 10 10 4 16Mweights .gettingbetter! Can we formulate soeach filter has similarstatistics across alllocations?Ranzato CVPR’1410R. Ptucha ‘2010Convolution Layer 200 200 pixel image. 40K input. Four 10 10 filters,each fully connected 40K 10 10 4 16Mweights .gettingbetter! Require each filterhas same statisticsacross all locations. Learn filters.Ranzato CVPR’14R. Ptucha ‘2011115

Convolution LayerRanzato CVPR’14R. Ptucha ‘20 200 200 pixel image. 40K input. Four 10 10 filters,each fully connected 40K 10 10 4 16Mweights .gettingbetter! Require each filterhas same statisticsacross all locations. Learn filters. To learn four filters wehave 4 10 10 400parameters- great!1212Many Flavors of CNNs LeNet-5, LeCun 1989VGGNet, Simonyan 2014AlexNet, Krizhevsky 2012GoogLeNet (Inception), Szegedy 2014ResNet, He 2015DenseNet, Huang 2017R. Ptucha ‘2013136

Image ConvolutionBy padding (filterWidth-1)/2, output imagesize matches input image sizeoutput3 3 filtersliding overinput imageVert padinputHoriz padhttps://github.com/vdumoulin/conv arithmeticR. Ptucha ‘201414Max Pooling- Reducing the Size ofan Imagecs321n, Karpathy, LiR. Ptucha ‘2015157

Convolution Neural Network (CNN)Building BlockPoolingConvolutionImageDeng ICML ‘14R. Ptucha ‘201616Putting it All TogetherConvolutionPoolingWhole SystemR. Ptucha ‘2017178

Learning Filters32 Learned Filters, each 5 532 Filtered images, each is 28 28Input image28 28Use zero padding21R. Ptucha ‘2021Filters3 3 filter3 3 filterR. Ptucha ‘203 3 4filter22229

Learning Filters32 Learned Filters, each 5 5 332 Filtered images, each is 28 28 1Input image28 28 3Use zero padding23R. Ptucha ‘2023CNN Architecture(Not so) Toy ExampleFiltered data8816 filters, each filteris 5 5 32. 2 pixelpad.16163232 filters,each filter is5 5 16. 2pixel pad.3264416106441616Fullyconnected to10 classesMaxpooling 21164 filters,each filter is5 5 32. 2pixel pad.1 1 64.filter, 0 pixelpad.4143232166464Input data32Maxpooling 286432332322x264Maxpooling 28Maxpooling 216Input RGB image:64 64 3 pixelsOutput:prediction of 1 of10 categories4 4convertedto 16elementvector32 filters, each filter is5 5 3. 2 pixel pad addedto top/bot/left/right sofiltered image is samedimension as input image.R. Ptucha ‘20242410

CNN Example10outputclasses32 32 3inputimagesCIFAR10Note: Two Conv’s between each pool cs231n, Li, Karpathy,JohnsonR. Ptucha ‘202525CNN Example Input [32x32x3] CONV with ten 3x3 filters, stride 1, pad 1: Parameters: (3*3*3)*10 10 280 Memory:32*32*10 CONV with ten 3x3 filters, stride 1, pad 1: Parameters: (3*3*10)*10 10 910 Memory: 32*32*10 Pool with 2x2 filters, stride 2: Parameters: 0 Memory: 16*16*10R. Ptucha ‘20262611

CNN ExampleInput: 16 16 10 CONV with five 3 3 filters, stride 1, pad 1: Parameters: (3*3*10)*5 5 455 Memory: 16*16*5 CONV with twenty 3 3 filters, stride 1, pad 1: Parameters: (3*3*5)*20 20 920 Memory: 16*16*20 Pool with 2 2 filters, stride 2: Parameters: 0 Memory: 8*8*20R. Ptucha ‘202727CNN ExampleInput: 8 8 20 CONV with ten 3 3 filters, stride 1, pad 1: Parameters: (3*3*20)*10 10 1810 Memory: 8*8*10 CONV with twenty 5 5 filters, stride 1, pad 2: Parameters: (5*5*10)*20 20 5020 Memory: 8*8*20 Pool with 2 2 filters, stride 2: Parameters: 0 Memory: 4*4*20R. Ptucha ‘20282812

CNN Example Fully connect (FC) 4x4x20 to 10 output classes Parameters: (4*4*20)*10 10 3210 Memory: 10 Done!R. Ptucha ‘202929Case Studycs321n, Karpathy, LiR. Ptucha ‘20303013

Case StudyNote:Most memory inearly layersNote:Most parametersin FC layersR. Ptucha ‘203131CNN VisualizationZeiler, Fergus, 2014R. Ptucha ‘20323214

CNN VisualizationZeiler, Fergus, 2014R. Ptucha ‘203333CNN Visualization We can compute the partial derivative of input pixelswith respect to a cost. Start with random noise image, pretrained network, theniteratively tweak the input as we minimize our cost. Cost is to activate a particular neuron (in this slide).Google Inception v1: layer mixed4a, unit 11Initial4 iterationsOlah, et al., "Feature Visualization", Distill, 2017.48 iterationsR. Ptucha ‘202048 iterations343415

CNN Visualization Can modify the objective to get different typesof insight to what the CNN is responding to.Olah, et al., "Feature Visualization", Distill, 2017.n layer indexx,y spatial positionz channel indexk class indexR. Ptucha ‘203535Interrogate a network withdifferent gradientsStart with any image, then when do back prop,use GT label of say a dog, then instead ofupdating the model weights, update the pixels inthe image.Inceptionism: Going Deeper into Neural NetworksA. Mordvintsev, C. Olah and M. Tyka (2015)R. Ptucha ‘20363616

Interrogate a network withdifferent gradientsDelta is the difference between logits andone-hot encoding of dog. After backprop,the next time you pass in the image, it willbe closer to being classified as a dog!Then iteratively keep updating pixelswith dog GT label .eventually, theimage will start to “look” like a dogInceptionism: Going Deeper into Neural NetworksA. Mordvintsev, C. Olah and M. Tyka (2015)R. Ptucha ‘203737A. Mordvintsev, C. Olah and M. ceptionism-going-deeper-into-neural.htmlR. Ptucha ‘20383817

R. Ptucha ‘203939CNN as Vector RepresentationTypical CNN ArchitectureInput Image2D Plot of fc8 Feature VectorR. Ptucha ‘20Image of fc8 Feature Vector404018

CNN as Vector Representation As it turns out, these fully connected layers are excellentdescriptors of the input image! For example, you can pass images through apre-trained CNN, then take the output from a FClayer as input to a SVM classifier. (image2vec) Images in this vector space generally have the propertythat similar images are close in this latent representation.41R. Ptucha ‘2041Vision TasksClassificationClassification LocalizationObject DetectionSingle ObjectInstanceSegmentationMultiple ObjectsR. Ptucha ‘20535319

Classification vs.Classification LocalizationClassificationInput: ImageOutput: Class labelEvaluation metric: AccuracyCATClassification LocalizationInput: ImageOutput: Class label, BoxcoordinatesEvaluation metric:Intersection over Union (IoU)(CAT,x,y,w,h)54R. Ptucha ‘2054.(0,0)Classification with Localization. (1,1)Lets allow a few classes:1. Car2. Truck3. Pedestrian4. MotorcycleImage from: deeplearning.ai, C4W3L01 For now, lets assume oneobject per image. Each object has {x, y, w, h} For this image, object location{x, y, w, h} {0.3, 0.6, 0.4, 0.3}R. Ptucha ‘20555520

Classification with LocalizationFour classes:1. Car2. Truck3. Pedestrian4. MotorcycleLocalization {x, y, w, h}Define y label:Image from: deeplearning.ai, 𝑏𝑏 𝐶𝐶𝐶𝐶.𝐶𝐶/𝐶𝐶0Probability of an objectBounding box location0/1 for each class56R. Ptucha ‘2056Classification with LocalizationFour classes:1. Car2. Truck3. Pedestrian4. MotorcycleLocalization {x, y, w, h}Cost function (squared error): 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 ; 𝑦𝑦? 𝑦𝑦 -𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝑦𝑦?- 𝑦𝑦-.If y1 1If y1 0Image from: deeplearning.ai, C4W3L0110.30.60.4𝑦𝑦 0.30100R. Ptucha ‘20𝑃𝑃&𝑏𝑏(𝑏𝑏)𝑏𝑏*𝑏𝑏 𝐶𝐶𝐶𝐶.𝐶𝐶/𝐶𝐶00?𝑦𝑦 ? don’tcare575721

Classification with LocalizationFour classes:1. Car2. Truck3. Pedestrian4. MotorcycleLocalization {x, y, w, h}10.30.60.4𝑦𝑦 0.30100Alternate cost function: y1 can be logistic loss y2 à y5 can be squarederror y6 à y9 can be softmaxcross entropyImage from: deeplearning.ai, C4W3L01R. Ptucha ‘20𝑃𝑃&𝑏𝑏(𝑏𝑏)𝑏𝑏*𝑏𝑏 𝐶𝐶𝐶𝐶.𝐶𝐶/𝐶𝐶00?𝑦𝑦 ? don’tcare5858SnapchatFacewarp? Traditional approach:Viola JonesFace DetectionSearch for actual point locationsusing Mahalanobis distanceAverage eye and 82facial feature pointsRepeat 3-5 Restrict based onPCA statisticsR. Ptucha ‘20595922

SnapchatFacewarp? Deep Learningapproach:CandidatefacesMT-CNN [Zhang et al. 2016]RefinefaceselectionFacialfeaturepoints60R. Ptucha ‘2060LocalizationEach face has 68 points,so CNN would output:Face?pt1Xpt1Ypt2Xpt2Y137 outputs.Of course,.need GT forpt68Xthousands ofpt68Y Facial featurepointsfaces to trainmodel.ptuchaR. Ptucha ‘20616123

Can do same with Body Pose Pishchulin et al. CVPR’16R. Ptucha ‘206262Object DetectionImage Credit: Redmon, Joseph, et al [4]R. Ptucha ‘20636324

More than one object per image?Training set:xyCar detection example1110Images from: deeplearning.ai, C4W3L03066R. Ptucha ‘2066Sliding Window Detection Images from: deeplearning.ai, C4W3L03ptuchaR. Ptucha ‘2067676725

Computing FC layers with ConvolutionNoPadStride 1MAX POOL(16)5 5 314 14 32 210 10 16NoPadStride 12 210 10 16FC5 5 16 400400NoPadStride 1(4)1 1 400(400)1 1 400(400)5 5 165 5 16ysoftmax (4)NoPadStride 1NoPadStride 1MAX POOL(16)5 5 314 14 3FC1 1 4001 1 4001 1 4Images from: deeplearning.ai, C4W3L0468R. Ptucha ‘2068TrainingReplacing Sliding Windows w/Fully Convolutional CNNsNoPadNoPadStride 1Stride 1MAX POOLMAX POOL(16) (16)5 5 35 5 32 22 210 10 1610 10 1616 16 312 12 16NoPadNoPadStride 1Stride 1NoPadNoPadStride 1Stride 1(4)(400)(400)(4)(400)(400)1 1 401 1 405 5 11 1 4001 1 4005 5 1 606 1 1 400 0 1 1 4005 5 161 1 45 5 161 1 4001 1 4001 1 4Testing14 14 314 14 3NoPadNoPadStride 1Stride 12 2 4002 2 4002 2 4Testing6 6 1628 28 324 24 16Images from: deeplearning.ai, C4W3L0412 12 16R. Ptucha ‘208 8 4008 8 4008 8 469696926

Replacing Sliding Windows w/Fully Convolutional CNNsSliding window approach:Sequentially evaluate onewindow at a time28 28 324 24 16Images from: deeplearning.ai, C4W3L04Fully convolutional approach:Evaluate 64 regions at once12 12 168 8 4008 8 400R. Ptucha ‘208 8 47070More Accurate Bounding Boxes Whether using sliding window or fully convolutionalapproach, bounding boxes often don’t align well ormay not match actual 𝑏*𝑦𝑦 𝑦𝑦 𝐶𝐶&IG𝑏𝑏 𝐶𝐶JGK&L𝐶𝐶MNJNG& At each location, solve for probability of object as wellas actual coordinates, which don’t have to be square.Images from: deeplearning.ai, C4W3L05R. Ptucha ‘20717127

Replacing Sliding Windows w/Fully Convolutional CNNs Can think of this as evaluating8 8 grid, where each of the 64cells is independently checked for𝑃𝑃&an object:Prob. of an objectEach cell hasa y label:28 28 3𝑏𝑏(𝑏𝑏)𝑏𝑏*𝑦𝑦 𝑏𝑏 𝐶𝐶𝐶𝐶.𝐶𝐶/𝐶𝐶024 24 16Object location0/1 for each class(Four classes in this example)12 12 168 8 4008 8 400R. Ptucha ‘208 8 97272Replacing Sliding Windows w/Fully Convolutional CNNs Overlay GT of object Cell where centroid lie isresponsible.Each cell hasa y label:28 28 3𝑃𝑃&𝑏𝑏(𝑏𝑏)𝑏𝑏*𝑦𝑦 𝑏𝑏 𝐶𝐶𝐶𝐶.𝐶𝐶/𝐶𝐶024 24 16Prob. of an objectObject location0/1 for each class(Four classes in this example)12 12 16R. Ptucha ‘208 8 4008 8 4008 8 9737328

Replacing Sliding Windows w/Fully Convolutional CNNs𝑃𝑃&𝑏𝑏(𝑏𝑏)𝑏𝑏*𝑦𝑦 𝑏𝑏 0?𝑦𝑦𝑦 ?28 28 30?𝑦𝑦𝑦 ?24 24 1610.80.92.0 𝑦𝑦𝑦𝑦 1.8 0100Note1: cell upper left (0,0); cell lower right (1,1)Note2: bw and bh can be 1.012 12 168 8 4008 8 400R. Ptucha ‘208 8 97474YOLO V3https://arxiv.org/abs/1804.02767, 2018The picturecan't bedisplayed.The picturecan't bedisplayed.The picturecan't e-yolov3-object-detection-network-is-fast/R. Ptucha ‘20757529

Mask R-CNNhttps://arxiv.org/pdf/1703.06870.pdfFAIR Mask R-CNN, COCO Places Workshop, ICCV 201776R. Ptucha ‘2076CNN Results Handwritten characters– MNIST: 0.17% error, Ciresan et al ’11– Arabic & Chinese: Ciresan ‘12 CIFAR-10 (60K images of 10 classes)CIFAR, Krizhevsky ‘09– 9.3% error, Wan et al. ‘13 Traffic Sign Recognition– 0.56% error vs 1.16% for humans, Ciresan ‘11R. Ptucha ‘20Ciresan ‘11777730

ImageNet Amazon Turk did bulk of labeling 14M labeled images 20K classesRussakovsky et al., 2015 1.2M images, 1000 categories Image classification, object localization, video detectionR. Ptucha ‘207878ImageNet: Examples of HammerR. Ptucha ‘20797931

Deep Learning- Surpassing The Visual Cortex’s ObjectDetection and Recognition CapabilityTop-5 error on ImageNetTraditional Computer Visionand Machine Learning30Error2822.526157.50Deep Convolution NeuralNetworks (CNNs)Similar effectdemonstrated onvoice and patternTrainedIntroduction ofrecognitionHumandeep learning(geniusintellect)2018 movedto Kaggle15.4and 20142015ZFNet GoogleLeNet 252017SENetYearR. Ptucha ‘208080Deep Learning Specialization,Five courses:Andrew Ng, 20171.2.3.4.5.Neural Networks and Deep LearningImproving Deep Neural NetworksStructured Machine Learning ProjectsConvolutional Neural NetworksSequence Modelshttps://www.deeplearning.ai/R. Ptucha ‘20818132

Li, Johnson, Yeung 2017http://cs231n.stanford.edu/ lR. Ptucha ‘208282For More Information: http://www.rit.edu/milRaymond W. PtuchaAssistant Professor, Computer EngineeringDirector, Machine Intelligence LaboratoryRochester Institute of TechnologyEmail: [email protected]: 1 (585) 797-5561Office: GLE (09) 3441R. Ptucha ‘20838333

Two Most Important Deep Learning Fields, Convolutional Neural Networks (CNN) -Examine high dimensional input, learn features and classifier simultaneously, Recurrent Neural Networks (RNN) -Learn temporal signals, remember both short and long sequences, 6, R. Ptucha '207, Fully Connected Layers?