M3 - Machine Learning for Computer Vision

1 M3 - Machine Learning for Computer VisionTraffic Sign D...

Author: Cory Shields

0 downloads 2 Views

1 M3 - Machine Learning for Computer VisionTraffic Sign Detection and Recognition Adrià Ciurana Guim Perarnau Pau Riba [ADRIA] Good afternoon, we are Guim Perarnau, Pau Riba and myself Adrià Ciurana and we are going to present you the work done during this module.

2 Index Dataset generation Data pre-processingCorrectly crop dataset Bootstrap Dataset generation Extract features Normalization Dimensionality reduction Data pre-processing Sliding window Detection Recognition Sign detection and recognition Get metrics (F1-Score, AUC) Visualize data Evaluation [ADRIA] This is the structure of the presentation that is divided into 4 main blocks. Data generation, pre-processing, detection and recognition and finally the evaluation of all the modules.

3 Introduction [ADRIA] As you all know, this project consists of detecting and recognising signs using a machine learning approach.

4 Module 1 project segmentationMotivation Module 1 project segmentation Per window results (669 images): Precision Accuracy Recall F1-Score Time / frame 47.88% 38.25% 65.55% 55.34% 0.73 s [ADRIA] For us, the main motivation for this project is to improve the results we obtained in module 1. In that module we detected (not recognized) traffic signs by means of only computer vision techniques. Here you can see an example of the segmentation we achieved. As you can see, it is not perfect: the segmentation is based on color information and it can be sometimes be misleading. Notice how the blue track is also detected as a sign. Besides, you can see the results we obtained. So, let’s see if we have succeed!

5 New background datasetImage Initial Detector Pipeline Round sign? Bootstrap Sliding Window Framework Segmentation Detection Recognition Evaluation Triangular sign? Square sign? New background dataset [ADRIA] Bootstrap Sliding Window Framework Segmentation (Imatge integral + Color del M1) Detection (Conjunt de classificador binaris, tipus de forma) Recognition (Reconeixement de senyals) Detection (again) (El recognition ens ajuda a eliminar background) Evaluation (NMS, pedro, tud i pascal (per evaluacio sempre pascal)) = False Positive

6 Pipeline Bootstrap Sliding Window Framework Segmentation DetectionRecognition Evaluation [ADRIA]

7 Pipeline Bootstrap Sliding Window Framework Segmentation DetectionRecognition Evaluation [ADRIA]

8 Pipeline Bootstrap Sliding Window Framework Segmentation DetectionRecognition Evaluation [ADRIA]

9 ? Pipeline Bootstrap Sliding Window Framework Segmentation DetectionRecognition Evaluation [ADRIA]

10 Pipeline Bootstrap Sliding Window Framework Segmentation DetectionRecognition Evaluation [ADRIA]

11 1. http://btsd.ethz.ch/shareddata/Dataset Dataset used: reduced BelgiumTS Dataset1 (62 classes) Problems found: Traffic signs in (supposedly) only background images: Traffic signs not labeled but correctly detected: Assumption: Do Not Care Object : types of signs that we will ignore (No penalization, No gain). [PAU] Like all of you, we have used the reduced belgium dataset which is composed of 62 types of traffic signs. However, some things in the dataset are not how it should be. Most concretely, we have found that there are actually some traffic signs in images that the dataset author claimed to be background-only. Besides, there are unlabeled traffic signs that our algorithm correctly detects, as you can see in the images. Manually label these signs would have been a very tedious and unproductive task, so we decided to leave it like that. Also, for the 62 classes, we only want to recognize 14. That’s the reason there are certain types of signs that we will simple ignore.

12 Crop training dataset Problem:BelgiumTS Dataset already cropped images: Problem: Cropped images need to have a canonical size. All signs must have the same height (vertical padding). [PAU] Now let’s dig into the dataset. We found that we could download the traffic signs images already cropped. This, however, was a bad idea for two reasons. The first one is that images need to have a canonical size, and if you resize different images with different aspect ratios, you are “deforming” the image. The second one is that, in order to correctly have a per-window detection, you need the signs to have the same height so they are look “centered”.

13 Crop training dataset Solution: make our own 32x32 crops with 4 vertical padding pixels. Original bounding box 4 pixels 32 Expand BB Resize 32 Results: Special case: sign is at image boundary → add boundary padding Boundary padding [PAU] So, how did we solve this problem? Making our own crops. We chose the size to be 32x32 and add 4 pixels of vertical padding. The steps that we do are the following. Firstly, we have got the initial bounding box, which has a non-canonical aspect ratio. The next step is to expand this BB so it has the target aspect ratio and the specified padding. Finally, we resize the image to the canonical size. One last consideration: there’s a specific case where the image is “cropped” at the boundary of the image. If this happens, we add a boundary padding replicating the last value, as you can see in the bottom image.

14 Bootstrap Result Total New background dataset 9863 Initial 11647Round sign? Background Images Result = False Positive New background dataset Square sign? Initial Detector Total Initial 9863 Hard Negatives 11647 21510 Hard negatives Train a new model adding False Positives [GUIM] Until now, we have been focused on the positive samples. Let’s see how do we do to define the negative ones. The first step consists on selecting random crops of background images, train a detector. Afterwards, we are able to apply the our sign detection to background images. Some background crops will be classified as signs, they are hard negatives. Once we have a few hard negatives we can train a new model and apply again this process with the new detector. After lots of iterations, no more hard negatives will be added to our set. Therefore, we can generate our background dataset joining the hard negatives with the background samples that we have at the beginning. The table shows the number of elements of our final negative samples. Triangular sign?

15 Segmentation using YCbCr color spacePossible sign Original Image Morphology Segmentation using YCbCr color space Speed up SW Advantages Reduces False Positives [GUIM] We use segmentation to optimize the sliding window process and focus only on those regions that will most likely contain a traffic sign. This also means that we are decreasing the chances that our detector detects a background window as sign, causing the precision to increase.

16 Segmentation However... ...we miss some signs![GUIM] However, the segmentation is not perfect and it fails when the sign appears overshadowed. This means that no matter how good is our detector, we won’t achieve a perfect recall.

17 Sliding window of the image and the integral imageFor each level of GP: Sliding window Input image Gaussian pyramid Segmentation Integral Image Possible sign region Sliding window of the image and the integral image [GUIM] As you know, sliding windows is a very slow technique. Therefore some optimizations have been proposed. First, each image will be processed in different threads. Moreover

18 Data augmentation Idea: Generate more positive samples for each class.Flip samples: Add more positive samples: Flip not desired in some cases: Blur samples: Smooth sudden changes. Gives the shape. [PAU] Initialy, our dataset have only a few positive samples with respect the negative ones. To train our models for detection and recognition we have performed some kind of data augmentation to balance a little the number of examples. First, and only for detection, we have perform a flip in the horizontal axis to the classes that are symmetric or very similar. It is clear that this is not a good practise for recognition purpose. Other technique that have been tried, is to apply some blurring to the images. Then, the shape will take more importance than the small details and the detection will improve. Original (3,3) (5,5) (7,7) (9,9)

19 Dataset division for detectionFirst idea: Background vs Signs Problem: Very different kinds of signs. Separation is not easy. Solution: Divide signs according to its shape: Up-triangle Down-triangle Horizontal rectangle Vertical rectangle Parking Round Stop Diamond No-flip No-flip No-flip [PAU] Apart from the data augmentation, the dataset have been divided in smaller subsets. Try to classify background vs signs is a hard task due to the variability of the positive samples. The proposed division is the one in the screen with 8 different classes, three of them are not flipped.

20 Customized thresholdsDetection Simples binaries classifiers Window Candidate Customized thresholds vs BKGD … > th△ OR … > th◯ … > th▽ … > thロ … > th♢ … > th YES Feature Extraction It is a traffic sign? NO vs BKGD [PAU] Lets start with the detection framework. Lets assume that we get one window from the sliding window. Then, features are extracted and we can classify it. We propose to create simple binary classifiers with different thresholds that have been set independently. For instance, on window is classified as traffic sign if one of the binary classifiers says it.

21 score(A)Non maximum suppression Multiple detection: Combine detections: Overlap > threshold → Keep the best score. Pascal Vallotton (Pascal) Pedro Felzenwalb (Pedro) Technische Universität Darmstadt (TUD) Red: Ground truth Green: Detections score(A) [PAU] One of the main problems of sliding window based detectors, is the multiple detection problem in one sign. There exist different techniques to deal with it considering the overlapping area between detections. If the overlap is bigger than a threshold, the boxes are join and we keep the one with best score. The three approaches that have been tried ara Pascal, pedro and tud. The overlap is computed dividing the intersection between the union, the area of one of them or the minimum area. The best resuts for detection were obtained with the Pedro approach and a threshold of 0.3

22 Background (refinement step)Class Recognition Multiclass: Yes Detections boxes Feature Extraction 14 Classes It is a traffic sign? No Delete from detections [GUIM] Background (refinement step)

23 Evaluation Test: Sliding window Train and test: cropped imagesTrain Set Test: Sliding window Translation Different scale Multiple detections Train and test: cropped images Signs are centered Same scale Train Model Per Window Per Image [GUIM]

24 HOG Color MultichannelEvaluation - Detection Per window results: Classifier Solver Feature descriptor Dimension reduction Data normalization F1-Score SVM Linear HOG (4x4 pxc) No Yes 98.63% HOG (8x8 pxc) 97.95% Yes (PCA) 97.30% HOG+ColorHist 97.26% RBF 97.66% HOG+LBP 97.31% HOG Color Multichannel 96.98% LDA SVD LBP 96.26% Faster! Color is not important Slower [GUIM] So, here are the bests configurations we have found. These results are per window, meaning that it is very easy to obtain high results. However, this is a low computational way to validate which configuration is better. Here is a lot of data, but don’t be scared. This table is meant to show that we have tested a lot of things. In fact, we can summarize this table by saying that linear SVM has given the bests results. RBF SVM is also good, but it is far slower. Also, adding color information has not improved the results, and other classifiers like LDA haven’t outperformed SVM. Applying PCA speeds ups the execution with the cost of slightly decreasing the results.

25 Cascade boosted ClassifiersEvaluation - Detection Per image results: Blurring the images is key Classifier Solver Feature descriptor Dimension reduction Segmentation Blur images F1-Score SVM Linear HOG Yes (LDA) Yes 55.17% No 44.89% 24.86% 21.49% Cascade boosted Classifiers - Haar + Adaboost 27.61% LDA and segmentation improve results and speed [GUIM] So, we have taken the best model of per window approach and now we optimize it by per image evaluation. The first thing we have done is applying LDA as dimension reduction. So, our features that originally had 1700 dimensions now are represented with only 6. With this, we have increased our F1Score by a 3% and also have considerably improved the speed. Next step was to apply the M1 segmentation, which also increased the results and the speed. Finally, we added the blurred images, which also improved the results. Here on the right you can see the Miss-Rate and FPPI curve.

26 Evaluation - RecognitionSVM → Model Solver Feature descriptor Dimension reduction F1-Score (per window) F1-Score (per image) ECOC + SVM (one vs rest) Linear HOG Yes (LDA) 82.50% 64.68% Neural Network - No 75.68% NN → F1-Score Precision Recall Mean 56.22% 52.52% 76.15% Weighted 75.68% 81.05% 73.02% [GUIM]

27 Evaluation - Whole PipelineDetection Recognition Detection (improved) [GUIM]

28 Video = Ground truth = Estimated sign = Do not care objectNote: this video shows the final output of the recognition given the detection, not the detection by itself. [PAU] Here we can se the detections and predictions of our system. Green boxes, are our final detections, the red ones corresponds to the ground truth and the blue boxes are the do not care objects, they are labeled in the ground truth but their label does not belong to the 61 from the training. The video does not corresponds to the best detection due to time limitations. However we can see that there are quite good results despite the decrease produced by the recognition step.

29 Conclusions Color segmentation and parallelization saved us time.LDA improves performance (both speed and results). Tricks learned: Correctly cropping the dataset Bootstrap Data augmentation Low results. M1 M3 F1-Score 55.34% 55.17% [PAU] Let’s draw the final conclusions. First, this project has shown the bad performance of the sliding windows. Hence, we were forced to find ways to speed up the whole process, color segmentation and parallelization saved us lots of time. Moreover, dimensionality reduction techniques are useful in two ways, increase the speed and improve the results removing noise from the data. Furthermore, we have learnt lots of tricks such as preparing a dataset, bootstrap and cropping correctly the data, and data augmentation such as flip and blur. The only drawback is that we have expected to improve a lot the results from the previous module, but it was not possible due to the lots of parameters to tune.

M3 - Machine Learning for Computer Vision

Recommend Documents