# Tue Nov 13 2018 Computer Vision and Pattern Recognition Papers on arXiv.org

## Deep Learning versus Classical Regression for Brain Tumor Patient Survival Prediction

Deep learning for regression tasks on medical imaging data has shown promising results. However, compared to other approaches, their power is strongly linked to the dataset size. In this study, we evaluate 3D-convolutional neural networks (CNNs) and classical regression methods with hand-crafted features for survival time regression of patients with high grade brain tumors. The tested CNNs for regression showed promising but unstable results. The best performing deep learning approach reached an accuracy of 51.5% on held-out samples of the training set. All tested deep learning experiments were outperformed by a Support Vector Classifier (SVC) using 30 radiomic features. The investigated features included intensity, shape, location and deep features. The submitted method to the BraTS 2018 survival prediction challenge is an ensemble of SVCs, which reached a cross-validated accuracy of 72.2% on the BraTS 2018 training set, 57.1% on the validation set, and 42.9% on the testing set. The results suggest that more training data is necessary for a stable performance of a CNN model for direct regression from magnetic resonance images, and that non-imaging clinical patient information is crucial along with imaging information.

Paper Link

## Focusing on the Big Picture: Insights into a Systems Approach to Deep Learning for Satellite Imagery

Deep learning tasks are often complicated and require a variety of components working together efficiently to perform well. Due to the often large scale of these tasks, there is a necessity to iterate quickly in order to attempt a variety of methods and to find and fix bugs. While participating in IARPA’s Functional Map of the World challenge, we identified challenges along the entire deep learning pipeline and found various solutions to these challenges. In this paper, we present the performance, engineering, and deep learning considerations with processing and modeling data, as well as underlying infrastructure considerations that support large-scale deep learning tasks. We also discuss insights and observations with regard to satellite imagery and deep learning for image classification.

Paper Link

## A Perceptual Prediction Framework for Self Supervised Event Segmentation

Temporal segmentation of long videos is an important problem, that has largely been tackled through supervised learning, often requiring large amounts of annotated training data. In this paper, we tackle the problem of self-supervised temporal segmentation of long videos that alleviate the need for any supervision. We introduce a self-supervised, predictive learning framework that draws inspiration from cognitive psychology to segment long, visually complex videos into individual, stable segments that share the same semantics. We also introduce a new adaptive learning paradigm that helps reduce the effect of catastrophic forgetting in recurrent neural networks. Extensive experiments on three publicly available datasets – Breakfast Actions, 50 Salads, and INRIA Instructional Videos datasets show the efficacy of the proposed approach. We show that the proposed approach is able to outperform weakly-supervised and other unsupervised learning approaches by up to 24% and have competitive performance compared to fully supervised approaches. We also show that the proposed approach is able to learn highly discriminative features that help improve action recognition when used in a representation learning paradigm.

Paper Link

## A Framework of Transfer Learning in Object Detection for Embedded Systems

Transfer learning is one of the subjects undergoing intense study in the area of machine learning. In object recognition and object detection there are known experiments for the transferability of parameters, but not for neural networks which are suitable for object-detection in real time embedded applications, such as the SqueezeDet neural network. We use transfer learning to accelerate the training of SqueezeDet to a new group of classes. Also, experiments are conducted to study the transferability and co-adaptation phenomena introduced by the transfer learning process. To accelerate training, we propose a new implementation of the SqueezeDet training which provides a faster pipeline for data processing and achieves $1.8$ times speedup compared to the initial implementation. Finally, we created a mechanism for automatic hyperparamer optimization using an empirical method.

Paper Link

## Generative Dual Adversarial Network for Generalized Zero-shot Learning

This paper studies the problem of generalized zero-shot learning which requires the model to train on image-label pairs from some seen classes and test on the task of classifying new images from both seen and unseen classes. Most previous models try to learn a fixed one-directional mapping between visual and semantic space, while some recently proposed generative methods try to generate image features for unseen classes so that the zero-shot learning problem becomes a traditional fully-supervised classification problem. In this paper, we propose a novel model that provides a unified framework for three different approaches: visual-> semantic mapping, semantic->visual mapping, and metric learning. Specifically, our proposed model consists of a feature generator that can generate various visual features given class embeddings as input, a regressor that maps each visual feature back to its corresponding class embedding, and a discriminator that learns to evaluate the closeness of an image feature and a class embedding. All three components are trained under the combination of cyclic consistency loss and dual adversarial loss. Experimental results show that our model not only preserves higher accuracy in classifying images from seen classes, but also performs better than existing state-of-the-art models in in classifying images from unseen classes.

Paper Link

## Subsequent Boundary Distance Regression and Pixelwise Classification Networks for Automatic Kidney Segmentation in Ultrasound Images

It remains challenging to automatically segment kidneys in clinical ultrasound (US) images due to the kidneys’ varied shapes and image intensity distributions, although semi-automatic methods have achieved promising performance. In this study, we propose subsequent boundary distance regression and pixel classification networks to segment the kidneys, informed by the fact that the kidney boundaries have relatively homogenous texture patterns across images. Particularly, we first use deep neural networks pre-trained for classification of natural images to extract high-level image features from US images, then these features are used as input to learn kidney boundary distance maps using a boundary distance regression network, and finally the predicted boundary distance maps are classified as kidney pixels or non-kidney pixels using a pixel classification network in an end-to-end learning fashion. We also proposed a novel data-augmentation method based on kidney shape registration to generate enriched training data from a small number of US images with manually segmented kidney labels. Experimental results have demonstrated that our method could effectively improve the performance of automatic kidney segmentation, significantly better than deep learning based pixel classification networks.

Paper Link

## Scene Parsing via Dense Recurrent Neural Networks with Attentional Selection

Recurrent neural networks (RNNs) have shown the ability to improve scene parsing through capturing long-range dependencies among image units. In this paper, we propose dense RNNs for scene labeling by exploring various long-range semantic dependencies among image units. Different from existing RNN based approaches, our dense RNNs are able to capture richer contextual dependencies for each image unit by enabling immediate connections between each pair of image units, which significantly enhances their discriminative power. Besides, to select relevant dependencies and meanwhile to restrain irrelevant ones for each unit from dense connections, we introduce an attention model into dense RNNs. The attention model allows automatically assigning more importance to helpful dependencies while less weight to unconcerned dependencies. Integrating with convolutional neural networks (CNNs), we develop an end-to-end scene labeling system. Extensive experiments on three large-scale benchmarks demonstrate that the proposed approach can improve the baselines by large margins and outperform other state-of-the-art algorithms.

Paper Link

## Adaptive Target Recognition: A Case Study Involving Airport Baggage Screening

This work addresses the question whether it is possible to design a computer-vision based automatic threat recognition (ATR) system so that it can adapt to changing specifications of a threat without having to create a new ATR each time. The changes in threat specifications, which may be warranted by intelligence reports and world events, are typically regarding the physical characteristics of what constitutes a threat: its material composition, its shape, its method of concealment, etc. Here we present our design of an AATR system (Adaptive ATR) that can adapt to changing specifications in materials characterization (meaning density, as measured by its x-ray attenuation coefficient), its mass, and its thickness. Our design uses a two-stage cascaded approach, in which the first stage is characterized by a high recall rate over the entire range of possibilities for the threat parameters that are allowed to change. The purpose of the second stage is to then fine-tune the performance of the overall system for the current threat specifications. The computational effort for this fine-tuning for achieving a desired PD/PFA rate is far less than what it would take to create a new classifier with the same overall performance for the new set of threat specifications.

Paper Link

## Learning Segmentation Masks with the Independence Prior

An instance with a bad mask might make a composite image that uses it look fake. This encourages us to learn segmentation by generating realistic composite images. To achieve this, we propose a novel framework that exploits a new proposed prior called the independence prior based on Generative Adversarial Networks (GANs). The generator produces an image with multiple category-specific instance providers, a layout module and a composition module. Firstly, each provider independently outputs a category-specific instance image with a soft mask. Then the provided instances’ poses are corrected by the layout module. Lastly, the composition module combines these instances into a final image. Training with adversarial loss and penalty for mask area, each provider learns a mask that is as small as possible but enough to cover a complete category-specific instance. Weakly supervised semantic segmentation methods widely use grouping cues modeling the association between image parts, which are either artificially designed or learned with costly segmentation labels or only modeled on local pairs. Unlike them, our method automatically models the dependence between any parts and learns instance segmentation. We apply our framework in two cases: (1) Foreground segmentation on category-specific images with box-level annotation. (2) Unsupervised learning of instance appearances and masks with only one image of homogeneous object cluster (HOC). We get appealing results in both tasks, which shows the independence prior is useful for instance segmentation and it is possible to unsupervisedly learn instance masks with only one image.

Paper Link

## Towards Adversarial Denoising of Radar Micro-Doppler Signatures

Generative Adversarial Networks (GANs) are considered the state-of-the-art in the field of image generation. They learn the joint distribution of the training data and attempt to generate new data samples in high dimensional space following the same distribution as the input. Recent improvements in GANs opened the field to many other computer vision applications based on improving and changing the characteristics of the input image to follow some given training requirements. In this paper, we propose a novel technique for the denoising and reconstruction of the micro-Doppler ($\\boldsymbol\u03bc$-D) spectra of walking humans based on GANs. Two sets of experiments were collected on 22 subjects walking on a treadmill at an intermediate velocity using a \\unit[25]{GHz} CW radar. In one set, a clean $\\boldsymbol\u03bc$-D spectrum is collected for each subject by placing the radar at a close distance to the subject. In the other set, variations are introduced in the experiment setup to introduce different noise and clutter effects on the spectrum by changing the distance and placing reflective objects between the radar and the target. Synthetic paired noisy and noise-free spectra were used for training, while validation was carried out on the real noisy measured data. Finally, qualitative and quantitative comparison with other classical radar denoising approaches in the literature demonstrated the proposed GANs framework is better and more robust to different noise levels.

Paper Link

## Proprties of biclustering algorithms and a novel biclustering technique based on relative density

Biclustering is found to be useful in areas like data mining and bioinformatics. The term biclustering involves searching subsets of observations and features forming coherent structure. This can be interpreted in different ways like spatial closeness, relation between features for selected observations etc. This paper discusses different properties, objectives and approaches of biclustering algorithms. We also present an algorithm which detects feature relation based biclusters using density based techniques. Here we use relative density of regions to identify biclusters embedded in the data. Properties of this algorithm are discussed and demonstrated using artificial datasets. Proposed method is seen to give better results on these datasets using paired right tailed t test. Usefulness of proposed method is also demonstrated using real life datasets.

Paper Link

## Hallucinating very low-resolution and obscured face images

Most of the face hallucination methods are often designed for complete inputs. They cannot work well if the inputs are very tiny and contaminated by large occlusion. Inspired by this fact, we propose a novel obscured face hallucination network(OFHNet) using deep learning. Our OFHNet consists of four parts: an inpainting network, an upsampling network, a discriminative network, and a fixed facial landmark detection network. Firstly, we use the inpainting network restores the low-resolution(LR) obscured face images. Secondly, we employ the upsampling network to upsample outputs of the inpainting network. To ensure the generated high-resolution(HR) face images more photo-realistic, we use the discriminative network and the facial landmark detection network to help upsampling network. In addition, we present a new semantic structure loss, which can make the generated HR face images more pleasing. Extensive experiments show that our framework can restore the appealing HR face images from 1/4 missing area LR face images with a challenging scaling factor of 8x.

Paper Link

## Parameterized Synthetic Image Data Set for Fisheye Lens

Based on different projection geometry, a fisheye image can be presented as a parameterized non-rectilinear image. Deep neural networks(DNN) is one of the solutions to extract parameters for fisheye image feature description. However, a large number of images are required for training a reasonable prediction model for DNN. In this paper, we propose to extend the scale of the training dataset using parameterized synthetic images. It effectively boosts the diversity of images and avoids the data scale limitation. To simulate different viewing angles and distances, we adopt controllable parameterized projection processes on transformation. The reliability of the proposed method is proved by testing images captured by our fisheye camera. The synthetic dataset is the first dataset that is able to extend to a big scale labeled fisheye image dataset. It is accessible via: this http URL.

Paper Link

## Depth Image Upsampling based on Guided Filter with Low Gradient Minimization

In this paper, we present a novel upsampling framework to enhance the spatial resolution of the depth image. In our framework, the upscaling of a low-resolution depth image is guided by a corresponding intensity images, we formulate it as a cost aggregation problem with the guided filter. However, the guided filter does not make full use of the properties of the depth image. Since depth images have quite sparse gradients, it inspires us to regularize the gradients for improving depth upscaling results. Statistics show a special property of depth images, that is, there is a non-ignorable part of pixels whose horizontal or vertical derivatives are equal to $\\pm 1$. Considering this special property, we propose a low gradient regularization method which reduces the penalty for horizontal or vertical derivative $\\pm1$. The proposed low gradient regularization is integrated with the guided filter into the depth image upsampling method. Experimental results demonstrate the effectiveness of our proposed approach both qualitatively and quantitatively compared with the state-of-the-art methods.

Paper Link

## Matrix Product Operator Restricted Boltzmann Machines

A restricted Boltzmann machine (RBM) learns a probability distribution over its input samples and has numerous uses like dimensionality reduction, classification and generative modeling. Conventional RBMs accept vectorized data that dismisses potentially important structural information in the original tensor (multi-way) input. Matrix-variate and tensor-variate RBMs, named MvRBM and TvRBM, have been proposed but are all restrictive by model construction, which leads to a weak model expression power. This work presents the matrix product operator RBM (MPORBM) that utilizes a tensor network generalization of Mv/TvRBM, preserves input formats in both the visible and hidden layers, and results in higher expressive power. A novel training algorithm integrating contrastive divergence and an alternating optimization procedure is also developed. Numerical experiments compare the MPORBM with the traditional RBM and MvRBM for data classification and image completion and denoising tasks. The expressive power of the MPORBM as a function of the MPO-rank is also investigated.

Paper Link

## Learning The Invisible: A Hybrid Deep Learning-Shearlet Framework for Limited Angle Computed Tomography

The high complexity of various inverse problems poses a significant challenge to model-based reconstruction schemes, which in such situations often reach their limits. At the same time, we witness an exceptional success of data-based methodologies such as deep learning. However, in the context of inverse problems, deep neural networks mostly act as black box routines, used for instance for a somewhat unspecified removal of artifacts in classical image reconstructions. In this paper, we will focus on the severely ill-posed inverse problem of limited angle computed tomography, in which entire boundary sections are not captured in the measurements. We will develop a hybrid reconstruction framework that fuses model-based sparse regularization with data-driven deep learning. Our method is reliable in the sense that we only learn the part that can provably not be handled by model-based methods, while applying the theoretically controllable sparse regularization technique to the remaining parts. Such a decomposition into visible and invisible segments is achieved by means of the shearlet transform that allows to resolve wavefront sets in the phase space. Furthermore, this split enables us to assign the clear task of inferring unknown shearlet coefficients to the neural network and thereby offering an interpretation of its performance in the context of limited angle computed tomography. Our numerical experiments show that our algorithm significantly surpasses both pure model- and more data-based reconstruction methods.

Paper Link

## Holistic Multi-modal Memory Network for Movie Question Answering

Answering questions according to multi-modal context is a challenging problem as it requires a deep integration of different data sources. Existing approaches only employ partial interactions among data sources in one attention hop. In this paper, we present the Holistic Multi-modal Memory Network (HMMN) framework which fully considers the interactions between different input sources (multi-modal context, question) in each hop. In addition, it takes answer choices into consideration during the context retrieval stage. Therefore, the proposed framework effectively integrates multi-modal context, question, and answer information, which leads to more informative context retrieved for question answering. Our HMMN framework achieves state-of-the-art accuracy on MovieQA dataset. Extensive ablation studies show the importance of holistic reasoning and contributions of different attention strategies.

Paper Link

## Visual Saliency Maps Can Apply to Facial Expression Recognition

Human eyes concentrate different facial regions during distinct cognitive activities. We study utilising facial visual saliency maps to classify different facial expressions into different emotions. Our results show that our novel method of merely using facial saliency maps can achieve a descent accuracy of 65\\%, much higher than the chance level of $1/7$. Furthermore, our approach is of semi-supervision, i.e., our facial saliency maps are generated from a general saliency prediction algorithm that is not explicitly designed for face images. We also discovered that the classification accuracies of each emotional class using saliency maps demonstrate a strong positive correlation with the accuracies produced by face images. Our work implies that humans may look at different facial areas in order to perceive different emotions.

Paper Link

## Variational Community Partition with Novel Network Structure Centrality Prior

In this paper, we proposed a novel two-stage optimization method for network community partition, which is based on inherent network structure information. The introduced optimization approach utilizes the new network centrality measure of both links and vertices to construct the key affinity description of the given network, where the direct similarities between graph nodes or nodal features are not available to obtain the classical affinity matrix. Indeed, such calculated network centrality information presents the essential structure of network, hence, the proper measure for detecting network communities, which also introduces a `confidence’ criterion for referencing new labeled benchmark nodes. For the resulted challenging combinatorial optimization problem of graph clustering, the proposed optimization method iteratively employs an efficient convex optimization algorithm which is developed based under a new variational perspective of primal and dual. Experiments over both artificial and real-world network datasets demonstrate that the proposed optimization strategy of community detection significantly improves result accuracy and outperforms the state-of-the-art algorithms in terms of accuracy and reliability.

Paper Link

## Identification of Internal Faults in Indirect Symmetrical Phase Shift Transformers Using Ensemble Learning

This paper proposes methods to identify 40 different types of internal faults in an Indirect Symmetrical Phase Shift Transformer (ISPST). The ISPST was modeled using Power System Computer Aided Design (PSCAD)/ Electromagnetic Transients including DC (EMTDC). The internal faults were simulated by varying the transformer tapping, backward and forward phase shifts, loading, and percentage of winding faulted. Data for 960 cases of each type of fault was recorded. A series of features were extracted for a, b, and c phases from time, frequency, time-frequency, and information theory domains. The importance of the extracted features was evaluated through univariate tests which helped to reduce the number of features. The selected features were then used for training five state-of-the-art machine learning classifiers. Extremely Random Trees and Random Forest, the ensemble-based learners, achieved the accuracy of 98.76% and 97.54% respectively outperforming Multilayer Perceptron (96.13%), Logistic Regression (93.54%), and Support Vector Machines (92.60%)

Paper Link

## Road Damage Detection And Classification In Smartphone Captured Images Using Mask R-CNN

This paper summarizes the design, experiments and results of our solution to the Road Damage Detection and Classification Challenge held as part of the 2018 IEEE International Conference On Big Data Cup. Automatic detection and classification of damage in roads is an essential problem for multiple applications like maintenance and autonomous driving. We demonstrate that convolutional neural net based instance detection and classfication approaches can be used to solve this problem. In particular we show that Mask-RCNN, one of the state-of-the-art algorithms for object detection, localization and instance segmentation of natural images, can be used to perform this task in a fast manner with effective results. We achieve a mean F1 score of 0.528 at an IoU of 50% on the task of detection and classification of different types of damages in real-world road images acquired using a smartphone camera and our average inference time for each image is 0.105 seconds on an NVIDIA GeForce 1080Ti graphic card. The code and saved models for our approach can be found here : this https URL Submission

Paper Link

## M2Det: A Single-Shot Object Detector based on Multi-Level Feature Pyramid Network

Feature pyramids are widely exploited by both the state-of-the-art one-stage object detectors (\\textit{e.g.}, DSSD, RetinaNet, RefineDet) and the two-stage object detectors (\\textit{e.g.}, Mask R-CNN, DetNet) to alleviate the problem arising from scale variation across object instances. Although these object detectors with feature pyramids achieve encouraging results, they have some limitations due to that they only simply construct the feature pyramid according to the inherent multi-scale, pyramidal architecture of the backbones which are actually designed for object classification task. Newly, in this work, we present a method called Multi-Level Feature Pyramid Network (MLFPN) to construct more effective feature pyramids for detecting objects of different scales. First, we fuse multi-level features (\\textit{i.e.} multiple layers) extracted by backbone as the base feature. Second, we feed the base feature into a block of alternating joint Thinned U-shape Modules and Feature Fusion Modules and exploit the decoder layers of each u-shape module as the features for detecting objects. Finally, we gather up the decoder layers with equivalent scales (sizes) to develop a feature pyramid for object detection, in which every feature map consists of the layers (features) from multiple levels. To evaluate the effectiveness of the proposed MLFPN, we design and train a powerful end-to-end one-stage object detector we call M2Det by integrating it into the architecture of SSD, which gets better detection performance than state-of-the-art one-stage detectors. Specifically, on MS-COCO benchmark, M2Det achieves AP of 41.0 at speed of 11.8 FPS with single-scale inference strategy and AP of 44.2 with multi-scale inference strategy, which is the new state-of-the-art results among one-stage detectors. The code will be made available on \\url{this https URL.

Paper Link

## An Interpretable Generative Model for Handwritten Digit Image Synthesis

An interpretable generative model for handwritten digits synthesis is proposed in this work. Modern image generative models, such as Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), are trained by backpropagation (BP). The training process is complex and the underlying mechanism is difficult to explain. We propose an interpretable multi-stage PCA method to achieve the same goal and use handwritten digit images synthesis as an illustrative example. First, we derive principal-component-analysis-based (PCA-based) transform kernels at each stage based on the covariance of its inputs. This results in a sequence of transforms that convert input images of correlated pixels to spectral vectors of uncorrelated components. In other words, it is a whitening process. Then, we can synthesize an image based on random vectors and multi-stage transform kernels through a coloring process. The generative model is a feedforward (FF) design since no BP is used in model parameter determination. Its design complexity is significantly lower, and the whole design process is explainable. Finally, we design an FF generative model using the MNIST dataset, compare synthesis results with those obtained by state-of-the-art GAN and VAE methods, and show that the proposed generative model achieves comparable performance.

Paper Link

## Multiple Subspace Alignment Improves Domain Adaptation

We present a novel unsupervised domain adaptation (DA) method for cross-domain visual recognition. Though subspace methods have found success in DA, their performance is often limited due to the assumption of approximating an entire dataset using a single low-dimensional subspace. Instead, we develop a method to effectively represent the source and target datasets via a collection of low-dimensional subspaces, and subsequently align them by exploiting the natural geometry of the space of subspaces, on the Grassmann manifold. We demonstrate the effectiveness of this approach, using empirical studies on two widely used benchmarks, with state of the art domain adaptation performance

Paper Link

## Deep Learning Framework for Pedestrian Collision Avoidance System (PeCAS)

Drowsy driving is a major cause of on-road accidents in the US, which sometimes is fatal to unsuspecting pedestrians. This framework based on Deep Learning proposes an approach to detect the onset of drowsiness in a vehicle operator especially alerting the driver when in the proximity of a pedestrian. Using Convolutional Neural Network (CNN), an approach is proposed to detect drowsiness based on the Viola-Jones algorithm. The pedestrian detector is also based on a deep CNN architecture and is capable to detect multiple pedestrians. In the end, an integration of the output from the two architectures is fed into an Arduino hardware kit to generate warnings for the vehicle operator.

Paper Link

## A Progressively-trained Scale-invariant and Boundary-aware Deep Neural Network for the Automatic 3D Segmentation of Lung Lesions

Volumetric segmentation of lesions on CT scans is important for many types of analysis, including lesion growth kinetic modeling in clinical trials and machine learning of radiomic features. Manual segmentation is laborious, and impractical for large-scale use. For routine clinical use, and in clinical trials that apply the Response Evaluation Criteria In Solid Tumors (RECIST), clinicians typically outline the boundaries of a lesion on a single slice to extract diameter measurements. In this work, we have collected a large-scale database, named LesionVis, with pixel-wise manual 2D lesion delineations on the RECIST-slices. To extend the 2D segmentations to 3D, we propose a volumetric progressive lesion segmentation (PLS) algorithm to automatically segment the 3D lesion volume from 2D delineations using a scale-invariant and boundary-aware deep convolutional network (SIBA-Net). The SIBA-Net copes with the size transition of a lesion when the PLS progresses from the RECIST-slice to the edge-slices, as well as when performing longitudinal assessment of lesions whose size change over multiple time points. The proposed PLS-SiBA-Net (P-SiBA) approach is assessed on the lung lesion cases from LesionVis. Our experimental results demonstrate that the P-SiBA approach achieves mean Dice similarity coefficients (DSC) of 0.81, which significantly improves 3D segmentation accuracy compared with the approaches proposed previously (highest mean DSC at 0.78 on LesionVis). In summary, by leveraging the limited 2D delineations on the RECIST-slices, P-SiBA is an effective semi-supervised approach to produce accurate lesion segmentations in 3D.

Paper Link

## HSD-CNN: Hierarchically self decomposing CNN architecture using class specific filter sensitivity analysis

Conventional Convolutional neural networks (CNN) are trained on large domain datasets, and are hence typically over-represented and inefficient in limited class applications. An efficient way to convert such large many-class pre-trained networks into small few-class networks is through a hierarchical decomposition of its feature maps. To alleviate this issue, we propose an automated framework for such decomposition in Hierarchically Self Decomposing CNNs (HSD-CNN), in four steps. HSD-CNNs are derived automatically using a class specific filter sensitivity analysis that quantifies the impact of specific features on a class prediction. The decomposed and hierarchical network can be utilized and deployed directly to obtain sub-networks for subset of classes, and it is shown to perform better without the requirement of retraining these sub-networks. Experimental results show that HSD-CNNs generally do not degrade accuracy if the full set of classes are used. However, when operating on known subsets of classes, HSD-CNNs lead to an increased accuracy using a much smaller model size, requiring much less operations. HSD-CNN flow is verified on the CIFAR10, CIFAR100 and CALTECH101 data sets. We report accuracies up to $85.6\\%$ ( $94.75\\%$ ) on scenarios with 13 ( 4 ) classes of CIFAR100, using a VGG-16 network pretrained on the full data set. In this case, the used HSD-CNN requires $3.97 \\times$ fewer parameters and $3.56 \\times$ fewer operations than the VGG-16 baseline containing features for all 100 classes.

Paper Link

## Integrating Multiple Receptive Fields through Grouped Active Convolution

Convolutional networks have achieved great success in various vision tasks. This is mainly due to a considerable amount of research on network structure. In this study, instead of focusing on architectures, we focused on the convolution unit itself. The existing convolution unit has a fixed shape, and is limited to observing restricted receptive fields. In an earlier work, we proposed the active convolution unit (ACU), which can freely define its shape and learn by itself. In this paper, we propose a detailed analysis of the proposed unit and show that it is an efficient representation of a sparse weight convolution. Furthermore, we expand the unit to a grouped ACU, which can observe multiple receptive fields in one layer. We found that the performance of a naive grouped convolution is degraded by increasing the number of groups; however, the proposed unit retains the accuracy even though the number of parameters reduces. Based on this result, we suggest a depthwise ACU, and various experiments have shown that our unit is efficient and can replace the existing convolutions.

Paper Link

## Fashion and Apparel Classification using Convolutional Neural Networks

We present an empirical study of applying deep Convolutional Neural Networks (CNN) to the task of fashion and apparel image classification to improve meta-data enrichment of e-commerce applications. Five different CNN architectures were analyzed using clean and pre-trained models. The models were evaluated in three different tasks person detection, product and gender classification, on two small and large scale datasets.

Paper Link

## Improved Visual Relocalization by Discovering Anchor Points

We address the visual relocalization problem of predicting the location and camera orientation or pose (6DOF) of the given input scene. We propose a method based on how humans determine their location using the visible landmarks. We define anchor points uniformly across the route map and propose a deep learning architecture which predicts the most relevant anchor point present in the scene as well as the relative offsets with respect to it. The relevant anchor point need not be the nearest anchor point to the ground truth location, as it might not be visible due to the pose. Hence we propose a multi task loss function, which discovers the relevant anchor point, without needing the ground truth for it. We validate the effectiveness of our approach by experimenting on CambridgeLandmarks (large scale outdoor scenes) as well as 7 Scenes (indoor scenes) using variousCNN feature extractors. Our method improves the median error in indoor as well as outdoor localization datasets compared to the previous best deep learning model known as PoseNet (with geometric re-projection loss) using the same feature extractor. We improve the median error in localization in the specific case of Street scene, by over 8m.

Paper Link

## Neural Generative Models for 3D Faces with Application in 3D Texture Free Face Recognition

Using heterogeneous depth cameras and 3D scanners in 3D face verification causes variations in the resolution of the 3D point clouds. To solve this issue, previous studies use 3D registration techniques. Out of these proposed techniques, detecting points of correspondence is proven to be an efficient method given that the data belongs to the same individual. However, if the data belongs to different persons, the registration algorithms can convert the 3D point cloud of one person to another, destroying the distinguishing features between the two point clouds. Another issue regarding the storage size of the point clouds. That is, if the captured depth image contains around 50 thousand points in the cloud for a single pose for one individual, then the storage size of the entire dataset will be in order of giga if not tera bytes. With these motivations, this work introduces a new technique for 3D point clouds generation using a neural modeling system to handle the differences caused by heterogeneous depth cameras, and to generate a new face canonical compact representation. The proposed system reduces the stored 3D dataset size, and if required, provides an accurate dataset regeneration. Furthermore, the system generates neural models for all gallery point clouds and stores these models to represent the faces in the recognition or verification processes. For the probe cloud to be verified, a new model is generated specifically for that particular cloud and is matched against pre-stored gallery model presentations to identify the query cloud. This work also introduces the utilization of Siamese deep neural network in 3D face verification using generated model representations as raw data for the deep network, and shows that the accuracy of the trained network is comparable all published results on Bosphorus dataset.

Paper Link

## Deep Face Quality Assessment

Face image quality is an important factor in facial recognition systems as its verification and recognition accuracy is highly dependent on the quality of image presented. Rejecting low quality images can significantly increase the accuracy of any facial recognition system. In this project, a simple approach is presented to train a deep convolutional neural network to perform end-to-end face image quality assessment. The work is done in 2 stages : First, generation of quality score label and secondly, training a deep convolutional neural network in a supervised manner to predict quality score between 0 and 1. The generation of quality labels is done by comparing the face image with a template of best quality images and then evaluating the normalized score based on the similarity.

Paper Link

## Automatic Brain Structures Segmentation Using Deep Residual Dilated U-Net

Brain image segmentation is used for visualizing and quantifying anatomical structures of the brain. We present an automated ap-proach using 2D deep residual dilated networks which captures rich context information of different tissues for the segmentation of eight brain structures. The proposed system was evaluated in the MICCAI Brain Segmentation Challenge and ranked 9th out of 22 teams. We further compared the method with traditional U-Net using leave-one-subject-out cross-validation setting on the public dataset. Experimental results shows that the proposed method outperforms traditional U-Net (i.e. 80.9% vs 78.3% in averaged Dice score, 4.35mm vs 11.59mm in averaged robust Hausdorff distance) and is computationally efficient.

Paper Link

## Multi-label Object Attribute Classification using a Convolutional Neural Network

Objects of different classes can be described using a limited number of attributes such as color, shape, pattern, and texture. Learning to detect object attributes instead of only detecting objects can be helpful in dealing with a priori unknown objects. With this inspiration, a deep convolutional neural network for low-level object attribute classification, called the Deep Attribute Network (DAN), is proposed. Since object features are implicitly learned by object recognition networks, one such existing network is modified and fine-tuned for developing DAN. The performance of DAN is evaluated on the ImageNet Attribute and a-Pascal datasets. Experiments show that in comparison with state-of-the-art methods, the proposed model achieves better results.

Paper Link

## Coronary Calcium Detection using 3D Attention Identical Dual Deep Network Based on Weakly Supervised Learning

Coronary artery calcium (CAC) is biomarker of advanced subclinical coronary artery disease and predicts myocardial infarction and death prior to age 60 years. The slice-wise manual delineation has been regarded as the gold standard of coronary calcium detection. However, manual efforts are time and resource consuming and even impracticable to be applied on large-scale cohorts. In this paper, we propose the attention identical dual network (AID-Net) to perform CAC detection using scan-rescan longitudinal non-contrast CT scans with weakly supervised attention by only using per scan level labels. To leverage the performance, 3D attention mechanisms were integrated into the AID-Net to provide complementary information for classification tasks. Moreover, the 3D Gradient-weighted Class Activation Mapping (Grad-CAM) was also proposed at the testing stage to interpret the behaviors of the deep neural network. 5075 non-contrast chest CT scans were used as training, validation and testing datasets. Baseline performance was assessed on the same cohort. From the results, the proposed AID-Net achieved the superior performance on classification accuracy (0.9272) and AUC (0.9627).

Paper Link

## The method of multimodal MRI brain image segmentation based on differential geometric features

Accurate segmentation of brain tissue in magnetic resonance images (MRI) is a diffcult task due to different types of brain abnormalities. Using information and features from multimodal MRI including T1, T1-weighted inversion recovery (T1-IR) and T2-FLAIR and differential geometric features including the Jacobian determinant(JD) and the curl vector(CV) derived from T1 modality can result in a more accurate analysis of brain images. In this paper, we use the differential geometric information including JD and CV as image characteristics to measure the differences between different MRI images, which represent local size changes and local rotations of the brain image, and we can use them as one CNN channel with other three modalities (T1-weighted, T1-IR and T2-FLAIR) to get more accurate results of brain segmentation. We test this method on two datasets including IBSR dataset and MRBrainS datasets based on the deep voxelwise residual network, namely VoxResNet, and obtain excellent improvement over single modality or three modalities and increases average DSC(Cerebrospinal Fluid (CSF), Gray Matter (GM) and White Matter (WM)) by about 1.5% on the well-known MRBrainS18 dataset and about 2.5% on the IBSR dataset. Moreover, we discuss that one modality combined with its JD or CV information can replace the segmentation effect of three modalities, which can provide medical conveniences for doctor to diagnose because only to extract T1-modality MRI image of patients. Finally, we also compare the segmentation performance of our method in two networks, VoxResNet and U-Net network. The results show VoxResNet has a better performance than U-Net network with our method in brain MRI segmentation. We believe the proposed method can advance the performance in brain segmentation and clinical diagnosis.

Paper Link

## Scene Text Detection and Recognition: The Deep Learning Era

With the rise and development of deep learning, computer vision has been tremendously transformed and reshaped. As an important research area in computer vision, scene text detection and recognition has been inescapably influenced by this wave of revolution, consequentially entering the era of deep learning. In recent years, the community has witnessed substantial advancements in mindset, approach and performance. This survey is aimed at summarizing and analyzing the major changes and significant progresses of scene text detection and recognition in the deep learning era. Through this article, we devote to: (1) introduce new insights and ideas; (2) highlight recent techniques and benchmarks; (3) look ahead into future trends. Specifically, we will emphasize the dramatic differences brought by deep learning and the grand challenges still remained. We expect that this review paper would serve as a reference book for researchers in this field. Related resources are also collected and compiled in our Github repository: this https URL.

Paper Link

## Detecting Work Zones in SHRP 2 NDS Videos Using Deep Learning Based Computer Vision

Naturalistic driving studies seek to perform the observations of human driver behavior in the variety of environmental conditions necessary to analyze, understand and predict that behavior using statistical and physical models. The second Strategic Highway Research Program (SHRP 2) funds a number of transportation safety-related projects including its primary effort, the Naturalistic Driving Study (NDS), and an effort supplementary to the NDS, the Roadway Information Database (RID). This work seeks to expand the range of answerable research questions that researchers might pose to the NDS and RID databases. Specifically, we present the SHRP 2 NDS Video Analytics (SNVA) software application, which extracts information from NDS-instrumented vehicles’ forward-facing camera footage and efficiently integrates that information into the RID, tying the video content to geolocations and other trip attributes. Of particular interest to researchers and other stakeholders is the integration of work zone, traffic signal state and weather information. The version of SNVA introduced in this paper focuses on work zone detection, the highest priority. The ability to automate the discovery and cataloging of this information, and to do so quickly, is especially important given the two petabyte (2PB) size of the NDS video data set.

Paper Link

## Deep Learning Approach for Building Detection in Satellite Multispectral Imagery

Building detection from satellite multispectral imagery data is being a fundamental but a challenging problem mainly because it requires correct recovery of building footprints from high-resolution images. In this work, we propose a deep learning approach for building detection by applying numerous enhancements throughout the process. Initial dataset is preprocessed by 2-sigma percentile normalization. Then data preparation includes ensemble modelling where 3 models were created while incorporating OpenStreetMap data. Binary Distance Transformation (BDT) is used for improving data labeling process and the U-Net (Convolutional Networks for Biomedical Image Segmentation) is modified by adding batch normalization wrappers. Afterwards, it is explained how each component of our approach is correlated with the final detection accuracy. Finally, we compare our results with winning solutions of SpaceNet 2 competition for real satellite multispectral images of Vegas, Paris, Shanghai and Khartoum, demonstrating the importance of our solution for achieving higher building detection accuracy.

Paper Link

## Breast Cancer Classification from Histopathological Images with Inception Recurrent Residual Convolutional Neural Network

The Deep Convolutional Neural Network (DCNN) is one of the most powerful and successful deep learning approaches. DCNNs have already provided superior performance in different modalities of medical imaging including breast cancer classification, segmentation, and detection. Breast cancer is one of the most common and dangerous cancers impacting women worldwide. In this paper, we have proposed a method for breast cancer classification with the Inception Recurrent Residual Convolutional Neural Network (IRRCNN) model. The IRRCNN is a powerful DCNN model that combines the strength of the Inception Network (Inception-v4), the Residual Network (ResNet), and the Recurrent Convolutional Neural Network (RCNN). The IRRCNN shows superior performance against equivalent Inception Networks, Residual Networks, and RCNNs for object recognition tasks. In this paper, the IRRCNN approach is applied for breast cancer classification on two publicly available datasets including BreakHis and Breast Cancer Classification Challenge 2015. The experimental results are compared against the existing machine learning and deep learning-based approaches with respect to image-based, patch-based, image-level, and patient-level classification. The IRRCNN model provides superior classification performance in terms of sensitivity, Area Under the Curve (AUC), the ROC curve, and global accuracy compared to existing approaches for both datasets.

Paper Link

## Near Real-Time Data Labeling Using a Depth Sensor for EMG Based Prosthetic Arms

Recognizing sEMG (Surface Electromyography) signals belonging to a particular action (e.g., lateral arm raise) automatically is a challenging task as EMG signals themselves have a lot of variation even for the same action due to several factors. To overcome this issue, there should be a proper separation which indicates similar patterns repetitively for a particular action in raw signals. A repetitive pattern is not always matched because the same action can be carried out with different time duration. Thus, a depth sensor (Kinect) was used for pattern identification where three joint angles were recording continuously which is clearly separable for a particular action while recording sEMG signals. To Segment out a repetitive pattern in angle data, MDTW (Moving Dynamic Time Warping) approach is introduced. This technique is allowed to retrieve suspected motion of interest from raw signals. MDTW based on DTW algorithm, but it will be moving through the whole dataset in a pre-defined manner which is capable of picking up almost all the suspected segments inside a given dataset an optimal way. Elevated bicep curl and lateral arm raise movements are taken as motions of interest to show how the proposed technique can be employed to achieve auto identification and labelling. The full implementation is available at this https URL

Paper Link

## Skeleton-Based Action Recognition with Synchronous Local and Non-local Spatio-temporal Learning and Frequency Attention

Benefiting from its succinctness and robustness, skeleton-based human action recognition has recently attracted much attention. Most existing methods utilize local networks, such as recurrent networks, convolutional neural networks, and graph convolutional networks, to extract spatio-temporal dynamics hierarchically. As a consequence, the local and non-local dependencies, which respectively contain more details and semantics, are asynchronously captured in different level of layers. Moreover, limited to the spatio-temporal domain, these methods ignored patterns in the frequency domain. To better extract information from multi-domains, we propose a residual frequency attention (rFA) to focus on discriminative patterns in the frequency domain, and a synchronous local and non-local (SLnL) block to simultaneously capture the details and semantics in the spatio-temporal domain. To optimize the whole process, we also propose a soft-margin focal loss (SMFL), which can automatically conducts adaptive data selection and encourages intrinsic margins in classifiers. Extensive experiments are performed on several large-scale action recognition datasets and our approach significantly outperforms other state-of-the-art methods.

Paper Link

## Innovative 3D Depth Map Generation From A Holoscopic 3D Image Based on Graph Cut Technique

Holoscopic 3D imaging is a promising technique for capturing full colour spatial 3D images using a single aperture holoscopic 3D camera. It mimics fly’s eye technique with a microlens array, which views the scene at a slightly different angle to its adjacent lens that records three dimensional information onto a two dimensional surface. This paper proposes a method of depth map generation from a holoscopic 3D image based on graph cut technique. The principal objective of this study is to estimate the depth information presented in a holoscopic 3D image with high precision. As such, depth map extraction is measured from a single still holoscopic 3D image which consists of multiple viewpoint images. The viewpoints are extracted and utilised for disparity calculation via disparity space image technique and pixels displacement is measured with sub pixel accuracy to overcome the issue of the narrow baseline between the viewpoint images for stereo matching. In addition, cost aggregation is used to correlate the matching costs within a particular neighbouring region using sum of absolute difference SAD combined with gradient-based metric and winner takes all algorithm is employed to select the minimum elements in the array as optimal disparity value. Finally, the optimal depth map is obtained using graph cut technique. The proposed method extends the utilisation of holoscopic 3D imaging system and enables the expansion of the technology for various applications of autonomous robotics, medical, inspection, AR VR, security and entertainment where 3D depth sensing and measurement are a concern.

Paper Link

## Image Cartoon-Texture Decomposition Using Isotropic Patch Recurrence

Aiming at separating the cartoon and texture layers from an image, cartoon-texture decomposition approaches resort to image priors to model cartoon and texture respectively. In recent years, patch recurrence has emerged as a powerful prior for image recovery. However, the existing strategies of using patch recurrence are ineffective to cartoon-texture decomposition, as both cartoon contours and texture patterns exhibit strong patch recurrence in images. To address this issue, we introduce the isotropy prior of patch recurrence, that the spatial configuration of similar patches in texture exhibits the isotropic structure which is different from that in cartoon, to model the texture component. Based on the isotropic patch recurrence, we construct a nonlocal sparsification system which can effectively distinguish well-patterned features from contour edges. Incorporating the constructed nonlocal system into morphology component analysis, we develop an effective method to both noiseless and noisy cartoon-texture decomposition. The experimental results have demonstrated the superior performance of the proposed method to the existing ones, as well as the effectiveness of the isotropic patch recurrence prior.

Paper Link

## Fast On-the-fly Retraining-free Sparsification of Convolutional Neural Networks

Modern Convolutional Neural Networks (CNNs) are complex, encompassing millions of parameters. Their deployment exerts computational, storage and energy demands, particularly on embedded platforms. Existing approaches to prune or sparsify CNNs require retraining to maintain inference accuracy. Such retraining is not feasible in some contexts. In this paper, we explore the sparsification of CNNs by proposing three model-independent methods. Our methods are applied on-the-fly and require no retraining. We show that the state-of-the-art models’ weights can be reduced by up to 73% (compression factor of 3.7x) without incurring more than 5% loss in Top-5 accuracy. Additional fine-tuning gains only 8% in sparsity, which indicates that our fast on-the-fly methods are effective.

Paper Link

## Power Normalizing Second-order Similarity Network for Few-shot Learning

Second- and higher-order statistics of data points have played an important role in advancing the state of the art on several computer vision problems such as the fine-grained image and scene recognition. However, these statistics need to be passed via an appropriate pooling scheme to obtain the best performance. Power Normalizations are non-linear activation units which enjoy probability-inspired derivations and can be applied in CNNs. In this paper, we propose a similarity learning network leveraging second-order information and Power Normalizations. To this end, we propose several formulations capturing second-order statistics and derive a sigmoid-like Power Normalizing function to demonstrate its interpretability. Our model is trained end-to-end to learn the similarity between the support set and query images for the problem of one- and few-shot learning. The evaluations on Omniglot, miniImagenet and Open MIC datasets demonstrate that this network obtains state-of-the-art results on several few-shot learning protocols.

Paper Link

## STA: Spatial-Temporal Attention for Large-Scale Video-based Person Re-Identification

In this work, we propose a novel Spatial-Temporal Attention (STA) approach to tackle the large-scale person re-identification task in videos. Different from the most existing methods, which simply compute representations of video clips using frame-level aggregation (e.g. average pooling), the proposed STA adopts a more effective way for producing robust clip-level feature representation. Concretely, our STA fully exploits those discriminative parts of one target person in both spatial and temporal dimensions, which results in a 2-D attention score matrix via inter-frame regularization to measure the importances of spatial parts across different frames. Thus, a more robust clip-level feature representation can be generated according to a weighted sum operation guided by the mined 2-D attention score matrix. In this way, the challenging cases for video-based person re-identification such as pose variation and partial occlusion can be well tackled by the STA. We conduct extensive experiments on two large-scale benchmarks, i.e. MARS and DukeMTMC-VideoReID. In particular, the mAP reaches 87.7% on MARS, which significantly outperforms the state-of-the-arts with a large margin of more than 11.6%.

Paper Link

## Reducing Network Agnostophobia

Agnostophobia, the fear of the unknown, can be experienced by deep learning engineers while applying their networks to real-world applications. Unfortunately, network behavior is not well defined for inputs far from a networks training set. In an uncontrolled environment, networks face many instances that are not of interest to them and have to be rejected in order to avoid a false positive. This problem has previously been tackled by researchers by either a) thresholding softmax, which by construction cannot return “none of the known classes”, or b) using an additional background or garbage class. In this paper, we show that both of these approaches help, but are generally insufficient when previously unseen classes are encountered. We also introduce a new evaluation metric that focuses on comparing the performance of multiple approaches in scenarios where such unseen classes or unknowns are encountered. Our major contributions are simple yet effective Entropic Open-Set and Objectosphere losses that train networks using negative samples from some classes. These novel losses are designed to maximize entropy for unknown inputs while increasing separation in deep feature space by modifying magnitudes of known and unknown samples. Experiments on networks trained to classify classes from MNIST and CIFAR-10 show that our novel loss functions are significantly better at dealing with unknown inputs from datasets such as Devanagari, NotMNIST, CIFAR-100, and SVHN.

Paper Link

Please follow and like us: