## Pipe-SGD: A Decentralized Pipelined SGD Framework for Distributed Deep Net Training

Youjie Li, Mingchao Yu, Songze Li, Salman Avestimehr, Nam Sung Kim, Alexander Schwing

Distributed training of deep nets is an important technique to address some of the present day computing challenges like memory consumption and computational demands. Classical distributed approaches, synchronous or asynchronous, are based on the parameter server architecture, i.e., worker nodes compute gradients which are communicated to the parameter server while updated parameters are returned. Recently, distributed training with AllReduce operations gained popularity as well. While many of those operations seem appealing, little is reported about wall-clock training time improvements. In this paper, we carefully analyze the AllReduce based setup, propose timing models which include network latency, bandwidth, cluster size and compute time, and demonstrate that a pipelined training with a width of two combines the best of both synchronous and asynchronous training. Specifically, for a setup consisting of a four-node GPU cluster we show wall-clock time training improvements of up to 5.4x compared to conventional approaches.

## GradiVeQ: Vector Quantization for Bandwidth-Efficient Gradient Aggregation in Distributed CNN Training

Mingchao Yu, Zhifeng Lin, Krishna Narra, Songze Li, Youjie Li, Nam Sung Kim, Alexander Schwing, Murali Annavaram, Salman Avestimehr

Data parallelism can boost the training speed of convolutional neural networks (CNN), but could suffer from significant communication costs caused by gradient aggregation. To alleviate this problem, several scalar quantization techniques have been developed to compress the gradients. But these techniques could perform poorly when used together with decentralized aggregation protocols like ring all-reduce (RAR), mainly due to their inability to directly aggregate compressed gradients. In this paper, we empirically demonstrate the strong linear correlations between CNN gradients, and propose a gradient vector quantization technique, named GradiVeQ, to exploit these correlations through principal component analysis (PCA) for substantial gradient dimension reduction. GradiVeQ enables direct aggregation of compressed gradients, hence allows us to build a distributed learning system that parallelizes GradiVeQ gradient compression and RAR communications. Extensive experiments on popular CNNs demonstrate that applying GradiVeQ slashes the wall-clock gradient aggregation time of the original RAR by more than 5X without noticeable accuracy loss, and reduces the end-to-end training time by almost 50%. The results also show that GradiVeQ is compatible with scalar quantization techniques such as QSGD (Quantized SGD), and achieves a much higher speed-up gain under the same compression ratio.

## Measuring the Effects of Data Parallelism on Neural Network Training

Christopher J. Shallue, Jaehoon Lee, Joe Antognini, Jascha Sohl-Dickstein, Roy Frostig, George E. Dahl

Recent hardware developments have made unprecedented amounts of data parallelism available for accelerating neural network training. Among the simplest ways to harness next-generation accelerators is to increase the batch size in standard mini-batch neural network training algorithms. In this work, we aim to experimentally characterize the effects of increasing the batch size on training time, as measured in the number of steps necessary to reach a goal out-of-sample error. Eventually, increasing the batch size will no longer reduce the number of training steps required, but the exact relationship between the batch size and how many training steps are necessary is of critical importance to practitioners, researchers, and hardware designers alike. We study how this relationship varies with the training algorithm, model, and dataset and find extremely large variation between workloads. Along the way, we reconcile disagreements in the literature on whether batch size affects model quality. Finally, we discuss the implications of our results for efforts to train neural networks much faster in the future.

## Intrinsic Geometric Vulnerability of High-Dimensional Artificial Intelligence

Luca Bortolussi, Guido Sanguinetti

The success of modern Artificial Intelligence (AI) technologies depends critically on the ability to learn non-linear functional dependencies from large, high dimensional data sets. Despite recent high-profile successes, empirical evidence indicates that the high predictive performance is often paired with low robustness, making AI systems potentially vulnerable to adversarial attacks. In this report, we provide a simple intuitive argument suggesting that high performance and vulnerability are intrinsically coupled, and largely dependent on the geometry of typical, high-dimensional data sets. Our work highlights a major potential pitfall of modern AI systems, and suggests practical research directions to ameliorate the problem.

## A Geometric Approach of Gradient Descent Algorithms in Neural Networks

Yacine Chitour, Zhenyu Liao, Romain Couillet

In this article we present a geometric framework to analyze convergence of gradient descent trajectories in the context of neural networks. In the case of linear networks of an arbitrary number of hidden layers, we characterize appropriate quantities which are conserved along the gradient descent system (GDS). We use them to prove boundedness of every trajectory of the GDS, which implies convergence to a critical point. We further focus on the local behavior in the neighborhood of each critical points and perform a study on the associated basin of attractions so as to measure the “possibility” of converging to saddle points and local minima.

## Real-time Traffic Data Prediction with Basic Safety Messages using Kalman-Filter based Noise Reduction Model and Long Short-term Memory Neural Network

Mizanur Rahman, Mashrur Chowdhury, Jerome McClendon

With the development of Connected Vehicle (CV) technology, temporal variation of roadway traffic can be captured by sharing Basic Safety Messages (BSMs) from each vehicle using the communication between vehicles as well as with transportation roadside infrastructures (e.g., traffic signal) and traffic management centers. However, the penetration of connected vehicles in the near future will be limited. BSMs from limited CVs could provide an inaccurate estimation of current speed or space headway. This inaccuracy in the estimated current average speed and average space headway data is termed as noise. This noise in the traffic data significantly reduces the prediction accuracy of a machine learning model, such as the accuracy of long short term memory (LSTM) model in predicting traffic condition. To improve the real time prediction accuracy with low penetration of CVs, we developed a traffic data prediction model that combines the LSTM with a noise reduction model (the standard Kalman filter or Kalman filter based Rauch Tung Striebel (RTS)). The average speed and space headway used in this study were generated from the Enhanced Next Generation Simulation (NGSIM) dataset, which contains vehicle trajectory data for every one tenth of a second. Compared to a baseline LSTM model without any noise reduction, for 5 percent penetration of CVs, the analyses revealed that combined LSTM\\RTS model reduced the mean absolute percentage error (MAPE) from 19 percent to 5 percent for speed prediction and from 27 percent to 9 percent for space headway prediction. The overall reduction of MAPE value ranged from 1 percent to 14 percent for speed and 2 percent to 18 percent for space headway prediction compared to the baseline model.

## Iterative Classroom Teaching

Teresa Yeo, Parameswaran Kamalaruban, Adish Singla, Arpit Merchant, Thibault Asselborn, Louis Faucon, Pierre Dillenbourg, Volkan Cevher

We consider the machine teaching problem in a classroom-like setting wherein the teacher has to deliver the same examples to a diverse group of students. Their diversity stems from differences in their initial internal states as well as their learning rates. We prove that a teacher with full knowledge about the learning dynamics of the students can teach a target concept to the entire classroom using O(min{d,N} log(1/eps)) examples, where d is the ambient dimension of the problem, N is the number of learners, and eps is the accuracy parameter. We show the robustness of our teaching strategy when the teacher has limited knowledge of the learners’ internal dynamics as provided by a noisy oracle. Further, we study the trade-off between the learners’ workload and the teacher’s cost in teaching the target concept. Our experiments validate our theoretical results and suggest that appropriately partitioning the classroom into homogenous groups provides a balance between these two objectives.

## A Geometric Perspective on the Transferability of Adversarial Directions

Zachary Charles, Harrison Rosenberg, Dimitris Papailiopoulos

State-of-the-art machine learning models frequently misclassify inputs that have been perturbed in an adversarial manner. Adversarial perturbations generated for a given input and a specific classifier often seem to be effective on other inputs and even different classifiers. In other words, adversarial perturbations seem to transfer between different inputs, models, and even different neural network architectures. In this work, we show that in the context of linear classifiers and two-layer ReLU networks, there provably exist directions that give rise to adversarial perturbations for many classifiers and data points simultaneously. We show that these “transferable adversarial directions” are guaranteed to exist for linear separators of a given set, and will exist with high probability for linear classifiers trained on independent sets drawn from the same distribution. We extend our results to large classes of two-layer ReLU networks. We further show that adversarial directions for ReLU networks transfer to linear classifiers while the reverse need not hold, suggesting that adversarial perturbations for more complex models are more likely to transfer to other classifiers. We validate our findings empirically, even for deeper ReLU networks.

## Learning from Demonstration in the Wild

Feryal Behbahani, Kyriacos Shiarlis, Xi Chen, Vitaly Kurin, Sudhanshu Kasewa, Ciprian Stirbu, Jo\xe3o Gomes, Supratik Paul, Frans A. Oliehoek, Jo\xe3o Messias, Shimon Whiteson

Learning from demonstration (LfD) is useful in settings where hand-coding behaviour or a reward function is impractical. It has succeeded in a wide range of problems but typically relies on artificially generated demonstrations or specially deployed sensors and has not generally been able to leverage the copious demonstrations available in the wild: those that capture behaviour that was occurring anyway using sensors that were already deployed for another purpose, e.g., traffic camera footage capturing demonstrations of natural behaviour of vehicles, cyclists, and pedestrians. We propose video to behaviour (ViBe), a new approach to learning models of road user behaviour that requires as input only unlabelled raw video data of a traffic scene collected from a single, monocular, uncalibrated camera with ordinary resolution. Our approach calibrates the camera, detects relevant objects, tracks them through time, and uses the resulting trajectories to perform LfD, yielding models of naturalistic behaviour. We apply ViBe to raw videos of a traffic intersection and show that it can learn purely from videos, without additional expert knowledge.

## A simple yet effective baseline for non-attribute graph classification

Graphs are complex objects that do not lend themselves easily to typical learning tasks. Recently, a range of approaches based on graph kernels or graph neural networks have been developed for graph classification and for representation learning on graphs in general. As the developed methodologies become more sophisticated, it is important to understand which components of the increasingly complex methods are necessary or most effective.

As a first step, we develop a simple yet meaningful graph representation, and explore its effectiveness in graph classification. We test our baseline representation for the graph classification task on a range of graph datasets. Interestingly, this simple representation achieves similar performance as the state-of-the-art graph kernels and graph neural networks for non-attributed graph classification. Its performance on classifying attributed graphs is slightly weaker as it does not incorporate attributes. However, given its simplicity and efficiency, we believe that it still serves as an effective baseline for attributed graph classification. Our graph representation is efficient (linear-time) to compute. We also provide a simple connection with the graph neural networks.

Note that these observations are only for the task of graph classification while existing methods are often designed for a broader scope including node embedding and link prediction. The results are also likely biased due to the limited amount of benchmark datasets available. Nevertheless, the good performance of our simple baseline calls for the development of new, more comprehensive benchmark datasets so as to better evaluate and analyze different graph learning methods. Furthermore, given the computational efficiency of our graph summary, we believe that it is a good candidate as a baseline method for future graph classification (or even other graph learning) studies.

## Degree-$d$ Chow Parameters Robustly Determine Degree-$d$ PTFs (and Algorithmic Applications)

Ilias Diakonikolas, Daniel M. Kane

The degree-$d$ Chow parameters of a Boolean function $f: \\{-1,1\\}^n \\to \\mathbb{R}$ are its degree at most $d$ Fourier coefficients. It is well-known that degree-$d$ Chow parameters uniquely characterize degree-$d$ polynomial threshold functions (PTFs) within the space of all bounded functions. In this paper, we prove a robust version of this theorem: For $f$ any Boolean degree-$d$ PTF and $g$ any bounded function, if the degree-$d$ Chow parameters of $f$ are close to the degree-$d$ Chow parameters of $g$ in $\\ell_2$-norm, then $f$ is close to $g$ in $\\ell_1$-distance. Notably, our bound relating the two distances is completely independent of the dimension $n$. That is, we show that Boolean degree-$d$ PTFs are {\\em robustly identifiable} from their degree-$d$ Chow parameters. Results of this form had been shown for the $d=1$ case~\\cite{OS11:chow, DeDFS14}, but no non-trivial bound was previously known for $d >1$.

Our robust identifiability result gives the following algorithmic applications: First, we show that Boolean degree-$d$ PTFs can be efficiently approximately reconstructed from approximations to their degree-$d$ Chow parameters. This immediately implies that degree-$d$ PTFs are efficiently learnable in the uniform distribution $d$-RFA model~\\cite{BenDavidDichterman:98}. As a byproduct of our approach, we also obtain the first low integer-weight approximations of degree-$d$ PTFs, for $d>1$. As our second application, our robust identifiability result gives the first efficient algorithm, with dimension-independent error guarantees, for malicious learning of Boolean degree-$d$ PTFs under the uniform distribution.

## Disentangling Latent Factors with Whitening

After the success of deep generative models in image generation tasks, learning disentangled latent variable of data has become a major part of deep learning research. Many models have been proposed to learn an interpretable and factorized representation of latent variable by modifying their objective function or model architecture. While disentangling the latent variable, some models show lower quality of reconstructed images and others increase the model complexity which is hard to train. In this paper, we propose a simple disentangling method with traditional principle component analysis (PCA) which is applied to the latent variables of variational auto-encoder (VAE). Our method can be applied to any generative models. In experiment, we apply our proposed method to simple VAE models and experimental results confirm that our method finds more interpretable factors from the latent space while keeping the reconstruction error the same.

## Alpha-Pooling for Convolutional Neural Networks

Convolutional neural networks (CNNs) have achieved remarkable performance in many applications, especially image recognition. As a crucial component of CNNs, sub-sampling plays an important role, and max pooling and arithmetic average pooling are commonly used sub-sampling methods. In addition to the two pooling methods, however, there could be many other pooling types, such as geometric average, harmonic average, and so on. Since it is not easy for algorithms to find the best pooling method, human experts choose types of pooling, which might not be optimal for different tasks. Following deep learning philosophy, the type of pooling can be driven by data for a given task. In this paper, we propose {\\em alpha-pooling}, which has a trainable parameter $\u03b1$ to decide the type of pooling. Alpha-pooling is a general pooling method including max pooling and arithmetic average pooling as a special case, depending on the parameter $\u03b1$. In experiments, alpha-pooling improves the accuracy of image recognition tasks, and we found that max pooling is not the optimal pooling scheme. Moreover each layer has different optimal pooling types.

## Explaining Deep Learning Models – A Bayesian Non-parametric Approach

Wenbo Guo, Sui Huang, Yunzhe Tao, Xinyu Xing, Lin Lin

Understanding and interpreting how machine learning (ML) models make decisions have been a big challenge. While recent research has proposed various technical approaches to provide some clues as to how an ML model makes individual predictions, they cannot provide users with an ability to inspect a model as a complete entity. In this work, we propose a novel technical approach that augments a Bayesian non-parametric regression mixture model with multiple elastic nets. Using the enhanced mixture model, we can extract generalizable insights for a target model through a global approximation. To demonstrate the utility of our approach, we evaluate it on different ML models in the context of image recognition. The empirical results indicate that our proposed approach not only outperforms the state-of-the-art techniques in explaining individual decisions but also provides users with an ability to discover the vulnerabilities of the target ML models.

## A Factor Graph Approach to Automated Design of Bayesian Signal Processing Algorithms

Marco Cox, Thijs van de Laar, Bert de Vries

The benefits of automating design cycles for Bayesian inference-based algorithms are becoming increasingly recognized by the machine learning community. As a result, interest in probabilistic programming frameworks has much increased over the past few years. This paper explores a specific probabilistic programming paradigm, namely message passing in Forney-style factor graphs (FFGs), in the context of automated design of efficient Bayesian signal processing algorithms. To this end, we developed “ForneyLab” (this https URL) as a Julia toolbox for message passing-based inference in FFGs. We show by example how ForneyLab enables automatic derivation of Bayesian signal processing algorithms, including algorithms for parameter estimation and model comparison. Crucially, due to the modular makeup of the FFG framework, both the model specification and inference methods are readily extensible in ForneyLab. In order to test this framework, we compared variational message passing as implemented by ForneyLab with automatic differentiation variational inference (ADVI) and Monte Carlo methods as implemented by state-of-the-art tools “Edward” and “Stan”. In terms of performance, extensibility and stability issues, ForneyLab appears to enjoy an edge relative to its competitors for automated inference in state-space models.

## A Survey on Data Collection for Machine Learning: a Big Data – AI Integration Perspective

Yuji Roh, Geon Heo, Steven Euijong Whang

Data collection is a major bottleneck in machine learning and an active research topic in multiple communities. There are largely two reasons data collection has recently become a critical issue. First, as machine learning is becoming more widely-used, we are seeing new applications that do not necessarily have enough labeled data. Second, unlike traditional machine learning where feature engineering is the bottleneck, deep learning techniques automatically generate features, but instead require large amounts of labeled data. Interestingly, recent research in data collection comes not only from the machine learning, natural language, and computer vision communities, but also from the data management community due to the importance of handling large amounts of data. In this survey, we perform a comprehensive study of data collection from a data management point of view. Data collection largely consists of data acquisition, data labeling, and improvement of existing data or models. We provide a research landscape of these operations, provide guidelines on which technique to use when, and identify interesting research challenges. The integration of machine learning and data management for data collection is part of a larger trend of Big data and Artificial Intelligence (AI) integration and opens many opportunities for new research.

## Transformative Machine Learning

Ivan Olier, Oghenejokpeme I. Orhobor, Joaquin Vanschoren, Ross D. King

The key to success in machine learning (ML) is the use of effective data representations. Traditionally, data representations were hand-crafted. Recently it has been demonstrated that, given sufficient data, deep neural networks can learn effective implicit representations from simple input representations. However, for most scientific problems, the use of deep learning is not appropriate as the amount of available data is limited, and/or the output models must be explainable. Nevertheless, many scientific problems do have significant amounts of data available on related tasks, which makes them amenable to multi-task learning, i.e. learning many related problems simultaneously. Here we propose a novel and general representation learning approach for multi-task learning that works successfully with small amounts of data. The fundamental new idea is to transform an input intrinsic data representation (i.e., handcrafted features), to an extrinsic representation based on what a pre-trained set of models predict about the examples. This transformation has the dual advantages of producing significantly more accurate predictions, and providing explainable models. To demonstrate the utility of this transformative learning approach, we have applied it to three real-world scientific problems: drug-design (quantitative structure activity relationship learning), predicting human gene expression (across different tissue types and drug treatments), and meta-learning for machine learning (predicting which machine learning methods work best for a given problem). In all three problems, transformative machine learning significantly outperforms the best intrinsic representation.

## Linear Memory Networks

Davide Bacciu, Antonio Carta, Alessandro Sperduti

Recurrent neural networks can learn complex transduction problems that require maintaining and actively exploiting a memory of their inputs. Such models traditionally consider memory and input-output functionalities indissolubly entangled. We introduce a novel recurrent architecture based on the conceptual separation between the functional input-output transformation and the memory mechanism, showing how they can be implemented through different neural components. By building on such conceptualization, we introduce the Linear Memory Network, a recurrent model comprising a feedforward neural network, realizing the non-linear functional transformation, and a linear autoencoder for sequences, implementing the memory component. The resulting architecture can be efficiently trained by building on closed-form solutions to linear optimization problems. Further, by exploiting equivalence results between feedforward and recurrent neural networks we devise a pretraining schema for the proposed architecture. Experiments on polyphonic music datasets show competitive results against gated recurrent networks and other state of the art models.

## Using Known Information to Accelerate HyperParameters Optimization Based on SMBO

Cheng Daning, Zhang Hanping, Xia Fen, Li Shigang, Zhang Yunquan

Automl is the key technology for machine learning problem. Current state of art hyperparameter optimization methods are based on traditional black-box optimization methods like SMBO (SMAC, TPE). The objective function of black-box optimization is non-smooth, or time-consuming to evaluate, or in some way noisy. Recent years, many researchers offered the work about the properties of hyperparameters. However, traditional hyperparameter optimization methods do not take those information into consideration. In this paper, we use gradient information and machine learning model analysis information to accelerate traditional hyperparameter optimization methods SMBO. In our L2 norm experiments, our method yielded state-of-the-art performance, and in many cases outperformed the previous best configuration approach.

## Bias and Generalization in Deep Generative Models: An Empirical Study

Shengjia Zhao, Hongyu Ren, Arianna Yuan, Jiaming Song, Noah Goodman, Stefano Ermon

In high dimensional settings, density estimation algorithms rely crucially on their inductive bias. Despite recent empirical success, the inductive bias of deep generative models is not well understood. In this paper we propose a framework to systematically investigate bias and generalization in deep generative models of images. Inspired by experimental methods from cognitive psychology, we probe each learning algorithm with carefully designed training datasets to characterize when and how existing models generate novel attributes and their combinations. We identify similarities to human psychology and verify that these patterns are consistent across commonly used models and architectures.

## Efficient Identification of Approximate Best Configuration of Training in Large Datasets

Silu Huang, Chi Wang, Bolin Ding, Surajit Chaudhuri

A configuration of training refers to the combinations of feature engineering, learner, and its associated hyperparameters. Given a set of configurations and a large dataset randomly split into training and testing set, we study how to efficiently identify the best configuration with approximately the highest testing accuracy when trained from the training set. To guarantee small accuracy loss, we develop a solution using confidence interval (CI)-based progressive sampling and pruning strategy. Compared to using full data to find the exact best configuration, our solution achieves more than two orders of magnitude speedup, while the returned top configuration has identical or close test accuracy.

## SRP: Efficient class-aware embedding learning for large-scale data via supervised random projections

Amir-Hossein Karimi, Alexander Wong, Ali Ghodsi

Supervised dimensionality reduction strategies have been of great interest. However, current supervised dimensionality reduction approaches are difficult to scale for situations characterized by large datasets given the high computational complexities associated with such methods. While stochastic approximation strategies have been explored for unsupervised dimensionality reduction to tackle this challenge, such approaches are not well-suited for accelerating computational speed for supervised dimensionality reduction. Motivated to tackle this challenge, in this study we explore a novel direction of directly learning optimal class-aware embeddings in a supervised manner via the notion of supervised random projections (SRP). The key idea behind SRP is that, rather than performing spectral decomposition (or approximations thereof) which are computationally prohibitive for large-scale data, we instead perform a direct decomposition by leveraging kernel approximation theory and the symmetry of the Hilbert-Schmidt Independence Criterion (HSIC) measure of dependence between the embedded data and the labels. Experimental results on five different synthetic and real-world datasets demonstrate that the proposed SRP strategy for class-aware embedding learning can be very promising in producing embeddings that are highly competitive with existing supervised dimensionality reduction methods (e.g., SPCA and KSPCA) while achieving 1-2 orders of magnitude better computational performance. As such, such an efficient approach to learning embeddings for dimensionality reduction can be a powerful tool for large-scale data analysis and visualization.

## Time Series Classification to Improve Poultry Welfare

Alireza Abdoli, Amy C. Murillo, Chin-Chia M. Yeh, Alec C. Gerry, Eamonn J. Keogh

Poultry farms are an important contributor to the human food chain. Worldwide, humankind keeps an enormous number of domesticated birds (e.g. chickens) for their eggs and their meat, providing rich sources of low-fat protein. However, around the world, there have been growing concerns about the quality of life for the livestock in poultry farms; and increasingly vocal demands for improved standards of animal welfare. Recent advances in sensing technologies and machine learning allow the possibility of automatically assessing the health of some individual birds, and employing the lessons learned to improve the welfare for all birds. This task superficially appears to be easy, given the dramatic progress in recent years in classifying human behaviors, and given that human behaviors are presumably more complex. However, as we shall demonstrate, classifying chicken behaviors poses several unique challenges, chief among which is creating a generalizable dictionary of behaviors from sparse and noisy data. In this work we introduce a novel time series dictionary learning algorithm that can robustly learn from weakly labeled data sources.

## Blockwise Parallel Decoding for Deep Autoregressive Models

Mitchell Stern, Noam Shazeer, Jakob Uszkoreit

Deep autoregressive sequence-to-sequence models have demonstrated impressive performance across a wide variety of tasks in recent years. While common architecture classes such as recurrent, convolutional, and self-attention networks make different trade-offs between the amount of computation needed per layer and the length of the critical path at training time, generation still remains an inherently sequential process. To overcome this limitation, we propose a novel blockwise parallel decoding scheme in which we make predictions for multiple time steps in parallel then back off to the longest prefix validated by a scoring model. This allows for substantial theoretical improvements in generation speed when applied to architectures that can process output sequences in parallel. We verify our approach empirically through a series of experiments using state-of-the-art self-attention models for machine translation and image super-resolution, achieving iteration reductions of up to 2x over a baseline greedy decoder with no loss in quality, or up to 7x in exchange for a slight decrease in performance. In terms of wall-clock time, our fastest models exhibit real-time speedups of up to 4x over standard greedy decoding.

## Nonlinear Dimension Reduction via Outer Bi-Lipschitz Extensions

Sepideh Mahabadi, Konstantin Makarychev, Yury Makarychev, Ilya Razenshteyn

We introduce and study the notion of an outer bi-Lipschitz extension of a map between Euclidean spaces. The notion is a natural analogue of the notion of a Lipschitz extension of a Lipschitz map. We show that for every map $f$ there exists an outer bi-Lipschitz extension $f’$ whose distortion is greater than that of $f$ by at most a constant factor. This result can be seen as a counterpart of the classic Kirszbraun theorem for outer bi-Lipschitz extensions. We also study outer bi-Lipschitz extensions of near-isometric maps and show upper and lower bounds for them. Then, we present applications of our results to prioritized and terminal dimension reduction problems.

* We prove a prioritized variant of the Johnson-Lindenstrauss lemma: given a set of points $X\\subset \\mathbb{R}^d$ of size $N$ and a permutation (“priority ranking”) of $X$, there exists an embedding $f$ of $X$ into $\\mathbb{R}^{O(\\log N)}$ with distortion $O(\\log \\log N)$ such that the point of rank $j$ has only $O(\\log^{3 + \\varepsilon} j)$ non-zero coordinates – more specifically, all but the first $O(\\log^{3+\\varepsilon} j)$ coordinates are equal to $0$; the distortion of $f$ restricted to the first $j$ points (according to the ranking) is at most $O(\\log\\log j)$. The result makes a progress towards answering an open question by Elkin, Filtser, and Neiman about prioritized dimension reductions.

* We prove that given a set $X$ of $N$ points in $\\mathbb{R}^d$, there exists a terminal dimension reduction embedding of $\\mathbb{R}^d$ into $\\mathbb{R}^{d’}$, where $d’ = O\\left(\\frac{\\log N}{\\varepsilon^4}\\right)$, which preserves distances $\\|x-y\\|$ between points $x\\in X$ and $y \\in \\mathbb{R}^{d}$, up to a multiplicative factor of $1 \\pm \\varepsilon$. This improves a recent result by Elkin, Filtser, and Neiman.

The dimension reductions that we obtain are nonlinear, and this nonlinearity is necessary.

## Analysis of Multilingual Sequence-to-Sequence speech recognition systems

Martin Karafi\xe1t, Murali Karthick Baskar, Shinji Watanabe, Takaaki Hori, Matthew Wiesner, Jan “Honza” \u010cernock\xfd

This paper investigates the applications of various multilingual approaches developed in conventional hidden Markov model (HMM) systems to sequence-to-sequence (seq2seq) automatic speech recognition (ASR). On a set composed of Babel data, we first show the effectiveness of multi-lingual training with stacked bottle-neck (SBN) features. Then we explore various architectures and training strategies of multi-lingual seq2seq models based on CTC-attention networks including combinations of output layer, CTC and/or attention component re-training. We also investigate the effectiveness of language-transfer learning in a very low resource scenario when the target language is not included in the original multi-lingual training data. Interestingly, we found multilingual features superior to multilingual models, and this finding suggests that we can efficiently combine the benefits of the HMM system with the seq2seq system through these multilingual feature techniques.

## Optimized Hidden Markov Model based on Constrained Particle Swarm Optimization

L. Chang, Yacine Ouzrout (DISP), Antoine Nongaillard (DISP), Abdelaziz Bouras (DISP)

As one of Bayesian analysis tools, Hidden Markov Model (HMM) has been used to in extensive applications. Most HMMs are solved by Baum-Welch algorithm (BWHMM) to predict the model parameters, which is difficult to find global optimal solutions. This paper proposes an optimized Hidden Markov Model with Particle Swarm Optimization (PSO) algorithm and so is called PSOHMM. In order to overcome the statistical constraints in HMM, the paper develops re-normalization and re-mapping mechanisms to ensure the constraints in HMM. The experiments have shown that PSOHMM can search better solution than BWHMM, and has faster convergence speed.

## Integrating Project Spatial Coordinates into Pavement Management Prioritization

Omar Elbagalati, Mustafa Hajij

To date, pavement management software products and studies on optimizing the prioritization of pavement maintenance and rehabilitation (M&R) have been mainly focused on three parameters; the pre-treatment pavement condition, the rehabilitation cost, and the available budget. Yet, the role of the candidate projects’ spatial characteristics in the decision-making process has not been deeply considered. Such a limitation, predominately, allows the recommended M&R projects’ schedule to involve simultaneously running but spatially scattered construction sites, which are very challenging to monitor and manage. This study introduces a novel approach to integrate pavement segments’ spatial coordinates into the M&R prioritization analysis. The introduced approach aims at combining the pavement segments with converged spatial coordinates to be repaired in the same timeframe without compromising the allocated budget levels or the overall target Pavement Condition Index (PCI). Such a combination would result in minimizing the routing of crews, materials and other equipment among the construction sites and would provide better collaborations and communications between the pavement maintenance teams. Proposed herein is a novel spatial clustering algorithm that automatically finds the projects within a certain budget and spatial constrains. The developed algorithm was successfully validated using 1,800 pavement maintenance projects from two real-life examples of the City of Milton, GA and the City of Tyler, TX.

## Data Science as Political Action: Grounding Data Science in a Politics of Justice

In response to recent controversies, the field of data science has rushed to adopt codes of ethics. Such professional codes, however, are ill-equipped to address broad matters of social justice. Instead of ethics codes, I argue, the field must embrace politics. Data scientists must recognize themselves as political actors engaged in normative constructions of society and, as befits political work, evaluate their work according to its downstream material impacts on people’s lives.

I justify this notion in two parts: first, by articulating why data scientists must recognize themselves as political actors, and second, by describing how the field can evolve toward a deliberative and rigorous grounding in a politics of social justice.

Part 1 responds to three common arguments that have been invoked by data scientists when they are challenged to take political positions regarding their work. In confronting these arguments, I will demonstrate why attempting to remain apolitical is itself a political stance–a fundamentally conservative one–and why the field’s current attempts to promote “social good” dangerously rely on vague and unarticulated political assumptions.

Part 2 proposes a framework for what a politically-engaged data science could look like and how to achieve it, recognizing the challenge of reforming the field in this manner. I conceptualize the process of incorporating politics into data science in four stages: becoming interested in directly addressing social issues, recognizing the politics underlying these issues, redirecting existing methods toward new applications, and, finally, developing new practices and methods that orient data science around a mission of social justice. The path ahead does not require data scientists to abandon their technical expertise, but it does entail expanding their notions of what problems to work on and how to engage with society.

## Phenotyping Endometriosis through Mixed Membership Models of Self-Tracking Data

I\xf1igo Urteaga, Mollie McKillop, Sharon Lipsky-Gorman, No\xe9mie Elhadad

We investigate the use of self-tracking data and unsupervised mixed-membership models to phenotype endometriosis. Endometriosis is a systemic, chronic condition of women in reproductive age and, at the same time, a highly enigmatic condition with no known biomarkers to monitor its progression and no established staging. We leverage data collected through a self-tracking app in an observational research study of over 2,800 women with endometriosis tracking their condition over a year and a half (456,900 observations overall). We extend a classical mixed-membership model to accommodate the idiosyncrasies of the data at hand (i.e., the multimodality of the tracked variables). Our experiments show that our approach identifies potential subtypes that are robust in terms of biases of self-tracked data (e.g., wide variations in tracking frequency amongst participants), as well as to variations in hyperparameters of the model. Jointly modeling a wide range of observations about participants (symptoms, quality of life, treatments) yields clinically meaningful subtypes that both validate what is already known about endometriosis and suggest new findings.

## dAIrector: Automatic Story Beat Generation through Knowledge Synthesis

Markus Eger, Kory W. Mathewson

dAIrector is an automated director which collaborates with humans storytellers for live improvisational performances and writing assistance. dAIrector can be used to create short narrative arcs through contextual plot generation. In this work, we present the system architecture, a quantitative evaluation of design choices, and a case-study usage of the system which provides qualitative feedback from a professional improvisational performer. We present relevant metrics for the understudied domain of human-machine creative generation, specifically long-form narrative creation. We include, alongside publication, open-source code so that others may test, evaluate, and run the dAIrector.

## Bayesian Deep Learning for Exoplanet Atmospheric Retrieval

Frank Soboczenski, Michael D. Himes, Molly D. O’Beirne, Simone Zorzan, Atilim Gunes Baydin, Adam D. Cobb, Daniel Angerhausen, Giada N. Arney, Shawn D. Domagal-Goldman

Over the past decade, the study of exoplanets has shifted from their detection to the characterization of their atmospheres. Atmospheric retrieval, the inverse modeling technique used to determine an atmosphere’s temperature and composition from an observed spectrum, is both time-consuming and compute-intensive, requiring complex algorithms that compare thousands to millions of atmospheric models to the observational data to find the most probable values and associated uncertainties for each model parameter. For rocky, terrestrial planets, the retrieved atmospheric composition can give insight into the surface fluxes of gaseous species necessary to maintain the stability of that atmosphere, which may in turn provide insight into the geological and/or biological processes active on the planet. These atmospheres contain many molecules, some of which are biosignatures, or molecules indicative of biological activity. Runtimes of traditional retrieval models scale with the number of model parameters, so as more molecular species are considered, runtimes can become prohibitively long. Recent advances in machine learning (ML) and computer vision offer new ways to reduce the time to perform a retrieval by orders of magnitude, given a sufficient data set to train with. Here we present an ML-based retrieval framework called Intelligent exoplaNet Atmospheric RetrievAl (INARA) that consists of a Bayesian deep learning model for retrieval and a data set of 3,000,000 spectra of synthetic rocky exoplanets generated using the NASA Planetary Spectrum Generator (PSG). Our work represents the first ML model for rocky, terrestrial exoplanets and the first synthetic data set of spectra generated at this scale.

## Spectral Simplicial Theory for Feature Selection and Applications to Genomics

Kiya W. Govek, Venkata S. Yamajala, Pablo G. Camara

The scale and complexity of modern data sets and the limitations associated with testing large numbers of hypotheses underline the need for feature selection methods. Spectral techniques rank features according to their degree of consistency with an underlying metric structure, but their current graph-based formulation restricts their applicability to point features. We extend spectral methods for feature selection to abstract simplicial complexes and present a general framework which can be applied to 2-point and higher-order features. Combinatorial Laplacian scores take into account the topology spanned by the data and reduce to the ordinary Laplacian score in the case of point features. We demonstrate the utility of spectral simplicial methods for feature selection with several examples of application to the analysis of gene expression and multi-modal genomic data. Our results provide a unifying perspective on topological data analysis and manifold learning approaches.

## Speaker-adaptive neural vocoders for statistical parametric speech synthesis systems

Eunwoo Song, Jinseob Kim, Kyungguen Byun, Hong-Goo Kang

This paper proposes speaker-adaptive neural vocoders for statistical parametric speech synthesis (SPSS) systems. Recently proposed WaveNet-based neural vocoding systems successfully generate a time sequence of speech signal with an autoregressive framework. However, building high-quality speech synthesis systems with limited training data for a target speaker remains a challenge. To generate more natural speech signals with the constraint of limited training data, we employ a speaker adaptation task with an effective variation of neural vocoding models. In the proposed method, a speaker-independent training method is applied to capture universal attributes embedded in multiple speakers, and the trained model is then fine-tuned to represent the specific characteristics of the target speaker. Experimental results verify that the proposed SPSS systems with speaker-adaptive neural vocoders outperform those with traditional source-filter model-based vocoders and those with WaveNet vocoders, trained either speaker-dependently or speaker-independently.

## Memory-based Deep Reinforcement Learning for Obstacle Avoidance in UAV with Limited Environment Knowledge

Abhik Singla, Sindhu Padakandla, Shalabh Bhatnagar

This paper presents our method for enabling a UAV quadrotor, equipped with a monocular camera, to autonomously avoid collisions with obstacles in unstructured and unknown indoor environments. When compared to obstacle avoidance in ground vehicular robots, UAV navigation brings in additional challenges because the UAV motion is no more constrained to a well-defined indoor ground or street environment. Horizontal structures in indoor and outdoor environments like decorative items, furnishings, ceiling fans, sign-boards, tree branches etc., also become relevant obstacles unlike those for ground vehicular robots. Thus, methods of obstacle avoidance developed for ground robots are clearly inadequate for UAV navigation. Current control methods using monocular images for UAV obstacle avoidance are heavily dependent on environment information. These controllers do not fully retain and utilize the extensively available information about the ambient environment for decision making. We propose a deep reinforcement learning based method for UAV obstacle avoidance (OA) and autonomous exploration which is capable of doing exactly the same. The crucial idea in our method is the concept of partial observability and how UAVs can retain relevant information about the environment structure to make better future navigation decisions. Our OA technique uses recurrent neural networks with temporal attention and provides better results compared to prior works in terms of distance covered during navigation without collisions. In addition, our technique has a high inference rate (a key factor in robotic applications) and is energy-efficient as it minimizes oscillatory motion of UAV and reduces power wastage.

## BAR: Bayesian Activity Recognition using variational inference

Ranganath Krishnan, Mahesh Subedar, Omesh Tickoo

Uncertainty estimation in deep neural networks is essential for designing reliable and robust AI systems. Applications such as video surveillance for identifying suspicious activities are designed with deep neural networks (DNNs), but DNNs do not provide uncertainty estimates. Capturing reliable uncertainty estimates in safety and security critical applications will help to establish trust in the AI system. Our contribution is to apply Bayesian deep learning framework to visual activity recognition application and quantify model uncertainty along with principled confidence. We utilize the variational inference technique while training the Bayesian DNNs to infer the approximate posterior distribution around model parameters and perform Monte Carlo sampling on the posterior of model parameters to obtain the predictive distribution. We show that the Bayesian inference applied to DNNs provides reliable confidence measures for visual activity recognition task as compared to the conventional DNNs. We also show that our method improves the visual activity recognition precision-recall score by 6% compared to non-Bayesian baseline. We evaluate our models on Moments-In-Time (MiT) activity recognition dataset by selecting a subset of in- and out-of-distribution video samples.

## An Optimal Transport View on Generalization

Jingwei Zhang, Tongliang Liu, Dacheng Tao

We derive upper bounds on the generalization error of learning algorithms based on their \\emph{algorithmic transport cost}: the expected Wasserstein distance between the output hypothesis and the output hypothesis conditioned on an input example. The bounds provide a novel approach to study the generalization of learning algorithms from an optimal transport view and impose less constraints on the loss function, such as sub-gaussian or bounded. We further provide several upper bounds on the algorithmic transport cost in terms of total variation distance, relative entropy (or KL-divergence), and VC dimension, thus further bridging optimal transport theory and information theory with statistical learning theory. Moreover, we also study different conditions for loss functions under which the generalization error of a learning algorithm can be upper bounded by different probability metrics between distributions relating to the output hypothesis and/or the input data. Finally, under our established framework, we analyze the generalization in deep learning and conclude that the generalization error in deep neural networks (DNNs) decreases exponentially to zero as the number of layers increases. Our analyses of generalization error in deep learning mainly exploit the hierarchical structure in DNNs and the contraction property of $f$-divergence, which may be of independent interest in analyzing other learning models with hierarchical structure.