My Data Science Blogs

November 17, 2018

Whats new on arXiv

A causal inference framework for cancer cluster investigations using publicly available data

Often, a community becomes alarmed when high rates of cancer are noticed, and residents suspect that the cancer cases could be caused by a known source of hazard. In response, the CDC recommends that departments of health perform a standardized incidence ratio (SIR) analysis to determine whether the observed cancer incidence is higher than expected. This approach has several limitations that are well documented in the literature. In this paper we propose a novel causal inference approach to cancer cluster investigations, rooted in the potential outcomes framework. Assuming that a source of hazard representing a potential cause of increased cancer rates in the community is identified a priori, we introduce a new estimand called the causal SIR (cSIR). The cSIR is a ratio defined as the expected cancer incidence in the exposed population divided by the expected cancer incidence under the (counterfactual) scenario of no exposure. To estimate the cSIR we need to overcome two main challenges: 1) identify unexposed populations that are as similar as possible to the exposed one to inform estimation under the counterfactual scenario of no exposure, and 2) make inference on cancer incidence in these unexposed populations using publicly available data that are often available at a much higher level of spatial aggregation than what is desired. We overcome the first challenge by relying on matching. We overcome the second challenge by developing a Bayesian hierarchical model that borrows information from other sources to impute cancer incidence at the desired finer level of spatial aggregation. We apply our proposed approach to determine whether trichloroethylene vapor exposure has caused increased cancer incidence in Endicott, NY.

Human-like machine learning: limitations and suggestions

This paper attempts to address the issues of machine learning in its current implementation. It is known that machine learning algorithms require a significant amount of data for training purposes, whereas recent developments in deep learning have increased this requirement dramatically. The performance of an algorithm depends on the quality of data and hence, algorithms are as good as the data they are trained on. Supervised learning is developed based on human learning processes by analysing named (i.e. annotated) objects, scenes and actions. Whether training on large quantities of data (i.e. big data) is the right or the wrong approach, is debatable. The fact is, that training algorithms the same way we learn ourselves, comes with limitations. This paper discusses the issues around applying a human-like approach to train algorithms and the implications of this approach when using limited data. Several current studies involving non-data-driven algorithms and natural examples are also discussed and certain alternative approaches are suggested.

On the Robustness of Information-Theoretic Privacy Measures and Mechanisms

Consider a data publishing setting for a dataset composed of non-private features correlated with a set of private features not necessarily present in the dataset. The objective of the publisher is to maximize the amount of information about the non-private features in a revealed dataset (utility), while keeping the information leaked about the private attributes bounded (privacy). Here, both privacy and utility are measured using information leakage measures that arise in adversarial settings. We consider a practical setting wherein the publisher uses an estimate of the joint distribution of the features to design the privacy mechanism. For any privacy mechanism, we provide probabilistic upper bounds for the discrepancy between the privacy guarantees for the empirical and true distributions, and similarly for utility. These bounds follow from our main technical results regarding the Lipschitz continuity of the considered information leakage measures. We also establish the statistical consistency of the notion of optimal privacy mechanism. Finally, we introduce and study the family of uniform privacy mechanisms, mechanisms designed upon an estimate of the joint distribution which are capable of providing privacy to a whole neighborhood of the estimated distribution, and thereby, guaranteeing privacy for the true distribution with high probability.

Constraint-based Sequential Pattern Mining with Decision Diagrams

Constrained sequential pattern mining aims at identifying frequent patterns on a sequential database of items while observing constraints defined over the item attributes. We introduce novel techniques for constraint-based sequential pattern mining that rely on a multi-valued decision diagram representation of the database. Specifically, our representation can accommodate multiple item attributes and various constraint types, including a number of non-monotone constraints. To evaluate the applicability of our approach, we develop an MDD-based prefix-projection algorithm and compare its performance against a typical generate-and-check variant, as well as a state-of-the-art constraint-based sequential pattern mining algorithm. Results show that our approach is competitive with or superior to these other methods in terms of scalability and efficiency.

Unsupervised learning with contrastive latent variable models

In unsupervised learning, dimensionality reduction is an important tool for data exploration and visualization. Because these aims are typically open-ended, it can be useful to frame the problem as looking for patterns that are enriched in one dataset relative to another. These pairs of datasets occur commonly, for instance a population of interest vs. control or signal vs. signal free recordings.However, there are few methods that work on sets of data as opposed to data points or sequences. Here, we present a probabilistic model for dimensionality reduction to discover signal that is enriched in the target dataset relative to the background dataset. The data in these sets do not need to be paired or grouped beyond set membership. By using a probabilistic model where some structure is shared amongst the two datasets and some is unique to the target dataset, we are able to recover interesting structure in the latent space of the target dataset. The method also has the advantages of a probabilistic model, namely that it allows for the incorporation of prior information, handles missing data, and can be generalized to different distributional assumptions. We describe several possible variations of the model and demonstrate the application of the technique to de-noising, feature selection, and subgroup discovery settings.

Newton Methods for Convolutional Neural Networks

Deep learning involves a difficult non-convex optimization problem, which is often solved by stochastic gradient (SG) methods. While SG is usually effective, it may not be robust in some situations. Recently, Newton methods have been investigated as an alternative optimization technique, but nearly all existing studies consider only fully-connected feedforward neural networks. They do not investigate other types of networks such as Convolutional Neural Networks (CNN), which are more commonly used in deep-learning applications. One reason is that Newton methods for CNN involve complicated operations, and so far no works have conducted a thorough investigation. In this work, we give details of all building blocks including function, gradient, and Jacobian evaluation, and Gauss-Newton matrix-vector products. These basic components are very important because with them further developments of Newton methods for CNN become possible. We show that an efficient MATLAB implementation can be done in just several hundred lines of code and demonstrate that the Newton method gives competitive test accuracy.

Multivariate Time-series Similarity Assessment via Unsupervised Representation Learning and Stratified Locality Sensitive Hashing: Application to Early Acute Hypotensive Episode Detection

Timely prediction of clinically critical events in Intensive Care Unit (ICU) is important for improving care and survival rate. Most of the existing approaches are based on the application of various classification methods on explicitly extracted statistical features from vital signals. In this work, we propose to eliminate the high cost of engineering hand-crafted features from multivariate time-series of physiologic signals by learning their representation with a sequence-to-sequence auto-encoder. We then propose to hash the learned representations to enable signal similarity assessment for the prediction of critical events. We apply this methodological framework to predict Acute Hypotensive Episodes (AHE) on a large and diverse dataset of vital signal recordings. Experiments demonstrate the ability of the presented framework in accurately predicting an upcoming AHE.

Predictive Modeling with Delayed Information: a Case Study in E-commerce Transaction Fraud Control

In Business Intelligence, accurate predictive modeling is the key for providing adaptive decisions. We studied predictive modeling problems in this research which was motivated by real-world cases that Microsoft data scientists encountered while dealing with e-commerce transaction fraud control decisions using transaction streaming data in an uncertain probabilistic decision environment. The values of most online transactions related features can return instantly, while the true fraud labels only return after a stochastic delay. Using partially mature data directly for predictive modeling in an uncertain probabilistic decision environment would lead to significant inaccuracy on risk decision-making. To improve accurate estimation of the probabilistic prediction environment, which leads to more accurate predictive modeling, two frameworks, Current Environment Inference (CEI) and Future Environment Inference (FEI), are proposed. These frameworks generated decision environment related features using long-term fully mature and short-term partially mature data, and the values of those features were estimated using varies of learning methods, including linear regression, random forest, gradient boosted tree, artificial neural network, and recurrent neural network. Performance tests were conducted using some e-commerce transaction data from Microsoft. Testing results suggested that proposed frameworks significantly improved the accuracy of decision environment estimation.

Deep Learning in the Wavelet Domain

This paper examines the possibility of, and the possible advantages to learning the filters of convolutional neural networks (CNNs) for image analysis in the wavelet domain. We are stimulated by both Mallat’s scattering transform and the idea of filtering in the Fourier domain. It is important to explore new spaces in which to learn, as these may provide inherent advantages that are not available in the pixel space. However, the scattering transform is limited by its inability to learn in between scattering orders, and any Fourier domain filtering is limited by the large number of filter parameters needed to get localized filters. Instead we consider filtering in the wavelet domain with learnable filters. The wavelet space allows us to have local, smooth filters with far fewer parameters, and learnability can give us flexibility. We present a novel layer which takes CNN activations into the wavelet space, learns parameters and returns to the pixel space. This allows it to be easily dropped in to any neural network without affecting the structure. As part of this work, we show how to pass gradients through a multirate system and give preliminary results.

Concept Learning through Deep Reinforcement Learning with Memory-Augmented Neural Networks

Deep neural networks have shown superior performance in many regimes to remember familiar patterns with large amounts of data. However, the standard supervised deep learning paradigm is still limited when facing the need to learn new concepts efficiently from scarce data. In this paper, we present a memory-augmented neural network which is motivated by the process of human concept learning. The training procedure, imitating the concept formation course of human, learns how to distinguish samples from different classes and aggregate samples of the same kind. In order to better utilize the advantages originated from the human behavior, we propose a sequential process, during which the network should decide how to remember each sample at every step. In this sequential process, a stable and interactive memory serves as an important module. We validate our model in some typical one-shot learning tasks and also an exploratory outlier detection problem. In all the experiments, our model gets highly competitive to reach or outperform those strong baselines.

Theoretical Perspective of Deep Domain Adaptation

Deep domain adaptation has recently undergone a big success. Compared with shallow domain adaptation, deep domain adaptation has shown higher predictive performance and stronger capacity to tackle structural data (e.g., image and sequential data). The underlying idea of deep domain adaptation is to bridge the gap between source and target domains in a joint feature space so that a supervised classifier trained on labeled source data can be nicely transferred to the target domain. This idea is certainly appealing and motivating, but under the theoretical perspective, none of the theory has been developed to support this. In this paper, we have developed a rigorous theory to explain why we can bridge the relevant gap in an intermediate joint space. Under the light of our proposed theory, it turns out that there is a strong connection between deep domain adaptation and Wasserstein (WS) distance. More specifically, our theory revolves the following points: i) first, we propose a context wherein we can perfectly perform a transfer learning and ii) second, we further prove that by means of bridging the relevant gap and minimizing some reconstruction errors we are minimizing a WS distance between the push forward source distribution and the target distribution via a transport that maps from the source to target domains.

Short-Term Wind-Speed Forecasting Using Kernel Spectral Hidden Markov Models

In machine learning, a nonparametric forecasting algorithm for time series data has been proposed, called the kernel spectral hidden Markov model (KSHMM). In this paper, we propose a technique for short-term wind-speed prediction based on KSHMM. We numerically compared the performance of our KSHMM-based forecasting technique to other techniques with machine learning, using wind-speed data offered by the National Renewable Energy Laboratory. Our results demonstrate that, compared to these methods, the proposed technique offers comparable or better performance.

Model-based Approximate Query Processing

Interactive visualizations are arguably the most important tool to explore, understand and convey facts about data. In the past years, the database community has been working on different techniques for Approximate Query Processing (AQP) that aim to deliver an approximate query result given a fixed time bound to support interactive visualizations better. However, classical AQP approaches suffer from various problems that limit the applicability to support the ad-hoc exploration of a new data set: (1) Classical AQP approaches that perform online sampling can support ad-hoc exploration queries but yield low quality if executed over rare subpopulations. (2) Classical AQP approaches that rely on offline sampling can use some form of biased sampling to mitigate these problems but require a priori knowledge of the workload, which is often not realistic if users want to explore a new database. In this paper, we present a new approach to AQP called Model-based Approximate Query Processing that leverages generative models learned over the complete database to answer SQL queries at interactive speeds. Different from classical AQP approaches, generative models allow us to compute responses to ad-hoc queries and deliver high-quality estimates also over rare subpopulations at the same time. In our experiments with real and synthetic data sets, we show that Model-based AQP can in many scenarios return more accurate results in a shorter runtime. Furthermore, we think that our techniques of using generative models presented in this paper can not only be used for AQP in databases but also has applications for other database problems including Query Optimization as well as Data Cleaning.

Selective Feature Connection Mechanism: Concatenating Multi-layer CNN Features with a Feature Selector

Different layers of deep convolutional neural networks(CNN) can encode different-level information. High-layer features always contain more semantic information, and low-layer features contain more detail information. However, low-layer features suffer from the background clutter and semantic ambiguity. During visual recognition, the feature combination of the low-layer and high-level features plays an important role in context modulation. Directly combining the high-layer and low-layer features, the background clutter and semantic ambiguity may be caused due to the introduction of detailed information.In this paper, we propose a general network architecture to concatenate CNN features of different layers in a simple and effective way, called Selective Feature Connection Mechanism (SFCM). Low level features are selectively linked to high-level features with an feature selector which is generated by high-level features. The proposed connection mechanism can effectively overcome the above-mentioned drawbacks. We demonstrate the effectiveness, superiority, and universal applicability of this method on many challenging computer vision tasks, such as image classification, scene text detection, and image-to-image translation.

Contextual Care Protocol using Neural Networks and Decision Trees

A contextual care protocol is used by a medical practitioner for patient healthcare, given the context or situation that the specified patient is in. This paper proposes a method to build an automated self-adapting protocol which can help make relevant, early decisions for effective healthcare delivery. The hybrid model leverages neural networks and decision trees. The neural network estimates the chances of each disease and each tree in the decision trees represents care protocol for a disease. These trees are subject to change in case of aberrations found by the diagnosticians. These corrections or prediction errors are clustered into similar groups for scalability and review by the experts. The corrections as suggested by the experts are incorporated into the model.

Minimizing the Age of Information from Sensors with Correlated Observations

The Age of Information (AoI) metric has recently received much attention as a measure of freshness of information in a network. In this paper, we study the average AoI in a generic setting with correlated sensor information, which can be related to multiple Internet of Things (IoT) scenarios. The system consists of physical sources with discrete-time updates that are observed by a set of sensors. Each source update may be observed by multiple sensors, and hence the sensor observations are correlated. We devise a model that is simple, but still capable to capture the main tradeoffs. We propose two sensor scheduling policies that minimize the AoI of the sources; one that requires the system parameters to be known a priori, and one based on contextual bandits in which the parameters are unknown and need to be learned. We show that both policies are able to exploit the sensor correlation to reduce the AoI, which result in a large reduction in AoI compared to the use of schedules that are random or based on round-robin.

Towards Explainable Deep Learning for Credit Lending: A Case Study

Deep learning adoption in the financial services industry has been limited due to a lack of model interpretability. However, several techniques have been proposed to explain predictions made by a neural network. We provide an initial investigation into these techniques for the assessment of credit risk with neural networks.

Multi-cell LSTM Based Neural Language Model

Language models, being at the heart of many NLP problems, are always of great interest to researchers. Neural language models come with the advantage of distributed representations and long range contexts. With its particular dynamics that allow the cycling of information within the network, `Recurrent neural network’ (RNN) becomes an ideal paradigm for neural language modeling. Long Short-Term Memory (LSTM) architecture solves the inadequacies of the standard RNN in modeling long-range contexts. In spite of a plethora of RNN variants, possibility to add multiple memory cells in LSTM nodes was seldom explored. Here we propose a multi-cell node architecture for LSTMs and study its applicability for neural language modeling. The proposed multi-cell LSTM language models outperform the state-of-the-art results on well-known Penn Treebank (PTB) setup.

Exploiting Class Learnability in Noisy Data

In many domains, collecting sufficient labeled training data for supervised machine learning requires easily accessible but noisy sources, such as crowdsourcing services or tagged Web data. Noisy labels occur frequently in data sets harvested via these means, sometimes resulting in entire classes of data on which learned classifiers generalize poorly. For real world applications, we argue that it can be beneficial to avoid training on such classes entirely. In this work, we aim to explore the classes in a given data set, and guide supervised training to spend time on a class proportional to its learnability. By focusing the training process, we aim to improve model generalization on classes with a strong signal. To that end, we develop an online algorithm that works in conjunction with classifier and training algorithm, iteratively selecting training data for the classifier based on how well it appears to generalize on each class. Testing our approach on a variety of data sets, we show our algorithm learns to focus on classes for which the model has low generalization error relative to strong baselines, yielding a classifier with good performance on learnable classes.

DeepCSO: Forecasting of Combined Sewer Overflow at a Citywide Level using Multi-task Deep Learning
Enhancing Operation of a Sewage Pumping Station for Inter Catchment Wastewater Transfer by Using Deep Learning and Hydraulic Model
Modelling student online behaviour in a virtual learning environment
Cluster analysis of homicide rates in the Brazilian state of Goias from 2002 to 2014
Baire categorical aspects of first passage percolation II
Modeling car-following behavior on urban expressways in Shanghai: A naturalistic driving study
Prediction of Preliminary Maximum Wing Bending Moments under Discrete Gust
Transportation inequalities for stochastic wave equation
The Amortized Analysis of a Non-blocking Chromatic Tree
On the Further Structure of the Finite Free Convolutions
Fault-Tolerant Metric Dimension of $P(n,2)$ with Prism Graph
Randomisation Algorithms for Large Sparse Matrices
How to get meaningful and correct results from your finite element model
Machine Learning Analysis of Heterogeneity in the Effect of Student Mindset Interventions
On the use of FHT, its modification for practical applications and the structure of Hough image
Describing many-body localized systems in thermal environments
Catch and Prolong: recurrent neural network for seeking track-candidates
Estimation of Multivariate Wrapped Models for Data in Torus
Faster manipulation of large quantum circuits using wire label reference diagrams
Performance Estimation of Synthesis Flows cross Technologies using LSTMs and Transfer Learning
Distinguishing number of Urysohn metric spaces
Incentivizing Exploration with Unbiased Histories
Verification of Recurrent Neural Networks Through Rule Extraction
A Fast Method for Array Response Adjustment with Phase-Only Constraint
A Hierarchical Multi-task Approach for Learning Embeddings from Semantic Tasks
Natural Environment Benchmarks for Reinforcement Learning
Limit distributions of the upper order statistics for the Lévy-frailty Marshall-Olkin distribution
Dynamic Behavior Control of Induction Motor with STATCOM
Focus Quality Assessment of High-Throughput Whole Slide Imaging in Digital Pathology
Arterial Blood Pressure Feature Estimation Using Photoplethysmography
Unsupervised domain adaptation for medical imaging segmentation with self-ensembling
A Performance Vocabulary for Affine Loop Transformations
Looking at the Driver/Rider in Autonomous Vehicles to Predict Take-Over Readiness
A multi-level convolutional LSTM model for the segmentation of left ventricle myocardium in infarcted porcine cine MR images
Minimax Rates in Network Analysis: Graphon Estimation, Community Detection and Hypothesis Testing
Lattice bijections for string modules, snake graphs and the weak Bruhat order
Spatial Logics and Model Checking for Medical Imaging (Extended Version)
Interpretable deep learning for guided structure-property explorations in photovoltaics
Derivation of an aggregated band pseudo phasor for single phase pulse width modulation voltage waveforms
Communication-Optimal Distributed Dynamic Graph Clustering
Learning Optimal Personalized Treatment Rules Using Robust Regression Informed K-NN
ReSIFT: Reliability-Weighted SIFT-based Image Quality Assessment
Automatic Grammar Augmentation for Robust Voice Command Recognition
On the specification and verification of atomic swap smart contracts
Deep Neural Networks based Modrec: Some Results with Inter-Symbol Interference and Adversarial Examples
Operator-Theoretical Treatment of Ergodic Theorem and Its Application to Dynamic Models in Economics
Layered Belief Propagation for Low-complexity Large MIMO Detection Based on Statistical Beams
Learning from Past Bids to Participate Strategically in Day-Ahead Electricity Markets
Prophet Inequalities for Independent Random Variables from an Unknown Distribution
The case for shifting the Renyi Entropy
Some Moderate Deviations for Ewens-Pitman Sampling Model
Cooperation Enforcement and Collusion Resistance in Repeated Public Goods Games
Vectorized Character Counting for Faster Pattern Matching
Machine Learning for Combinatorial Optimization: a Methodological Tour d’Horizon
Gallai-Ramsey numbers for a class of graphs with five vertices
Forbidden rainbow subgraphs that force large monochromatic or multicolored connected subgraphs
It Does Not Follow. Response to ‘Yes They Can! …’
Empirical Effects of Dynamic Human-Body Blockage in 60 GHz Communications
Adaptive Full-Duplex Jamming Receiver for Secure D2D Links in Random Networks
Real-time Power System State Estimation and Forecasting via Deep Neural Networks
Boosting Search Performance Using Query Variations
Pure-Exploration for Infinite-Armed Bandits with General Arm Reservoirs
Effect Handling for Composable Program Transformations in Edward2
Orthogonal Policy Gradient and Autonomous Driving Application
Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos
Cops and robbers on oriented graphs
Exploiting Sentence Embedding for Medical Question Answering
Intersecting Families of Perfect Matchings
Recurrent approach to effective material properties with application to anisotropic binarized random fields
Improving Skin Condition Classification with a Question Answering Model
The autoregression bootstrap for kernel estimates of smooth nonlinear functional time series
Leveraging Financial News for Stock Trend Prediction with Attention-Based Recurrent Neural Network
Monochromatic Schur triples in randomly perturbed dense sets of integers
Implementing a Portable Clinical NLP System with a Common Data Model – a Lisp Perspective
A q-analogue and a symmetric function analogue of a result by Carlitz, Scoville and Vaughan
Phase retrieval for Bragg coherent diffraction imaging at high X-ray energies
Characterizing Design Patterns of EHR-Driven Phenotype Extraction Algorithms
Electric Vehicle Valet
GaitSet: Regarding Gait as a Set for Cross-View Gait Recognition
Intervention Aided Reinforcement Learning for Safe and Practical Policy Optimization in Navigation
Equivariant Perturbation in Gomory and Johnson’s Infinite Group Problem. VII. Inverse semigroup theory, closures, decomposition of perturbations
From Videos to URLs: A Multi-Browser Guide To Extract User’s Behavior with Optical Character Recognition
Face Verification and Forgery Detection for Ophthalmic Surgery Images
Distributed Obstacle and Multi-Robot Collision Avoidance in Uncertain Environments
Minimax Posterior Convergence Rates and Model Selection Consistency in High-dimensional DAG Models based on Sparse Cholesky Factors
Exotic Steiner Chains in Miquelian Möbius Planes
Combining Axiom Injection and Knowledge Base Completion for Efficient Natural Language Inference
Time-Varying Formation Control of a Collaborative Multi-Agent System Using Negative-Imaginary Systems Theory
Fano generalized Bott manifolds
Quantile Regression Modeling of Recurrent Event Risk
Predicting thermoelectric properties from crystal graphs and material descriptors – first application for functional materials
A Schur transform for spatial stochastic processes
Reward-estimation variance elimination in sequential decision processes
Graph Convolutional Neural Networks for Polymers Property Prediction
On Training Targets and Objective Functions for Deep-Learning-Based Audio-Visual Speech Enhancement
SGR: Self-Supervised Spectral Graph Representation Learning
Computing Quartet Distance is Equivalent to Counting 4-Cycles
Fundamental Limits of Caching in Heterogeneous Networks with Uncoded Prefetching
Effects of Lombard Reflex on the Performance of Deep-Learning-Based Audio-Visual Speech Enhancement Systems
Network capacity enhancement with various link rewiring strategies
In-silico Feedback Control of a MIMO Synthetic Toggle Switch via Pulse-Width Modulation
On Graphs with equal Dominating and C-dominating energy
Woulda, Coulda, Shoulda: Counterfactually-Guided Policy Search
On Infinite Prefix Normal Words
Effect of correlations on routing and modeling of Time Varying Communication Networks
Image declipping with deep networks
Survey of Computational Approaches to Diachronic Conceptual Change
Decentralized Data Storages: Technologies of Construction
Guiding the One-to-one Mapping in CycleGAN via Optimal Transport
Sketch based Reduced Memory Hough Transform
On a new class of score functions to estimate tail probabilities of some stochastic processes with Adaptive Multilevel Splitting
Unique ergodicity for stochastic hyperbolic equations with additive space-time white noise
Local theorems for arithmetic compound renewal processes, when Cramer’s condition holds
End-to-End Learning for Answering Structured Queries Directly over Text
A Neurodynamical model of Saliency prediction in V1
Redundancy scheduling with scaled Bernoulli service requirements
Sum rules and large deviations for spectral matrix measures in the Jacobi ensemble
Effect of data reduction on sequence-to-sequence neural TTS
ShuffleDet: Real-Time Vehicle Detection Network in On-board Embedded UAV Imagery
Multivariate Spatiotemporal Hawkes Processes and Network Reconstruction
Impedance Network of Interconnected Power Electronics Systems: Impedance Operator and Stability Criterion
State Complexity Characterizations of Parameterized Degree-Bounded Graph Connectivity, Sub-Linear Space Computation, and the Linear Space Hypothesis
Hörmander’s theorem for semilinear SPDEs
Sparsely Observed Functional Time Series: Estimation and Prediction
Spatio-temporal Stacked LSTM for Temperature Prediction in Weather Forecasting
Deep Template Matching for Offline Handwritten Chinese Character Recognition
Physical Signal Classification Via Deep Neural Networks
Temporal viability regulation for control affine systems with applications to mobile vehicle coordination under time-varying motion constraints
State-dependent jump activity estimation for Markovian semimartingales
Cost of selfishness in the allocation of cities in the Multiple Travelling Salesmen Problem
On approximations of Value at Risk and Expected Shortfall involving kurtosis
Semi-perfect 1-Factorizations of the Hypercube
Leaf-induced subtrees of leaf-Fibonacci trees
Asynchronous Stochastic Composition Optimization with Variance Reduction
Offline Biases in Online Platforms: a Study of Diversity and Homophily in Airbnb
Pairwise Relational Networks using Local Appearance Features for Face Recognition
Neural Predictive Belief Representations
Chordal circulant graphs and induced matching number
LinkNet: Relational Embedding for Scene Graph
Characterising $k$-connected sets in infinite graphs
Effects of Beamforming and Antenna Configurations on Mobility in 5G NR
The Sliding Frank-Wolfe Algorithm and its Application to Super-Resolution Microscopy
Adversarial Examples from Cryptographic Pseudo-Random Generators
Learning to Bound the Multi-class Bayes Error
Latecomers Help to Meet: Deterministic Anonymous Gathering in the Plane
Introducing Multiobjective Complex Systems
On a minimum distance procedure for threshold selection in tail analysis
HCU400: An Annotated Dataset for Exploring Aural Phenomenology Through Causal Uncertainty
Energy Efficient Precoder in Multi-User MIMO Systems with Imperfect Channel State Information
Secretary Ranking with Minimal Inversions
Preliminary Studies on a Large Face Database
Adversarial Resilience Learning – Towards Systemic Vulnerability Analysis for Large and Complex Systems
A regularised Dean-Kawasaki model for weakly interacting particles
MIMO Beampattern and Waveform Design with Low Resolution DACs
Review of isolation enhancement with the help of Theory of characteristic modes
Psychophysical evaluation of individual low-level feature influences on visual attention
Oversampled Adaptive Sensing with Random Projections: Analysis and Algorithmic Approaches
The q-Hahn PushTASEP
A Note On Universal Point Sets for Planar Graphs
Fourier decay, Renewal theorem and Spectral gaps for random walks on split semisimple Lie groups
Exploring the Deep Feature Space of a Cell Classification Neural Network
Mathematical Analysis of Adversarial Attacks
Large-Scale Distributed Algorithms for Facility Location with Outliers
Development and Validation of a Deep Learning Algorithm for Improving Gleason Scoring of Prostate Cancer
Adjusting for Confounding in Unsupervised Latent Representations of Images
A Bayesian optimization approach to compute the Nash equilibria of potential games using bandit feedback
Anti-concentration in most directions
Tight Bayesian Ambiguity Sets for Robust MDPs
Kernel Smoothing of the Treatment Effect CDF
Reward learning from human preferences and demonstrations in Atari
On transfer learning using a MAC model variant
Learning to Predict the Cosmological Structure Formation
Predicting enterprise cyber incidents using social network analysis on the darkweb hacker forums

Continue Reading…


Read More

Convert Data Frame to Dictionary List in R

(This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers)

In R, there are a couple ways to convert the column-oriented data frame to a row-oriented dictionary list or alike, e.g. a list of lists.

In the code snippet below, I would show each approach and how to extract keys and values from the dictionary. As shown in the benchmark, it appears that the generic R data structure is still the most efficient.

x1 <- as.list(iris[1, ])
# [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"
# [1] 5.1

envn_dict <- function(x) {
  e <- new.env(hash = TRUE)
  for (name in names(x)) assign(name, x[, name], e)

x2 <- envn_dict(iris[1, ])
# [1] "Petal.Length" "Petal.Width"  "Sepal.Length" "Sepal.Width"  "Species"
# [1] 5.1

coll_dict <-  function(x) {
  d <- collections::Dict$new()
  for (name in names(x)) d$set(name, x[, name])

x3 <- coll_dict(iris[1, ])
# [1] "Petal.Length" "Petal.Width"  "Sepal.Length" "Sepal.Width"  "Species"
# [1] 5.1

hash_dict <- function(x) {
  d <- hash::hash()
  for (name in names(x)) d[[name]] <- x[, name]

x4 <- hash_dict(iris[1, ])
# [1] "Petal.Length" "Petal.Width"  "Sepal.Length" "Sepal.Width"  "Species"
hash::values(x4, "Sepal.Length")
# Sepal.Length
#          5.1

data_dict <- function(x) {
  d <- datastructures::hashmap()
  for (name in names(x)) d[name] <- x[, name]

x5 <- data_dict(iris[1, ])
# [1] "Species"      "Sepal.Width"  "Petal.Length" "Sepal.Length" "Petal.Width"
datastructures::get(x5, "Sepal.Length")
# [1] 5.1

py2r_dict <- function(x) {
  return(reticulate::py_dict(names(x), x, TRUE))

x6 <- py2r_dict(iris[1, ])
# [1] "Petal.Length" "Sepal.Length" "Petal.Width"  "Sepal.Width"  "Species"
# [1] 5.1

to_list <- function(df, fn) {
  l <- list()
  for (i in seq(nrow(df))) l[[i]] <- fn(df[i, ])

rbenchmark::benchmark(replications = 100, order = "elapsed", relative = "elapsed",
                      columns = c("test", "replications", "elapsed", "relative", "user.self", "sys.self"),
  "BASE::LIST"              = to_list(iris, as.list),
  "BASE::ENVIRONMENT"       = to_list(iris, envn_dict),
  "COLLECTIONS::DICT"       = to_list(iris, coll_dict),
  "HASH::HASH"              = to_list(iris, hash_dict),
  "DATASTRUCTURES::HASHMAP" = to_list(iris, data_dict),
  "RETICULATE::PY_DICT"     = to_list(iris, py2r_dict)
#                     test replications elapsed relative user.self sys.self
#1              BASE::LIST          100   0.857    1.000     0.857    0.000
#2       BASE::ENVIRONMENT          100   1.607    1.875     1.607    0.000
#4              HASH::HASH          100   2.600    3.034     2.600    0.000
#3       COLLECTIONS::DICT          100   2.956    3.449     2.956    0.000
#5 DATASTRUCTURES::HASHMAP          100  16.070   18.751    16.071    0.000
#6     RETICULATE::PY_DICT          100  18.030   21.039    18.023    0.008

To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

If you did not already know

Halide google
Halide is a computer programming language designed for writing digital image processing code that takes advantage of memory locality, vectorized computation and multi-core CPUs and GPUs. Halide is implemented as an internal domain-specific language (DSL) in C++. The main innovation Halide brings is the separation of the algorithm being implemented from its execution schedule, i.e. code specifying the loop nesting, parallelization, loop unrolling and vector instruction. These two are usually interleaved together and experimenting with changing the schedule requires the programmer to rewrite large portions of the algorithm with every change. With Halide, changing the schedule does not require any changes to the algorithm and this allows the programmer to experiment with scheduling and finding the most efficient one.
DNN Dataflow Choice Is Overrated

Learning Active Learning (LAL) google
In this paper, we suggest a novel data-driven approach to active learning: Learning Active Learning (LAL). The key idea behind LAL is to train a regressor that predicts the expected error reduction for a potential sample in a particular learning state. By treating the query selection procedure as a regression problem we are not restricted to dealing with existing AL heuristics; instead, we learn strategies based on experience from previous active learning experiments. We show that LAL can be learnt from a simple artificial 2D dataset and yields strategies that work well on real data from a wide range of domains. Moreover, if some domain-specific samples are available to bootstrap active learning, the LAL strategy can be tailored for a particular problem. …

Dimensional Clustering google
This paper introduces a new clustering technique, called {\em dimensional clustering}, which clusters each data point by its latent {\em pointwise dimension}, which is a measure of the dimensionality of the data set local to that point. Pointwise dimension is invariant under a broad class of transformations. As a result, dimensional clustering can be usefully applied to a wide range of datasets. Concretely, we present a statistical model which estimates the pointwise dimension of a dataset around the points in that dataset using the distance of each point from its $n^{\text{th}}$ nearest neighbor. We demonstrate the applicability of our technique to the analysis of dynamical systems, images, and complex human movements. …

Continue Reading…


Read More

Simple table size estimates and 128-bit numbers (Java Edition)

Suppose that you are given a table. You know the number of rows, as well as how many distinct value each column has. For example, you know that there are two genders (in this particular table). Maybe there are 73 distinct age values. For a concrete example, take the standard Adult data set which is made of 48842 rows.

How many distinct entries do you expect the table to have? That is, if you remove all duplicate rows, what is the number of rows left?

There is a standard formula for this problem: Cardenas’ formula. It uses the simplistic model where there is no relationship between the distinct columns. In practice, it will tend to overestimate the number of rows. However, despite it is simplicity, it often works really well.

Let p be the product of all column cardinalities, and let n be number of rows, then the Cardenas estimate is p * (1 – (1 – 1/p)n). Simple right?

You can implement in Java easily enough…

double cardenas64(long[] cards, int n) {
    double product = 1;
    for(int k = 0;  k < cards.length; k++) {
      product *= cards[k];
    return product 
         * (1- Math.pow(1 - 1.0/product,n));

So let us put in the numbers… my column cardinalities are 16,16,15,5,2,94,21648,92,42,7,9,2,6,73,119; and I have 48842 rows. So what is Cardenas’ prediction?


At least, that’s what the Java function returns.

Why is that? The first problem is that 1 – 1/p is 1 when p is that large. And even if you could compute 1 – 1/p accurately enough, taking it to the power of 48842 is a problem.

So what do you do?

You can switch to something more accurate than double precision, that is quadruple precision (also called binary128). There is no native 128-bit floats in Java, but you can emulate them using the BigDecimal class. The code gets much uglier. Elegance aside, I assumed it would be a walk in the park, but I found that the implementation of the power function was numerically unstable, so I had to roll my own (from multiplications).

The core function looks like this…

double cardenas128(long[] cards, int n) {
    BigDecimal product = product(cards);
    BigDecimal oneover = BigDecimal.ONE.divide(product,
    BigDecimal proba = BigDecimal.ONE.subtract(oneover,
    proba = lemirepower(proba,n);
    return product.subtract(
       product.multiply(proba, MathContext.DECIMAL128),

It scales up to billions of rows and up to products of cardinalities that do not fit in any of Java’s native type. Though the computation involves fancy data types, it is probably more than fast enough for most applications.

My source code is available.

Continue Reading…


Read More

Document worth reading: “Multi-Agent Reinforcement Learning: A Report on Challenges and Approaches”

Reinforcement Learning (RL) is a learning paradigm concerned with learning to control a system so as to maximize an objective over the long term. This approach to learning has received immense interest in recent times and success manifests itself in the form of human-level performance on games like \textit{Go}. While RL is emerging as a practical component in real-life systems, most successes have been in Single Agent domains. This report will instead specifically focus on challenges that are unique to Multi-Agent Systems interacting in mixed cooperative and competitive environments. The report concludes with advances in the paradigm of training Multi-Agent Systems called \textit{Decentralized Actor, Centralized Critic}, based on an extension of MDPs called \textit{Decentralized Partially Observable MDP}s, which has seen a renewed interest lately. Multi-Agent Reinforcement Learning: A Report on Challenges and Approaches

Continue Reading…


Read More

MongoDB Atlas: Connector for Apache Spark now Officially Certified for Azure Databricks

This is a guest blog from our partners at MongoDB
Bryan Reinero and Dana Groce

We are happy to announce that the MongoDB Connector for Apache Spark is now officially certified for Azure Databricks. MongoDB Atlas users can integrate Spark and MongoDB in the cloud for advanced analytics and machine learning workloads by using the MongoDB Connector for Apache Spark which is fully supported and maintained by MongoDB.

The MongoDB Connector for Apache Spark exposes all of Spark’s libraries, including Scala, Java, Python, and R. MongoDB data is materialized as DataFrames and Datasets for analysis with machine learning, graph, streaming, and SQL APIs. The MongoDB Connector for Apache Spark can take advantage of MongoDB’s aggregation pipeline and rich secondary indexes to extract, filter, and process only the range of data it needs – for example, analyzing all customers located in a specific geography. This is very different from simple NoSQL data stores that do not offer secondary indexes or in-database aggregations and require the extraction of all data based on a simple primary key, even if only a subset of that data is needed for the Spark process. This results in more processing overhead, more hardware, and longer time-to-insight for data scientists and engineers.

Additionally, MongoDB’s workload isolation makes it easy for users to efficiently process data drawn from multiple sources into a single database with zero impact on other business-critical database operations. Running Spark on MongoDB reduces operational overhead as well by greatly simplifying your architecture and increasing the speed at which analytics can be executed.

MongoDB Atlas, our on-demand, fully-managed cloud database service for MongoDB, makes it even easier to run sophisticated analytics processing by eliminating the operational overhead of managing database clusters directly. By combining Azure Databricks and MongoDB, Atlas users can make benefit of a fully managed analytics platform, freeing engineering resources to focus on their core business domain and deliver actionable insights quickly.


Try Databricks for free. Get started today.

The post MongoDB Atlas: Connector for Apache Spark now Officially Certified for Azure Databricks appeared first on Databricks.

Continue Reading…


Read More

RcppGetconf 0.0.3

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A second and minor update for the RcppGetconf package for reading system configuration — not unlike getconf from the libc library — is now on CRAN.

Changes are minor. We avoid an error on a long-dead operating system cherished in one particular corner of the CRAN world. In doing so some files were updated so that dynamically loaded routines are now registered too.

The short list of changes in this release follows:

Changes in inline version 0.0.3 (2018-11-16)

  • Examples no longer run on Solaris where they appear to fail.

Courtesy of CRANberries, there is a diffstat report. More about the package is at the local RcppGetconf page and the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Book Memo: “Reinforcement Learning for Optimal Feedback Control”

A Lyapunov-Based Approach
Reinforcement Learning for Optimal Feedback Control develops model-based and data-driven reinforcement learning methods for solving optimal control problems in nonlinear deterministic dynamical systems. In order to achieve learning under uncertainty, data-driven methods for identifying system models in real-time are also developed. The book illustrates the advantages gained from the use of a model and the use of previous experience in the form of recorded data through simulations and experiments. The book’s focus on deterministic systems allows for an in-depth Lyapunov-based analysis of the performance of the methods described during the learning phase and during execution. To yield an approximate optimal controller, the authors focus on theories and methods that fall under the umbrella of actor-critic methods for machine learning. They concentrate on establishing stability during the learning phase and the execution phase, and adaptive model-based and data-driven reinforcement learning, to assist readers in the learning process, which typically relies on instantaneous input-output measurements.

Continue Reading…


Read More

Despite California’s inferno, global wildfires are fizzling out

Climate change makes fires worse, but agricultural development limits them

Continue Reading…


Read More

November 16, 2018

Announcing the Winners of the Facebook Crisis Informatics Research Awards

We are pleased to announce the winners of the Facebook Crisis Informatics Research Awards. As social media and messaging apps continue to become important communication tools in people’s everyday lives, they have also come to play important roles in how people prepare, respond to, and recover from disasters. With these research awards, Facebook is supporting new and innovative research in the area of crisis informatics to make social media tools more useful to people responding to or impacted by a disaster. This research will unlock new tools and products to improve disaster response and recovery. No Facebook data will be provided to award recipients.

The winning proposals represent a diverse range of problems, methodologies and academic disciplines. One proposal will use social media data to enhance available information on flood conditions in order to predict the impact and extent of floods. Another takes crisis informatics from a typical focus on natural-hazard induced disasters to one dealing with conflict zones, including the use of mixed methods to understand displacement and information flows among refugee populations. A third winning proposal will build statistical models that predict how populations will move in response to natural hazards, including sheltering in place, evacuations and returning to affected areas. This diverse slate of research stands to advance our understanding of and development of tools for dealing with crises, and will push the field of crisis informatics in new and exciting directions.

The three award winners and their topic areas are:

Capturing the Big Picture from Social Media for Global Flood Awareness
Carlos Castillo (Universitat Pompeu Fabra) and Milan Kalaš

Data & Design Methods for Supporting Crisis Informatics in Conflict Zones
Karen E. Fisher (University of Washington), Eaid Yafi (University Kuala Lumpur), Jevin West (University of Washington), Kate Starbird (University of Washington) and Reem Talhouk (University of Newcastle)

Real-time Prediction of Movement Patterns during Disaster Times
Cynthia Chen (University of Washington), Feilong Wang (University of Washington) and Xiangyang Guan (University of Washington)

Thank you to all the researchers that submitted proposals, and congratulations to the winners. To view our currently open research awards and to subscribe to our Research Awards email list, visit our Research Awards page.

Continue Reading…


Read More

Example of Overfitting

(This article was first published on Mad (Data) Scientist, and kindly contributed to R-bloggers)

I occasionally see queries on various social media as to overfitting — what is it?, etc. I’ll post an example here. (I mentioned it at my talk the other night on our novel approach to missing values, but had a bug in the code. Here is the correct account.)

The dataset is prgeng, on wages of programmers and engineers in Silicon Valley as of the 2000 Census. It’s included in our polyreg package, which we developed as an alternative to neural networks. But it is quite useful in its own right, as  it makes it very convenient to fit multivariate polynomial models. (Not as easy as it sounds; e.g. one must avoid cross products of orthogonal dummy variables, powers of those variables, etc.)

First I split the data into training and test sets:


> set.seed(88888)
> getPE()
> pe1 <- pe[,c(1,2,4,6,7,12:16,3)]
> testidxs <- sample(1:nrow(pe1),1000)
> testset <- pe1[testidxs,]
> trainset <- pe1[-testidxs,]

As a base, I fit an ordinary degree-1 model and found the mean absolute prediction error:

> lmout <- lm(wageinc ~ .,data=trainset)
> predvals <- predict(lmout,testset[,-11])
> mean(abs(predvals - testset[,11]))
[1] 25456.98

Next, I tried a quadratic model:

> pfout <- polyFit(trainset,deg=2)
> mean(abs(predict(pfout,testset[,-11]) 
   - testset[,11]))
[1] 24249.83

Note that, though originally there were 10 predictors, now with polynomial terms we have 46.

I kept going:

deg MAPE # terms
3 23951.69 118
4 23974.76 226
5 24340.85 371
6 24554.34 551
7 36463.61 767
8 74296.09 1019

One must keep in mind the effect of sampling variation, and repeated trials would be useful here, but it seems that the data can stand at least a cubic fit and possibly as much as degree 5 or even 6. To be conservative, it would seem wise to stop at degree 3. That’s also consistent with the old Tukey rule of thumb that we should have p <- sqrt(n), n being about 20,000 here.

In any event, the effects of overfitting here are dramatic, starting at degree 7.

It should be noted that I made no attempt to clean the data, nor to truncate predicted values at 0, etc.

It should also be noted that, starting at degree 4, R emitted warnings, “prediction from a rank-deficient fit may be misleading.” It is well known that at high degrees, polynomial terms can have multicollinearity issues.

Indeed, this is a major point we make in our arXiv paper cited above. There we argue that neural networks are polynomial models in disguise, with the effective degree of the polynomial increasing a lot at each successive layer, and thus multicollinearity increasing from layer to layer. We’ve confirmed this empirically. We surmise that this is a major cause of convergence problems in NNs.

Finally, whenever I talk about polynomials and NNs, I hear the standard (and correct) concern that polynomial grow rapidly at the edges of the data. True, but I would point out that if you accept NNs = polynomials, then the same is true for NNs.

We’re still working on the polynomials/NNs project. More developments to be announced soon. But for those who are wondering about overfitting, the data here should make the point.

To leave a comment for the author, please follow the link and comment on their blog: Mad (Data) Scientist. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

How to Choose an AI Vendor

This report explores why it is so challenging to choose an AI vendor and what you should consider as you seek a partner in AI. Download now.

Continue Reading…


Read More

UnitedHealth Group: Senior Principal Data Scientist [Telecommute, Central or Eastern Time Zones]

An exceptional Data Scientist with a passion for working with healthcare data, hands-on skills using various big-data and ML platform tools, and proven experience delivering business impact.

Continue Reading…


Read More

Distilled News

Prescriptive Maintenance for Manufacturing Industry

Today’s trend of Artificial Intelligence (AI) and the increased level of Automation in manufacturing allow firms to flexibly connect assets and improve productivity through data-driven insights that has not been possible before. As more automation is used in manufacturing, the speed of responses required in dealing with maintenance issues is going to get faster and automated decisions as to what’s the best option from an economic standpoint are getting more complex.

Best books on Artificial Intelligence and Deep Learning for October 2018

Machine Learning with C++ – Faster R-CNN with MXNet C++ Frontend

I published implementation of Faster R-CNN with MXNet C++ Frontend. You can use this implementation as comprehensive example of using MXNet C++ Frontend, it has custom data loader for MS Coco dataset, implements custom target proposal layer as a part of the project without modification MXNet library, contains code for errors checking (Missed in current C++ API), have Eigen and NDArray integration samples. Feel free to leave comments and proposes.

A deep dive into glmnet: standardize

I´m writing a series of posts on various function options of the glmnet function (from the package of the same name), hoping to give more detail and insight beyond R´s documentation.

Make Beautiful Tables with the Formattable Package

I love the formattable package, but I always struggle to remember its syntax. A quick Google search reveals that I’m not alone in this struggle. This post is intended as a reminder for myself of how the package works – and hopefully you’ll find it useful too!

What is RQDA and what are its features?

RDQA is a R package for Qualitative Data Analysis, a free (free as freedom) qualitative analysis software application (BSD license). It works on Windows, Linux/FreeBSD and Mac OSX platforms. RQDA is an easy to use tool to assist in the analysis of textual data. At the moment it only supports plain text formatted data. All the information is stored in a SQLite database via the R package of RSQLite. The GUI is based on RGtk2, via the aid of gWidgetsRGtk2. It includes a number of standard Computer-Aided Qualitative Data Analysis features. In addition it seamlessly integrates with R, which means that a) statistical analysis on the coding is possible, and b) functions for data manipulation and analysis can be easily extended by writing R functions. To some extent, RQDA and R make an integrated platform for both quantitative and qualitative data analysis.

Discourse Network Analysis: Undertaking Literature Reviews in R

Literature reviews are the cornerstone of science. Keeping abreast of developments within any given field of enquiry has become increasingly difficult given the enormous amounts of new research. Databases and search technology have made finding relevant literature easy but, keeping a coherent overview of the discourse within a field of enquiry is an ever more encompassing task. Scholars have proposed many approaches to analysing literature, which can be placed along a continuum from traditional narrative methods to systematic analytic syntheses of text using machine learning. Traditional reviews are biased because they rely entirely on the interpretation of the researcher. Analytical approaches follow a process that is more like scientific experimentation. These systematic methods are reproducible in the way literature is searched and collated but still rely on subjective interpretation. Machine learning provides new methods to analyse large swaths of text. Although these methods sound exciting, these methods are incapable of providing insight. Machine learning cannot interpret a text; it can only summarise and structure a corpus. Machine learning still requires human interpretation to make sense of the information. This article introduces a mixed-method technique for reviewing literature, combining qualitative and quantitative methods. I used this method to analyse literature published by the International Water Association as part of my dissertation into water utility marketing. You can read the code below, or download it from GitHub. Detailed infromation about the methodology is available through FigShare.

Is the Food Here Yet?

Delivery time prediction has long been a part of city logistics, but refining accuracy has recently become very important for services such as Deliveroo, Foodpanda and Uber Eats which deliver food on-demand. These services and similar ones must receive an order and have it delivered within ~30 minutes to appease their users. In these situations +/- 5 minutes can make a big difference so it’s very important for customer satisfaction that the initial prediction is highly accurate and that any delays are communicated effectively.

How does Scaling change Principle Components? – Part 1

Assume that we have a some data that looks like above, and red arrow shows the first principle component, while the blue arrow shows the second principle component. How would normalization as well as standardization change those two? Also can we perform some kind of normalization respect to the variance we have? (I will make more post as I go along. )

Algorithmic Complexity [101]

Computers are fast: they store and manipulate data using electronic signals that travel across their silicon internals at hundreds of thousands of miles per hour. For comparison, the fastest signals in the human nervous system travel at about 250mph, which is about 3 million times slower, and those speeds are only possible for unconscious signals – signal speeds observed for conscious thought and calculation are typically orders of magnitude slower still. Basically, we’re never going to be able to out calculate a computer.

Speed up your deep learning language model up to 1000% with the adaptive softmax, Part 1

How would you like to speed up your language modeling (LM) tasks by 1000%, with nearly no drop in accuracy? A recent paper from Grave et al. (2017), called ‘Efficient softmax approximation for GPUs’, shows how you can gain a massive speedup in one of the most time-consuming aspects of language-modeling, the computation-heavy softmax step, through their ‘adaptive softmax’. The giant speedup from using the adaptive softmax comes with only minimal costs in accuracy, so anyone who is doing language modeling should definitely consider using it. Here in Part 1 of this blog post, I’ll fully explain the adaptive softmax, then in Part 2 I’ll walk you step by step through a Pytorch implementation (with an accompanying Jupyter notebook), which uses Pytorch’s built-in AdaptiveLogSoftmaxWithLoss function.

Speed up your deep learning language model up to 1000% with the adaptive softmax, Part 2: Pytorch implementation

In Part 1 of this blog post, I explained how the adaptive softmax works, and how it can speed up your language model by up to 1000%. Here in Part 2, I’ll walk you step by step through a Pytorch implementation (with an accompanying Jupyter notebook), which uses Pytorch’s built-in AdaptiveLogSoftmaxWithLoss function. For preprocessing you will need fastai (see ), a deep learning library that runs on top of Pytorch that simplifies training neural networks. [For those who want to learn state-of-the-art deep learning techniques, I highly recommend Jeremy Howard’s course, which is available online for free: https://…/]. I decided to use the Wikitext-2 dataset, which is a relatively small dataset that contains ~2 million tokens and ~33 thousand words in the vocabulary. Once the data has been downloaded and properly formatted in csv files, fastai makes it easy to quickly tokenize, numericalize, and create a data-loader that will be used for training. I also downloaded GloVe pre-trained word vectors that are used for the models’ word embeddings. Finally, I created a modeler class that handles training.

Continue Reading…


Read More

Because it's Friday: The physics of The Expanse

For a science fiction show set hundreds of years in the future, The Expanse is unusual in that it takes very few liberties with Science as we understand it today. The solar system is made up of the familiar planets we know (other than the colonists and space stations spread throughout the system), communication is limited by the speed of light, and and spaceships operate under standard Newtonian principles. You can get a good sense in this highlight reel of the visual effects of the show:

The series was recently saved from cancellation by Amazon: the first three series are streaming on Amazon Prime now (in the US at least) and I can't wait for the fourth season and the continuation of the excellent story, which hews closely to the source novels.

That's all from from the blog for this week. Next week will be a short week for us with US Thanksgiving coming up: if you're celebrating, have a great holiday!


Continue Reading…


Read More

Magister Dixit

“1. Think carefully about which projects you take on.
2. Use as much data as you can from as many places as possible.
3. Don’t just use internal customer data.
4. Have a clear sampling strategy.
5. Always use a holdout sample.
6. Spend time on ‘throwaway’ modelling.
7. Refresh your model regularly.
8. Make sure your insights are meaningful to other people.
9. Use your model in the real world.”
Rachel Clinton ( January 7, 2015 )

Continue Reading…


Read More

UnitedHealth Group: Data Analytics and Reporting Lead [Minnetonka, MN or Telecommute]

This would be a hands-on technical lead role that will require end-to-end ownership of our UnitedHealthcare Fin360 applications.

Continue Reading…


Read More

UnitedHealth Group: Clinical Data Statistical Analyst – SQL SAS (Clinician Required) [Telecommute]

Leverage your data analytic and project management skills to lead programs that focus on improving HEDIS rates and impacting the quality of care for our members.

Continue Reading…


Read More

R Packages worth a look

Header-Only C++ Mathematical Optimization Library for ‘Armadillo’ (RcppEnsmallen)
Ensmallen’ is a templated C++ mathematical optimization library (by the ‘MLPACK’ team) that provides a simple set of abstractions for writing an object …

Network Tools for Memory Research (memnet)
Efficient implementations of network science tools to facilitate research into human (semantic) memory. In its current version, the package contains se …

Sequential Quadratic Programming for Fast Maximum-Likelihood Estimation of Mixture Proportions (mixsqp)
Provides optimization algorithms based on sequential quadratic programming (SQP) for maximum likelihood estimation of the mixture proportions in a fini …

Continue Reading…


Read More

The tidy caret interface in R

(This article was first published on poissonisfish, and kindly contributed to R-bloggers)

Can human experience and chemoinformatics unravel the chemistry of smell? We are about to find out in the new world of caret [source]

Among most popular off-the-shelf machine learning packages available to R, caret ought to stand out for its consistency. It reaches out to a wide range of dependencies that deploy and support model building using a uniform, simple syntax. I have been using caret extensively for the past three years, with a precious partial least squares (PLS) tutorial in my track record.

A couple of years ago, the creator and maintainer of caret Max Kuhn joined RStudio where he has contributing new packages to the ongoing tidy-paranoia – the supporting recipes, yardstick, rsample and many other packages that are part of the tidyverse paradigm and I knew little about. As it happens, caret is now best used with some of these. As an aspiring data scientist with fashionable hex-stickers on my laptop cover and a tendency to start any sentence with ‘big data’, I set to learn tidyverse and going Super Mario using pipes (%>%, Ctrl + Shift + M).

Overall, I found the ‘tidy’ approach quite enjoyable and efficient. Besides streamlining model design in a sequential manner, recipes and the accompanying ‘tidy’ packages fix a common problem in the statistical learning realm that curtails predictive accuracy, data leakage. You can now generate a ‘recipe’ that lays out the plan for all the feature engineering (or data pre-processing, depending on how many hex-stickers you have) and execute it separately for validation and resampling splits (i.e. split first, process later), thus providing a more robust validation scheme. It also enables recycling any previous computations afterwards.

I drew a lot of inspiration from a webinar and a presentation from Max himself to come up with this tutorial. Max is now hands-on writing a new book, entitled Feature Engineering and Selection: A Practical Approach for Predictive Models that will include applications of caret supported by recipes. I very much look forward to read it considering he did a fantastic job with Applied Predictive Modeling.

The aim here is to test this new caret framework on the DREAM Olfaction Prediction Challenge, a crowd-sourced initiative launched in January 2015. This challenge demonstrated predictive models can identify perceptual attributes (e.g. coffee-like, spicy and sweet notes) in a collection of fragrant compounds from physicochemical features (e.g. atom types). Results and conclusions were condensed into a research article [1] that reports an impressive predictive accuracy for a large number of models and teams, over various perceptive attributes. Perhaps more impressive yet, is the fact it anticipates the very scarce knowledge of human odour receptors and their molecular activation mechanisms. Our task, more concretely, will be to predict population-level pleasantness of hundreds of compounds from physicochemical features, using the training set of the competition.

Finally, a short announcement. Many people have been raising reproducibility issues with previous entries, due to either package archiving (e.g. GenABEL, for genetic mapping) or cross-incompatibility (e.g. caret, for PLS). There were easy fixes for the most part, but I have nevertheless decided to start packing scripts together with sessionInfo() reports in unique GitHub repositories. The bundle for this tutorial is available under I hope this will help track down the exact package versions and system configurations I use, so that anyone can anytime fully reproduce these results. No hex-stickers required!

NOTE that most code chunks containing pipes are corrupted. PLEASE refer to the materials from the repo.

The DREAM Olfaction Prediction Challenge

The DREAM Olfaction Prediction Challenge training set consists of 338 compounds characterised by 4,884 physicochemical features (the design matrix), and profiled by 49 subjects with respect to 19 semantic descriptors, such as ‘flower’, ‘sweet’ and ‘garlic’, together with intensity and pleasantness (all perceptual attributes). Two different dilutions were used per compound. The perceptual attributes were given scores in the 0-100 range. The two major goals of the competition were to use the physicochemical features in modelling i) the perceptual attributes at the subject-level and ii) both the mean and standard deviation of the perceptual attributes at the population-level. How ingenious it was to model the standard deviation, the variability in how something tastes to different people! In the end, the organisers garnered important insights about the biochemical basis of flavour, from teams all over the world. The resulting models were additionally used to reverse-engineer nonexistent compounds from target sensory profiles – an economically exciting prospect.

According to the publication, the top-five best predicted perceptual attributes at the population-level (i.e. mean values across all 49 subjects) were ‘garlic’, ‘fish’, ‘intensity’, pleasantness and ‘sweet’ (cf. Fig. 3A in [1]), all exhibiting average prediction correlations of r \ge 0.5, far greater than the average subject-level predictions (cf. Fig. 2D in [1]). This should come as no surprise since averaging over subjects is more likely to flesh out universal mechanisms. This averaging also helps stabilising the subject-level variance – no subject will necessarily utilise the full 0-100 scoring range; some will tend to use narrow intervals (low variance) while some will use the whole range instead (high variance).

Since pleasantness is one of the most general olfactory properties, I propose modelling the population-level median pleasantness of all compounds, from all physicochemical features. The median, as opposed to the mean, is less sensitive to outliers and an interesting departure from the original approach we can later compare against.

Let’s get started with R

We can start off by loading all the necessary packages and pulling the data directly from the DREAM Olfaction Prediction GitHub repository. More specifically, we will create a directory called data and import both the profiled perceptual attributes (TrainSet.txt) and the physicochemical features that comprise our design matrix (molecular_descriptors_data.txt). Because we are tidy people, we will use readr to parse these as tab-separated values (TSV). I also suggest re-writing the column name Compound Identifier in the sensory profiles into CID, to match with that from the design matrix molFeats.

# Mon Oct 29 13:17:55 2018 ------------------------------
## DREAM olfaction prediction challenge

# Create directory and download files
ghurl <- ""
download.file(paste0(ghurl, "TrainSet.txt"),
destfile = "data/trainSet.txt")
download.file(paste0(ghurl, "molecular_descriptors_data.txt"),
destfile = "data/molFeats.txt")

# Read files with readr, select least diluted compounds
responses %
rename(CID = `Compound Identifier`)

molFeats <- read_tsv("data/molFeats.txt") # molecular features

Next, we filter compounds that are common to both datasets and reorder them accordingly. The profiled perceptual attributes span a total of 49 subjects, two different dilutions and some replications (cf. Fig. 1A in [1]). Determine the median pleasantness (i.e. VALENCE/PLEASANTNESS) per compound, across all subjects and dilutions while ignoring missingness with na.rm = T. The last line will ensure the order of the compounds is identical between this new dataset and the design matrix, so that we can subsequently introduce population-level pleasantness as a new column termed Y in the new design matrix X. We no longer need CID so we can discard it.

Regarding missingness, I found maxima of 0.006% and 23% NA content over all columns and rows, respectively. A couple of compounds could be excluded but I would move on.

# Determine intersection of compounds in features and responses
commonMols %
dplyr::summarise(pleasantness = median(`VALENCE/PLEASANTNESS`, na.rm = T))
all(medianPlsnt$CID == molFeats$CID) # TRUE - rownames match

# Concatenate predictors (molFeats) and population pleasantness
X %

In examining the structure of the design matrix, you will see there are many skewed binary variables. In the most extreme case, if a binary predictor is either all zeros or all ones throughout it is said to be a zero-variance predictor that if anything, will harm the model training process. There is still the risk of having near-zero-variance predictors, which can be controlled based on various criteria (e.g. number of samples, size of splits). Of course, this can also impact quantitative, continuous variables. Here we will use nearZeroVar from caret to identify faulty predictors that are either zero-variance or display values whose frequency exceeds a predefined threshold. Having a 5x repeated 5-fold cross-validation in mind, I suggest setting freqCut = 4, which will rule out i) binary variables with |f(0) - f(1)| \ge 0.80 , and ii) continuous variables with \frac{\# most common value}{\# second most common value} \ge 4 . See ?nearZeroVar for more information. From 4,870 features we are left with 2,554 – a massive reduction by almost a half.

# Filter nzv
X <- X[,-nearZeroVar(X, freqCut = 4)] # == 80/20

Now we will use rsample to define the train-test and cross-validation splits. Please use set.seed as it is, to ensure you will generate the same partitions and arrive to the same results. Here I have allocated 10% of the data to the test set; by additionally requesting stratification based on the target Y, we are sure to have a representative subset. We can next pass the new objects to training and testing to effectively split the design matrix into train and test sets, respectively. The following steps consist of setting up a 5x repeated 5-fold cross-validation for the training set. Use vfold_cv and convert the output to a caret-like object via rsample2caret. Next, initialise a trainControl object and overwrite index and indexOut using the corresponding elements in the newly converted vfold_cv output.

# Split train/test with rsample
initSplit <- initial_split(X, prop = .9,
strata = "Y")
trainSet <- training(initSplit)
testSet <- testing(initSplit)

# Create 5-fold cross-validation, convert to caret class
myFolds %
ctrl <- trainControl(method = "cv",
selectionFunction = "oneSE")
ctrl$index <- myFolds$index
ctrl$indexOut <- myFolds$indexOut

Prior to modelling, I will create two reference variables – binVars, to identify binary variables, and missingVars, to identify any variables containing missing values. These will help with i) excluding binary variables from mean-centering and unit-variance scaling, and ii) restricting mean-based imputation to variables that do contain missing values, respectively.

# binary vars
binVars <- which(sapply(X, function(x){all(x %in% 0:1)}))
missingVars <- which(apply(X, 2, function(k){any(}))

Recipes are very simple in design. The call to recipe is first given the data it is supposed to be applied on, as well as a lm-style formula. The recipes  package contains a family of functions with a step_... prefix, involved in encodings, date conversions, filters, imputation, normalisation, dimension reduction and a lot more. These can be linked using pipes, forming a logic sequence of operations that starts with the recipe call. Supporting functions such as all_predictors, all_outcomes and all_nominal delineate specific groups of variables at any stage in the sequence. One can also use the names of the variables, akin to my usage of binVars and missingVars.

I propose writing a basic recipe myRep that does the following:

  • Yeo-Johnson [2] transformation of all quantitative predictors;
  • Mean-center all quantitative predictors;
  • Unit-variance scale all quantitative predictors;
  • Impute missing values with the mean of the incident variables.
# Design recipe
myRec %
step_YeoJohnson(all_predictors(), -binVars) %>%
step_center(all_predictors(), -binVars) %>%
step_scale(all_predictors(), -binVars) %>%

In brief, the Yeo-Johnson procedure transforms variables to be distributed as Gaussian-like as possible. Before delving into the models let us tweak the recipe and assign it to pcaRep, conduct a principal component analysis and examine how different compounds distribute along the first two principal components. Colour them based on their population-level pleasantness – red for very pleasant, blue for very unpleasant and the gradient in between, via colGrad.

# simple PCA, plot
pcaRec %

myPCA %
colGrad <- trainSet$Y/100 # add color

plot(myPCA$PC1, myPCA$PC2,
col = rgb(1 - colGrad, 0, colGrad,.5),
pch = 16, xlab = "PC1", ylab = "PC2")
legend("topleft", pch = 16,
col = rgb(c(0,.5,1), 0, c(1,.5,0), alpha = .5),
legend = c("Pleasant", "Neutral", "Unpleasant"), bty = "n")

The compounds do not seem to cluster into groups, nor do they clearly separate with respect to pleasantness.


Note that pcaRep itself is just a recipe on wait. Except for recipes passed to caret::train, to process and extract the data as instructed you need to either ‘bake’ or ‘juice’ the recipe. The difference between the two is that ‘juicing’ outputs the data associated to the recipe (e.g. the training set), whereas ‘baking’ can be applied to process any other dataset. ‘Baking’ is done with bake, provided the recipe was trained using prep. I hope this explains why I used juice above!

Next we have the model training. I propose training five regression models – k-nearest neighbours (KNN), elastic net (ENET), support vector machine with a radial basis function kernel (SVM), random forests (RF), extreme gradient boosting (XGB) and Quinlan’s Cubist (CUB) – aiming at minimising the root mean squared error (RMSE). Note I am using tuneGrid and tuneLength interchangeably. I rather supply predefined hyper-parameters to simple models I am more familiar with, and run the rest with a number of defaults via tuneLength.

Parallelise the computations if possible. With macOS, I can use registerDoMC from the doMC package to harness multi-threading of the training procedure. If you are running a different OS, please use a different library if necessary. To the best of my knowledge, doMC will also work in Linux but Windows users might need to use the doParallel package instead.

# Train
knnMod <- train(myRec, data = trainSet,
method = "knn",
tuneGrid = data.frame(k = seq(5, 25, by = 4)),
trControl = ctrl)

enetMod <- train(myRec, data = trainSet,
method = "glmnet",
tuneGrid = expand.grid(alpha = seq(0, 1, length.out = 5),
lambda = seq(.5, 2, length.out = 5)),
trControl = ctrl)

svmMod <- train(myRec, data = trainSet,
method = "svmRadial",
tuneLength = 8,
trControl = ctrl)

rfMod <- train(myRec, data = trainSet,
method = "ranger",
tuneLength = 8,
num.trees = 5000,
trControl = ctrl)

xgbMod <- train(myRec, data = trainSet,
method = "xgbTree",
tuneLength = 8,
trControl = ctrl)

cubMod <- train(myRec, data = trainSet,
method = "cubist",
tuneLength = 8,
trControl = ctrl)

modelList <- list("KNN" = knnMod,
"ENET" = enetMod,
"SVM" = svmMod,
"RF" = rfMod,
"XGB" = xgbMod,
"CUB" = cubMod)

Once the training is over, we can compare the optimal five cross-validated models based on their RMSE across all 5 \times 5 = 25 resamples, using some bwplot magic sponsored by lattice. In my case the runtime was unusually long (>2 hours) compared to previous releases.

metric = "RMSE")

The cross-validated RSME does not vary considerably across the six optimal models. To conclude, I propose creating a model ensemble by simply taking the average predictions of all six optimal models on the test set, to compare observed and predicted population-level pleasantness in this hold-out subset of compounds.

# Validate on test set with ensemble
allPreds <- sapply(modelList, predict, newdata = testSet)
ensemblePred <- rowSums(allPreds) / length(modelList)

# Plot predicted vs. observed; create PNG
plot(ensemblePred, testSet$Y,
xlim = c(0,100), ylim = c(0,100),
xlab = "Predicted", ylab = "Observed",
pch = 16, col = rgb(0, 0, 0, .25))
abline(a=0, b=1)

writeLines(capture.output(sessionInfo()), "sessionInfo")

The ensemble fit on the test set is not terrible – slightly underfit, predicting the poorest in the two extremes of the pleasantness range. All in all, it attained a prediction correlation of r \approx 0.73 , which is slightly larger than the mean reported (cf. Fig. 3A in [1]). However, note that there are only 31 compounds in the test set. A model like this could be easily moved into production stage or repurposed to solve a molecular structure from a sensory profile of interest, as mentioned earlier.


The revamped caret framework is still under development, but it seems to offer

  • A substantially expanded, unified syntax for models and utils. Keep an eye on textrecipes, an upcoming complementary package for processing textual data;
  • A sequential streamlining of extraction, processing and modelling, enabling recycling of previous computations;
  • Executing-after-splitting, thereby precluding leaky validation strategies.

On the other hand, I hope to see improvements in

  • Decoupling from proprietary frameworks in the likes of tidyverse, although there are alternatives;
  • The runtime. I suspect the recipes themselves are the culprit, not the models. Future updates will certainly fix this.

Regarding the DREAM Olfaction Prediction Challenge, we were able to predict population-level perceived pleasantness from physicochemical features with an accuracy as good as that achieved, in average, by the competing teams. Our model ensemble seems underfit, judging from the narrow range of predicted values. Maybe we could be less restrictive about the near-zero-variance predictors and use a repeated 3-fold cross-validation.

I hope you had as much fun as I did in dissecting a fascinating problem while unveiling a bit of the new caret. You can use this code as a template for exploring different recipes and models. If you encounter any problems or have any questions do not hesitate to post a comment underneath – we all learn from each other!


[1] Keller, Andreas et al. (2017). Predicting human olfactory perception from chemical features of odor molecules. Science, 355(6327), 820-826.

[2] Yeo, In-Kwon and Johnson, Richard (2000). A new family of power transformations to improve normality or symmetry. Biometrika, 87(4), 954-959.

To leave a comment for the author, please follow the link and comment on their blog: poissonisfish. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Top 10 Python Data Science Libraries

The third part of our series investigating the top Python Libraries across Machine Learning, AI, Deep Learning and Data Science.

Continue Reading…


Read More

Introducing New Built-in and Higher-Order Functions for Complex Data Types in Apache Spark 2.4

Try this notebook in Databricks

Apache Spark 2.4 introduces 29 new built-in functions for manipulating complex types (for example, array type), including higher-order functions.

Before Spark 2.4, for manipulating the complex types directly, there were two typical solutions: 1) Exploding the nested structure into individual rows, and applying some functions, and then creating the structure again 2) Building a User Defined Function (UDF).

In contrast, the new built-in functions can directly manipulate complex types, and the higher-order functions can manipulate complex values with an anonymous lambda function similar to UDFs but with much better performance.

In this blog, through some examples, we’ll show some of these new built-in functions and how to use them to manipulate complex data types.

Typical solutions

Let’s review the typical solutions with the following examples first.

Option 1 – Explode and Collect

We use explode to break the array into individual rows and evaluate val + 1, and then use collect_list to restructure the array as follows:

       collect_list(val + 1) AS vals
               explode(vals) AS val
        FROM input_tbl) x

This is error-prone and inefficient for three reasons. First, we have to be diligent to ensure that the recollected arrays are made exactly from the original arrays by grouping them by the unique key. Second, we need a group-by, which means a shuffle operation; a shuffle operation is not guaranteed to keep the element order of the re-collected array from the original array. And finally, it is expensive.

Option 2 – User Defined Function

Next, we use Scala UDF which takes Seq[Int] and add 1 to the each element in it:

def addOne(values: Seq[Int]): Seq[Int] = { => value + 1)
val plusOneInt = spark.udf.register("plusOneInt", addOne(_: Seq[Int]): Seq[Int])

or we can also use Python UDF, and then:

SELECT id, plusOneInt(vals) as vals FROM input_tbl

This is simpler and faster and does not suffer from correctness pitfalls, but it might still be inefficient because the data serialization into Scala or Python can be expensive.

You can see the examples in a notebook in a blog that we published and try them.

New Built-in Functions

Let’s see the new built-in functions for manipulating complex types directly. The notebook lists the examples for each function. The signatures and arguments for each function are annotated with their respective types T or U to denote as array element types and K, V as map and value types.

Higher-Order Functions

For further manipulation for array and map types, we used known syntax in SQL for the anonymous lambda function and higher-order functions to take the lambda functions as arguments.

The syntax for the lambda function is as follows:

 argument -> function body
  (argument1, argument2, ...) -> function body

The left side of the symbol -> defines the argument list, and the right side defines the function body which can use the arguments and other variables in it to calculate the new value.

Transform with Anonymous Lambda Function

Let’s see the example with transform function that employs an anonymous lambda function.

Here we have a table of data that contains 3 columns: a key as an integer; values of array of integer; and nested_values of array of array of integers.

key values nested_values
1 [1, 2, 3] [[1, 2, 3], [], [4, 5]]

When we execute the following SQL:

SELECT TRANSFORM(values, element -> element + 1) FROM data;

the transform function iterates over the array and applies the lambda function, adding 1 to each element, and creates a new array.

We can also use other variables besides the arguments, for example: key, which is coming from the outer context, a column of the table, in the lambda function:

SELECT TRANSFORM(values, element -> element + key) FROM data;

If you want to manipulate a deeply nested column, like nested_values in this case, you can use the nested lambda functions:

    arr -> TRANSFORM(arr,
      element -> element + key + SIZE(arr)))
  FROM data;

You can use key and arr in the internal lambda function that are coming from the outer context, a column of the table and an argument of the outer lambda function.

Note, you can see the same examples as the typical solution in the notebook for them, and the examples of the other higher-order functions are included in the notebook for built-in functions.


Spark 2.4 introduced 24 new built-in functions, such as array_union, array_max/min, etc., and 5 higher-order functions, such as transform, filter, etc. for manipulating complex types. The whole list and their examples are in this notebook. If you have any complex values, consider using them and let us know of any issues.

We would like to thank contributors from the Apache Spark community Alex Vayda, Bruce Robbins, Dylan Guedes, Florent Pepin, H Lu, Huaxin Gao, Kazuaki Ishizaki, Marco Gaido, Marek Novotny, Neha Patil, Sandeep Singh, and many others.

Read More

To find out more about higher-order and built-in functions, see the following resources:

  1. Try the accompanying notebook
  2. Read the previous blog on higher-order functions
  3. Watch the Spark + AI Summit Europe Talk on Higher-Order Functions


Try Databricks for free. Get started today.

The post Introducing New Built-in and Higher-Order Functions for Complex Data Types in Apache Spark 2.4 appeared first on Databricks.

Continue Reading…


Read More

Book of hand-painted ski maps

When you go skiing or snowboarding, you get a map of the mountain that shows the terrain and where you can go. James Niehues is the man behind many of these hand-painted ski maps around the world, and he has a kickstarter to catalog his life’s work.

This is kind of amazing. I went skiing a lot as a kid, and I have distinct memories of these maps. I would stand at the top of the mountain, rip off one of my gloves with my teeth, and then pull out a folded map from a zipped pocket. I never knew they were by the same man, but in retrospect, it makes sense.

Tags: ,

Continue Reading…


Read More

Using Uncertainty to Interpret your Model

We outline why you should care about uncertainty and discuss the different types, including model, data and measurement uncertainty and what different purposes these all serve.

Continue Reading…


Read More

Report from the Enterprise Applications of the R Language conference

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Mark Niemann-Ross 

Mango Solutions presented the EARL conference in Seattle this November and I was fortunate enough to have time to attend. During the conference I took notes on the presentations, which I’ll pass along to you.

Thoughts on the future of R in industry

The EARL conference started with a panel discussion on R. Moderated by Doug Ashton from Mango Solutions, the panel included Julia Silge with Stack Overflow, David Smith with Microsoft, and Joe Cheng with Rstudio.

Topics ranged from the tidyverse, R certification, R as a general purpose language, Shiny, and the future of R. I captured some interesting quotes from the panel members:

Regarding R certification, David Smith points out that certifications are important for big companies trying to filter employment applications. He mentions that certification is a minimum bar for some HR departments. Julia Silge mentions that Stack Overflow has found certifications to be less influential in the hiring process.

R as a general purpose language: Joe Cheng feels that R is useful for more than just statistics, but that Rstudio isn’t interested in developing general purpose tools. There was discussion around Python as the “second best” language for a lot of applications, and an agreement that R should remain focused on data science.

Most interesting was the discussion regarding the future of R. Julia Silge points out that Stack Overflow data shows R growing fast year over year — at about the same rate as Python. There are a lot of new users and packages need to take that into account.

I learned more about Natural Language Processing

Tim Oldfield introduces this conference as NOT a code-based conference. However, Julia Silge doesn't hesitate to illustrate her discussion with appropriate code. And seriously, how would you discuss natural language processing without using the language of the trade?

I won't get into the terms (TF-IDF) and technology (tidytext) of Mz Silge's presentation. I will mention she does a great job of summarizing how and why to perform text mining. Like all good tech, you can easily scratch the surface of text mining in fifteen minutes. A thorough understanding requires years of hard research. If you'd like an introduction to her work, take a look at her paper She Giggles, He Gallops – analyzing gender tropes in film.


I gained an understanding of machine-learning

David Smith with Microsoft presented a session on neural nets, machine learning and transfer learning. More than just a theoretical chat, David illustrated the concepts with working tools. I’ve read quite a bit about machine learning – but this illustration really brings it home. Oh — and it’s pretty damn funny. ( David posted this on a blog entry here )


I learned to (sort of) love Excel

Eina Ooka found herself forced to incorporate Excel with her data workflow. Now — we all have opinions about Excel in data science — but Eina points out that for multidisciplinary data science teams, it’s good for data storage, modeling, and reports. There are issues about reproducibility and source control and for that, R is a good solution. But Eina summarizes that excel is still useful. Not all projects can move away from it.

Data science teams without structured, intentional collaboration leak knowledge and waste resources

Stephanie Kirmer  provided real-life experience and lessons learned on Data Science teams. Her themes included collaboration, version control, reproducibility, institutional knowledge, and other concerns. She has accomplished this with the consistent use of R packages.

One of her most interesting concepts was using packages to capture institutional knowledge. Documenting procedures with a function, contained in a standardized package provides stability and a common tool. This package then becomes an on-boarding tool for new hires and a knowledge repository for departing staff.

I learned risk can be studied and quantified

Risk is the chance that something bad will happen. David Severski with Starbucks revealed some of the tools used by the coffee giant, specifically OpenFAIR (Open Factor Analysis of Information Risk) and EvaluatoR (an R package for risk evaluation). Dave points out that R is an elegant language for data tasks. It has an ecosystem of tools for simulations and statistics, making risk evaluation a plug-and-play process.

Personally, I don’t have call for risk evaluation. But it’s interesting to get a quick peek into the language and concerns of this specialty.

I was reminded of the Science in Data Science

Mark Gallivan reminds us about the science in data science. He researched the effect of Zika on travel by searching twitter for #babymoon. With that data, he cross-references on the geolocation of the tweet to draw conclusions of the impact of Zika on travel plans of pregnant women. This is one of those useful presentations on the nuts and bolts of research.

I gained knowledge for non-R conversations

On November 7th I attended the EARL (Enterprise Applications of the R Language) conference in Seattle. Two days later I attended Orycon, the annual science fiction convention in Portland, Oregon. After every conference I attend, I do a private post-mortem. I ask myself if the conference was worthwhile, if I’d attend again, and if my time was well-spent.

EARL is a deep dive into the application of the R language. Attendees are assumed to have deep understanding of R, statistics, and a domain knowledge; the quintessential definition of data science.

Orycon is a gathering of science fiction fans. It includes a crowd of cosplayers imitating the latest Star Wars characters — but I’ll ignore them for this discussion. To be clear, the people that appreciate science fiction are deeply involved in science fact.

As a result of attending EARL, I’m better prepared to understand the talent and experience slightly under the radar at Orycon. I already knew the methods the experts used to perform and document their research. Thanks to David Smith’s “not hotdog” I understand machine learning at an operational level, so can skip over that discussion — or correct bad science from a misinformed espouser of pseudo-fact.

Mark is an author, educator, and writer and teaches about R and Raspberry Pi at LinkedIn Learning.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

R-bloggers weekly – top R posts from last week (2018-11-04 till 2018-11-10)

Most liked R posts from last week, sorted based on the number of likes they got on twitter, enjoy:

Continue Reading…


Read More

Sorry I didn’t get that! How to understand what your users want

Creating a chatbot is difficult, it involves knowledge in many AI-Hard tasks, such as Natural Language Understanding, Machine Comprehension, Inference, or Automatic Language Generation (in fact, solving these tasks is close to solving AI) and large human effort is required.

Continue Reading…


Read More

2018: How did people actually vote? (The real story, not the exit polls.)

Following up on the post that we linked to last week, here’s Yair’s analysis, using Mister P, of how everyone voted.

Like Yair, I think these results are much better than what you’ll see from exit polls, partly because the analysis is more sophisticated (MRP gives you state-by-state estimates in each demographic group), partly because he’s using more data (tons of pre-election polls), and partly because I think his analysis does a better job of correcting for bias (systematic differences between the sample and population).

As Yair puts it:

We spent the last week combining all of the information available — pre-election projections, early voting, county and congressional district election results, precinct results where we have it available , and polling data — to come up with our estimates.

In future election years, maybe Yair’s results, or others constructed using similar methods, will become the standard, and we’ll be able to forget exit polls, or relegate them to a more minor part of our discourse.

Anyway, here’s what Yair found:

The breakdown by age. Wow:

Changes since the previous midterm election, who voted and how they voted:

Ethnicity and education:



Yair’s got more at the link.

And here’s our summary of what happened in 2018, that we posted a few days after the election.

The post 2018: How did people actually vote? (The real story, not the exit polls.) appeared first on Statistical Modeling, Causal Inference, and Social Science.

Continue Reading…


Read More

Is AI Just a Buzzword? 4 Experts Weigh In on the Future of AI

Perhaps even moreso than even big data or blockchain, AI is fast becoming the buzzword on everyone’s lips. Machine learning has been a promising field for years, but with the astonishing success of deep learning techniques, we’re rapidly being propelled into an automated future. But can AI withstand the hype?

The post Is AI Just a Buzzword? 4 Experts Weigh In on the Future of AI appeared first on Dataconomy.

Continue Reading…


Read More

Four short links: 16 November 2018

Illuminated Paper, Software Forge, Leak Checklist, and PC on ESP

  1. IllumiPaper -- illuminated elements built into regular paper, with implementation.
  2. -- (pronounced "sir hat") a software forge like GitHub or GitLab, but with interesting strengths (e.g., very lightweight pages, and the CI system).
  3. Leak Mitigation Checklist -- If you just leaked sensitive information in public source code, read this document as part of your emergency procedure.
  4. Emulating an IBM PC on an ESP8266 -- an 8086 PC-XT emulation with 640K RAM, 80×25 CGA composite video, and a 1.44MB MS-DOS disk on an ESP12E without additional components. (via Alasdair Allen)

Continue reading Four short links: 16 November 2018.

Continue Reading…


Read More

IBM Watson IoT’s Chief Data Scientist Advocates For Ethical Deep Learning and Building The AI Humans Really Need

In terms of pioneering data-based technology, IBM are the gold standard. Indeed, IBM has held the record for receiving the most patents every year for the past 25 years and has developed countless revolutionary technologies, from SQL to the world’s fastest supercomputer. It should come as no surprise that IBM

The post IBM Watson IoT’s Chief Data Scientist Advocates For Ethical Deep Learning and Building The AI Humans Really Need appeared first on Dataconomy.

Continue Reading…


Read More

Basics of Entity Resolution

Entity resolution (ER) is the task of disambiguating records that correspond to real world entities across and within datasets. The applications of entity resolution are tremendous, particularly for public sector and federal datasets related to health, transportation, finance, law enforcement, and antiterrorism.

Unfortunately, the problems associated with entity resolution are equally big — as the volume and velocity of data grow, inference across networks and semantic relationships between entities becomes increasingly difficult. Data quality issues, schema variations, and idiosyncratic data collection traditions can all complicate these problems even further. When combined, such challenges amount to a substantial barrier to organizations’ ability to fully understand their data, let alone make effective use of predictive analytics to optimize targeting, thresholding, and resource management.

Naming Your Problem

Let us first consider what an entity is. Much as the key step in machine learning is to determine what an instance is, the key step in entity resolution is to determine what an entity is. Let's define an entity as a unique thing (a person, a business, a product) with a set of attributes that describe it (a name, an address, a shape, a title, a price, etc.). That single entity may have multiple references across data sources, such as a person with two different email addresses, a company with two different phone numbers, or a product listed on two different websites. If we want to ask questions about all the unique people, or businesses, or products in a dataset, we must find a method for producing an annotated version of that dataset that contains unique entities.

How can we tell that these multiple references point to the same entity? What if the attributes for each entity aren't the same across references? What happens when there are more than two or three or ten references to the same entity? Which one is the main (canonical) version? Do we just throw the duplicates away?

Each question points to a single problem, albeit one that frequently goes unnamed. Ironically, one of the problems in entity resolution is that even though it goes by a lot of different names, many people who struggle with entity resolution do not know the name of their problem.

The three primary tasks involved in entity resolution are deduplication, record linkage, and canonicalization:

  1. Deduplication: eliminating duplicate (exact) copies of repeated data.
  2. Record linkage: identifying records that reference the same entity across different sources.
  3. Canonicalization: converting data with more than one possible representation into a standard form.

Entity resolution is not a new problem, but thanks to Python and new machine learning libraries, it is an increasingly achievable objective. This post will explore some basic approaches to entity resolution using one of those tools, the Python Dedupe library. In this post, we will explore the basic functionalities of Dedupe, walk through how the library works under the hood, and perform a demonstration on two different datasets.

About Dedupe

Dedupe is a library that uses machine learning to perform deduplication and entity resolution quickly on structured data. It isn't the only tool available in Python for doing entity resolution tasks, but it is the only one (as far as we know) that conceives of entity resolution as it's primary task. In addition to removing duplicate entries from within a single dataset, Dedupe can also do record linkage across disparate datasets. Dedupe also scales fairly well — in this post we demonstrate using the library with a relatively small dataset of a few thousand records and a very large dataset of several million.

How Dedupe Works

Effective deduplication relies largely on domain expertise. This is for two main reasons: first, because domain experts develop a set of heuristics that enable them to conceptualize what a canonical version of a record should look like, even if they've never seen it in practice. Second, domain experts instinctively recognize which record subfields are most likely to uniquely identify a record; they just know where to look. As such, Dedupe works by engaging the user in labeling the data via a command line interface, and using machine learning on the resulting training data to predict similar or matching records within unseen data.

Testing Out Dedupe

Getting started with Dedupe is easy, and the developers have provided a convenient repo with examples that you can use and iterate on. Let's start by walking through the from the dedupe-examples. To get Dedupe running, we'll need to install unidecode, future, and dedupe.

In your terminal (we recommend doing so inside a virtual environment):

git clone
cd dedupe-examples

pip install unidecode
pip install future
pip install dedupe

Then we'll run the file to see what dedupe can do:


Blocking and Affine Gap Distance

Let's imagine we own an online retail business, and we are developing a new recommendation engine that mines our existing customer data to come up with good recommendations for products that our existing and new customers might like to buy. Our dataset is a purchase history log where customer information is represented by attributes like name, telephone number, address, and order history. The database we've been using to log purchases assigns a new unique ID for every customer interaction.

But it turns out we're a great business, so we have a lot of repeat customers! We'd like to be able to aggregate the order history information by customer so that we can build a good recommender system with the data we have. That aggregation is easy if every customer's information is duplicated exactly in every purchase log. But what if it looks something like the table below?

Silvrback blog image

How can we aggregate the data so that it is unique to the customer rather than the purchase? Features in the data set like names, phone numbers, and addresses will probably be useful. What is notable is that there are numerous variations for those attributes, particularly in how names appear — sometimes as nicknames, sometimes even misspellings. What we need is an intelligent and mostly automated way to create a new dataset for our recommender system. Enter Dedupe.

When comparing records, rather than treating each record as a single long string, Dedupe cleverly exploits the structure of the input data to instead compare the records field by field. The advantage of this approach is more pronounced when certain feature vectors of records are much more likely to assist in identifying matches than other attributes. Dedupe lets the user nominate the features they believe will be most useful:

fields = [
    {'field' : 'Name', 'type': 'String'},
    {'field' : 'Phone', 'type': 'Exact', 'has missing' : True},
    {'field' : 'Address', 'type': 'String', 'has missing' : True},
    {'field' : 'Purchases', 'type': 'String'},

Dedupe scans the data to create tuples of records that it will propose to the user to label as being either matches, not matches, or possible matches. These uncertainPairs are identified using a combination of blocking , affine gap distance, and active learning.

Blocking is used to reduce the number of overall record comparisons that need to be made. Dedupe's method of blocking involves engineering subsets of feature vectors (these are called 'predicates') that can be compared across records. In the case of our people dataset above, the predicates might be things like:

  • the first three digits of the phone number
  • the full name
  • the first five characters of the name
  • a random 4-gram within the city name

Records are then grouped, or blocked, by matching predicates so that only records with matching predicates will be compared to each other during the active learning phase. The blocks are developed by computing the edit distance between predicates across records. Dedupe uses a distance metric called affine gap distance, which is a variation on Hamming distance that makes subsequent consecutive deletions or insertions cheaper.

Silvrback blog image

Silvrback blog image

Silvrback blog image

Therefore, we might have one blocking method that groups all of the records that have the same area code of the phone number. This would result in three predicate blocks: one with a 202 area code, one with a 334, and one with NULL. There would be two records in the 202 block (IDs 452 and 821), two records in the 334 block (IDs 233 and 699), and one record in the NULL area code block (ID 720).

Silvrback blog image

The relative weight of these different feature vectors can be learned during the active learning process and expressed numerically to ensure that features that will be most predictive of matches will be heavier in the overall matching schema. As the user labels more and more tuples, Dedupe gradually relearns the weights, recalculates the edit distances between records, and updates its list of the most uncertain pairs to propose to the user for labeling.

Once the user has generated enough labels, the learned weights are used to calculate the probability that each pair of records within a block is a duplicate or not. In order to scale the pairwise matching up to larger tuples of matched records (in the case that entities may appear more than twice within a document), Dedupe uses hierarchical clustering with centroidal linkage. Records within some threshold distance of a centroid will be grouped together. The final result is an annotated version of the original dataset that now includes a centroid label for each record.

Active Learning

You can see that dedupe is a command line application that will prompt the user to engage in active learning by showing pairs of entities and asking if they are the same or different.

Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

Active learning is the so-called special sauce behind Dedupe. As in most supervised machine learning tasks, the challenge is to get labeled data that the model can learn from. The active learning phase in Dedupe is essentially an extended user-labeling session, which can be short if you have a small dataset and can take longer if your dataset is large. You are presented with four options:

Silvrback blog image

You can experiment with typing the y, n, and u keys to flag duplicates for active learning. When you are finished, enter f to quit.

  • (y)es: confirms that the two references are to the same entity
  • (n)o: labels the two references as not the same entity
  • (u)nsure: does not label the two references as the same entity or as different entities
  • (f)inished: ends the active learning session and triggers the supervised learning phase

Silvrback blog image

As you can see in the example above, some comparisons decisions are very easy. The first contains zero for zero hits on all four attributes being examined, so the verdict is most certainly a non-match. On the second, we have a 3/4 exact match, with the fourth being fuzzy in that one entity contains a piece of the matched entity; Ryerson vs. Chicago Public Schools Ryerson. A human would be able to discern these as two references to the same entity, and we can label it as such to enable the supervised learning that comes after the active learning.

The csv_example also includes an evaluation script that will enable you to determine how successfully you were able to resolve the entities. It's important to note that the blocking, active learning and supervised learning portions of the deduplication process are very dependent on the dataset attributes that the user nominates for selection. In the csv_example, the script nominates the following four attributes:

fields = [
    {'field' : 'Site name', 'type': 'String'},
    {'field' : 'Address', 'type': 'String'},
    {'field' : 'Zip', 'type': 'Exact', 'has missing' : True},
    {'field' : 'Phone', 'type': 'String', 'has missing' : True},

A different combination of attributes would result in a different blocking, a different set of uncertainPairs, a different set of features to use in the active learning phase, and almost certainly a different result. In other words, user experience and domain knowledge factor in heavily at multiple phases of the deduplication process.

Something a Bit More Challenging

In order to try out Dedupe with a more challenging project, we decided to try out deduplicating the White House visitors' log. Our hypothesis was that it would be interesting to be able to answer questions such as "How many times has person X visited the White House during administration Y?" However, in order to do that, it would be necessary to generate a version of the list that contained unique entities. We guessed that there would be many cases where there were multiple references to a single entity, potentially with slight variations in how they appeared in the dataset. We also expected to find a lot of names that seemed similar but in fact referenced different entities. In other words, a good challenge!

The data set we used was pulled from the website, a part of the executive initiative to make federal data more open to the public. This particular set of data is a list of White House visitor record requests from 2006 through 2010. Here's a snapshot of what the data looks like via the White House API.

Silvrback blog image

The dataset includes a lot of columns, and for most of the entries, the majority of these fields are blank:

Database Field Field Description
NAMELAST Last name of entity
NAMEFIRST First name of entity
NAMEMID Middle name of entity
UIN Unique Identification Number
BDGNBR Badge Number
Type of Access Access type to White House
TOA Time of arrival
POA Post on arrival
TOD Time of departure
POD Post on departure
APPT_MADE_DATE When the appointment date was made
APPT_START_DATE When the appointment date is scheduled to start
APPT_END_DATE When the appointment date is scheduled to end
APPT_CANCEL_DATE When the appointment date was canceled
Total_People Total number of people scheduled to attend
LAST_UPDATEDBY Who was the last person to update this event
POST Classified as 'WIN'
LastEntryDate When the last update to this instance
TERMINAL_SUFFIX ID for terminal used to process visitor
visitee_namelast The visitee's last name
visitee_namefirst The visitee's first name
MEETING_LOC The location of the meeting
MEETING_ROOM The room number of the meeting
CALLER_NAME_LAST The authorizing person for the visitor's last name
CALLER_NAME_FIRST The authorizing person for the visitor's first name
CALLER_ROOM The authorizing person's room for the visitor
Description Description of the event or visit
RELEASE_DATE The date this set of logs were released to the public

Loading the Data

Using the API, the White House Visitor Log Requests can be exported in a variety of formats to include, .json, .csv, and .xlsx, .pdf, .xlm, and RSS. However, it's important to keep in mind that the dataset contains over 5 million rows. For this reason, we decided to use .csv and grabbed the data using requests:

import requests

def getData(url,fname):
    Download the dataset from the webpage.
    response = requests.get(url)
    with open(fname, 'w') as f:

ORIGFILE = "fixtures/whitehouse-visitors.csv"


Once downloaded, we can clean it up and load it into a database for more secure and stable storage.

Tailoring the Code

Next, we'll discuss what is needed to tailor a dedupe example to get the code to work for the White House visitors log dataset. The main challenge with this dataset is its sheer size. First, we'll need to import a few modules and connect to our database:

import csv
import psycopg2
from dateutil import parser
from datetime import datetime

conn = None

DATABASE = your_db_name
USER = your_user_name
HOST = your_hostname
PASSWORD = your_password

    conn = psycopg2.connect(database=DATABASE, user=USER, host=HOST, password=PASSWORD)
    print ("I've connected")
    print ("I am unable to connect to the database")
cur = conn.cursor()

The other challenge with our dataset are the numerous missing values and datetime formatting irregularities. We wanted to be able to use the datetime strings to help with entity resolution, so we wanted to get the formatting to be as consistent as possible. The following script handles both the datetime parsing and the missing values by combining Python's dateutil module and PostgreSQL's fairly forgiving 'varchar' type.

This function takes the csv data in as input, parses the datetime fields we're interested in ('lastname','firstname','uin','apptmade','apptstart','apptend', 'meeting_loc'.), and outputs a database table that retains the desired columns. Keep in mind this will take a while to run.

def dateParseSQL(nfile):
    cur.execute('''CREATE TABLE IF NOT EXISTS visitors_er
                  (visitor_id SERIAL PRIMARY KEY,
                  lastname    varchar,
                  firstname   varchar,
                  uin         varchar,
                  apptmade    varchar,
                  apptstart   varchar,
                  apptend     varchar,
                  meeting_loc varchar);''')
    with open(nfile, 'rU') as infile:
        reader = csv.reader(infile, delimiter=',')
        next(reader, None)
        for row in reader:
            for field in DATEFIELDS:
                if row[field] != '':
                        dt = parser.parse(row[field])
                        row[field] = dt.toordinal()  # We also tried dt.isoformat()
            sql = "INSERT INTO visitors_er(lastname,firstname,uin,apptmade,apptstart,apptend,meeting_loc) \
                   VALUES (%s,%s,%s,%s,%s,%s,%s)"
            cur.execute(sql, (row[0],row[1],row[3],row[10],row[11],row[12],row[21],))
    print ("All done!")


About 60 of our rows had ASCII characters, which we dropped using this SQL command:

delete from visitors where firstname ~ '[^[:ascii:]]' OR lastname ~ '[^[:ascii:]]';

For our deduplication script, we modified the PostgreSQL example as well as Dan Chudnov's adaptation of the script for the OSHA dataset.

import tempfile
import argparse
import csv
import os

import dedupe
import psycopg2
from psycopg2.extras import DictCursor

Initially, we wanted to try to use the datetime fields to deduplicate the entities, but dedupe was not a big fan of the datetime fields, whether in isoformat or ordinal, so we ended up nominating the following fields:

KEY_FIELD = 'visitor_id'
SOURCE_TABLE = 'visitors'

FIELDS =  [{'field': 'firstname', 'variable name': 'firstname',
               'type': 'String','has missing': True},
              {'field': 'lastname', 'variable name': 'lastname',
               'type': 'String','has missing': True},
              {'field': 'uin', 'variable name': 'uin',
               'type': 'String','has missing': True},
              {'field': 'meeting_loc', 'variable name': 'meeting_loc',
               'type': 'String','has missing': True}

We modified a function Dan wrote to generate the predicate blocks:

def candidates_gen(result_set):
    lset = set
    block_id = None
    records = []
    i = 0
    for row in result_set:
        if row['block_id'] != block_id:
            if records:
                yield records

            block_id = row['block_id']
            records = []
            i += 1

            if i % 10000 == 0:
                print ('{} blocks'.format(i))

        smaller_ids = row['smaller_ids']
        if smaller_ids:
            smaller_ids = lset(smaller_ids.split(','))
            smaller_ids = lset([])

        records.append((row[KEY_FIELD], row, smaller_ids))

    if records:
        yield records

And we adapted the method from the dedupe-examples repo to handle the active learning, supervised learning, and clustering steps:

def find_dupes(args):
    deduper = dedupe.Dedupe(FIELDS)

    with psycopg2.connect(database=args.dbname,
                          cursor_factory=DictCursor) as con:
        with con.cursor() as c:
            c.execute('SELECT COUNT(*) AS count FROM %s' % SOURCE_TABLE)
            row = c.fetchone()
            count = row['count']
            sample_size = int(count * args.sample)

            print ('Generating sample of {} records'.format(sample_size))
            with con.cursor('deduper') as c_deduper:
                c_deduper.execute('SELECT visitor_id,lastname,firstname,uin,meeting_loc FROM %s' % SOURCE_TABLE)
                temp_d = dict((i, row) for i, row in enumerate(c_deduper))
                deduper.sample(temp_d, sample_size)

            if os.path.exists(
                print ('Loading training file from {}'.format(
                with open( as tf:

            print ('Starting active learning')

            print ('Starting training')
            deduper.train(ppc=0.001, uncovered_dupes=5)

            print ('Saving new training file to {}'.format(
            with open(, 'w') as training_file:


            print ('Creating blocking_map table')
                DROP TABLE IF EXISTS blocking_map
                CREATE TABLE blocking_map
                (block_key VARCHAR(200), %s INTEGER)
                """ % KEY_FIELD)

            for field in deduper.blocker.index_fields:
                print ('Selecting distinct values for "{}"'.format(field))
                c_index = con.cursor('index')
                    SELECT DISTINCT %s FROM %s
                    """ % (field, SOURCE_TABLE))
                field_data = (row[field] for row in c_index)
                deduper.blocker.index(field_data, field)

            print ('Generating blocking map')
            c_block = con.cursor('block')
                SELECT * FROM %s
                """ % SOURCE_TABLE)
            full_data = ((row[KEY_FIELD], row) for row in c_block)
            b_data = deduper.blocker(full_data)

            print ('Inserting blocks into blocking_map')
            csv_file = tempfile.NamedTemporaryFile(prefix='blocks_', delete=False)
            csv_writer = csv.writer(csv_file)

            f = open(, 'r')
            c.copy_expert("COPY blocking_map FROM STDIN CSV", f)



            print ('Indexing blocks')
                CREATE INDEX blocking_map_key_idx ON blocking_map (block_key)
            c.execute("DROP TABLE IF EXISTS plural_key")
            c.execute("DROP TABLE IF EXISTS plural_block")
            c.execute("DROP TABLE IF EXISTS covered_blocks")
            c.execute("DROP TABLE IF EXISTS smaller_coverage")

            print ('Calculating plural_key')
                CREATE TABLE plural_key
                (block_key VARCHAR(200),
                block_id SERIAL PRIMARY KEY)
                INSERT INTO plural_key (block_key)
                SELECT block_key FROM blocking_map
                GROUP BY block_key HAVING COUNT(*) > 1

            print ('Indexing block_key')
                CREATE UNIQUE INDEX block_key_idx ON plural_key (block_key)

            print ('Calculating plural_block')
                CREATE TABLE plural_block
                AS (SELECT block_id, %s
                FROM blocking_map INNER JOIN plural_key
                USING (block_key))
                """ % KEY_FIELD)

            print ('Adding {} index'.format(KEY_FIELD))
                CREATE INDEX plural_block_%s_idx
                    ON plural_block (%s)
                """ % (KEY_FIELD, KEY_FIELD))
                CREATE UNIQUE INDEX plural_block_block_id_%s_uniq
                ON plural_block (block_id, %s)
                """ % (KEY_FIELD, KEY_FIELD))

            print ('Creating covered_blocks')
                CREATE TABLE covered_blocks AS
                    (SELECT %s,
                            string_agg(CAST(block_id AS TEXT), ','
                            ORDER BY block_id) AS sorted_ids
                     FROM plural_block
                     GROUP BY %s)
                 """ % (KEY_FIELD, KEY_FIELD))

            print ('Indexing covered_blocks')
                CREATE UNIQUE INDEX covered_blocks_%s_idx
                    ON covered_blocks (%s)
                """ % (KEY_FIELD, KEY_FIELD))
            print ('Committing')

            print ('Creating smaller_coverage')
                CREATE TABLE smaller_coverage AS
                    (SELECT %s, block_id,
                        TRIM(',' FROM split_part(sorted_ids,
                        CAST(block_id AS TEXT), 1))
                        AS smaller_ids
                     FROM plural_block
                     INNER JOIN covered_blocks
                     USING (%s))
                """ % (KEY_FIELD, KEY_FIELD))

            print ('Clustering...')
            c_cluster = con.cursor('cluster')
                SELECT *
                FROM smaller_coverage
                INNER JOIN %s
                    USING (%s)
                ORDER BY (block_id)
                """ % (SOURCE_TABLE, KEY_FIELD))
            clustered_dupes = deduper.matchBlocks(
                    candidates_gen(c_cluster), threshold=0.5)

            print ('Creating entity_map table')
            c.execute("DROP TABLE IF EXISTS entity_map")
                CREATE TABLE entity_map (
                    %s INTEGER,
                    canon_id INTEGER,
                    cluster_score FLOAT,
                    PRIMARY KEY(%s)
                )""" % (KEY_FIELD, KEY_FIELD))

            print ('Inserting entities into entity_map')
            for cluster, scores in clustered_dupes:
                cluster_id = cluster[0]
                for key_field, score in zip(cluster, scores):
                        INSERT INTO entity_map
                            (%s, canon_id, cluster_score)
                        VALUES (%s, %s, %s)
                        """ % (KEY_FIELD, key_field, cluster_id, score))

            print ('Indexing head_index')
            c.execute("CREATE INDEX head_index ON entity_map (canon_id)")

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--dbname', dest='dbname', default='whitehouse', help='database name')
    parser.add_argument('-s', '--sample', default=0.10, type=float, help='sample size (percentage, default 0.10)')
    parser.add_argument('-t', '--training', default='training.json', help='name of training file')
    args = parser.parse_args()

Active Learning Observations

We ran multiple experiments:

  • Test 1: lastname, firstname, meeting_loc => 447 (15 minutes of training)
  • Test 2: lastname, firstname, uin, meeting_loc => 3385 (5 minutes of training) - one instance that had 168 duplicates

We observed a lot of uncertainty during the active learning phase, mostly because of how enormous the dataset is. This was particularly pronounced with names that seemed more common to us and that sounded more domestic since those are much more commonly occurring in this dataset. For example, are two records containing the name Michael Grant the same entity?

Additionally, we noticed that there were a lot of variations in the way that middle names were captured. Sometimes they were concatenated with the first name, other times with the last name. We also observed what seemed to be many nicknames or that could have been references to separate entities: KIM ASKEW vs. KIMBERLEY ASKEW and Kathy Edwards vs. Katherine Edwards (and yes, dedupe does preserve variations in case). On the other hand, since nicknames generally appear only in people's first names, when we did see a short version of a first name paired with an unusual or rare last name, we were more confident in labeling those as a match.

Other things that made the labeling easier were clearly gendered names (e.g. Brian Murphy vs. Briana Murphy), which helped us to identify separate entities in spite of very small differences in the strings. Some names appeared to be clear misspellings, which also made us more confident in our labeling two references as matches for a single entity (Davifd Culp vs. David Culp). There were also a few potential easter eggs in the dataset, which we suspect might actually be aliases (Jon Doe and Ben Jealous).

One of the things we discovered upon multiple runs of the active learning process is that the number of fields the user nominates to Dedupe for use has a great impact on the kinds of predicate blocks that are generated during the initial blocking phase. Thus, the comparisons that are presented to the trainer during the active learning phase. In one of our runs, we used only the last name, first name, and meeting location fields. Some of the comparisons were easy:

lastname : KUZIEMKO
firstname : ILYANA
meeting_loc : WH

lastname : KUZIEMKO
firstname : ILYANA
meeting_loc : WH

Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

Some were hard:

lastname : Desimone
firstname : Daniel
meeting_loc : OEOB

lastname : DeSimone
firstname : Daniel
meeting_loc : WH

Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


What we realized from this is that there are two different kinds of duplicates that appear in our dataset. The first kind of duplicate is one that generated via (likely mistaken) duplicate visitor request forms. We noticed that these duplicate entries tended to be proximal to each other in terms of visitor_id number, have the same meeting location and the same uin (which confusingly, is not a unique guest identifier but appears to be assigned to every visitor within a unique tour group). The second kind of duplicate is what we think of as the frequent flier — people who seem to spend a lot of time at the White House like staffers and other political appointees.

During the dedupe process, we computed there were 332,606 potential duplicates within the data set of 1,048,576 entities. For this particular data, we would expect these kinds of figures, knowing that people visit for repeat business or social functions.

Within-Visit Duplicates

lastname : Ryan
meeting_loc : OEOB
firstname : Patrick
uin : U62671

lastname : Ryan
meeting_loc : OEOB
firstname : Patrick
uin : U62671

Across-Visit Duplicates (Frequent Fliers)

lastname : TANGHERLINI
meeting_loc : OEOB
firstname : DANIEL
uin : U02692

lastname : TANGHERLINI
meeting_loc : NEOB
firstname : DANIEL
uin : U73085
lastname : ARCHULETA
meeting_loc : WH
firstname : KATHERINE
uin : U68121

lastname : ARCHULETA
meeting_loc : OEOB
firstname : KATHERINE
uin : U76331

Silvrback blog image


In this beginners guide to Entity Resolution, we learned what it means to identify entities and their possible duplicates within and across records. To further examine this data beyond the scope of this blog post, we would like to determine which records are true duplicates. This would require additional information to canonicalize these entities, thus allowing for potential indexing of entities for future assessments. Ultimately we discovered the importance of entity resolution across a variety of domains, such as counter-terrorism, customer databases, and voter registration.

Please return to the District Data Labs blog for upcoming posts on entity resolution and discussion about a number of other important topics to the data science community. Upcoming post topics from our research group include string matching algorithms, data preparation, and entity identification!

District Data Labs provides data science consulting and corporate training services. We work with companies and teams of all sizes, helping them make their operations more data-driven and enhancing the analytical abilities of their employees. Interested in working with us? Let us know!

Continue Reading…


Read More

Data Exploration with Python, Part 2

This is the second post in our Data Exploration with Python series. Before reading this post, make sure to check out Data Exploration with Python, Part 1!

Mise en place (noun): In a professional kitchen, the disciplined organization and preparation of equipment and food before service begins.

When performing exploratory data analysis (EDA), it is important to not only prepare yourself (the analyst) but to prepare your data as well. As we discussed in the previous post, a small amount of preparation will often save you a significant amount of time later on. So let's review where we should be at this point and then continue our exploration process with data preparation.

In Part 1 of this series, we were introduced to the data exploration framework we will be using. As a reminder, here is what that framework looks like.

Exploratory Data Analysis Framework

We also introduced the example data set we are going to be using to illustrate the different phases and stages of the framework. Here is what that looks like.

EPA Vehicle Fuel Economy Data

We then familiarized ourselves with our data set by identifying the types of information and entities encoded within it. We also reviewed several data transformation and visualization methods that we will use later to explore and analyze it. Now we are at the last stage of the framework's Prep Phase, the Create stage, where our goal will be to create additional categorical fields that will make our data easier to explore and allow us to view it from new perspectives.

Renaming Columns to be More Intuitive

Before we dive in and start creating categories, however, we have an opportunity to improve our categorization efforts by examining the columns in our data and making sure their labels intuitively convey what they represent. Just as with the other aspects of preparation, changing them now will save us from having to remember what displ or co2TailpipeGpm mean when they show up on a chart later. In my experience, these small, detail-oriented enhancements to the beginning of your process usually compound and preserve cognitive cycles that you can later apply to extracting insights.

We can use the code below to rename the columns in our vehicles data frame.

vehicles.columns = ['Make','Model','Year','Engine Displacement','Cylinders',
                    'Transmission','Drivetrain','Vehicle Class','Fuel Type',
                    'Fuel Barrels/Year','City MPG','Highway MPG','Combined MPG',
                    'CO2 Emission Grams/Mile','Fuel Cost/Year']

Thinking About Categorization

Now that we have changed our column names to be more intuitive, let's take a moment to think about what categorization is and examine the categories that currently exist in our data set. At the most basic level, categorization is just a way that humans structure information — how we hierarchically create order out of complexity. Categories are formed based on attributes that entities have in common, and they present us with different perspectives from which we can view and think about our data.

Our primary objective in this stage is to create additional categories that will help us further organize our data. This will prove beneficial not only for the exploratory analysis we will conduct but also for any supervised machine learning or modeling that may happen further down the data science pipeline. Seasoned data scientists know that the better your data is organized, the better downstream analyses you will be able to perform and the more informative features you will have to feed into your machine learning models.

In this stage of the framework, we are going to create additional categories in 3 distinct ways:

  • Category Aggregations
  • Binning Continuous Variables
  • Clustering

Now that we have a better idea of what we are doing and why, let's get started.

Aggregating to Higher-Level Categories

The first way we are going to create additional categories is by identifying opportunities to create higher-level categories out of the variables we already have in our data set. In order to do this, we need to get a sense of what categories currently exist in the data. We can do this by iterating through our columns and printing out the name, the number of unique values, and the data type for each.

def unique_col_values(df):
    for column in df:
        print("{} | {} | {}".format(
            df[column].name, len(df[column].unique()), df[column].dtype

Make | 126 | object
Model | 3491 | object
Year | 33 | int64
Engine Displacement | 65 | float64
Cylinders | 9 | float64
Transmission | 43 | object
Drivetrain | 7 | object
Vehicle Class | 34 | object
Fuel Type | 13 | object
Fuel Barrels/Year | 116 | float64
City MPG | 48 | int64
Highway MPG | 49 | int64
Combined MPG | 45 | int64
CO2 Emission Grams/Mile | 550 | float64
Fuel Cost/Year | 58 | int64

From looking at the output, it is clear that we have some numeric columns (int64 and float64) and some categorical columns (object). For now, let's focus on the six categorical columns in our data set.

  • Make: 126 unique values
  • Model: 3,491 unique values
  • Transmission: 43 unique values
  • Drivetrain: 7 unique values
  • Vehicle Class: 34 unique values
  • Fuel Type: 13 unique values

When aggregating and summarizing data, having too many categories can be problematic. The average human is said to have the ability to hold 7 objects at a time in their short-term working memory. Accordingly, I have noticed that once you exceed 8-10 discrete values in a category, it becomes increasingly difficult to get a holistic picture of how the entire data set is divided up.

What we want to do is examine the values in each of our categorical variables to determine where opportunities exist to aggregate them into higher-level categories. The way this is typically done is by using a combination of clues from the current categories and any domain knowledge you may have (or be able to acquire).

For example, imagine aggregating by Transmission, which has 43 discrete values in our data set. It is going to be difficult to derive insights due to the fact that any aggregated metrics are going to be distributed across more categories than you can hold in short-term memory. However, if we examine the different transmission categories with the goal of finding common features that we can group on, we would find that all 43 values fall into one of two transmission types, Automatic or Manual.

Category Aggregations - Transmission

Let's create a new Transmission Type column in our data frame and, with the help of the loc method in pandas, assign it a value of Automatic where the first character of Transmission is the letter A and a value of Manual where the first character is the letter M.

AUTOMATIC = "Automatic"
MANUAL = "Manual"

             'Transmission Type'] = AUTOMATIC

             'Transmission Type'] = MANUAL

We can apply the same logic to the Vehicle Class field. We originally have 34 vehicle classes, but we can distill those down into 8 vehicle categories, which are much easier to remember.

Category Aggregations - Vehicle Class

small = ['Compact Cars','Subcompact Cars','Two Seaters','Minicompact Cars']
midsize = ['Midsize Cars']
large = ['Large Cars']

vehicles.loc[vehicles['Vehicle Class'].isin(small), 
             'Vehicle Category'] = 'Small Cars'

vehicles.loc[vehicles['Vehicle Class'].isin(midsize), 
             'Vehicle Category'] = 'Midsize Cars'

vehicles.loc[vehicles['Vehicle Class'].isin(large), 
             'Vehicle Category'] = 'Large Cars'

vehicles.loc[vehicles['Vehicle Class'].str.contains('Station'), 
             'Vehicle Category'] = 'Station Wagons'

vehicles.loc[vehicles['Vehicle Class'].str.contains('Truck'), 
             'Vehicle Category'] = 'Pickup Trucks'

vehicles.loc[vehicles['Vehicle Class'].str.contains('Special Purpose'), 
             'Vehicle Category'] = 'Special Purpose'

vehicles.loc[vehicles['Vehicle Class'].str.contains('Sport Utility'), 
             'Vehicle Category'] = 'Sport Utility'

vehicles.loc[(vehicles['Vehicle Class'].str.lower().str.contains('van')),
             'Vehicle Category'] = 'Vans & Minivans'

Next, let's look at the Make and Model fields, which have 126 and 3,491 unique values respectively. While I can't think of a way to get either of those down to 8-10 categories, we can create another potentially informative field by concatenating Make and the first word of the Model field together into a new Model Type field. This would allow us to, for example, categorize all Chevrolet Suburban C1500 2WD vehicles and all Chevrolet Suburban K1500 4WD vehicles as simply Chevrolet Suburbans.

vehicles['Model Type'] = (vehicles['Make'] + " " +

Finally, let's look at the Fuel Type field, which has 13 unique values. On the surface, that doesn't seem too bad, but upon further inspection, you'll notice some complexity embedded in the categories that could probably be organized more intuitively.

vehicles['Fuel Type'].unique()
array(['Regular', 'Premium', 'Diesel', 'Premium and Electricity',
       'Premium or E85', 'Premium Gas or Electricity', 'Gasoline or E85',
       'Gasoline or natural gas', 'CNG', 'Regular Gas or Electricity',
       'Midgrade', 'Regular Gas and Electricity', 'Gasoline or propane'],

This is interesting and a little tricky because there are some categories that contain a single fuel type and others that contain multiple fuel types. In order to organize this better, we will create two sets of categories from these fuel types. The first will be a set of columns that will be able to represent the different combinations, while still preserving the individual fuel types.

vehicles['Gas'] = 0
vehicles['Ethanol'] = 0
vehicles['Electric'] = 0
vehicles['Propane'] = 0
vehicles['Natural Gas'] = 0

vehicles.loc[vehicles['Fuel Type'].str.contains(
        'Regular|Gasoline|Midgrade|Premium|Diesel'),'Gas'] = 1

vehicles.loc[vehicles['Fuel Type'].str.contains('E85'),'Ethanol'] = 1

vehicles.loc[vehicles['Fuel Type'].str.contains('Electricity'),'Electric'] = 1

vehicles.loc[vehicles['Fuel Type'].str.contains('propane'),'Propane'] = 1

vehicles.loc[vehicles['Fuel Type'].str.contains('natural|CNG'),'Natural Gas'] = 1

As it turns out, 99% of the vehicles in our database have gas as a fuel type, either by itself or combined with another fuel type. Since that is the case, let's create a second set of categories - specifically, a new Gas Type field that extracts the type of gas (Regular, Midgrade, Premium, Diesel, or Natural) each vehicle accepts.

vehicles.loc[vehicles['Fuel Type'].str.contains(
        'Regular|Gasoline'),'Gas Type'] = 'Regular'

vehicles.loc[vehicles['Fuel Type'] == 'Midgrade',
             'Gas Type'] = 'Midgrade'

vehicles.loc[vehicles['Fuel Type'].str.contains('Premium'),
             'Gas Type'] = 'Premium'

vehicles.loc[vehicles['Fuel Type'] == 'Diesel',
             'Gas Type'] = 'Diesel'

vehicles.loc[vehicles['Fuel Type'].str.contains('natural|CNG'),
             'Gas Type'] = 'Natural'

An important thing to note about what we have done with all of the categorical fields in this section is that, while we created new categories, we did not overwrite the original ones. We created additional fields that will allow us to view the information contained within the data set at different (often higher) levels. If you need to drill down to the more granular original categories, you can always do that. However, now we have a choice whereas before we performed these category aggregations, we did not.

Creating Categories from Continuous Variables

The next way we can create additional categories in our data is by binning some of our continuous variables - breaking them up into different categories based on a threshold or distribution. There are multiple ways you can do this, but I like to use quintiles because it gives me one middle category, two categories outside of that which are moderately higher and lower, and then two extreme categories at the ends. I find that this is a very intuitive way to break things up and provides some consistency across categories. In our data set, I've identified 4 fields that we can bin this way.

Silvrback blog image

Binning essentially looks at how the data is distributed, creates the necessary number of bins by splitting up the range of values (either equally or based on explicit boundaries), and then categorizes records into the appropriate bin that their continuous value falls into. Pandas has a qcut method that makes binning extremely easy, so let's use that to create our quintiles for each of the continuous variables we identified.

efficiency_categories = ['Very Low Efficiency', 'Low Efficiency',
                         'Moderate Efficiency','High Efficiency',
                         'Very High Efficiency']

vehicles['Fuel Efficiency'] = pd.qcut(vehicles['Combined MPG'],
                                      5, efficiency_categories)
engine_categories = ['Very Small Engine', 'Small Engine','Moderate Engine',
                     'Large Engine', 'Very Large Engine']

vehicles['Engine Size'] = pd.qcut(vehicles['Engine Displacement'],
                                  5, engine_categories)
emission_categories = ['Very Low Emissions', 'Low Emissions',
                        'Moderate Emissions','High Emissions',
                        'Very High Emissions']

vehicles['Emissions'] = pd.qcut(vehicles['CO2 Emission Grams/Mile'],
                                 5, emission_categories)
fuelcost_categories = ['Very Low Fuel Cost', 'Low Fuel Cost',
                       'Moderate Fuel Cost','High Fuel Cost',
                       'Very High Fuel Cost']

vehicles['Fuel Cost'] = pd.qcut(vehicles['Fuel Cost/Year'],
                                5, fuelcost_categories)

Clustering to Create Additional Categories

The final way we are going to prepare our data is by clustering to create additional categories. There are a few reasons why I like to use clustering for this. First, it takes multiple fields into consideration together at the same time, whereas the other categorization methods only consider one field at a time. This will allow you to categorize together entities that are similar across a variety of attributes, but might not be close enough in each individual attribute to get grouped together.

Clustering also creates new categories for you automatically, which takes much less time than having to comb through the data yourself identifying patterns across attributes that you can form categories on. It will automatically group similar items together for you.

The third reason I like to use clustering is because it will sometimes group things in ways that you, as a human, may not have thought of. I'm a big fan of humans and machines working together to optimize analytical processes, and this is a good example of value that machines bring to the table that can be helpful to humans. I'll write more about my thoughts on that in future posts, but for now, let's move on to clustering our data.

The first thing we are going to do is isolate the columns we want to use for clustering. These are going to be columns with numeric values, as the clustering algorithm will need to compute distances in order to group similar vehicles together.

cluster_columns = ['Engine Displacement','Cylinders','Fuel Barrels/Year',
                   'City MPG','Highway MPG','Combined MPG',
                   'CO2 Emission Grams/Mile', 'Fuel Cost/Year']

Next, we want to scale the features we are going to cluster on. There are a variety of ways to normalize and scale variables, but I'm going to keep things relatively simple and just use Scikit-Learn's MaxAbsScaler, which will divide each value by the max absolute value for that feature. This will preserve the distributions in the data and convert the values in each field to a number between 0 and 1 (technically -1 and 1, but we don't have any negatives).

from sklearn import preprocessing
scaler = preprocessing.MaxAbsScaler()

vehicle_clusters = scaler.fit_transform(vehicles[cluster_columns])
vehicle_clusters = pd.DataFrame(vehicle_clusters, columns=cluster_columns)

Now that our features are scaled, let's write a couple of functions. The first function we are going to write is a kmeans_cluster function that will k-means cluster a given data frame into a specified number of clusters. It will then return a copy of the original data frame with those clusters appended in a column named Cluster.

from sklearn.cluster import KMeans

def kmeans_cluster(df, n_clusters=2):
    model = KMeans(n_clusters=n_clusters, random_state=1)
    clusters = model.fit_predict(df)
    cluster_results = df.copy()
    cluster_results['Cluster'] = clusters
    return cluster_results

Our second function, called summarize_clustering is going to count the number of vehicles that fall into each cluster and calculate the cluster means for each feature. It is going to merge the counts and means into a single data frame and then return that summary to us.

def summarize_clustering(results):
    cluster_size = results.groupby(['Cluster']).size().reset_index()
    cluster_size.columns = ['Cluster', 'Count']
    cluster_means = results.groupby(['Cluster'], as_index=False).mean()
    cluster_summary = pd.merge(cluster_size, cluster_means, on='Cluster')
    return cluster_summary

We now have functions for what we need to do, so the next step is to actually cluster our data. But wait, our kmeans_cluster function is supposed to accept a number of clusters. How do we determine how many clusters we want?

There are a number of approaches for figuring this out, but for the sake of simplicity, we are just going to plug in a couple of numbers and visualize the results to arrive at a good enough estimate. Remember earlier in this post where we were trying to aggregate our categorical variables to less than 8-10 discrete values? We are going to apply the same logic here. Let's start out with 8 clusters and see what kind of results we get.

cluster_results = kmeans_cluster(vehicle_clusters, 8)
cluster_summary = summarize_clustering(cluster_results)

After running the couple of lines of code above, your cluster_summary should look similar to the following.

Silvrback blog image

By looking at the Count column, you can tell that there are some clusters that have significantly more records in them (ex. Cluster 7) and others that have significantly fewer (ex. Cluster 3). Other than that, though, it is difficult to notice anything informative about the summary. I don't know about you, but to me, the rest of the summary just looks like a bunch of decimals in a table.

This is a prime opportunity to use a visualization to discover insights faster. With just a couple import statements and a single line of code, we can light this summary up in a heatmap so that we can see the contrast between all those decimals and between the different clusters.

import matplotlib.pyplot as plt
import seaborn as sns

sns.heatmap(cluster_summary[cluster_columns].transpose(), annot=True)

Silvrback blog image

In this heatmap, the rows represent the features and the columns represent the clusters, so we can compare how similar or differently columns look to each other. Our goal for clustering these features is to ultimately create meaningful categories out of the clusters, so we want to get to the point where we can clearly distinguish one from the others. This heatmap allows us to do this quickly and visually.

With this goal in mind, it is apparent that we probably have too many clusters because:

  • Clusters 3, 4, and 7 look pretty similar
  • Clusters 2 and 5 look similar as well
  • Clusters 0 and 6 are also a little close for comfort

From the way our heatmap currently looks, I'm willing to bet that we can cut the number of clusters in half and get clearer boundaries. Let's re-run the clustering, summary, and heatmap code for 4 clusters and see what kind of results we get.

cluster_results = kmeans_cluster(vehicle_clusters, 4)
cluster_summary = summarize_clustering(cluster_results)

sns.heatmap(cluster_summary[cluster_columns].transpose(), annot=True)

Silvrback blog image

These clusters look more distinct, don't they? Clusters 1 and 3 look like they are polar opposites of each other, cluster 0 looks like it’s pretty well balanced across all the features, and cluster 2 looks like it’s about half-way between Cluster 0 and Cluster 1.

We now have a good number of clusters, but we still have a problem. It is difficult to remember what clusters 0, 1, 2, and 3 mean, so as a next step, I like to assign descriptive names to the clusters based on their properties. In order to do this, we need to look at the levels of each feature for each cluster and come up with intuitive natural language descriptions for them. You can have some fun and can get as creative as you want here, but just keep in mind that the objective is for you to be able to remember the characteristics of whatever label you assign to the clusters.

  • Cluster 1 vehicles seem to have large engines that consume a lot of fuel, process it inefficiently, produce a lot of emissions, and cost a lot to fill up. I'm going to label them Large Inefficient.
  • Cluster 3 vehicles have small, fuel efficient engines that don't produce a lot of emissions and are relatively inexpensive to fill up. I'm going to label them Small Very Efficient.
  • Cluster 0 vehicles are fairly balanced across every category, so I'm going to label them Midsized Balanced.
  • Cluster 2 vehicles have large engines but are more moderately efficient than the vehicles in Cluster 1, so I'm going to label them Large Moderately Efficient.

Now that we have come up with these descriptive names for our clusters, let's add a Cluster Name column to our cluster_results data frame, and then copy the cluster names over to our original vehicles data frame.

cluster_results['Cluster Name'] = ''
cluster_results['Cluster Name'][cluster_results['Cluster']==0] = 'Midsized Balanced'
cluster_results['Cluster Name'][cluster_results['Cluster']==1] = 'Large Inefficient'
cluster_results['Cluster Name'][cluster_results['Cluster']==2] = 'Large Moderately Efficient'
cluster_results['Cluster Name'][cluster_results['Cluster']==3] = 'Small Very Efficient'

vehicles = vehicles.reset_index().drop('index', axis=1)
vehicles['Cluster Name'] = cluster_results['Cluster Name']


In this post, we examined several ways to prepare a data set for exploratory analysis. First, we looked at the categorical variables we had and attempted to find opportunities to roll them up into higher-level categories. After that, we converted some of our continuous variables into categorical ones by binning them into quintiles based on how relatively high or low their values were. Finally, we used clustering to efficiently create categories that automatically take multiple fields into consideration. The result of all this preparation is that we now have several columns containing meaningful categories that will provide different perspectives of our data and allow us to acquire as many insights as possible.

Now that we have these meaningful categories, our data set is in really good shape, which means that we can move on to the next phase of our data exploration framework. In the next post, we will cover the first two stages of the Explore Phase and demonstrate various ways to visually aggregate, pivot, and identify relationships between fields in our data. Make sure to subscribe to the DDL blog so that you get notified when we publish it!

District Data Labs provides data science consulting and corporate training services. We work with companies and teams of all sizes, helping them make their operations more data-driven and enhancing the analytical abilities of their employees. Interested in working with us? Let us know!

Continue Reading…


Read More

Data Exploration with Python, Part 3

This is the third post in our Data Exploration with Python series. Before reading this post, make sure to check out Part 1 and Part 2!

Preparing yourself and your data like we have done thus far in this series is essential to analyzing your data well. However, the most exciting part of Exploratory Data Analysis (EDA) is actually getting in there, exploring the data, and discovering insights. That's exactly what we are going to start doing in this post.

We will begin with the cleaned and prepped vehicle fuel economy data set that we ended up with at the end of the last post. This version of the data set contains:

  • The higher-level categories we created via category aggregations.
  • The quintiles we created by binning our continuous variables.
  • The clusters we generated via k-means clustering based on numeric variables.

Now, without further ado, let's embark on our insight-finding mission!

Making Our Data Smaller: Filter + Aggregate

One of the fundamental ways to extract insights from a data set is to reduce the size of the data so that you can look at just a piece of it at a time. There are two ways to do this: filtering and aggregating. With filtering, you are essentially removing either rows or columns (or both rows and columns) in order to focus on a subset of the data that interests you. With aggregation, the objective is to group records in your data set that have similar categorical attributes and then perform some calculation (count, sum, mean, etc.) on one or more numerical fields so that you can observe and identify differences between records that fall into each group.

To begin filtering and aggregating our data set, we could write a function like the one below to aggregate based on a group_field that we provide, counting the number of rows in each group. To make things more intuitive and easier to interpret, we will also sort the data from most frequent to least and format it in a pandas data frame with appropriate column names.

def agg_count(df, group_field):
    grouped = df.groupby(group_field, as_index=False).size()
    grouped.sort(ascending = False)

    grouped = pd.DataFrame(grouped).reset_index()
    grouped.columns = [group_field, 'Count']
    return grouped

Now that we have this function in our toolkit, let's use it. Suppose we were looking at the Vehicle Category field in our data set and were curious about the number of vehicles in each category that were manufactured last year (2016). Here is how we would filter the data and use the agg_count function to transform it to show what we wanted to know.

vehicles_2016 = vehicles[vehicles['Year']==2016]
category_counts = agg_count(vehicles_2016, 'Vehicle Category')

Filter Aggregate Count

This gives us what we want in tabular form, but we could take it a step further and visualize it with a horizontal bar chart.

ax = sns.barplot(data=category_counts, x='Count', y='Vehicle Category')
ax.set(xlabel='\n Number of Vehicles Manufactured')
sns.plt.title('Vehicles Manufactured by Category (2016) \n')

Vehicles Manufactured 2016 Bar Chart

Now that we know how to do this, we can filter, aggregate, and plot just about anything in our data set with just a few lines of code. For example, here is the same metric but filtered for a different year (1985).

vehicles_1985 = vehicles[vehicles['Year']==1985]
category_counts = agg_count(vehicles, 'Vehicle Category')

ax = sns.barplot(data=category_counts, x='Count', y='Vehicle Category')
ax.set(xlabel='\n Number of Vehicles Manufactured')
sns.plt.title('Vehicles Manufactured by Category (1985) \n')

Vehicles Manufactured 1985 Bar Chart

If we wanted to stick with the year 2016 but drill down to the more granular Vehicle Class, we could do that as well.

class_counts = agg_count(vehicles_2016, 'Vehicle Class')

ax = sns.barplot(data=class_counts, x='Count', y='Vehicle Class')
ax.set(xlabel='\n Number of Vehicles Manufactured')
sns.plt.title('Vehicles Manufactured by Class (2016) \n')

Silvrback blog image

We could also look at vehicle counts by manufacturer.

make_counts = agg_count(vehicles_2016, 'Make')

ax = sns.barplot(data=make_counts, x='Count', y='Make')
ax.set(xlabel='\n Number of Vehicles Manufactured')
sns.plt.title('Vehicles Manufactured by Make (2016) \n')

Silvrback blog image

What if we wanted to filter by something other than the year? We could do that by simply creating a different filtered data frame and passing that to our agg_count function. Below, instead of filtering by Year, I've filtered on the Fuel Efficiency field, which contains the fuel efficiency quintiles we generated in the last post. Let's choose the Very High Efficiency value so that we can see how many very efficient vehicles each manufacturer has made.

very_efficient = vehicles[vehicles['Fuel Efficiency']=='Very High Efficiency']
make_counts = agg_count(very_efficient, 'Make')

ax = sns.barplot(data=make_counts, x='Count', y='Make')
ax.set(xlabel='\n Number of Vehicles Manufactured')
sns.plt.title('Very Fuel Efficient Vehicles by Make \n')

Very Efficient Vehicles by Make

What if we wanted to perform some other calculation, such as averaging, instead of counting the number of records that fall into each group? We can just create a new function called agg_avg that calculates the mean of a designated numerical field.

def agg_avg(df, group_field, calc_field):
    grouped = df.groupby(group_field, as_index=False)[calc_field].mean()
    grouped = grouped.sort(calc_field, ascending = False)
    grouped.columns = [group_field, 'Avg ' + str(calc_field)]
    return grouped

We can then simply swap out the agg_count function with our new agg_avg function and indicate what field we would like to use for our calculation. Below is an example showing the average fuel efficiency, represented by the Combined MPG field, by vehicle category.

category_avg_mpg = agg_avg(vehicles_2016, 'Vehicle Category', 'Combined MPG')

ax = sns.barplot(data=category_avg_mpg, x='Avg Combined MPG', y='Vehicle Category')
ax.set(xlabel='\n Average Combined MPG')
sns.plt.title('Average Combined MPG by Category (2016) \n')

Average Fuel Efficiency by Vehicle Category

Pivoting the Data for More Detail

Up until this point, we've been looking at our data at a pretty high level, aggregating up by a single variable. Sure, we were able to drill down from Vehicle Category to Vehicle Class to get a more granular view, but we only looked at the data one hierarchical level at a time. Next, we're going to go into further detail by taking a look at two or three variables at a time. The way we are going to do this is via pivot tables and their visual equivalents, pivot heatmaps.

First, we will create a pivot_count function, similar to the agg_count function we created earlier, that will transform whatever data frame we feed it into a pivot table with the rows, columns, and calculated field we specify.

def pivot_count(df, rows, columns, calc_field):
    df_pivot = df.pivot_table(values=calc_field, 
                             ).dropna(axis=0, how='all')
    return df_pivot

We will then use this function on our vehicles_2016 data frame and pivot it out with the Fuel Efficiency quintiles we created in the last post representing the rows, the Engine Size quintiles representing the columns, and then counting the number of vehicles that had a Combined MPG value.

effic_size_pivot = pivot_count(vehicles_2016,'Fuel Efficiency',
                               'Engine Size','Combined MPG')

Filter Pivot Count Fuel Efficiency vs. Engine Size

This is OK, but it would be faster to analyze visually. Let's create a heatmap that will color the magnitude of the counts and present us with a more intuitive view.

sns.heatmap(effic_size_pivot, annot=True, fmt='g')
ax.set(xlabel='\n Engine Size')
sns.plt.title('Fuel Efficiency vs. Engine Size (2016) \n')

Fuel Efficiency by Engine Size Heatmap 2016

Just like we did earlier with our horizontal bar charts, we can easily filter by a different year and get a different perspective. For example, here's what this heatmap looks like for 1985.

effic_size_pivot = pivot_count(vehicles_1985,'Fuel Efficiency',
                               'Engine Size','Combined MPG')

fig, ax = plt.subplots(figsize=(15,8))
sns.heatmap(effic_size_pivot, annot=True, fmt='g')
ax.set(xlabel='\n Engine Size')
sns.plt.title('Fuel Efficiency vs. Engine Size (1985) \n')

Fuel Efficiency by Engine Size Heatmap 1985

With these pivot heatmaps, we are not limited to just two variables. We can pass a list of variables for any of the axes (rows or columns), and it will display all the different combinations of values for those variables.

effic_size_category = pivot_count(vehicles_2016,
                                  ['Engine Size','Fuel Efficiency'],
                                  'Vehicle Category','Combined MPG')

fig, ax = plt.subplots(figsize=(20,10))
sns.heatmap(effic_size_category, annot=True, fmt='g')
ax.set(xlabel='\n Vehicle Category')
sns.plt.title('Fuel Efficiency + Engine Size vs. Vehicle Category (2016) \n')

Engine Size and Fuel Efficiency by Vehicle Category Heatmap

In this heatmap, we have Engine Size and Fuel Efficiency combinations represented by the rows, and we've added a third variable (the Vehicle Category) across the columns. So now we can see a finer level of detail about what types of cars had what size engines and what level of fuel efficiency last year.

As a final example for this section, let's create a pivot heatmap that plots Make against Vehicle Category for 2016. We saw earlier, in the bar chart that counted vehicles by manufacturer, that BMW made the largest number of specific models last year. This pivot heatmap will let us see how those counts are distributed across vehicle categories, giving us a better sense of each auto company's current offerings in terms of the breadth vs. depth of vehicle types they make.

effic_size_pivot = pivot_count(vehicles_2016, 'Make',
                               'Vehicle Category','Combined MPG')

fig, ax = plt.subplots(figsize=(20,20))
sns.heatmap(effic_size_pivot, annot=True, fmt='g')
ax.set(xlabel='\n Vehicle Category')
sns.plt.title('Make vs. Vehicle Category (2016) \n')

Make vs. Vehicle Category 2016

Visualizing Changes Over Time

So far in this post, we've been looking at the data at given points in time. The next step is to take a look at how the data has changed over time. We can do this relatively easily by creating a multi_line function that accepts a data frame and x/y fields and then plots them on a multiline chart.

def multi_line(df, x, y):
    ax = df.groupby([x, y]).size().unstack(y).plot(figsize=(15,8), cmap="Set2")

Let's use this function to visualize our vehicle categories over time. The resulting chart shows the number of vehicles in each category that were manufactured each year.

multi_line(vehicles, 'Year', 'Vehicle Category')
ax.set(xlabel='\n Year')
sns.plt.title('Vehicle Categories Over Time \n')

Vehicle Categories Over Time

We can see from the chart that Small Cars have generally dominated across the board and that there was a small decline in the late 90s that then started to pick up again in the early 2000s. We can also see the introduction and increase in popularity of SUVs starting in the late 90s, and the decline in popularity of trucks in recent years.

If we wanted to, we could zoom in and filter for specific manufacturers to see how their offerings have changed over the years. Since BMW had the most number of vehicles last year and we saw in the pivot heatmap that those were mostly small cars, let's filter for just their vehicles to see whether they have always made a lot of small cars or if this is more of a recent phenomenon.

bmw = vehicles[vehicles['Make'] == 'BMW']

multi_line(bmw, 'Year', 'Vehicle Category')
ax.set(xlabel='\n Year')
sns.plt.title('BMW Vehicle Categories Over Time \n')

BMW Vehicle Categories Over Time

We can see in the chart above that they started off making a reasonable number of small cars, and then seemed to ramp up production of those types of vehicles in the late 90s. We can contrast this with a company like Toyota, who started out making a lot of small cars back in the 1980s and then seemingly made a decision to gradually manufacture less of them over the years, focusing instead on SUVs, pickup trucks, and midsize cars.

toyota = vehicles[vehicles['Make'] == 'Toyota']

multi_line(toyota, 'Year', 'Vehicle Category')
ax.set(xlabel='\n Year')
sns.plt.title('Toyota Vehicle Categories Over Time \n')

Toyota Vehicle Categories Over Time

Examining Relationships Between Variables

The final way we are going to explore our data in this post is by examining the relationships between numerical variables in our data. Doing this will provide us with better insight into which fields are highly correlated, what the nature of those correlations are, what typical combinations of numerical values exist in our data, and which combinations are anomalies.

For looking at relationships between variables, I often like to start with a scatter matrix because it gives me a bird's eye view of the relationships between all the numerical fields in my data set. With just a couple lines of code, we can not only create a scatter matrix, but we can also factor in a layer of color that can represent, for example, the clusters we generated at the end of the last post.

select_columns = ['Engine Displacement','Cylinders','Fuel Barrels/Year',
                   'City MPG','Highway MPG','Combined MPG',
                   'CO2 Emission Grams/Mile', 'Fuel Cost/Year', 'Cluster Name']

sns.pairplot(vehicles[select_columns], hue='Cluster Name', size=3)

Scatter Matrix with Cluster Hue

From here, we can see that there are some strong positive linear relationships in our data, such as the correlations between the MPG fields, and also among the fuel cost, barrels, and CO2 emissions fields. There are also some hyperbolic relationships in there as well, particularly between the MPG fields and engine displacement, fuel cost, barrels, and emissions. Additionally, we can also get a sense of the size of our clusters, how they are distributed, and the level of overlap we have between them.

Once we have this high-level overview, we can zoom in on anything that we think looks interesting. For example, let's take a closer look at Engine Displacement plotted against Combined MPG.

sns.lmplot('Engine Displacement', 'Combined MPG', data=vehicles, 
           hue='Cluster Name', size=8, fit_reg=False)

Displacement MPG Scatter Plot

In addition to being able to see that there is a hyperbolic correlation between these two variables, we can see that our Small Very Efficient cluster resides in the upper left, followed by our Midsized Balanced cluster that looks smaller and more compact than the others. After that, we have our Large Moderately Efficient cluster and finally our Large Inefficient cluster on the bottom right.

We can also see that there are a few red points at the very top left and a few purple points at the very bottom right that we may want to investigate further to get a sense of what types of vehicles we are likely to see at the extremes. Try identifying some of those on your own by filtering the data set like we did earlier in the post. While you're at it, try creating additional scatter plots that zoom in on other numerical field combinations from the scatter matrix above. There are a bunch of other insights to be found in this data set, and all it takes is a little exploration!


We have covered quite a bit in this post, and I hope I've provided you with some good examples of how, with just a few tools in your arsenal, you can embark on a robust insight-finding expedition and discover truly interesting things about your data. Now that you have some structure in your process and some tools for exploring data, you can let your creativity run wild a little and come up with filter, aggregate, pivot, and scatter combinations that are most interesting to you. Feel free to experiment and post any interesting insights you're able to find in the comments!

Also, make sure to stay tuned because in the next (and final) post of this series, I'm going to cover how to identify and think about the different networks that are present in your data and how to explore them using graph analytics. Click the subscribe button below and enter your email so that you receive a notification as soon as it's published!

District Data Labs provides data science consulting and corporate training services. We work with companies and teams of all sizes, helping them make their operations more data-driven and enhancing the analytical abilities of their employees. Interested in working with us? Let us know!

Continue Reading…


Read More

Inflated counts for cleared rape cases

Newsy, Reveal and ProPublica look into rape cases in the U.S. and law enforcement’s use of exceptional clearance.

The designation allows police to clear cases when they have enough evidence to make an arrest and know who and where the suspect is, but can’t make an arrest for reasons outside their control. Experts say it’s supposed to be used sparingly.

Culled data from various police departments shows the designation is used more often that one would expect.

Tags: , , ,

Continue Reading…


Read More

Where Camp fire destroyed homes

The Camp fire death toll rose to 63 and 631 missing as of yesterday. The Los Angeles Times provides some graphics showing scale and the buildings that burned.

Ugh. I live a few hundred miles away and the smoke is bad enough that my son’s school is closed today. It has not been a good year for California in terms of wildfires.

Tags: ,

Continue Reading…


Read More

Document worth reading: “Saliency Prediction in the Deep Learning Era: An Empirical Investigation”

Visual saliency models have enjoyed a big leap in performance in recent years, thanks to advances in deep learning and large scale annotated data. Despite enormous effort and huge breakthroughs, however, models still fall short in reaching human-level accuracy. In this work, I explore the landscape of the field emphasizing on new deep saliency models, benchmarks, and datasets. A large number of image and video saliency models are reviewed and compared over two image benchmarks and two large scale video datasets. Further, I identify factors that contribute to the gap between models and humans and discuss remaining issues that need to be addressed to build the next generation of more powerful saliency models. Some specific questions that are addressed include: in what ways current models fail, how to remedy them, what can be learned from cognitive studies of attention, how explicit saliency judgments relate to fixations, how to conduct fair model comparison, and what are the emerging applications of saliency models. Saliency Prediction in the Deep Learning Era: An Empirical Investigation

Continue Reading…


Read More

Hey, check this out: Columbia’s Data Science Institute is hiring research scientists and postdocs!

Here’s the official announcement:

The Institute’s Postdoctoral and Research Scientists will help anchor Columbia’s presence as a leader in data-science research and applications and serve as resident experts in fostering collaborations with the world-class faculty across all schools at Columbia University. They will also help guide, plan and execute data-science research, applications and technological innovations that address societal challenges and related University-wide initiatives.

Postdoc Fellows

Requirements: PhD degree


Research Scientist (Open Rank)

Requirements: PhD degree + (see position description for more information)

Research Scientist who will conduct independent cutting-edge research in the foundations or application of data science or related fields, or be involved in interdisciplinary research through a collaboration between the Data Science Institute and the various schools at the University.


Research Scientist (Open Rank)

Requirements: PhD degree + (see position description for more information)

Research Scientist who will serve as Columbia’s resident experts to foster collaborations with faculty across all the schools at Columbia University.


Candidates for all Research Scientists positions must apply using the links above that direct to the Columbia HR portal for each position, whereas the Postdoc Fellows submit materials via:

I’m part of the Data Science Institute so if you want to work with me or others here at Columbia, you should apply.

The post Hey, check this out: Columbia’s Data Science Institute is hiring research scientists and postdocs! appeared first on Statistical Modeling, Causal Inference, and Social Science.

Continue Reading…


Read More

Whats new on arXiv

An Introductory Survey on Attention Mechanisms in NLP Problems

First derived from human intuition, later adapted to machine translation for automatic token alignment, attention mechanism, a simple method that can be used for encoding sequence data based on the importance score each element is assigned, has been widely applied to and attained significant improvement in various tasks in natural language processing, including sentiment classification, text summarization, question answering, dependency parsing, etc. In this paper, we survey through recent works and conduct an introductory summary of the attention mechanism in different NLP problems, aiming to provide our readers with basic knowledge on this widely used method, discuss its different variants for different tasks, explore its association with other techniques in machine learning, and examine methods for evaluating its performance.

Interactive dimensionality reduction using similarity projections

Recent advances in machine learning allow us to analyze and describe the content of high-dimensional data like text, audio, images or other signals. In order to visualize that data in 2D or 3D, usually Dimensionality Reduction (DR) techniques are employed. Most of these techniques, e.g., PCA or t-SNE, produce static projections without taking into account corrections from humans or other data exploration scenarios. In this work, we propose the interactive Similarity Projection (iSP), a novel interactive DR framework based on similarity embeddings, where we form a differentiable objective based on the user interactions and perform learning using gradient descent, with an end-to-end trainable architecture. Two interaction scenarios are evaluated. First, a common methodology in multidimensional projection is to project a subset of data, arrange them in classes or clusters, and project the rest unseen dataset based on that manipulation, in a kind of semi-supervised interpolation. We report results that outperform competitive baselines in a wide range of metrics and datasets. Second, we explore the scenario of manipulating some classes, while enriching the optimization with high-dimensional neighbor information. Apart from improving classification precision and clustering on images and text documents, the new emerging structure of the projection unveils semantic manifolds. For example, on the Head Pose dataset, by just dragging the faces looking far left to the left and those looking far right to the right, all faces are re-arranged on a continuum even on the vertical axis (face up and down). This end-to-end framework can be used for fast, visual semi-supervised learning, manifold exploration, interactive domain adaptation of neural embeddings and transfer learning.

Cross-lingual Short-text Matching with Deep Learning

The problem of short text matching is formulated as follows: given a pair of sentences or questions, a matching model determines whether the input pair mean the same or not. Models that can automatically identify questions with the same meaning have a wide range of applications in question answering sites and modern chatbots. In this article, we describe the approach by team hahu to solve this problem in the context of the ‘CIKM AnalytiCup 2018 – Cross-lingual Short-text Matching of Question Pairs’ that is sponsored by Alibaba. Our solution is an end-to-end system based on current advances in deep learning which avoids heavy feature-engineering and achieves improved performance over traditional machine-learning approaches. The log-loss scores for the first and second rounds of the contest are 0.35 and 0.39 respectively. The team was ranked 7th from 1027 teams in the overall ranking scheme by the organizers that consisted of the two contest scores as well as: innovation and system integrity, understanding data as well as practicality of the solution for business.

Aequitas: A Bias and Fairness Audit Toolkit

Recent work has raised concerns on the risk of unintended bias in algorithmic decision making systems being used nowadays that can affect individuals unfairly based on race, gender or religion, among other possible characteristics. While a lot of bias metrics and fairness definitions have been proposed in recent years, there is no consensus on which metric/definition should be used and there are very few available resources to operationalize them. Therefore, despite recent awareness, auditing for bias and fairness when developing and deploying algorithmic decision making systems is not yet a standard practice. We present Aequitas, an open source bias and fairness audit toolkit that is an intuitive and easy to use addition to the machine learning workflow, enabling users to seamlessly test models for several bias and fairness metrics in relation to multiple population sub-groups. We believe Aequitas will facilitate informed and equitable decisions around developing and deploying algorithmic decision making systems for both data scientists, machine learning researchers and policymakers.

Emergence of Addictive Behaviors in Reinforcement Learning Agents

This paper presents a novel approach to the technical analysis of wireheading in intelligent agents. Inspired by the natural analogues of wireheading and their prevalent manifestations, we propose the modeling of such phenomenon in Reinforcement Learning (RL) agents as psychological disorders. In a preliminary step towards evaluating this proposal, we study the feasibility and dynamics of emergent addictive policies in Q-learning agents in the tractable environment of the game of Snake. We consider a slightly modified settings for this game, in which the environment provides a ‘drug’ seed alongside the original ‘healthy’ seed for the consumption of the snake. We adopt and extend an RL-based model of natural addiction to Q-learning agents in this settings, and derive sufficient parametric conditions for the emergence of addictive behaviors in such agents. Furthermore, we evaluate our theoretical analysis with three sets of simulation-based experiments. The results demonstrate the feasibility of addictive wireheading in RL agents, and provide promising venues of further research on the psychopathological modeling of complex AI safety problems.

Controllability, Multiplexing, and Transfer Learning in Networks using Evolutionary Learning

Networks are fundamental building blocks for representing data, and computations. Remarkable progress in learning in structurally defined (shallow or deep) networks has recently been achieved. Here we introduce evolutionary exploratory search and learning method of topologically flexible networks under the constraint of producing elementary computational steady-state input-output operations. Our results include; (1) the identification of networks, over four orders of magnitude, implementing computation of steady-state input-output functions, such as a band-pass filter, a threshold function, and an inverse band-pass function. Next, (2) the learned networks are technically controllable as only a small number of driver nodes are required to move the system to a new state. Furthermore, we find that the fraction of required driver nodes is constant during evolutionary learning, suggesting a stable system design. (3), our framework allows multiplexing of different computations using the same network. For example, using a binary representation of the inputs, the network can readily compute three different input-output functions. Finally, (4) the proposed evolutionary learning demonstrates transfer learning. If the system learns one function A, then learning B requires on average less number of steps as compared to learning B from tabula rasa. We conclude that the constrained evolutionary learning produces large robust controllable circuits, capable of multiplexing and transfer learning. Our study suggests that network-based computations of steady-state functions, representing either cellular modules of cell-to-cell communication networks or internal molecular circuits communicating within a cell, could be a powerful model for biologically inspired computing. This complements conceptualizations such as attractor based models, or reservoir computing.

An Introduction to Fuzzy & Annotated Semantic Web Languages

We present the state of the art in representing and reasoning with fuzzy knowledge in Semantic Web Languages such as triple languages RDF/RDFS, conceptual languages of the OWL 2 family and rule languages. We further show how one may generalise them to so-called annotation domains, that cover also e.g. temporal and provenance extensions.

Deep Bayesian Inversion

Characterizing statistical properties of solutions of inverse problems is essential for decision making. Bayesian inversion offers a tractable framework for this purpose, but current approaches are computationally unfeasible for most realistic imaging applications in the clinic. We introduce two novel deep learning based methods for solving large-scale inverse problems using Bayesian inversion: a sampling based method using a WGAN with a novel mini-discriminator and a direct approach that trains a neural network using a novel loss function. The performance of both methods is demonstrated on image reconstruction in ultra low dose 3D helical CT. We compute the posterior mean and standard deviation of the 3D images followed by a hypothesis test to assess whether a ‘dark spot’ in the liver of a cancer stricken patient is present. Both methods are computationally efficient and our evaluation shows very promising performance that clearly supports the claim that Bayesian inversion is usable for 3D imaging in time critical applications.

Multiscale change point detection for dependent data

In this paper we study the theoretical properties of the simultaneous multiscale change point estimator (SMUCE) proposed by Frick et al. (2014) in regression models with dependent error processes. Empirical studies show that in this case the change point estimate is inconsistent, but it is not known if alternatives suggested in the literature for correlated data are consistent. We propose a modification of SMUCE scaling the basic statistic by the long run variance of the error process, which is estimated by a difference-type variance estimator calculated from local means from different blocks. For this modification we prove model consistency for physical dependent error processes and illustrate the finite sample performance by means of a simulation study.

Age of Information Scaling in Large Networks

We study age of information in a multiple source-multiple destination setting with a focus on its scaling in large wireless networks. There are n nodes that are randomly paired with each other on a fixed area to form n source-destination (S-D) pairs. We propose a three-phase transmission scheme which utilizes local cooperation between the nodes by forming what we call mega update packets to serve multiple S-D pairs at once. We show that under the proposed scheme average age of an S-D pair scales as O(n^{\frac{1}{4}}) as the number of users, n, in the network grows. To the best of our knowledge, this is the best age scaling result for a multiple source-multiple destination setting.

Composing Modeling and Inference Operations with Probabilistic Program Combinators

Probabilistic programs with dynamic computation graphs can define measures over sample spaces with unbounded dimensionality, and thereby constitute programmatic analogues to Bayesian nonparametrics. Owing to the generality of this model class, inference relies on ‘black-box’ Monte Carlo methods that are generally not able to take advantage of conditional independence and exchangeability, which have historically been the cornerstones of efficient inference. We here seek to develop a ‘middle ground’ between probabilistic models with fully dynamic and fully static computation graphs. To this end, we introduce a combinator library for the Probabilistic Torch framework. Combinators are functions that accept models and return transformed models. We assume that models are dynamic, but that model composition is static, in the sense that combinator application takes place prior to evaluating the model on data. Combinators provide primitives for both model and inference composition. Model combinators take the form of classic functional programming constructs such as map and reduce. These constructs define a computation graph at a coarsened level of representation, in which nodes correspond to models, rather than individual variables. Inference combinators – such as enumeration, importance resampling, and Markov Chain Monte Carlo operators – assume a sampling semantics for model evaluation, in which application of combinators preserves proper weighting. Owing to this property, models defined using combinators can be trained using stochastic methods that optimize either variational or wake-sleep style objectives. As a validation of this principle, we use combinators to implement black box inference for hidden Markov models.

Char2char Generation with Reranking for the E2E NLG Challenge
Parser Extraction of Triples in Unstructured Text
Construction of an algebra corresponding to a statistical model of the square ladder (square lattice with two lines)
Internal Wiring of Cartesian Verbs and Prepositions
Native Language Identification using i-vector
Non-intrusive model reduction of static parametric non-linear systems and application to global optimization and uncertainty quantification
Conformal Bootstrap Analysis for Localization: Symplectic Case
Towards Neural Machine Translation for African Languages
Jointly identifying opinion mining elements and fuzzy measurement of opinion intensity to analyze product features
Learning to Compensate Photovoltaic Power Fluctuations from Images of the Sky by Imitating an Optimal Policy
Two-stream convolutional networks for end-to-end learning of self-driving cars
Few-shot Learning for Named Entity Recognition in Medical Text
Optimal Scalar Linear Index Codes for Symmetric and Neighboring Side-information Problems
ML-Net: multi-label classification of biomedical texts with deep neural networks
On the number of sets with a given doubling constant
ROMAN: Reduced-Order Modeling with Artificial Neurons
Sampling from manifold-restricted distributions using tangent bundle projections
Phase transition for the frog model on biregular trees
Novel Inter-file Coded Placement and D2D Delivery for a Cache-aided Fog-RAN Architecture
A combinatorial $\mathfrak{sl}_2$-action and the Sperner property for the weak order
Random periodic solutions and ergodicity for stochastic differential equations
Many cusped hyperbolic 3-manifolds do not bound geometrically
A geometric study of Strassen’s asymptotic rank conjecture and its variants
Evaluating GANs via Duality
A survey on graphs with convex quadratic stability number
Wavelet Based Dictionaries for Dimensionality Reduction of ECG Signals
Deep Q learning for fooling neural networks
Semi-dual Regularized Optimal Transport
Heuristic Voting as Ordinal Dominance Strategies
Towards Characterising Bayesian Network Models under Selection
Robust Dynamic CPU Resource Provisioning in Virtualized Servers
Staging Human-computer Dialogs: An Application of the Futamura Projections
Data Driven Governing Equations Approximation Using Deep Neural Networks
Discourse in Multimedia: A Case Study in Information Extraction
An Analysis of the Semantic Annotation Task on the Linked Data Cloud
Neural Wavetable: a playable wavetable synthesizer using neural networks
Corpus Phonetics Tutorial
Identification of semiparametric discrete outcome models with bounded covariates
What is really needed to justify ignoring the response mechanism for modelling purposes?
A New SVDD-Based Multivariate Non-parametric Process Capability Index
Text Assisted Insight Ranking Using Context-Aware Memory Network
Region-Referenced Spectral Power Dynamics of EEG Signals: A Hierarchical Modeling Approach
Estimation of High-Dimensional Seemingly Unrelated Regression Models
Consensus and Sectioning-based ADMM with Norm-1 Regularization for Imaging with a Compressive Reflector Antenna
An Overview of Semiparametric Extensions of Finite Mixture Models
Improving constant in end-point Poincaré inequality on Hamming cube
YOLO-LITE: A Real-Time Object Detection Algorithm Optimized for Non-GPU Computers
TrolleyMod v1.0: An Open-Source Simulation and Data-Collection Platform for Ethical Decision Making in Autonomous Vehicles
Transform Methods for Heavy-Traffic Analysis
Extractive Summary as Discrete Latent Variables
Boundary Braids
Fokker-Planck equations for nonlinear dynamical systems driven by multiplicative $α$-stable Lévy motions
Central limit theorem and moderate deviations for a class of semilinear SPDES
Bayesian Reinforcement Learning in Factored POMDPs
SepNE: Bringing Separability to Network Embedding
Improving Distantly Supervised Relation Extraction with Neural Noise Converter and Conditional Optimal Selector
Style and Content Disentanglement in Generative Adversarial Networks
A Game Theoretic Approach for Dynamic Information Flow Tracking to Detect Multi-Stage Advanced Persistent Threats
Multi-Winner Contests for Strategic Diffusion in Social Networks
How Drones Look: Crowdsourced Knowledge Transfer for Aerial Video Saliency Prediction
A framework for covert and secret key expansion over quantum channels
Memory-Efficient Quantum Circuit Simulation by Using Lossy Data Compression
Translating a Math Word Problem to an Expression Tree
On the Capacity of MISO Channels with One-Bit ADCs and DACs
Gaussian Reciprocal Sequences from the Viewpoint of Conditionally Markov Sequences
Dropping Symmetry for Fast Symmetric Nonnegative Matrix Factorization
Fast Distribution Grid Line Outage Identification with $μ$PMU
Analysis of Gaussian Spatial Models with Covariate Measurement Error
Submodular Optimization Over Streams with Inhomogeneous Decays
Sample complexity of partition identification using multi-armed bandits
MT-CGCNN: Integrating Crystal Graph Convolutional Neural Network with Multitask Learning for Material Property Prediction
Cutting resilient networks — complete binary trees
Saddlepoint adjusted inversion of characteristic functions
Off-grid Variational Bayesian Inference of Line Spectral Estimation from One-bit Samples
Modeling Coherence for Discourse Neural Machine Translation
Layout Design for Intelligent Warehouse by Evolution with Fitness Approximation
Melodic Phrase Segmentation By Deep Neural Networks
Leveraging Aspect Phrase Embeddings for Cross-Domain Review Rating Prediction
Efficient and Scalable Multi-task Regression on Massive Number of Tasks
Generating Multiple Diverse Responses for Short-Text Conversation
Preventive Equipment Repair Planning Model
A Radiomics Approach to Traumatic Brain Injury Prediction in CT Scans
Plan-And-Write: Towards Better Automatic Storytelling
AMGCL: an Efficient, Flexible, and Extensible Algebraic Multigrid Implementation
An analysis of a fair division protocol for drawing legislative districts
Plateau Polycubes and Lateral Area
From Free Text to Clusters of Content in Health Records: An Unsupervised Graph Partitioning Approach
A Deterministic Algorithm for Bridging Anaphora Resolution
Universal Polarization for Processes with Memory
Acyclic subgraphs with high chromatic number
Lattice paths and submonoids of $\mathbb Z^2$
Optimal stopping of Brownian motion with broken drift
Neural Based Statement Classification for Biased Language
Stochastic Algorithmic Differentiation of (Expectations of) Discontinuous Functions (Indicator Functions)
Measuring Road Network Topology Vulnerability by Ricci Curvature
Rice-Marlin Codes: Tiny and Efficient Variable-to-Fixed Codes
Space-time localisation for the dynamic $Φ^4_3$ model
SLIM: Simultaneous Logic-in-Memory Computing Exploiting Bilayer Analog OxRAM Devices
LoANs: Weakly Supervised Object Detection with Localizer Assessor Networks
Creatures great and SMAL: Recovering the shape and motion of animals from video
Robustness of spectral methods for community detection
Groups with few maximal sum-free sets
ProstateGAN: Mitigating Data Bias via Prostate Diffusion Imaging Synthesis with Generative Adversarial Networks
Distortion Robust Image Classification with Deep Convolutional Neural Network based on Discrete Cosine Transform
Statistical post-processing of dual-resolution ensemble forecasts
Revisiting Projection-Free Optimization for Strongly Convex Constraint Sets
Time-Varying Isotropic Vector Random Fields on Compact Two-Point Homogeneous Spaces
A Learning-Based Framework for Line-Spectra Super-resolution
A combinatorial classification of 2-regular simple modules for Nakayama algebras
A structural characterization of tree-based phylogenetic networks
Drop-Activation: Implicit Parameter Reduction and Harmonic Regularization
Predicting the time-evolution of multi-physics systems with sequence-to-sequence models
Robust low-rank multilinear tensor approximation for a joint estimation of the multilinear rank and the loading matrices
Pitfalls of Graph Neural Network Evaluation
Large-scale Interactive Recommendation with Tree-structured Policy Gradient
Design of Spectrally Shaped Binary Sequences via Randomized Convex Relaxation
Experimental 3D Coherent Diffractive Imaging from photon-sparse random projections
Dependency Grammar Induction with a Neural Variational Transition-based Parser
Data-Enabled Predictive Control: In the Shallows of the DeePC
Reduced Order Controller Design for Robust Output Regulation of Parabolic Systems
Development of Real-time ADAS Object Detector for Deployment on CPU
QUENN: QUantization Engine for low-power Neural Networks
A Simulated Cyberattack on Twitter: Assessing Partisan Vulnerability to Spear Phishing and Disinformation ahead of the 2018 U.S. Midterm Elections
The ADAPT System Description for the IWSLT 2018 Basque to English Translation Task
The Greedy Dirichlet Process Filter – An Online Clustering Multi-Target Tracker
Structural and temporal heterogeneities on networks
Applications of mesoscopic CLTs in random matrix theory
A Duality Based 2-Approximation Algorithm for Maximum Agreement Forest
Matrix rigidity and the ill-posedness of Robust PCA and matrix completion
Bandana: Using Non-volatile Memory for Storing Deep Learning Models
SCORE+ for Network Community Detection
Evolving intrinsic motivations for altruistic behavior
Streaming Network Embedding through Local Actions
Deep Nonlinear Non-Gaussian Filtering for Dynamical Systems
Opinion dynamics with Lotka-Volterra type interactions
Domain Randomization for Scene-Specific Car Detection and Pose Estimation
Virtual Net: a Decentralized Architecture for Interaction in Mobile Virtual Worlds
Mayall: A Framework for Desktop JavaScript Auditing and Post-Exploitation Analysis
EdgeBench: Benchmarking Edge Computing Platforms
Jointly Learning to Label Sentences and Tokens
The exchange-driven growth model: basic properties and longtime behavior
Pulse radar with FPGA range compression for real time displacement and vibration monitoring
Strong Feller property for SDEs driven by multiplicative cylindrical stable noise
Tower Cranes and Supply Points Locating Problem Using CBO, ECBO, and VPS
No-Frills Human-Object Interaction Detection: Factorization, Appearance and Layout Encodings, and Training Techniques
Geometry of Gaussian free field sign clusters and random interlacements

Continue Reading…


Read More

Distilled News

Does Synthetic Data Hold The Secret To Artificial Intelligence?

Could synthetic data be the solution to rapidly train artificial intelligence (AI) algorithms? There are advantages and disadvantages to synthetic data; however, many technology experts believe that synthetic data is the key to democratizing machine learning and to accelerate testing and adoption of artificial intelligence algorithms into our daily lives.

Best Deals in Deep Learning Cloud Providers

I wanted to figure out where I should train my deep learning models online for the lowest cost and least hassle. I wasn’t able to find a good comparison of GPU cloud service providers, so I decided to make my own. Feel free to skip to the pretty charts if you know all about GPUs and TPUs and just want the results. I’m not looking at serving models in this article, but I might in the future. Follow me to make sure you don’t miss out.

Forecasting for data-driven decision making

Accurate forecasts are key components of successful data-driven businesses. We may forecast the need for internal ressources (e.g. call center staffing), key metrics that drive other business decisions (e.g. electricity demand to decide on constructing a new power plant), or customer demand for products we distribute (retail demand forecasting). Time horizons of our forecasts may differ widely from forecasts years in advance to only a few minutes.

Analyzing Experiment Outcomes: Beyond Average Treatment Effects

At Uber, we test most new features and products with the help of experiments in order to understand and quantify their impact on our marketplace. The analysis of experimental results traditionally focuses on calculating average treatment effects (ATEs). Since averages reduce an entire distribution to a single number, however, any heterogeneity in treatment effects will go unnoticed. Instead, we have found that calculating quantile treatment effects (QTEs) allows us to effectively and efficiently characterize the full distribution of treatment effects and thus capture the inherent heterogeneity in treatment effects when thousands of riders and drivers interact within Uber’s marketplace. Besides providing a more nuanced picture of the effect of a new algorithm, this analysis is relevant to our business because people remember negative experiences more strongly than positive ones (see Baumeister et al. (2001)). In this article, we describe what QTEs are, how exactly they provide additional insights beyond ATEs, why they are relevant for a business such as Uber’s, and how we calculate them.

Model evaluation, model selection, and algorithm selection in machine learning

This final article in the series Model evaluation, model selection, and algorithm selection in machine learning presents overviews of several statistical hypothesis testing approaches, with applications to machine learning model and algorithm comparisons. This includes statistical tests based on target predictions for independent test sets (the downsides of using a single test set for model comparisons was discussed in previous articles) as well as methods for algorithm comparisons by fitting and evaluating models via cross-validation. Lastly, this article will introduce nested cross-validation, which has become a common and recommended a method of choice for algorithm comparisons for small to moderately-sized datasets. Then, at the end of this article, I provide a list of my personal suggestions concerning model evaluation, selection, and algorithm selection summarizing the several techniques covered in this series of articles.

Getting to Know Keras for New Data Scientists

For many new data scientists transitioning into AI and deep learning, the Keras framework is an efficient tool. Keras is a powerful and easy-to-use Python library for developing and evaluating deep learning models. In this article, we’ll lay out the welcome mat to the framework. You should walk away with a handful of useful features to keep in mind as you get up to speed. In the words of the developers, ‘Keras is a high-level neural networks API, written in Python and developed with a focus on enabling fast experimentation.’ It has been open sourced since its initial release in March 2015. Its documentation can be found on with source code on GitHub.

Differences Between Machine Learning & Deep Learning

Starting off, you’ll learn about Artificial Intelligence and then move to machine learning and deep learning. You will further learn how machine learning is different from deep learning, the various kinds of algorithms that fall under these two domains of learning. Finally, you will be introduced to some real-life applications where machine learning and deep learning is being applied.

Simplifying Sentiment Analysis in Python

The promise of machine learning has shown many stunning results in a wide variety of fields. Natural language processing is no exception of it, and it is one of those fields where machine learning has been able to show general artificial intelligence (not entirely but at least partially) achieving some brilliant results for genuinely complicated tasks. Now, NLP (natural language processing) is not a new field and neither is machine learning. But the fusion of both the fields is quite contemporary and only vows to make progress. This is one of those hybrid applications which everyone (with a budget smart-phone) comes across daily. For example, take ‘keyboard word suggestion’ into the account, or intelligent auto-completions; these all are the byproducts of the amalgamation of NLP and Machine Learning, and quite naturally these have become the inseparable parts of our lives.

Machine Learning in Excel With Python

Machine learning is an important topic in lots of industries right now. It’s a fast moving field with lots of active research and receives huge amounts of media attention. This post isn’t intended to be an introduction to machine learning, or a comprehensive overview of the state of the art. Instead it will show how models built using machine learning can be leveraged from within Excel.

More on Bias Corrected Standard Deviation Estimates

This note is just a quick follow-up to our last note on correcting the bias in estimated standard deviations for binomial experiments. For normal deviates there is, of course, a well know scaling correction that returns an unbiased estimate for observed standard deviations.

Searching for the optimal hyper-parameters of an ARIMA model in parallel: the tidy gridsearch approach

In this blog post, I’ll use the data that I cleaned in a previous blog post, which you can download here. If you want to follow along, download the monthly data. In the previous blog post, I used the auto.arima() function to very quickly get a ‘good-enough’ model to predict future monthly total passengers flying from LuxAirport. ‘Good-enough’ models can be all you need in a lot of situations, but perhaps you’d like to have a better model. I will show here how you can get a better model by searching through a grid of hyper-parameters.

A Mathematician’s Perspective on Topological Data Analysis and R

A few years ago, when I first became aware of Topological Data Analysis (TDA), I was really excited by the possibility that the elegant theorems of Algebraic Topology could provide some new insights into the practical problems of data analysis. But time has passed, and the sober assessment of Larry Wasserman seems to describe where things stand.

How to teach an AI to play Games: Deep Reinforcement Learning

If you are excited about Machine Learning, and you’re interested in how it can be applied to Gaming or Optimization, this article is for you. We’ll see the basics of Reinforcement Learning, and more specifically Deep Reinforcement Learning (Neural Networks + Q-Learning) applied to the game Snake. Let’s dive into it!

Neural Net from scratch (using Numpy)

This post is about building a shallow NeuralNetowrk(nn) from scratch for a classification problem using numpy library in Python and also compare the performance against the LogisticRegression (using scikit learn). Building a nn from scratch helps in understanding how nn works in the back-end and it is essential for building effective models. Without delay lets dive into building our simple shallow nn model from scratch.

Feel discouraged by sparse data in your hand? Give Factorization Machine a shot (1)

If you’re a data scientist in industry, have you had experience of facing your customers and telling them the project might not be able to be realized as their expectation because of data sparsity?

A Comprehensive Hands-on Guide to Transfer Learning with Real-World Applications in Deep Learning

Conventional machine learning and deep learning algorithms, so far, have been traditionally designed to work in isolation. These algorithms are trained to solve specific tasks. The models have to be rebuilt from scratch once the feature-space distribution changes. Transfer learning is the idea of overcoming the isolated learning paradigm and utilizing knowledge acquired for one task to solve related ones. In this article, we will do a comprehensive coverage of the concepts, scope and real-world applications of transfer learning and even showcase some hands-on examples. To be more specific, we will be covering the following.

The ABC of Machine Learning

There has been a renewed interest in machine learning in the last few years. This revival seems to be driven by strong fundamentals?-?loads of data being emitted by sensors across the globe, with cheap storage and lowest ever computational costs! The Buzz around Machine Learning has developed keen interest among the techies to get their hands on ML. However, before diving into the ocean of ML here are few basic concepts that you should be familiar with. Keep this handy as you will come across these terms frequently while learning ML.

Implementing Facebook Prophet efficiently

If you have ever worked with time series predictions, I am quite sure you are well aware of the strains and pains that come with them. One moment you think you have cracked the stock market, the next moment you are lying in the bath crying and cursing your inaccurate models (I really don’t recommend that you try and predict the stock market, you will most likely not reap the benefits you think you will). What I am trying to say is that time series predictions are difficult and always require a very specialized data scientist to implement it.

Dash: A Beginner’s Guide

As a data scientist, one of the most integral aspects of our job is to relay and display data to ‘non-data scientists’, in formats that provide visually actionable data. In my opinion, one of the coolest parts of our job is interacting with data, especially when there is visual interaction. At times where we might want to build an interactive application, one of the options available to us is a framework called Dash. Dash is an open source Python framework for building web applications, created and maintained by the people at Plotly. Dash’s web graphics are completely interactive because the framework is built on top of Ploty.js, a JavaScript library written and maintained by Ploty. This means that after importing the Dash framework into a Python file you can build a web application writing strictly in Python with no other languages necessary.

Continue Reading…


Read More

If you did not already know

Information Extraction Technology google
With rise of digital age, there is an explosion of information in the form of news, articles, social media, and so on. Much of this data lies in unstructured form and manually managing and effectively making use of it is tedious, boring and labor intensive. This explosion of information and need for more sophisticated and efficient information handling tools gives rise to Information Extraction(IE) and Information Retrieval(IR) technology. Information Extraction systems takes natural language text as input and produces structured information specified by certain criteria, that is relevant to a particular application. Various sub-tasks of IE such as Named Entity Recognition, Coreference Resolution, Named Entity Linking, Relation Extraction, Knowledge Base reasoning forms the building blocks of various high end Natural Language Processing (NLP) tasks such as Machine Translation, Question-Answering System, Natural Language Understanding, Text Summarization and Digital Assistants like Siri, Cortana and Google Now. This paper introduces Information Extraction technology, its various sub-tasks, highlights state-of-the-art research in various IE subtasks, current challenges and future research directions. …

Implicit Maximum Likelihood Estimation google
Implicit probabilistic models are models defined naturally in terms of a sampling procedure and often induces a likelihood function that cannot be expressed explicitly. We develop a simple method for estimating parameters in implicit models that does not require knowledge of the form of the likelihood function or any derived quantities, but can be shown to be equivalent to maximizing likelihood under some conditions. Our result holds in the non-asymptotic parametric setting, where both the capacity of the model and the number of data examples are finite. We also demonstrate encouraging experimental results. …

Lovasz Convolutional Network (LCN) google
Semi-supervised learning on graph structured data has received significant attention with the recent introduction of graph convolution networks (GCN). While traditional methods have focused on optimizing a loss augmented with Laplacian regularization framework, GCNs perform an implicit Laplacian type regularization to capture local graph structure. In this work, we propose Lovasz convolutional network (LCNs) which are capable of incorporating global graph properties. LCNs achieve this by utilizing Lovasz’s orthonormal embeddings of the nodes. We analyse local and global properties of graphs and demonstrate settings where LCNs tend to work better than GCNs. We validate the proposed method on standard random graph models such as stochastic block models (SBM) and certain community structure based graphs where LCNs outperform GCNs and learn more intuitive embeddings. We also perform extensive binary and multi-class classification experiments on real world datasets to demonstrate LCN’s effectiveness. In addition to simple graphs, we also demonstrate the use of LCNs on hypergraphs by identifying settings where they are expected to work better than GCNs. …

Continue Reading…


Read More

Obama-Trump voters turn back to Democrats

Senate Democrats did especially well where Donald Trump had gained the most ground

Continue Reading…


Read More

Using a genetic algorithm for the hyperparameter optimization of a SARIMA model

(This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers)


In this blog post, I’ll use the data that I cleaned in a previous
blog post, which you can download
here. If you want to follow along,
download the monthly data. In my last blog post
I showed how to perform a grid search the “tidy” way. As an example, I looked for the right
hyperparameters of a SARIMA model. However, the goal of the post was not hyperparameter optimization
per se, so I did not bother with tuning the hyperparameters on a validation set, and used the test
set for both validation of the hyperparameters and testing the forecast. Of course, this is not great
because doing this might lead to overfitting the hyperparameters to the test set. So in this blog post
I split my data into trainig, validation and testing sets and use a genetic algorithm to look
for the hyperparameters. Again, this is not the most optimal way to go about this problem, since
the {forecast} package contains the very useful auto.arima() function. I just wanted to see
what kind of solution a genetic algorithm would return, and also try different cost functions.
If you’re interested, read on!


Let’s first load some libraries and define some helper functions (the helper functions were explained
in the previous blog posts):


ihs <- function(x){
    log(x + sqrt(x**2 + 1))

to_tibble <- function(forecast_object){
    point_estimate <- forecast_object$mean %>%
        as_tsibble() %>%
        rename(point_estimate = value,
               date = index)

    upper <- forecast_object$upper %>%
        as_tsibble() %>%
        spread(key, value) %>%
        rename(date = index,
               upper80 = `80%`,
               upper95 = `95%`)

    lower <- forecast_object$lower %>%
        as_tsibble() %>%
        spread(key, value) %>%
        rename(date = index,
               lower80 = `80%`,
               lower95 = `95%`)

    reduce(list(point_estimate, upper, lower), full_join)

Now, let’s load the data:

avia_clean_monthly <- read_csv("")
## Parsed with column specification:
## cols(
##   destination = col_character(),
##   date = col_date(format = ""),
##   passengers = col_integer()
## )

Let’s split the data into a train set, a validation set and a test set:

avia_clean_train <- avia_clean_monthly %>%
    select(date, passengers) %>%
    filter(year(date) < 2013) %>%
    group_by(date) %>%
    summarise(total_passengers = sum(passengers)) %>%
    pull(total_passengers) %>%
    ts(., frequency = 12, start = c(2005, 1))

avia_clean_validation <- avia_clean_monthly %>%
    select(date, passengers) %>%
    filter(between(year(date), 2013, 2016)) %>%
    group_by(date) %>%
    summarise(total_passengers = sum(passengers)) %>%
    pull(total_passengers) %>%
    ts(., frequency = 12, start = c(2013, 1))

avia_clean_test <- avia_clean_monthly %>%
    select(date, passengers) %>%
    filter(year(date) >= 2016) %>%
    group_by(date) %>%
    summarise(total_passengers = sum(passengers)) %>%
    pull(total_passengers) %>%
    ts(., frequency = 12, start = c(2016, 1))

logged_test_data <- ihs(avia_clean_test)

logged_validation_data <- ihs(avia_clean_validation)

logged_train_data <- ihs(avia_clean_train)

I will train the models on data from 2005 to 2012, look for the hyperparameters on data from 2013
to 2016 and test the accuracy on data from 2016 to March 2018. For this kind of exercise, the ideal
situation would be to perform cross-validation. Doing this with time-series data is not obvious
because of the autocorrelation between observations, which would be broken by sampling independently
which is required by CV. Also, if for example you do leave-one-out CV,
you would end up trying to predict a point in, say, 2017, with data
from 2018, which does not make sense. So you should be careful about that. {forecast} is able
to perform CV for time series and scikit-learn, the
Python package, is able to perform
cross-validation of time series data
too. I will not do it in this blog post and simply focus on the genetic algorithm part.

Let’s start by defining the cost function to minimize. I’ll try several, in the first one I will
minimize the RMSE:

cost_function_rmse <- function(param, train_data, validation_data, forecast_periods){
    order <- param[1:3]
    season <- c(param[4:6], 12)
    model <- purrr::possibly(arima, otherwise = NULL)(x = train_data, order = order, 
                                                      seasonal = season,
                                                      method = "ML")
    } else {
      forecast_model <- forecast::forecast(model, h = forecast_periods)
      point_forecast <- forecast_model$mean
      sqrt(mean(point_forecast - validation_data) ** 2)

If arima() is not able to estimate a model for the given parameters, I force it to return NULL,
and in that case force the cost function to return a very high cost. If a model was successfully estimated,
then I compute the RMSE.

Let’s also take a look at what auto.arima() says:

starting_model <- auto.arima(logged_train_data)
## Series: logged_train_data 
## ARIMA(1,0,2)(2,1,0)[12] 
## Coefficients:
##          ar1      ma1     ma2     sar1     sar2
##       0.9754  -0.7872  0.2091  -0.7285  -0.4413
## s.e.  0.0261   0.1228  0.1213   0.1063   0.1150
## sigma^2 estimated as 0.004514:  log likelihood=105.61
## AIC=-199.22   AICc=-198.13   BIC=-184.64
## Training set error measures:
##                       ME       RMSE        MAE        MPE      MAPE
## Training set 0.008398036 0.06095102 0.03882593 0.07009285 0.3339574
##                   MASE       ACF1
## Training set 0.4425794 0.02073886

Let’s compute the cost at this vector of parameters:

cost_function_rmse(c(1, 0, 2, 2, 1, 0),
              train_data = logged_train_data,
              validation_data = logged_validation_data,
              forecast_periods = 65)
## [1] 0.1731473

Ok, now let’s start with optimizing the hyperparameters. Let’s help the genetic algorithm a little
bit by defining where it should perform the search:

domains <- matrix(c(0, 3, 0, 2, 0, 3, 0, 3, 0, 2, 0, 3), byrow = TRUE, ncol = 2)

This matrix constraints the first parameter to lie between 0 and 3, the second one between 0 and 2,
and so on.

Let’s call the genoud() function from the {rgenoud} package, and use 8 cores:

cl <- makePSOCKcluster(8)
clusterExport(cl, c('logged_train_data', 'logged_validation_data'))

tic <- Sys.time()

auto_arima_rmse <- genoud(cost_function_rmse,
                     nvars = 6,
            = TRUE,
                     starting.values = c(1, 0, 2, 2, 1, 0), # <- from auto.arima
                     Domains = domains,
                     cluster = cl,
                     train_data = logged_train_data,
                     validation_data = logged_validation_data,
                     forecast_periods = length(logged_validation_data),
                     hard.generation.limit = TRUE)
toc_rmse <- Sys.time() - tic

makePSOCKcluster() is a function from the {parallel} package. I must also export the global
variables logged_train_data or logged_validation_data. If I don’t do that, the workers called
by genoud() will not know about these variables and an error will be returned. The option = TRUE force the algorithm to look only for integers, and hard.generation.limit = TRUE
forces the algorithm to stop after 100 generations.

The process took 7 minutes, which is faster than doing the grid search.
What was the solution found?

## $value
## [1] 0.0001863039
## $par
## [1] 3 2 1 1 2 1
## $gradients
## [1] NA NA NA NA NA NA
## $generations
## [1] 11
## $peakgeneration
## [1] 1
## $popsize
## [1] 1000
## $operators
## [1] 122 125 125 125 125 126 125 126   0

Let’s train the model using the arima() function at these parameters:

best_model_rmse <- arima(logged_train_data, order = auto_arima_rmse$par[1:3], 
                         season = list(order = auto_arima_rmse$par[4:6], period = 12),
                         method = "ML")

## Call:
## arima(x = logged_train_data, order = auto_arima_rmse$par[1:3], seasonal = list(order = auto_arima_rmse$par[4:6], 
##     period = 12), method = "ML")
## Coefficients:
##           ar1      ar2      ar3      ma1     sar1     sma1
##       -0.6999  -0.4541  -0.0476  -0.9454  -0.4996  -0.9846
## s.e.   0.1421   0.1612   0.1405   0.1554   0.1140   0.2193
## sigma^2 estimated as 0.006247:  log likelihood = 57.34,  aic = -100.67
## Training set error measures:
##                         ME       RMSE        MAE          MPE      MAPE
## Training set -0.0006142355 0.06759545 0.04198561 -0.005408262 0.3600483
##                   MASE         ACF1
## Training set 0.4386693 -0.008298546

Let’s extract the forecasts:

best_model_rmse_forecast <- forecast::forecast(best_model_rmse, h = 65)

best_model_rmse_forecast <- to_tibble(best_model_rmse_forecast)
## Joining, by = "date"
## Joining, by = "date"
starting_model_forecast <- forecast(starting_model, h = 65)

starting_model_forecast <- to_tibble(starting_model_forecast)
## Joining, by = "date"
## Joining, by = "date"

and plot the forecast to see how it looks:

avia_clean_monthly %>%
    group_by(date) %>%
    summarise(total = sum(passengers)) %>%
    mutate(total_ihs = ihs(total)) %>%
    ggplot() +
    ggtitle("Minimization of RMSE") +
    geom_line(aes(y = total_ihs, x = date), colour = "#82518c") +
    scale_x_date(date_breaks = "1 year", date_labels = "%m-%Y") +
    geom_ribbon(data = best_model_rmse_forecast, aes(x = date, ymin = lower95, ymax = upper95),
                fill = "#666018", alpha = 0.2) +
    geom_line(data = best_model_rmse_forecast, aes(x = date, y = point_estimate), 
              linetype = 2, colour = "#8e9d98") +
    geom_ribbon(data = starting_model_forecast, aes(x = date, ymin = lower95, ymax = upper95),
                fill = "#98431e", alpha = 0.2) +
    geom_line(data = starting_model_forecast, aes(x = date, y = point_estimate), 
              linetype = 2, colour = "#a53031") +

The yellowish line and confidence intervals come from minimizing the genetic algorithm, and the
redish from auto.arima(). Interesting; the point estimate is very precise, but the confidence
intervals are very wide. Low bias, high variance.

Now, let’s try with another cost function, where I minimize the BIC, similar to the auto.arima() function:

cost_function_bic <- function(param, train_data, validation_data, forecast_periods){
    order <- param[1:3]
    season <- c(param[4:6], 12)
    model <- purrr::possibly(arima, otherwise = NULL)(x = train_data, order = order, 
                                                      seasonal = season,
                                                      method = "ML")
    } else {

Let’s take a look at the cost at the parameter values returned by auto.arima():

cost_function_bic(c(1, 0, 2, 2, 1, 0),
              train_data = logged_train_data,
              validation_data = logged_validation_data,
              forecast_periods = 65)
## [1] -184.6397

Let the genetic algorithm run again:

cl <- makePSOCKcluster(8)
clusterExport(cl, c('logged_train_data', 'logged_validation_data'))

tic <- Sys.time()

auto_arima_bic <- genoud(cost_function_bic,
                     nvars = 6,
            = TRUE,
                     starting.values = c(1, 0, 2, 2, 1, 0), # <- from auto.arima
                     Domains = domains,
                     cluster = cl,
                     train_data = logged_train_data,
                     validation_data = logged_validation_data,
                     forecast_periods = length(logged_validation_data),
                     hard.generation.limit = TRUE)
toc_bic <- Sys.time() - tic

This time, it took 6 minutes, a bit slower than before. Let’s take a look at the solution:

## $value
## [1] -201.0656
## $par
## [1] 0 1 1 1 0 1
## $gradients
## [1] NA NA NA NA NA NA
## $generations
## [1] 12
## $peakgeneration
## [1] 1
## $popsize
## [1] 1000
## $operators
## [1] 122 125 125 125 125 126 125 126   0

Let’s train the model at these parameters:

best_model_bic <- arima(logged_train_data, order = auto_arima_bic$par[1:3], 
                        season = list(order = auto_arima_bic$par[4:6], period = 12),
                        method = "ML")

## Call:
## arima(x = logged_train_data, order = auto_arima_bic$par[1:3], seasonal = list(order = auto_arima_bic$par[4:6], 
##     period = 12), method = "ML")
## Coefficients:
##           ma1    sar1    sma1
##       -0.6225  0.9968  -0.832
## s.e.   0.0835  0.0075   0.187
## sigma^2 estimated as 0.004145:  log likelihood = 109.64,  aic = -211.28
## Training set error measures:
##                       ME       RMSE        MAE        MPE      MAPE
## Training set 0.003710982 0.06405303 0.04358164 0.02873561 0.3753513
##                   MASE        ACF1
## Training set 0.4553447 -0.03450603

And let’s plot the results:

best_model_bic_forecast <- forecast::forecast(best_model_bic, h = 65)

best_model_bic_forecast <- to_tibble(best_model_bic_forecast)
## Joining, by = "date"
## Joining, by = "date"
avia_clean_monthly %>%
    group_by(date) %>%
    summarise(total = sum(passengers)) %>%
    mutate(total_ihs = ihs(total)) %>%
    ggplot() +
    ggtitle("Minimization of BIC") +
    geom_line(aes(y = total_ihs, x = date), colour = "#82518c") +
    scale_x_date(date_breaks = "1 year", date_labels = "%m-%Y") +
    geom_ribbon(data = best_model_bic_forecast, aes(x = date, ymin = lower95, ymax = upper95),
                fill = "#5160a0", alpha = 0.2) +
    geom_line(data = best_model_bic_forecast, aes(x = date, y = point_estimate), 
              linetype = 2, colour = "#208480") +
    geom_ribbon(data = starting_model_forecast, aes(x = date, ymin = lower95, ymax = upper95),
                fill = "#98431e", alpha = 0.2) +
    geom_line(data = starting_model_forecast, aes(x = date, y = point_estimate), 
              linetype = 2, colour = "#a53031") +

The solutions are very close, both in terms of point estimates and confidence intervals. Bias
increased, but variance lowered… This gives me an idea! What if I minimize the RMSE, while
keeping the number of parameters low, as a kind of regularization? This is somewhat what minimising
BIC does, but let’s try to do it a more “naive” approach:

cost_function_rmse_low_k <- function(param, train_data, validation_data, forecast_periods, max.order){
    order <- param[1:3]
    season <- c(param[4:6], 12)
    if(param[1] + param[3] + param[4] + param[6] > max.order){
    } else {
        model <- purrr::possibly(arima, otherwise = NULL)(x = train_data, 
                                                          order = order, 
                                                          seasonal = season,
                                                          method = "ML")
    } else {
        forecast_model <- forecast::forecast(model, h = forecast_periods)
        point_forecast <- forecast_model$mean
        sqrt(mean(point_forecast - validation_data) ** 2)

This is also similar to what auto.arima() does; by default, the max.order argument in auto.arima()
is set to 5, and is the sum of p + q + P + Q. So I’ll try something similar.

Let’s take a look at the cost at the parameter values returned by auto.arima():

cost_function_rmse_low_k(c(1, 0, 2, 2, 1, 0),
              train_data = logged_train_data,
              validation_data = logged_validation_data,
              forecast_periods = 65,
              max.order = 5)
## [1] 0.1731473

Let’s see what will happen:

cl <- makePSOCKcluster(8)
clusterExport(cl, c('logged_train_data', 'logged_validation_data'))

tic <- Sys.time()

auto_arima_rmse_low_k <- genoud(cost_function_rmse_low_k,
                         nvars = 6,
                = TRUE,
                         starting.values = c(1, 0, 2, 2, 1, 0), # <- from auto.arima
                         max.order = 5,
                         Domains = domains,
                         cluster = cl,
                         train_data = logged_train_data,
                         validation_data = logged_validation_data,
                         forecast_periods = length(logged_validation_data),
                         hard.generation.limit = TRUE)
toc_rmse_low_k <- Sys.time() - tic

It took 1 minute to train this one, quite fast! Let’s take a look:

## $value
## [1] 0.002503478
## $par
## [1] 1 2 0 3 1 0
## $gradients
## [1] NA NA NA NA NA NA
## $generations
## [1] 11
## $peakgeneration
## [1] 1
## $popsize
## [1] 1000
## $operators
## [1] 122 125 125 125 125 126 125 126   0

And let’s plot it:

best_model_rmse_low_k <- arima(logged_train_data, order = auto_arima_rmse_low_k$par[1:3], 
                               season = list(order = auto_arima_rmse_low_k$par[4:6], period = 12),
                               method = "ML")

## Call:
## arima(x = logged_train_data, order = auto_arima_rmse_low_k$par[1:3], seasonal = list(order = auto_arima_rmse_low_k$par[4:6], 
##     period = 12), method = "ML")
## Coefficients:
##           ar1     sar1     sar2     sar3
##       -0.6468  -0.7478  -0.5263  -0.1143
## s.e.   0.0846   0.1171   0.1473   0.1446
## sigma^2 estimated as 0.01186:  log likelihood = 57.88,  aic = -105.76
## Training set error measures:
##                        ME      RMSE        MAE         MPE      MAPE
## Training set 0.0005953302 0.1006917 0.06165919 0.003720452 0.5291736
##                   MASE       ACF1
## Training set 0.6442205 -0.3706693
best_model_rmse_low_k_forecast <- forecast::forecast(best_model_rmse_low_k, h = 65)

best_model_rmse_low_k_forecast <- to_tibble(best_model_rmse_low_k_forecast)
## Joining, by = "date"
## Joining, by = "date"
avia_clean_monthly %>%
    group_by(date) %>%
    summarise(total = sum(passengers)) %>%
    mutate(total_ihs = ihs(total)) %>%
    ggplot() +
    ggtitle("Minimization of RMSE + low k") +
    geom_line(aes(y = total_ihs, x = date), colour = "#82518c") +
    scale_x_date(date_breaks = "1 year", date_labels = "%m-%Y") +
    geom_ribbon(data = best_model_rmse_low_k_forecast, aes(x = date, ymin = lower95, ymax = upper95),
                fill = "#5160a0", alpha = 0.2) +
    geom_line(data = best_model_rmse_low_k_forecast, aes(x = date, y = point_estimate), 
              linetype = 2, colour = "#208480") +
    geom_ribbon(data = starting_model_forecast, aes(x = date, ymin = lower95, ymax = upper95),
                fill = "#98431e", alpha = 0.2) +
    geom_line(data = starting_model_forecast, aes(x = date, y = point_estimate), 
              linetype = 2, colour = "#a53031") +

Looks like this was not the right strategy. There might be a better cost function than what I have
tried, but looks like minimizing the BIC is the way to go.

Hope you enjoyed! If you found this blog post useful, you might want to follow
me on twitter for blog post updates or
buy me an espresso.

Buy me an EspressoBuy me an Espresso

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

November 15, 2018

Introducing Drexel new online MS in Data Science

Drexel’s new online MS in Data Science is the degree that launched a thousand opportunities. Complete your courses on your own schedule, while still building meaningful relationships with your professors and classmates.

Continue Reading…


Read More

R Packages worth a look

Generates Multivariate Nonnormal Data and Determines How Many Factors to Retain (RGenData)
The GenDataSample() and GenDataPopulation() functions create, respectively, a sample or population of multivariate nonnormal data using methods describ …

Statistical Learning Functions (LearningRlab)
Aids in learning statistical functions incorporating the result of calculus done with each function and how they are obtained, that is, which equations …

Smooth Time-Dependent ROC Curve Estimation (smoothROCtime)
Computes smooth estimations for the Cumulative/Dynamic and Incident/Dynamic ROC curves, in presence of right censorship, based on the bivariate kernel …

Continue Reading…


Read More

URI: Director, Data Analytics/DataSpark [Kingston, RI]

Provide leadership and administrative oversight for all data and analytic responsibilities of DataSpark. Apply by Dec 14, 2018.

Continue Reading…


Read More

In case you missed it: October 2018 roundup

In case you missed them, here are some articles from October of particular interest to R users.

Peter Provost ports some 80's-era BASIC programs for kids to R.

In a podcast for Fringe FM, I discuss the ethics of AI, Microsoft and Open Source, and the R Community.

Roundup of AI, Machine Learning and Data Science news from October 2018.

In this episode of "Guy in a Cube", R is used to visualize Anscombe's Quartet via Power BI.

Di Cook suggests using computer vision to automate statistical model assessment for machine learning in the 2018 Belz Lecture.

R provides the analysis behind a front-page story on bridge safety in the Baltimore Sun.

Tomas Kalibera describes the big impacts of a small tweak to the logical comparison operators in R.

The Economist is now using R to calculate its famous "Big Mac Index".

Behind-the-scenes details of how R gets built on Windows, from a presentation by Jeroen Ooms.

The R Consortium has accepted another round of grant applications for R community projects.

A list of upcoming R conferences.

A recap of AI, Machine Learning and Data Science announcements from the Microsoft Ignite conference.

And some general interest stories (not necessarily related to R):

As always, thanks for the comments and please send any suggestions to me at Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here

Continue Reading…


Read More

In case you missed it: October 2018 roundup

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

In case you missed them, here are some articles from October of particular interest to R users.

Peter Provost ports some 80's-era BASIC programs for kids to R.

In a podcast for Fringe FM, I discuss the ethics of AI, Microsoft and Open Source, and the R Community.

Roundup of AI, Machine Learning and Data Science news from October 2018.

In this episode of "Guy in a Cube", R is used to visualize Anscombe's Quartet via Power BI.

Di Cook suggests using computer vision to automate statistical model assessment for machine learning in the 2018 Belz Lecture.

R provides the analysis behind a front-page story on bridge safety in the Baltimore Sun.

Tomas Kalibera describes the big impacts of a small tweak to the logical comparison operators in R.

The Economist is now using R to calculate its famous "Big Mac Index".

Behind-the-scenes details of how R gets built on Windows, from a presentation by Jeroen Ooms.

The R Consortium has accepted another round of grant applications for R community projects.

A list of upcoming R conferences.

A recap of AI, Machine Learning and Data Science announcements from the Microsoft Ignite conference.

And some general interest stories (not necessarily related to R):

As always, thanks for the comments and please send any suggestions to me at Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Magister Dixit

“There is a dizzying array of algorithms from which to choose, and just making the choice between them presupposes that you have sufficiently advanced mathematical background to understand the alternatives and make a rational choice. The options are also changing, evolving constantly as a result of the work of some very bright, very dedicated researchers who are continually refining existing algorithms and coming up with new ones.” Ted Dunning, Ellen Friedman ( 2014 )

Continue Reading…


Read More

Quoting in R

Many R users appear to be big fans of "code capturing" or "non standard evaluation" (NSE) interfaces. In this note we will discuss quoting and non-quoting interfaces in R.

The above terms are simply talking about interfaces where a name to be used is captured from the source code the user typed, and thus does not need quote marks. For example:

d <- data.frame(x = 1)

## [1] 1

Notice both during data.frame creation and column access: the column name is given without quotes and also accessed without quotes.

This differs from using a standard value oriented interface as in the following:

## [1] 1

A natural reason for R users to look for automatic quoting is: it helps make working with columns in data.frames (R‘s primary data analysis structure) look much like working with variables in the environment. Without the quotes a column name looks very much like a variable name. And thinking of columns as variables is a useful mindset.

Another place implicit quoting shows up is with R‘s "combine" operator where one can write either of the following.

c(a = "b")
##   a 
## "b"
c("a" = "b")
##   a 
## "b"

The wrapr package brings in a new function: qc() or "quoting c()" that gives a very powerful and convenient way to elide quotes.


qc(a = b)
##   a 
## "b"

Notice quotes are not required on either side of the name assignment. Again, eliding quotes is not that big a deal, and not to everyone’s taste. For example I have never seen a Python user feel they are missing anything because they write "{"a" : "b"}" to construct their own named dictionary structure.

That being said, qc() is a very convenient and consistent notation if you do want to work in an NSE style.

For example, if it ever bothered you that dplyr join takes the join column names as a character vector you can use qc() to instead write:

  iris, iris, 
  by = qc(Sepal.Length, Sepal.Width, 
          Petal.Length, Petal.Width, 

(Actually I very much like that the join takes the columns as a vector, as it is much easier to program over.) I feel the qc() grouping of the columns makes it easier for a reader to see which arguments are the column set than a use of ... would. Please take, as an example, the following dplyr::group_by():


starwars %>%
  group_by(homeworld, species, add = FALSE) %>%
  summarize(mass = mean(mass, na.rm = TRUE))
## # A tibble: 58 x 3
## # Groups:   homeworld [?]
##    homeworld      species    mass
##    <chr>          <chr>     <dbl>
##  1 Alderaan       Human      64  
##  2 Aleen Minor    Aleena     15  
##  3 Bespin         Human      79  
##  4 Bestine IV     Human     110  
##  5 Cato Neimoidia Neimodian  90  
##  6 Cerea          Cerean     82  
##  7 Champala       Chagrian  NaN  
##  8 Chandrila      Human     NaN  
##  9 Concord Dawn   Human      79  
## 10 Corellia       Human      78.5
## # ... with 48 more rows

When coming back to such code later, I find the following notation to be easier to read:


starwars %>%
  group_by_se(qc(homeworld, species), add = FALSE) %>%
  summarize(mass = mean(mass, na.rm = TRUE))
## # A tibble: 58 x 3
## # Groups:   homeworld [?]
##    homeworld      species    mass
##    <chr>          <chr>     <dbl>
##  1 Alderaan       Human      64  
##  2 Aleen Minor    Aleena     15  
##  3 Bespin         Human      79  
##  4 Bestine IV     Human     110  
##  5 Cato Neimoidia Neimodian  90  
##  6 Cerea          Cerean     82  
##  7 Champala       Chagrian  NaN  
##  8 Chandrila      Human     NaN  
##  9 Concord Dawn   Human      79  
## 10 Corellia       Human      78.5
## # ... with 48 more rows

In the above we can clearly see which arguments to the grouping command are intended to be column names, and which are not.

qc() is a powerful NSE tool that annotates and contains where we are expecting quoting behavior. Some possible applications include examples such as the following.

# install many packages
install.packages(qc(testthat, knitr, rmarkdown, R.rsp))

# select columns
iris[, qc(Petal.Length, Petal.Width, Species)]

# control a for-loop
for(col in qc(Petal.Length, Petal.Width)) {
  iris[[col]] <- sqrt(iris[[col]])

# control a vapply
vapply(qc(Petal.Length, Petal.Width), 
       function(col) {
       }, numeric(1))

The idea is: with qc() the user can switch name capturing notation at will, with no prior-arrangement needed in the functions or packages used. Also the parenthesis in qc() make for more legible code: a reader can see which arguments are being quoted and taken as a group.

As of wrapr 1.7.0 qc() incorporates bquote() functionality. bquote() is R‘s built-in quasi-quotation facility. It was added to R in August of 2003 by Thomas Lumley, and doesn’t get as much attention as it deserves.

A quoting tool such as qc() becomes a quasi-quoting tool if we add a notation that signals we do not wish to quote. In R the standard notation for this is ".()" (Lisp uses a back-tick, the data.table packages uses "..", and the rlang package uses "!!"). The bquote()-enabled version of qc() lets us write code such as the following.


extra_column = "Species"

qc(Petal.Length, Petal.Width, extra_column)
## [1] "Petal.Length" "Petal.Width"  "extra_column"
qc(Petal.Length, Petal.Width, .(extra_column))
## [1] "Petal.Length" "Petal.Width"  "Species"

Notice it is un-ambiguous what is going on above. The first qc() quotes all of its arguments into strings. The second works much the same, with the exception of names marked with .(). This ability to "break out" or turn off quoting is convenient if we are working with a combination of values we wish to type in directly and others we wish to take from variables.

qc() allows substitution on the left-hand sides of assignments, if we use the alternate := notation for assignment (a convention put forward by data.table, and later adopted by dplyr).


left_name = "a"
right_value = "b"

qc(.(left_name) := .(right_value))
##   a 
## "b"

The wrapr package also exports an implementation for :=. So one could also write:

left_name := right_value
##   a 
## "b"

The hope is that the qc() and := operators are well behaved enough to commute in the sense the following two statements should return the same value.


qc(a := b, c := d)
##   a   c 
## "b" "d"
qc(a, c) := qc(b, d)
##   a   c 
## "b" "d"

The idea is: when there is a symmetry it is often evidence you are using the right concepts.

In conclusion: the goal of wrapr::qc() is to put a very regular and controllable quoting facility directly into the hands of the R user. This allows the R user to treat just about any R function or package as if the function or package itself implemented argument quoting and quasi-quotation capabilities.

Continue Reading…


Read More

A deep dive into glmnet: standardize

(This article was first published on R – Statistical Odds & Ends, and kindly contributed to R-bloggers)

I’m writing a series of posts on various function options of the glmnet function (from the package of the same name), hoping to give more detail and insight beyond R’s documentation.

In this post, we will focus on the standardize option.

For reference, here is the full signature of the glmnet function:

glmnet(x, y, family=c("gaussian","binomial","poisson","multinomial","cox","mgaussian"),
    weights, offset=NULL, alpha = 1, nlambda = 100,
    lambda.min.ratio = ifelse(nobs<500,"covariance","naive"),
    standardize.response=FALSE, type.multinomial=c("ungrouped","grouped"))

Unless otherwise stated, n will denote the number of observations, p will denote the number of features, and fit will denote the output/result of the glmnet call. The data matrix is denoted by X \in \mathbb{R}^{n \times p} and the response is denoted by y \in \mathbb{R}^n.


When standardize = TRUE (default), columns of the data matrix x are standardized, i.e. each column of x has mean 0 and standard deviation 1. More specifically, we have that for each j = 1, \dots, p,

\displaystyle\sum_{i=1}^n X_{ij} = 0, and \sqrt{\displaystyle\sum_{i=1}^n \frac{X_{ij}^2}{n}} = 1.

Why might we want to do this? Standardizing our features before model fitting is common practice in statistical learning. This is because if our features are on vastly different scales, the features with larger scales will tend to dominate the action. (One instance where we might not want to standardize our features is if they are already all measured along the same scale, e.g. meters or kilograms.)

Notice that the standardization here is slightly different from that offered by the scale function: scale(x, center = TRUE, scale = TRUE) gives the standardization

\displaystyle\sum_{i=1}^n X_{ij} = 0, and \sqrt{\displaystyle\sum_{i=1}^n \frac{X_{ij}^2}{n-1}} = 1.

We verify this with a small data example. Generate data according to the following code:

n <- 100; p <- 5; true_p <- 2
X <- matrix(rnorm(n * p), nrow = n)
beta <- matrix(c(rep(1, true_p), rep(0, p - true_p)), ncol = 1)
y <- X %*% beta + 3 * rnorm(n)

Create a version of the data matrix which has standardized columns:

X_centered <- apply(X, 2, function(x) x - mean(x))
Xs <- apply(X_centered, 2, function(x) x / sqrt(sum(x^2) / n))

Next, we run glmnet on Xs and y with both possible options for standardize:

fit <- glmnet(Xs, y, standardize = TRUE)
fit2 <- glmnet(Xs, y, standardize = FALSE)

We can check that we get the same fit in both cases (modulo numerical precision):

sum(fit$lambda != fit2$lambda)
# 0
max(abs(fit$beta - fit2$beta))
# 6.661338e-16

The documentation notes that the coefficients returned are on the original scale. Let’s confirm that with our small data set. Run glmnet with the original data matrix and standardize = TRUE:

fit3 <- glmnet(X, y, standardize = TRUE)

For each column j, our standardized variables are Z_j = \dfrac{X_j - \mu_j}{s_j}, where \mu_j and s_j are the mean and standard deviation of column j respectively. If \beta_j and \gamma_j represent the model coefficients of fit2 and fit3 respectively, then we should have

\begin{aligned} \beta_0 + \sum_{j=1}^p \beta_j Z_j &= \gamma_0 + \sum_{j=1}^p \gamma_j X_j, \\  \beta_0 + \sum_{j=1}^p \beta_j \frac{X_j - \mu_j}{s_j} &= \gamma_0 + \sum_{j=1}^p \gamma_j X_j, \\  \left( \beta_0 - \sum_{j=1}^p \frac{\mu_j}{s_j} \right) + \sum_{j=1}^p \frac{\beta_j}{s_j} X_j &= \gamma_0 + \sum_{j=1}^p \gamma_j X_j,  \end{aligned}

i.e. we should have \gamma_0 = \beta_0 - \sum_{j=1}^p \frac{\mu_j}{s_j} and \gamma_j = \frac{\beta_j}{s_j} for j = 1, \dots, p. The code below checks that this is indeed the case (modulo numerical precision):

# get column means and SDs
X_mean <- colMeans(X)
X_sd <- apply(X_centered, 2, function(x) sqrt(sum(x^2) / n))

# check difference for intercepts
fit2_int <- coefficients(fit2)[1,]
fit3_int <- coefficients(fit3)[1,]
temp <- fit2_int - colSums(diag(X_mean / X_sd) %*% fit2$beta)
max(abs(temp - fit3_int))
# 1.110223e-16

# check difference for feature coefficients
temp <- diag(1 / X_sd) %*% fit2$beta
max(abs(temp - fit3$beta))
# 1.110223e-15

The discussion above has been for the standardization of x. What about standardization for y? The documentation notes that when family = "gaussian", y is automatically standardized, and the coefficients are unstandardized at the end of the procedure.

More concretely, let the mean and standard deviation of y be denoted by \mu_y and s_y respectively. If running glmnet on standardized y gives intercept \beta_0 and coefficients \beta_1, \dots, \beta_p, then glmnet on unstandardized y will give intercept \mu_y + s_y\beta_0 and coefficients s_y\beta_1, \dots, s_y\beta_p.

Again, this can be verified empirically:

# get mean and SD of y
y_mean <- mean(y)
y_sd <- sqrt(sum((y - y_mean)^2) / n)

# fit model with standardized y
fit4 <- glmnet(X, (y - y_mean) / y_sd, standardize = TRUE)

# check difference for intercepts
fit4_int <- coefficients(fit4)[1,]
temp <- fit4_int * y_sd + y_mean
max(abs(temp - fit3_int))
# 1.110223e-16

# check difference for feature coefficients
max(abs(y_sd * fit4$beta - fit3$beta))
# 8.881784e-16

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Odds & Ends. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Book Memo: “Fog Computing”

Concepts, Frameworks and Technologies
This authoritative text/reference describes the state of the art of fog computing, presenting insights from an international selection of renowned experts. A particular focus is provided on development approaches, architectural mechanisms, and measurement metrics for building smart adaptable environments. The coverage also includes important related topics such as device connectivity, security and interoperability, and communication methods. Topics and features: introduces the core concepts and principles of fog computing, and reviews the latest research and best practice relating to fog/edge environments; discusses the vision for an Internet of Things (IoT) in terms of fog computing and other related distributed computing paradigms, such as cloud computing; presents a survey of the key issues and broader aspects of the fog paradigm, as well as the factors that affect adoption of fog computing; examines frameworks and methodologies for fog-based architecture design, improving performance, and measuring quality of experience; proposes tools and methodologies for analyzing large amounts of sensor data from smart city initiatives; describes approaches for designing robust services, management of data-intensive applications, context-aware data analysis, and vehicular networking; identifies potential future research directions and technological innovations in relation to distributed computing environments such as the IoT.

Continue Reading…


Read More

(Webinar) Farmers and Chubb on Humanizing Claims with AI

This Webinar from Insurance Nexus discusses how AI is transforming claims from a necessary back-office function into a source of competitive advantage.

Continue Reading…


Read More

✚ Uses for Animation in Charts and Animating Your Own Data

Important question: Is animation in visualization even worthwhile? Well, it depends. Surprise, surprise. In this issue, I look at animation in data visualization, its uses, and how I like to think about it when I implement moving data. Read More

Continue Reading…


Read More

Data Notes: Impact of Game of Thrones on US Baby Names

Welcome to Kaggle Data Notes!

Are Game of Thrones fans naming their babies Sansa and Tyrion? Enjoy these new, intriguing and overlooked datasets and kernels


1.👶 Impact of Game of Thrones on US Baby Names (Link)

2. 🏥 Skin Lesion Analyzer + Tensorflow.js Web App (Link)

3. 🎓 Where do People Learn ML / DS? (Link)

4. 💰 What Makes a Kaggler Valuable? (Link)

5. 💾 What Software do Kagglers Use to Code? (Link)

6. 🐍 A Deep Dive into the Usage of R, Python, or Both (Link)

7. 👽 UFO Report Analysis - MapPlots (Link)

8. 📝 Dataset: 2018 Kaggle ML & DS Survey (Link)

9. 🚑 Dataset: CMS Estimated Uninsured People (Link)

10. 📈 Dataset: AMEX, NYSE, NASDAQ Stock Histories (Link)


Technique of the week

Working with image data? Try using scikit-image to process it all!

Copyright © 2018 Kaggle, All rights reserved.

Continue Reading…


Read More

Whats new on arXiv

Correction of AI systems by linear discriminants: Probabilistic foundations

Artificial Intelligence (AI) systems sometimes make errors and will make errors in the future, from time to time. These errors are usually unexpected, and can lead to dramatic consequences. Intensive development of AI and its practical applications makes the problem of errors more important. Total re-engineering of the systems can create new errors and is not always possible due to the resources involved. The important challenge is to develop fast methods to correct errors without damaging existing skills. We formulated the technical requirements to the ‘ideal’ correctors. Such correctors include binary classifiers, which separate the situations with high risk of errors from the situations where the AI systems work properly. Surprisingly, for essentially high-dimensional data such methods are possible: simple linear Fisher discriminant can separate the situations with errors from correctly solved tasks even for exponentially large samples. The paper presents the probabilistic basis for fast non-destructive correction of AI systems. A series of new stochastic separation theorems is proven. These theorems provide new instruments for fast non-iterative correction of errors of legacy AI systems. The new approaches become efficient in high-dimensions, for correction of high-dimensional systems in high-dimensional world (i.e. for processing of essentially high-dimensional data by large systems).

Learning From Positive and Unlabeled Data: A Survey

Learning from positive and unlabeled data or PU learning is the setting where a learner only has access to positive examples and unlabeled data. The assumption is that the unlabeled data can contain both positive and negative examples. This setting has attracted increasing interest within the machine learning literature as this type of data naturally arises in applications such as medical diagnosis and knowledge base completion. This article provides a survey of the current state of the art in PU learning. It proposes seven key research questions that commonly arise in this field and provides a broad overview of how the field has tried to address them.

Assessing biological models using topological data analysis

We use topological data analysis as a tool to analyze the fit of mathematical models to experimental data. This study is built on data obtained from motion tracking groups of aphids in [Nilsen et al., PLOS One, 2013] and two random walk models that were proposed to describe the data. One model incorporates social interactions between the insects, and the second model is a control model that excludes these interactions. We compare data from each model to data from experiment by performing statistical tests based on three different sets of measures. First, we use time series of order parameters commonly used in collective motion studies. These order parameters measure the overall polarization and angular momentum of the group, and do not rely on a priori knowledge of the models that produced the data. Second, we use order parameter time series that do rely on a priori knowledge, namely average distance to nearest neighbor and percentage of aphids moving. Third, we use computational persistent homology to calculate topological signatures of the data. Analysis of the a priori order parameters indicates that the interactive model better describes the experimental data than the control model does. The topological approach performs as well as these a priori order parameters and better than the other order parameters, suggesting the utility of the topological approach in the absence of specific knowledge of mechanisms underlying the data.

A Framework of Transfer Learning in Object Detection for Embedded Systems

Transfer learning is one of the subjects undergoing intense study in the area of machine learning. In object recognition and object detection there are known experiments for the transferability of parameters, but not for neural networks which are suitable for object-detection in real time embedded applications, such as the SqueezeDet neural network. We use transfer learning to accelerate the training of SqueezeDet to a new group of classes. Also, experiments are conducted to study the transferability and co-adaptation phenomena introduced by the transfer learning process. To accelerate training, we propose a new implementation of the SqueezeDet training which provides a faster pipeline for data processing and achieves 1.8 times speedup compared to the initial implementation. Finally, we created a mechanism for automatic hyperparamer optimization using an empirical method.

A Perceptual Prediction Framework for Self Supervised Event Segmentation

Temporal segmentation of long videos is an important problem, that has largely been tackled through supervised learning, often requiring large amounts of annotated training data. In this paper, we tackle the problem of self-supervised temporal segmentation of long videos that alleviate the need for any supervision. We introduce a self-supervised, predictive learning framework that draws inspiration from cognitive psychology to segment long, visually complex videos into individual, stable segments that share the same semantics. We also introduce a new adaptive learning paradigm that helps reduce the effect of catastrophic forgetting in recurrent neural networks. Extensive experiments on three publicly available datasets – Breakfast Actions, 50 Salads, and INRIA Instructional Videos datasets show the efficacy of the proposed approach. We show that the proposed approach is able to outperform weakly-supervised and other unsupervised learning approaches by up to 24% and have competitive performance compared to fully supervised approaches. We also show that the proposed approach is able to learn highly discriminative features that help improve action recognition when used in a representation learning paradigm.

Characterizing machine learning process: A maturity framework

Academic literature on machine learning modeling fails to address how to make machine learning models work for enterprises. For example, existing machine learning processes cannot address how to define business use cases for an AI application, how to convert business requirements from offering managers into data requirements for data scientists, and how to continuously improve AI applications in term of accuracy and fairness, and how to customize general purpose machine learning models with industry, domain, and use case specific data to make them more accurate for specific situations etc. Making AI work for enterprises requires special considerations, tools, methods and processes. In this paper we present a maturity framework for machine learning model lifecycle management for enterprises. Our framework is a re-interpretation of the software Capability Maturity Model (CMM) for machine learning model development process. We present a set of best practices from our personal experience of building large scale real-world machine learning models to help organizations achieve higher levels of maturity independent of their starting point.

The doctrinal paradox: ROC analysis in a probabilistic framework

The doctrinal paradox is analysed from a probabilistic point of view assuming a simple parametric model for the committee’s behaviour. The well known issue-by-issue and case-by-case majority rules are compared in this model, by means of the concepts of false positive rate (FPR), false negative rate (FNR) and Receiver Operating Characteristics (ROC) space. We introduce also a new rule that we call path-by-path, which is somehow halfway between the other two. Under our model assumptions, the issue-by-issue rule is shown to be the best of the three according to an optimality criterion based in ROC maps, for all values of the model parameters (committee size and competence of its members), when equal weight is given to FPR an FNR. For unequal weights, the relative goodness of the rules depends on the values of the competence and the weights, in a way which is precisely described. The results are illustrated with some numerical examples.

TED: Teaching AI to Explain its Decisions

Artificial intelligence systems are being increasingly deployed due to their potential to increase the efficiency, scale, consistency, fairness, and accuracy of decisions. However, as many of these systems are opaque in their operation, there is a growing demand for such systems to provide explanations for their decisions. Conventional approaches to this problem attempt to expose or discover the inner workings of a machine learning model with the hope that the resulting explanations will be meaningful to the consumer. In contrast, this paper suggests a new approach to this problem. It introduces a simple, practical framework, called Teaching Explanations for Decisions (TED), that provides meaningful explanations that match the mental model of the consumer. We illustrate the generality and effectiveness of this approach with two different examples, resulting in highly accurate explanations with no loss of prediction accuracy for these two examples.

Learning Temporal Point Processes via Reinforcement Learning

Social goods, such as healthcare, smart city, and information networks, often produce ordered event data in continuous time. The generative processes of these event data can be very complex, requiring flexible models to capture their dynamics. Temporal point processes offer an elegant framework for modeling event data without discretizing the time. However, the existing maximum-likelihood-estimation (MLE) learning paradigm requires hand-crafting the intensity function beforehand and cannot directly monitor the goodness-of-fit of the estimated model in the process of training. To alleviate the risk of model-misspecification in MLE, we propose to generate samples from the generative model and monitor the quality of the samples in the process of training until the samples and the real data are indistinguishable. We take inspiration from reinforcement learning (RL) and treat the generation of each event as the action taken by a stochastic policy. We parameterize the policy as a flexible recurrent neural network and gradually improve the policy to mimic the observed event distribution. Since the reward function is unknown in this setting, we uncover an analytic and nonparametric form of the reward function using an inverse reinforcement learning formulation. This new RL framework allows us to derive an efficient policy gradient algorithm for learning flexible point process models, and we show that it performs well in both synthetic and real data.

Dynamic Feature Scaling for K-Nearest Neighbor Algorithm

Nearest Neighbors Algorithm is a Lazy Learning Algorithm, in which the algorithm tries to approximate the predictions with the help of similar existing vectors in the training dataset. The predictions made by the K-Nearest Neighbors algorithm is based on averaging the target values of the spatial neighbors. The selection process for neighbors in the Hermitian space is done with the help of distance metrics such as Euclidean distance, Minkowski distance, Mahalanobis distance etc. A majority of the metrics such as Euclidean distance are scale variant, meaning that the results could vary for different range of values used for the features. Standard techniques used for the normalization of scaling factors are feature scaling method such as Z-score normalization technique, Min-Max scaling etc. Scaling methods uniformly assign equal weights to all the features, which might result in a non-ideal situation. This paper proposes a novel method to assign weights to individual feature with the help of out of bag errors obtained from constructing multiple decision tree models.

PanJoin: A Partition-based Adaptive Stream Join

In stream processing, stream join is one of the critical sources of performance bottlenecks. The sliding-window-based stream join provides a precise result but consumes considerable computational resources. The current solutions lack support for the join predicates on large windows. These algorithms and their hardware accelerators are either limited to equi-join or use a nested loop join to process all the requests. In this paper, we present a new algorithm called PanJoin which has high throughput on large windows and supports both equi-join and non-equi-join. PanJoin implements three new data structures to reduce computations during the probing phase of stream join. We also implement the most hardware-friendly data structure, called BI-Sort, on FPGA. Our evaluation shows that PanJoin outperforms several recently proposed stream join methods by more than 1000x, and it also adapts well to highly skewed data.

Theoretical Analysis of Adversarial Learning: A Minimax Approach

We propose a general theoretical method for analyzing the risk bound in the presence of adversaries. In particular, we try to fit the adversarial learning problem into the minimax framework. We first show that the original adversarial learning problem could be reduced to a minimax statistical learning problem by introducing a transport map between distributions. Then we prove a risk bound for this minimax problem in terms of covering numbers. In contrast to previous minimax bounds in \cite{lee,far}, our bound is informative when the radius of the ambiguity set is small. Our method could be applied to multi-class classification problems and commonly-used loss functions such as hinge loss and ramp loss. As two illustrative examples, we derive the adversarial risk bounds for kernel-SVM and deep neural networks. Our results indicate that a stronger adversary might have a negative impact on the complexity of the hypothesis class and the existence of margin could serve as a defense mechanism to counter adversarial attacks.

A Multi-layer LSTM-based Approach for Robot Command Interaction Modeling

As the first robotic platforms slowly approach our everyday life, we can imagine a near future where service robots will be easily accessible by non-expert users through vocal interfaces. The capability of managing natural language would indeed speed up the process of integrating such platform in the ordinary life. Semantic parsing is a fundamental task of the Natural Language Understanding process, as it allows extracting the meaning of a user utterance to be used by a machine. In this paper, we present a preliminary study to semantically parse user vocal commands for a House Service robot, using a multi-layer Long-Short Term Memory neural network with attention mechanism. The system is trained on the Human Robot Interaction Corpus, and it is preliminarily compared with previous approaches.

Anomaly Detection using Autoencoders in High Performance Computing Systems

Anomaly detection in supercomputers is a very difficult problem due to the big scale of the systems and the high number of components. The current state of the art for automated anomaly detection employs Machine Learning methods or statistical regression models in a supervised fashion, meaning that the detection tool is trained to distinguish among a fixed set of behaviour classes (healthy and unhealthy states). We propose a novel approach for anomaly detection in High Performance Computing systems based on a Machine (Deep) Learning technique, namely a type of neural network called autoencoder. The key idea is to train a set of autoencoders to learn the normal (healthy) behaviour of the supercomputer nodes and, after training, use them to identify abnormal conditions. This is different from previous approaches which where based on learning the abnormal condition, for which there are much smaller datasets (since it is very hard to identify them to begin with). We test our approach on a real supercomputer equipped with a fine-grained, scalable monitoring infrastructure that can provide large amount of data to characterize the system behaviour. The results are extremely promising: after the training phase to learn the normal system behaviour, our method is capable of detecting anomalies that have never been seen before with a very good accuracy (values ranging between 88% and 96%).

Incentivising Participation in Liquid Democracy with Breadth First Delegation
Robustness of the Closest Unstable Equilibrium Point Along a P-V Curve
ADNet: A Deep Network for Detecting Adverts
The largest graphs with given order and diameter: A simple proof
On Stability Condition of Wireless Networked Control Systems under Joint Design of Control Policy and Network Scheduling Policy
Weak convergence of particle swarm optimization
Adaptive model selection method for a conditionally Gaussian semimartingale regression in continuous time
Temporal Graph Convolutional Network for Urban Traffic Flow Prediction Method
On Asymptotic Covariances of A Few Unrotated Factor Solutions
MMALFM: Explainable Recommendation by Leveraging Reviews and Images
Edge directionality properties in complex spherical networks
Distributionally Robust Semi-Supervised Learning for People-Centric Sensing
Clifford-like parallelisms
Learning data augmentation policies using augmented random search
Adaptive Target Recognition: A Case Study Involving Airport Baggage Screening
Syntax Helps ELMo Understand Semantics: Is Syntax Still Relevant in a Deep Neural Architecture for SRL?
Improving Generalization for Abstract Reasoning Tasks Using Disentangled Feature Representations
Localisation, chiral symmetry and confinement in QCD and related theories
Stationary Harmonic Measure as the Scaling Limit of Truncated Harmonic Measure
On the absolute continuity of random nodal volumes
Strong Equivalence for Epistemic Logic Programs Made Easy (Extended Version)
3s-Unification for Vehicular Headway Modeling
Subsequent Boundary Distance Regression and Pixelwise Classification Networks for Automatic Kidney Segmentation in Ultrasound Images
Large-deviation properties of the largest biconnected component for random graphs
A test case for application of convolutional neural networks to spatio-temporal climate data: Re-identifying clustered weather patterns
Understanding the boosted decision tree methods with the weak-learner approximation
Compliance in Real Time Multiset Rewriting Models
Circuit Depth Reductions
Scattering-free pulse propagation through invisible non-Hermitian disorder
Regularity results of the speed of biased random walks on Galton-Watson trees
Potential Game-Based Non-Myopic Sensor Network Planning for Multi-Target Tracking
Quantum-inspired sublinear classical algorithms for solving low-rank linear systems
On the practice of classification learning for clinical diagnosis and therapy advice in oncology
Generative Dual Adversarial Network for Generalized Zero-shot Learning
Bio-YODIE: A Named Entity Linking System for Biomedical Text
Measures of goodness of fit obtained by canonical transformations on Riemannian manifolds
Pareto-Optimal Allocation of Indivisible Goods with Connectivity Constraints
Comparing Spark vs MPI/OpenMP On Word Count MapReduce
CQASUMM: Building References for Community Question Answering Summarization Corpora
Triangular Ladders $P_{d,2}$ are $e$-positive
Focusing on the Big Picture: Insights into a Systems Approach to Deep Learning for Satellite Imagery
Segue: Overviewing Evolution Patterns of Egocentric Networks by Interactive Construction of Spatial Layouts
Multi-encoder multi-resolution framework for end-to-end speech recognition
Stream attention-based multi-array end-to-end speech recognition
Algorithmic models of human behavior and stochastic optimization
Deep Learning versus Classical Regression for Brain Tumor Patient Survival Prediction
Nonexistence of Bigeodesics in Integrable Models of Last Passage Percolation
Quantum-inspired low-rank stochastic regression with logarithmic dependence on the dimension
Boosting Model Performance through Differentially Private Model Aggregation
Online Timely Status Updates with Erasures for Energy Harvesting Sensors
Analytical Formulation of the Block-Constrained Configuration Model
Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers
Algebraic Many-Body Localization and its implications on information propagation
Modeling and Performance of Uplink Cache-Enabled Massive MIMO Heterogeneous Networks
Molecular computers
A simplifed static frequency converter model for electromechanical transient stability studies of 16$\frac{2}{3}$ Hz railways
The Impact of Timestamp Granularity in Optimistic Concurrency Control
PennyLane: Automatic differentiation of hybrid quantum-classical computations
Eliminating Latent Discrimination: Train Then Mask
p-regularity theory. Applications and developments
Unseen Word Representation by Aligning Heterogeneous Lexical Semantic Spaces
Generalized Ternary Connect: End-to-End Learning and Compression of Multiplication-Free Deep Neural Networks
Prediction of Alzheimer’s disease-associated genes by integration of GWAS summary data and expression data
A Generalized Framework for Approximate Control Variates
OriNet: A Fully Convolutional Network for 3D Human Pose Estimation
A new approach for pedestrian density estimation using moving sensors and computer vision
You Only Live Multiple Times: A Blackbox Solution for Reusing Crash-Stop Algorithms In Realistic Crash-Recovery Settings
Choosing to grow a graph: Modeling network formation as discrete choice
Coordinating Disaster Emergency Response with Heuristic Reinforcement Learning
Blindfold Baselines for Embodied QA
NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level Features for Large-scale Video Classification
A Team-Formation Algorithm for Faultline Minimization
Improved Dynamic Memory Network for Dialogue Act Classification with Adversarial Training
Approximation Algorithms for Minimum Norm and Ordered Optimization Problems
Generating faces for affect analysis
LookinGood: Enhancing Performance Capture with Real-time Neural Re-Rendering
A Review of automatic differentiation and its efficient implementation
Shortcut Graphs and Groups
Finding All Bayesian Network Structures within a Factor of Optimal
Exploiting Local Feature Patterns for Unsupervised Domain Adaptation
Distributed Cooperative Spectrum Sharing in UAV Networks Using Multi-Agent Reinforcement Learning
Electrophysiological indicators of gesture perception
A unified algorithm for the non-convex penalized estimation: The ncpen package
SMERC: Social media event response clustering using textual and temporal information
Multiple-paths $SLE_κ$ in multiply connected domains
Shall I Compare Thee to a Machine-Written Sonnet? An Approach to Algorithmic Sonnet Generation
The first passage time density of Brownian motion and the heat equation with Dirichlet boundary condition in time dependent domains
Private Model Compression via Knowledge Distillation
Regularised Zero-Variance Control Variates
Learning from Binary Multiway Data: Probabilistic Tensor Decomposition and its Statistical Optimality
Task Graph Transformations for Latency Tolerance
A Unified Model for Opinion Target Extraction and Target Sentiment Prediction
Domain Agnostic Real-Valued Specificity Prediction
Nonsingular Gaussian Conditionally Markov Sequences
Parallel Stochastic Asynchronous Coordinate Descent: Tight Bounds on the Possible Parallelism
A General Method for Amortizing Variational Filtering
A SAT+CAS Approach to Finding Good Matrices: New Examples and Counterexamples
A Local Regret in Nonconvex Online Learning
Exploring RNN-Transducer for Chinese Speech Recognition
Balancing Relevance and Diversity in Online Bipartite Matching via Submodularity
Neuroimaging Modality Fusion in Alzheimer’s Classification Using Convolutional Neural Networks
Interpreting Models by Allowing to Ask
Towards the topological recursion for double Hurwitz numbers
A Variational Inference based Detection Method for Repetition Coded Generalized Spatial Modulation
Parametric Shortest Paths in Planar Graphs
Exploiting temporal and depth information for multi-frame face anti-spoofing
Modeling Local Dependence in Natural Language with Multi-channel Recurrent Neural Networks
Fundamental Limits of Exact Support Recovery in High Dimensions
Multi-unit Bilateral Trade
Sensitivity Analysis of a Stationary Point Set Map under Total Perturbations. Part 2: Robinson Stability
Community Exploration: From Offline Optimization to Online Learning
Multiscale Information Storage of Linear Long-Range Correlated Stochastic Processes
M Equilibrium: A dual theory of beliefs and choices in games
Amplitude-Aware Lossy Compression for Quantum Circuit Simulation
Co-Representation Learning For Classification and Novel Class Detection via Deep Networks
Spectral Efficiency Analysis in Presence of Correlated Gamma-Lognormal Desired and Interfering Signals
Sensitivity Analysis of a Stationary Point Set Map under Total Perturbations. Part 1: Lipschitzian Stability
Hate Speech Detection from Code-mixed Hindi-English Tweets Using Deep Learning Models
Application of Faster R-CNN model on Human Running Pattern Recognition
Fast HARQ over Finite Blocklength Codes: A Technique for Low-Latency Reliable Communication
User Demand Based Precoding for DSL Systems
Garbage In, Reward Out: Bootstrapping Exploration in Multi-Armed Bandits
On the Throughput of Large-but-Finite MIMO Networks using Schedulers
Protection Placement for State Estimation Measurement Data Integrity
Recurrent Multi-Graph Neural Networks for Travel Cost Prediction
Approximating minimum representations of key Horn functions
Vehicle Re-identification Using Quadruple Directional Deep Learning Features
On Lipschitz-like property for polyhedral moving sets
Nonparametric geometric outlier detection
Optimal extension to Sobolev rough paths
Child Gender Determination with Convolutional Neural Networks on Hand Radio-Graphs
Gradient Harmonized Single-stage Detector
Equilibrium measures on trees
Polynomial Schur’s theorem
On the Polarization Levels of Automorphic-Symmetric Channels
Relating local structures, energies, and occurrence probabilities in a two-dimensional silica network
FusionStitching: Deep Fusion and Code Generation for Tensorflow Computations on GPUs
Classical Access Structures of Ramp Secret Sharing Based on Quantum Stabilizer Codes
ImageNet/ResNet-50 Training in 224 Seconds
Applications of Littlewood-Richardson tableaux to computing generic extension of semisimple invariant subspaces of nilpotent linear operators
BAN: Focusing on Boundary Context for Object Detection
Interpretable Credit Application Predictions With Counterfactual Explanations
An Online Attention-based Model for Speech Recognition
Probing interacting two-level systems with rare-earth ions
Modular Networks: Learning to Decompose Neural Computation
Modality Attention for End-to-End Audio-visual Speech Recognition
SVM-Based Sea-Surface Small Target Detection: A False-Alarm-Rate-Controllable Approach
Image Captioning Based on a Hierarchical Attention Mechanism and Policy Gradient Optimization
Deep Neural Network Concepts for Background Subtraction: A Systematic Review and Comparative Evaluation
How Secure are Deep Learning Algorithms from Side-Channel based Reverse Engineering?
A conjugate prior for the Dirichlet distribution
Predicting Distresses using Deep Learning of Text Segments in Annual Reports
Towards the Design of Aerostat Wind Turbine Arrays through AI
Intelligent Drone Swarm for Search and Rescue Operations at Sea
Pose Invariant 3D Face Reconstruction
SAFE: Self-Attentive Function Embeddings for Binary Similarity
Genetic algorithm for optimal distribution in cities
Translating Natural Language to SQL using Pointer-Generator Networks and How Decoding Order Matters
Self-Supervised Learning of Depth and Camera Motion from 360° Videos
Improved Fourier Mellin Invariant for Robust Rotation Estimation with Omni-cameras
Detect or Track: Towards Cost-Effective Video Object Detection/Tracking
Spectral Deconfounding and Perturbed Sparse Linear Models
Iteratively Training Look-Up Tables for Network Quantization
Highly Efficient Stepped Wedge Designs for Clusters of Unequal Size
Personal Names Popularity Estimation and its Application to Record Linkage
Unsupervised Transfer Learning for Spoken Language Understanding in Intelligent Agents
Benchmarking datasets for Anomaly-based Network Intrusion Detection: KDD CUP 99 alternatives
Operator-Valued Matrices with Free or Exchangeable Entries
Comparison of Feature Extraction Methods and Predictors for Income Inference
Quantile regression approach to conditional mode estimation
Sorting out Lipschitz function approximation
On Finding Quantum Multi-collisions
Hallucinating Point Cloud into 3D Sculptural Object
Remarks on a fractional-time stochastic equation
Strong Approximation of Monotone Stochastic Partial Different Equations Driven by Multiplicative Noise
Estimation of urban traffic state with probe vehicles
Advances in sequential measurement and control of open quantum systems
Algorithms for Optimal AC Power Flow in the Presence of Renewable Sources
Embedding Electronic Health Records for Clinical Information Retrieval
Autonomic Intrusion Response in Distributed Computing using Big Data
Multi-task learning for Joint Language Understanding and Dialogue State Tracking
Estimating the Impact of Cyber-Attack Strategies for Stochastic Control Systems
Home Activity Monitoring using Low Resolution Infrared Sensor
Fast Human Pose Estimation
ABox Abduction via Forgetting in ALC (Long Version)
Quickest Detection of Time-Varying False Data Injection Attacks in Dynamic Linear Regression Models
On the Mean Order of Connected Induced Subgraphs of Block Graphs
Deep Object Centric Policies for Autonomous Driving
Robust H-infinity kinematic control of manipulator robots using dual quaternion algebra
Argumentation for Explainable Scheduling (Full Paper with Proofs)
Very Hard Electoral Control Problems
A survey of semidefinite programming approaches to the generalized problem of moments and their error analysis
Cyclic quasi-symmetric functions
Co-regularized Alignment for Unsupervised Domain Adaptation
Higher-Order Cone Programming
New fat-tail normality test based on conditional second moments with applications to finance

Continue Reading…


Read More

Best Deals in Deep Learning Cloud Providers: From CPU to GPU to TPU

A detailed comparison of the best places to train your deep learning model for the lowest cost and hassle, including AWS, Google, Paperspace,, and more.

Continue Reading…


Read More

Monday after Thanksgiving, see you at the Data Visualization Meetup

My little analytics training startup, Principal Analytics Prep, is proud to sponsor the next meeting of the Data Visualization New York Meetup, organized by the indefatigable Naomi Robbins. The Meetup, to be held on November 26, 2018 (Monday), headlines NYU professor, Kristen Sosulski, who will discuss the “business case for data visualization”. Click here to register.

Naomirobbins_bookcoverNaomi is a long-time friend. I reviewed and recommended her book on Creating More Effective Data Graphics in 2008. This is still a useful reference to some key concepts, presented in a clean, easily digestible format.


The keynote speaker, Kristen, has just published Data Visualization Made Simple, described as a “top book for computer science students.” I am very much looking forward to her talk because the subject speaks to the core of the mission of Principal Analytics Prep: placing data science & analytics in the context of the entire enterprise.

That’s why my bootcamp and training programs emphasize the Three Pillars: computing, statistics, and business. Our instructors are practitioners with 10 to 30 years of learning from real-world implementation of models and systems used by forward-thinking, data-driven organizations.

The Spring 2019 cohort of our Certified Data Specialist bootcamp is open for applications. If you're looking to transition your career into data science and advanced analytics, check us out. Here are some of the great things our alums have said about the program. Take advantage of the early admit deadline by December 10, 2018. Click here for more information.


Kristen_bookcoverI'm excited to hear what Kristen has to say about using data visualization in the business world. Kristen's website is here. She teaches at NYU's Stern School of Business, and describes herself as a computer scientist. Here's the link to the Amazon page. Unfortunately the page is not very informative about the book's content. The table of contents is minimalist ("The Design", "The Audience", etc.). I will report back on what she spoke about after the Meetup.



Continue Reading…


Read More

Gold-Mining Week 11 (2018)

(This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers)

Week 11 Gold Mining and Fantasy Football Projection Roundup now available. Go get that free agent gold!

The post Gold-Mining Week 11 (2018) appeared first on Fantasy Football Analytics.

To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Online Bayesian Deep Learning in Production at Tencent

Bayesian deep learning methods often look like a theoretical curiosity, rather than a practically useful tool, and I'm personally a bit skeptical about the practical usefulness of some of the work. However, there are some situations a decent method of handling and representing residual uncertainties about model parameters might prove crucial. These applications include active learning, reinforcement learning and online/continual learning.

So as I recently read a paper by Tencent, I was surprised to learn that the online Bayesian deep learning algorithm is apparently deployed in production to power click-through-rate prediction in their ad system. I thought, therefore, that this is worth a mention. Here is the paper:

Though the paper has a few typos and sometimes inconsistent notation (like going back and forth between using $\omega$ and $w$) which can make it tricky to read, I like the paper's layout: starting from desiderata - what the system has to be able to do - arriving at an elegant solution.

Assumed Density Filtering

The method relies on the approximate Bayesian online-learning technique often referred to as assumed density filtering.

ADF has been independently discovered by the statistics, machine learning and control communities (for citations see (Minka, 2001). It is perhaps most elegantly described by Opper, (1998), and by Minka, (2001) who also extended the idea to develop expectation propagation, a highly influential method for approximate Bayesian inference not just for online settings.

ADF can be explained as recursive algorithm repeating the following steps:

  • The starting point for the algorithm at time $t$ is a "temporal prior" distribution $q_{t-1}(w)$ over parameters $w$. This $q_{t-1}(w)$ incorporates all evidence from datapoints observed in previous timesteps, and it is assumed to approximate the posterior $p(w\vert x_{1:t-1})$ conditioned on all data oberved so far. $q_{t-1}(w)$ is assumed to take some simple form, like a Gaussian distribution factorized over the elements of $w$.
  • Then, some new observations $x_t$ are observed. The Bayesian way to learn from these new observations would be to update the posterior: $p_t(w|x_t) \propto p(x_t\vert w)q_{t-1}(w)$
  • However, this posterior might be complicated, and may not be available in a simple form. Therefore, at this point we approximate $p_t(w|x_t)$ by a new simple distribution $q_{t}$, which is in the same family as $q_{t-1}$. This step can involve a KL divergence-based approximation, e.g. (Opper, 1998, Minka, 2001), or Laplace approximation e.g. (Kirkpatrick et al, 2017, Huszár, 2017), or a more involved inner loop such as EP in MatchBox (Stern et al, 2009) or probabilistic backpropagation (Hernandez-Lobato and Adams, 2015) as in this paper.

Probabilistic Backpropagation

The Tencent paper is based on probabilistic backpropagation (Hernandez-Lobato and Adams, 2015), which uses the ADF idea multiple times to perform the two non-trivial tasks required in a supervised Bayesian deep network: inference and learning.

  • forward propagation: In Bayesian deep learning, we maintain a distribution $q(w)$ over neural network weights, and each value $w$ defines a conditional probability $p(y\vert x, w)$. To predict the label $y$ from the input $x$, we have to average over $q(w)$, that is calculate $p(y\vert x) = \int q(w)p(y\vert x,w)dw$, which is difficult due to the nonlinearities. Probabilistic backprop uses an ADF-like algorithm to approximate this predictive distribution. Starting from the bottom of the network, it approximates the distribution of the first hidden layer's activations given the input $x$ with a Gaussian. The first hidden layer will have a distribution of activations because we have a distribution over the weights) It then propagates that distribution to the second layer, and approximates the result with a Gaussian. This process is repeated until the distribution of $y$ given $x$ is calculated in the last step, which can be done easily if one uses a probit activation at the last layer.
  • backward propagation: Forward propagation allows us to make a prediction given an input. The second task we have to be able to perform is to incorporate evidence from a new datapoint $(x_t, y_t)$ by updating the distribution over weights $q_t(w)$ to $p_t(w|x_t, y_t) \propto p(y_t\vert x_t, w) q_{t-1}(w)$. We approximate this $p_t(w|x_t, y_t)$ in an inner loop, by first running probabilistic forward propagation, then a similar ADF-like sweep backwards in the network.

I find the similarity of this algorithm to gradient calculations in neural network training beautiful: forward propagation to perform inference, forward and backward propagation for learning and parameter updates.

Crucially though, there is no gradient descent, or indeed no gradients or iterative optimization taking place here. After a single forward-backward cycle, information about the new datapoint (or minibatch of data) is - approximately - incorporated into the updated distribution $q_{t}(w)$ of network weights. Once this is done, the mini-batch can be discarded and never revisited. This makes this method a great candidate for online learning in large-scale streaming data situations.

Parallel operation

Another advantage of this method is that it can be parallelized in a data-parallel fashion: Multiple workers can update the posterior simultaneously on different minibatches of data. Expectation-propagation provides a meaningful way of combining the parameter updates computed by parallel workers in a meaningful fashion. Indeed, this is what the PBODL algorithm of Tencent does.


Bayesian online learning comes with great advantages. In recommender systems and ads marketplaces, the Bayesian online learning approach handles the user and item cold-start problem gracefully. If there is a new ad in the system that has to be recommended, initially our network might say "I don't really know", and then gradually hone in on a confident prediction as more data is observed.

One can use the uncertainty estimates in a Bayesian network to perform online learning: actively proposing which items to label in order to increase predictive performance in the future. In the context of ads, this may be a useful feature if implemented with care.

Finally, there is the useful principle that Bayes never forgets: if we perform exact Bayesian updates on global parameters of an exchangeable model, the posterior will store information indefinitely long about all datapoints, including very early ones. This advantage is clearly demonstrated in DeepMind's work on catastrophic forgetting (Kirkpatrick et al, 2017). This capacity to keep a long memory of course diminishes the more approximations we have to make, which leads me to drawbacks.


The ADF method is only approximate, and over time, the approximations in each step may accumulate resulting in the network to essentially forget what it learned from earlier datapoints. It is also worth pointing out that in stationary situations, the posterior over parameters is expected to shrink, especially in the last hidden layers of the network where there are often fewer parameters. This posterior compression is often countered by introducing explicit forgetting: without evidence to the contrary, the variance of each parameter is marginally increased in each step. Forgetting and the lack of infinite memory may turn out to be an advantage, not a bug, in non-stationary situations where more recent datapoints are more representative of future test cases.

A practical drawback of a method like this in production environments is that probabilistic forward- and backpropagation require non-trivial custom implementations. Probabilistic backprop does not use reverse mode automatic differentiation, i.e. vanilla backprop as a subroutine. As a consequence, one cannot rely on extensively tested and widely used autodiff packages in tensorflow or pytorch, and performing probabilistic backprop in new layer-types might require significant effort and custom implementations. It is possible that high-quality message passing packages such as the recently open-sourced will see a renaissance after all.

Continue Reading…


Read More

Thanks for reading!