My Data Science Blogs

April 22, 2019

Four short links: 22 April 2019

GANs via Spreadsheet, Open Source Chat, Sandboxing Libraries, and Flat Robot Sales

  1. Spacesheet -- Interactive Latent Space Exploration through a Spreadsheet Interface. (via Flowing Data)
  2. Tchap -- the French government's open source secure encrypted chat tool, built off the open source Riot. (via ZDNet)
  3. Sandboxed API -- Google open-sourced their tool for automatically generating sandboxes for C/C++ libraries. (via Google Blog)
  4. Industrial Robot Sales Flat (Robohub) -- It was only up 1% over 2017. Important note: No information was given about service and field robotics. (which may well be booming)

Continue reading Four short links: 22 April 2019.

Continue Reading…


Read More

R Packages worth a look

Cognitive Mapping Tools Based on Coding of Textual Sources (cogmapr)
Functions for building cognitive maps based on qualitative data. Inputs are textual sources (articles, transcription of qualitative interviews of agent …

Download ‘Qualtrics’ Survey Data (qualtRics)
Provides functions to access survey results directly into R using the ‘Qualtrics’ API. ‘Qualtrics’ <<a href=">&quot; t …

Interactively Explore Dimension-Reduced Embeddings (sleepwalk)
A tool to interactively explore the embeddings created by dimension reduction methods such as Principal Components Analysis (PCA), Multidimensional Sca …

Density Estimation with a Penalized Mixture Approach (pendensity)
Estimation of univariate (conditional) densities using penalized B-splines with automatic selection of optimal smoothing parameter.

Interface to ‘Virtuoso’ using ‘ODBC’ (virtuoso)
Provides users with a simple and convenient mechanism to manage and query a ‘Virtuoso’ database using the ‘DBI’ (DataBase Interface) compatible ‘ODBC’ …

Calculate and Scale Moran’s I (Irescale)
Provides a scaling method to obtain a standardized Moran’s I measure. Moran’s I is a measure for the spatial autocorrelation of a data set, it gives a …

Continue Reading…


Read More

Videos: New Deep Learning Techniques, February 5 - 9, 2018, IPAM Workshop



In recent years, artificial neural networks a.k.a. deep learning have significantly improved the fields of computer vision, speech recognition, and natural language processing. The success relies on the availability of large-scale datasets, the developments of affordable high computational power, and basic deep learning operations that are sound and fast as they assume that data lie on Euclidean grids. However, not all data live on regular lattices. 3D shapes in computer graphics represent Riemannian manifolds. In neuroscience, brain activity (fMRI) is encoded on the structural connectivity network (sMRI). In genomics, the human body functionality is expressed through DNA, RNA, and proteins that form the gene regulatory network (GRN). In social sciences, people interact through networks. Eventually, data in communication networks are structured by graphs like the Internet or road traffic networks.

Deep learning that has originally been developed for computer vision cannot be directly applied to these highly irregular domains, and new classes of deep learning techniques must be designed. This is highly challenging as most standard data analysis tools cannot be used on heterogonous data domains. The workshop will bring together experts in mathematics (statistics, harmonic analysis, optimization, graph theory, sparsity, topology), machine learning (deep learning, supervised & unsupervised learning, metric learning) and specific applicative domains (neuroscience, genetics, social science, computer vision) to establish the current state of these emerging techniques and discuss the next directions.
This workshop will include a poster session; a request for posters will be sent to registered participants in advance of the workshop.
Here are the videos (and some slides):
Samuel Bowman (New York University)
Toward natural language semantics in learned representations

Emily Fox (University of Washington)
Interpretable and Sparse Neural Network Time Series Models for Granger Causality Discovery

Ellie Pavlick (University of Pennsylvania), Should we care about linguistics?

Leonidas Guibas (Stanford University), Knowledge Transport Over Visual Data

Yann LeCun (New York University), Public Lecture: Deep Learning and the Future of Artificial Intelligence

Alán Aspuru-Guzik (Harvard University), Generative models for the inverse design of molecules and materials

Daniel Rueckert (Imperial College), Deep learning in medical imaging: Techniques for image reconstruction, super-resolution and segmentation

Kyle Cranmer (New York University), Deep Learning in the Physical Sciences

Stéphane Mallat (École Normale Supérieure), Deep Generative Networks as Inverse Problems

Michael Elad (Technion - Israel Institute of Technology), Sparse Modeling in Image Processing and Deep Learning

Yann LeCun (New York University), Public Lecture: AI Breakthroughs & Obstacles to Progress, Mathematical and Otherwise

Xavier Bresson (Nanyang Technological University, Singapore), Convolutional Neural Networks on Graphs

Federico Monti (Universita della Svizzera Italiana), Deep Geometric Matrix Completion: a Geometric Deep Learning approach to Recommender Systems

Joan Bruna (New York University), On Computational Hardness with Graph Neural Networks

Jure Leskovec (Stanford University), Large-scale Graph Representation Learning

Arthur Szlam (Facebook), Composable planning with attributes

Yann LeCun (New York University), A Few (More) Approaches to Unsupervised Learning

Sanja Fidler (University of Toronto), Teaching Machines with Humans in the Loop
Raquel Urtasun (University of Toronto), Deep Learning for Self-Driving Cars

Pratik Chaudhari (University of California, Los Angeles (UCLA)), Unraveling the mysteries of stochastic gradient descent on deep networks

Stefano Soatto (University of California, Los Angeles (UCLA)), Emergence Theory of Deep Learning

Tom Goldstein (University of Maryland), What do neural net loss functions look like?

Stanley Osher (University of California, Los Angeles (UCLA)), New Techniques in Optimization and Their Applications to Deep Learning and Related Inverse Problems

Michael Bronstein (USI Lugano, Switzerland), Deep functional maps: intrinsic structured prediction for dense shape correspondence

Sainbayar Sukhbaatar (New York University), Deep Architecture for Sets and Its Application to Multi-agent Communication

Zuowei Shen (National University of Singapore), Deep Learning: Approximation of functions by composition

Wei Zhu (Duke University), LDMnet: low dimensional manifold regularized neural networks

Join the CompressiveSensing subreddit or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Continue Reading…


Read More

Whats new on arXiv

OCKELM+: Kernel Extreme Learning Machine based One-class Classification using Privileged Information (or KOC+: Kernel Ridge Regression or Least Square SVM with zero bias based One-class Classification using Privileged Information)

Kernel method-based one-class classifier is mainly used for outlier or novelty detection. In this letter, kernel ridge regression (KRR) based one-class classifier (KOC) has been extended for learning using privileged information (LUPI). LUPI-based KOC method is referred to as KOC+. This privileged information is available as a feature with the dataset but only for training (not for testing). KOC+ utilizes the privileged information differently compared to normal feature information by using a so-called correction function. Privileged information helps KOC+ in achieving better generalization performance which is exhibited in this letter by testing the classifiers with and without privileged information. Existing and proposed classifiers are evaluated on the datasets from UCI machine learning repository and also on MNIST dataset. Moreover, experimental results evince the advantage of KOC+ over KOC and support vector machine (SVM) based one-class classifiers.

PL-NMF: Parallel Locality-Optimized Non-negative Matrix Factorization

Non-negative Matrix Factorization (NMF) is a key kernel for unsupervised dimension reduction used in a wide range of applications, including topic modeling, recommender systems and bioinformatics. Due to the compute-intensive nature of applications that must perform repeated NMF, several parallel implementations have been developed in the past. However, existing parallel NMF algorithms have not addressed data locality optimizations, which are critical for high performance since data movement costs greatly exceed the cost of arithmetic/logic operations on current computer systems. In this paper, we devise a parallel NMF algorithm based on the HALS (Hierarchical Alternating Least Squares) scheme that incorporates algorithmic transformations to enhance data locality. Efficient realizations of the algorithm on multi-core CPUs and GPUs are developed, demonstrating significant performance improvement over existing state-of-the-art parallel NMF algorithms.

Cross-Lingual Sentiment Quantification

We discuss \emph{Cross-Lingual Text Quantification} (CLTQ), the task of performing text quantification (i.e., estimating the relative frequency p_{c}(D) of all classes c\in\mathcal{C} in a set D of unlabelled documents) when training documents are available for a source language \mathcal{S} but not for the target language \mathcal{T} for which quantification needs to be performed. CLTQ has never been discussed before in the literature; we establish baseline results for the binary case by combining state-of-the-art quantification methods with methods capable of generating cross-lingual vectorial representations of the source and target documents involved. We present experimental results obtained on publicly available datasets for cross-lingual sentiment classification; the results show that the presented methods can perform CLTQ with a surprising level of accuracy.

A Systematic Study of Leveraging Subword Information for Learning Word Representations

The use of subword-level information (e.g., characters, character n-grams, morphemes) has become ubiquitous in modern word representation learning. Its importance is attested especially for morphologically rich languages which generate a large number of rare words. Despite a steadily increasing interest in such subword-informed word representations, their systematic comparative analysis across typologically diverse languages and different tasks is still missing. In this work, we deliver such a study focusing on the variation of two crucial components required for subword-level integration into word representation models: 1) segmentation of words into subword units, and 2) subword composition functions to obtain final word representations. We propose a general framework for learning subword-informed word representations that allows for easy experimentation with different segmentation and composition components, also including more advanced techniques based on position embeddings and self-attention. Using the unified framework, we run experiments over a large number of subword-informed word representation configurations (60 in total) on 3 tasks (general and rare word similarity, dependency parsing, fine-grained entity typing) for 5 languages representing 3 language types. Our main results clearly indicate that there is no ‘one-sizefits-all’ configuration, as performance is both language- and task-dependent. We also show that configurations based on unsupervised segmentation (e.g., BPE, Morfessor) are sometimes comparable to or even outperform the ones based on supervised word segmentation.

SynC: A Unified Framework for Generating Synthetic Population with Gaussian Copula

Synthetic population generation is the process of combining multiple socioeonomic and demographic datasets from various sources and at different granularity, and downscaling them to an individual level. Although it is a fundamental step for many data science tasks, an efficient and standard framework is absent. In this study, we propose a multi-stage framework called SynC (Synthetic Population via Gaussian Copula) to fill the gap. SynC first removes potential outliers in the data and then fits the filtered data with a Gaussian copula model to correctly capture dependencies and marginal distributions of sampled survey data. Finally, SynC leverages neural networks to merge datasets into one and then scales them accordingly to match the marginal constraints. We make four key contributions in this work: 1) propose a novel framework for generating individual level data from aggregated data sources by combining state-of-the-art machine learning and statistical techniques, 2) design a metric for validating the accuracy of generated data when the ground truth is hard to obtain, 3) demonstrate its effectiveness with the Canada National Census data and presenting two real-world use cases where datasets of this nature can be leveraged by businesses, and 4) release an easy-to-use framework implementation for reproducibility.

Inductive Graph Representation Learning with Recurrent Graph Neural Networks

In this paper, we study the problem of node representation learning with graph neural networks. We present a graph neural network class named recurrent graph neural network (RGNN), that address the shortcomings of prior methods. By using recurrent units to capture the long-term dependency across layers, our methods can successfully identify important information during recursive neighborhood expansion. In our experiments, we show that our model class achieves state-of-the-art results on three benchmarks: the Pubmed, Reddit, and PPI network datasets. Our in-depth analyses also demonstrate that incorporating recurrent units is a simple yet effective method to prevent noisy information in graphs, which enables a deeper graph neural network.

A Multi-Task Learning Framework for Overcoming the Catastrophic Forgetting in Automatic Speech Recognition

Recently, data-driven based Automatic Speech Recognition (ASR) systems have achieved state-of-the-art results. And transfer learning is often used when those existing systems are adapted to the target domain, e.g., fine-tuning, retraining. However, in the processes, the system parameters may well deviate too much from the previously learned parameters. Thus, it is difficult for the system training process to learn knowledge from target domains meanwhile not forgetting knowledge from the previous learning process, which is called as catastrophic forgetting (CF). In this paper, we attempt to solve the CF problem with the lifelong learning and propose a novel multi-task learning (MTL) training framework for ASR. It considers reserving original knowledge and learning new knowledge as two independent tasks, respectively. On the one hand, we constrain the new parameters not to deviate too far from the original parameters and punish the new system when forgetting original knowledge. On the other hand, we force the new system to solve new knowledge quickly. Then, a MTL mechanism is employed to get the balance between the two tasks. We applied our method to an End2End ASR task and obtained the best performance in both target and original datasets.

Sparseout: Controlling Sparsity in Deep Networks

Dropout is commonly used to help reduce overfitting in deep neural networks. Sparsity is a potentially important property of neural networks, but is not explicitly controlled by Dropout-based regularization. In this work, we propose Sparseout a simple and efficient variant of Dropout that can be used to control the sparsity of the activations in a neural network. We theoretically prove that Sparseout is equivalent to an L_q penalty on the features of a generalized linear model and that Dropout is a special case of Sparseout for neural networks. We empirically demonstrate that Sparseout is computationally inexpensive and is able to control the desired level of sparsity in the activations. We evaluated Sparseout on image classification and language modelling tasks to see the effect of sparsity on these tasks. We found that sparsity of the activations is favorable for language modelling performance while image classification benefits from denser activations. Sparseout provides a way to investigate sparsity in state-of-the-art deep learning models. Source code for Sparseout could be found at \url{https://…/sparseout}.

Forecasting with time series imaging

Feature-based time series representation has attracted substantial attention in a wide range of time series analysis methods. Recently, the use of time series features for forecast model selection and model averaging has been an emerging research focus in the forecasting community. Nonetheless, most of the existing approaches depend on the manual choice of an appropriate set of features. Exploiting machine learning methods to automatically extract features from time series becomes crucially important in the state-of-the-art time series analysis. In this paper, we introduce an automated approach to extract time series features based on images. Time series are first transformed into recurrence images, from which local features can be extracted using computer vision algorithms. The extracted features are used for forecast model selection and model averaging. Our experiments show that forecasting based on automatically extracted features, with less human intervention and a more comprehensive view of the raw time series data, yields comparable performances with the top best methods proposed in the largest forecasting competition M4.

Text Classification Algorithms: A Survey

In recent years, there has been an exponential growth in the number of complex documents and texts that require a deeper understanding of machine learning methods to be able to accurately classify texts in many applications. Many machine learning approaches have achieved surpassing results in natural language processing. The success of these learning algorithms relies on their capacity to understand complex models and non-linear relationships within data. However, finding suitable structures, architectures, and techniques for text classification is a challenge for researchers. In this paper, a brief overview of text classification algorithms is discussed. This overview covers different text feature extractions, dimensionality reduction methods, existing algorithms and techniques, and evaluations methods. Finally, the limitations of each technique and their application in the real-world problem are discussed.

Self-Attention Graph Pooling

Advanced methods of applying deep learning to structured data such as graphs have been proposed in recent years. In particular, studies have focused on generalizing convolutional neural networks to graph data, which includes redefining the convolution and the downsampling (pooling) operations for graphs. The method of generalizing the convolution operation to graphs has been proven to improve performance and is widely used. However, the method of applying downsampling to graphs is still difficult to perform and has room for improvement. In this paper, we propose a graph pooling method based on self-attention. Self-attention using graph convolution allows our pooling method to consider both node features and graph topology. To ensure a fair comparison, the same training procedures and model architectures were used for the existing pooling methods and our method. The experimental results demonstrate that our method achieves superior graph classification performance on the benchmark datasets using a reasonable number of parameters.

General Purpose (GenP) Bioimage Ensemble of Handcrafted and Learned Features with Data Augmentation

Bioimage classification plays a crucial role in many biological problems. Here we present a new General Purpose (GenP) ensemble that boosts performance by combining local features, dense sampling features, and deep learning approaches. We propose an ensemble of deep learning methods built using different criteria (different batch sizes, learning rates, topologies, and data augmentation methods). One of the contributions of this paper is the proposal of new methods of data augmentation based on feature transforms (principal component analysis/discrete cosine transform) that boost performance of Convolutional Neural Networks (CNNs). Each handcrafted descriptor is used to train a different Support Vector Machine (SVM), and the different SVMs are combined with the ensemble of CNNs. Our method is evaluated on a diverse set of bioimage classification problems. Results demonstrate that the proposed GenP bioimage ensemble obtains state-of-the-art performance without any ad-hoc dataset tuning of parameters (avoiding the risk of overfitting/overtraining).

Patent Analytics Based on Feature Vector Space Model: A Case of IoT

The number of approved patents worldwide increases rapidly each year, which requires new patent analytics to efficiently mine the valuable information attached to these patents. Vector space model (VSM) represents documents as high-dimensional vectors, where each dimension corresponds to a unique term. While originally proposed for information retrieval systems, VSM has also seen wide applications in patent analytics, and used as a fundamental tool to map patent documents to structured data. However, VSM method suffers from several limitations when applied to patent analysis tasks, such as loss of sentence-level semantics and curse-of-dimensionality problems. In order to address the above limitations, we propose a patent analytics based on feature vector space model (FVSM), where the FVSM is constructed by mapping patent documents to feature vectors extracted by convolutional neural networks (CNN). The applications of FVSM for three typical patent analysis tasks, i.e., patents similarity comparison, patent clustering, and patent map generation are discussed. A case study using patents related to Internet of Things (IoT) technology is illustrated to demonstrate the performance and effectiveness of FVSM. The proposed FVSM can be adopted by other patent analysis studies to replace VSM, based on which various big data learning tasks can be performed.

Explainability in Human-Agent Systems

This paper presents a taxonomy of explainability in Human-Agent Systems. We consider fundamental questions about the Why, Who, What, When and How of explainability. First, we define explainability, and its relationship to the related terms of interpretability, transparency, explicitness, and faithfulness. These definitions allow us to answer why explainability is needed in the system, whom it is geared to and what explanations can be generated to meet this need. We then consider when the user should be presented with this information. Last, we consider how objective and subjective measures can be used to evaluate the entire system. This last question is the most encompassing as it will need to evaluate all other issues regarding explainability.

Audio-Text Sentiment Analysis using Deep Robust Complementary Fusion of Multi-Features and Multi-Modalities

Sentiment analysis research has been rapidly developing in the last decade and has attracted widespread attention from academia and industry, most of which is based on text. However, the information in the real world usually comes as different modalities. In this paper, we consider the task of Multimodal Sentiment Analysis, using Audio and Text Modalities, proposed a novel fusion strategy including Multi-Feature Fusion and Multi-Modality Fusion to improve the accuracy of Audio-Text Sentiment Analysis. We call this the Deep Feature Fusion-Audio and Text Modal Fusion (DFF-ATMF) model, and the features learned from it are complementary to each other and robust. Experiments with the CMU-MOSI corpus and the recently released CMU-MOSEI corpus for Youtube video sentiment analysis show the very competitive results of our proposed model. Surprisingly, our method also achieved the state-of-the-art results in the IEMOCAP dataset, indicating that our proposed fusion strategy is also extremely generalization ability to Multimodal Emotion Recognition.

Understanding the Signature of Controversial Wikipedia Articles through Motifs in Editor Revision Networks

Wikipedia serves as a good example of how editors collaborate to form and maintain an article. The relationship between editors, derived from their sequence of editing activity, results in a directed network structure called the revision network, that potentially holds valuable insights into editing activity. In this paper we create revision networks to assess differences between controversial and non-controversial articles, as labelled by Wikipedia. Originating from complex networks, we apply motif analysis, which determines the under or over-representation of induced sub-structures, in this case triads of editors. We analyse 21,631 Wikipedia articles in this way, and use principal component analysis to consider the relationship between their motif subgraph ratio profiles. Results show that a small number of induced triads play an important role in characterising relationships between editors, with controversial articles having a tendency to cluster. This provides useful insight into editing behaviour and interaction capturing counter-narratives, without recourse to semantic analysis. It also provides a potentially useful feature for future prediction of controversial Wikipedia articles.

Compositional Network Embedding

Network embedding has proved extremely useful in a variety of network analysis tasks such as node classification, link prediction, and network visualization. Almost all the existing network embedding methods learn to map the node IDs to their corresponding node embeddings. This design principle, however, hinders the existing methods from being applied in real cases. Node ID is not generalizable and, thus, the existing methods have to pay great effort in cold-start problem. The heterogeneous network usually requires extra work to encode node types, as node type is not able to be identified by node ID. Node ID carries rare information, resulting in the criticism that the existing methods are not robust to noise. To address this issue, we introduce Compositional Network Embedding, a general inductive network representation learning framework that generates node embeddings by combining node features based on the principle of compositionally. Instead of directly optimizing an embedding lookup based on arbitrary node IDs, we learn a composition function that infers node embeddings by combining the corresponding node attribute embeddings through a graph-based loss. For evaluation, we conduct the experiments on link prediction under four different settings. The results verified the effectiveness and generalization ability of compositional network embeddings, especially on unseen nodes.

Analysing Neural Network Topologies: a Game Theoretic Approach

Artificial Neural Networks have shown impressive success in very different application cases. Choosing a proper network architecture is a critical decision for a network’s success, usually done in a manual manner. As a straightforward strategy, large, mostly fully connected architectures are selected, thereby relying on a good optimization strategy to find proper weights while at the same time avoiding overfitting. However, large parts of the final network are redundant. In the best case, large parts of the network become simply irrelevant for later inferencing. In the worst case, highly parameterized architectures hinder proper optimization and allow the easy creation of adverserial examples fooling the network. A first step in removing irrelevant architectural parts lies in identifying those parts, which requires measuring the contribution of individual components such as neurons. In previous work, heuristics based on using the weight distribution of a neuron as contribution measure have shown some success, but do not provide a proper theoretical understanding. Therefore, in our work we investigate game theoretic measures, namely the Shapley value (SV), in order to separate relevant from irrelevant parts of an artificial neural network. We begin by designing a coalitional game for an artificial neural network, where neurons form coalitions and the average contributions of neurons to coalitions yield to the Shapley value. In order to measure how well the Shapley value measures the contribution of individual neurons, we remove low-contributing neurons and measure its impact on the network performance. In our experiments we show that the Shapley value outperforms other heuristics for measuring the contribution of neurons.

CaseNet: Content-Adaptive Scale Interaction Networks for Scene Parsing

Objects in an image exhibit diverse scales. Adaptive receptive fields are expected to catch suitable range of context for accurate pixel level semantic prediction for handling objects of diverse sizes. Recently, atrous convolution with different dilation rates has been used to generate features of multi-scales through several branches and these features are fused for prediction. However, there is a lack of explicit interaction among the branches to adaptively make full use of the contexts. In this paper, we propose a Content-Adaptive Scale Interaction Network (CaseNet) to exploit the multi-scale features for scene parsing. We build the CaseNet based on the classic Atrous Spatial Pyramid Pooling (ASPP) module, followed by the proposed contextual scale interaction (CSI) module, and the scale adaptation (SA) module. Specifically, first, for each spatial position, we enable context interaction among different scales through scale-aware non-local operations across the scales, \ie, CSI module, which facilitates the generation of flexible mixed receptive fields, instead of a traditional flat one. Second, the scale adaptation module (SA) explicitly and softly selects the suitable scale for each spatial position and each channel. Ablation studies demonstrate the effectiveness of the proposed modules. We achieve state-of-the-art performance on three scene parsing benchmarks Cityscapes, ADE20K and LIP.

Guided Anisotropic Diffusion and Iterative Learning for Weakly Supervised Change Detection

Large scale datasets created from user labels or openly available data have become crucial to provide training data for large scale learning algorithms. While these datasets are easier to acquire, the data are frequently noisy and unreliable, which is motivating research on weakly supervised learning techniques. In this paper we propose an iterative learning method that extracts the useful information from a large scale change detection dataset generated from open vector data to train a fully convolutional network which surpasses the performance obtained by naive supervised learning. We also propose the guided anisotropic diffusion algorithm, which improves semantic segmentation results using the input images as guides to perform edge preserving filtering, and is used in conjunction with the iterative training method to improve results.

Indirect Inference for Time Series Using the Empirical Characteristic Function and Control Variates

We estimate the parameter of a time series process by minimizing the integrated weighted mean squared error between the empirical and simulated characteristic function, when the true characteristic functions cannot be explicitly computed. Motivated by Indirect Inference, we use a Monte Carlo approximation of the characteristic function based on iid simulated blocks. As a classical variance reduction technique, we propose the use of control variates for reducing the variance of this Monte Carlo approximation. These two approximations yield two new estimators that are applicable to a large class of time series processes. We show consistency and asymptotic normality of the parameter estimators under strong mixing, moment conditions, and smoothness of the simulated blocks with respect to its parameter. In a simulation study we show the good performance of these new simulation based estimators, and the superiority of the control variates based estimator for Poisson driven time series of counts.

Interpreting Adversarial Examples with Attributes

Deep computer vision systems being vulnerable to imperceptible and carefully crafted noise have raised questions regarding the robustness of their decisions. We take a step back and approach this problem from an orthogonal direction. We propose to enable black-box neural networks to justify their reasoning both for clean and for adversarial examples by leveraging attributes, i.e. visually discriminative properties of objects. We rank attributes based on their class relevance, i.e. how the classification decision changes when the input is visually slightly perturbed, as well as image relevance, i.e. how well the attributes can be localized on both clean and perturbed images. We present comprehensive experiments for attribute prediction, adversarial example generation, adversarially robust learning, and their qualitative and quantitative analysis using predicted attributes on three benchmark datasets.

Continue Reading…


Read More

India has 100k records on iNaturalist

(This article was first published on Vijay Barve, and kindly contributed to R-bloggers)

Biodiversity citizen scientists use iNaturalist to post their observations with photographs. The observations are then curated there by crowd-sourcing the identifications and other trait related aspects too. The data once converted to “research grade” is passed on to GBIF as occurrence records.

Exciting news from India in 3rd week of April 2019 is:

Being interested in biodiversity data visualizations and completeness, I was waiting for 100k records to explore the data. Here is what I did and found out.

Step 1: Download the data from iNaturalist website. Which can be done very easily by visiting the website and choosing the right options.

I downloaded the file as .zip and extracted the observations-xxxx.csv. [In my case it was observations-51008.csv].

Step 2: Read the data file in R

observations_51008 <- read_csv("input/observations-51008.csv")

Step 3: Clean up the data and set it up to be used in package bdvis.


inatc <- list(

inat <- format_bdvis(observations_51008,config = inatc)

Step 4: We still need to rename some more columns for ease in visualizations like rather than ‘taxon_family_name’ it will be easy to have field called ‘Family’

rename_column <- function(dat,old,new){
  if(old %in% colnames(dat)){
    colnames(dat)[which(names(dat) == old)] <- new
  } else {
    print(paste("Error: Fieldname not found...",old))

inat <- rename_column(inat,'taxon_kingdom_name','Kingdom')
inat <- rename_column(inat,'taxon_phylum_name','Phylum')
inat <- rename_column(inat,'taxon_class_name','Class')
inat <- rename_column(inat,'taxon_order_name','Order_')
inat <- rename_column(inat,'taxon_family_name','Family')
inat <- rename_column(inat,'taxon_genus_name','Genus')

# Remove records excess of 100k
inat <- inat[1:100000,]

Step 5: Make sure the data is loaded properly


will produce some like this:

Total no of records = 100000 

 Temporal coverage...
 Date range of the records from  1898-01-01  to  2019-04-19 

 Taxonomic coverage...
 No of Families :  1345
 No of Genus :  5638
 No of Species :  13377 

 Spatial coverage ...
 Bounding box of records  6.806092 , 68.532  -  35.0614769085 , 97.050133
 Degree celles covered :  336
 % degree cells covered :  39.9524375743163

The data looks good. But we have a small issue, we have some records from year 1898, which might cause trouble with some of our visualizations. So let us drop records before year 2000 for the time being.

inat = inat[which(inat$Date_collected > "2000-01-01"),]

Now we are ready to explore the data. First one I always like to see is geographical coverage of the data. First let us try it at 1 degree (~100km) grid cells. Note here I have Admin2.shp file with India states map.

        shp = "Admin2.shp")

This shows a fairly good geographical coverage of the data at this scale. We have very few degree cells with no data. How about fines scale though? Say at 0.1 degree (~10km) grid. Let us generate that.

        shp = "Admin2.shp",

Now the pattern is clear, where the data is coming from.

To be continued…


To leave a comment for the author, please follow the link and comment on their blog: Vijay Barve. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Distilled News

Neural Network Quantization Introduction

I had been planing to write a series articles on neural network quantization. This Neural Network Quantization Introduction draws the skeleton. Regarding the rapid evolution of deep learning in recent years, there has been plenty of metarils on quantiaztion. However, for most of these documents, authors rush into their work details so fast that new comers can hardly understand even the baseline. This is of course not a good status in such a fast growing field.

Generative Adversarial Networks Demystified

Since Ian Goodfellow and co-authors published their first paper about GANs in 2014, GANs have been continuously improved and applied in many ways to generate new data. For example, GANs have been successfully used for generating realistic faces, characters or filling images. GANs represent a relatively new idea with wide applications, and this concept differs from common deep learning frameworks as it is a form of unsupervised learning. The inputs are unlabeled and the adversarial network learns what the data looks like (i.e., density estimation), which enables it to generate new examples.

News Feature: What are the limits of deep learning?

There’s no mistaking the image: It’s a banana – a big, ripe, bright-yellow banana. Yet the artificial intelligence (AI) identifies it as a toaster, even though it was trained with the same powerful and oft-publicized deep-learning techniques that have produced a white-hot revolution in driverless cars, speech understanding, and a multitude of other AI applications. That means the AI was shown several thousand photos of bananas, slugs, snails, and similar-looking objects, like so many flash cards, and then drilled on the answers until it had the classification down cold. And yet this advanced system was quite easily confused – all it took was a little day-glow sticker, digitally pasted in one corner of the image.

What Microsoft and Google Are Not Telling You About Their A.I.

It’s natural for companies to portray their A.I. technology as much more sophisticated than it really is. Doing so attracts new investors and raises consumer confidence. But as A.I. becomes more prevalent, tech companies run into trouble when the humans that support it behind the scenes decide to go public about their work – or when a human translator at a conference notices that things aren’t quite what they seem. Most A.I. companies reveal very little about their algorithms in the name of intellectual property protection. As investors flood into this new field, it’s worth examining any tech company’s claims critically. More than ever, we must use all available information to verify just how artificial and intelligent A.I. truly is.

A graphical introduction to dynamic programming

I’ve been helping a friend understand dynamic programming (DP for short), so I’ve been on the lookout for good resources. The topic is covered all across the web, but I found many of them focused on the code, and not enough on the beautiful visualizations that shed light into how DP works. In this post, I present a highly visual introduction to dynamic programming, then walk through three separate problems utilizing dynamic programming.

The Rise of Generative Adversarial Networks

5 years back, Generative Adversarial Networks(GANs) started a revolution in deep learning. This revolution has produced some major technological breakthroughs. Generative Adversarial Networks were introduced by Ian Goodfellow and others in the paper titled ‘Generative Adversarial Networks’ – https://…/1406.2661. Academia accepted GANs with open hands and industry welcomed GANs with much fanfare too. The rise of GANs was inevitable. First, the best thing about GANs is their nature of learning, which is unsupervised. GANs don’t need labeled data, which makes GANs powerful as the boring work of data labeling is not required. Second, the potential use-cases of GANs have put GANs at the center of conversations. They can generate high-quality images, enhance photos, generate images from text, convert images from one domain to another, change the appearance of the face image as age progresses and many more. The list is endless. We will cover some of the widely popular GAN architectures in this article. Third, the endless research put around GANs is so mesmerizing that it grabs the attention of every other industry. We will be talking about major technological breakthroughs in the later section of this article.

Principled GraphQL

GraphQL, despite the name, isn’t simply a query language. It’s a comprehensive solution to the problem of connecting modern apps to services in the cloud. As such, it forms the basis for a new and important layer in the modern application development stack: the data graph. This new layer brings all of a company’s app data and services together in one place, with one consistent, secure, and easy-to-use interface, so that anyone can draw upon it with minimal friction.

How to Debug a Neural Network With Gradient Checking

When implementing a neural network from scratch, backpropagation is arguably where it is more prone to mistakes. Therefore, a method to debug this step could potentially save a lot of time and headaches when debugging a neural network. Here, the method of gradient checking will be introduced. Briefly, this methods consists in approximating the gradient using a numerical approach. If it is close to the calculated gradients, then backpropagation was implemented correctly! Let’s dive into more details and let’s see how it can be implemented in a project.

Symbolic vs Connectionist A.I.

It seems that wherever there are two categories of some sort, people are very quick to take one side or the other, to then pit both against each other. Artificial Intelligence techniques have traditionally been divided into two categories; Symbolic A.I. and Connectionist A.I. The latter kind have gained significant popularity with recent success stories and media hype, and no one could be blamed for thinking that they are what A.I. is all about. There have even been cases of people spreading false information to diverge attention and funding from more classic A.I. research and development. The truth of the matter is that each set of techniques has its place. There is no silver bullet A.I. algorithm yet, and trying to use the same algorithm for all problems is just plain stupid. Each has its own strengths and weaknesses, and choosing the right tools for the job is key.

Reinforcement Learning 101

Reinforcement Learning(RL) is one of the hottest research topics in the field of modern Artificial Intelligence and its popularity is only growing. Let’s look at 5 useful things one needs to know to get started with RL.

Some Popular Metrics in Machine Learning

In this article I provide a brief overview of several metrics used to evaluate the performance of models that simulate some behavior. These metrics compare the simulated output to some ground truth.

Tidy correlation tests in R

When we try to estimate the correlation coefficient between multiple variables, the task is more complicated in order to obtain a simple and tidy result. A simple solution is to use the tidy() function from the {broom} package. In this post we are going to estimate the correlation coefficients between the annual precipitation of several Spanish cities and climate teleconnections indices: download. The data of the teleconnections are preprocessed, but can be downloaded directly from The daily precipitation data comes from ECA&D.

The Complete Guide to Decision Trees

In the beginning, learning Machine Learning (ML) can be intimidating. Terms like ‘Gradient Descent’, ‘Latent Dirichlet Allocation’ or ‘Convolutional Layer’ can scare lots of people. But there are friendly ways of getting into the discipline, and I think starting with Decision Trees is a wise decision. Decision Trees (DTs) are probably one of the most useful supervised learning algorithms out there. As opposed to unsupervised learning (where there is no output variable to guide the learning process and data is explored by algorithms to find patterns), in supervised learning your existing data is already labelled and you know which behaviour you want to predict in the new data you obtain. This is the type of algorithms that autonomous cars use to recognize pedestrians and objects, or organizations exploit to estimate customers lifetime value and their churn rates.

34 Great Articles and Tutorials on Clustering

This resource is part of a series on specific topics related to data science: regression, clustering, neural networks, deep learning, decision trees, ensembles, correlation, Python, R, Tensorflow, SVM, data reduction, feature selection, experimental design, cross-validation, model fitting, and many more.

How do Graph Neural Networks Work?

Graph neural networks (GNNs) have emerged as an interesting application to a variety of problems. The most pronounced is in the field of chemistry and molecular biology. An example of the impact in this field is DeepChem, a pythonic library that makes use of GNNs. But how exactly do they work?

Continue Reading…


Read More

Magister Dixit

“With the current focus on technology and the automated techniques, rather than on the actual processes of exploration and analysis, many people perceive data mining as a product rather than as a discipline that must be mastered.” Joyce Jackson

Continue Reading…


Read More

April 21, 2019

Book Memo: “Emerging Security Algorithms and Techniques”

Cyber security is the protection of information systems, hardware, software, and information as well from theft, damages, interruption or misdirection to any of these resources. In other words, cyber security focuses on protecting computers, networks, programs and data (in use, in rest, in motion) from unauthorized or unintended access, change or destruction. Therefore, strengthening the security and resilience of cyberspace has become a vital homeland security mission. Cyber security attacks are growing exponentially. Expert observers are optimistic that a new method called honey encryption will deter hackers by serving up fake data for every incorrect speculation of the key code. Then, there are various emerging algorithms and techniques viz. DES, AES, IDEA, WAKE, CAST5, Serpent Algorithm, Chaos-Based Cryptography McEliece, Niederreiter, NTRU, etc.

Continue Reading…


Read More

survivalists [a Riddler’s riddle]

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

A neat question from The Riddler on a multi-probability survival rate:

Nine processes are running in a loop with fixed survivals rates .99,….,.91. What is the probability that the first process is the last one to die? Same question with probabilities .91,…,.99 and the probability that the last process is the last one to die.

The first question means that the realisation of a Geometric G(.99) has to be strictly larger than the largest of eight Geometric G(.98),…,G(.91). Given that the cdf of a Geometric G(a) is [when counting the number of attempts till failure, included, i.e. the Geometric with support the positive integers]

F(x)=\Bbb P(X\le x)=1-a^{x}

the probability that this happens has the nice (?!) representation

\sum_{x=2}^\infty a_1^{x-1}(1-a_1)\prod_{j\ge 2}(1-a_j^{x-1})=(1-a_1)G(a_1,\ldots,a_9)

which leads to an easy resolution by recursion since


and G(a)=a/(1-a)

and a value of 0.5207 returned by R (Monte Carlo evaluation of 0.5207 based on 10⁷ replications). The second question is quite similar, with solution

\sum_{x=2}^\infty a_1^{x-1}(1-a_1)\prod_{j\ge 1}(1-a_j^{x})=a^{-1}(1-a_1)G(a_1,\ldots,a_9)

and value 0.52596 (Monte Carlo evaluation of 0.52581 based on 10⁷ replications).

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Binning with Weights

(This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers)

After working on the MOB package, I received requests from multiple users if I can write a binning function that takes the weighting scheme into consideration. It is a legitimate request from the practical standpoint. For instance, in the development of fraud detection models, we often would sample down non-fraud cases given an extremely low frequency of fraud instances. After the sample down, a weight value > 1 should be assigned to all non-fraud cases to reflect the fraud rate in the pre-sample data.

While accommodating the request for weighting cases is trivial, I’d like to do a simple experitment showing what the impact might be with the consideration of weighting.

– First of all, let’s apply the monotonic binning to a variable named “tot_derog”. In this unweighted binning output, KS = 18.94, IV = 0.21, and WoE values range from -0.38 to 0.64.

– In the first trial, a weight value = 5 is assigned to cases with Y = 0 and a weight value = 1 assigned to cases with Y = 1. As expected, frequency, distribution, bad_frequency, and bad_rate changed. However, KS, IV, and WoE remain identical.

– In the second trial, a weight value = 1 is assigned to cases with Y = 0 and a weight value = 5 assigned to cases with Y = 1. Once again, KS, IV, and WoE are still the same as the unweighted output.

The conclusion from this demonstrate is very clear. In cases of two-value weights assigned to the binary Y, the variable importance reflected by IV / KS and WoE values should remain identical with or without weights. However, if you are concerned about the binning distribution and the bad rate in each bin, the function wts_bin() should do the correction and is available in the project repository (

To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Document worth reading: “Expectation propagation: a probabilistic view of Deep Feed Forward Networks”

We present a statistical mechanics model of deep feed forward neural networks (FFN). Our energy-based approach naturally explains several known results and heuristics, providing a solid theoretical framework and new instruments for a systematic development of FFN. We infer that FFN can be understood as performing three basic steps: encoding, representation validation and propagation. We obtain a set of natural activations — such as sigmoid, $\tanh$ and ReLu — together with a state-of-the-art one, recently obtained by Ramachandran et al.(arXiv:1710.05941) using an extensive search algorithm. We term this activation ESP (Expected Signal Propagation), explain its probabilistic meaning, and study the eigenvalue spectrum of the associated Hessian on classification tasks. We find that ESP allows for faster training and more consistent performances over a wide range of network architectures. Expectation propagation: a probabilistic view of Deep Feed Forward Networks

Continue Reading…


Read More

If you did not already know

Deep Learning Optimizer Benchmark Suite (DeepOBS) google
Because the choice and tuning of the optimizer affects the speed, and ultimately the performance of deep learning, there is significant past and recent research in this area. Yet, perhaps surprisingly, there is no generally agreed-upon protocol for the quantitative and reproducible evaluation of optimization strategies for deep learning. We suggest routines and benchmarks for stochastic optimization, with special focus on the unique aspects of deep learning, such as stochasticity, tunability and generalization. As the primary contribution, we present DeepOBS, a Python package of deep learning optimization benchmarks. The package addresses key challenges in the quantitative assessment of stochastic optimizers, and automates most steps of benchmarking. The library includes a wide and extensible set of ready-to-use realistic optimization problems, such as training Residual Networks for image classification on ImageNet or character-level language prediction models, as well as popular classics like MNIST and CIFAR-10. The package also provides realistic baseline results for the most popular optimizers on these test problems, ensuring a fair comparison to the competition when benchmarking new optimizers, and without having to run costly experiments. It comes with output back-ends that directly produce LaTeX code for inclusion in academic publications. It supports TensorFlow and is available open source. …

Temporal Bibliographic Network google
We present two ways (instantaneous and cumulative) to transform bibliographic networks, using the works’ publication year, into corresponding temporal networks based on temporal quantities. We also show how to use the addition of temporal quantities to define interesting temporal properties of nodes, links and their groups thus providing an insight into evolution of bibliographic networks. Using the multiplication of temporal networks we obtain different derived temporal networks providing us with new views on studied networks. The proposed approach is illustrated with examples from the collection of bibliographic networks on peer review. …

Knowledge Graph Completion (KGC) google
Knowledge Graphs (KGs) have been applied to many tasks including Web search, link prediction, recommendation, natural language processing, and entity linking. However, most KGs are far from complete and are growing at a rapid pace. To address these problems, Knowledge Graph Completion (KGC) has been proposed to improve KGs by filling in its missing connections. Unlike existing methods which hold a closed-world assumption, i.e., where KGs are fixed and new entities cannot be easily added, in the present work we relax this assumption and propose a new open-world KGC task. As a first attempt to solve this task we introduce an open-world KGC model called ConMask. This model learns embeddings of the entity’s name and parts of its text-description to connect unseen entities to the KG. To mitigate the presence of noisy text descriptions, ConMask uses a relationship-dependent content masking to extract relevant snippets and then trains a fully convolutional neural network to fuse the extracted snippets with entities in the KG. Experiments on large data sets, both old and new, show that ConMask performs well in the open-world KGC task and even outperforms existing KGC models on the standard closed-world KGC task. …

Continue Reading…


Read More

Turn Up the Signal; Turn Off the Noise

To thoroughly, accurately, and clearly inform, we must identify the intended signal and then boost it while eliminating as much noise as possible. This certainly applies to data visualization, which unfortunately lends itself to a great deal of noise if we’re not careful and skilled. The signal in a stream of content is the intended message, the information we want people to understand. Noise is everything that isn’t signal, with one exception: non-signal content that somehow manages to boost the signal without compromising it in any way is not noise. For example, if we add nonessential elements or attributes to a data visualization to draw the reader’s attention to the message, thus boosting it, without reducing or altering the message in any way, we haven’t introduced noise. No accurate item of data, in and of itself, always qualifies either as a signal or noise. It always depends on the circumstances.

In physics, the signal-to-noise ratio, which is where the concept originated, is an expression of odds: the ratio of the one possible outcome to another. When comparing signal to noise, we want the odds to dramatically favor the signal. Which odds qualify as favorable varies, depending on the situation. When communicating information to someone, a signal-to-noise ratio of 99 to 1 would usually be considered favorable. When hoping to get into a particular college, however, 3-to-1 odds might be considered favorable, but those odds would be dreadful in communication, for it would mean that 25% of the content was noise. Another ratio that is common in data communication, a probability ratio, is related to an odds ratio. Rather than comparing one outcome to other as we do with odds, however, a probability ratio compares a particular outcome to the total of all outcomes. For example, a probability ratio of 85 out of 100 (i.e., the outcome of interest will occur 85% of the time on average), is the mathematical equivalent of 85-to-15 odds. When Edward Tufte introduced the concept of the data-ink ratio back in the 1980s, he proposed a probability ratio rather than an odds ratio. He argued that the percentage of ink in a chart that displays data, when compared to the total ink, should be as close to 100% as possible.

Every choice that we make when creating a data visualization seeks to optimize the signal-to-noise ratio. We could argue that the signal-to-noise ratio is the most essential consideration in data visualization—the fundamental guide for all design decisions while creating a data visualization and the fundamental measure of success once it’s out there in the world.

It’s worth noting that particular content doesn’t qualify as noise simply because it’s inconvenient. Earlier, I said that a signal is the intended message, but let me qualify this further by pointing out that this assumes the message is truthful. In fact, the message itself is noise to the degree that it communicates misinformation, even if that misinformation is intentional. I’ve seen many examples of data visualizations that left out or misrepresented vital information because a clear understanding of the truth wasn’t the designer’s objective. I’ve also witnessed occasions when highly manipulated data replaced the actual data because it told a more convenient story—one that better supported an agenda. For example, a research paper that claims a strong relationship between two variables might refrain from revealing the actual data on which those claims were supposedly based in favor of a statistical model that replaced a great deal of volatility and uncertainty in the relationship, which could be seen in the actual data, with a perfectly smooth and seemingly certain portrayal of that relationship. On occasions when I’ve questioned researchers about this, I’ve been told that the volatility in the actual data was “just noise,” so they removed it. While they might argue that their smooth model illustrates the relationship in a simpler manner, I would argue that it over-simplifies the relationship if they only report the model without also revealing the actual data on which it was based. Seeing the actual data as well helps us keep in mind that statistical models are estimates, built on assumptions, which are never entirely true.

So, to recap, noise in communication, including data visualization, is content that isn’t part of and doesn’t support the intended message or content that isn’t truthful. Turn up the signal; turn off the noise.

Continue Reading…


Read More

Familiarisation with the Australian Election Study by @ellis2013nz

(This article was first published on free range statistics - R, and kindly contributed to R-bloggers)

The Australian Election Study is an impressive long term research project that has collected the attitudes and behaviours of a sample of individual voters after each Australian federal election since 1987. All the datasets and documentation are freely available. An individual survey of this sort is a vital complement to the necessarily aggregate results provided by the Australian Electoral Commission, and is essential for coming closer to understanding political and social change. Many countries have a comparable data source.

Many thanks to the original researchers:

McAllister, Ian; Makkai, Toni; Bean, Clive; Gibson, Rachel Kay, 2017, “Australian Election Study, 2016”, doi:10.4225/87/7OZCZA, ADA Dataverse, V2, UNF:6:TNnUHDn0ZNSlIM94TQphWw==; 1. Australian Election Study December 2016 Technical Report.pdf

In this blog post I just scratch the surface of the latest available data, from the time of the 2016 federal election. In later posts I’ll get into modelling it a bit deeper, and hopefully get a chance to look at some changes over time. Some of the code I use in this post for preparing the data might find its way into the ozfedelect R package, but not the data itself.

Getting the data

The data need to be downloaded by hand from the Australian Data Archive via the Dataverse (an international open source research data repository) in order to agree to the terms and conditions of use. The following code requires the SPSS version of the data and the Excel version of the data dictionary to have been saved in an ./aes/ subfolder of the working directory.

The SPSS data has column names truncated at eight characters long, which means that some of the variables in the data don’t match those in the data dictionary. The code below makes the necessary manual adjustments to variable names from the dictionary to deal with this, and re-classes the categorical responses as factors with the correct level labels in the correct order. I do this by importing and joining to the data dictionary rather than extracting the attributes information from the SPSS import, which might have also worked; I prefer my manual approach so I can deal with some anomalies explicitly (for example, many of the values listed as 999 “skipped” in the data dictionary are NA in the SPSS data).

Post continues after code extract



#---------------------Import and tidy data-----------------
# Download by hand from

aes2016_orig <- import("aes/2. Australian Election Study, 2016.sav")
aes2016_code_orig <- read_excel("aes/1. Australian Election Study, 2016 - Codebook.xlsx",
                                sheet = "Data Dictionary")

aes2016_code <- aes2016_code_orig %>%
  rename(question_label = Label...3,
         answer_label = Label...5) %>%
  rename_all(tolower) %>%
  fill(variable, position, question_label) %>%
  mutate(var_abb = substring(variable, 1, 8)) %>%
  mutate(var_abb = case_when(
    var_abb == "H19_STAT" ~ "H19STATE",
    var_abb == "H19_PCOD" ~ "H19pcoRV",
    var_abb == "H20_othe" ~ "H20_oth",
    var_abb == "H25_othe" ~ "H25_oth",
    var_abb == "Final_ST" ~ "finSTATE",
    var_abb == "Samp_STA" ~ "SamSTATE",
    var_abb == "detailed" ~ "doutcome",
    var_abb == "Responde" ~ "responID",
    var_abb == "Start_Ti" ~ "Sta_time",
    var_abb == "Samp_PCO" ~ "SampcoRV",
    var_abb == "Final_PC" ~ "finpcoRV",
    var_abb == "Total_Du" ~ "totaldur",
    TRUE ~ var_abb
  )) %>%
    variable = "StateMap", position = NA, question_label = "Unidentified state variable",
    value = 1:8, 
  answer_label = c("New South Wales", "Victoria", "Queensland", "South Australia", "Western Australia",
                   "Tasmania", "Northern Territory", "Australian Capital Territory"),
  var_abb = "StateMap"

aes2016_questions <- distinct(aes2016_code, var_abb, position, question_label)

aes2016_answers <- aes2016_code %>%
  select(var_abb, variable, value, answer_label) %>%

# Check all the names in the data now match those in the data dictionary:  
  names(aes2016_orig)[!names(aes2016_orig) %in% aes2016_questions$var_abb],

# ... and vice versa:
  unique(aes2016_questions$var_abb)[!unique(aes2016_questions$var_abb) %in% names(aes2016_orig)],

aes2016 <- aes2016_orig %>%

attributes(aes2016)$question_labels <- names(aes2016)

for(i in 1:ncol(aes2016)){
  this_q <- filter(aes2016_questions, var_abb == names(aes2016)[i])
  # Sometimes a code like 999 'Skipped' is not present in the data but has already
  # been replaced with an NA, so we don't want it in our factor level. So we need
  # to find out what answers are actually used in the i'th column
  used_a <- unique(pull(aes2016, i))
  # the answer labels for this particular question
  these_a <- aes2016_code %>%
    filter(var_abb == names(aes2016)[i]) %>%
    filter(value %in% used_a) %>%
  attributes(aes2016)$question_labels[i] <- pull(this_q, question_label)
  # half a dozen open text questions don't match to the data dictionary so we exclude those:
  if(nrow(these_a) > 0 & 
     !pull(this_q, question_label) %in% c("What kind of work do you do? Full job title",
                                          "What are (or were) the main tasks that you usually perform(ed)?",
                                          "What kind of business or industry is (or was) that in?",
                                          "What was your partner's main activity last week? (other specify)",
                                          "What kind of work does (or did) your partner do? Provide job title",
                                          "Generally speaking, does your partner usually think of himself or herself as Liberal, Labor, National or what? (other specify)")
	 # for the rest, we turn the response into a factor with the correct labels
    aes2016[ , i] <- factor(pull(aes2016, i), labels = pull(these_a, answer_label))  

aes2016 <- aes2016 %>%
  mutate(age_2016 = 2016 - H2,
         agegrp_2016 = cut(age_2016, 
                           breaks = c(0, 17.5, 24.5, 34.5, 44.5, 54.5, 64.5, Inf),
                           labels = c("0-17", "18-24", "25-34", "35-44", "45-54", "55-64", "65+")),
         agegrp_2016 = as.factor(replace_na(as.character(agegrp_2016), "Age unknown")),
         sex = replace_na(as.character(H1), "Sex unknown"),
         first_pref_hr_grp = case_when(
           B9_1 %in% c("Liberal Party", "National (Country) Party") ~ "Coalition",
           B9_1 == "Labor Party (ALP)"                              ~ "Labor",
           B9_1 == "Greens"                                         ~ "Greens",
           B9_1 == "Voted informal" |                   ~ "Did not vote/voted informal",
           TRUE                                                     ~ "Other"
         first_pref_sen_grp = case_when(
           B9_2 %in% c("Liberal Party", "National (Country) Party") ~ "Coalition",
           B9_2 == "Labor Party (ALP)"                              ~ "Labor",
           B9_2 == "Greens"                                         ~ "Greens",
           B9_2 == "Voted informal" |                   ~ "Did not vote/voted informal",
           TRUE                                                     ~ "Other"
         first_pref_sen_grp2 = case_when(
           B9_2 == "Liberal Party"                                  ~ "Liberal",
           B9_2 == "National (Country) Party"                       ~ "National",
           B9_2 == "One Nation"                                     ~ "One Nation",
           B9_2 == "Labor Party (ALP)"                              ~ "Labor",
           B9_2 == "Greens"                                         ~ "Greens",
           TRUE                                                     ~ "Other (including did not vote)"
  ) %>%
  mutate(first_pref_sen_grp2 = toupper(first_pref_sen_grp2),
         first_pref_sen_grp2 = ifelse(grepl("^OTHER", first_pref_sen_grp2),
                                      "OTHER (incl. did not vote)",
                                      first_pref_sen_grp2)) %>%
  mutate(first_pref_sen_grp2 = fct_relevel(first_pref_sen_grp2,
                                           toupper(c("Greens", "Labor", "Liberal",
                                                     "National", "One Nation")))) %>%
  mutate(tpp_alp = B11_1  %in% "Labor Party (ALP)" | B9_1 %in% "Labor Party (ALP)")

The last few lines of that code also defines some summarised variables that lump various parties or non-votes together and will be useful in later analysis.

Survey design and weights

The AES in 2016 used two separate sampling frames: one provided by the Australian Electoral Commission, and one from the Geo-coded National Address File or GNAF, a large, authoritative open data source of every address in Australia. For inference about Australian adults on the election roll (enrolment is required of all resident adult citizens by legislation), the AES team recommend using a combination of the two samples, and weights are provided in the wt_enrol column to do so.

The technical report describes data collection as mixed-mode (hard copy and online versions), with a solid system of managing contacts, reminders and non-returns.

The AEC sample is stratified by state, and the GNAF sample by Greater Capital City (GCC) statistical areas (which divide states into two eg “Greater Sydney” and “Rest of New South Wales”). As far as I can see no GCC variable is provided for end-use analysts such as myself. Nor are the various postcode variables populated, and I can’t see a variable for electorate/division (presumably for statistical disclosure control purposes) which is a shame as it would be an important variable to use for a variety of multi-level analyses. So we have to use “state” as our best regional variable.

Post-stratification weighting (raking) was used by the AES to match the population proportions by sex, age group, state, and first preference vote in the House of Representatives. This latter calibration is particularly important to ensure that analysis gives appropriate weight to the supporters of the various parties; it is highly likely that non-response is related to political views (or non-views).

The weights have been scaled to add up to the sample size, rather than to population numbers (as would be the approach of eg a national statistical office). This is presumably out of deference for SPSS’ behaviour of treating survey weights as frequencies, which leads to appalling results if they add up to anything other than the sample size (and still sub-optimal results even then).

Table 16 of the technical report is meant to provide the weighting benchmarks for the post-stratification weighting, but the state, age and sex totals given are only for the GNAF sample; and the benchmarks given for which party voted for appear to be simply incorrect in a way I cannot determine. They add up to much more than either the AEC or GNAF sample, but much less than their combined total; and I can’t match their results with any of the plausible combinations of filters I’ve tried. I’d be happy for anyone to point out if I’m doing something wrong, but my working hypothesis is that Table 16 simply contains some mistakes, whereas the weights in the actual data are correct.

The code below explores the three sets of weights provided, and defines a survey design object (using Thomas Lumley’s survey package) that treats state as the strata for sampling, and sex, state, age and House of Representatives vote as post-stratification variables. Although I won’t make much use of the survey package today, I’ll almost certainly want it for more sophisticated analysis in a later post.

Post continues after code extract

# The sample has two parts - addresses supplied by the AEC, and addresses from 
# the GNAF (Geocoded National Address File). There are separate weights for each,
# so they can be analysed as two separate surveys or you can use the combined set of weights:
table(aes2016$wt_aec > 0, useNA = 'always')    # for comparison to previous waves of AES
table(aes2016$wt_gnaf > 0, useNA = 'always')   # inference about Australian adults
table(aes2016$wt_enrol > 0, useNA = 'always')  # inference about enrolled Australian adults
# There are 107 cases from the GNAF sample that have zero weight in the final combined sample:
filter(aes2016, wt_enrol <= 0) %>% select(wt_aec, wt_gnaf, wt_enrol)

# wt_enrol had been raked for age group, sex, state and first preference vote. See pages 32-33
# of the technical guide.

# These three state variables seem identical, and are all different from those in Table 16
# of the technical report:

# In fact, Table 16 has the state, age and sex only of the GNAF sample, not the whole sample:
table(filter(aes2016, wt_gnaf >0)$finSTATE)
table(filter(aes2016, wt_gnaf >0)$agegrp_2016)
table(filter(aes2016, wt_gnaf >0)$H1)

# The first pref party vote in Table 16 looks to be just wrong. The results below match those
# in the code book, but not in Table 16
table(aes2016$first_pref_hr_grp, useNA = 'always')
table(aes2016$first_pref_sen_grp, useNA = 'always')

# Well, we'll assume that wt_enrol has actually been weighted as described, and ignore
# the problems in Table 16.

# Set up survey design:
aes2016_svy <- svydesign(~1, strata = ~SamSTATE, weights = ~wt_enrol, data = aes2016)

# Deduce population totals from the sums of weights (if weights were raked to match pop totals,
# then the sum of the weights should reveal what they were)
age_pop <- aes2016 %>%
  group_by(agegrp_2016) %>%
  summarise(Freq = sum(wt_enrol))

sex_pop <- aes2016 %>%
  group_by(sex) %>%
  summarise(Freq = sum(wt_enrol))

state_pop <- aes2016 %>%
  group_by(SamSTATE) %>%
  summarise(Freq = sum(wt_enrol))

age2016_svy <- rake(aes2016_svy, sample.margins = list(~sex, ~agegrp_2016, ~SamSTATE),
                    population.margins = list(sex_pop, age_pop, state_pop))

# Compare the weighted and unweighted response to a survey question
filter(aes2016_questions, var_abb == "H14_2")
svytable(~H14_2, aes2016_svy)
# Many more "yes" posted answers to the Internet when weighted. So people active on the internet
# needed extra weight (probably younger)

Analysis – vote and one question

Now we’re ready for some exploratory analysis. In Australia, the Senate is elected by proportional representation, a wider range of small parties stand candidates, and the first preference vote for that house is less subject to gaming by voters than is the single-transferrable-vote system in the House of Representatives. So I prefer to use Senate vote when examining the range of political opinion, particularly of supporters of the smaller parties. Here is one of the graphics that I love during exploration but which I despair of using for communication purposes – a mosaic plot, which beautifully visualises a contingency table of counts (or, in this case, weighted counts). This particular graphic shows the views of people who voted for different parties in the Senate on the trade off between reducing taxes and spening more on social services:

As any casual observer of Australian politics would have predicted, disproportionately large numbers (coloured blue to show the positive discrepency from the no-interaction null hypothesis) of Greens voters (and to a less extent Labor) are mildly or strongly in favour of spending more on social services. Correspondingly these parties’ supporters have relatively few people counted in the top left corner (coloured red to show a negative discrepancy from the null hypothesis) supporting reducing tax as the priority.

In contrast, there are relatively few Liberal voters in the social services camp, and disproportionately high numbers who favour reducing taxes.

The “other” voters – which includes those who did not vote, could not remember, or did not vote for one of the five most prominent parties – are disproportionately likely to sit on the fence or not have a view on the trade-off between social services and tax cuts, which again is consistent with expectations.

What about a different angle – what influences people when they decide their vote?

We see that Greens voters decided their vote disproportionately on the basis of “the policy issues”. Liberal voters have less people in the “policy” box than would be expected under the no-interaction null hypothesis and were more likely to decide based on either “the party leaders” or “the parties taken as a whole”. National voters stand out as the only party with a noticeable disproportionate count (high, in their case) in the “candidates in your electorate” box, emphasising the known spatially-specific nature of National’s support.

Here’s the code for drawing those mosaic plots, using the vcd R package that implements some useful ways of visualising categorical data.

Post continues after code extract

#======================selected bivariate============

#------------------tax/social services-------------
x <- svytable(~E1 + first_pref_sen_grp2, aes2016_svy)
x <- round(x, 2)
colnames(x)[grepl("OTHER", colnames(x))] <- "OTHER"
rownames(x) <- paste0(str_wrap(rownames(x), 20))

mosaic(x, shade = TRUE, border = "grey70", direction = "v", 
       legend = legend_resbased(10, fontfamily = "Roboto", pvalue = FALSE),
       xlab = "x", ylab = "y", 
       labeling = labeling_values(
         suppress = c(0, 1000),
         gp_labels = gpar(fontfamily = "Roboto", cex = 0.8),
         gp_varnames = gpar(col = "transparent"),
         just_labels = c("center", "right"),
         alternate_labels = c(TRUE, FALSE),
         rot_labels = c(0,90,0,0),
         offset_labels = c(1,0,1.5,0),
         digits = 0
#--------------how decide vote------------
x <- svytable(~B5 + first_pref_sen_grp2, aes2016_svy)
x <- round(x, 2)
colnames(x)[grepl("OTHER", colnames(x))] <- "OTHER"
rownames(x) <- paste0(str_wrap(rownames(x), 20))

mosaic(x, shade = TRUE, border = "grey70", direction = "v", 
       legend = legend_resbased(10, fontfamily = "Roboto", pvalue = FALSE),
       xlab = "x", ylab = "y", 
       labeling = labeling_values(
         suppress = c(0, 1000),
         gp_labels = gpar(fontfamily = "Roboto", cex = 0.8),
         gp_varnames = gpar(col = "transparent"),
         just_labels = c("center", "right"),
         alternate_labels = c(TRUE, FALSE),
         rot_labels = c(0,90,0,0),
         offset_labels = c(1,0,1.5,0),
         digits = 0

Analysis – vote and a battery of questions

Like many social sciences surveys, the AES contains a number of batteries of questions with similar response sets. An advantage of this, both for an analyst and the consumer of their results, is that we can set up a common approach to comparing the responses to those questions. Here is a selection of diverging likert plots depicting interesting results from these questions:

Above, we see that Greens voters disproportionately think that Australia hasn’t gone far enough in allowing more migrants into the country, and One Nation voters unsurprisingly tend to think the opposite. Views on change in Australia relating to Aboriginal assistance and land rights follow a similar pattern. Interestingly, there is less partisan difference on some other issues such as “equal opportunities for women”. Overall, the pattern is clear of increasing tendency to think that “things have gone to far” as we move across the spectrum (Greens-Labor-Liberal-National-One Nation).

When it comes to attitudes on the left-right economic issues, we see subtle but expected party differences (chart above). Liberal supporters are particularly inclined to think trade unions have too much power and should be regulated; but to disagree with the statement that income and wealth should be distributed towards ordinary working people.

In the social issues chart above, we see the Greens’ voters strongly disagreeing with turning back boats of asylum seekers, and reintroducing the death penalty – issues where One Nation voters tend to agree with the statement. Again, the clear sequencing of parties’ supporters (Greens-Labor-Liberal-National-One Nation) is most obvious on these social/cultural touchpoint issues, in contrast to the left-right economic debate.

In the above, we see high levels of confidence in the armed forces, police and universities across all parties’ supporters, and distrust in the press, television (what does it even mean to trust the television?) and political parties. Differences by party in trust in institutions varies along predictable party lines eg voters of the Liberal party tend to have less confidence in unions. One Nation voters have particularly low trust in the Australian political system, consistent with the mainstream interpretation in political science of the rise of socially conservative populist parties of this sort.

Attitudes to public spend are fairly consistent across parties, with subtle but unsurprising differences in a few value-based touch points such as defence (Greens favour spending less) and unemployment benefits (Greens favour spending more but the three conservative parties favour spending less; Labor voters are in the middle).

Finally, when it comes to factual knowledge of election law and constitutional issues, there is little to say from a party perspective. Many people don’t know the maximum period between federal elections for the lower house (the correct answer is three); and doubtless niether know nor care about the need to pay a deposit to stand for election. But there is no obvious difference by party voted for.

In summary, on many of the key social variables of the day, we see the expected greatest contrast between supporters of the Greens on one hand and One Nation on the other, with the ALP, Liberal and National voters strung out between. Economic issues indicate another but closely related dimension, with One Nation closer to the Greens than they are to the other socially conservative parties on statements such as “big business in this country has too much power”, “the government should take measures to reduce differences in income levels” and “trade unions in this country have too much power”.

Here’s the code for drawing those diverging-Likert plots. I’m not particularly proud of it, it’s a bit repetitive (I hadn’t realised I’d be doing six of these), but it get’s the job done. There are some specialist R packages for drawing these charts, but I prefer (at this point) build them with the standard grammar of graphics tools in ggplot2 rather learn a new specific API.

  • Post continues after code extract*
#=================Batteries of similar questions========================

svytable(~ F7 + first_pref_sen_grp, aes2016_svy, Ntotal = 1000, round = TRUE)   # Immigration
svytable(~ F6_4 + first_pref_sen_grp, aes2016_svy, Ntotal = 1000, round = TRUE) # war on terror
svytable(~ F2 + first_pref_sen_grp, aes2016_svy, Ntotal = 1000, round = TRUE)   # Republic
svytable(~ E9_13 + first_pref_sen_grp, aes2016_svy, Ntotal = 1000, round = TRUE)   # trust universities
svytable(~ E9_14 + first_pref_sen_grp, aes2016_svy, Ntotal = 1000, round = TRUE)   # trust political system

#-----------Confidence in institutions----------------------
d1 <- aes2016 %>%
  select(E9_1:E9_14, wt_enrol, first_pref_sen_grp2) %>%
  gather(variable, value, -wt_enrol, -first_pref_sen_grp2) %>%
  group_by(variable, value, first_pref_sen_grp2) %>%
  summarise(freq = sum(wt_enrol)) %>%
  ungroup() %>%
  left_join(aes2016_questions, by = c("variable" = "var_abb")) %>%
  mutate(institution = gsub("How much .+\\? ", "", question_label)) %>%
  filter(! %>%
  group_by(variable, first_pref_sen_grp2, institution) %>%
  mutate(negative = value %in% c("Not very much confidence", "None at all"),
         prop = freq/sum(freq) * ifelse(negative, -1, 1)) %>%
  mutate(value = factor(value, levels = c("None at all", 
                                          "Not very much confidence",
                                          "A great deal of confidence",
                                          "Quite a lot of confidence"))) %>%
  ungroup() %>%
  mutate(institution = fct_reorder(institution, prop, .fun = sum))

pal <- brewer.pal(4, "RdYlGn")
names(pal) <- levels(d1$value)[c(1,2,4,3)]

d1 %>%
  ggplot(aes(x = institution, fill = value, weight = prop)) +
  geom_bar(data = filter(d1, negative), position = "stack") +
  geom_bar(data = filter(d1, !negative), position = "stack") +
  facet_wrap(~first_pref_sen_grp2, ncol = 3) +
  coord_flip() +
                    values = pal,
                    guide = guide_legend(reverse = FALSE),
                    breaks = c("None at all", 
                                  "Not very much confidence",
                                  "Quite a lot of confidence",
                                  "A great deal of confidence")) +
  scale_y_continuous(breaks = c(-1, -0.5, 0, 0.5, 1), labels = c("100%", "50%", "0", "50%", "100%")) + 
  expand_limits(y = c(-1, 1)) +
  theme(panel.spacing = unit(1.5, "lines"))+
  ggtitle("Attitudes to institutions, by first preference Senate vote in 2016",
          "How much confidence do you have in the following organisation? ...") +
  labs(x = "", y = "", caption = "Source: Australian Election Study 2016; analysis by")

#-----------more or less expenditure than now----------------------
d2 <- aes2016 %>%
  select(D8_1:D8_10, wt_enrol, first_pref_sen_grp2) %>%
  gather(variable, value, -wt_enrol, -first_pref_sen_grp2) %>%
  group_by(variable, value, first_pref_sen_grp2) %>%
  summarise(freq = sum(wt_enrol)) %>%
  ungroup() %>%
  left_join(aes2016_questions, by = c("variable" = "var_abb")) %>%
  mutate(item = gsub("Should there be .+\\? ", "", question_label)) %>%
  filter(! %>%
  group_by(variable, first_pref_sen_grp2, item) %>%
  mutate(negative = value %in% c("Much less than now", "Somewhat less than now"),
         positive = value %in% c("Much more than now", "Somewhat more than now"),
         same = !negative & !positive,
         prop = freq/sum(freq) * case_when(
           negative ~ -1,
           positive ~ 1,
           TRUE ~ 0.5)) %>%
  mutate(value = factor(value, levels = c("Much less than now", 
                                          "Somewhat less than now",
                                          "Much more than now",
                                          "Somewhat more than now",
                                          "The same as now"))) %>%
  ungroup() %>%
  mutate(item = fct_reorder(item, prop, .fun = sum))

pal <- brewer.pal(5, "RdYlGn")
names(pal) <- levels(d2$value)[c(1,2,5,4,3)]

d2a <- d2 %>% 
  filter(negative | same) %>%
  mutate(prop = -abs(prop))

d2 %>%
  ggplot(aes(x = item, fill = value, weight = prop)) +
  geom_bar(data = d2a, position = "stack") +
  geom_bar(data = filter(d2,positive | same), position = "stack") +
  facet_wrap(~first_pref_sen_grp2, ncol = 3) +
  coord_flip() +
                    values = pal,
                    guide = guide_legend(reverse = FALSE),
                    breaks = levels(d2$value)[c(1,2,5,4,3)]) +
  scale_y_continuous(breaks = c(-1, -0.5, 0, 0.5, 1), labels = c("100%", "50%", "0", "50%", "100%")) + 
  expand_limits(y = c(-1, 1)) +
  theme(panel.spacing = unit(1.5, "lines"))+
  ggtitle("Attitudes to public spend, by first preference Senate vote in 2016",
          "Should there be more or less public expenditure in the following area? ...") +
  labs(x = "", y = "", caption = "Source: Australian Election Study 2016; analysis by")

#----------change gone far enough----------
d3 <- aes2016 %>%
  select(E2_1:E2_7, wt_enrol, first_pref_sen_grp2) %>%
  gather(variable, value, -wt_enrol, -first_pref_sen_grp2) %>%
  group_by(variable, value, first_pref_sen_grp2) %>%
  summarise(freq = sum(wt_enrol)) %>%
  ungroup() %>%
  left_join(aes2016_questions, by = c("variable" = "var_abb")) %>%
  mutate(item = gsub("Do you think .+\\? ", "", question_label)) %>%
  filter(! %>%
  group_by(variable, first_pref_sen_grp2, item) %>%
  mutate(negative = value %in% c("Gone much too far", "Gone too far"),
         positive = value %in% c("Not gone nearly far enough", "Not gone far enough"),
         same = !negative & !positive,
         prop = freq/sum(freq) * case_when(
           negative ~ -1,
           positive ~ 1,
           TRUE ~ 0.5)) %>%
  mutate(value = factor(value, levels = c("Gone much too far", 
                                          "Gone too far",
                                          "Not gone nearly far enough",
                                          "Not gone far enough",
                                          "About right"))) %>%
  ungroup() %>%
  mutate(item = fct_reorder(item, prop, .fun = sum))

pal <- brewer.pal(5, "RdYlGn")
names(pal) <- levels(d3$value)[c(1,2,5,4,3)]

d3a <- d3 %>% 
  filter(negative | same) %>%
  mutate(prop = -abs(prop))

d3 %>%
  ggplot(aes(x = item, fill = value, weight = prop)) +
  geom_bar(data = d3a, position = "stack") +
  geom_bar(data = filter(d3, positive | same), position = "stack") +
  facet_wrap(~first_pref_sen_grp2, ncol = 3) +
  coord_flip() +
                    values = pal,
                    guide = guide_legend(reverse = FALSE),
                    breaks = levels(d3$value)[c(1,2,5,4,3)]) +
  scale_y_continuous(breaks = c(-1, -0.5, 0, 0.5, 1), labels = c("100%", "50%", "0", "50%", "100%")) + 
  theme(panel.spacing = unit(1.5, "lines")) +
  expand_limits(y = c(-1, 1)) +
  ggtitle("Attitudes to change, by first preference Senate vote in 2016",
          "Do you think the following change that has been happening in Australia over the years has gone...?") +
  labs(x = "", y = "", caption = "Source: Australian Election Study 2016; analysis by")

#----------agree with various economic statements----------
d4 <- aes2016 %>%
  select(D13_1:D13_6, wt_enrol, first_pref_sen_grp2) %>%
  gather(variable, value, -wt_enrol, -first_pref_sen_grp2) %>%
  group_by(variable, value, first_pref_sen_grp2) %>%
  summarise(freq = sum(wt_enrol)) %>%
  ungroup() %>%
  left_join(aes2016_questions, by = c("variable" = "var_abb")) %>%
  mutate(item = gsub("Do you strongly .+\\? ", "", question_label)) %>%
  filter(! %>%
  group_by(variable, first_pref_sen_grp2, item) %>%
  mutate(negative = value %in% c("Disagree", "Strongly disagree"),
         positive = value %in% c("Agree", "Strongly agree"),
         same = !negative & !positive,
         prop = freq/sum(freq) * case_when(
           negative ~ -1,
           positive ~ 1,
           TRUE ~ 0.5)) %>%
  mutate(value = factor(value, levels = c("Strongly disagree", 
                                          "Strongly agree",
                                          "Neither agree nor disagree"))) %>%
  ungroup() %>%
  mutate(item = fct_reorder(item, prop, .fun = sum))

pal <- brewer.pal(5, "RdYlGn")
names(pal) <- levels(d4$value)[c(1,2,5,4,3)]

d4a <- d4 %>% 
  filter(negative | same) %>%
  mutate(prop = -abs(prop))

d4 %>%
  ggplot(aes(x = item, fill = value, weight = prop)) +
  geom_bar(data = d4a, position = "stack") +
  geom_bar(data = filter(d4, positive | same), position = "stack") +
  facet_wrap(~first_pref_sen_grp2, ncol = 3) +
  coord_flip() +
                    values = pal,
                    guide = guide_legend(reverse = FALSE),
                    breaks = levels(d4$value)[c(1,2,5,4,3)]) +
  scale_y_continuous(breaks = c(-1, -0.5, 0, 0.5, 1), labels = c("100%", "50%", "0", "50%", "100%")) + 
  theme(panel.spacing = unit(1.5, "lines")) +
  expand_limits(y = c(-1, 1)) +
  ggtitle("Attitudes to left-right economic issues, by first preference Senate vote in 2016",
          "Do you strongly agree ... or strongly disagree with the following statement?") +
  labs(x = "", y = "", caption = "Source: Australian Election Study 2016; analysis by")

#----------agree with various social statements----------
d5 <- aes2016 %>%
  select(E6_1:E6_7, wt_enrol, first_pref_sen_grp2) %>%
  gather(variable, value, -wt_enrol, -first_pref_sen_grp2) %>%
  group_by(variable, value, first_pref_sen_grp2) %>%
  summarise(freq = sum(wt_enrol)) %>%
  ungroup() %>%
  left_join(aes2016_questions, by = c("variable" = "var_abb")) %>%
  mutate(item = gsub("Do you strongly .+\\? ", "", question_label)) %>%
  filter(! %>%
  group_by(variable, first_pref_sen_grp2, item) %>%
  mutate(negative = value %in% c("Disagree", "Strongly disagree"),
         positive = value %in% c("Agree", "Strongly agree"),
         same = !negative & !positive,
         prop = freq/sum(freq) * case_when(
           negative ~ -1,
           positive ~ 1,
           TRUE ~ 0.5)) %>%
  mutate(value = factor(value, levels = c("Strongly disagree", 
                                          "Strongly agree",
                                          "Neither agree nor disagree"))) %>%
  ungroup() %>%
  mutate(item = fct_reorder(item, prop, .fun = sum))

pal <- brewer.pal(5, "RdYlGn")
names(pal) <- levels(d5$value)[c(1,2,5,4,3)]

d5a <- d5 %>% 
  filter(negative | same) %>%
  mutate(prop = -abs(prop))

d5 %>%
  ggplot(aes(x = item, fill = value, weight = prop)) +
  geom_bar(data = d5a, position = "stack") +
  geom_bar(data = filter(d5, positive | same), position = "stack") +
  facet_wrap(~first_pref_sen_grp2, ncol = 3) +
  coord_flip() +
                    values = pal,
                    guide = guide_legend(reverse = FALSE),
                    breaks = levels(d5$value)[c(1,2,5,4,3)]) +
  scale_y_continuous(breaks = c(-1, -0.5, 0, 0.5, 1), labels = c("100%", "50%", "0", "50%", "100%")) + 
  theme(panel.spacing = unit(1.5, "lines")) +
  expand_limits(y = c(-1, 1)) +
  ggtitle("Attitudes to liberal-conservative social issues, by first preference Senate vote in 2016",
          "Do you strongly agree ... or strongly disagree with the following statement?") +
  labs(x = "", y = "", caption = "Source: Australian Election Study 2016; analysis by")

#----------constitutional knowledge----------
d6 <- aes2016 %>%
  select(F10_1:F10_6, wt_enrol, first_pref_sen_grp2) %>%
  gather(variable, value, -wt_enrol, -first_pref_sen_grp2) %>%
  group_by(variable, value, first_pref_sen_grp2) %>%
  summarise(freq = sum(wt_enrol)) %>%
  ungroup() %>%
  left_join(aes2016_questions, by = c("variable" = "var_abb")) %>%
  mutate(item = gsub("Do you think .+\\? ", "", question_label),
         item = str_wrap(item, 50)) %>%
  filter(! %>%
  group_by(variable, first_pref_sen_grp2, item) %>%
  mutate(negative = value %in% c("False"),
         positive = value %in% c("True"),
         same = !negative & !positive,
         prop = freq/sum(freq) * case_when(
           negative ~ -1,
           positive ~ 1,
           TRUE ~ 0.5)) %>%
  mutate(value = factor(value, levels = c("False", 
                                          "Don't know"))) %>%
  ungroup() %>%
  mutate(item = fct_reorder(item, prop, .fun = sum))

pal <- brewer.pal(3, "RdYlGn")
names(pal) <- levels(d6$value)[c(1,3,2)]

d6a <- d6 %>% 
  filter(negative | same) %>%
  mutate(prop = -abs(prop))

d6 %>%
  ggplot(aes(x = item, fill = value, weight = prop)) +
  geom_bar(data = d6a, position = "stack") +
  geom_bar(data = filter(d6, positive | same), position = "stack") +
  facet_wrap(~first_pref_sen_grp2, ncol = 3) +
  coord_flip() +
                    values = pal,
                    guide = guide_legend(reverse = FALSE),
                    breaks = names(pal)) +
  scale_y_continuous(breaks = c(-1, -0.5, 0, 0.5, 1), labels = c("100%", "50%", "0", "50%", "100%")) + 
  theme(panel.spacing = unit(1.5, "lines")) +
  expand_limits(y = c(-1, 1)) +
  ggtitle("Knowledge of constitutional issues",
          "Do you think the following statement is true or false?
(correct answers in order from 'federation in 1901' to '75 members' are True, True, False, False, True, False)") +
  labs(x = "", y = "", caption = "Source: Australian Election Study 2016; analysis by")

Final word

OK, watch this space for more analysis using the AES (2016 and other years), if and when I get time.

Note – I have no affiliation or contact with the Australian Election Study, and the AES researchers bear no responsibility for my analysis or interpretation of their data. Use of the AES data by me is solely at my risk and I indemnify the Australian Data Archive and its host institution, The Australian National University. The Australian Data Archive and its host institution, The Australian National University, are not responsible for the accuracy and completeness of the material supplied. Similarly, if you use my analysis, it is at your risk. Don’t blame me for anything that goes wrong, even if I made a mistake. But it would be nice if you let me know.

thankr::shoulders() %>% 
  mutate(maintainer = str_squish(gsub("<.+>", "", maintainer)),
         maintainer = ifelse(maintainer == "R-core", "R Core Team", maintainer)) %>%
  group_by(maintainer) %>%
  summarise(`Number packages` = sum(no_packages),
            packages = paste(packages, collapse = ", ")) %>%
  arrange(desc(`Number packages`)) %>%
  knitr::kable() %>% 
maintainer Number packages packages
Hadley Wickham 17 assertthat, dplyr, ellipsis, forcats, ggplot2, gtable, haven, httr, lazyeval, modelr, plyr, rvest, scales, stringr, testthat, tidyr, tidyverse
R Core Team 13 base, compiler, datasets, graphics, grDevices, grid, methods, splines, stats, tools, utils, nlme, foreign
Gábor Csárdi 4 cli, crayon, pkgconfig, zip
Kirill Müller 4 DBI, hms, pillar, tibble
Lionel Henry 4 purrr, rlang, svglite, tidyselect
Winston Chang 4 extrafont, extrafontdb, R6, Rttf2pt1
Yihui Xie 4 evaluate, knitr, rmarkdown, xfun
Achim Zeileis 3 colorspace, lmtest, zoo
Jim Hester 3 glue, withr, readr
Yixuan Qiu 3 showtext, showtextdb, sysfonts
Dirk Eddelbuettel 2 digest, Rcpp
Jennifer Bryan 2 readxl, cellranger
Jeroen Ooms 2 curl, jsonlite
Simon Urbanek 2 Cairo, audio
“Thomas Lumley” 1 survey
Alex Hayes 1 broom
Alexander Walker 1 openxlsx
Brian Ripley 1 MASS
Brodie Gaslam 1 fansi
Charlotte Wickham 1 munsell
David Gohel 1 gdtools
David Meyer 1 vcd
Deepayan Sarkar 1 lattice
Erich Neuwirth 1 RColorBrewer
James Hester 1 xml2
Jeremy Stephens 1 yaml
Joe Cheng 1 htmltools
Justin Talbot 1 labeling
Kamil Slowikowski 1 ggrepel
Kevin Ushey 1 rstudioapi
Marek Gagolewski 1 stringi
Martin Maechler 1 Matrix
Matt Dowle 1 data.table
Max Kuhn 1 generics
Michel Lang 1 backports
Patrick O. Perry 1 utf8
Peter Ellis 1 frs
Rasmus Bååth 1 beepr
Simon Garnier 1 viridisLite
Stefan Milton Bache 1 magrittr
Terry M Therneau 1 survival
Thomas J. Leeper 1 rio
Vitalie Spinu 1 lubridate

To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

No, its not correct to say that you can be 95% sure that the true value will be in the confidence interval

Hans van Maanen writes:

Mag ik je weer een statistische vraag voorleggen?

If I ask my frequentist statistician for a 95%-confidence interval, I can be 95% sure that the true value will be in the interval she just gave me. My visualisation is that she filled a bowl with 100 intervals, 95 of which do contain the true value and 5 do not, and she picked one at random.
Now, if she gives me two independent 95%-CI’s (e.g., two primary endpoints in a clinical trial), I can only be 90% sure (0.95^2 = 0,9025) that they both contain the true value. If I have a table with four measurements and 95%-CI’s, there’s only a 81% chance they all contain the true value.

Also, if we have two results and we want to be 95% sure both intervals contain the true values, we should construct two 97.5%-CI’s (0.95^(1/2) = 0.9747), and if we want to have 95% confidence in four results, we need 0,99%-CI’s.

I’ve read quite a few texts trying to get my head around confidence intervals, but I don’t remember seeing this discussed anywhere. So am I completely off, is this a well-known issue, or have I just invented the Van Maanen Correction for Multiple Confidence Intervals? ;-))

Ik hoop dat je tijd hebt voor een antwoord. It puzzles me!

My reply:

Ja hoor kan ik je hulpen, maar en engels:

1. “If I ask my frequentist statistician for a 95%-confidence interval, I can be 95% sure that the true value will be in the interval she just gave me.” Not quite true. Yes, true on average, but not necessarily true in any individual case. Some intervals are clearly wrong. Here’s the point: even if you picked an interval at random from the bowl, once you see the interval you have additional information. Sometimes the entire interval is implausible, suggesting that it’s likely that you happened to have picked one of the bad intervals in the bowl. Other times, the interval contains the entire range of plausible values, suggesting that you’re almost completely sure that you have picked one of the good intervals in the bowl. This can especially happen if your study is noisy and the sample size is small. For example, suppose you’re trying to estimate the difference in proportion of girl births, comparing two different groups of parents (for example, beautiful parents and ugly parents). You decide to conduct a study of N=400 births, with 200 in each group. Your estimate will be p2 – p1, with standard error sqrt(0.5^2/200 + 0.5^2/200) = 0.05, so your 95% conf interval will be p2 – p1 +/- 0.10. We happen to be pretty sure that any true population difference will be less than 0.01 (see here), hence if p2 – p1 is between -0.09 and +0.09, we can be pretty sure that our 95% interval does contain the true value. Conversely, if p2 – p1 is less than -0.11 or more than +0.11, then we can be pretty sure that our interval does not contain the true value. Thus, once we see the interval, it’s no longer generally a correct statement to say that you can be 95% sure the interval contains the true value.

2. Regarding your question: I don’t really think it makes sense to want 95% confidence in four results. It makes more sense to accept that our inferences are uncertain, we should not demand or act as if that they all be correct.

Continue Reading…


Read More

FizzBuzz in R and Python

(This article was first published on Method Matters, and kindly contributed to R-bloggers)

In this post, we will solve a simple problem (called “FizzBuzz“) that is asked by some employers in data scientist job interviews. The question seeks to ascertain the applicant’s familiarity with basic programming concepts. We will see 2 different ways to solve the problem in 2 different statistical programming languages: R and Python.

The FizzBuzz Question

I came across the FizzBuzz question in this excellent blog post on conducting data scientist interviews and have seen it referenced elsewhere on the web. The intent of the question is to probe the job applicant’s knowledge of basic programming concepts.

The prompt of the question is as follows:

In pseudo-code or whatever language you would like: write a program that prints the numbers from 1 to 100. But for multiples of three print “Fizz” instead of the number and for the multiples of five print “Buzz”. For numbers which are multiples of both three and five print “FizzBuzz”.

The Solution in R

We will first solve the problem in R. We will make use of control structures: statements “which result in a choice being made as to which of two or more paths to follow.” This is necessary for us to A) iterate over a number sequence and B) print the correct output for each number in the sequence. We will solve this problem in two slightly different ways.

Solution 1: Using a “for loop”

The classical way of solving this problem is via a “for loop“: we use the loop to explicitly iterate through a sequence of numbers. For each number, we use if-else statements to evaluate which output to print to the console.

The code* looks like this:

for (i in 1:100){

if(i%%3 == 0 & i%%5 == 0) {
else if(i%%3 == 0) {
else if (i%%5 == 0){
else {


Our code first explicitly defines the number sequence from 1 to 100. For each number in this sequence, we evaluate a series of if-else statements, and based on the results of these evaluations, we print the appropriate output (as defined in the question prompt).

The double parentheses is a modulo operator, which allows us to determine the remainder following a division. If the remainder is zero (which we test via the double equals signs), the first number is divisible by the second number.

We start with the double evaluation – any other ordering of our if-else statements would cause the program to fail (because numbers which pass the double evaluation by definition also pass the single evaluations). If a given number in the sequence from 1 to 100 is divisible both by 3 and by 5, we print “FizzBuzz” to the console.

We then continue our testing, using “else if” statements. The program checks each one of these statements in turn. If the number fails the double evaluation, we test to see if it is divisible by 3. If this is the case, we print “Fizz” to the console. If the number fails this evaluation, we check to see if it is divisible by 5 (in which case we print “Buzz” to the console). If the number fails each of these evaluations (meaning it is not divisible by 3 or 5), we simply print the number to the console.

Solution 2: Using a function

In general, for loops are discouraged in R. In earlier days, they were much slower than functions, although this seems to be no longer true. The wonderful R programmer Hadley Wickham advocates for a functional approach based on the fact that functions are generally clearer about what they’re trying to accomplish. As he writes: “A for loop conveys that it’s iterating over something, but doesn’t clearly convey a high level goal… [Functions are] tailored for a specific task, so when you recognise the functional you know immediately why it’s being used.” (There are many interesting and strong opinions about using loops vs. functions in R; I won’t get into the details here, but it’s informative to see the various opinions out there.)

The FizzBuzz solution using a function looks like this:

# define the function
fizz_buzz <- function(number_sequence_f){
if(number_sequence_f%%3 == 0 & number_sequence_f%%5 == 0) {
else if(number_sequence_f%%3 == 0) {
else if (number_sequence_f%%5 == 0){
else {


# apply it to the numbers 1 to 100
sapply(seq(from = 1, to = 100, by = 1), fizz_buzz)

The code above defines a function called “fizz_buzz”. This function takes an input (which I have defined in the function as number_sequence_f), and then passes it through the logical sequence we defined using the “if” statements above.

We then use sapply to apply the FizzBuzz function to a sequence of numbers, which we define directly in the apply statement itself. (The seq command allows us to define a sequence of numbers, and we explicitly state that we want to start at 1 and end at 100, advancing in increments of 1).

This approach is arguably more elegant. We define a function, which is quite clear as to what its goal is. And then we use a single-line command to apply that function to the sequence of numbers we generate with the seq function.

However, note that both solutions produce the output that was required by the question prompt!

The Solution in Python

We will now see how to write these two types of programs in Python. The logic will be the same as that used above, so I will only comment on the slight differences between how the languages implement the central ideas.

Solution 1: Using a “for loop”

The code to solve the FizzBuzz problem looks like this:

for i in range(1,101):
if (i % 3 == 0) & (i % 5 == 0):
elif (i % 3 == 0):
elif (i % 5 == 0):

Note that the structure of definition of the “for loop” is different. Whereas R requires {} symbols around all the code in the loop, and for the code inside of each if-else statement, Python uses the colon and indentations to indicate the different parts of the control structure.

I personally find the Python implementation cleaner and easier to read. The indentations are simpler and make the code less cluttered. I get the sense that Python is a true programming language, whereas R’s origins in the statistical community have left it with comparatively more complicated ways to accomplish the same programming goal.

Note also that, if we want to stop at 100, we have to specify that the range in our for loop stops at 101. Python has a different way of defining ranges. The first element in the range indicates where to start, but the last number in the range should be 1 greater than the number you actually want to stop at.

Solution 2: Using a function

Note that we can also use the functional approach to solve this problem in Python. As we did above, we wrap our control flow in a function (called fizz_buzz), and apply the function to the sequence of numbers from 1 to 100.

There are a number of different ways to do this in Python. Below I use pandas, a tremendously useful Python library for data manipulation and munging, to apply our function to the numbers from 1 to 100. Because the pandas apply function cannot be used on arrays (the element returned by the numpy np.arange), I convert the array to a Series and call the apply function on that.

# import the pandas library
import pandas as pd

# define the function
def fizz_buzz(number_sequence_f):
if (number_sequence_f % 3 == 0) & (number_sequence_f % 5 == 0):
elif (number_sequence_f % 3 == 0):
elif (number_sequence_f % 5 == 0):

# apply it to the number series

One final bit of programming details. The for loop defined above calls range, which is actually an iterator, not a series of numbers. Long story short, this is appropriate for the for loop, but as we need to apply our function to an actual series of numbers, we can’t use the range command here. Instead, in order to apply our function to a series of numbers, we use the numpy arange command (which returns an array of numbers). As described above, we convert the array to a pandas Series, which is then passed to our function.

Does this make sense as an interview question?

As part of a larger interview, I suppose, this question makes sense. A large part of applied analytical work involves manipulating data programmatically, and so getting some sense of how the candidate can use control structures could be informative.

However, this doesn’t really address the biggest challenges I face as a data scientist, which involve case definition and scoping: e.g. how can a data scientist partner with stakeholders to develop an analytical solution that answers a business problem while avoiding unnecessary complexity? The key issues involve A) analytical maturity to know what numerical solutions are likely to give a reasonable solution with the available data, B) business sense to understand the stakeholder problem and translate it into a form that allows a data science solution, and C) a combination of A and B (analytical wisdom and business sense) to identify the use cases for which analytics will have the largest business impact.**

So data science is a complicated affair, and I’m not sure the FizzBuzz question gives an interviewer much sense of the things I see as being important in a job candidate.*** For another interesting opinion on what’s important in applied analytics, I can recommend this excellent blog post which details a number of other important but under-discussed aspects of the data science profession.


In this post, we saw how to solve a data science interview question called “FizzBuzz.” The question requires the applicant to think through (and potentially code) a solution which involves cycling through a series of numbers, and based on certain conditions, printing a specific outcome.

We solved this problem in two different ways in both R and Python. The solutions allowed us to see different programming approaches (for loops and functions) for implementing control flow logic. The for loops are perhaps the most intuitive approach, but potentially more difficult for others to interpret. The functional approach benefits from clarity of purpose, but requires a slightly higher level of abstract thinking to code. The examples here serve as a good illustration of the differences between R and Python code. R has its origins in the statistical community, and its implementation of control structures seems less straightforward than the comparable implementations in Python.

Finally, we examined the value of the FizzBuzz interview question when assessing qualities in a prospective data science candidate. While the FizzBuzz question definitely provides some information, it misses a lot of what (in my view) comprises the critical skill-set that is needed to deliver data science projects in a business setting. 

Coming Up Next

In the next post, we will see how to use R to download data from Fitbit devices, employing a combination of existing R packages and custom API calls.

Stay tuned!

* My go-to html code formatter for R has disappeared! If anyone knows of such a resource, it would be hugely helpful in making the R code on this blog more readable. Please let me know in the comments! (The Python code is converted with tohtml).

** And then, of course, once a proper solution has been developed, it has to be deployed. Can you partner with data engineers and/or other IT stakeholders to integrate your results into business processes? (Notice the importance of collaboration – a critical but seldom-discussed aspect of the data science enterprise).

*** In defense of the above-referenced blog post, the original author is quite nuanced and does not only rely on the FizzBuzz question when interviewing candidates!

To leave a comment for the author, please follow the link and comment on their blog: Method Matters. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Book Memo: “Data Management in Machine Learning Systems”

Large-scale data analytics using machine learning (ML) underpins many modern data-driven applications. ML systems provide means of specifying and executing these ML workloads in an efficient and scalable manner. Data management is at the heart of many ML systems due to data-driven application characteristics, data-centric workload characteristics, and system architectures inspired by classical data management techniques. In this book, we follow this data-centric view of ML systems and aim to provide a comprehensive overview of data management in ML systems for the end-to-end data science or ML lifecycle. We review multiple interconnected lines of work: (1) ML support in database (DB) systems, (2) DB-inspired ML systems, and (3) ML lifecycle systems. Covered topics include: in-database analytics via query generation and user-defined functions, factorized and statistical-relational learning; optimizing compilers for ML workloads; execution strategies and hardware accelerators; data access methods such as compression, partitioning and indexing; resource elasticity and cloud markets; as well as systems for data preparation for ML, model selection, model management, model debugging, and model serving. Given the rapidly evolving field, we strive for a balance between an up-to-date survey of ML systems, an overview of the underlying concepts and techniques, as well as pointers to open research questions. Hence, this book might serve as a starting point for both systems researchers and developers. Large-scale data analytics using machine learning (ML) underpins many modern data-driven applications. ML systems provide means of specifying and executing these ML workloads in an efficient and scalable manner. Data management is at the heart of many ML systems due to data-driven application characteristics, data-centric workload characteristics, and system architectures inspired by classical data management techniques.

Continue Reading…


Read More

Document worth reading: “A Survey of Hierarchy Identification in Social Networks”

Humans are social by nature. Throughout history, people have formed communities and built relationships. Most relationships with coworkers, friends, and family are developed during face-to-face interactions. These relationships are established through explicit means of communications such as words and implicit such as intonation, body language, etc. By analyzing human interactions we can derive information about the relationships and influence among conversation participants. However, with the development of the Internet, people started to communicate through text in online social networks. Interestingly, they brought their communicational habits to the Internet. Many social network users form relationships with each other and establish communities with leaders and followers. Recognizing these hierarchical relationships is an important task because it will help to understand social networks and predict future trends, improve recommendations, better target advertisement, and improve national security by identifying leaders of anonymous terror groups. In this work, I provide an overview of current research in this area and present the state-of-the-art approaches to deal with the problem of identifying hierarchical relationships in social networks. A Survey of Hierarchy Identification in Social Networks

Continue Reading…


Read More

A Working Mathematician’s Guide to Parsing

Our hero, a mathematician, is writing notes in LaTeX and needs to convert it to a format that her blog platform accepts. She’s used to using dollar sign delimiters for math mode, but her blog requires \( \) and \[ \]. Find-and-replace fails because it doesn’t know about which dollar sign is the start and which is the end. She knows there’s some computer stuff out there that could help, but she doesn’t have the damn time to sort through it all. Paper deadlines, argh!

If you want to skip ahead to a working solution, and you can run basic python scripts, see the Github repository for this post with details on how to run it and what the output looks like. It’s about 30 lines of code, and maybe 10 of those lines are not obvious. Alternatively, copy/paste the code posted inline in the section “First attempt: regular expressions.”

In this post I’ll guide our hero through the world of text manipulation, explain the options for solving a problem like this, and finally explain how to build the program in the repository from scratch. This article assumes you have access to a machine that has basic programming tools pre-installed on it, such as python and perl.

The problem with LaTeX

LaTeX is great, don’t get me wrong, but people who don’t have experience writing computer programs that operate on human input tend to write sloppy LaTeX. They don’t anticipate that they might need to programmatically modify the file because the need was never there before. The fact that many LaTeX compilers are relatively forgiving with syntax errors exacerbates the issue.

The most common way to enter math mode is with dollar signs, as in

Now let $\varepsilon > 0$ and set $\delta = 1/\varepsilon$.

For math equations that must be offset, one often first learns to use double-dollar-signs, as in

First we claim that $$0 \to a \to b \to c \to 0$$ is a short exact sequence

The specific details that make it hard to find and convert from this delimiter type to another are:

  1. Math mode can be broken across lines, but need not be.
  2. A simple search and replace for $ would conflict with $$.
  3. The fact that the start and end are symmetric means a simple search and replace for $$ fails: you can’t tell whether to replace it with \[ or \] without knowing the context of where it occurs in the document.
  4. You can insert a dollar sign in LaTeX using \$ and it will not enter math mode. (I do not solve this problem in my program, but leave it as an exercise to the reader to modify each solution to support this)

First attempt: regular expressions

The first thing most programmers will tell you when you have a text manipulation problem is to use regular expressions (or regex). Regular expressions are text patterns that a program called a regular expression engine uses to find subsets of text in a document. This can often be with the goal of modifying the matched text somehow, but also just to find places where the text occurs to generate a report.

In their basic form, regular expressions are based on a very clean theory called regular languages, which is a kind of grammar equivalent to “structure that can be recognized by a finite state machine.”

[Aside: some folks prefer to distinguish between regular expressions as implemented by software systems (regex) and regular expressions as a representation of a regular language; as it turns out, features added to regex engines make them strictly stronger than what can be represented by the theory of regular languages. In this post I will use “regex” and “regular expressions” both for practical implementations, because programmers and software don’t talk about the theory, and use “regular languages” for the CS theory concept]

The problem is that practical regular expressions are difficult and nit-picky, especially when there are exceptional cases to consider. Even matching something like a date can require a finnicky expression that’s hard for humans to read and debug when they are incorrect. A regular expression for a line in a file that contains a date by itself:

^\s*(19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])\s*$

Worse, regular expressions are rife with “gotchas” having to do with escape characters. For example, parentheses are used for something called “capturing”, so you have to use \( to insert a literal parenthesis. If  having to use “\$” in LaTeX bothers you, you’ll hate regular expressions.

Another issue comes from history. There was a time when computers only allowed you to edit a file one line at a time. Many programming tools and frameworks were invented during this time that continue to be used today (you may have heard of sed which is a popular regular expression find/replace program—one I use almost daily). These tools struggle to operate on problems that span many lines of a file, because they simply weren’t designed for that. Problem (1) above suggests this might be a problem.

Yet another issue is in slight discrepancies between regex engines. Perl, python, sed, etc., all have slight variations and “nonstandard” features. As all programmers know, every visible behavior of a system will eventually be depended on by some other system.

But the real core problem is that regular expressions weren’t really designed for knowing about the context where a match occurs. Regular expressions are designed to be character-at-a-time pattern matching. [edit: removed an incorrect example] But over time, regular expression engines have added features to do such things over the years (which makes them more powerful than the original, formal definition of a regular language, and even more powerful than what parsers can handle!), but the more complicated you make a regular expression, the more likely it’s going to misbehave on odd inputs, and less likely others can use it without bugs or modification for their particular use case. Software engineers care very much about such things, though mathematicians needing a one-off solution may not.

One redeeming feature of regular expressions is that—by virtue of being so widely used in industry—there are many tools to work with them. Every major programming language has a regular expression engine built in. And many websites help explain how regexes work. is one I like to use. Here is an example of using that website to replace offset mathmode delimiters. Note the “Explain” button, which traces the regex engine as it looks for matches.

Screen Shot 2019-04-20 at 3.25.41 PM

So applying this to our problem: we can use two regular expressions to solve the problem. I’m using the perl programming language because its regex engine supports multiline matches. All MacOS and linux systems come with perl pre-installed.

 perl -0777 -pe 's/\$\$(.*?)\$\$/\\[\1\\]/gs' < test.tex | perl -0777 -pe 's/\$(.*?)\$/\\(\1\\)/gs' > output.tex

Now let’s explain, starting with the core expressions being matched and replaced.

s/X/Y/ tells a regex engine to “substitute regex matches of X with Y”. In the first regex X is \$\$(.*?)\$\$, which breaks down as

  1. \$\$ match two literal dollar signs
  2. ( capture a group of characters signified by the regex between here and the matching closing parenthesis
    1. .* zero or more of any character
    2. ? looking at the “zero or more” in the previous step, try to match as few possible characters as you can while still making this pattern successfully match something
  3. ) stop the capture group, and save it as group 1
  4. \$\$ match two more literal dollar signs

Then Y is the chosen replacement. We’re processing offset mathmode, so we want \[ \]. Y is \\[\1\\], which means

  1. \\ a literal backslash
  2. [ a literal open bracket
  3. \1 the first capture group from the matched expression
  4. \\ a literal backslash
  5. ] a literal close bracket

All together we have s/\$\$(.*?)\$\$/\\[\1\\]/, but then we add a final s and g characters, which act as configuration. The “s” tells the regex engine to allow the dot . to match newlines (so a pattern can span multiple lines) and the “g” tells the regex to apply the substitution globally to every match it sees—as opposed to just the first. 

Finally, the full first command is

perl -0777 -pe 's/\$\$(.*?)\$\$/\\[\1\\]/gs' < test.tex 

This tells perl to read in the entire test.tex file and apply the regex to it. Broken down

  1. perl run perl
  2. -0777 read the entire file into one string. If you omit it, perl will apply the regex to each line separately.
  3. -p will make perl automatically “read input and print output” without having to tell it to with a “print” statement
  4. e tells perl to run the following command line argument as a program.
  5. < test.tex tells perl to use the file test.tex as input to the program (as input to the regex engine, in this case).

Then we pipe the output of this first perl command to a second one that does a very similar replacement for inline math mode.

<first_perl_command> | perl -0777 -pe 's/\$(.*?)\$/\\(\1\\)/gs'

The vertical bar | tells the shell executing the commands to take the output of the first program and feed it as input to the second program, allowing us to chain together sequences of programs. The second command does the same thing as the first, but replacing $ with \( and \). Note, it was crucial we had this second program occur after the offset mathmode regex, since $ would match $$.

Exercise: Adapt this solution to support Problem (4), support for literal \$ dollar signs. Hint: you can either try to upgrade the regular expression to not be tricked into thinking \$ is a delimiter, or you can add extra programs before that prevent \$ from being a problem. Warning: this exercise may cause a fit.

It can feel like a herculean accomplishment to successfully apply regular expressions to a problem. You can appreciate the now-classic programmer joke from the webcomic xkcd:

Regular Expressions

However, as you can tell, getting regular expressions right is hard and takes practice. It’s great when someone else solves your problem exactly, and you can copy/paste for a one-time fix. But debugging regular expressions that don’t quite work can be excruciating. There is another way!

Second attempt: using a parser generator

While regular expressions have a clean theory of “character by character stateless processing”, they are limited. It’s not possible to express the concept of “memory” in a regular expression, and the simplest example of this is the problem of counting. Suppose you want to find strings that constitute valid, balanced parentheticals. E.g., this is balanced:

(hello (()there)() wat)

But this is not

(hello ((there )() wat)

This is impossible for regexes to handle because counting the opening parentheses is required to match the closing parens, and regexes can’t count arbitrarily high. If you want to parse and manipulate structures like this, that have balance and nesting, regex will only bring you heartache.

The next level up from regular expressions and regular languages are the two equivalent theories of context-free grammars and pushdown automata. A pushdown automaton is literally a regular expression (a finite state machine) equipped with a simple kind of memory called a stack. Rather than dwell on the mechanics, we’ll see how context-free grammars work, since if you can express your document as a context free grammar, a tool called a parser generator will give you a parsing program for free. Then a few simple lines of code allow you to manipulate the parsed representation, and produce the output document.

The standard (abstract) notation of a context-free grammar is called Extended Backus-Naur Form (EBNF). It’s a “metasyntax”, i.e., a syntax for describing syntax. In EBNF, you describe rules and terminals. Terminals are sequences of constant patterns, like


A rule is an “or” of sequences of other rules or terminals. It’s much easier to show an example:

char = "a" | "b" | "c"


The above describes the structure of any string that looks like offset math mode, but with a single “a” or a single “b” or a single “c” inside, e.g, “\[b\]”. You can see some more complete examples on Wikipedia, though they use a slightly different notation.

With some help from a practical library’s built-in identifiers for things like “arbitrary text” we can build a grammar that covers all of the ways to do latex math mode.

latex = content
content = content mathmode content | TEXT | EMPTY


Here we’re taking advantage of the fact that we can’t nest mathmode inside of mathmode in LaTeX (you probably can, but I’ve never seen it), by defining the mathmode rule to contain only text, and not other instances of the “content” rule. This rules out some ambiguities, such as whether “$x$ y $z$” is a nested mathmode or not.

We may not need the counting powers of context-free grammars, yet EBNF is easier to manage than regular expressions. You can apply context-sensitive rules to matches, whereas with regexes that would require coordination between separate passes. The order of operations is less sensitive; because the parser generator knows about all patterns you want to match in advance, it will match longer terminals before shorter—more ambiguous—terminals. And if we wanted to do operations on all four kinds of math mode, this allows us to do so without complicated chains of regular expressions.

The history of parsers is long and storied, and the theory of generating parsing programs from specifications like EBNF is basically considered a solved problem. However, there are lot of parser generators out there. And, like regular expression engines, they each have their own flavor of EBNF—or, as is more popular nowadays, they have you write your EBNF using the features of the language the parser generator is written in. And finally, a downside of using a parser generator is that you have to then write a program to operate on the parsed representation (which also differs by implementation).

We’ll demonstrate this process by using a Python library that, in my opinion, stays pretty faithful to the EBNF heritage. It’s called lark and you can pip-install it as

pip install lark-parser

Note: the hard-core industry standard parser generators are antlr, lex, and yacc. I would not recommend them for small parsing jobs, but if you’re going to do this as part of a company, they are weighty, weathered, well-documented—unlike lark.

Lark is used entirely inside python, and you specify the EBNF-like grammar as a string. For example, ours is

tex: content+

?content: mathmode | text+

        | INLINE text+ INLINE


?text: /./s

You can see the similarities with our “raw” EBNF. The main difference here is the use of + for matching “one or more” of a rule, the use of a regular expression to define the “text” rule as any character (here again the trailing “s” means: allow the dot character to match newline characters). The backslashes are needed because backslash is an escape character in Python. And finally, the question mark tells lark to try to compress the tree if it only matches one item (you can see what the difference is by playing with our script that shows the parsed representation of the input document. You can read more in lark’s documentation about what the structure of the parsed tree is as python objects (Tree for rule/terminal matches and Token for individual characters).

For the input “Let $x=0$”, the parsed tree is as follows (note that the ? makes lark collapse the many “text” matches):

    [Token(__ANON_0, 'L'), 
     Token(__ANON_0, 'e'), 
     Token(__ANON_0, 't'), 
     Token(__ANON_0, ' ')]), 
    [Token(INLINE, '$'), 
     Token(__ANON_0, 'x'), 
     Token(__ANON_0, '='), 
     Token(__ANON_0, '0'), 
     Token(INLINE, '$')]), 
   Token(__ANON_0, '\n')])

So now we can write a simple python program that traverses this tree and converts the delimiters. The entire program is on Github, but the core is

def join_tokens(tokens):
    return ''.join(x.value for x in tokens)

def handle_mathmode(tree_node):
    '''Switch on the different types of math mode, and convert
       the delimiters to the desired output, and then concatenate the
       text between.'''
    starting_delimiter = tree_node.children[0].type

    if starting_delimiter in ['INLINE', 'INLINEOPEN']:
        return '\\(' + join_tokens(tree_node.children[1:-1]) + '\\)'
    elif starting_delimiter in ['OFFSETDOLLAR', 'OFFSETOPEN']:
        return '\\[' + join_tokens(tree_node.children[1:-1]) + '\\]'
        raise Exception("Unsupported mathmode type %s" % starting_delimiter)

def handle_content(tree_node):
    '''Each child is a Token node whose text we'd like to concatenate
    return join_tokens(tree_node.children)

The rest of the program uses lark to create the parser, reads the file from standard input, processes the parsed representation, and outputs the converted document to standard output. You can use the program like this:

python < input.tex > output.tex

Exercise: extend this grammar to support literal dollar signs using \$, and passes them through to the output document unchanged.

What’s better?

I personally prefer regular expressions when the job is quick. If my text manipulation rule fits on one line, or can be expressed without requiring “look ahead” or “look behind” rules, regex is a winner. It’s also a winner when I only expect it to fail in a few exceptional cases that can easily be detected and fixed by hand. It’s faster to write a scrappy regex, and then open the output in a text editor and manually fix one or two mishaps, than it is to write a parser.

However, the longer I spend on a regular expression problem—and the more frustrated I get wrestling with it—the more I start to think I should have used a parser all along. This is especially true when dealing with massive jobs. Such as converting delimiters in hundreds of blog articles, each thousands of words long, or making changes across all chapter files of a book.

When I need something in between rigid structure and quick-and-dirty, I actually turn to vim. Vim has this fantastic philosophy of “act, repeat, rewind” wherein you find an edit that applies to the thing you want to change, then you search for the next occurrence of the start, try to apply the change again, visually confirm it does the right thing, and if not go back and correct it manually. Learning vim is a major endeavor (for me it feels lifelong, as I’m always learning new things), but since I spend most of my working hours editing structured text the investment and philosophy has paid off.

Until next time!

Continue Reading…


Read More

Whats new on arXiv

Topological based classification of paper domains using graph convolutional networks

The main approaches for node classification in graphs are information propagation and the association of the class of the node with external information. State of the art methods merge these approaches through Graph Convolutional Networks. We here use the association of topological features of the nodes with their class to predict this class. Moreover, combining topological information with information propagation improves classification accuracy on the standard CiteSeer and Cora paper classification task. Topological features and information propagation produce results almost as good as text-based classification, without no textual or content information. We propose to represent the topology and information propagation through a GCN with the neighboring training node classification as an input and the current node classification as output. Such a formalism outperforms state of the art methods.

Advanced Customer Activity Prediction based on Deep Hierarchic Encoder-Decoders

Product recommender systems and customer profiling techniques have always been a priority in online retail. Recent machine learning research advances and also wide availability of massive parallel numerical computing has enabled various approaches and directions of recommender systems advancement. Worth to mention is the fact that in past years multiple traditional ‘offline’ retail business are gearing more and more towards employing inferential and even predictive analytics both to stock-related problems such as predictive replenishment but also to enrich customer interaction experience. One of the most important areas of recommender systems research and development is that of Deep Learning based models which employ representational learning to model consumer behavioral patterns. Current state of the art in Deep Learning based recommender systems uses multiple approaches ranging from already classical methods such as the ones based on learning product representation vector, to recurrent analysis of customer transactional time-series and up to generative models based on adversarial training. Each of these methods has multiple advantages and inherent weaknesses such as inability of understanding the actual user-journey, ability to propose only single product recommendation or top-k product recommendations without prediction of actual next-best-offer. In our work we will present a new and innovative architectural approach of applying state-of-the-art hierarchical multi-module encoder-decoder architecture in order to solve several of current state-of-the-art recommender systems issues. Our approach will also produce by-products such as product need-based segmentation and customer behavioral segmentation – all in an end-to-end trainable approach.

Graph Wavelet Neural Network

We present graph wavelet neural network (GWNN), a novel graph convolutional neural network (CNN), leveraging graph wavelet transform to address the shortcomings of previous spectral graph CNN methods that depend on graph Fourier transform. Different from graph Fourier transform, graph wavelet transform can be obtained via a fast algorithm without requiring matrix eigendecomposition with high computational cost. Moreover, graph wavelets are sparse and localized in vertex domain, offering high efficiency and good interpretability for graph convolution. The proposed GWNN significantly outperforms previous spectral graph CNNs in the task of graph-based semi-supervised classification on three benchmark datasets: Cora, Citeseer and Pubmed.

Low-Rank Deep Convolutional Neural Network for Multi-Task Learning

In this paper, we propose a novel multi-task learning method based on the deep convolutional network. The proposed deep network has four convolutional layers, three max-pooling layers, and two parallel fully connected layers. To adjust the deep network to multi-task learning problem, we propose to learn a low-rank deep network so that the relation among different tasks can be explored. We proposed to minimize the number of independent parameter rows of one fully connected layer to explore the relations among different tasks, which is measured by the nuclear norm of the parameter of one fully connected layer, and seek a low-rank parameter matrix. Meanwhile, we also propose to regularize another fully connected layer by sparsity penalty, so that the useful features learned by the lower layers can be selected. The learning problem is solved by an iterative algorithm based on gradient descent and back-propagation algorithms. The proposed algorithm is evaluated over benchmark data sets of multiple face attribute prediction, multi-task natural language processing, and joint economics index predictions. The evaluation results show the advantage of the low-rank deep CNN model over multi-task problems.

Short Text Topic Modeling Techniques, Applications, and Performance: A Survey

Analyzing short texts infers discriminative and coherent latent topics that is a critical and fundamental task since many real-world applications require semantic understanding of short texts. Traditional long text topic modeling algorithms (e.g., PLSA and LDA) based on word co-occurrences cannot solve this problem very well since only very limited word co-occurrence information is available in short texts. Therefore, short text topic modeling has already attracted much attention from the machine learning research community in recent years, which aims at overcoming the problem of sparseness in short texts. In this survey, we conduct a comprehensive review of various short text topic modeling techniques proposed in the literature. We present three categories of methods based on Dirichlet multinomial mixture, global word co-occurrences, and self-aggregation, with example of representative approaches in each category and analysis of their performance on various tasks. We develop the first comprehensive open-source library, called STTM, for use in Java that integrates all surveyed algorithms within a unified interface, benchmark datasets, to facilitate the expansion of new methods in this research field. Finally, we evaluate these state-of-the-art methods on many real-world datasets and compare their performance against one another and versus long text topic modeling algorithm.

Bounded and Approximate Strong Satisfiability in Workflows

There has been a considerable amount of interest in recent years in the problem of workflow satisfiability, which asks whether the existence of constraints in a workflow specification makes it impossible to allocate authorized users to each step in the workflow. Recent developments have seen the workflow satisfiability problem (WSP) studied in the context of workflow specifications in which the set of steps may vary from one instance of the workflow to another. This, in turn, means that some constraints may only apply to certain workflow instances. Inevitably, WSP becomes more complex for such workflow specifications. Other approaches have considered the possibility of associating costs with the violation of `soft’ constraints and authorizations. Workflow satisfiability in this context becomes a question of minimizing the cost of allocating users to steps in the workflow. In this paper, we introduce new problems, which we believe to be of practical relevance, that combine these approaches. In particular, we consider the question of whether, given a workflow specification with costs and a `budget’, all possible workflow instances have an allocation of users to steps that does not exceed the budget. We design a fixed-parameter tractable algorithm to solve this problem parameterized by the total number of steps, release points and xor branchings.

Three scenarios for continual learning

Standard artificial neural networks suffer from the well-known issue of catastrophic forgetting, making continual or lifelong learning difficult for machine learning. In recent years, numerous methods have been proposed for continual learning, but due to differences in evaluation protocols it is difficult to directly compare their performance. To enable more structured comparisons, we describe three continual learning scenarios based on whether at test time task identity is provided and–in case it is not–whether it must be inferred. Any sequence of well-defined tasks can be performed according to each scenario. Using the split and permuted MNIST task protocols, for each scenario we carry out an extensive comparison of recently proposed continual learning methods. We demonstrate substantial differences between the three scenarios in terms of difficulty and in terms of how efficient different methods are. In particular, when task identity must be inferred (i.e., class incremental learning), we find that regularization-based approaches (e.g., elastic weight consolidation) fail and that replaying representations of previous experiences seems required for solving this scenario.

Introduction to Multi-Armed Bandits

Multi-armed bandits a simple but very powerful framework for algorithms that make decisions over time under uncertainty. An enormous body of work has accumulated over the years, covered in several books and surveys. This book provides a more introductory, textbook-like treatment of the subject. Each chapter tackles a particular line of work, providing a self-contained, teachable technical introduction and a review of the more advanced results. The chapters are as follows: Stochastic bandits; Lower bounds; Bayesian Bandits and Thompson Sampling; Lipschitz Bandits; Full Feedback and Adversarial Costs; Adversarial Bandits; Linear Costs and Semi-bandits; Contextual Bandits; Bandits and Zero-Sum Games; Bandits with Knapsacks; Incentivized Exploration and Connections to Mechanism Design. Status of the manuscript: essentially complete (modulo some polishing), except for last chapter, which the author plans to add over the next few months.

Latent Code and Text-based Generative Adversarial Networks for Soft-text Generation

Text generation with generative adversarial networks (GANs) can be divided into the text-based and code-based categories according to the type of signals used for discrimination. In this work, we introduce a novel text-based approach called Soft-GAN to effectively exploit GAN setup for text generation. We demonstrate how autoencoders (AEs) can be used for providing a continuous representation of sentences, which we will refer to as soft-text. This soft representation will be used in GAN discrimination to synthesize similar soft-texts. We also propose hybrid latent code and text-based GAN (LATEXT-GAN) approaches with one or more discriminators, in which a combination of the latent code and the soft-text is used for GAN discriminations. We perform a number of subjective and objective experiments on two well-known datasets (SNLI and Image COCO) to validate our techniques. We discuss the results using several evaluation metrics and show that the proposed techniques outperform the traditional GAN-based text-generation methods.

CryptoNN: Training Neural Networks over Encrypted Data

Emerging neural networks based machine learning techniques such as deep learning and its variants have shown tremendous potential in many application domains. However, they raise serious privacy concerns due to the risk of leakage of highly privacy-sensitive data when data collected from users is used to train neural network models to support predictive tasks. To tackle such serious privacy concerns, several privacy-preserving approaches have been proposed in the literature that use either secure multi-party computation (SMC) or homomorphic encryption (HE) as the underlying mechanisms. However, neither of these cryptographic approaches provides an efficient solution towards constructing a privacy-preserving machine learning model, as well as supporting both the training and inference phases. To tackle the above issue, we propose a CryptoNN framework that supports training a neural network model over encrypted data by using the emerging functional encryption scheme instead of SMC or HE. We also construct a functional encryption scheme for basic arithmetic computation to support the requirement of the proposed CryptoNN framework. We present performance evaluation and security analysis of the underlying crypto scheme and show through our experiments that CryptoNN achieves accuracy that is similar to those of the baseline neural network models on the MNIST dataset.

Fast Inference in Capsule Networks Using Accumulated Routing Coefficients

We present a method for fast inference in Capsule Networks (CapsNets) by taking advantage of a key insight regarding the routing coefficients that link capsules between adjacent network layers. Since the routing coefficients are responsible for assigning object parts to wholes, and an object whole generally contains similar intra-class and dissimilar inter-class parts, the routing coefficients tend to form a unique signature for each object class. For fast inference, a network is first trained in the usual manner using examples from the training dataset. Afterward, the routing coefficients associated with the training examples are accumulated offline and used to create a set of ‘master’ routing coefficients. During inference, these master routing coefficients are used in place of the dynamically calculated routing coefficients. Our method effectively replaces the for-loop iterations in the dynamic routing procedure with a single matrix multiply operation, providing a significant boost in inference speed. Compared with the dynamic routing procedure, fast inference decreases the test accuracy for the MNIST, Background MNIST, Fashion MNIST, and Rotated MNIST datasets by less than 0.5% and by approximately 5% for CIFAR10.

Incentivized Blockchain-based Social Media Platforms: A Case Study of Steemit

This paper presents an empirical analysis of Steemit, a key representative of the emerging incentivized social media platforms over Blockchains, to understand and evaluate the actual level of decentralization and the practical effects of cryptocurrency-driven reward system in these modern social media platforms. Similar to Bitcoin, Steemit is operated by a decentralized community, where 21 members are periodically elected to cooperatively operate the platform through the Delegated Proof-of-Stake (DPoS) consensus protocol. Our study performed on 539 million operations performed by 1.12 million Steemit users during the period 2016/03 to 2018/08 reveals that the actual level of decentralization in Steemit is far lower than the ideal level, indicating that the DPoS consensus protocol may not be a desirable approach for establishing a highly decentralized social media platform. In Steemit, users create contents as posts which get curated based on votes from other users. The platform periodically issues cryptocurrency as rewards to creators and curators of popular posts. Although such a reward system is originally driven by the desire to incentivize users to contribute to high-quality contents, our analysis of the underlying cryptocurrency transfer network on the blockchain reveals that more than 16% transfers of cryptocurrency in Steemit are sent to curators suspected to be bots and also finds the existence of an underlying supply network for the bots, both suggesting a significant misuse of the current reward system in Steemit. Our study is designed to provide insights on the current state of this emerging blockchain-based social media platform including the effectiveness of its design and the operation of the consensus protocols and the reward system.

Helping Effects Against Curse of Dimensionality in Threshold Factor Models for Matrix Time Series

As is known, factor analysis is a popular method to reduce dimension for high-dimensional data. For matrix data, the dimension reduction can be more effectively achieved through both row and column directions. In this paper, we introduce a threshold factor models to analyze matrix-valued high-dimensional time series data. The factor loadings are allowed to switch between regimes, controlling by a threshold variable. The estimation methods for loading spaces, threshold value, and the number of factors are proposed. The asymptotic properties of these estimators are investigated. Not only the strengths of thresholding and factors, but also their interactions from different directions and different regimes play an important role on the estimation performance. When the thresholding and factors are all strong across regimes, the estimation is immune to the impact that the increase of dimension brings, which breaks the curse of dimensionality. When the thresholding in two directions and factors across regimes have different levels of strength, we show that estimators for loadings and threshold value experience ‘helping’ effects against the curse of dimensionality. We also discover that even when the numbers of factors are overestimated, the estimators are still consistent. The proposed methods are illustrated with both simulated and real examples.

Persistent Homology of Complex Networks for Dynamic State Detection

In this paper we develop a novel Topological Data Analysis (TDA) approach for studying graph representations of time series of dynamical systems. Specifically, we show how persistent homology, a tool from TDA, can be used to yield a compressed, multi-scale representation of the graph that can distinguish between dynamic states such as periodic and chaotic behavior. We show the approach for two graph constructions obtained from the time series. In the first approach the time series is embedded into a point cloud which is then used to construct an undirected k-nearest neighbor graph. The second construct relies on the recently developed ordinal partition framework. In either case, a pairwise distance matrix is then calculated using the shortest path between the graph’s nodes, and this matrix is utilized to define a filtration of a simplicial complex that enables tracking the changes in homology classes over the course of the filtration. These changes are summarized in a persistence diagram—a two-dimensional summary of changes in the topological features. We then extract existing as well as new geometric and entropy point summaries from the persistence diagram and compare to other commonly used network characteristics. Our results show that persistence-based point summaries yield a clearer distinction of the dynamic behavior and are more robust to noise than existing graph-based scores, especially when combined with ordinal graphs.

Metrics for Graph Comparison: A Practitioner’s Guide

Comparison of graph structure is a ubiquitous task in data analysis and machine learning, with diverse applications in fields such as neuroscience, cyber security, social network analysis, and bioinformatics, among others. Discovery and comparison of structures such as modular communities, rich clubs, hubs, and trees in data in these fields yields insight into the generative mechanisms and functional properties of the graph. Often, two graphs are compared via a pairwise distance measure, with a small distance indicating structural similarity and vice versa. Common choices include spectral distances (also known as \lambda distances) and distances based on node affinities. However, there has of yet been no comparative study of the efficacy of these distance measures in discerning between common graph topologies and different structural scales. In this work, we compare commonly used graph metrics and distance measures, and demonstrate their ability to discern between common topological features found in both random graph models and empirical datasets. We put forward a multi-scale picture of graph structure, in which the effect of global and local structure upon the distance measures is considered. We make recommendations on the applicability of different distance measures to empirical graph data problem based on this multi-scale view. Finally, we introduce the Python library NetComp which implements the graph distances used in this work.

What I See Is What You See: Joint Attention Learning for First and Third Person Video Co-analysis

In recent years, more and more videos are captured from the first-person viewpoint by wearable cameras. Such first-person video provides additional information besides the traditional third-person video, and thus has a wide range of applications. However, techniques for analyzing the first-person video can be fundamentally different from those for the third-person video, and it is even more difficult to explore the shared information from both viewpoints. In this paper, we propose a novel method for first- and third-person video co-analysis. At the core of our method is the notion of ‘joint attention’, indicating the learnable representation that corresponds to the shared attention regions in different viewpoints and thus links the two viewpoints. To this end, we develop a multi-branch deep network with a triplet loss to extract the joint attention from the first- and third-person videos via self-supervised learning. We evaluate our method on the public dataset with cross-viewpoint video matching tasks. Our method outperforms the state-of-the-art both qualitatively and quantitatively. We also demonstrate how the learned joint attention can benefit various applications through a set of additional experiments.

Counterfactual Visual Explanations

A counterfactual query is typically of the form ‘For situation X, why was the outcome Y and not Z?’. A counterfactual explanation (or response to such a query) is of the form ‘If X was X*, then the outcome would have been Z rather than Y.’ In this work, we develop a technique to produce counterfactual visual explanations. Given a ‘query’ image I for which a vision system predicts class c, a counterfactual visual explanation identifies how I could change such that the system would output a different specified class c'. To do this, we select a ‘distractor’ image I' that the system predicts as class c' and identify spatial regions in I and I' such that replacing the identified region in I with the identified region in I' would push the system towards classifying I as c'. We apply our approach to multiple image classification datasets generating qualitative results showcasing the interpretability and discriminativeness of our counterfactual explanations. To explore the effectiveness of our explanations in teaching humans, we present machine teaching experiments for the task of fine-grained bird classification. We find that users trained to distinguish bird species fare better when given access to counterfactual explanations in addition to training examples.

DSTP-RNN: a dual-stage two-phase attention-based recurrent neural networks for long-term and multivariate time series prediction

Long-term prediction of multivariate time series is still an important but challenging problem. The key to solve this problem is to capture the spatial correlations at the same time, the spatio-temporal relationships at different times and the long-term dependence of the temporal relationships between different series. Attention-based recurrent neural networks (RNN) can effectively represent the dynamic spatio-temporal relationships between exogenous series and target series, but it only performs well in one-step time prediction and short-term time prediction. In this paper, inspired by human attention mechanism including the dual-stage two-phase (DSTP) model and the influence mechanism of target information and non-target information, we propose dual-stage two-phase-based recurrent neural network (DSTP-RNN) and DSTP-RNN-2 respectively for long-term time series prediction. Specifically, we first propose the DSTP-based structure to enhance the spatial correlations between exogenous series. The first phase produces violent but decentralized response weight, while the second phase leads to stationary and concentrated response weight. Secondly, we employ multiple attentions on target series to boost the long-term dependence. Finally, we study the performance of deep spatial attention mechanism and provide experiment and interpretation. Our methods outperform nine baseline methods on four datasets in the fields of energy, finance, environment and medicine, respectively.

Discriminative Regression Machine: A Classifier for High-Dimensional Data or Imbalanced Data

We introduce a discriminative regression approach to supervised classification in this paper. It estimates a representation model while accounting for discriminativeness between classes, thereby enabling accurate derivation of categorical information. This new type of regression models extends existing models such as ridge, lasso, and group lasso through explicitly incorporating discriminative information. As a special case we focus on a quadratic model that admits a closed-form analytical solution. The corresponding classifier is called discriminative regression machine (DRM). Three iterative algorithms are further established for the DRM to enhance the efficiency and scalability for real applications. Our approach and the algorithms are applicable to general types of data including images, high-dimensional data, and imbalanced data. We compare the DRM with currently state-of-the-art classifiers. Our extensive experimental results show superior performance of the DRM and confirm the effectiveness of the proposed approach.

RES-PCA: A Scalable Approach to Recovering Low-rank Matrices

Robust principal component analysis (RPCA) has drawn significant attentions due to its powerful capability in recovering low-rank matrices as well as successful appplications in various real world problems. The current state-of-the-art algorithms usually need to solve singular value decomposition of large matrices, which generally has at least a quadratic or even cubic complexity. This drawback has limited the application of RPCA in solving real world problems. To combat this drawback, in this paper we propose a new type of RPCA method, RES-PCA, which is linearly efficient and scalable in both data size and dimension. For comparison purpose, AltProj, an existing scalable approach to RPCA requires the precise knowlwdge of the true rank; otherwise, it may fail to recover low-rank matrices. By contrast, our method works with or without knowing the true rank; even when both methods work, our method is faster. Extensive experiments have been performed and testified to the effectiveness of proposed method quantitatively and in visual quality, which suggests that our method is suitable to be employed as a light-weight, scalable component for RPCA in any application pipelines.

Selection Bias in News Coverage: Learning it, Fighting it

News entities must select and filter the coverage they broadcast through their respective channels since the set of world events is too large to be treated exhaustively. The subjective nature of this filtering induces biases due to, among other things, resource constraints, editorial guidelines, ideological affinities, or even the fragmented nature of the information at a journalist’s disposal. The magnitude and direction of these biases are, however, widely unknown. The absence of ground truth, the sheer size of the event space, or the lack of an exhaustive set of absolute features to measure make it difficult to observe the bias directly, to characterize the leaning’s nature and to factor it out to ensure a neutral coverage of the news. In this work, we introduce a methodology to capture the latent structure of media’s decision process on a large scale. Our contribution is multi-fold. First, we show media coverage to be predictable using personalization techniques, and evaluate our approach on a large set of events collected from the GDELT database. We then show that a personalized and parametrized approach not only exhibits higher accuracy in coverage prediction, but also provides an interpretable representation of the selection bias. Last, we propose a method able to select a set of sources by leveraging the latent representation. These selected sources provide a more diverse and egalitarian coverage, all while retaining the most actively covered events.

On the Mathematical Understanding of ResNet with Feynman Path Integral

In this paper, we aim to understand Residual Network (ResNet) in a scientifically sound way by providing a bridge between ResNet and Feynman path integral. In particular, we prove that the effect of residual block is equivalent to partial differential equation, and the ResNet transforming process can be equivalently converted to Feynman path integral. These conclusions greatly help us mathematically understand the advantage of ResNet in addressing the gradient vanishing issue. More importantly, our analyses offer a path integral view of ResNet, and demonstrate that the output of certain network can be obtained by adding contributions of all paths. Moreover, the contribution of each path is proportional to e^{-S}, where S is the action given by time integral of Lagrangian L. This lays the solid foundation in the understanding of ResNet, and provides insights in the future design of convolutional neural network architecture. Based on these results, we have designed the network using partial differential operators, which further validates our theoritical analyses.

Go Wide, Go Deep: Quantifying the Impact of Scientific Papers through Influence Dispersion Trees

Despite a long history of use of citation count as a measure to assess the impact or influence of a scientific paper, the evolution of follow-up work inspired by the paper and their interactions through citation links have rarely been explored to quantify how the paper enriches the depth and breadth of a research field. We propose a novel data structure, called Influence Dispersion Tree (IDT) to model the organization of follow-up papers and their dependencies through citations. We also propose the notion of an ideal IDT for every paper and show that an ideal (highly influential) paper should increase the knowledge of a field vertically and horizontally. Upon suitably exploring the structural properties of IDT, we derive a suite of metrics, namely Influence Dispersion Index (IDI), Normalized Influence Divergence (NID) to quantify the influence of a paper. Our theoretical analysis shows that an ideal IDT configuration should have equal depth and breadth (and thus minimize the NID value). We establish the superiority of NID as a better influence measure in two experimental settings. First, on a large real-world bibliographic dataset, we show that NID outperforms raw citation count as an early predictor of the number of new citations a paper will receive within a certain period after publication. Second, we show that NID is superior to the raw citation count at identifying the papers recognized as highly influential through Test of Time Award among all their contemporary papers (published in the same venue). We conclude that in order to quantify the influence of a paper, along with the total citation count, one should also consider how the citing papers are organized among themselves to better understand the influence of a paper on the research field. For reproducibility, the code and datasets used in this study are being made available to the community.

Continue Reading…


Read More

Before you take my DataCamp course please consider this info

(This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

Today, I am finally getting around to writing this very sad blog post: Before you take my DataCamp course please consider the following information about the sexual harassment scandal surrounding DataCamp!

As many of my fellow instructors and community members have already written excellent articles about what happened, I am providing a collection of tweets and links below for you to judge the situation for yourself. All I have to add is that while I have no personal knowledge of ALL the details surrounding the incident, there are many prominent members of the community who have – and since I highly value their opinion and trust their judgement, I am supporting what has been said!

It’s making me very sad that my course, which I have put a lot of hard work into and which I was very proud of, and by association my name are now linked to this unfortunate situation. That’s why I have not been addressing any feedback messages on my course any more and will not do so for the foreseeable future. At the end of this post, I have collected some efforts to find alternatives for DataCamp courses. The contents of my own course will probably be converted to a series of blog posts.

What happened

In essence, after a broader group of instructors, including myself, were made aware of an incident of sexual harassment at DataCamp (which was followed by a lengthy struggle to get them to properly handle and own up to what happened), DataCamp made a “public” announcement:

The link to this community post can be found here, but…

And if that wasn’t enough:

Here are two excellent blogposts about this:

RStudio, R-Ladies and Women in Data Science and Machine Learning have published public statements, as well as many others:

Where to go from here

If you decide that you want to move away from DataCamp, here are some alternatives:

To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Process Mining (Part 2/3): More on bupaR package

(This article was first published on R on notast, and kindly contributed to R-bloggers)


In the last post, the discipline of event log and process mining were defined. The bupaR package was introduced as a technique to do process mining in R.

Objectives for This Post

  1. Visualize workflow
  2. Understand the concept of activity reoccurrences

We will use a pre-loaded dataset sepsis from the bupaR package. This event log is based on real life management of sepsis from the point of admission to discharge.

The dataset has 15214 activity instances and 1050 cases.

## Event log consisting of:
## 15214 events
## 846 traces
## 1050 cases
## 16 activities
## 15214 activity instances
## # A tibble: 15,214 x 34
##    case_id activity lifecycle resource timestamp             age   crp
##  1 A       ER Regi~ complete  A        2014-10-22 11:15:41    85    NA
##  2 A       Leucocy~ complete  B        2014-10-22 11:27:00    NA    NA
##  3 A       CRP      complete  B        2014-10-22 11:27:00    NA   210
##  4 A       LacticA~ complete  B        2014-10-22 11:27:00    NA    NA
##  5 A       ER Tria~ complete  C        2014-10-22 11:33:37    NA    NA
##  6 A       ER Seps~ complete  A        2014-10-22 11:34:00    NA    NA
##  7 A       IV Liqu~ complete  A        2014-10-22 14:03:47    NA    NA
##  8 A       IV Anti~ complete  A        2014-10-22 14:03:47    NA    NA
##  9 A       Admissi~ complete  D        2014-10-22 14:13:19    NA    NA
## 10 A       CRP      complete  B        2014-10-24 09:00:00    NA  1090
## # ... with 15,204 more rows, and 27 more variables: diagnose ,
## #   diagnosticartastrup , diagnosticblood , diagnosticecg ,
## #   diagnosticic , diagnosticlacticacid ,
## #   diagnosticliquor , diagnosticother , diagnosticsputum ,
## #   diagnosticurinaryculture , diagnosticurinarysediment ,
## #   diagnosticxthorax , disfuncorg , hypotensie ,
## #   hypoxie , infectionsuspected , infusion ,
## #   lacticacid , leucocytes , oligurie ,
## #   sirscritheartrate , sirscritleucos ,
## #   sirscrittachypnea , sirscrittemperature ,
## #   sirscriteria2ormore , activity_instance_id , .order 

We will subset a smaller and more granular event log to make illustrations more comprehensible.

activity_frequency(sepsis, level = "activity") %>% arrange(relative)# least common activity 
## # A tibble: 16 x 3
##    activity         absolute relative
##  1 Release E               6 0.000394
##  2 Release D              24 0.00158 
##  3 Release C              25 0.00164 
##  4 Release B              56 0.00368 
##  5 Admission IC          117 0.00769 
##  6 Return ER             294 0.0193  
##  7 Release A             671 0.0441  
##  8 IV Liquid             753 0.0495  
##  9 IV Antibiotics        823 0.0541  
## 10 ER Sepsis Triage     1049 0.0689  
## 11 ER Registration      1050 0.0690  
## 12 ER Triage            1053 0.0692  
## 13 Admission NC         1182 0.0777  
## 14 LacticAcid           1466 0.0964  
## 15 CRP                  3262 0.214   
## 16 Leucocytes           3383 0.222
sepsis_subset<-filter_activity_presence(sepsis, "Release E") # cases with least common activity to achieve smaller eventlog

Visualizing Workflow

In an event log, the sequence of activities are captured either using time stamps or activity instance identifier. The sequential order of activities allows us to create a workflow on how a case was managed. This workflow can be compared against theoretical workflow models to identify deviations. It can also reveal the reoccurrence of activities which will be covered later. The most intuitive approach to examine workflow is with a process map and this is achieved with the process_map function from bupaR

The process map is either created base on frequency of activities (i.e. process_map(type = frequency())) or base on duration of activities (i.e. process_map(type= performance())). I have focused on the frequency aspect which is the default argument.

There are 4 arguments that you can supply to process_map(type = frequency()) which determines the value to be displayed. The values can reflect either the number of activity instances or the number of cases. The values displayed can be in absolute or relative frequency.

Process mapping based on absolute activity instances

sepsis_subset %>% process_map(type = frequency("absolute"))

Process mapping based on absolute number of cases for the activity

sepsis_subset %>% process_map(type = frequency("absolute_case"))

The darker the colour of the boxes and the darker and thicker the arrows, the higher the value. From the process map, activities leading up to “Release E” were “CRP”, “Leucocytes” and “Lactic Acid”. “CRP” contributed to half the activity instances and cases leading up to “Release E”.

Reoccurence of Activities

In the above process map, the top of the boxes for “CRP”, “Leucocytes”, “Admission NC” and “Lactic Acid” has an arrow head and its tail belonging to the same activity. These arrows indicate reoccurrence of the activities for the same case. Reoccurrence of activities can suggest inefficiency or interruptions or disruptions which may warrant further investigation to optimise workflow.

bupaR has 2 constructs to define reoccurrence of activities for the same case.

Construct 1 (resource reduplicating the activity)

If the same resource reduplicates the activity for a particular case, it is known as “repeat” in bupaR’s terminology. If a different resource reduplicates the activity, it is known as “redo”.

Construct 2 (activity instances when the activity reoccured)

When the activity for a specific case reoccurs as consecutive activity instances, it is term as “self loop”. This is an example of “1 self loop”

Activity Self Loop
ER Triage
Release A

When the activity reoccurs as non consecutive activity instances, it is known as “repetition”. In other words, there are other activities that occur before that specific activity is replicated. This is an example of “1 repetition”

Activity Repetition
ER Triage

The permutation of these constructs result in these 4 types of activity reoccurrences:

  1. Redo self-loop
  2. Redo repetition
  3. Repeat self-loop
  4. Repeat repetition

Let’s look at some examples using bupaR

Which activities reoccurred consecutively in a case?

number_of_repetitions(sepsis_subset, level="activity", type="all")
## # Description: activity_metric [12 x 3]
##    activity         absolute relative
##  1 Admission IC            0    0    
##  2 Admission NC            2    0.167
##  3 CRP                     6    0.188
##  4 ER Registration         0    0    
##  5 ER Sepsis Triage        0    0    
##  6 ER Triage               0    0    
##  7 IV Antibiotics          0    0    
##  8 IV Liquid               0    0    
##  9 LacticAcid              3    0.158
## 10 Leucocytes              6    0.136
## 11 Release E               0    0    
## 12 Return ER               0    0

Which activities reoccurred consecutively in a case where the same resource repeated the activities?

number_of_repetitions(sepsis_subset, level="activity", type="repeat")
## # Description: activity_metric [12 x 3]
##    activity         absolute relative
##  1 Admission IC            0    0    
##  2 Admission NC            0    0    
##  3 CRP                     6    0.188
##  4 ER Registration         0    0    
##  5 ER Sepsis Triage        0    0    
##  6 ER Triage               0    0    
##  7 IV Antibiotics          0    0    
##  8 IV Liquid               0    0    
##  9 LacticAcid              3    0.158
## 10 Leucocytes              6    0.136
## 11 Release E               0    0    
## 12 Return ER               0    0

Which cases had activities duplicated? The duplicated activities were interrupted by other activities and these duplicated activities were redone by a different resource.

number_of_selfloops(sepsis_subset, level="case", type = "redo")
## # Description: case_metric [6 x 3]
##   case_id absolute relative
## 1 BCA            0   0     
## 2 CY             1   0.0303
## 3 JAA            0   0     
## 4 JM             1   0.0769
## 5 LG             1   0.0244
## 6 SAA            0   0

Summing Up

In this post, we learnt how to visualize the workflow registered in event logs with a process map. We also learnt bupaR’s constructs of activity reoccurrences and therefore the types of activity reoccurrences. If you want to learn more about bupaR, head over to Datacamp for more comprehensive and interactive lessons. P.S. I’m not sponsored nor affiliated to Datacamp

In the next post, we’ll look at more visualizations for process analysis.

To leave a comment for the author, please follow the link and comment on their blog: R on notast. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

April 20, 2019

Magister Dixit

“Efficiency is becoming as important a differentiator as speed and scale. As a result, firms are delving deeper into predictive analytics to realize faster time to value and improve operational performance and decision outcomes.” Gabriel Lowy ( January 22, 2015 )

Continue Reading…


Read More

Batch Deployment of WoE Transformations

(This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers)

After wrapping up the function batch_woe() today with the purpose to allow users to apply WoE transformations to many independent variables simultaneously, I have completed the development of major functions in the MOB package that can be usable for the model development in a production setting.

The function batch_woe() basically is the wrapper around cal_woe() and has two input parameters. The “data” parameter is the data frame that we would deploy binning outcomes and the “slst” parameter is the list of multiple binning specification tables that is either the direct output from the function batch_bin or created manually by combining outputs from multiple binning functions.

There are also two components in the output of batch_woe(), a list of PSI tables for transformed variables and a data frame with a row index and all transformed variables. The default printout is a PSI summation of all input variables to be transformed. As shown below, all PSI values are below 0.1 and therefore none is concerning.

binout <- batch_bin(df, 1)

woeout <- batch_woe(df[sample(seq(nrow(df)), 2000, replace = T), ], binout$BinLst)

#     tot_derog tot_tr age_oldest_tr tot_open_tr tot_rev_tr tot_rev_debt ...
# psi    0.0027 0.0044        0.0144      0.0011      3e-04       0.0013 ...

str(woeout, max.level = 1)
# List of 2
#  $ psi:List of 11
#  $ df :'data.frame':	2000 obs. of  12 variables:
#  - attr(*, "class")= chr "psiSummary"

head(woeout$df, 1)
#  idx_ woe.tot_derog woe.tot_tr woe.age_oldest_tr woe.tot_open_tr woe.tot_rev_tr ...
#     1       -0.3811    -0.0215           -0.5356         -0.0722        -0.1012 ...

All source codes of the MOB package are available on and free (as free beer) to download and distribute.

To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

R Packages worth a look

Generator of Experiments (gexp)
Generates experiments – simulating structured or experimental data as: completely randomized design, randomized block design, latin square design, fact …

Joint Change Point Detection (jcp)
Procedures for joint detection of changes in both expectation and variance in univariate sequences. Performs a statistical test of the null hypothesis …

Statistically Validated Networks (SVN)
Determines networks of significant synchronization between the discrete states of nodes; see Tumminello et al <doi:10.1371/journal.pone.0017994>.

Chaotic Time Series Analysis (DChaos)
Provides several algorithms for the purpose of detecting chaotic signals inside univariate time series. We focus on methods derived from chaos theory w …

Open Trade Statistics API Wrapper and Utility Program (tradestatistics)
Access Open Trade Statistics API from R to download international trade data.

Modelling Adoption Process in Marketing (adoption)
The classical Bass (1969) <doi:10.1287/mnsc.15.5.215> model and the agent based models, such as that by Goldenberg, Libai and Muller (2010) <d …

Continue Reading…


Read More

If you did not already know

Dlib google
Dlib is a modern C++ toolkit containing machine learning algorithms and tools for creating complex software in C++ to solve real world problems. It is used in both industry and academia in a wide range of domains including robotics, embedded devices, mobile phones, and large high performance computing environments. Dlib’s open source licensing allows you to use it in any application, free of charge. …

Variation Network (VarNet) google
This paper presents the Variation Network (VarNet), a generative model providing means to manipulate the high-level attributes of a given input. The originality of our approach is that VarNet is not only capable of handling pre-defined attributes but can also learn the relevant attributes of the dataset by itself. These two settings can be easily combined which makes VarNet applicable for a wide variety of tasks. Further, VarNet has a sound probabilistic interpretation which grants us with a novel way to navigate in the latent spaces as well as means to control how the attributes are learned. We demonstrate experimentally that this model is capable of performing interesting input manipulation and that the learned attributes are relevant and interpretable. …

Federated Reinforcement Learning (FRL) google
In reinforcement learning, building policies of high-quality is challenging when the feature space of states is small and the training data is limited. Directly transferring data or knowledge from an agent to another agent will not work due to the privacy requirement of data and models. In this paper, we propose a novel reinforcement learning approach to considering the privacy requirement and building Q-network for each agent with the help of other agents, namely federated reinforcement learning (FRL). To protect the privacy of data and models, we exploit Gausian differentials on the information shared with each other when updating their local models. In the experiment, we evaluate our FRL framework in two diverse domains, Grid-world and Text2Action domains, by comparing to various baselines. …

Continue Reading…


Read More

Science and Technology links (April 20th 2019)

  1. Early-career setback cause a performance improvement among those who persevere. This is related to the observation that immigrants are four times more likely to become millionaires. In biology, that is called hormesis: by challenging your muscles, you get stronger; by exposing yourself to some radiations or starving a little, you live longer. So you should seek out challenges and expose your kids to difficulties.
  2. We have a significant bias in favor of tall men. Tall men get promoted more often and earn more money; they are also much more successful with women. However, heightism, like ageism, is considered an acceptable form of discrimination. It is fine to mock a man because he is small or old. It is not fine to mock a man for being gay, transgender or black.
  3. Canadians who finish high school, get a full time job and only have children within marriage have less than a one percent chance of being poor.
  4. Currently, a sizeable fraction of men go bald with age and there is relatively little that can be done. There is some good surgery, but it is expensive. Products like minoxidil work, but only so. A new product (clascoterone) has passed a Phase II clinical trial with good results. It seems quite safe and probably more effective than current drugs.
  5. A stem-cell therapy for knee arthritis got solid results during a clinical trial.
  6. It seems that you can bring back some brain function hours after death (in pigs).
  7. We are making progress against the “bubble boy” syndrome.
  8. For every 100 women who earn a bachelor’s degree from US colleges and universities there are 74 men.

Continue Reading…


Read More

Claims about excess road deaths on “4/20” don’t add up

Sam Harper writes:

Since you’ve written about similar papers (that recent NRA study in NEJM, the birthday analysis) before and we linked to a few of your posts, I thought you might be interested in this recent blog post we wrote about a similar kind of study claiming that fatal motor vehicle crashes increase by 12% after 4:20pm on April 20th (an annual cannabis celebration…google it).

The post is by Harper and Adam Palayew, and it’s excellent. Here’s what they say:

A few weeks ago a short paper was published in a leading medical journal, JAMA Internal Medicine, suggesting that, over the 25 years from 1992-2016, excess cannabis consumption after 4:20pm on 4/20 increased fatal traffic crashes by 12% relative to fatal crashes that occurred one week before and one week after. Here is the key result from the paper:

In total, 1369 drivers were involved in fatal crashes after 4:20 PM on April 20 whereas 2453 drivers were in fatal crashes on control days during the same time intervals (corresponding to 7.1 and 6.4 drivers in fatal crashes per hour, respectively). The risk of a fatal crash was significantly higher on April 20 (relative risk, 1.12; 95% CI, 1.05-1.19; P = .001).
— Staples JA, Redelmeier DA. The April 20 Cannabis Celebration and Fatal Traffic Crashes in the United States JAMA Int Med, Feb 18, 2018, p.E2

Naturally, this sparked (heh) considerable media interest, not only because p<.05 and the finding is “surprising”, but also because cannabis is a hot topic these days (and, of course, April 20th happens every year).

But how seriously should we take these findings? Harper and Palayew crunch the numbers:

If we try and back out some estimates of what might have to happen on 4/20 to generate a 12% increase in the national rate of fatal car crashes, it seems less and less plausible that the 4/20 effect is reliable or valid. Let’s give it a shot. . . .

Over the 25 year period [the authors of the linked paper] tally 1369 deaths on 4/20 and 2453 deaths on control days, which works out to average deaths on those days each year of 1369/25 ~ 55 on 4/20 and 2453/25/2 ~ 49 on control days, an average excess of about 6 deaths each year. If we use our estimates of post-1620h VMT above, that works out to around 55/2.5 = 22 fatal crashes per billion VMT on 4/20 vs. 49/2.5 = 19.6 on control days. . . .

If we don’t assume the relative risk changes on 4/20, just more people smoking, what proportion of the population would need to be driving while high to generate a rate of 22 per billion VMT? A little algebra tells us that to get to 22 we’d need to see something like . . . 15%! That’s nearly one-sixth of the population driving while high on 4/20 from 4:20pm to midnight, which doesn’t, absent any other evidence, seem very likely. . . . Alternatively, one could also raise the relative risk among cannabis drivers to 6x the base rate and get something close. Or some combination of the two. This means either the nationwide prevalence of driving while using cannabis increases massively on 4/20, or the RR of a fatal crash with the kind of cannabis use happening on 4/20 is absurdly high. Neither of these scenarios seem particularly likely based on what we currently know about cannabis use and driving risks.

They also look at the big picture:

Nothing so exciting is happening on 20 Apr, which makes sense given that total accident rates are affected by so many things, with cannabis consumption being a very small part. It’s similar to that NRA study (see link at beginning of this post) in that the numbers just don’t add up.

Harper sent me this email last year. I wrote the above post and scheduled it for 4/20. In the meantime, he had more to report:

We published a replication paper with some additional analysis. The original paper in question (in JAMA Internal Med no less) used a design (comparing an index ‘window’ on a given day to the same ‘window’ +/- 1 week) similar to some others that you have blogged about (the NRA study, for example), and I think it merits similar skepticism (a sizeable fraction of the population would need to be driving while drugged/intoxicated on this day to raise the national rate by such a margin).

As I said, my co-author Adam Palayew and I replicated that paper’s findings but also showed that their results seem much more consistent with daily variations in traffic crashes throughout the year (lots of noise) and we used a few other well known “risky” days (July 4th is quite reliable for excess deaths from traffic crashes) as a comparison. We also used Stan to fit some partial pooling models to look at how these “effects” may vary over longer time windows.

I wrote an updated blog post about it here.

And the gated version of the paper is now posted on Injury Prevention’s website, but we have made a preprint and all of the raw data and code to reproduce our work available at my Open Science page.


Continue Reading…


Read More

If you did not already know

MobiRNN google
In this paper, we explore optimizations to run Recurrent Neural Network (RNN) models locally on mobile devices. RNN models are widely used for Natural Language Processing, Machine Translation, and other tasks. However, existing mobile applications that use RNN models do so on the cloud. To address privacy and efficiency concerns, we show how RNN models can be run locally on mobile devices. Existing work on porting deep learning models to mobile devices focus on Convolution Neural Networks (CNNs) and cannot be applied directly to RNN models. In response, we present MobiRNN, a mobile-specific optimization framework that implements GPU offloading specifically for mobile GPUs. Evaluations using an RNN model for activity recognition shows that MobiRNN does significantly decrease the latency of running RNN models on phones. …

Graph Structured Recurrent Neural Network (GSRNN) google
We present a generic framework for spatio-temporal (ST) data modeling, analysis, and forecasting, with a special focus on data that is sparse in both space and time. Our multi-scaled framework is a seamless coupling of two major components: a self-exciting point process that models the macroscale statistical behaviors of the ST data and a graph structured recurrent neural network (GSRNN) to discover the microscale patterns of the ST data on the inferred graph. This novel deep neural network (DNN) incorporates the real time interactions of the graph nodes to enable more accurate real time forecasting. The effectiveness of our method is demonstrated on both crime and traffic forecasting. …

Progressive Sampling-Based Bayesian Optimization google
Purpose: Machine learning is broadly used for clinical data analysis. Before training a model, a machine learning algorithm must be selected. Also, the values of one or more model parameters termed hyper-parameters must be set. Selecting algorithms and hyper-parameter values requires advanced machine learning knowledge and many labor-intensive manual iterations. To lower the bar to machine learning, miscellaneous automatic selection methods for algorithms and/or hyper-parameter values have been proposed. Existing automatic selection methods are inefficient on large data sets. This poses a challenge for using machine learning in the clinical big data era. Methods: To address the challenge, this paper presents progressive sampling-based Bayesian optimization, an efficient and automatic selection method for both algorithms and hyper-parameter values. Results: We report an implementation of the method. We show that compared to a state of the art automatic selection method, our method can significantly reduce search time, classification error rate, and standard deviation of error rate due to randomization. Conclusions: This is major progress towards enabling fast turnaround in identifying high-quality solutions required by many machine learning-based clinical data analysis tasks. …

Continue Reading…


Read More

Videos: Sublinear Algorithms and Nearest-Neighbor Search workshop, Nov. 27 – Nov. 30, 2018 (Simon Institute and Kavli Foundation)

The Sublinear Algorithms and Nearest-Neighbor Search workshop is part of the Program on Foundations of Data Science sponsored by the Simons Institute for the Theory of Computing at Berkeley and the Kavli Foundation. It took place Nov. 27 – Nov. 30, 2018 in Berkeley. Thank you to the organizer to make this workshop a reality: Robert Krauthgamer , Artur Czumaj , Aarti Singh , Rachel Ward . The introduction for the workshop goes as follows:

Many applications require extreme computational efficiency, where the usage of resources, like runtime, storage, or data samples, is sublinear in the size of the input, and the workshop will cover different areas where this topic is studied, and explore connections between them. Specifically, this topic received a lot of attention within Theoretical Computer Science, often motivated by advances in various application areas, and the workshop will review recent developments, such as algorithms for data streams, sketching, dimensionality reduction, graph sparsification, and property testing. In addition, the workshop will examine connections with linear and sublinear methods in Machine Learning, Statistics, Signal Processing and related areas, such as the well-known connections with compressed sensing, sparse recovery, and nearest-neighbor search. Some more recent connections are to online bandit problems and to distributed optimization, where constraints on algorithmic resources are just starting to be considered; such problems may be amenable to techniques from data-stream algorithms.

Here are the videos:

Learning-Augmented Sketches for Frequency EstimationPiotr Indyk (Massachusetts Institute of Technology)
Classical streaming algorithms typically do not leverage data properties or patterns in their input. We propose to augment such algorithms with a learning model that enables them to exploit data properties without being specific to a particular pattern or property. We focus on the problem of estimating the frequency of elements in a data stream, a fundamental problem with applications in network measurements, natural language processing, and security. We propose a new framework for designing frequency estimation streaming algorithms that automatically learn to leverage the properties of the input data. We present a theoretical analysis of the proposed algorithms and prove that, under natural assumptions, they have lower space complexity than prior algorithms. We also evaluate our algorithms on two problems ? monitoring Internet traffic and tracking the popularity of search queries ? and demonstrate their performance gains. Joint work with Chen-Yu Hsu, Dina Katabi and Ali Vakilian.

In sparse recovery/compressed sensing, one can estimate a k-sparse vector in n dimensions with only Theta(k log n) nonadaptive linear measurements. With adaptivity -- if each measurement can be based on the previous ones -- this reduces to O(k log log n). But what happens if the measurement matrices can only be chosen in a few rounds, as seen (for example) in constant-pass streaming algorithms? This talk will give upper and lower bounds, showing (up to a log^* k factor) that R rounds of adaptivity require Theta(k log^{1/R} n) measurements.

Universal Sketches,Vladimir Braverman (Johns Hopkins University)
Streaming and sketching algorithms have found many applications in computer science and other areas. A typical sketching algorithm approximates one function. Given a class of functions F, it is natural to ask if it is possible to compute a single sketch S that will approximate every function f from F. We call S a “universal sketch” for F. In this talk we will discuss results on universal sketches for several classes of functions. For example, we will describe a sketch that approximates a sub-class of symmetric norms (a norm is symmetric if it is invariant under sign-flips and coordinate-permutations) and outline a connection between universal sketches and concentration of measure and Milman’s theorem. Also, we will describe a recent result for subset (i.e. 0-1 weighted) l0 and l1 norms. For these problems we obtain a nearly optimal upper and lower bounds for streaming space complexity. We will discuss the applicability of universal sketches for Software Defined Networks (SDN). For SDN, we will present the UnivMon (short for Universal Monitoring) framework that can simultaneously achieve both generality and high fidelity across a broad spectrum of monitoring tasks.
This talk is based on joint works with Jaroslaw Blasiok, Stephen R. Chestnut, Robert Krauthgamer and Lin F. Yang (STOC 2017), with Robert Krauthgamer and Lin F. Yang (submitted) and with Zaoxing Liu, Antonis Manousis, Gregory Vorsanger, and Vyas Sekar (HotNets 2015, SIGCOMM 2016).

Let (X,d) be an n-point metric space. We assume that (X,d) is given in the distance oracle model, i.e., X={1,...,n} and for every pair of points x,y from X we can query their distance d(x,y) in constant time. A k-nearest neighbor (k-NN) graph} for (X,d) is a directed graph G=(V,E) that has an edge to each of v's k nearest neighbors. We use cost(G) to denote the sum of edge weights of G.
In this paper, we study the problem of approximating cost(G) in sublinear time, when we are given oracle access to the metric space (X,d) that defines G. Our goal is to develop an algorithm that solves this problem faster than the time required to compute G. To this end, we develop an algorithm that in time ~O(min (n k^{3/2} / eps^6, n^2 / (eps^2 k))) computes an estimate K for the cost of the minimum spanning tree that satisfies with probability at least 2/3
|cost(G) - K | less or equal to eps (cost(G) + mst(X))
where mst(X) denotes the cost of the minimum spanning tree of (X,d).
Joint work with Artur Czumaj. Work was done as part of the speaker's affiliation with Google Switzerland.

Independent samples from an unknown probability distribution p on a domain of size k are distributed across n players, with each player holding one sample. Each player can send a message to a central referee in a simultaneous message passing (SMP) model of communication, whose goal is to solve a pre-specified inference problem. The catch, however, is that each player cannot simply send their own sample to the referee; instead, the message they send must obey some (local) information constraint. For instance, each player may be limited to communicating only L bits, where L << log k; or they may seek to reveal as little information as possible, and preserve local differentially privacy. We propose a general formulation for inference problems in this distributed setting, and instantiate it to two fundamental inference questions, learning and uniformity testing. We study the role of randomness for those questions, and obtain striking separations between public- and private-coin protocols for the latter, while showing the two settings are equally powerful for the former. (Put differently, sharing with your neighbors does help a lot for the test, but not really for the learning.) Based on joint works with Jayadev Acharya (Cornell University), Cody Freitag (Cornell University), and Himanshu Tyagi (IISc Bangalore).

We initiate the study of the role of erasures in local decoding and use our understanding to prove a separation between erasure-resilient and tolerant property testing. Local decoding in the presence of errors has been extensively studied, but has not been considered explicitly in the presence of erasures. Motivated by applications in property testing, we begin our investigation with local {\em list} decoding in the presence of erasures. We prove an analog of a famous result of Goldreich and Levin on local list decodability of the Hadamard code. Specifically, we show that the Hadamard code is locally list decodable in the presence of a constant fraction of erasures, arbitrary close to 1, with list sizes and query complexity better than in the Goldreich-Levin theorem. We use this result to exhibit a property which is testable with a number of queries independent of the length of the input in the presence of erasures, but requires a number of queries that depends on the input length, $n$, for tolerant testing. We further study {\em approximate} locally list decodable codes that work against erasures and use them to strengthen our separation by constructing a property which is testable with a constant number of queries in the presence of erasures, but requires $n^{\Omega(1)}$ queries for tolerant testing. Next, we study the general relationship between local decoding in the presence of errors and in the presence of erasures. We observe that every locally (uniquely or list) decodable code that works in the presence of errors also works in the presence of twice as many erasures (with the same parameters up to constant factors). We show that there is also an implication in the other direction for locally decodable codes (with unique decoding): specifically, that the existence of a locally decodable code that works in the presence of erasures implies the existence of a locally decodable code that works in the presence of errors and has related parameters. However, it remains open whether there is an implication in the other direction for locally {\em list} decodable codes. (Our Hadamard result shows that there has to be some difference in parameters for some settings.) We relate this question to other open questions in local decoding. Based on joint work with Noga Ron-Zewi and Nithin Varma.

Sublinear Time Local-Access Random GeneratorsRonitt Rubinfeld (Massachusetts Institute of Technology)

No abstract available.

Locality sensitive hashing (LSH) is a popular technique for nearest neighbor search in high dimensional data sets. Recently, a new view at LSH as a biased sampling technique has been fruitful for density estimation problems in high dimensions. Given a set of points and a query point, the goal (roughly) is to estimate the density of the data set around the query. One way to formalize this is by kernel density estimation: Given a function that decays with distance and represents the "influence" of a data point at the query, sum up this influence function over the data set. Yet another way to formalize this problem is by counting the number of data points within a certain radius of the query. While these problems can easily be solved by making a linear pass over the data, this can be prohibitive for large data sets and multiple queries. Can we preprocess the data so as to answer queries efficiently? This talk will survey several recent papers that use locality sensitive hashing to design unbiased estimators for such density estimation problems and their extensions. This talk will survey joint works with Arturs Backurs, Piotr Indyk, Vishnu Natchu, Paris Syminelakis and Xian (Carrie) Wu.

We consider algorithms that take an unlabeled data set and label it in its entirety, given the ability to interact with a human expert. The goal is to minimize the amount of interaction while producing a labeling that satisfies an (epsilon, delta) guarantee: with probability at least 1-delta over the randomness in the algorithm, at most an epsilon fraction of the labels are incorrect. Scenario 1: The algorithm asks the expert for labels of specific points. This is the standard problem of active learning, except that the final product is a labeled data set rather than a classifier. Scenario 2: The expert also provides "weak rules" or helpful features. We will summarize the state of the art on these problems, in terms of promising algorithms and statistical guarantees, and identify key challenges and open problems.

We consider the problem of estimating the value of MAX-CUT in a graph in the streaming model of computation. At one extreme, there is a trivial $2$-approximation for this problem that uses only $O(\log n)$ space, namely, count the number of edges and output half of this value as the estimate for the size of the MAX-CUT. On the other extreme, for any fixed $\eps > 0$, if one allows $\tilde{O}(n)$ space, a $(1+\eps)$-approximate solution to the MAX-CUT value can be obtained by storing an $\tilde{O}(n)$-size sparsifier that essentially preserves MAX-CUT value.
Our main result is that any (randomized) single pass streaming algorithm that breaks the $2$-approximation barrier requires $\Omega(n)$-space, thus resolving the space complexity of any non-trivial approximations of the MAX-CUT value to within polylogarithmic factors in the single pass streaming model. We achieve the result by presenting a tight analysis of the Implicit Hidden Partition Problem introduced by Kapralov et al.[SODA'17] for an arbitrarily large number $k$ of players. In this problem a number of players receive random matchings of $\Omega(n)$ size together with random bits on the edges, and their task is to determine whether the bits correspond to parities of some hidden bipartition, or are just uniformly random.
Unlike all previous Fourier analytic communication lower bounds, our analysis does not directly use bounds on the $\ell_2$ norm of Fourier coefficients of a typical message at any given weight level that follow from hypercontractivity. Instead, we use the fact that graphs received by players are sparse (matchings) to obtain strong upper bounds on the $\ell_1$ norm of the Fourier coefficients of the messages of individual players using their special structure, and then argue, using the convolution theorem, that similar strong bounds on the $\ell_1$ norm are essentially preserved (up to an exponential loss in the number of players) once messages of different players are combined. We feel that our main technique is likely of independent interest.

Any graph with maximum degree Delta admits a proper vertex coloring with Delta+1 colors that can be found via a simple sequential greedy algorithm in linear time and space. But can one find such a coloring via a sublinear algorithm? In this talk, I present new algorithms that answer this question in the affirmative for several canonical classes of sublinear algorithms including graph streaming, sublinear time, and massively parallel computation (MPC) algorithms. At the core of these algorithms is a remarkably simple meta-algorithm for the (Delta+1) coloring problem: Sample O(log n) colors for each vertex uniformly at random from the Delta+1 colors and then find a proper coloring of the graph using the sampled colors; our main structural result states that the sampled set of colors with high probability contains a proper coloring of the input graph.

Efficient algorithms for k-means clustering frequently converge to suboptimal partitions, and given a partition, it is difficult to detect k-means optimality. We discuss an a posteriori certifier of approximate optimality for k-means clustering based on Peng and Wei's semidefinite relaxation of k-means.

Efficient algorithms for k-means clustering frequently converge to suboptimal partitions, and given a partition, it is difficult to detect k-means optimality. In this paper, we develop an a posteriori certifier of approximate optimality for k-means clustering. The certifier is a sub-linear Monte Carlo algorithm based on Peng and Wei's semidefinite relaxation of k-means. In particular, solving the relaxation for small random samples of the dataset produces a high-confidence lower bound on the k-means objective, and being sub-linear, our algorithm is faster than k-means++ when the number of data points is large. If the data points are drawn independently from any mixture of two Gaussians over R^m with identity covariance, then with probability 1?O(1/m), our poly(m)-time algorithm produces a 3-approximation certificate with 99% confidence (no separation required). We also introduce a linear-time Monte Carlo algorithm that produces an O(k) additive approximation lower bound.

We provide a novel ? and to the best of our knowledge, the first ? algorithm for high dimensional sparse regression with corruptions in explanatory and/or response variables. Our algorithm recovers the true sparse parameters in the presence of a constant fraction of arbitrary corruptions. Our main contribution is a robust variant of Iterative Hard Thresholding. Using this, we provide accurate estimators with sub-linear sample complexity. Our algorithm consists of a novel randomized outlier removal technique for robust sparse mean estimation that may be of interest in its own right: it is orderwise more efficient computationally than existing algorithms, and succeeds with high probability, thus making it suitable for general use in iterative algorithms.

We study the PCA and column subset selection problems in matrices in an online setting, where the columns arrive one after the other. In the context of column subset selection, the goal is to decide whether to include or discard a column, as it arrives. We design a simple algorithm that includes at most O(k \polylog n) columns overall and achieves a multiplicative (1+\epsilon) error compared to the best rank-k approximation of the full matrix. This result may be viewed as an analog of the classic result of Myerson on online clustering.

The need for fast computation typically requires tradeoffs with statistical accuracy; here we are interested in whether computation can be significantly improved without trading-off accuracy.
In particular, for best possible accuracy in NN prediction, the number of neighbors generally needs to grow as a root of n (sample size), consequently limiting NN-search (any technique) to order of root of n complexity; in other words, expensive prediction seems unavoidable, even while using fast search methods, if accuracy is to be optimal. Unfortunately, the usual alternative is to tradeoff accuracy.
Interestingly, we show that it is possible to maintain accuracy, while reducing computation (at prediction time) to just O(log n), through simple bias and or variance correction tricks applied after data quantization or subsampling, together with (black box) fast search techniques. Furthermore, our analysis yields clear insights into how much quantization or subsampling is tolerable if optimal accuracy is to be achieved.
Our theoretical insights are validated through extensive experiments with large datasets from various domains.
The talk is based on a series of works with N. Verma, and with L. Xue.

We establish a generic reduction from nonlinear spectral gaps of metric spaces to space partitions, in the form of data-dependent Locality-Sensitive Hashing. This yields a new approach to the high-dimensional Approximate Near Neighbor Search problem (ANN). Using this reduction, we obtain a new ANN data structure under an arbitrary d-dimensional norm, where the query algorithm makes only a sublinear number of probes into the data structure. Most importantly, the new data structure achieves a O(log d) approximation for an arbitrary norm. The only other such generic approach, via John's ellipsoid, would achieve square-root-d approximation only. Joint work with Assaf Naor, Aleksandar Nikolov, Ilya Razenshteyn, and Erik Waingarten.

I will show the first approximate nearest neighbor search data structure for a general d-dimensional normed space with sub-polynomial in d approximation.
The main tool is a finite-dimensional quantitative version of a theorem of Daher, which yields a Holder homeomorphism between small perturbations of a normed space of interest and a Euclidean space. To make Daher's theorem algorithmic, we employ convex programming to compute the norm of a vector in a space, which is the result of complex interpolation between two given normed spaces.
Based on a joint work (FOCS 2018) with Alex Andoni, Assaf Naor, Sasho Nikolov and Erik Waingarten.

Theoretical work on high-dimensional nearest neighbor search has focused on the setting where a single point is sought within a known search radius, and an acceptable approximation ratio c is given. Locality Sensitive Hashing is a powerful framework for addressing this problem. In practice one usually seeks the (exact) k nearest points, the search radius is unknown, and the parameter c must be chosen in a way that depends on the data distribution. Though reductions of the latter problem to the former exist, they incur polylogarithmic overhead in time and/or space, which in turn make them unattractive in many practical settings. We address this discrepancy between theory and practice by suggesting new, simple, more efficient reductions for solving the k-Nearest Neighbor search problem using Locality Sensitive Hashing. Joint work with Tobias Christiani and Mikkel Thorup.
The main story of this talk is how theoretical ideas on randomized sampling algorithms became part of production code at Twitter. The specific context is finding pairs of similar items: a classic algorithmic problem that is an integral part of recommendation systems. In most incarnations, it boils down to finding high inner products among a large collection of vectors, or alternately high entries in a matrix product. Despite a rich literature on this topic (and despite Twitter's significant compute resources), none of the existing methods scaled to "industrial sized" inputs, which exceed hundreds of billions of non-zeros. I will talk about a distributed algorithm for this problem, that combines low-dimension projections (hashes) with path-sampling techniques (wedges). There is some cute math behind the algorithm, and we were able to run it in production on Twitter's recommendation system. Joint work with Aneesh Sharma (Twitter) and Ashish Goel (Stanford).

In this talk I will discuss how to recover spectral approximations to broad classes of structured matrices using only a polylogarithmic number of adaptive linear measurements to either the matrix or its inverse. Leveraging this result I will discuss how to achieve faster algorithms for solving a variety of linear algebraic problems including solving linear systems in the inverse of symmetric M-matrices (a generalization of Laplacian systems), solving linear systems that are constant spectral approximations of Laplacians (or more generally, SDD matrices), and recovering a spectral sparsifier of a graph using only a polylogarithmc number of matrix vector multiplies. More broadly this talk will show how to leverage a number of recent approaches to spectral sparsification towards expanding the robustness and scope of recent nearly linear time linear system solving research, and providing general matrix recovery machinery that may serve as a stepping stone for faster algorithms. This talk reflects joint work with Arun Jambulapati and Kiran Shiragur.

Spatial Scan Statistics measure and detect anomalous spatial behavior, specifically they identify geometric regions where significantly more of a measured characteristic is found than would be expected from the background distribution. These techniques have been used widely in geographic information science, such as to pinpoint disease outbreaks. However, until recently, available algorithms and software only scaled to at most a thousand or so spatial records. In this work I will describe how using coresets, efficient constructions, and scanning algorithms, we have developed new algorithms and software that easily scales to millions or more data points. Along the way we provide new efficient algorithms and constructions for eps-samples and eps-nets for various geometric range spaces. This is a case where subtle theoretical improvements of old structures from discrete geometry actually result in substantial empirical improvements.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !

Continue Reading…


Read More

Whats new on arXiv

Compressed Indexes for Fast Search of Semantic Data

The sheer increase in volume of RDF data demands efficient solutions for the triple indexing problem, that is devising a compressed data structure to compactly represent RDF triples by guaranteeing, at the same time, fast pattern matching operations. This problem lies at the heart of delivering good practical performance for the resolution of complex SPARQL queries on large RDF datasets. In this work, we propose a trie-based index layout to solve the problem and introduce two novel techniques to reduce its space of representation for improved effectiveness. The extensive experimental analysis conducted over a wide range of publicly available real-world datasets, reveals that our best space/time trade-off configuration substantially outperforms existing solutions at the state-of-the-art, by taking 30 {\div} 60% less space and speeding up query execution by a factor of 2 {\div} 81X.

Why Are the ARIMA and SARIMA not Sufficient

The autoregressive moving average (ARMA) model and its variants like autoregressive integrated moving average (ARIMA), seasonal ARIMA (SARIMA) take the significant position in the time series analysis community. The ARMA model could describe a rational-spectra wide-sense stationary stochastic process and make use of the past information to approximate the underlying dynamics of the interested stochastic process so that we can make predictions of the future. As its supplementary, the ARIMA and SARIMA, collectively referred to as S-ARIMA for briefness, aim to transform the interested non-stationary and irrational-spectra wide-sense stationary stochastic process to be stationary and rational by difference and seasonal difference operator with proper order so that they can follow the philosophy of the ARMA model. However, such difference operators are merely empirical experience without exploring the exact philosophy behind. Therefore, we never know why they work well somewhere and ineffectively elsewhere. In this paper, we investigate the power of the (seasonal) difference operator from the perspective of spectral analysis, linear system theory and digital filtering, and point out the characteristics and limitations of (seasonal) difference operator and S-ARIMA. Besides, the general operator that transforms a non-stationary and/or irrational-spectra wide-sense stationary stochastic process to be stationary and/or rational will be presented. In the end, we show the overall methodology, ARMA-SIN, for predicting a non-stationary and/or wide-sense stationary stochastic process. As a closing note, we will also present the nature, philosophy, effectiveness, insufficiency, and improvement of the canonical time series decomposition (like STL, X11) and time series smoothing (like exponential smoothing, Holt’s, and moving average) methods, and demonstrate that they are special cases of the ARMA-SIN.

HARK Side of Deep Learning — From Grad Student Descent to Automated Machine Learning

Recent advancements in machine learning research, i.e., deep learning, introduced methods that excel conventional algorithms as well as humans in several complex tasks, ranging from detection of objects in images and speech recognition to playing difficult strategic games. However, the current methodology of machine learning research and consequently, implementations of the real-world applications of such algorithms, seems to have a recurring HARKing (Hypothesizing After the Results are Known) issue. In this work, we elaborate on the algorithmic, economic and social reasons and consequences of this phenomenon. We present examples from current common practices of conducting machine learning research (e.g. avoidance of reporting negative results) and failure of generalization ability of the proposed algorithms and datasets in actual real-life usage. Furthermore, a potential future trajectory of machine learning research and development from the perspective of accountable, unbiased, ethical and privacy-aware algorithmic decision making is discussed. We would like to emphasize that with this discussion we neither claim to provide an exhaustive argumentation nor blame any specific institution or individual on the raised issues. This is simply a discussion put forth by us, insiders of the machine learning field, reflecting on us.

Semantically Aligned Bias Reducing Zero Shot Learning

Zero shot learning (ZSL) aims to recognize unseen classes by exploiting semantic relationships between seen and unseen classes. Two major problems faced by ZSL algorithms are the hubness problem and the bias towards the seen classes. Existing ZSL methods focus on only one of these problems in the conventional and generalized ZSL setting. In this work, we propose a novel approach, Semantically Aligned Bias Reducing (SABR) ZSL, which focuses on solving both the problems. It overcomes the hubness problem by learning a latent space that preserves the semantic relationship between the labels while encoding the discriminating information about the classes. Further, we also propose ways to reduce the bias of the seen classes through a simple cross-validation process in the inductive setting and a novel weak transfer constraint in the transductive setting. Extensive experiments on three benchmark datasets suggest that the proposed model significantly outperforms existing state-of-the-art algorithms by ~1.5-9% in the conventional ZSL setting and by ~2-14% in the generalized ZSL for both the inductive and transductive settings.

Predicting Time-to-Failure of Plasma Etching Equipment using Machine Learning

Predicting unscheduled breakdowns of plasma etching equipment can reduce maintenance costs and production losses in the semiconductor industry. However, plasma etching is a complex procedure and it is hard to capture all relevant equipment properties and behaviors in a single physical model. Machine learning offers an alternative for predicting upcoming machine failures based on relevant data points. In this paper, we describe three different machine learning tasks that can be used for that purpose: (i) predicting Time-To-Failure (TTF), (ii) predicting health state, and (iii) predicting TTF intervals of an equipment. Our results show that trained machine learning models can outperform benchmarks resembling human judgments in all three tasks. This suggests that machine learning offers a viable alternative to currently deployed plasma etching equipment maintenance strategies and decision making processes.

Most Frequent Itemset Optimization

In this paper we are dealing with the frequent itemset mining. We concentrate on the special case that we only want to identify the most frequent itemset of length N. To do that, we present a pattern on how to consider this search as an optimization problem. First, we extract the frequency of all possible 2-item-sets. Then the optimization problem is to find the N objects, for which the minimal frequency of all containing 2-item-sets is maximal. This combinatorial optimization problem can be solved by any optimization algorithm. We will solve them with Quantum Annealing and QUBO with QbSolv by D-Wave. The advantages of MFIO in comparison to the state-of-the-art-approach are the enormous reduction of time need, reduction of memory need and the omission of a threshold. The disadvantage is that there is no guaranty for accuracy of the result. The evaluation indicates good results.

Multimodal Subspace Support Vector Data Description

In this paper, we propose a novel method for projecting data from multiple modalities to a new subspace optimized for one-class classification. The proposed method iteratively transforms the data from the original feature space of each modality to a new common feature space along with finding a joint compact description of data coming from all the modalities. For data in each modality, we define a separate transformation to map the data from the corresponding feature space to the new optimized subspace by exploiting the available information from the class of interest only. The data description in the new subspace is obtained by Support Vector Data Description. We also propose different regularization strategies for the proposed method and provide both linear and non-linear formulation. We conduct experiments on two multimodal datasets and compare the proposed approach with baseline and recently proposed one-class classification methods combined with early fusion and also considering each modality separately. We show that the proposed Multimodal Subspace Support Vector Data Description outperforms all the methods using data from a single modality and performs better or equally well than the methods fusing data from all modalities.

Sameness Attracts, Novelty Disturbs, but Outliers Flourish in Fanfiction Online

The nature of what people enjoy is not just a central question for the creative industry, it is a driving force of cultural evolution. It is widely believed that successful cultural products balance novelty and conventionality: they provide something familiar but at least somewhat divergent from what has come before, and occupy a satisfying middle ground between ‘more of the same’ and ‘too strange’. We test this belief using a large dataset of over half a million works of fanfiction from the website Archive of Our Own (AO3), looking at how the recognition a work receives varies with its novelty. We quantify the novelty through a term-based language model, and a topic model, in the context of existing works within the same fandom. Contrary to the balance theory, we find that the lowest-novelty are the most popular and that popularity declines monotonically with novelty. A few exceptions can be found: extremely popular works that are among the highest novelty within the fandom. Taken together, our findings not only challenge the traditional theory of the hedonic value of novelty, they invert it: people prefer the least novel things, are repelled by the middle ground, and have an occasional enthusiasm for extreme outliers. It suggests that cultural evolution must work against inertia — the appetite people have to continually reconsume the familiar, and may resemble a punctuated equilibrium rather than a smooth evolution.

Kernel canonical correlation analysis approximates operators for the detection of coherent structures in dynamical data

We illustrate relationships between classical kernel-based dimensionality reduction techniques and eigendecompositions of empirical estimates of reproducing kernel Hilbert space (RKHS) operators associated with dynamical systems. In particular, we show that kernel canonical correlation analysis (CCA) can be interpreted in terms of kernel transfer operators and that coherent sets of particle trajectories can be computed by applying kernel CCA to Lagrangian data. We demonstrate the efficiency of this approach with several examples, namely the well-known Bickley jet, ocean drifter data, and a molecular dynamics problem with a time-dependent potential. Furthermore, we propose a straightforward generalization of dynamic mode decomposition (DMD) called coherent mode decomposition (CMD).

ProUM: Projection-based Utility Mining on Sequence Data

In recent decade, utility mining has attracted a great attention, but most of the existing studies are developed to deal with itemset-based data. Different from the itemset-based data, the time-ordered sequence data is more commonly seen in real-world situations. Current utility mining algorithms have the limitation when dealing with sequence data since they are time-consuming and require large amount of memory usage. In this paper, we propose an efficient Projection-based Utility Mining (ProUM) approach to discover high-utility sequential patterns from sequence data. The utility-array structure is designed to store necessary information of sequence-order and utility. By utilizing the projection technique in generating utility-array, ProUM can significantly improve the mining efficiency, and effectively reduce the memory consumption. Besides, we propose a new upper bound named sequence extension utility. Several pruning strategies are further applied to improve the efficiency of ProUM. Experimental results show that the proposed ProUM algorithm significantly outperforms the state-of-the-art algorithms.

An Evaluation Framework for Interactive Recommender System

Traditional recommender systems present a relatively static list of recommendations to a user where the feedback is typically limited to an accept/reject or a rating model. However, these simple modes of feedback may only provide limited insights as to why a user likes or dislikes an item and what aspects of the item the user has considered. Interactive recommender systems present an opportunity to engage the user in the process by allowing them to interact with the recommendations, provide feedback and impact the results in real-time. Evaluation of the impact of the user interaction typically requires an extensive user study which is time consuming and gives researchers limited opportunities to tune their solutions without having to conduct multiple rounds of user feedback. Additionally, user experience and design aspects can have a significant impact on the user feedback which may result in not necessarily assessing the quality of some of the underlying algorithmic decisions in the overall solution. As a result, we present an evaluation framework which aims to simulate the users interacting with the recommender. We formulate metrics to evaluate the quality of the interactive recommenders which are outputted by the framework once simulation is completed. While simulation along is not sufficient to evaluate a complete solution, the results can be useful to help researchers tune their solution before moving to the user study stage.

Persistence Curves: A canonical framework for summarizing persistence diagrams

Persistence diagrams are a main tool in the field of Topological Data Analysis (TDA). They contain fruitful information about the shape of data. The use of machine learning algorithms on the space of persistence diagrams proves to be challenging as the space is complicated. For that reason, summarizing and vectorizing these diagrams is an important topic currently researched in TDA. In this work, we provide a general framework of summarizing diagrams that we call Persistence Curves (PC). The main idea is so-called Fundamental Lemma of Persistent Homology, which is derived from the classic elder rule. Under this framework, certain well-known summaries, such as persistent Betti numbers, and persistence landscape, are special cases of the PC. Moreover, we prove a rigorous bound for a general families of PCs. In particular, certain family of PCs admit the stability property under an additional assumption. Finally, we apply PCs to textures classification on four well-know texture datasets. The result outperforms several existing TDA methods.

AT-GAN: A Generative Attack Model for Adversarial Transferring on Generative Adversarial Nets

Recent studies have discovered the vulnerability of Deep Neural Networks (DNNs) to adversarial examples, which are imperceptible to humans but can easily fool DNNs. Existing methods for crafting adversarial examples are mainly based on adding small-magnitude perturbations to the original images so that the generated adversarial examples are constrained by the benign examples within a small matrix norm. In this work, we propose a new attack method called AT-GAN that directly generates the adversarial examples from random noise using generative adversarial nets (GANs). The key idea is to transfer a pre-trained GAN to generate adversarial examples for the target classifier to be attacked. Once the model is transferred for attack, AT-GAN can generate diverse adversarial examples efficiently, making it helpful to potentially accelerate the adversarial training on defenses. We evaluate AT-GAN in both semi-whitebox and black-box settings under typical defense methods on the MNIST handwritten digit database. Empirical comparisons with existing attack baselines demonstrate that AT-GAN can achieve a higher attack success rate.

Learning 3D Navigation Protocols on Touch Interfaces with Cooperative Multi-Agent Reinforcement Learning

Using touch devices to navigate in virtual 3D environments such as computer assisted design (CAD) models or geographical information systems (GIS) is inherently difficult for humans, as the 3D operations have to be performed by the user on a 2D touch surface. This ill-posed problem is classically solved with a fixed and handcrafted interaction protocol, which must be learned by the user. We propose to automatically learn a new interaction protocol allowing to map a 2D user input to 3D actions in virtual environments using reinforcement learning (RL). A fundamental problem of RL methods is the vast amount of interactions often required, which are difficult to come by when humans are involved. To overcome this limitation, we make use of two collaborative agents. The first agent models the human by learning to perform the 2D finger trajectories. The second agent acts as the interaction protocol, interpreting and translating to 3D operations the 2D finger trajectories from the first agent. We restrict the learned 2D trajectories to be similar to a training set of collected human gestures by first performing state representation learning, prior to reinforcement learning. This state representation learning is addressed by projecting the gestures into a latent space learned by a variational auto encoder (VAE).

Simion Zoo: A Workbench for Distributed Experimentation with Reinforcement Learning for Continuous Control Tasks

We present Simion Zoo, a Reinforcement Learning (RL) workbench that provides a complete set of tools to design, run, and analyze the results,both statistically and visually, of RL control applications. The main features that set apart Simion Zoo from similar software packages are its easy-to-use GUI, its support for distributed execution including deployment over graphics processing units (GPUs) , and the possibility to explore concurrently the RL metaparameter space, which is key to successful RL experimentation.

Unsupervised Discovery of Multimodal Links in Multi-Image, Multi-Sentence Documents

Images and text co-occur everywhere on the web, but explicit links between images and sentences (or other intra-document textual units) are often not annotated by users. We present algorithms that successfully discover image-sentence relationships without relying on any explicit multimodal annotation. We explore several variants of our approach on seven datasets of varying difficulty, ranging from images that were captioned post hoc by crowd-workers to naturally-occurring user-generated multimodal documents, wherein correspondences between illustrations and individual textual units may not be one-to-one. We find that a structured training objective based on identifying whether sets of images and sentences co-occur in documents can be sufficient to predict links between specific sentences and specific images within the same document at test time.

Scalable and Efficient Hypothesis Testing with Random Forests

Throughout the last decade, random forests have established themselves as among the most accurate and popular supervised learning methods. While their black-box nature has made their mathematical analysis difficult, recent work has established important statistical properties like consistency and asymptotic normality by considering subsampling in lieu of bootstrapping. Though such results open the door to traditional inference procedures, all formal methods suggested thus far place severe restrictions on the testing framework and their computational overhead precludes their practical scientific use. Here we propose a permutation-style testing approach to formally assess feature significance. We establish asymptotic validity of the test via exchangeability arguments and show that the test maintains high power with orders of magnitude fewer computations. As importantly, the procedure scales easily to big data settings where large training and testing sets may be employed without the need to construct additional models. Simulations and applications to ecological data where random forests have recently shown promise are provided.

Active Adversarial Domain Adaptation

We propose an active learning approach for transferring representations across domains. Our approach, active adversarial domain adaptation (AADA), explores a duality between two related problems: adversarial domain alignment and importance sampling for adapting models across domains. The former uses a domain discriminative model to align domains, while the latter utilizes it to weigh samples to account for distribution shifts. Specifically, our importance weight promotes samples with large uncertainty in classification and diversity from labeled examples, thus serves as a sample selection scheme for active learning. We show that these two views can be unified in one framework for domain adaptation and transfer learning when the source domain has many labeled examples while the target domain does not. AADA provides significant improvements over fine-tuning based approaches and other sampling methods when the two domains are closely related. Results on challenging domain adaptation tasks, e.g., object detection, demonstrate that the advantage over baseline approaches is retained even after hundreds of examples being actively annotated.

End-to-End Robotic Reinforcement Learning without Reward Engineering

The combination of deep neural network models and reinforcement learning algorithms can make it possible to learn policies for robotic behaviors that directly read in raw sensory inputs, such as camera images, effectively subsuming both estimation and control into one model. However, real-world applications of reinforcement learning must specify the goal of the task by means of a manually programmed reward function, which in practice requires either designing the very same perception pipeline that end-to-end reinforcement learning promises to avoid, or else instrumenting the environment with additional sensors to determine if the task has been performed successfully. In this paper, we propose an approach for removing the need for manual engineering of reward specifications by enabling a robot to learn from a modest number of examples of successful outcomes, followed by actively solicited queries, where the robot shows the user a state and asks for a label to determine whether that state represents successful completion of the task. While requesting labels for every single state would amount to asking the user to manually provide the reward signal, our method requires labels for only a tiny fraction of the states seen during training, making it an efficient and practical approach for learning skills without manually engineered rewards. We evaluate our method on real-world robotic manipulation tasks where the observations consist of images viewed by the robot’s camera. In our experiments, our method effectively learns to arrange objects, place books, and drape cloth, directly from images and without any manually specified reward functions, and with only 1-4 hours of interaction with the real world.

JGraphT — A Java library for graph data structures and algorithms

Mathematical software and graph-theoretical algorithmic packages to efficiently model, analyze and query graphs are crucial in an era where large-scale spatial, societal and economic network data are abundantly available. One such package is JGraphT, a programming library which contains very efficient and generic graph data-structures along with a large collection of state-of-the-art algorithms. The library is written in Java with stability, interoperability and performance in mind. A distinctive feature of this library is the ability to model vertices and edges as arbitrary objects, thereby permitting natural representations of many common networks including transportation, social and biological networks. Besides classic graph algorithms such as shortest-paths and spanning-tree algorithms, the library contains numerous advanced algorithms: graph and subgraph isomorphism; matching and flow problems; approximation algorithms for NP-hard problems such as independent set and TSP; and several more exotic algorithms such as Berge graph detection. Due to its versatility and generic design, JGraphT is currently used in large-scale commercial, non-commercial and academic research projects. In this work we describe in detail the design and underlying structure of the library, and discuss its most important features and algorithms. A computational study is conducted to evaluate the performance of JGraphT versus a number of similar libraries. Experiments on a large number of graphs over a variety of popular algorithms show that JGraphT is highly competitive with other established libraries such as NetworkX or the BGL.

Scalable Bayesian Inference for Population Markov Jump Processes

Bayesian inference for Markov jump processes (MJPs) where available observations relate to either system states or jumps typically relies on data-augmentation Markov Chain Monte Carlo. State-of-the-art developments involve representing MJP paths with auxiliary candidate jump times that are later thinned. However, these algorithms are i) unfeasible in situations involving large or infinite capacity systems and ii) not amenable for all observation types. In this paper we establish and present a general data-augmentation framework for population MJPs based on uniformized representations of the underlying non-stationary jump processes. This leads to multiple novel MCMC samplers which enable exact (in the Monte Carlo sense) inference tasks for model parameters. We show that proposed samplers outperform existing popular approaches, and offer substantial efficiency gains in applications to partially observed stochastic epidemics, immigration processes and predator-prey dynamical systems.

Aggregation Cross-Entropy for Sequence Recognition

In this paper, we propose a novel method, aggregation cross-entropy (ACE), for sequence recognition from a brand new perspective. The ACE loss function exhibits competitive performance to CTC and the attention mechanism, with much quicker implementation (as it involves only four fundamental formulas), faster inference\back-propagation (approximately O(1) in parallel), less storage requirement (no parameter and negligible runtime memory), and convenient employment (by replacing CTC with ACE). Furthermore, the proposed ACE loss function exhibits two noteworthy properties: (1) it can be directly applied for 2D prediction by flattening the 2D prediction into 1D prediction as the input and (2) it requires only characters and their numbers in the sequence annotation for supervision, which allows it to advance beyond sequence recognition, e.g., counting problem. The code is publicly available at https://…/Aggregation-Cross-Entropy.

Relay: A High-Level IR for Deep Learning

Frameworks for writing, compiling, and optimizing deep learning (DL) models have recently enabled progress in areas like computer vision and natural language processing. Extending these frameworks to accommodate the rapidly diversifying landscape of DL models and hardware platforms presents challenging tradeoffs between expressiveness, composability, and portability. We present Relay, a new intermediate representation (IR) and compiler framework for DL models. The functional, statically-typed Relay IR unifies and generalizes existing DL IRs and can express state-of-the-art models. Relay’s expressive IR required careful design of the type system, automatic differentiation, and optimizations. Relay’s extensible compiler can eliminate abstraction overhead and target new hardware platforms. The design insights from Relay can be applied to existing frameworks to develop IRs that support extension without compromising on expressivity, composibility, and portability. Our evaluation demonstrates that the Relay prototype can already provide competitive performance for a broad class of models running on CPUs, GPUs, and FPGAs.

DocBERT: BERT for Document Classification

Pre-trained language representation models achieve remarkable state of the art across a wide range of tasks in natural language processing. One of the latest advancements is BERT, a deep pre-trained transformer that yields much better results than its predecessors do. Despite its burgeoning popularity, however, BERT has not yet been applied to document classification. This task deserves attention, since it contains a few nuances: first, modeling syntactic structure matters less for document classification than for other problems, such as natural language inference and sentiment classification. Second, documents often have multiple labels across dozens of classes, which is uncharacteristic of the tasks that BERT explores. In this paper, we describe fine-tuning BERT for document classification. We are the first to demonstrate the success of BERT on this task, achieving state of the art across four popular datasets.

Continue Reading…


Read More

Distilled News

Turning Data into Sound

Inspired by Simon Rogers’s post introducing TwoTone, a tool to represent data as sound, I created my first data ‘sonification’ (aka musical representation of data). This was particularly interesting to me as I had only dreamt about creating jaw-dropping visualizations but never imagined that I could ever turn data into another format, especially sound. This post covers basics about TwoTone and how to use it.

Embedding Graphs with Deep Learning

Sparse representations are the natural killer of classifiers. Current graph data structures such as adjacency matrices and lists are plagued with this sparsity. This article will discuss techniques such as Matrix decomposition, DeepWalk, and Node2Vec which convert sparse graph data into low-dimensional continuous vector spaces.

Scalable Jaccard similarity using MinHash and Spark

It occurred to me a little while ago that the Jaccard similarity coefficient has probably cropped up in my work more than any other statistic except for the arithmetic mean. If you have two sets of things (words, parts of words, attributes, categories, or whatever), you can take the number of things in the intersection of the sets and divide by the number of things in the union of the sets. The resulting metric is a meaningful measure of similarity that has the added virtue of being pretty easily explainable to non-technical folks. Jaccard similarity gets a little difficult to calculate directly at scale. If you have a really large list of entity-attribute pairs, and you want an entity-by-entity similarity matrix, you basically have to do an inner join, group by entity and count, then do an outer join, group by entity and count, and then join the results of the two joins together. If your workflow uses Spark, as mine does, that’s a whole lot of shuffling. It’s expensive.

X-AI, Black Boxes and Crystal Balls

On our road to trusted AI, I discussed in my previous blog the question of bias, how it travels from humans to machines, how it is amplified by AI applications, the impacts in the real world, for individuals and for businesses, and the importance to proactively tackle this problem. Today, I’ll address the issue of explainability and transparency of the so-called ‘black box’ models.

Data-Driven Scenario Stories

So how do we tell meaningful, persuasive stories with data? How do we send messages that are relevant and relatable so that decision-makers can receive them in stride and accelerate down the field? Perhaps data scientists can borrow a page from scenario planning, which relies on informed narratives, to effectively bridge the gap between the digital world and the physical world.

Why are you still doing batch processing? ‘ETL is dead’

It was about year ago that a few colleagues suggested that I research Apache Kafka for an application that I was designing. I watched the rerun video from QCon 2016 titled ‘ETL is Dead; Long Live Streams’, in that video, Neha Narkhede (CTO of Confluent), describes the concept of replacing ETL batch data processing with messaging and microservices. It took some time for the paradigm to really sink in but after designing and writing a data streaming system, I can say that I am a believer. I will describe the difference between ETL batch processing and a data streaming process. Every company is still doing batch processing, it’s just a fact of life. A file of data is received, it must be processed: it needs to be parsed, validated, cleansed, calculated, organized, aggregated, then eventually delivered to some downstream system. Most companies are using some sort of workflow tool such as SSIS or Informatica, that can intuitively wrap these tasks into a ridged ‘package’ contained on a single server and execute on schedule.

Please, explain.’ Interpretability of black-box machine learning models

In February 2019 Polish government added an amendment to a banking law that gives a customer a right to receive an explanation in case of a negative credit decision. It’s one of the direct consequences of implementing GDPR in EU. This means that a bank needs to be able to explain why the loan wasn’t granted if the decision process was automatic. In October 2018 world headlines reported about Amazon AI recruiting tool that favored men. Amazon’s model was trained on biased data that were skewed towards male candidates. It has built rules that penalized résumés that included the word ‘women’s’.

A Detailed Guide to Plotting Line Graphs in R using ggplot geom_line

When it comes to data visualization, it can be fun to think of all the flashy and exciting ways to display a dataset. But if you’re trying to convey information, flashy isn’t always the way to go. In fact, one of the most powerful ways to communicate the relationship between two variables is the simple line graph. A line graph is a type of graph that displays information as a series of data points connected by straight line segments.

Bling Fire

Hi, we are a team at Microsoft called Bling (Beyond Language Understanding), we help Bing be smarter. Here we wanted to share with all of you our FInite State machine and REgular expression manipulation library (FIRE). We use Fire for many linguistic operations inside Bing such as Tokenization, Multi-word expression matching, Unknown word-guessing, Stemming / Lemmatization just to mention a few.

A Comparative Review of the JASP Statistical Software

JASP is a free and open source statistics package that targets beginners looking to point-and-click their way through analyses. This article is one of a series of reviews which aim to help non-programmers choose the Graphical User Interface (GUI) for R, which best meets their needs. Most of these reviews also include cursory descriptions of the programming support that each GUI offers. JASP stands for Jeffreys’ Amazing Statistics Program, a nod to the Bayesian statistician, Sir Harold Jeffreys. It is available for Windows, Mac, Linux, and there is even a cloud version. One of JASP’s key features is its emphasis on Bayesian analysis. Most statistics software emphasizes a more traditional frequentist approach; JASP offers both. However, while JASP uses R to do some of its calculations, it does not currently show you the R code it uses, nor does it allow you to execute your own. The developers hope to add that to a future version. Some of JASP’s calculations are done in C++, so getting that converted to R will be a necessary first step on that path.

Continue Reading…


Read More

Magister Dixit

“Data have no value or meaning in isolation; they exist within a knowledge infrastructure – an ecology of people, practices, technologies, institutions, material objects, and relationships.” Christine L. Borgman ( 2015 )

Continue Reading…


Read More

Quick Example of Latent Profile Analysis in R

(This article was first published on R on Will Hipson, and kindly contributed to R-bloggers)

Latent Profile Analysis (LPA) tries to identify clusters of individuals (i.e., latent profiles) based on responses to a series of continuous variables (i.e., indicators). LPA assumes that there are unobserved latent profiles that generate patterns of responses on indicator items.

Here, I will go through a quick example of LPA to identify groups of people based on their interests/hobbies. The data comes from the Young People Survey, available freely on

Here’s a sneak peek at what we’re going for:

Terminology note: People use the terms clusters, profiles, classes, and groups interchangeably, but there are subtle differences. I’ll mostly stick to profile to refer to a grouping of cases, in keeping with LPA terminology. We should note that LPA is a branch of Gaussian Finite Mixture Modeling, which includes Latent Class Analysis (LCA). The difference between LPA and LCA is conceptual, not computational: LPA uses continuous indicators and LCA uses binary indicators. LPA is a probabilistic model, which means that it models the probability of case belonging to a profile. This is superior to an approach like K-means that uses distance algorithms.

With that aside, let’s load in the data.

survey <- read_csv("") %>%

The data is on 32 interests/hobbies. Each item is ranked 1 (not interested) to 5 (very interested).

The description on Kaggle suggests there may be careless responding (e.g., participants who selected the same value over and over). We can use the careless package to identify “string responding”. Let’s also look for multivariate outliers with Mahalanobis Distance (see my previous post on Mahalanobis for identifying outliers).


interests <- survey %>%
  mutate(string = longstring(.)) %>%
  mutate(md = outlier(., plot = FALSE))

We’ll cap string responding to a maximum of 10 and use a Mahalanobis D cutoff of alpha = .001.

cutoff <- (qchisq(p = 1 - .001, df = ncol(interests)))

interests_clean <- interests %>%
  filter(string <= 10,
         md < cutoff) %>%
  select(-string, -md)

The package mclust performs various types of model-based clustering and dimension reduction. Plus, it’s really intuitive to use. It requires complete data (no missing), so for this example we’ll remove cases with NAs. This is not the preferred approach; we’d be better off imputing. But for illustrative purposes, this works fine. I’m also going to standardize all of the indicators so when we plot the profiles it’s clearer to see the differences between clusters. Running this code will take a few minutes.


interests_clustering <- interests_clean %>%
  na.omit() %>%

BIC <- mclustBIC(interests_clustering)

We’ll start by plotting Bayesian Information Criteria for all the models with profiles ranging from 1 to 9.


It’s not immediately clear which model is the best since the y-axis is so large and many of the models score close together. summary(BIC) shows the top three models based on BIC.

## Best BIC values:
##             VVE,3       VEE,3      EVE,3
## BIC      -75042.7 -75165.1484 -75179.165
## BIC diff      0.0   -122.4442   -136.461

The highest BIC comes from VVE, 3. This says there are 3 clusters with variable volume, variable shape, equal orientation, and ellipsodial distribution (see Figure 2 from this paper for a visual). However, VEE, 3 is not far behind and actually may be a more theoretically useful model since it constrains the shape of the distribution to be equal. For this reason, we’ll go with VEE, 3.

If we want to look at this model more closely, we save it as an object and inspect it with summary().

mod1 <- Mclust(interests_clustering, modelNames = "VEE", G = 3, x = BIC)

## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## Mclust VEE (ellipsoidal, equal shape and orientation) model with 3
## components: 
##  log.likelihood   n  df       BIC       ICL
##       -35455.83 874 628 -75165.15 -75216.14
## Clustering table:
##   1   2   3 
## 137 527 210

The output describes the geometric characteristics of the profiles and the number of cases classified into each of the three clusters.

BIC is one of the best fit indices, but it’s always recommended to look for more evidence that the solution we’ve chosen is the correct one. We can also compare values of the Integrated Completed Likelikood (ICL) criterion. See this paper for more details. ICL isn’t much different from BIC, except that it adds a penalty on solutions with greater entropy or classification uncertainty.

ICL <- mclustICL(interests_clustering)


## Best ICL values:
##              VVE,3        VEE,3      EVE,3
## ICL      -75134.69 -75216.13551 -75272.891
## ICL diff      0.00    -81.44795   -138.203

We see similar results. ICL suggests that model VEE, 3 fits quite well. Finally, we’ll perform the Bootstrap Likelihood Ratio Test (BLRT) which compares model fit between k-1 and k cluster models. In other words, it looks to see if an increase in profiles increases fit. Based on simulations by Nylund, Asparouhov, and Muthén (2007) BIC and BLRT are the best indicators for how many profiles there are. This line of code will take a long time to run, so if you’re just following along I suggest skipping it unless you want to step out for a coffee break.

mclustBootstrapLRT(interests_clustering, modelName = "VEE")
## ------------------------------------------------------------- 
## Bootstrap sequential LRT for the number of mixture components 
## ------------------------------------------------------------- 
## Model        = VEE 
## Replications = 999 
##               LRTS bootstrap p-value
## 1 vs 2    197.0384             0.001
## 2 vs 3    684.8743             0.001
## 3 vs 4   -124.1935             1.000

BLRT also suggests that a 3-profile solution is ideal.

Visualizing LPA

Now that we’re confident in our choice of a 3-profile solution, let’s plot the results. Specifically, we want to see how the profiles differ on the indicators, that is, the items that made up the profiles. If the solution is theoretically meaningful, we should see differences that make sense.

First, we’ll extract the means for each profile (remember, we chose these to be standardized). Then, we melt this into long form. Note that I’m trimming values exceeding +1 SD, otherwise we run into plotting issues.


means <- data.frame(mod1$parameters$mean, stringsAsFactors = FALSE) %>%
  rownames_to_column() %>%
  rename(Interest = rowname) %>%
  melt(id.vars = "Interest", = "Profile", = "Mean") %>%
  mutate(Mean = round(Mean, 2),
         Mean = ifelse(Mean > 1, 1, Mean))

Here’s the code for the plot. I’m reordering the indicators so that similar activities are close together.

means %>%
  ggplot(aes(Interest, Mean, group = Profile, color = Profile)) +
  geom_point(size = 2.25) +
  geom_line(size = 1.25) +
  scale_x_discrete(limits = c("Active sport", "Adrenaline sports", "Passive sport",
                              "Countryside, outdoors", "Gardening", "Cars",
                              "Art exhibitions", "Dancing", "Musical instruments", "Theatre", "Writing", "Reading",
                              "Geography", "History", "Law", "Politics", "Psychology", "Religion", "Foreign languages",
                              "Biology", "Chemistry", "Mathematics", "Medicine", "Physics", "Science and technology",
                              "Internet", "PC",
                              "Celebrities", "Economy Management", "Fun with friends", "Shopping", "Pets")) +
  labs(x = NULL, y = "Standardized mean interest") +
  theme_bw(base_size = 14) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "top")

We have a lot of indicators (more than typical for LPA), but we see some interesting differences. Clearly the red group is interested in science and the blue group shows greater interest in arts and humanities. The green group seems disinterested in both science and art, but moderately interested in other things.

We can make this plot more informative by plugging in profile names and proportions. I’m also going to save this plot as an object so that we can do something really cool with it!

p <- means %>%
  mutate(Profile = recode(Profile, 
                          X1 = "Science: 16%",
                          X2 = "Disinterest: 60%",
                          X3 = "Arts & Humanities: 24%")) %>%
  ggplot(aes(Interest, Mean, group = Profile, color = Profile)) +
  geom_point(size = 2.25) +
  geom_line(size = 1.25) +
  scale_x_discrete(limits = c("Active sport", "Adrenaline sports", "Passive sport",
                              "Countryside, outdoors", "Gardening", "Cars",
                              "Art exhibitions", "Dancing", "Musical instruments", "Theatre", "Writing", "Reading",
                              "Geography", "History", "Law", "Politics", "Psychology", "Religion", "Foreign languages",
                              "Biology", "Chemistry", "Mathematics", "Medicine", "Physics", "Science and technology",
                              "Internet", "PC",
                              "Celebrities", "Economy Management", "Fun with friends", "Shopping", "Pets")) +
  labs(x = NULL, y = "Standardized mean interest") +
  theme_bw(base_size = 14) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "top")


The something really cool that I want to do is make an interactive plot. Why would I want to do this? Well, one of the problems with the static plot is that with so many indicators it’s tough to read the values for each indicator. An interactive plot lets the reader narrow in on specific indicators or profiles of interest. We’ll use plotly to turn our static plot into an interactive one.


ggplotly(p, tooltip = c("Interest", "Mean")) %>%
  layout(legend = list(orientation = "h", y = 1.2))

There’s a quick example of LPA. Overall, I think LPA is great tool for Exploratory Analysis, although I question its reproducibility. What’s important is that the statistician considers both fit indices and theory when deciding on the number of profiles.

References & Resources

Bertoletti, M., Friel, N., & Rastelli, R. (2015). Choosing the number of clusters in a finite mixture model using an exact Integrated Completed Likelihood criterion.

Nylund, K. L., Asparouhov, T., & Muthén, B. O. (2007). Deciding on the Number of Classes in Latent Class Analysis and Growth Mixture Modeling: A Monte Carlo Simulation Study. Structural Equation Modeling, 14, 535-569.

Scrucca, L., Fop, M., Murphy, T. B., & Raftery, A. E. (2016). mclust5: Clustering, Classification and Density Estimation Using Gaussian Finite Mixture Models. The R Journal, 8, 289-317.

To leave a comment for the author, please follow the link and comment on their blog: R on Will Hipson. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Styling DataTables

(This article was first published on Stencilled, and kindly contributed to R-bloggers)

Most of the shiny apps have tables as the primary component. Now lets say you want to prettify your app and style the tables. All you need understand how tables are built using HTML. This is how the default datatable looks like in the app.

In order to build the html table I have used a function table_frame which can be used as a container in DT::renderdatatable.
This function basically uses htmltools. For more references on the basics of html tables please refer here

table_frame <-
  function() {
    htmltools::withTags(table(class = 'display',
                                  th(rowspan = 2, 'Latitude'),
                                  th(rowspan = 2, 'Longitude'),
                                  th(rowspan = 2, 'Month'),
                                  th(rowspan = 2, 'Year'),
                                  th(class = 'dt-center', colspan = 3, 'Cloud'),
                                  th(rowspan = 2, 'Ozone'),
                                  th(rowspan = 2, 'Pressure'),
                                  th(rowspan = 2, 'Surface Temperature'),
                                  th(rowspan = 2, 'Temperature'),
                                    c('High', 'Low', 'Mid'), 1
                                  ), th))

Tables might have n number of records and its not feasible to display them at once on dashboards. But someone might need to see them all at once. So in tableoptions where we can add two buttons show more and show less. Show less will use the default option of 10 records and show more will display all the records.

table_options <- function() {
    dom = 'Bfrtip',
    pageLength = 10,
    buttons = list(
      c('copy', 'csv', 'excel', 'pdf', 'print'),
        extend = "collection",
        text = 'Show All',
        action = DT::JS(
          "function ( e, dt, node, config ) {
        extend = "collection",
        text = 'Show Less',
        action = DT::JS(
          "function ( e, dt, node, config ) {
    deferRender = TRUE,
    lengthMenu = list(c(10, 20,-1), c('10', '20', 'All')),
    searching = FALSE,
    editable = TRUE,
    scroller = TRUE,
    lengthChange = FALSE
    initComplete = JS(
      "function(settings, json) {",
      "$(this.api().table().header()).css({'background-color': '#517fb9', 'color': '#fff'});",

Below is the output how the datatable looks like once the html container and table options are used.So by stying not only
can we change the column names but also group them. If you see the default table how we have three columns with prefix cloud.
These can be grouped under one column name Cloud.


You can find code for the app here.

To leave a comment for the author, please follow the link and comment on their blog: Stencilled. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

China is using Facebook to build a huge audience around the world

But its methods look fishy

Continue Reading…


Read More

Control Charts Another Package

(This article was first published on Analysis of AFL, and kindly contributed to R-bloggers)

I got an email from Alex Zanidean, who runs the xmrr package

“You might enjoy my package xmrr for similar charts – but mine recalculate the bounds automatically” and if we go to the vingette, “XMRs combine X-Bar control charts and Moving Range control charts. These functions also will recalculate the reference lines when significant change has occurred” This seems like a pretty handy thing. So lets do it.

First lets do our graphic from our previous post using ggQC

## ── Attaching packages ───────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.1       ✔ purrr   0.3.2  
## ✔ tibble  2.1.1       ✔ dplyr
## ✔ tidyr   0.8.3       ✔ stringr 1.4.0  
## ✔ readr   1.3.1       ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
filter(Season>1989 &  Round=="R1")%>%
  ylab("Mean Round 1 Total for Each Game") +ggtitle("Stop Freaking OUT over ONE ROUND")

filter(Season>1989 &  Round=="R1")

So when using a package for the first time, one of the best things about the R community is how the examples are usually fully reproducible and this helps.

From the github

Year <- seq(2001, 2009, 1)
Measure <-  runif(length(Year))

df <- data.frame(Year, Measure)
##   Year   Measure
## 1 2001 0.6146880
## 2 2002 0.2854914
## 3 2003 0.6081190
## 4 2004 0.4357665
## 5 2005 0.1509844
## 6 2006 0.5935707
xmr(df, "Measure", recalc = T)
##   Year   Measure Order Central Line Moving Range Average Moving Range
## 1 2001 0.6146880     1        0.419           NA                   NA
## 2 2002 0.2854914     2        0.419        0.329                0.277
## 3 2003 0.6081190     3        0.419        0.323                0.277
## 4 2004 0.4357665     4        0.419        0.172                0.277
## 5 2005 0.1509844     5        0.419        0.285                0.277
## 6 2006 0.5935707     6        0.419        0.443                0.277
## 7 2007 0.5739720     7        0.419        0.020                0.277
## 8 2008 0.9961513     8        0.419        0.422                0.277
## 9 2009 0.9091553     9        0.419        0.087                0.277
##   Lower Natural Process Limit Upper Natural Process Limit
## 1                          NA                          NA
## 2                           0                       1.156
## 3                           0                       1.156
## 4                           0                       1.156
## 5                           0                       1.156
## 6                           0                       1.156
## 7                           0                       1.156
## 8                           0                       1.156
## 9                           0                       1.156

Lets create a similar dataframe as df, but using data from fitzRoy

filter(Season>1989 &  Round=="R1")%>%
  select(Season, meantotal)
xmr_data <-xmr(df, "meantotal", recalc = T)

xmr_chart(df = xmr_data, 
          time = "Season", 
          measure = "meantotal",
          line_width = 0.75, text_size = 12, point_size = 2.5)

Does this tell a different story or a very similar one to earlier?

To leave a comment for the author, please follow the link and comment on their blog: Analysis of AFL. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

April 19, 2019

Document worth reading: “Algorithms for the Greater Good! On Mental Modeling and Acceptable Symbiosis in Human-AI Collaboration”

Effective collaboration between humans and AI-based systems requires effective modeling of the human in the loop, both in terms of the mental state as well as the physical capabilities of the latter. However, these models can also open up pathways for manipulating and exploiting the human in the hopes of achieving some greater good, especially when the intent or values of the AI and the human are not aligned or when they have an asymmetrical relationship with respect to knowledge or computation power. In fact, such behavior does not necessarily require any malicious intent but can rather be borne out of cooperative scenarios. It is also beyond simple misinterpretation of intents, as in the case of value alignment problems, and thus can be effectively engineered if desired. Such techniques already exist and pose several unresolved ethical and moral questions with regards to the design of autonomy. In this paper, we illustrate some of these issues in a teaming scenario and investigate how they are perceived by participants in a thought experiment. Algorithms for the Greater Good! On Mental Modeling and Acceptable Symbiosis in Human-AI Collaboration

Continue Reading…


Read More

Happy EasteR! Let’s find some eggs…

(This article was first published on Modelplot[R/py] introduction, and kindly contributed to R-bloggers)

It’s Easter Time! Let’s find some eggs…

plot of chunk bunniesneggs

Hi there! Yes, it’s the most Easterful time of the year again. For some of us a sacret time, for others mainly an egg-eating period and some just enjoy the extra day of spare time. In case you have some time available for some good egg searching business, but no-one seems willing to hide them for you this year, here’s an alteRnative. Hide them yourself and have a nice easteR holyday! As you go, you also get to know the very nice image processing package: magick.

Get some easter scenes, collect some eggs…

Before we can start hiding eggs, we need some scenes…

easterScenes <- c(


# read images and put in a vector
images = c(image_read(easterScenes[1]),image_read(easterScenes[2]),image_read(easterScenes[3]))
#show all images, next to eachother
image_append(image_scale(images, "x200"))

plot of chunk scenes

…and – obviously – we need some eggs to hide!

easterEggs <- c(

# read images and put in a vector
images = c(image_trim(image_read(easterEggs[1])),image_trim(image_read(easterEggs[2])),
#show all images, next to eachother

plot of chunk eggs

Hiding Eggs function

Excellent, we have some scenes and some eggs, let’s create a function to make hiding our eggs easy. We’ll take advantage of some other functions from the great image processing package magick.


hideEggs <- function(sceneUrl,nEggs=nEggs,eggPositions,eggSize=0.04){
    # read scene
    easterScene <- image_read(sceneUrl)
    # resize picture to 800 width (keep aspect ratio)
    easterScene <- image_scale(easterScene, "800x")
    sceneWidth <- as.numeric(image_info(easterScene)['width'])
    sceneHeight <- as.numeric(image_info(easterScene)['height'])
    # collect some eggs (sample with replacement in the basket ;))
    nEggs = nEggs
    eggUrls = sample(easterEggs,nEggs,replace = TRUE)

    easterSceneEggs <- easterScene

    for (eggn in 1:nEggs){
        eggUrl = eggUrls[eggn]
        # get egg, resize, rotate and put on canvas!
        egg <- image_read(eggUrl)
        # resize egg to 5% of canvas height (keep aspect ratio)
        egg <- image_scale(egg,paste0("x",round(eggSize*sceneWidth)))
        # remove background
        egg <- image_background(egg,"none")
        # rotate egg between -90 and 90 degrees
        eggRotation <- runif(1,-90,90)
        egg <- image_rotate(egg,eggRotation)
        #set x and y position (as specified in list or else random)
        if (!missing(eggPositions)){
            xpos <- eggPositions[[eggn]][1]*sceneWidth
        } else {
            xpos <- runif(1,0,sceneWidth)
        if (!missing(eggPositions)){
            ypos <- eggPositions[[eggn]][2]*sceneHeight
        } else {
            ypos <- runif(1,0,sceneHeight)
        #add egg to canvas
        easterSceneEggs <- image_composite(easterSceneEggs, egg, offset = paste0("+",xpos,"+",ypos))

Yeah, we’ve hidden 5 eggs near these happy easter bunnies. Can you spot them in a few seconds??

# let's hide 5 eggs among our easter bunnies
hideEggs(sceneUrl = easterScenes[2],nEggs = 5)

plot of chunk hideEggs

These guys also seem to enjoy our little game. Finally, let’s make a somewhat more challenging version, you can share it with your relatives, wishing them a nice easter holiday.

# think where to hide...
eggPositions = list(c(0.1,0.95),c(0.03,0.6),c(0.4,0.65),c(0.5,0.67),c(0.465,0.31),

#... and hide! we'll use smaller eggs to make it a bit more challenging.
easterCard <- hideEggs(sceneUrl = easterScenes[1],nEggs = length(eggPositions),
         eggPositions = eggPositions,
         eggSize = 0.02)

# let's add some wishes to our easter card...

easterCard %>% 
    image_annotate("Happy Easter!!!", size = 36, color = "yellow",  location = "+270+16") %>%
    image_annotate("Can you spot the 10 hidden eggs?", size = 18, color = "white",  location = "+250+60")

plot of chunk hideEggs2

In case your audience can’t figure out where you hid the eggs, magick lets you animate your pictures, so:

# specify easterScene with and without the eggs and specify number of frames
frames <- image_morph(c(easterCard, image_scale(image_read(easterScenes[1]),"800x")), frames = 10)

plot of chunk showEggs

For much more you can do with magick, see the vignette.

That’s it, have yourself a merry little easter egg!

To leave a comment for the author, please follow the link and comment on their blog: Modelplot[R/py] introduction. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

R Packages worth a look

Regular Expressions for Prices and Currencies (priceR)
Functions to aid in the analysis of price and currency data by expediting data preprocessing. This includes extraction of relevant data (e.g. from text …

Topologically Correct Isosurface Extraction (tcie)
Isosurfaces extraction algorithms are a powerful tool in the interpretation of volumetric data. The isosurfaces play an important role in several scien …

Uniformly Most Powerful Invariant Tests of Equivalence (equivUMP)
Implementation of uniformly most powerful invariant equivalence tests for one- and two-sample problems (paired and unpaired) as described in Wellek (20 …

Calculate Textures from Grey-Level Co-Occurrence Matrices (GLCMs) (glcm)
Enables calculation of image textures (Haralick 1973) <doi:10.1109/TSMC.1973.4309314> from grey-level co-occurrence matrices (GLCMs). Supports pr …

Hierarchical Ordered Probit Models with Application to Reporting Heterogeneity (hopit)
Self-reported health, happiness, attitudes, and other statuses or perceptions are often the subject of biases that may come from different sources. For …

Spatial Regression Analysis (spatialreg)
A collection of all the estimation functions for spatial cross-sectional models (on lattice/areal data using spatial weights matrices) contained up to …

Continue Reading…


Read More

Book Memo: “Deep Learning for Natural Language Processing”

Solve your natural language processing problems with smart deep neural networks
Gain knowledge of various deep neural network architectures and their application areas to conquer your NLP issues.

Continue Reading…


Read More

Data Driven Government Workshops Announced!

The workshops have been announced for Data Driven Government (formerly known as Predictive Analytics World for Government), Sep 25 in Washington, DC. Use the code KDNUGGETS for a 15% discount on your Deep Learning World ticket.

Continue Reading…


Read More

The Rise of Generative Adversarial Networks

A comprehensive overview of Generative Adversarial Networks, covering its birth, different architectures including DCGAN, StyleGAN and BigGAN, as well as some real-world examples.

Continue Reading…


Read More

ODSC East 2019 Talks to Expand and Apply R Skills

(This article was first published on, and kindly contributed to R-bloggers)

R programmers are not necessary data scientists, but rather software engineers. We have an entirely new multitrack focus area that helps engineers learn AI skills – AI for Engineers. This focus area is designed specifically to help programmers get familiar with AI-driven software that utilizes deep learning and machine learning models to enable conversational AI, autonomous machines, machine vision, and other AI technologies that require serious engineering.

For example, some use R in reinforcement learning – a popular topic and arguably one of the best techniques for sequential decision making and control policies. At ODSC East, Leonardo De Marchi of Badoo will present “Introduction to Reinforcement Learning,” where he will go over the fundamentals of RL all the way to creating unique algorithms, and everything in between.

Well, what if you’re missing data? That will mess up your entire workflow – R or otherwise. In “Good, Fast, Cheap: How to do Data Science with Missing Data,” Matt Brems of General Assembly, you will start by visualizing missing data and identifying the three different types of missing data, which will allow you to see how they affect whether we should avoid, ignore, or account for the missing data. Matt will give you practical tips for working with missing data and recommendations for integrating it with your workflow.

How about getting some context? In “Intro to Technical Financial Evaluation with R” with Ted Kwartler of Liberty Mutual and Harvard Extension School, you will learn how to download and evaluate equities with the TTR (technical trading rules) package. You will evaluate an equity according to three basic indicators and introduce you to backtesting for more sophisticated analyses on your own.

There is a widespread belief that the twin modeling goals of prediction and explanation are in conflict. That is, if one desires superior predictive power, then by definition one must pay a price of having little insight into how the model made its predictions. In “Explaining XGBoost Models – Tools and Methods” with Brian Lucena, PhD, Consulting Data Scientist at Agentero  you will work hands-on using XGBoost with real-world data sets to demonstrate how to approach data sets with the twin goals of prediction and understanding in a manner such that improvements in one area yield improvements in the other.

What about securing your deep learning frameworks? In “Adversarial Attacks on Deep Neural Networks” with Sihem Romdhani, Software Engineer at Veeva Systems, you will answer questions such as how do adversarial attacks pose a real-world security threat? How can these attacks be performed? What are the different types of attacks? What are the different defense techniques so far and how to make a system more robust against adversarial attacks? Get these answers and more here.

Data science is an art. In “Data Science + Design Thinking: A Perfect Blend to Achieve the Best User Experience,” Michael Radwin, VP of Data Science at Intuit, will offer a recipe for how to apply design thinking to the development of AI/ML products. You will learn to go broad to go narrow, focusing on what matters most to customers, and how to get customers involved in the development process by running rapid experiments and quick prototypes.

Lastly, your data and hard work mean nothing if you don’t do anything with it – that’s why the term “data storytelling” is more important than ever. In “Data Storytelling: The Essential Data Science Skill,” Isaac Reyes, TedEx speaker and founder of DataSeer, will discuss some of the most important of data storytelling and visualization, such as what data to highlight, how to use colors to bring attention to the right numbers, the best charts for each situation, and more.

Ready to apply all of your R skills to the above situations? Learn more techniques, applications, and use cases at ODSC East 2019 in Boston this April 30 to May 3!

Save 10% off the public ticket price when you use the code RBLOG10 today.

Register Here

More on ODSC:

ODSC East 2019  is one of the largest applied data science conferences in the world. Speakers include some of the core contributors to many open source tools, libraries, and languages. Attend ODSC East in Boston this April 30 to May 3 and learn the latest AI & data science topics, tools, and languages from some of the best and brightest minds in the field.

To leave a comment for the author, please follow the link and comment on their blog: offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

tint 0.1.2: Some cleanups

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new version 0.1.2 of the tint package is arriving at CRAN as I write this. It follows the recent 0.1.1 release which included two fabulous new vignettes featuring new font choices. The package name expands from tint is not tufte as the package offers a fresher take on the Tufte-style for html and pdf presentations.

However, with the new vignettes in 0.1.1 we now had four full-length vignettes which made the package somewhat bulky. So for this release I reorganized things a little, added two new shorter vignettes with links to the full-length vignettes but keeping the size more constrained.

Two screenshots for the first pages of the Lato and Garamond vignettes follow (and are links to a higher-resolution full-length pdf versions):

The new vignettes can also be browsed from CRAN: html variant and pdf variant. They also show the new theme borrowed with thanks and credits from ggtufte.

The full list of changes is below.

Changes in tint version 0.1.2 (2019-04-19)

  • Two new shorter html and pdf vignettes have been added (with references to the longer vignettes) reducing package size.

  • New helper function ‘theme_tint’ based on a similar function in package ‘ggtufte’.

Courtesy of CRANberries, there is a comparison to the previous release. More information is on the tint page.

For questions or comments use the issue tracker off the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

NASDAQ 100: Leader of Data Science [East Coast]

The Leader of Data Science should be well versed in the creation, critique, and deployment of complex analysis, predictive, and prescriptive modeling that require the use of both structured and unstructured data, from multiple sources such as SQL, Oracle, Access Data Bases and text.

Continue Reading…


Read More

Animating the US Treasury yield curve rates by @ellis2013nz

(This article was first published on free range statistics - R, and kindly contributed to R-bloggers)

My eye was caught by this tweet by Robin Wigglesworth of the Financial Times with an Alan Smith animation of the US Treasury yield curve from 2005 to 2009. It’s a nice graphic and it made me wonder what it would look like over a longer period.

The “yield curve” is the name given to the graphic showing the different annual rates paid on investors who in effect lend money to the US Treasury. In normal circumstances, investors who are prepared to make the loan on a long term basis demand higher effective interest rates as compensation for risk associated with inflation and future interest rates. When the situation goes in the other direction, and higher effective returns come from shorter term investments, it’s commonly seen as a sign of a coming recession. This isn’t because the “inverted yield curve” causes the recession, but because (amongst other things) it reflects investors’ fears that interest rates are about to be cut (in response to lower economic growth), hence best to lock in the return you can get now.

Animation options

Anyway, here is my version of the animation, now with 29 years of daily data, streaming from You Tube (because it was too large to embed directly in my web page):

Alternatively, here’s a streamlined social-media-friendly version (as a GIF) that has monthly averages and scoots through a bit quicker.

I’m not sure these are great graphics. For one thing, with three frames for every day with data reported, the daily version is over 20,000 frames long and even at 25 frames per second that’s a pretty tedious thing to watch – 14.5 minutes in fact. More generally, it suffers from the common problems of animated statistical graphics:

  • you have to be looking at just the right time to see the interesting things – like the yield inversion and other volatility in that 2006 to 2008 period
  • some changes over time are actually less obvious to the viewer than if they could be presented on a single page.

In the original FT article, there’s a more focused animation by Smith, this one showing just the last 15 months, emphasising the yield inverstion at the end of that period. It’s quite effective in telling a very pointed story, rather than trying to show a long history in a single animation.

Static alternatives

Animations aren’t always the best. In fact, it’s surely significant that I loathe video tutorials because I can’t flick my eye directly to the part I want but have to sit through the whole thing – even at high speed this is frustrating. Animations can have a similar problem.

For showing the longer history at a glance, consider this alternative presentation, still related to the animation’s visual structure. This has all 8,000 or so yield curves from those 29 years on a single static chart, each one drawn with 10% opacity to avoid colouring the page solid.

What I like about this version is that because it’s all there at once, you can make the comparisons across a broader period of time than is possible with the animation. You can see:

  • The secular decrease in rates over this time period
  • The period around 2005 when there were no 30 year rates for some reason
  • The different yield inversions in context – showing as upwards swooshes in blue, green and yellow (for different periods) at the left of the graph

Plus it’s very pretty.

Finally, how about a much more traditional time series line chart. I actually find this the easiest to interpret of them all.

  • The secular decrease is obvious again.
  • The carefully chosen colours let you compare the different terms quite easily.
  • This is the most dramatic depictiton of the way the shorter term yields hit the floor after 2008.
  • Easiest chart of them all to interpret because of the familiar convention of the horizontal axis meaning time
  • Still very pretty.


So what do I think?

Sometimes static graphs are beter than animations. Sometimes an old fashioned time series plot is better than less standard innovations. To make an animation tell a good story you often need more polishing, annotation and framing, not less.

More information on yield curves is easy to find on the Internet, and the piece in the Financial Times that Wigglesworth was referring to is worth a read. Here’s one of their very professional graphics:

One of the things that makes a difference to this, and to the animations in the same story, is the careful use of annotations to tell a story. That’s definitely what it takes to get a graphic to the next level, but it’s not something I’ve got time for right now.

Here’s the code for this (apart from that last one from the FT) – all pretty straightforward. Grabbing the data is easy with Wickham’s rvest R package, and animations are so much easier than a few years ago thanks to Thomas Pedersen’s gganimate. And let’s not forget the Viridis colours, originally developed for Python by Nathaniel Smith and Stefan van der Walt, and ported into an R package by Simon Garnier. In these graphics I use three variants of the Viridis colour schemes to represent different variables – the difference between long and short term yields at any one point in time, year, and the term of the yield. In each case, the variable I am trying to show is fundamentally ordered, and Viridis is at its best in showing that ordered nature in its colour.

#------------------------------load up functionality and import data-------------

the_caption <- "Data from, analysis by"

# we read in the data a year at a time because although there is a page with all 29 years of data,
# it was too difficult to read in all at once ie it crashed my humble laptop. So we define a function
# to get one year's data:
read_a_year <- function(y){
  url_stub <- ""
  url <- gsub("XXXX", y, url_stub)
  yield_page <- read_html(url)
  yield_data <- as_tibble(html_table(yield_page, fill = TRUE)[[2]] )

# bring in the data one year at a time.
yields_l <- lapply(1990:2019, read_a_year)

# clean and reshape the data with some handy variables for charting
periods <- tibble(
  period = paste(c(1,2,3,6,1,2,3,5,7,10,20,30), rep(c("mo", "yr"), c(4,8))),
  period_n = c(30, 60, 90, 180, 365.25 * c(1,2,3,5,7,10,20,30))

yields <-"rbind", yields_l) %>%
  mutate(Date = as.Date(Date, format = c("%m/%d/%y"))) %>%
  gather(period, value, -Date) %>%
  mutate(value = suppressWarnings(as.numeric(value))) %>% 
  left_join(periods, by = "period") %>%
  group_by(Date) %>%
  mutate(ratio_5_30 = value[period == "30 yr"] / value[period == "5 yr"],
         yield_av = mean(value, na.rm = TRUE, tr = 0.2),
         yield_30yr = value[period == "30 yr"],
         yield_3mo = value[period == "3 mo"],
         diff303 = yield_30yr - yield_3mo) %>%
  ungroup() %>%

#----------------All dates in one chart---------------------  
col_br <- tibble(
  lab = c(1990, 2000, 2010),
  date = as.Date(paste0(c(1990, 2000, 2010), "-03-01"))
) %>%
  mutate(date_n = as.numeric(date))

p2 <- yields %>% 
#  filter(Date < as.Date("1990-01-31")) %>%
  ggplot(aes(x = period_n, y = value, group = Date, colour = as.numeric(Date))) +
  geom_path(alpha = 0.1) +
                         breaks = pull(col_br, date_n),
                         labels = pull(col_br, lab)) +
  scale_y_continuous("Treasury yield curve rate") +
  scale_x_continuous("", breaks = periods[c(10:12), ]$period_n,
                     labels = periods[c(10:12), ]$period) +
  labs(caption = the_caption)


#--------------animated MP4 version----------
d <- yields  #%>% 
  filter(Date < as.Date("1991-02-28"))

a <- d %>% 
  ggplot(aes(x = period_n, y = value)) +
  geom_segment(data = distinct(d, Date, yield_3mo, yield_30yr),
               x = 90, xend = 10958, aes(y = yield_3mo, yend = yield_30yr),
               colour = "grey50") +
  geom_line(size = 1.5, aes(colour = diff303)) +
  scale_y_continuous("Treasury yield curve rate") +
  scale_x_continuous("", breaks = periods[c(10:12), ]$period_n,
                     labels = periods[c(10:12), ]$period) +
  scale_colour_viridis_c("Premium for long term lending:\n30 year yield minus 3 month yield", 
                         option= "A", direction = -1) +
  labs(title = "US Treasury Yield Curve Rates, 1990 to 2019",
       subtitle = 'Date: {frame_time}',
       caption = the_caption) +

# Save the frames in the file system and then manually knit into an animation, because
# there are so many and so large that I like to keep control of the two steps:
res <- 150
animate(a, nframes = length(unique(d$Date)) * 3, dev = "png", fps = 30,
        type = "cairo-png", antialias = "subpixel", 
        width = 6 * res, height =  4.3 * res, res = res,
        renderer = file_renderer(dir = "tmp", overwrite = TRUE))

od <- setwd("tmp")
system("ffmpeg -i gganim_plot%04d.png  -pix_fmt yuv420p -s 900x646 -c:v libx264 -r 30 movie.mp4")

#----------------traditional time series version-----------
p3 <- yields %>%
  mutate(period = fct_reorder(period, period_n)) %>%
  ggplot(aes(x = Date, y = value, colour = period)) +
  geom_line() +
  scale_colour_viridis_d(option = "C") +
  labs(x = "", colour = "Term", y = "Treasury yield curve rate",
       caption = the_caption) +
  ggtitle("US Treasury Yield Curve Rates, 1990 to 2019")


#-----------------Gif version for Twitter------------
# shorter version with just one main observation per month
d2 <- yields %>%
  mutate(mon = month(Date),
         yr = year(Date)) %>%
  group_by(mon, yr, period_n) %>%
  summarise(value = mean(value),
            yield_3mo = mean(yield_3mo),
            yield_30yr = mean(yield_30yr),
            diff303 = mean(diff303)) %>%
  ungroup() %>%
  mutate(Date = as.Date(paste(yr, mon, 15, sep = "-"), format = "%Y-%m-%d"))

a2 <- d2 %>% 
  ggplot(aes(x = period_n, y = value)) +
  geom_segment(data = distinct(d, Date, yield_3mo, yield_30yr),
               x = 90, xend = 10958, aes(y = yield_3mo, yend = yield_30yr),
               colour = "grey50") +
  geom_line(size = 1.5, aes(colour = diff303)) +
  scale_y_continuous("Treasury yield curve rate") +
  scale_x_continuous("", breaks = periods[c(10:12), ]$period_n,
                     labels = periods[c(10:12), ]$period) +
  scale_colour_viridis_c("Premium for long term lending:\n30 year yield minus 3 month yield", 
                         option= "A", direction = -1) +
  labs(title = "US Treasury Yield Curve Rates, 1990 to 2019",
       subtitle = 'Date: {frame_time}',
       caption = the_caption) +

res <- 150
animate(a, nframes = length(unique(d2$Date)) * 6, dev = "png", fps = 15,
        type = "cairo-png", antialias = "subpixel", 
        width = 6 * res, height =  4.3 * res, res = res)

thankr::shoulders() %>% 
  mutate(maintainer = str_squish(gsub("<.+>", "", maintainer)),
         maintainer = ifelse(maintainer == "R-core", "R Core Team", maintainer)) %>%
  group_by(maintainer) %>%
  summarise(`Number packages` = sum(no_packages),
            packages = paste(packages, collapse = ", ")) %>%
  arrange(desc(`Number packages`)) %>%
  knitr::kable() %>% 
maintainer Number packages packages
Hadley Wickham 15 assertthat, dplyr, forcats, ggplot2, gtable, haven, httr, lazyeval, modelr, plyr, rvest, scales, stringr, tidyr, tidyverse
R Core Team 11 base, compiler, datasets, graphics, grDevices, grid, methods, stats, tools, utils, nlme
Gábor Csárdi 4 cli, crayon, pkgconfig, progress
Kirill Müller 4 DBI, hms, pillar, tibble
Lionel Henry 4 purrr, rlang, svglite, tidyselect
Winston Chang 4 extrafont, extrafontdb, R6, Rttf2pt1
Yihui Xie 4 evaluate, knitr, rmarkdown, xfun
Jim Hester 3 glue, withr, readr
Thomas Lin Pedersen 3 farver, gganimate, tweenr
Yixuan Qiu 3 showtext, showtextdb, sysfonts
Dirk Eddelbuettel 2 digest, Rcpp
Jennifer Bryan 2 cellranger, readxl
Simon Urbanek 2 audio, Cairo
Achim Zeileis 1 colorspace
Alex Hayes 1 broom
Charlotte Wickham 1 munsell
David Gohel 1 gdtools
Deepayan Sarkar 1 lattice
Gabor Csardi 1 prettyunits
James Hester 1 xml2
Jeremy Stephens 1 yaml
Jeroen Ooms 1 jsonlite
Joe Cheng 1 htmltools
Kamil Slowikowski 1 ggrepel
Kevin Ushey 1 rstudioapi
Marek Gagolewski 1 stringi
Matthew Lincoln 1 clipr
Max Kuhn 1 generics
Michel Lang 1 backports
Peter Ellis 1 frs
Rasmus Bååth 1 beepr
Stefan Milton Bache 1 magrittr
Vitalie Spinu 1 lubridate

To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Data Visualization in Python: Matplotlib vs Seaborn

Seaborn and Matplotlib are two of Python's most powerful visualization libraries. Seaborn uses fewer syntax and has stunning default themes and Matplotlib is more easily customizable through accessing the classes.

Continue Reading…


Read More

A question about the piranha problem as it applies to A/B testing

Wicaksono Wijono writes:

While listening to your seminar about the piranha problem a couple weeks back, I kept thinking about a similar work situation but in the opposite direction. I’d be extremely grateful if you share your thoughts.

So the piranha problem is stated as “There can be some large and predictable effects on behavior, but not a lot, because, if there were, then these different effects would interfere with each other, and as a result it would be hard to see any consistent effects of anything in observational data.” The task, then, is to find out which large effects are real and which are spurious.

At work, sometimes people bring up the opposite argument. When experiments (A/B tests) are pre-registered, a lot of times the results are not statistically significant. And a few months down the line people would ask if we can re-run the experiment, because the app or website has changed, and so the treatment might interact differently with the current version. So instead of arguing that large effects can be explained by an interaction of previously established large effects, some people argue that large effects are hidden by yet unknown interaction effects.

My gut reaction is a resounding no, because otherwise people would re-test things every time they don’t get the results they want, and the number of false positives would go up like crazy. But it feels like there is some ring of truth to the concerns they raise.

For instance, if the old website had a green layout, and we changed the button to green, then it might have a bad impact. However, if the current layout is red, making the button green might make it stand out more, and the treatment will have positive effect. In that regard, it will be difficult to see consistent treatment effects over time when the website itself keeps evolving and the interaction terms keep changing. Even for previously established significant effects, how do we know that the effect size estimated a year ago still holds true with the current version?

What do you think? Is there a good framework to evaluate just when we need to re-run an experiment, if that is even a good idea? I can’t find a satisfying resolution to this.

My reply:

I suspect that large effects are out there, but, as you say, the effects can be strongly dependent on context. So, even if an intervention works in a test, it might not work in the future because in the future the conditions will change in some way. Given all that, I think the right way to study this is to explicitly model effects as varying. For example, instead of doing a single A/B test of an intervention, you could try testing it in many different settings, and then analyze the results with a hierarchical model so that you’re estimating varying effects. Then when it comes to decision-making, you can keep that variation in mind.

Continue Reading…


Read More

Paychex: Leader of Data Science [East Coast]

The Leader of Data Science should be well versed in the creation, critique, and deployment of complex analysis, predictive, and prescriptive modeling that require the use of both structured and unstructured data, from multiple sources such as SQL, Oracle, Access Data Bases and text.

Continue Reading…


Read More

Explore generative models and latent space with a simple spreadsheet interface

Generative models can seem like a magic box where you plug in observed data, turn some dials, and see what the computer spits out. SpaceSheet is a simple spreadsheet interface to explore and experiment for a clearer view of the spaces between. Even if you’re not into this research area, it’s fun to click and drag things around to see what happens.

Tags: , ,

Continue Reading…


Read More

Four short links: 19 April 2019

AI Music, Mind-Controlled Robot Hands, Uber's Repo Tools, and Career Resilience

  1. AI and Music (The Verge) -- total legal clusterf*ck.
  2. A Robot Hand Controlled with the Mind -- student uses open source hand and trains brain-machine interface, and holy crap we live in an age when these kinds of things are relatively easy to do rather than requiring massive resources.
  3. Keeping Master Green -- This paper presents the design and implementation of SubmitQueue. It guarantees an always green master branch at scale: all build steps (e.g., compilation, unit tests, UI tests) successfully execute for every commit point. SubmitQueue has been in production for over a year and can scale to thousands of daily commits to giant monolithic repositories. Uber's tech. (via Adrian Colyer)
  4. Early Career Setback and Future Career Impact -- Our analyses reveal that an early-career near miss has powerful, opposing effects. On one hand, it significantly increases attrition, with one near miss predicting more than a 10% chance of disappearing permanently from the NIH system. Yet, despite an early setback, individuals with near misses systematically outperformed those with near wins in the longer run, as their publications in the next 10 years garnered substantially higher impact. We further find that this performance advantage seems to go beyond a screening mechanism, whereby a more selected fraction of near-miss applicants remained than the near winners, suggesting that early-career setback appears to cause a performance improvement among those who persevere. Overall, the findings are consistent with the concept that "what doesn't kill me makes me stronger."

Continue reading Four short links: 19 April 2019.

Continue Reading…


Read More

If you did not already know

Survival-CRPS google
Personalized probabilistic forecasts of time to event (such as mortality) can be crucial in decision making, especially in the clinical setting. Inspired by ideas from the meteorology literature, we approach this problem through the paradigm of maximizing sharpness of prediction distributions, subject to calibration. In regression problems, it has been shown that optimizing the continuous ranked probability score (CRPS) instead of maximum likelihood leads to sharper prediction distributions while maintaining calibration. We introduce the Survival-CRPS, a generalization of the CRPS to the time to event setting, and present right-censored and interval-censored variants. To holistically evaluate the quality of predicted distributions over time to event, we present the Survival-AUPRC evaluation metric, an analog to area under the precision-recall curve. We apply these ideas by building a recurrent neural network for mortality prediction, using an Electronic Health Record dataset covering millions of patients. We demonstrate significant benefits in models trained by the Survival-CRPS objective instead of maximum likelihood. …

Temporal Logic google
In logic, temporal logic is any system of rules and symbolism for representing, and reasoning about, propositions qualified in terms of time (for example, ‘I am always hungry’, ‘I will eventually be hungry’, or ‘I will be hungry until I eat something’). It is sometimes also used to refer to tense logic, a modal logic-based system of temporal logic introduced by Arthur Prior in the late 1950s, with important contributions by Hans Kamp. It has been further developed by computer scientists, notably Amir Pnueli, and logicians. Temporal logic has found an important application in formal verification, where it is used to state requirements of hardware or software systems. For instance, one may wish to say that whenever a request is made, access to a resource is eventually granted, but it is never granted to two requestors simultaneously. Such a statement can conveniently be expressed in a temporal logic. …

Mask-ShadowGAN google
This paper presents a new method for shadow removal using unpaired data, enabling us to avoid tedious annotations and obtain more diverse training samples. However, directly employing adversarial learning and cycle-consistency constraints is insufficient to learn the underlying relationship between the shadow and shadow-free domains, since the mapping between shadow and shadow-free images is not simply one-to-one. To address the problem, we formulate Mask-ShadowGAN, a new deep framework that automatically learns to produce a shadow mask from the input shadow image and then takes the mask to guide the shadow generation via re-formulated cycle-consistency constraints. Particularly, the framework simultaneously learns to produce shadow masks and learns to remove shadows, to maximize the overall performance. Also, we prepared an unpaired dataset for shadow removal and demonstrated the effectiveness of Mask-ShadowGAN on various experiments, even it was trained on unpaired data. …

Continue Reading…


Read More

Shedding Light on the “Grand Débat”

This blog post was initially featured on Medium. At LightOn, we decided to start a blog and explain what can be done with the random projections performed by our Optical Processing Unit (OPU). Since most people think an optics based system can do only images, we decided to illustrate our technology on an NLP task. Here is the article:

This article is LightOn’s AI Research (LAIR) first contribution focused on applying LightOn’s Optical Processing Unit (OPU) hardware to generic large-scale Machine Learning tasks. Today we tackle the recently released public dataset of a nation-wide public debate and our first trial in using LightOn’s OPU for a Natural Language Processing task. 

As a response to the “Gilets Jaunes” social unrest in France, President Emmanuel Macron declared the opening of a nation-wide citizen debate (“Grand Débat National”), as a way to collect the opinion of the French people on a number of subjects. Aside from public meetings taking place throughout the country, the “Grand Débat” took the form of an online platform where people could answer questions around four major themes: the environment, taxes, state organization, and democracy / citizenship. For every theme, the survey was divided in two parts: the first part featured multiple-choice surveys, while the second part included open-ended questions. A subset of the answers was made openly available by Etalab, the public organization responsible for the Open Data policy of the French government. The data can be found here.
The deadline for survey submissions to the “Grand Débat” was March 18 2019. The French government provided a first analysis on Monday, four days ago. At LightOn, we have been looking for applications of our Optical Processing Unit (OPU) to various Machine Learning problems including Natural Language Processing (NLP). We therefore took this opportunity to feature some exploratory tests on this new dataset. It should be emphasized that no one at LightOn is currently an expert in NLP: the goal of this article is only to demonstrate the kind of data processing that can be done using an OPU. We want to show how this can be done, in only a few days of work, and with no specific knowledge of NLP but a generic Machine Learning background.
The first NLP application of the OPU technology was to form sentence embeddings from word embeddings. Word embeddings are vector representation of words introduced in their modern form by Mikolov et al. (2013)¹. They have been extremely successful as a first step of virtually all NLP models: we replace vanilla one-hot-encoded vectors of dimension the size of the vocabulary with these typically 300-dimensional dense real-valued vectors. The space of word embeddings even has a sensible arithmetic, the classical example being KING — MAN + WOMAN = QUEEN. To obtain these embeddings, a linear neural network is trained to predict the surrounding words given the current word in a sliding window over a huge text corpus (this model is called skip-gram, predicting the current word from the context is also possible). At the end of the training, the projection columns of the weight matrix are the word embeddings. You can see it as a kind of shallow transfer learning, where only the first layer is pre-trained.
Sentence embeddings target similar representations, but for whole sentences. It is a much harder task than word embeddings because, while words are fixed entities and in finite and reasonable number (at least for a single language), sentences are compositional and sequential in nature. They have several degrees of freedom stemming from meaning, word choice, syntax, tone, length, etc. This diversity not only means that there is more information to encode, but also that it’s virtually impossible to save an embedding for every possible sentence. Sentence embeddings instead have to be formed on-the-fly from their constituent words.
At the time of this writing, some of the leading models for sentence embeddings are SkipThought² and InferSent³. In particular, the paper that sparked our interest in sentence embeddings was the recent work of Wieting and Kiela (2019)⁴. This work shows that it is possible to obtain results on SentEval⁵ comparable to these advanced models with no training at all. One of the models used in this paper, Bag Of Random Embedding Projections(BOREP), uses random projections, which happens to be the operation an OPU excels at. As a result, we decided to give it a try. We followed BOREP but replaced the linear random projection with an OPU transform — complex-valued random projection followed by an element-wise non-linearity. The model is very simple: we get the embedding of each word in a sentence, we project it to a — much — higher dimension and then we apply a pooling function (the mean in practice) to obtain the embedding of the sentence.

The input to the current OPU prototypes -with remote access available to selected users - is a binary vector, meaning in this case that we need to use binary word embeddings. Since there is no such native embedding, we need to binarize real-valued word embeddings. For the latter, we chose fastText⁶, as it is among the best-performing word embedding method to date. To binarize the embeddings, we could use any standard binary encoding scheme. However it is hard to know a priori how destructive it would be for this kind of data. A possibly smarter thing to do is to train an autoencoder to do so. Luckily, some people did just that: binarize word embeddings using an autoencoder⁷. The associated github repo is in C, does not make use of a GPU and cannot handle large embeddings due to memory limitations, so we re-implemented it in PyTorch. It is hard to tell anything from the loss, so we evaluated our binary word embeddings directly on SentEval and obtained satisfying results for most tasks, except Semantic Textual Similarity (STS 12–16), for which there seems to be an unavoidable information loss. On a number of tasks, they performed even better than the original ones, probably because of their higher dimension. Indeed, the paper focuses on memory and computation savings, so the authors use rather low dimensional binary embeddings (up to 512 bits for an original dimension of 300 real numbers). Our goal is instead to obtain binary embeddings with the smallest possible information loss, so we went to much higher dimensions. Keep in mind that producing random projections on the OPU is essentially independent of the size — currently up to dimensions of 1 million ! In the end we used 3000 bits for each vector, so 10 bits per real number instead of 32, which seems reasonable.
Once this is done, we can actually use the OPU to form sentence embeddings using the aforementioned BOREP method. Evaluating it on SentEval gives satisfying results, in many tasks superior to that of the paper (even the Echo State Network). It should be noted that we didn’t find it useful to go above 15,000 dimensions, meaning that a high-end GPU could also have done the job in this case.
There are several reasons why increasing the dimensionality could increase performance on downstream tasks. The first is Cover’s theorem: projecting data non-linearly to a higher dimension increases the chance of them being linearly separable in the new high-dimensional space. This, of course, plays a role but the fact that a linear projection also improves performance in the Wieting and Kiela paper proves that simply adding parameters to the model already helps a lot. More interestingly, we can draw a parallel with work on hyperdimensional computing⁸. In very high-dimensional spaces (several thousand dimensions), the sheer volume available makes it possible to represent an entire dataset with no two points close to each other. The distances between two random points follow a very narrow Gaussian distribution: all points are more or less equidistant. In this setting, the mean of a few vectors is more similar to each of these vectors than any other point. The mean vector is then not merely a lossy summary of the original vectors but actually represents the set that contains them. We can thus expect the mean of the high-dimensional projections of the word embeddings to be truly more representative of the sentence than a mean computed in a lower-dimensional space, irrespective of the fact that the model used in downstream applications will have more parameters to accommodate this higher dimensionality.
For our task of dealing with answers to open-ended questions, which is notoriously difficult, we decided that an interactive visualisation would be a nice way to explore the answers. The goal was to see clusters of similar answers as a summary of what people had to say on the matter. So we chose a specific question and formed an embedding for every answer as if it were a single sentence. From visual observations, if an answer contains several sentences, they are most of the time related, so grouping them seems a reasonable first-attempt assumption. We end up with almost a hundred thousand points in dimension 15,000 (which is also why going above 15,000 would have been problematic) and want to visualise them. A PCA to 2 or 3 dimensions doesn’t give a satisfying result, so we first keep the first 50 principal components and then use t-SNE⁹ do obtain a 2D representation of the data. The result exhibits some interesting patterns but upon inspection, we realize that the global structure doesn’t make much sense. Indeed, t-SNE only preserves local neighborhoods and loses any sense of global structure. Clusters of loosely similar answers are thus not close to each other and can even be on opposite sides of the graph. Increasing the perplexity of t-SNE is supposed to favor global structure over local one, but this had little effect, so we used another technique. t-SNE is an iterative, stochastic method requiring an initial projection. This initialization is usually random but it doesn’t have to. The first two principal components are the best 2D representation of the global structure of the data, so we use them (in practice taking the first two columns of the 50-component PCA we already computed) as the initial state of t-SNE. This allows us to preserve global structure while benefiting from the superior visualizing power of t-SNE. Follow this link to see an interactive version (in French) of this visualization. It only contains 10,000 answers, sampled uniformly without replacement, so that the interactivity remains smooth. Some annotated screenshots are shown just below. What you see are the answers to the first question of the “ecology” (environment) questionnaire. The question was:
What is to you the most important concrete problem regarding ecology ?
  1. Air pollution
  2. Biodiversity and species extinction
  3. Climate change
  4. Shore erosion
  5. Other: answer in plain text

It was a semi-open question with 4 suggested answers but also the possibility to write another answer in plain text if one was not satisfied with the suggestions. Very few people answered 4, so it does not stand out on the graph, but you can clearly see answers 1, 2 and 3 circled in red on the first graph. In green are answers that mentioned two of the suggested answers and in yellow answers that gave variations of one of the suggested answers, e.g. global warming instead of climate change or water pollution instead of air pollution. A significant portion of people refused to choose and said that all suggestions were equally important.

Our visualization technique mostly works with respect to our objective for short sentences, less for longer ones. These are instead gathered in the middle, where more groups can be found as you can see here:

Long answers are hard to classify as belonging to one group or another, and are likely to be far from any other answer, as mentioned before. We believe the big cloud in the middle to be an artifact of the projection, that gathers such isolated answers, more than of the embeddings.
Overall, the result is rather satisfying, given the time spent. There is, of course, much room for improvement, though, with at least three directions to explore:
  • text preprocessing: we did tokenization, stop words removal and lemmatization, which are standard practices, but more thought could be put into it. For instance, a spelling corrector should help;
  • multi-sentence answers: can we separate semantically independent parts of long answers to give them tags ? can we better combine the sentence embeddings of a multi-sentence answer than a simple average ?
  • better sentence embeddings.

On the last point, one could, of course, use a more complex model than ours, e.g. SkipThought and InferSent. More recent models such as QuickThought¹⁰ or multi-task learning approaches have also been proposed¹¹ ¹². However, all these models are much harder to use than ours. Trained encoders are available in English but not in French, so one would have to retrain them. For supervised approaches, the data doesn’t even exist in French. SkipThought is unsupervised but takes several weeks to train. The best chance would be QuickThought but still it requires a book corpus in French and a day of training on a powerful GPU (a Titan X in the paper).
It is instead very easy to obtain satisfying results quickly with our model and OPU hardware. No training, any language works as long as word embeddings are available (fastText supports 157 languages), only the projection time (less than a minute in our case) and enough RAM to store the results are needed. The disadvantage is that the pooling function in our model loses word ordering information. The next step for us is therefore to try a time-aware model such as an Echo State Network, which was used with success in Wieting and Kiela (2019) and which can also be implemented very efficiently at large scale on an OPU¹³.
More info on LightOn’s OPU technology can be found at A remote access to an OPU through the LightOn Cloud can be granted, by invitation.

[1] Tomas Mikolov et al. “Efficient estimation of word representations in vector space.” arXiv preprint arXiv:1301.3781 (2013).
[2] Ryan Kiros et al. “Skip-thought vectors.” Advances in Neural Information Processing Systems. 2015.
[3] Alexis Conneau et al. “Supervised learning of universal sentence representations from natural language inference data.” arXiv preprint arXiv:1705.02364 (2017).
[4] John Wieting and Douwe Kiela. “No training required: Exploring random encoders for sentence classification.” arXiv preprint arXiv:1901.10444 (2019).
[5] Alexis Conneau and Douwe Kiela. “Senteval: An evaluation toolkit for universal sentence representations.” arXiv preprint arXiv:1803.05449 (2018).
[6] Piotr Bojanowski et al. “Enriching word vectors with subword information. CoRR abs/1607.04606.” (2016).
[7] Julien Tissier, Guillaume Gravier and Amaury Habrard. “Near-lossless Binarization of Word Embeddings.” arXiv preprint arXiv:1803.09065 (2018).
[8] Pentti Kanerva “Hyperdimensional computing: An introduction to computing in distributed representation with high-dimensional random vectors.” Cognitive computation 1.2 (2009): 139–159.
[9] Laurens van der Maaten and Geoffrey Hinton. “Visualizing data using t-SNE.” Journal of Machine Learning Research 9.Nov (2008): 2579–2605.
[10] Lajanugen Logeswaran and Honglak Lee. “An efficient framework for learning sentence representations.” arXiv preprint arXiv:1803.02893 (2018).
[11] Sandeep Subramanian et al. “Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning.” CoRRabs/1804.00079 (2018): n. pag.
[12] Daniel Cer et al. “Universal sentence encoder.” arXiv preprint arXiv:1803.11175 (2018).
[13] Jonathan Dong et al. “Scaling up Echo-State Networks with multiple light scattering.” 2018 IEEE Statistical Signal Processing Workshop (2018), arXiv:1609.05204

The author François Boniface, Machine Learning R&D engineer at LightOn AI Research

Join the CompressiveSensing subreddit or the Facebook page and post there !

Continue Reading…


Read More

Document worth reading: “A review of swarmalators and their potential in bio-inspired computing”

From fireflies to heart cells, many systems in Nature show the remarkable ability to spontaneously fall into synchrony. By imitating Nature’s success at self-synchronizing, scientists have designed cost-effective methods to achieve synchrony in the lab, with applications ranging from wireless sensor networks to radio transmission. A similar story has occurred in the study of swarms, where inspiration from the behavior flocks of birds and schools of fish has led to ‘low-footprint’ algorithms for multi-robot systems. Here, we continue this ‘bio-inspired’ tradition, by speculating on the technological benefit of fusing swarming with synchronization. The subject of recent theoretical work, minimal models of so-called ‘swarmalator’ systems exhibit rich spatiotemporal patterns, hinting at utility in ‘bottom-up’ robotic swarms. We review the theoretical work on swarmalators, identify possible realizations in Nature, and discuss their potential applications in technology. A review of swarmalators and their potential in bio-inspired computing

Continue Reading…


Read More

Generating the Ultimate List of 41 Data Science Podcasts by Crowdsourcing Google Results

(This article was first published on Learn R Programming & Build a Data Science Career | Michael Toth, and kindly contributed to R-bloggers)

Confession time: years ago, I was skeptical of podcasts. I was a music-only listener on commutes. Can you imagine? But around 2016, I gave in and finally took the plunge into podcasts. And I’m so glad that I did.

Since then, I’ve seen enormous benefits, all attributable at least in part to the podcasts I’ve listened to. I’ve improved my programming, learned new skills, and started multiple income-generating businesses.

Naturally, I’ve been interested in data science podcasts. Initially, I found my data science podcasts through people I followed on Twitter whose shows I knew about. But I’m always on the lookout for new podcasts, so I took to the internet to find recommendations.

If you’re a podcast listener, you’ve probably learned that the current state of podcast discovery leaves much to be desired. It’s very hard to find new podcasts, especially for niche fields like data science and analytics.

So of course, when I tried searching for interesting data science podcasts, I kept finding the same shows recommended everywhere. And usually they were the podcasts I was already listening to!

I decided that if I wanted a bigger list of data science podcasts, I needed to go deep. I searched nearly 100 lists for recommended podcasts, and used them to compile the most complete list of data science podcasts you will find.

Click here for the full list of 41 podcasts

In this post I’m going to talk about how I collected this list and analyze the results.

Gathering a list of data science podcasts

I knew I wanted to build a big list. But I also knew that not all podcasts are created equal, and that I needed some way to differentiate the truly great podcasts on the list. I decided I would use podcast recommendations as a form of social proof, so that the most recommended podcasts would bubble to the top of the list. Here’s the method I decided to follow:

  1. I’d generate a list of search terms
  2. I’d perform a google search for each term on the list
  3. I’d open each of the top 10 Google links for that search term and note all the results
  4. I’d aggregate the complete results by podcast and use the total number of recommendations to create a “recommendation score”. Higher scores should, in theory, represent better podcasts. Or at least more well known podcasts.

Generating a list of search phrases (Step 1)

This step was relatively easy. I knew I wanted a list of data science podcasts, but I also wanted to include more specific subgenres like data visualization, as well as more broad but related topics like Python programming or SQL & Databases. I settled on this list of phrases:

  • Data Science Podcasts
  • Data Engineering Podcasts
  • Data Visualization Podcasts
  • Analytics Podcasts
  • SQL Podcasts
  • R Podcasts -reddit
  • Python Podcasts

Pretty straightforward. The only curious bit is with the R Podcasts query. The Reddit url structure is such that leads to the podcasts subreddit. Without the explicit -reddit, all links were to that subreddit and unrelated to R programming.

Searching Google and aggregating (Steps 2-4)

Next, I’d search each of these phrases in Google and open up the first 10 sites that popped up in my search results.

Often times these were lists others had created of, for example, “top 5 data science podcasts”. I copied down the list of podcasts and kept an ongoing tally for how many times I’d seen a particular podcast represented.

Occasionally, the Google search would yield links to specific podcasts rather than to lists of podcasts. In this case, I would record this as well. My thinking is that ranking on Google for a particular search phrase is at least as reliable an indicator of podcast quality as inclusion on a “best data science podcasts” list. That said, this did particularly benefit those podcasts whose names closely matched my search query, as was the case with Data Engineering Podcast and the The R-Podcast.

After going through all of the search queries, I had a list of podcast recommendations along with a tally of how often the podcast had been recommended in search results.

Finally, I removed any podcasts that had only 1 appearance in the results. There were a large number of these, and I felt it would dilute the value of the list to include them. This was somewhat arbitrary, but I think it makes for a stronger list overall.

Analyzing the list of best data science podcasts

extrafont::loadfonts(quiet = TRUE) # Needed for hrbrthemes / mac dependency issue

podcasts <- read_csv('~/Desktop/best_podcasts.csv')

First I read in my compiled list of podcasts, then I used ggplot to graph the results.

ggplot(podcasts, aes(x = reorder(Title, Score), y = Score)) +
    geom_bar(stat = 'identity') +
    coord_flip() +
    labs(title = 'Top Data Science Podcasts',
         subtitle = 'Rankings crowdsourced from Google search results',
         x = '',
         y = 'Recommendation Score',
         caption = ' / @michael_toth') +
    theme_ipsum(base_size = 5, caption_size = 8, axis_title_size = 8) +
    theme(panel.grid.major.y = element_blank(),
          panel.grid.major.x = element_line(colour = 'white', linetype = 'dotted'),
          panel.grid.minor.x = element_line(colour = 'white', linetype = 'dotted'),
          panel.ontop = TRUE)


This is awesome! This list of 41 podcasts is more than I found anywhere else online while going through this exercise, and I’m confident it’s the most extensive list of data science podcasts you can find.

I had heard of many of the podcasts in the top 10–Data Stories, Data Skeptic, Partially Derivative, The O’Reilly Data Show, and Not So Standard Deviations–but others were new to me! I hadn’t heard of Linear Digressions, or Talking Machines, or Learning Machines 101. While I haven’t yet had a chance to listen to these, I’m excited to check them out!

So now we have a list of 41 data science podcasts to review. This is great, but I think we can do better! While these podcasts are all loosely related to data science, they’re quite different from one another in focus. For example, FiveThirtyEight Politics is a relatively mainstream podcast, Data Stories deals with topics in data visualization, and Data Skeptic is an all-around instructional data science podcast.

I thought some form of categorization would help guide my listening, so I created a broad list of categories and grouped them as best as I could. I decided I would group the podcasts into 8 categories:

  • General Data Science and Analytics
  • Relevant Mainstream Podcasts
  • Machine Learning & AI
  • Data Visualization
  • Data Engineering
  • SQL & Databases
  • R Programming
  • Python Programming

I went through this list and categorized them as best as I could according to topic. Equipped with these new categories, let’s take a look at the list of recommendations:

ggplot(podcasts, aes(x = reorder(Title, Score), y = Score, fill = Category)) +
    geom_bar(stat = 'identity') +
    coord_flip() +
    scale_fill_manual(values = c("#a6cee3","#1f78b4","#b2df8a","#33a02c","#fb9a99","#e31a1c","#fdbf6f","#ff7f00")) +
    labs(title = 'Top Data Science Podcasts',
         subtitle = 'Rankings crowdsourced from Google search results',
         x = '',
         y = 'Recommendation Score',
         caption = ' / @michael_toth') +
    theme_ipsum(base_size = 5, caption_size = 8, axis_title_size = 8) +
    theme(panel.grid.major.y = element_blank(),
          panel.grid.major.x = element_line(colour = 'white', linetype = 'dotted'),
          panel.grid.minor.x = element_line(colour = 'white', linetype = 'dotted'),
          panel.ontop = TRUE,
          legend.position = 'bottom')


While most of the top 10 are made up of what I’m calling general data science and analytics podcasts, Data Stories stands alone as the number one podcast and the only data visualization podcast in the top 10. Most of the language-specific podcasts are lower on the list, but that’s likely due to the fact that they are more specific in nature and not necessarily due to lower quality. The R podcast was the only language-specific podcast to crack the top 10, so I’m definitely going to check that one out!

Do you have favorite podcasts that aren’t included here? Let me know in the comments. Also let me know how you discover new podcasts, I’d love to improve my discovery! If you’re interested, you can get the full list of data science podcasts below:

Get the spreadsheet of all 41 data science podcasts organized by topic

Every week I publish concise tutorials 🎓 and career advice 💻 for data science and analytics workers. I will help you learn R programming, build your data science career, and raise your salary.

To leave a comment for the author, please follow the link and comment on their blog: Learn R Programming & Build a Data Science Career | Michael Toth. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

R Packages worth a look

Discretization and Grouping for Logistic Regression (glmdisc)
A Stochastic-Expectation-Maximization (SEM) algorithm (Celeux et al. (1995) <https:// …

LIME-Based Explanations with Interpretable Inputs Based on Ceteris Paribus Profiles (
Local explanations of machine learning models describe, how features contributed to a single prediction. This package implements an explanation method …

Multiple Fill and Color Scales in ‘ggplot2’ (ggnewscale)
Use multiple fill and color scales in ‘ggplot2’.

Plots for Circular Data (cplots)
Provides functions to produce some circular plots for circular data, in a height- or area-proportional manner. They include barplots, smooth density pl …

Stochastic Precipitation Downscaling with the RainFARM Method (rainfarmr)
An implementation of the RainFARM (Rainfall Filtered Autoregressive Model) stochastic precipitation downscaling method (Rebora et al. (2006) <doi:10 …

Estimation of the Proportion of Treatment Effect Explained by Surrogate Outcome Information (SurrogateOutcome)
Provides functions to estimate the proportion of treatment effect on a censored primary outcome that is explained by the treatment effect on a censored …

Continue Reading…


Read More

Magister Dixit

“Distinguishing between feature selection and dimensionality reduction might seem counter-intuitive at first, since feature selection will eventually lead (reduce dimensionality) to a smaller feature space. In practice, the key difference between the terms “feature selection” and “dimensionality reduction” is that in feature selection, we keep the “original feature axis”, whereas dimensionality reduction usually involves a transformation technique.” Sebastian Raschka ( August 24, 2014 )

Continue Reading…


Read More

Using ecmwfr to measure global warming

(This article was first published on Code R, and kindly contributed to R-bloggers)

For my research I needed to download gridded weather data from ERA-Interim, which is a big dataset generated by the ECMWF. Getting long term data through their website is very time consuming and requires a lot of clicks. Thankfuly, I came accross the nifty ecmwfr R package that allowed me to do it with ease. One of the great things about open source is that users can also be collaborators, so I made a few suggestions and offered some code.

Now that a new version is on it’s way to CRAN, I wanted to show a quick example of what you can do with the package.

To download data from the ECMWF servers you need to have an account. So if you’re following at home, you’ll need create one and add the API key to the ecmwfr keyring. This is all done interactivelly with this command:


wf_set_key(service = "webapi")

This will take you to the correct URL where you can get your key, and then sorts everything out to use it.

Now, in order to get data from the ECMWF servers I need to have a valid request. Unfortunatelly, since it hosts a huge number of different datasets with different data streams and time resolution, building a valid request from scratch is rather complicated. The easiest way to work with it is going to their to the website and use their point and click interface to create a basic request for the dataset of interest. In my case, I will use monthly data from ERA Interim.

ERA Interim request

ERA Interim request

As you can see, there’s no way of retreiving every year in one request using the web interface. But a the bottom of the page there’s a link that says “View the MARS request”.

ERA Interim MARS

ERA Interim MARS

Here, I get a valid request that I can modify slightly. In R, I converted this template into a list using the “MARS to list” RStudio addin (but you can do it manually). I added format = "netcdf" at the end to get the data as a NetCDF file.

I then pass that list to the wf_archetype() function.

ERAI_monthly <- wf_archetype(
   request = list(
      class   = "ei",
      dataset = "interim",
      date    = "19790101/19790201/19790301/19790401/19790501/19790601/19790701/19790801/19790901/19791001/19791101/19791201",
      expver  = "1",
      grid    = "0.75/0.75",
      levtype = "sfc",
      param   = "167.128",
      stream  = "moda",
      type    = "an",
      target  = "output",
      format  = "netcdf"),
   dynamic_fields = c("date", "grid", "target"))

Now ERAI_montly is a function with arguments “date”, “grid” and “target”. The reason I don’t just change the list willi nilli is that I want to be sure to always get a valid request pointing to the ERA Interim dataset. For this short example is probably overkill, but it’s usefull in a bigger project with multiple data requests.

As you can see, the MARS format for dates can be rather long to type, so I’ll create a custom function to make this easier:

format_dates <- function(dates) {
   dates <- as.Date(dates)
          formatC(lubridate::month(dates), width = 2, flag = "0"),
          formatC(lubridate::day(dates), width = 2, flag = "0"),
          collapse = "/")

format_dates(c("2018-01-01", "2018-02-01"))
## [1] "20180101/20180201"

Now I’m ready to downlaod data! I was bown in august 1988, so I will be looking at how the month of august changed since that year. I’m also not terribly interested in local changes, so I’ll use a 3° by 3° resolution.

dates <- seq.Date(as.Date("1988-08-01"), as.Date("2018-08-01"), "1 year")

my_request <- ERAI_monthly(date = format_dates(dates), 
                           grid = "3/3",
                           target = "")
## List of 11
##  $ class  : chr "ei"
##  $ dataset: chr "interim"
##  $ date   : chr "19880801/19890801/19900801/19910801/19920801/19930801/19940801/19950801/19960801/19970801/19980801/19990801/200"| __truncated__
##  $ expver : chr "1"
##  $ grid   : chr "3/3"
##  $ levtype: chr "sfc"
##  $ param  : chr "167.128"
##  $ stream : chr "moda"
##  $ type   : chr "an"
##  $ target : chr ""
##  $ format : chr "netcdf"

And now, wf_request() to download the data. This will take some. Not because I’m downloading a huge file (is not, about 455kb) but because the ECMWF has to process the request and collect the data. So if you’re following at home, you can go make some tea or, if you’re like me, mate. 🍵

wf_request(request = my_request,
           user = "", 
           transfer = TRUE,
           path = "data", 
           verbose = FALSE)

Now that I have my data saved as “”, I’ll just need to load it and analyse it. I’ll be using my metR package for that.

august_temp <- ReadNetCDF("data/")

First a quick look at the data.

## Classes 'data.table' and 'data.frame':   226920 obs. of  4 variables:
##  $ longitude: int  0 3 6 9 12 15 18 21 24 27 ...
##  $ latitude : int  90 90 90 90 90 90 90 90 90 90 ...
##  $ t2m      : num  273 273 273 273 273 ...
##  $ time     : POSIXct, format: "1988-08-01" "1988-08-01" ...
##  - attr(*, ".internal.selfref")=

It’s a data frame with a value of t2m for each longitude, latitude and time. The temperature is in Kelvin. Let’s take a look at one field.

# world map
world <- list(geom_path(data = map_data("world2"), 
                        aes(long, lat, group = group), 
                        size = 0.2, color = "gray50"),

ggplot(august_temp[time == time[1]], aes(longitude, latitude)) +
   geom_contour_fill(aes(z = t2m - 273.15)) +
   world +
   scale_fill_divergent("2m temperature (°C)") +

The tropis are warmer than the poles, as it should be.

After getting to know the data, I’ll compute the linear trend of temperature at each gridpoint. I’ll use a very crude method to get the statistical signficance of the trend.

trends <- august_temp[, FitLm(year = year(time), t2m, se = TRUE), 
                      by = .(longitude, latitude)] 
trends[, p.value :=  pt(abs(estimate)/std.error, df, lower.tail = FALSE)]

ggplot(trends[term == "year"], aes(longitude, latitude)) +
   geom_contour_fill(aes(z = estimate*10), 
                     breaks = AnchorBreaks(0, 0.25, exclude = 0)) +
   stat_subset(aes(subset = p.value <= 0.01), 
               geom = "point", size = 0.1, alpha = 0.5) +
   world +
   scale_fill_divergent("2m temperature \ntrend (°C/decade)") +
   metR:::theme_field() +
   labs(subtitle = "August mean temperature change 1988-2018", 
        caption = "Data: ERA Interim")

Not surprisingly, the trend is positive almost everywhere, although not everywhere statistically significant (using this very crude method). Of note, there hasn’t been much increase in august mean temperature where I live.

I will construct a (crude) global mean august temperature (“GMAT”) timeseries by averaging all gridpoint for each year (weighing by the cosine of latitude).

gmat <- august_temp[, .(t2m = weighted.mean(t2m, cos(latitude*pi/180))), 
                   by = year(time)]

ggplot(gmat, aes(year, t2m - 273.15)) +
   geom_line() +
   geom_smooth(method = "lm") +
   scale_y_continuous("Global august 2m temperature (°C)") +

Again, not surprisingly, global temperature is going up. Let’s compute the rate of increase

trend <- lm(t2m ~ I(year/10), data = gmat)
## Call:
## lm(formula = t2m ~ I(year/10), data = gmat)
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.28121 -0.05954 -0.01535  0.06890  0.28129 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 255.78307    4.71318  54.270  < 2e-16 ***
## I(year/10)    0.16756    0.02353   7.121 7.77e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 0.1172 on 29 degrees of freedom
## Multiple R-squared:  0.6362, Adjusted R-squared:  0.6236 
## F-statistic: 50.71 on 1 and 29 DF,  p-value: 7.772e-08

The rate of 1.68 °C/decade is consistent with estimates using better methods. 🔥

To leave a comment for the author, please follow the link and comment on their blog: Code R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

At Easter, beware chocolate-scoffing dogs

Vets report a spike in canine intoxication cases when the Easter Bunny visits

Continue Reading…


Read More

Thanks for reading!