My Data Science Blogs

May 25, 2019

China is surprisingly carbon-efficient—but still the world’s biggest emitter

It will take an unprecedented reduction in China’s emissions per head to stave off severe warming

Continue Reading…


Read More

May 24, 2019

Whats new on arXiv

T-EDGE: Temporal WEighted MultiDiGraph Embedding for Ethereum Transaction Network Analysis

Recently, graph embedding techniques have been widely used in the analysis of various networks, but most of the existing embedding methods omit the temporal and weighted information of edges which may be contributing in financial transaction networks. The open nature of Ethereum, a blockchain-based platform, gives us an unprecedented opportunity for data mining in this area. By taking the realistic rules and features of transaction networks into consideration, we propose to model the Ethereum transaction network as a Temporal Weighted Multidigraph (TWMDG) where each node is a unique Ethereum account and each edge represents a transaction weighted by amount and assigned with timestamp. In a TWMDG, we define the problem of Temporal Weighted Multidigraph Embedding (T-EDGE) by incorporating both temporal and weighted information of the edges, the purpose being to capture more comprehensive properties of dynamic transaction networks. To evaluate the effectiveness of the proposed embedding method, we conduct experiments of predictive tasks, including temporal link prediction and node classification, on real-world transaction data collected from Ethereum. Experimental results demonstrate that T-EDGE outperforms baseline embedding methods, indicating that time-dependent walks and multiplicity characteristic of edges are informative and essential for time-sensitive transaction networks.

Power of the Few: Analyzing the Impact of Influential Users in Collaborative Recommender Systems

Like other social systems, in collaborative filtering a small number of ‘influential’ users may have a large impact on the recommendations of other users, thus affecting the overall behavior of the system. Identifying influential users and studying their impact on other users is an important problem because it provides insight into how small groups can inadvertently or intentionally affect the behavior of the system as a whole. Modeling these influences can also shed light on patterns and relationships that would otherwise be difficult to discern, hopefully leading to more transparency in how the system generates personalized content. In this work we first formalize the notion of ‘influence’ in collaborative filtering using an Influence Discrimination Model. We then empirically identify and characterize influential users and analyze their impact on the system under different underlying recommendation algorithms and across three different recommendation domains: job, movie and book recommendations. Insights from these experiments can help in designing systems that are not only optimized for accuracy, but are also tuned to mitigate the impact of influential users when it might lead to potential imbalance or unfairness in the system’s outcomes.

Clustered Multitask Nonnegative Matrix Factorization for Spectral Unmixing of Hyperspectral Data

In this paper, the new algorithm based on clustered multitask network is proposed to solve spectral unmixing problem in hyperspectral imagery. In the proposed algorithm, the clustered network is employed. Each pixel in the hyperspectral image considered as a node in this network. The nodes in the network are clustered using the fuzzy c-means clustering method. Diffusion least mean square strategy has been used to optimize the proposed cost function. To evaluate the proposed method, experiments are conducted on synthetic and real datasets. Simulation results based on spectral angle distance, abundance angle distance and reconstruction error metrics illustrate the advantage of the proposed algorithm compared with other methods.

Control Theory Meets POMDPs: A Hybrid Systems Approach

Partially observable Markov decision processes (POMDPs) provide a modeling framework for a variety of sequential decision making under uncertainty scenarios in artificial intelligence (AI). Since the states are not directly observable in a POMDP, decision making has to be performed based on the output of a Bayesian filter (continuous beliefs). Hence, POMDPs are often computationally intractable to solve exactly and researchers resort to approximate methods often using discretizations of the continuous belief space. These approximate solutions are, however, prone to discretization errors, which has made POMDPs ineffective in applications, wherein guarantees for safety, optimality, or performance are required. To overcome the complexity challenge of POMDPs, we apply notions from control theory. The goal is to determine the reachable belief space of a POMDP, that is, the set of all possible evolutions given an initial belief distribution over the states and a set of actions and observations. We begin by casting the problem of analyzing a POMDP into analyzing the behavior of a discrete-time switched system. For estimating the reachable belief space, we find over-approximations in terms of sub-level sets of Lyapunov functions. Furthermore, in order to verify safety and optimality requirements of a given POMDP, we formulate a barrier certificate theorem, wherein we show that if there exists a barrier certificate satisfying a set of inequalities along with the belief update equation of the POMDP, the safety and optimality properties are guaranteed to hold. In both cases, we show how the calculations can be decomposed into smaller problems that can be solved in parallel. The conditions we formulate can be computationally implemented as a set of sum-of-squares programs. We illustrate the applicability of our method by addressing two problems in active ad scheduling and machine teaching.

Be Your Own Teacher: Improve the Performance of Convolutional Neural Networks via Self Distillation

Convolutional neural networks have been widely deployed in various application scenarios. In order to extend the applications’ boundaries to some accuracy-crucial domains, researchers have been investigating approaches to boost accuracy through either deeper or wider network structures, which brings with them the exponential increment of the computational and storage cost, delaying the responding time. In this paper, we propose a general training framework named self distillation, which notably enhances the performance (accuracy) of convolutional neural networks through shrinking the size of the network rather than aggrandizing it. Different from traditional knowledge distillation – a knowledge transformation methodology among networks, which forces student neural networks to approximate the softmax layer outputs of pre-trained teacher neural networks, the proposed self distillation framework distills knowledge within network itself. The networks are firstly divided into several sections. Then the knowledge in the deeper portion of the networks is squeezed into the shallow ones. Experiments further prove the generalization of the proposed self distillation framework: enhancement of accuracy at average level is 2.65%, varying from 0.61% in ResNeXt as minimum to 4.07% in VGG19 as maximum. In addition, it can also provide flexibility of depth-wise scalable inference on resource-limited edge devices.Our codes will be released on github soon.

The Unexpected Unexpected and the Expected Unexpected: How People’s Conception of the Unexpected is Not That Unexpected

The answers people give when asked to ‘think of the unexpected’ for everyday event scenarios appear to be more expected than unexpected. There are expected unexpected outcomes that closely adhere to the given information in a scenario, based on familiar disruptions and common plan-failures. There are also unexpected unexpected outcomes that are more inventive, that depart from given information, adding new concepts/actions. However, people seem to tend to conceive of the unexpected as the former more than the latter. Study 1 tests these proposals by analysing the object-concepts people mention in their reports of the unexpected and the agreement between their answers. Study 2 shows that object-choices are weakly influenced by recency, the order of sentences in the scenario. The implications of these results for ideas in philosophy, psychology and computing is discussed

Graph Mining Meets Crowdsourcing: Extracting Experts for Answer Aggregation

Aggregating responses from crowd workers is a fundamental task in the process of crowdsourcing. In cases where a few experts are overwhelmed by a large number of non-experts, most answer aggregation algorithms such as the majority voting fail to identify the correct answers. Therefore, it is crucial to extract reliable experts from the crowd workers. In this study, we introduce the notion of ‘expert core’, which is a set of workers that is very unlikely to contain a non-expert. We design a graph-mining-based efficient algorithm that exactly computes the expert core. To answer the aggregation task, we propose two types of algorithms. The first one incorporates the expert core into existing answer aggregation algorithms such as the majority voting, whereas the second one utilizes information provided by the expert core extraction algorithm pertaining to the reliability of workers. We then give a theoretical justification for the first type of algorithm. Computational experiments using synthetic and real-world datasets demonstrate that our proposed answer aggregation algorithms outperform state-of-the-art algorithms.

Arena: A General Evaluation Platform and Building Toolkit for Multi-Agent Intelligence

Learning agents that are not only capable of taking tests but also innovating is becoming the next hot topic in AI. One of the most promising paths towards this vision is multi-agent learning, where agents act as the environment for each other, and improving each agent means proposing new problems for the others. However, existing evaluation platforms are either not compatible with multi-agent settings, or limited to a specific game. That is, there is not yet a general evaluation platform for research on multi-agent intelligence. To this end, we introduce Arena, a general evaluation platform for multi-agent intelligence with \NumGames games of diverse logic and representations. Furthermore, multi-agent intelligence is still at the stage where many problems remain unexplored. Thus, we provide a building toolkit for researchers to invent and build novel multi-agent problems from the provided game set with little efforts. Finally, we provide python implementations of five state-of-the-art deep multi-agent reinforcement learning baselines. Along with the baseline implementations, we release a set of 100 best agents/teams that we can train with different training schemes for each game, as the base for evaluating agents with population performance, so that the research community can perform comparisons under a stable and uniform standard.

Alpha MAML: Adaptive Model-Agnostic Meta-Learning

Model-agnostic meta-learning (MAML) is a meta-learning technique to train a model on a multitude of learning tasks in a way that primes the model for few-shot learning of new tasks. The MAML algorithm performs well on few-shot learning problems in classification, regression, and fine-tuning of policy gradients in reinforcement learning, but comes with the need for costly hyperparameter tuning for training stability. We address this shortcoming by introducing an extension to MAML, called Alpha Model-agnostic meta-learning, to incorporate an online hyperparameter adaptation scheme that eliminates the need to tune meta-learning and learning rates. Our results with the Omniglot database demonstrate a substantial reduction in the need to tune MAML training hyperparameters and improvement to training stability with less sensitivity to hyperparameter choice.

A Fast and Scalable Implementation Method for Competing Risks Data with the R Package fastcmprsk

Advancements in medical informatics tools and high-throughput biological experimentation make large-scale biomedical data routinely accessible to researchers. Competing risks data are typical in biomedical studies where individuals are at risk to more than one cause (type of event) which can preclude the others from happening. The Fine-Gray model is a popular and well-appreciated model for competing risks data and is currently implemented in a number of statistical software packages. However, current implementations are not computationally scalable for large-scale competing risks data. We have developed an R package, fastcmprsk, that uses a novel forward-backward scan algorithm to significantly reduce the computational complexity for parameter estimation by exploiting the structure of the subject-specific risk sets. Numerical studies compare the speed and scalability of our implementation to current methods for unpenalized and penalized Fine-Gray regression and show impressive gains in computational efficiency.

AutoDispNet: Improving Disparity Estimation with AutoML

Much research work in computer vision is being spent on optimizing existing network architectures to obtain a few more percentage points on benchmarks. Recent AutoML approaches promise to relieve us from this effort. However, they are mainly designed for comparatively small-scale classification tasks. In this work, we show how to use and extend existing AutoML techniques to efficiently optimize large-scale U-Net-like encoder-decoder architectures. In particular, we leverage gradient-based neural architecture search and Bayesian optimization for hyperparameter search. The resulting optimization does not require a large company-scale compute cluster. We show results on disparity estimation that clearly outperform the manually optimized baseline and reach state-of-the-art performance.

Shortest Path Algorithms between Theory and Practice

Utilizing graph algorithms is a common activity in computer science. Algorithms that perform computations on large graphs are not always efficient. This work investigates the Single-Source Shortest Path (SSSP) problem, which is considered to be one of the most important and most studied graph problems. This thesis contains a review of the SSSP problem in both theory and practice. In addition, it discusses a new single-source shortest-path algorithm that achieves the same O(n \cdot m) time bound as the traditional Bellman-Ford-Moore algorithm but outperforms it and other state-of-the-art algorithms in practice. The work is comprised of three parts. The first discusses some basic shortest-path and negative-cycle-detection algorithms in literature from the theoretical and practical point of view. The second contains a discussion of a new algorithm for the single-source shortest-path problem that outperforms most state-of-the-art algorithms for several well-known families of graphs. The main idea behind the proposed algorithm is to select the fewest most-effective vertices to scan. We also propose a discussion of correctness, termination, and the proof of the worst-case time bound of the proposed algorithm. This section also suggests two different implementations for the proposed algorithm, the first runs faster while the second performs a fewer number of operations. Finally, an extensive computational study of the different shortest paths algorithms is conducted. The results are proposed using a new evaluation metric for shortest-path algorithms. A discussion of outcomes, strengths, and weaknesses of the various shortest path algorithms are also included in this work.

Graph-based Semi-Supervised & Active Learning for Edge Flows

We present a graph-based semi-supervised learning (SSL) method for learning edge flows defined on a graph. Specifically, given flow measurements on a subset of edges, we want to predict the flows on the remaining edges. To this end, we develop a computational framework that imposes certain constraints on the overall flows, such as (approximate) flow conservation. These constraints render our approach different from classical graph-based SSL for vertex labels, which posits that tightly connected nodes share similar labels and leverages the graph structure accordingly to extrapolate from a few vertex labels to the unlabeled vertices. We derive bounds for our method’s reconstruction error and demonstrate its strong performance on synthetic and real-world flow networks from transportation, physical infrastructure, and the Web. Furthermore, we provide two active learning algorithms for selecting informative edges on which to measure flow, which has applications for optimal sensor deployment. The first strategy selects edges to minimize the reconstruction error bound and works well on flows that are approximately divergence-free. The second approach clusters the graph and selects bottleneck edges that cross cluster-boundaries, which works well on flows with global trends.

Neural Metric Learning for Fast End-to-End Relation Extraction

Relation extraction (RE) is an indispensable information extraction task in several disciplines. RE models typically assume that named entity recognition (NER) is already performed in a previous step by another independent model. Several recent efforts, under the theme of end-to-end RE, seek to exploit inter-task correlations by modeling both NER and RE tasks jointly. Earlier work in this area commonly reduces the task to a table-filling problem wherein an additional expensive decoding step involving beam search is applied to obtain globally consistent cell labels. In efforts that do not employ table-filling, global optimization in the form of CRFs with Viterbi decoding for the NER component is still necessary for competitive performance. We introduce a novel neural architecture utilizing the table structure, based on repeated applications of 2D convolutions for pooling local dependency and metric-based features, without the need for global optimization. We validate our model on the ADE and CoNLL04 datasets for end-to-end RE and demonstrate \approx 1\% gain (in F-score) over prior best results with training and testing times that are nearly four times faster — the latter highly advantageous for time-sensitive end user applications.

Multilinear Compressive Learning

Compressive Learning is an emerging topic that combines signal acquisition via compressive sensing and machine learning to perform inference tasks directly on a small number of measurements. Many data modalities naturally have a multi-dimensional or tensorial format, with each dimension or tensor mode representing different features such as the spatial and temporal information in video sequences or the spatial and spectral information in hyperspectral images. However, in existing compressive learning frameworks, the compressive sensing component utilizes either random or learned linear projection on the vectorized signal to perform signal acquisition, thus discarding the multi-dimensional structure of the signals. In this paper, we propose Multilinear Compressive Learning, a framework that takes into account the tensorial nature of multi-dimensional signals in the acquisition step and builds the subsequent inference model on the structurally sensed measurements. Our theoretical complexity analysis shows that the proposed framework is more efficient compared to its vector-based counterpart in both memory and computation requirement. With extensive experiments, we also empirically show that our Multilinear Compressive Learning framework outperforms the vector-based framework in object classification and face recognition tasks, and scales favorably when the dimensionalities of the original signals increase, making it highly efficient for high-dimensional multi-dimensional signals.

Sequential training algorithm for neural networks

A sequential training method for large-scale feedforward neural networks is presented. Each layer of the neural network is decoupled and trained separately. After the training is completed for each layer, they are combined together. The performance of the network would be sub-optimal compared to the full network training if the optimal solution would be achieved. However, achieving the optimal solution for the full network would be infeasible or require long computing time. The proposed sequential approach reduces the required computer resources significantly and would have better convergences as a single layer is optimised for each optimisation step. The required modifications of existing algorithms to implement the sequential training are minimal. The performance is verified by a simple example.

Enforcing constraints for time series prediction in supervised, unsupervised and reinforcement learning

We assume that we are given a time series of data from a dynamical system and our task is to learn the flow map of the dynamical system. We present a collection of results on how to enforce constraints coming from the dynamical system in order to accelerate the training of deep neural networks to represent the flow map of the system as well as increase their predictive ability. In particular, we provide ways to enforce constraints during training for all three major modes of learning, namely supervised, unsupervised and reinforcement learning. In general, the dynamic constraints need to include terms which are analogous to memory terms in model reduction formalisms. Such memory terms act as a restoring force which corrects the errors committed by the learned flow map during prediction. For supervised learning, the constraints are added to the objective function. For the case of unsupervised learning, in particular generative adversarial networks, the constraints are introduced by augmenting the input of the discriminator. Finally, for the case of reinforcement learning and in particular actor-critic methods, the constraints are added to the reward function. In addition, for the reinforcement learning case, we present a novel approach based on homotopy of the action-value function in order to stabilize and accelerate training. We use numerical results for the Lorenz system to illustrate the various constructions.

Story Ending Prediction by Transferable BERT

Recent advances, such as GPT and BERT, have shown success in incorporating a pre-trained transformer language model and fine-tuning operation to improve downstream NLP systems. However, this framework still has some fundamental problems in effectively incorporating supervised knowledge from other related tasks. In this study, we investigate a transferable BERT (TransBERT) training framework, which can transfer not only general language knowledge from large-scale unlabeled data but also specific kinds of knowledge from various semantically related supervised tasks, for a target task. Particularly, we propose utilizing three kinds of transfer tasks, including natural language inference, sentiment classification, and next action prediction, to further train BERT based on a pre-trained model. This enables the model to get a better initialization for the target task. We take story ending prediction as the target task to conduct experiments. The final result, an accuracy of 91.8%, dramatically outperforms previous state-of-the-art baseline methods. Several comparative experiments give some helpful suggestions on how to select transfer tasks. Error analysis shows what are the strength and weakness of BERT-based models for story ending prediction.

Cross-referencing using Fine-grained Topic Modeling

Cross-referencing, which links passages of text to other related passages, can be a valuable study aid for facilitating comprehension of a text. However, cross-referencing requires first, a comprehensive thematic knowledge of the entire corpus, and second, a focused search through the corpus specifically to find such useful connections. Due to this, cross-reference resources are prohibitively expensive and exist only for the most well-studied texts (e.g. religious texts). We develop a topic-based system for automatically producing candidate cross-references which can be easily verified by human annotators. Our system utilizes fine-grained topic modeling with thousands of highly nuanced and specific topics to identify verse pairs which are topically related. We demonstrate that our system can be cost effective compared to having annotators acquire the expertise necessary to produce cross-reference resources unaided.

Multinomial Distribution Learning for Effective Neural Architecture Search

Architectures obtained by Neural Architecture Search (NAS) have achieved highly competitive performance in various computer vision tasks. However, the prohibitive computation demand of forward-backward propagation in deep neural networks and searching algorithms makes it difficult to apply NAS in practice. In this paper, we propose a Multinomial Distribution Learning for extremely effective NAS, which considers the search space as a joint multinomial distribution, i.e., the operation between two nodes is sampled from this distribution, and the optimal network structure is obtained by the operations with the most likely probability in this distribution. Therefore, NAS can be transformed to a multinomial distribution learning problem, i.e., the distribution is optimized to have high expectation of the performance. Besides, a hypothesis that the performance ranking is consistent in every training epoch is proposed and demonstrated to further accelerate the learning process. Experiments on CIFAR-10 and ImageNet demonstrate the effectiveness of our method. On CIFAR-10, the structure searched by our method achieves 2.4\% test error, while being 6.0 \times (only 4 GPU hours on GTX1080Ti) faster compared with state-of-the-art NAS algorithms. On ImageNet, our model achieves 75.2\% top-1 accuracy under MobileNet settings (MobileNet V1/V2), while being 1.2\times faster with measured GPU latency. Test code is available at https://…/MDENAS

Factor Models for High-Dimensional Tensor Time Series

Large tensor (multi-dimensional array) data are now routinely collected in a wide range of applications, due to modern data collection capabilities. Often such observations are taken over time, forming tensor time series. In this paper we present a factor model approach for analyzing high-dimensional dynamic tensor time series and multi-category dynamic transport networks. Two estimation procedures along with their theoretical properties and simulation results are presented. Two applications are used to illustrate the model and its interpretations.

Massively Parallel Computation via Remote Memory Access

We introduce the Adaptive Massively Parallel Computation (AMPC) model, which is an extension of the Massively Parallel Computation (MPC) model. At a high level, the AMPC model strengthens the MPC model by storing all messages sent within a round in a distributed data store. In the following round, all machines are provided with random read access to the data store, subject to the same constraints on the total amount of communication as in the MPC model. Our model is inspired by the previous empirical studies of distributed graph algorithms using MapReduce and a distributed hash table service. This extension allows us to give new graph algorithms with much lower round complexities compared to the best known solutions in the MPC model. In particular, in the AMPC model we show how to solve maximal independent set in O(1) rounds and connectivity/minimum spanning tree in O(\log\log_{m/n} n) rounds both using O(n^\delta) space per machine for constant \delta < 1. In the same memory regime for MPC, the best known algorithms for these problems require polylog n rounds. Our results imply that the 2-Cycle conjecture, which is widely believed to hold in the MPC model, does not hold in the AMPC model.

Which Tasks Should Be Learned Together in Multi-task Learning?

Many computer vision applications require solving multiple tasks in real-time. A neural network can be trained to solve multiple tasks simultaneously using `multi-task learning’. This saves computation at inference time as only a single network needs to be evaluated. Unfortunately, this often leads to inferior overall performance as task objectives compete, which consequently poses the question: which tasks should and should not be learned together in one network when employing multi-task learning? We systematically study task cooperation and competition and propose a framework for assigning tasks to a few neural networks such that cooperating tasks are computed by the same neural network, while competing tasks are computed by different networks. Our framework offers a time-accuracy trade-off and can produce better accuracy using less inference time than not only a single large multi-task neural network but also many single-task networks.

Gradient tree boosting with random output projections for multi-label classification and multi-output regression

In many applications of supervised learning, multiple classification or regression outputs have to be predicted jointly. We consider several extensions of gradient boosting to address such problems. We first propose a straightforward adaptation of gradient boosting exploiting multiple output regression trees as base learners. We then argue that this method is only expected to be optimal when the outputs are fully correlated, as it forces the partitioning induced by the tree base learners to be shared by all outputs. We then propose a novel extension of gradient tree boosting to specifically address this issue. At each iteration of this new method, a regression tree structure is grown to fit a single random projection of the current residuals and the predictions of this tree are fitted linearly to the current residuals of all the outputs, independently. Because of this linear fit, the method can adapt automatically to any output correlation structure. Extensive experiments are conducted with this method, as well as other algorithmic variants, on several artificial and real problems. Randomly projecting the output space is shown to provide a better adaptation to different output correlation patterns and is therefore competitive with the best of the other methods in most settings. Thanks to model sharing, the convergence speed is also improved, reducing the computing times (or the complexity of the model) to reach a specific accuracy.

RaFM: Rank-Aware Factorization Machines

Factorization machines (FM) are a popular model class to learn pairwise interactions by a low-rank approximation. Different from existing FM-based approaches which use a fixed rank for all features, this paper proposes a Rank-Aware Factorization machine (RaFM) model which adopts pairwise interactions from embeddings with different ranks. The proposed model achieves a better performance on real-world datasets where different features have significantly varying frequencies of occurrences. Moreover, we prove that the RaFM model can be stored, evaluated, and trained as efficiently as one single FM, and under some reasonable conditions it can be even significantly more efficient than FM. RaFM improves the performance of FMs in both regression tasks and classification tasks while incurring less computational burden, therefore also has attractive potential in industrial applications.

The Curious Case of Machine Learning In Malware Detection

In this paper, we argue that machine learning techniques are not ready for malware detection in the wild. Given the current trend in malware development and the increase of unconventional malware attacks, we expect that dynamic malware analysis is the future for antimalware detection and prevention systems. A comprehensive review of machine learning for malware detection is presented. Then, we discuss how malware detection in the wild present unique challenges for the current state-of-the-art machine learning techniques. We defined three critical problems that limit the success of malware detectors powered by machine learning in the wild. Next, we discuss possible solutions to these challenges and present the requirements of next-generation malware detection. Finally, we outline potential research directions in machine learning for malware detection.

Microblog Hashtag Generation via Encoding Conversation Contexts

Automatic hashtag annotation plays an important role in content understanding for microblog posts. To date, progress made in this field has been restricted to phrase selection from limited candidates, or word-level hashtag discovery using topic models. Different from previous work considering hashtags to be inseparable, our work is the first effort to annotate hashtags with a novel sequence generation framework via viewing the hashtag as a short sequence of words. Moreover, to address the data sparsity issue in processing short microblog posts, we propose to jointly model the target posts and the conversation contexts initiated by them with bidirectional attention. Extensive experimental results on two large-scale datasets, newly collected from English Twitter and Chinese Weibo, show that our model significantly outperforms state-of-the-art models based on classification. Further studies demonstrate our ability to effectively generate rare and even unseen hashtags, which is however not possible for most existing methods.

Semantic flow in language networks

In this study we propose a framework to characterize documents based on their semantic flow. The proposed framework encompasses a network-based model that connected sentences based on their semantic similarity. Semantic fields are detected using standard community detection methods. as the story unfolds, transitions between semantic fields are represent in Markov networks, which in turned are characterized via network motifs (subgraphs). Here we show that the proposed framework can be used to classify books according to their style and publication dates. Remarkably, even without a systematic optimization of parameters, philosophy and investigative books were discriminated with an accuracy rate of 92.5%. Because this model captures semantic features of texts, it could be used as an additional feature in traditional network-based models of texts that capture only syntactical/stylistic information, as it is the case of word adjacency (co-occurrence) networks.

Evolving Rewards to Automate Reinforcement Learning

Many continuous control tasks have easily formulated objectives, yet using them directly as a reward in reinforcement learning (RL) leads to suboptimal policies. Therefore, many classical control tasks guide RL training using complex rewards, which require tedious hand-tuning. We automate the reward search with AutoRL, an evolutionary layer over standard RL that treats reward tuning as hyperparameter optimization and trains a population of RL agents to find a reward that maximizes the task objective. AutoRL, evaluated on four Mujoco continuous control tasks over two RL algorithms, shows improvements over baselines, with the the biggest uplift for more complex tasks. The video can be found at: \url{}.

A Case Study: Exploiting Neural Machine Translation to Translate CUDA to OpenCL

The sequence-to-sequence (seq2seq) model for neural machine translation has significantly improved the accuracy of language translation. There have been new efforts to use this seq2seq model for program language translation or program comparisons. In this work, we present the detailed steps of using a seq2seq model to translate CUDA programs to OpenCL programs, which both have very similar programming styles. Our work shows (i) a training input set generation method, (ii) pre/post processing, and (iii) a case study using Polybench-gpu-1.0, NVIDIA SDK, and Rodinia benchmarks.

Quantifying Robotic Swarm Coverage

In the field of swarm robotics, the design and implementation of spatial density control laws has received much attention, with less emphasis being placed on performance evaluation. This work fills that gap by introducing an error metric that provides a quantitative measure of coverage for use with any control scheme. The proposed error metric is continuously sensitive to changes in the swarm distribution, unlike commonly used discretization methods. We analyze the theoretical and computational properties of the error metric and propose two benchmarks to which error metric values can be compared. The first uses the realizable extrema of the error metric to compute the relative error of an observed swarm distribution. We also show that the error metric extrema can be used to help choose the swarm size and effective radius of each robot required to achieve a desired level of coverage. The second benchmark compares the observed distribution of error metric values to the probability density function of the error metric when robot positions are randomly sampled from the target distribution. We demonstrate the utility of this benchmark in assessing the performance of stochastic control algorithms. We prove that the error metric obeys a central limit theorem, develop a streamlined method for performing computations, and place the standard statistical tests used here on a firm theoretical footing. We provide rigorous theoretical development, computational methodologies, numerical examples, and MATLAB code for both benchmarks.

On Selecting Stable Predictors in Time Series Models

We extend the feature selection methodology to dependent data and propose a novel time series predictor selection scheme that accommodates statistical dependence in a more typical i.i.d sub-sampling based framework. Furthermore, the machinery of mixing stationary processes allows us to quantify the improvements of our approach over any base predictor selection method (such as lasso) even in a finite sample setting. Using the lasso as a base procedure we demonstrate the applicability of our methods to simulated and several real time series datasets.

Regions In a Linked Dataset For Change Detection

Linked Datasets (LDs) are constantly evolving and the applications using a Linked Dataset (LD) may face several issues such as outdated data or broken interlinks due to evolution of the dataset. To overcome these issues, the detection of changes in LDs during their evolution has proven crucial. As LDs evolve frequently, the change detection during the evolution should also be done at frequent intervals. However, due to limitation of available computational resources such as capacity to fetch data from LD and time to detect changes, the frequent change detection may not be possible with existing change detection techniques. This research proposes to explore the notion of prioritization of regions (subsets) in LDs for change detection with the aim of achieving optimal accuracy and efficient use of available computational resources. This will facilitate the detection of changes in an evolving LD at frequent intervals and will allow the applications to update their data closest to real-time data.

Things You May Not Know About Adversarial Example: A Black-box Adversarial Image Attack

Numerous methods for crafting adversarial examples were proposed recently with high attack success rate. Most of the existing works normalize images into a continuous vector, domain firstly, and then craft adversarial examples in the continuous vector space. However, ‘adversarial’ examples may become benign after de-normalizing them back into discrete integer domain, known as the discretization problem. The discretization problem was mentioned in some work, but was despised and have received relatively little attention. In this work, we conduct the first comprehensive study of this discretization problem. We theoretically analyzed 34 representative methods and empirically studied 20 representative open source tools for crafting discretization images. Our findings reveal that almost all of existing works suffer from the discretization problem and the problem is far more serious than we thought. This suggests that the discretization problem should be taken into account when crafting adversarial examples. As a first step towards addressing the discretization problem, we propose a black-box attack method to encode the adversarial example searching problem as a derivative-free optimization problem. Our method is able to craft ‘real” adversarial images by derivative-free search on the discrete integer domain. Experimental results show that our method achieves significantly higher attack success rates on the discrete integer domain than most of the other tools, no matter white-box or black-box. Moreover, our method is able to handle any model that is not differentiable and we successfully break the winner of NIPS 17 competition on defense with a 95\% success rate.

Learning to Memorize in Neural Task-Oriented Dialogue Systems

In this thesis, we leverage the neural copy mechanism and memory-augmented neural networks (MANNs) to address existing challenge of neural task-oriented dialogue learning. We show the effectiveness of our strategy by achieving good performance in multi-domain dialogue state tracking, retrieval-based dialogue systems, and generation-based dialogue systems. We first propose a transferable dialogue state generator (TRADE) that leverages its copy mechanism to get rid of dialogue ontology and share knowledge between domains. We also evaluate unseen domain dialogue state tracking and show that TRADE enables zero-shot dialogue state tracking and can adapt to new few-shot domains without forgetting the previous domains. Second, we utilize MANNs to improve retrieval-based dialogue learning. They are able to capture dialogue sequential dependencies and memorize long-term information. We also propose a recorded delexicalization copy strategy to replace real entity values with ordered entity types. Our models are shown to surpass other retrieval baselines, especially when the conversation has a large number of turns. Lastly, we tackle generation-based dialogue learning with two proposed models, the memory-to-sequence (Mem2Seq) and global-to-local memory pointer network (GLMP). Mem2Seq is the first model to combine multi-hop memory attention with the idea of the copy mechanism. GLMP further introduces the concept of response sketching and double pointers copying. We show that GLMP achieves the state-of-the-art performance on human evaluation.

DivGraphPointer: A Graph Pointer Network for Extracting Diverse Keyphrases

Keyphrase extraction from documents is useful to a variety of applications such as information retrieval and document summarization. This paper presents an end-to-end method called DivGraphPointer for extracting a set of diversified keyphrases from a document. DivGraphPointer combines the advantages of traditional graph-based ranking methods and recent neural network-based approaches. Specifically, given a document, a word graph is constructed from the document based on word proximity and is encoded with graph convolutional networks, which effectively capture document-level word salience by modeling long-range dependency between words in the document and aggregating multiple appearances of identical words into one node. Furthermore, we propose a diversified point network to generate a set of diverse keyphrases out of the word graph in the decoding process. Experimental results on five benchmark data sets show that our proposed method significantly outperforms the existing state-of-the-art approaches.

Structured Summarization of Academic Publications

We propose SUSIE, a novel summarization method that can work with state-of-the-art summarization models in order to produce structured scientific summaries for academic articles. We also created PMC-SA, a new dataset of academic publications, suitable for the task of structured summarization with neural networks. We apply SUSIE combined with three different summarization models on the new PMC-SA dataset and we show that the proposed method improves the performance of all models by as much as 4 ROUGE points.

Explaining Machine Learning Classifiers through Diverse Counterfactual Explanations

Post-hoc explanations of machine learning models are crucial for people to understand and act on algorithmic predictions. An intriguing class of explanations is through counterfactuals, hypothetical examples that show people how to obtain a different prediction. We posit that effective counterfactual explanations should satisfy two properties: feasibility of the counterfactual actions given user context and constraints, and diversity among the counterfactuals presented. To this end, we propose a framework for generating and evaluating a diverse set of counterfactual explanations based on average distance and determinantal point processes. To evaluate the actionability of counterfactuals, we provide metrics that enable comparison of counterfactual-based methods to other local explanation methods. We further address necessary tradeoffs and point to causal implications in optimizing for counterfactuals. Our experiments on three real-world datasets show that our framework can generate a set of counterfactuals that are diverse and well approximate local decision boundaries.

An Objective Evaluation Metric for image fusion based on Del Operator

In this paper, a novel objective evaluation metric for image fusion is presented. Remarkable and attractive points of the proposed metric are that it has no parameter, the result is probability in the range of [0, 1] and it is free from illumination dependence. This metric is easy to implement and the result is computed in four steps: (1) Smoothing the images using Gaussian filter. (2) Transforming images to a vector field using Del operator. (3) Computing the normal distribution function ({\mu},{\sigma}) for each corresponding pixel, and converting to the standard normal distribution function. (4) Computing the probability of being well-behaved fusion method as the result. To judge the quality of the proposed metric, it is compared to thirteen well-known non-reference objective evaluation metrics, where eight fusion methods are employed on seven experiments of multimodal medical images. The experimental results and statistical comparisons show that in contrast to the previously objective evaluation metrics the proposed one performs better in terms of both agreeing with human visual perception and evaluating fusion methods that are not performed at the same level.

Earlier Attention? Aspect-Aware LSTM for Aspect Sentiment Analysis

Aspect-based sentiment analysis (ABSA) aims to predict fine-grained sentiments of comments with respect to given aspect terms or categories. In previous ABSA methods, the importance of aspect has been realized and verified. Most existing LSTM-based models take aspect into account via the attention mechanism, where the attention weights are calculated after the context is modeled in the form of contextual vectors. However, aspect-related information may be already discarded and aspect-irrelevant information may be retained in classic LSTM cells in the context modeling process, which can be improved to generate more effective context representations. This paper proposes a novel variant of LSTM, termed as aspect-aware LSTM (AA-LSTM), which incorporates aspect information into LSTM cells in the context modeling stage before the attention mechanism. Therefore, our AA-LSTM can dynamically produce aspect-aware contextual representations. We experiment with several representative LSTM-based models by replacing the classic LSTM cells with the AA-LSTM cells. Experimental results on SemEval-2014 Datasets demonstrate the effectiveness of AA-LSTM.

Butterfly: Robust One-step Approach towards Wildly-unsupervised Domain Adaptation

Unsupervised domain adaptation (UDA) trains with clean labeled data in source domain and unlabeled data in target domain to classify target-domain data. However, in real-world scenarios, it is hard to acquire fully-clean labeled data in source domain due to the expensive labeling cost. This brings us a new but practical adaptation called wildly-unsupervised domain adaptation (WUDA), which aims to transfer knowledge from noisy labeled data in source domain to unlabeled data in target domain. To tackle the WUDA, we present a robust one-step approach called Butterfly, which trains four networks. Specifically, two networks are jointly trained on noisy labeled data in source domain and pseudo-labeled data in target domain (i.e., data in mixture domain). Meanwhile, the other two networks are trained on pseudo-labeled data in target domain. By using dual-checking principle, Butterfly can obtain high-quality target-specific representations. We conduct experiments to demonstrate that Butterfly significantly outperforms other baselines on simulated and real-world WUDA tasks in most cases.

Reinforcement Learning for Learning of Dynamical Systems in Uncertain Environment: a Tutorial

In this paper, a review of model-free reinforcement learning for learning of dynamical systems in uncertain environments has discussed. For this purpose, the Markov Decision Process (MDP) will be reviewed. Furthermore, some learning algorithms such as Temporal Difference (TD) learning, Q-Learning, and Approximate Q-learning as model-free algorithms which constitute the main part of this article have been investigated, and benefits and drawbacks of each algorithm will be discussed. The discussed concepts in each section are explaining with details and examples.

Continue Reading…


Read More

Distilled News

Information-driven bars for financial machine learning: imbalance bars

In previous articles we talked about tick bars, volume bars and dollar bars, alternative types of bars which allow market activity-dependent sampling based on the number of ticks, volume or dollar value exchanged. Additionally, we saw how these bars display better statistical properties such as lower serial correlation when compared to traditional time-based bars. In this article we will talk about information-driven bars and specifically about imbalance bars. These bars aim to extract information encoded in the observed sequence of trades and notify us of a change in the imbalance of trades. The early detection of an imbalance change will allow us to anticipate a potential change of trend before reaching a new equilibrium.

recorder: Validate Predictors in New Data

recorder 0.8.1 is now available on CRAN. recorder is a lightweight toolkit to validate new observations before computing their corresponding predictions with a predictive model.
With recorder the validation process consists of two steps:
• record relevant statistics and meta data of the variables in the original training data for the predictive model
• use these data to run a set of basic validation tests on the new set of observations.
Now we will take a deeper look into, what recorder has to offer.

Becoming a machine learning company means investing in foundational technologies

In this post, I share slides and notes from a keynote I gave at the Strata Data Conference in London earlier this year. I will highlight the results of a recent survey on machine learning adoption, and along the way describe recent trends in data and machine learning (ML) within companies. This is a good time to assess enterprise activities, as there are many indications a number of companies are already beginning to use machine learning. For example, in a July 2018 survey that drew more than 11,000 respondents, we found strong engagement among companies: 51% stated they already had machine learning models in production.

Microsoft wants to apply AI ‘to the entire application developer lifecycle’

At its Build 2018 developer conference a year ago, Microsoft previewed Visual Studio IntelliCode, which uses AI to offer intelligent suggestions that improve code quality and productivity. In April, Microsoft launched Visual Studio 2019 for Windows and Mac. At that point, IntelliCode was still an optional extension that Microsoft was openly offering as a preview. But at Build 2019 earlier this month, Microsoft shared that IntelliCode’s capabilities are now generally available for C# and XAML in Visual Studio 2019 and for Java, JavaScript, TypeScript, and Python in Visual Studio Code. Microsoft also now includes IntelliCode by default in Visual Studio 2019.

The human problem of AI

When it comes to most things business, AI is making its mark as the must-have technology. Whether we are talking about customer-facing chatbots to help with engagement and conversion or AI working in the background to help make critical business decisions, AI is everywhere. And the expectations of what it can and should be able to do is often sky-high. When those expectations aren’t met, however, it’s not always the tech that’s to blame. More likely, it’s the humans who brought it on board. Here are some of the most common human errors when it comes to implementing AI.
Mistake #1: Confusing automation with AI
Mistake #2: Not determining success factors
Mistake #3: Not getting organizational buy-in
Mistake #4: Not considering the impact on the entire customer journey
Mistake #5: Not understanding the cause of the problems you’re trying to solve

Computational Socioeconomics

Uncovering the structure of socioeconomic systems and timely estimation of socioeconomic status are significant for economic development. The understanding of socioeconomic processes provides foundations to quantify global economic development, to map regional industrial structure, and to infer individual socioeconomic status. In this review, we will make a brief manifesto about a new interdisciplinary research field named Computational Socioeconomics, followed by detailed introduction about data resources, computational tools, data-driven methods, theoretical models and novel applications at multiple resolutions, including the quantification of global economic inequality and complexity, the map of regional industrial structure and urban perception, the estimation of individual socioeconomic status and demographic, and the real-time monitoring of emergent events. This review, together with pioneering works we have highlighted, will draw increasing interdisciplinary attentions and induce a methodological shift in future socioeconomic studies.

Practical Strategies to Handle Missing Values

One of the major challenges in most data science projects is to figure out a way to get clean data. 60 to 80 percent of the total time is spent on cleaning the data before you can make any meaningful sense of it. This is true for both BI and Predictive Analytics projects. To improve the effectiveness of the data cleaning process, the current trend is to migrate from the manual data cleaning to more intelligent machine learning-based processes.

Writing Quotes like Aristotle with Recurrent Neural Networks

It’s all thanks to a machine learning architecture called known as the Recurrent Neural Network (RNN).

Stock Market Analysis Using ARIMA

Time Series is a big component of our everyday lives. They are in fact used in medicine (EEG analysis), finance (Stock Prices) and electronics (Sensor Data Analysis). Many Machine Learning models have been created in order to tackle these types of tasks, two examples are ARIMA (AutoRegressive Integrated Moving Average) models and RNNs (Recurrent Neural Networks).

A New Way to look at GANs

A Generative Adversarial Network is an extremely interesting deep neural network architecture able to generate new data (often images) that resembles the data given during training (or in mathematical terms, matches the same distribution). Immediately after discovering GANs and how they work, I got intrigued. There is something special, maybe magical, about generating realistic looking images in an unsupervised manner. One area of GAN research that really caught my attention has been image-to-image translation: the ability to turn an image into another image keeping some sort of correspondence (for example turning a horse into a zebra or an apple into an orange). Academic papers like the one introducing CycleGAN (a particular architecture which uses two GANs ‘helping’ each other to perform image to image translation) showed me a powerful and captivating deep learning application that I immediately wanted to try and implement myself.

Enabling Cognitive Visual Question Answering

Exploring a hybrid approach to visual question answering through deeper integration of OpenCog and a Vision Subsystem.

Creative Artificial Intelligence… Towards New Horizons

In the creative world, an evolution is happening thanks to the innovations of machine learning and of deep learning. This change gives a glimpse into unpublished creative horizons in terms of design, cybernetic art, music, writing, imaging and video. This is an unprecedented revolution, leading to an unstoppable change, as well as, a major return to creativity. There is still a question on the lips of many creative people, artists, producers, directors and creative agencies: ‘Are robots will steal our jobs?’ The answer is NO. On the contrary, the emergence of this creative artificial intelligence is going to improve their everyday life. Before understanding the 3 keys of this major return to creativity (creativity, jobs and ethic), understanding the past and the founding events is necessary.

Continue Reading…


Read More

How does the reality TV show Cops stack up with real-life crime figures?

Creators of podcast Running from Cops watched 846 episodes and compared the numbers they found with national crime figures

Reality TV can fail to live up to its name, conveying a version of the world that is a sidestep from the truth. And Cops was a show that was crucial to the genre – it’s not only the longest-running reality show in history, it’s also one of the first.

A new podcast, Running from Cops, explores how the TV show helped to shape the same criminal justice system that it depicted. Dan Taberski tracks down some of the people whose crimes and arrests were broadcast for all to see and uncovers how the show was made.

Continue reading...

Continue Reading…


Read More

Book Memo: “Computational Methods for Numerical Analysis with R”

Computational Methods for Numerical Analysis with R is an overview of traditional numerical analysis topics presented using R. This guide shows how common functions from linear algebra, interpolation, numerical integration, optimization, and differential equations can be implemented in pure R code. Every algorithm described is given with a complete function implementation in R, along with examples to demonstrate the function and its use. Computational Methods for Numerical Analysis with R is intended for those who already know R, but are interested in learning more about how the underlying algorithms work. As such, it is suitable for statisticians, economists, and engineers, and others with a computational and numerical background.

Continue Reading…


Read More

R Packages worth a look

Computing General Equilibrium (CGE)
Developing general equilibrium models, computing general equilibrium and simulating economic dynamics with structural dynamic models in LI (2019, ISBN: …

C Resource Cleanup via Exit Handlers (cleancall)
Wrapper of .Call() that runs exit handlers to clean up C resources. Helps managing C (non-R) resources while using the R API.

Basket Trial Analysis (basket)
Implementation of multisource exchangeability models for Bayesian analyses of prespecified subgroups arising in the context of basket trial design and …

Identifiability of Linear Structural Equation Models (SEMID)
Provides routines to check identifiability or non-identifiability of linear structural equation models as described in Drton, Foygel, and Sullivant (20 …

Continue Reading…


Read More

Document worth reading: “When Gaussian Process Meets Big Data: A Review of Scalable GPs”

The vast quantity of information brought by big data as well as the evolving computer hardware encourages success stories in the machine learning community. In the meanwhile, it poses challenges for the Gaussian process (GP), a well-known non-parametric and interpretable Bayesian model, which suffers from cubic complexity to training size. To improve the scalability while retaining the desirable prediction quality, a variety of scalable GPs have been presented. But they have not yet been comprehensively reviewed and discussed in a unifying way in order to be well understood by both academia and industry. To this end, this paper devotes to reviewing state-of-the-art scalable GPs involving two main categories: global approximations which distillate the entire data and local approximations which divide the data for subspace learning. Particularly, for global approximations, we mainly focus on sparse approximations comprising prior approximations which modify the prior but perform exact inference, and posterior approximations which retain exact prior but perform approximate inference; for local approximations, we highlight the mixture/product of experts that conducts model averaging from multiple local experts to boost predictions. To present a complete review, recent advances for improving the scalability and model capability of scalable GPs are reviewed. Finally, the extensions and open issues regarding the implementation of scalable GPs in various scenarios are reviewed and discussed to inspire novel ideas for future research avenues. When Gaussian Process Meets Big Data: A Review of Scalable GPs

Continue Reading…


Read More

May 23, 2019

Magister Dixit

“Maturity is the capacity to endure uncertainty.” John Finley

Continue Reading…


Read More

Whats new on arXiv

EENA: Efficient Evolution of Neural Architecture

Latest algorithms for automatic neural architecture search perform remarkable but basically directionless in search space and computational expensive in the training of every intermediate architecture. In this paper, we propose a method for efficient architecture search called EENA (Efficient Evolution of Neural Architecture) with mutation and crossover operations guided by the information have already been learned to speed up this process and consume less computational effort by reducing redundant searching and training. On CIFAR-10 classification, EENA using minimal computational resources (0.65 GPU-days) can design highly effective neural architecture which achieves 2.56% test error with 8.47M parameters. Furthermore, The best architecture discovered is also transferable for CIFAR-100.

Fairness in Machine Learning with Tractable Models

Machine Learning techniques have become pervasive across a range of different applications, and are now widely used in areas as disparate as recidivism prediction, consumer credit-risk analysis and insurance pricing. The prevalence of machine learning techniques has raised concerns about the potential for learned algorithms to become biased against certain groups. Many definitions have been proposed in the literature, but the fundamental task of reasoning about probabilistic events is a challenging one, owing to the intractability of inference. The focus of this paper is taking steps towards the application of tractable models to fairness. Tractable probabilistic models have emerged that guarantee that conditional marginal can be computed in time linear in the size of the model. In particular, we show that sum product networks (SPNs) enable an effective technique for determining the statistical relationships between protected attributes and other training variables. If a subset of these training variables are found by the SPN to be independent of the training attribute then they can be considered `safe’ variables, from which we can train a classification model without concern that the resulting classifier will result in disparate outcomes for different demographic groups. Our initial experiments on the `German Credit’ data set indicate that this processing technique significantly reduces disparate treatment of male and female credit applicants, with a small reduction in classification accuracy compared to state of the art. We will also motivate the concept of ‘fairness through percentile equivalence’, a new definition predicated on the notion that individuals at the same percentile of their respective distributions should be treated equivalently, and this prevents unfair penalisation of those individuals who lie at the extremities of their respective distributions.

Knowledge-Based Sequential Decision-Making Under Uncertainty

Deep reinforcement learning (DRL) algorithms have achieved great success on sequential decision-making problems, yet is criticized for the lack of data-efficiency and explainability. Especially, explainability of subtasks is critical in hierarchical decision-making since it enhances the transparency of black-box-style DRL methods and helps the RL practitioners to understand the high-level behavior of the system better. To improve the data-efficiency and explainability of DRL, declarative knowledge is introduced in this work and a novel algorithm is proposed by integrating DRL with symbolic planning. Experimental analysis on publicly available benchmarks validates the explainability of the subtasks and shows that our method can outperform the state-of-the-art approach in terms of data-efficiency.

Vector Field Neural Networks

This work begins by establishing a mathematical formalization between different geometrical interpretations of Neural Networks, providing a first contribution. From this starting point, a new interpretation is explored, using the idea of implicit vector fields moving data as particles in a flow. A new architecture, Vector Fields Neural Networks (VFNN), is proposed based on this interpretation, with the vector field becoming explicit. A specific implementation of the VFNN using Euler’s method to solve ordinary differential equations (ODEs) and gaussian vector fields is tested. The first experiments present visual results remarking the important features of the new architecture and providing another contribution with the geometrically interpretable regularization of model parameters. Then, the new architecture is evaluated for different hyperparameters and inputs, with the objective of evaluating the influence on model performance, computational time, and complexity. The VFNN model is compared against the known basic models Naive Bayes, Feed Forward Neural Networks, and Support Vector Machines(SVM), showing comparable, or better, results for different datasets. Finally, the conclusion provides many new questions and ideas for improvement of the model that can be used to increase model performance.

Non-negative matrix factorization based on generalized dual divergence

A theoretical framework for non-negative matrix factorization based on generalized dual Kullback-Leibler divergence, which includes members of the exponential family of models, is proposed. A family of algorithms is developed using this framework and its convergence proven using the Expectation-Maximization algorithm. The proposed approach generalizes some existing methods for different noise structures and contrasts with the recently proposed quasi-likelihood approach, thus providing a useful alternative for non-negative matrix factorizations. A measure to evaluate the goodness-of-fit of the resulting factorization is described. This framework can be adapted to include penalty, kernel and discriminant functions as well as tensors.

Model interpretation through lower-dimensional posterior summarization

Nonparametric regression models have recently surged in their power and popularity, accompanying the trend of increasing dataset size and complexity. While these models have proven their predictive ability in empirical settings, they are often difficult to interpret, and by themselves often do not address the underlying inferential goals of the analyst or decision maker. In this paper, we propose a modular two-stage approach for creating parsimonious, interpretable summaries of complex models which allow freedom in the choice of modeling technique and the inferential target. In the first stage, a flexible model is fit which is believed to be as accurate as possible. Then, in the second stage, a lower-dimensional summary model is fit which is suited to interpretably explain global or local predictive trends in the original model. The summary is refined and refitted as necessary to give adequate explanations of the original model, and we provide heuristics for this summary search. Our methodology is an example of posterior summarization, and so these summaries naturally come with valid Bayesian uncertainty estimates. We apply our technique and demonstrate its strengths on several real datasets.

Online Multivariate Anomaly Detection and Localization for High-dimensional Settings

This paper considers the real-time detection of anomalies in high-dimensional systems. The goal is to detect anomalies quickly and accurately so that the appropriate countermeasures could be taken in time, before the system possibly gets harmed. We propose a sequential and multivariate anomaly detection method that scales well to high-dimensional datasets. The proposed method follows a nonparametric, i.e., data-driven, and semi-supervised approach, i.e., trains only on nominal data. Thus, it is applicable to a wide range of applications and data types. Thanks to its multivariate nature, it can quickly and accurately detect challenging anomalies, such as changes in the correlation structure and stealth low-rate cyberattacks. Its asymptotic optimality and computational complexity are comprehensively analyzed. In conjunction with the detection method, an effective technique for localizing the anomalous data dimensions is also proposed. We further extend the proposed detection and localization methods to a supervised setup where an additional anomaly dataset is available, and combine the proposed semi-supervised and supervised algorithms to obtain an online learning algorithm under the semi-supervised framework. The practical use of proposed algorithms are demonstrated in DDoS attack mitigation, and their performances are evaluated using a real IoT-botnet dataset and simulations.

Simple Black-box Adversarial Attacks

We propose an intriguingly simple method for the construction of adversarial images in the black-box setting. In constrast to the white-box scenario, constructing black-box adversarial images has the additional constraint on query budget, and efficient attacks remain an open problem to date. With only the mild assumption of continuous-valued confidence scores, our highly query-efficient algorithm utilizes the following simple iterative principle: we randomly sample a vector from a predefined orthonormal basis and either add or subtract it to the target image. Despite its simplicity, the proposed method can be used for both untargeted and targeted attacks — resulting in previously unprecedented query efficiency in both settings. We demonstrate the efficacy and efficiency of our algorithm on several real world settings including the Google Cloud Vision API. We argue that our proposed algorithm should serve as a strong baseline for future black-box attacks, in particular because it is extremely fast and its implementation requires less than 20 lines of PyTorch code.

FlexNGIA: A Flexible Internet Architecture for the Next-Generation Tactile Internet

From virtual reality and telepresence, to augmented reality, holoportation, and remotely controlled robotics, these future network applications promise an unprecedented development for society, economics and culture by revolutionizing the way we live, learn, work and play. In order to deploy such futuristic applications and to cater to their performance requirements, recent trends stressed the need for the Tactile Internet, an Internet that, according to the International Telecommunication Union, combines ultra low latency with extremely high availability, reliability and security. Unfortunately, today’s Internet falls short when it comes to providing such stringent requirements due to several fundamental limitations in the design of the current network architecture and communication protocols. This brings the need to rethink the network architecture and protocols, and efficiently harness recent technological advances in terms of virtualization and network softwarization to design the Tactile Internet of the future. In this paper, we start by analyzing the characteristics and requirements of future networking applications. We then highlight the limitations of the traditional network architecture and protocols and their inability to cater to these requirements. Afterward, we put forward a novel network architecture adapted to the Tactile Internet called FlexNGIA, a Flexible Next-Generation Internet Architecture. We then describe some use-cases where we discuss the potential mechanisms and control loops that could be offered by FlexNGIA in order to ensure the required performance and reliability guarantees for future applications. Finally, we identify the key research challenges to further develop FlexNGIA towards a full-fledged architecture for the future Tactile Internet.

An Essay on Optimization Mystery of Deep Learning

Despite the huge empirical success of deep learning, theoretical understanding of neural networks learning process is still lacking. This is the reason, why some of its features seem ‘mysterious’. We emphasize two mysteries of deep learning: generalization mystery, and optimization mystery. In this essay we review and draw connections between several selected works concerning the latter.

Reference-Based Sequence Classification

Sequence classification is an important data mining task in many real world applications. Over the past few decades, many sequence classification methods have been proposed from different aspects. In particular, the pattern-based method is one of the most important and widely studied sequence classification methods in the literature. In this paper, we present a reference-based sequence classification framework, which can unify existing pattern-based sequence classification methods under the same umbrella. More importantly, this framework can be used as a general platform for developing new sequence classification algorithms. By utilizing this framework as a tool, we propose new sequence classification algorithms that are quite different from existing solutions. Experimental results show that new methods developed under the proposed framework are capable of achieving comparable classification accuracy to those state-of-the-art sequence classification algorithms.

Spectral Metric for Dataset Complexity Assessment

In this paper, we propose a new measure to gauge the complexity of image classification problems. Given an annotated image dataset, our method computes a complexity measure called the cumulative spectral gradient (CSG) which strongly correlates with the test accuracy of convolutional neural networks (CNN). The CSG measure is derived from the probabilistic divergence between classes in a spectral clustering framework. We show that this metric correlates with the overall separability of the dataset and thus its inherent complexity. As will be shown, our metric can be used for dataset reduction, to assess which classes are more difficult to disentangle, and approximate the accuracy one could expect to get with a CNN. Results obtained on 11 datasets and three CNN models reveal that our method is more accurate and faster than previous complexity measures.

Stochastically Dominant Distributional Reinforcement Learning

We describe a new approach for mitigating risk in the Reinforcement Learning paradigm. Instead of reasoning about expected utility, we use second-order stochastic dominance (SSD) to directly compare the inherent risk of random returns induced by different actions. We frame the RL optimization within the space of probability measures to accommodate the SSD relation, treating Bellman’s equation as a potential energy functional. This brings us to Wasserstein gradient flows, for which the optimality and convergence are well understood. We propose a discrete-measure approximation algorithm called the Dominant Particle Agent (DPA), and we demonstrate how safety and performance are better balanced with DPA than with existing baselines.

DeepSwarm: Optimising Convolutional Neural Networks using Swarm Intelligence

In this paper we propose DeepSwarm, a novel neural architecture search (NAS) method based on Swarm Intelligence principles. At its core DeepSwarm uses Ant Colony Optimization (ACO) to generate ant population which uses the pheromone information to collectively search for the best neural architecture. Furthermore, by using local and global pheromone update rules our method ensures the balance between exploitation and exploration. On top of this, to make our method more efficient we combine progressive neural architecture search with weight reusability. Furthermore, due to the nature of ACO our method can incorporate heuristic information which can further speed up the search process. After systematic and extensive evaluation, we discover that on three different datasets (MNIST, Fashion-MNIST, and CIFAR-10) when compared to existing systems our proposed method demonstrates competitive performance. Finally, we open source DeepSwarm as a NAS library and hope it can be used by more deep learning researchers and practitioners.

Recurrent Kalman Networks: Factorized Inference in High-Dimensional Deep Feature Spaces

In order to integrate uncertainty estimates into deep time-series modelling, Kalman Filters (KFs) (Kalman et al., 1960) have been integrated with deep learning models, however, such approaches typically rely on approximate inference techniques such as variational inference which makes learning more complex and often less scalable due to approximation errors. We propose a new deep approach to Kalman filtering which can be learned directly in an end-to-end manner using backpropagation without additional approximations. Our approach uses a high-dimensional factorized latent state representation for which the Kalman updates simplify to scalar operations and thus avoids hard to backpropagate, computationally heavy and potentially unstable matrix inversions. Moreover, we use locally linear dynamic models to efficiently propagate the latent state to the next time step. The resulting network architecture, which we call Recurrent Kalman Network (RKN), can be used for any time-series data, similar to a LSTM (Hochreiter & Schmidhuber, 1997) but uses an explicit representation of uncertainty. As shown by our experiments, the RKN obtains much more accurate uncertainty estimates than an LSTM or Gated Recurrent Units (GRUs) (Cho et al., 2014) while also showing a slightly improved prediction performance and outperforms various recent generative models on an image imputation task.

Cleaned Similarity for Better Memory-Based Recommenders

Memory-based collaborative filtering methods like user or item k-nearest neighbors (kNN) are a simple yet effective solution to the recommendation problem. The backbone of these methods is the estimation of the empirical similarity between users/items. In this paper, we analyze the spectral properties of the Pearson and the cosine similarity estimators, and we use tools from random matrix theory to argue that they suffer from noise and eigenvalues spreading. We argue that, unlike the Pearson correlation, the cosine similarity naturally possesses the desirable property of eigenvalue shrinkage for large eigenvalues. However, due to its zero-mean assumption, it overestimates the largest eigenvalues. We quantify this overestimation and present a simple re-scaling and noise cleaning scheme. This results in better performance of the memory-based methods compared to their vanilla counterparts.

AM-LFS: AutoML for Loss Function Search

Designing an effective loss function plays an important role in visual analysis. Most existing loss function designs rely on hand-crafted heuristics that require domain experts to explore the large design space, which is usually sub-optimal and time-consuming. In this paper, we propose AutoML for Loss Function Search (AM-LFS) which leverages REINFORCE to search loss functions during the training process. The key contribution of this work is the design of search space which can guarantee the generalization and transferability on different vision tasks by including a bunch of existing prevailing loss functions in a unified formulation. We also propose an efficient optimization framework which can dynamically optimize the parameters of loss function’s distribution during training. Extensive experimental results on four benchmark datasets show that, without any tricks, our method outperforms existing hand-crafted loss functions in various computer vision tasks.

Merging versus Ensembling in Multi-Study Machine Learning: Theoretical Insight from Random Effects

A critical decision point when training predictors using multiple studies is whether these studies should be combined or treated separately. We compare two multi-study learning approaches in the presence of potential heterogeneity in predictor-outcome relationships across datasets. We consider 1) merging all of the datasets and training a single learner, and 2) cross-study learning, which involves training a separate learner on each dataset and combining the resulting predictions. In a linear regression setting, we show analytically and confirm via simulation that merging yields lower prediction error than cross-study learning when the predictor-outcome relationships are relatively homogeneous across studies. However, as heterogeneity increases, there exists a transition point beyond which cross-study learning outperforms merging. We provide analytic expressions for the transition point in various scenarios and study asymptotic properties.

POPQORN: Quantifying Robustness of Recurrent Neural Networks

The vulnerability to adversarial attacks has been a critical issue for deep neural networks. Addressing this issue requires a reliable way to evaluate the robustness of a network. Recently, several methods have been developed to compute \textit{robustness quantification} for neural networks, namely, certified lower bounds of the minimum adversarial perturbation. Such methods, however, were devised for feed-forward networks, e.g. multi-layer perceptron or convolutional networks. It remains an open problem to quantify robustness for recurrent networks, especially LSTM and GRU. For such networks, there exist additional challenges in computing the robustness quantification, such as handling the inputs at multiple steps and the interaction between gates and states. In this work, we propose \textit{POPQORN} (\textbf{P}ropagated-\textbf{o}ut\textbf{p}ut \textbf{Q}uantified R\textbf{o}bustness for \textbf{RN}Ns), a general algorithm to quantify robustness of RNNs, including vanilla RNNs, LSTMs, and GRUs. We demonstrate its effectiveness on different network architectures and show that the robustness quantification on individual steps can lead to new insights.

Online Distributed Estimation of Principal Eigenspaces

Principal components analysis (PCA) is a widely used dimension reduction technique with an extensive range of applications. In this paper, an online distributed algorithm is proposed for recovering the principal eigenspaces. We further establish its rate of convergence and show how it relates to the number of nodes employed in the distributed computation, the effective rank of the data matrix under consideration, and the gap in the spectrum of the underlying population covariance matrix. The proposed algorithm is illustrated on low-rank approximation and \boldsymbol{k}-means clustering tasks. The numerical results show a substantial computational speed-up vis-a-vis standard distributed PCA algorithms, without compromising learning accuracy.

Continue Reading…


Read More

Bayesian optimization in SAS

My colleague Wayne Thompson has written a series of blogs about machine learning best practices. In this series' fifth post, Autotune models to avoid local minimum breakdowns, Wayne discusses two autotune methods, grid search and Bayesian optimization. According to another colleague’s paper, Automated Hyperparameter Tuning for Effective Machine Learning, Bayesian optimization is currently popular for hyperparameter optimization. Recently, I read a post on Github which demonstrated the Bayesian optimization procedure through a great demo using Python, and I wondered if I could build the same with the SAS matrix language, SAS/IML. In this article, I will show you how Bayesian optimization works through this simple demo.

Revisit Bayesian optimization

Bayesian optimization builds a surrogate model for the black-box function between hyperparameters and the objective function based on the observations and uses an acquisition function to select the next hyperparameter.

1. Surrogate model

Functions between hyperparameters and objective function are often black-box functions, and Gaussian Process regression models are popular because of its capability to fit black-box functions with limited observations. More details about Gaussian Process regression, please refer to the book Gaussian Processes for Machine Learning.

2. Acquisition function and Location Propose function

The acquisition function evaluates points in the search space and the location propose function finds points which will likely lead to maximum improvement and are worth trying. The acquisition function trades off exploitation and exploration. Exploitation selects points having high objective predictions, while exploration selects points having high variance. There are several acquisition functions, and I used expected improvement, which is the same as in the GitHub post that I'm modeling.

3. Sequential optimization procedure

Bayesian optimization is a sequential optimization procedure. The algorithm follows:

For t=1, 2, ... do

  • Find the next sampling point by optimizing the acquisition function over the Gaussian Process (GP).
  • Obtain a possibly noisy sample from the objective function.
  • Add the sample to the previous samples and update the GP.

end for

Implementation with SAS/IML

1. Gaussian Process Regression

In the code, the SAS macro gpRegression is used to fit a Gaussian Process regression model through maximizing the marginal likelihood. More details, please read chapter 5 of Gaussian Processes for Machine Learning.

2. Bayesian optimization

The SAS macro bayesianOptimization includes two functions: the acquisition function and location propose function.

3. Visualization

Finally, the macro plotAll shows you the step-by-step optimization procedure visually.

The code for these macros and following demo codes are available on my GitHub project.

Demo implementation with SAS/IML

proc iml;
   * Target function;
   start f(X, noise=0);
       return (-sin(3*X) - X##2 + 0.7*X + noise * randfun(nrow(X), "Normal"));
   * Search space of sampling points;
   X = T(do(-1, 1.99, 0.01));
   call randseed(123);
   Y = f(X, 0);   
   * Initialize samples;
   X_train ={-0.9, 1.1};
   noise = 0.2;
   Y_train = f(X_train, noise);  
   * Initial parameters of GP regression model;
   gprParms = {1 1 0};
   * Max iterations for sequential Bayesian Optimization;
   n_iter = 15;
   do i=1 to n_iter;
      * Update Gaussian process with existing samples;
      gprParms = gprFit(gprParms);
      * Obtain next sampling point from the acquisition function (acquisition);
      proposeResults = proposeLocation(X, X_train, Y_train, gprParms);
      X_next = proposeResults$"max_x";
      if X_next=. then leave;
      * Obtain next noisy sample from the objective function;
      Y_next = f(X_next, noise);
      * Add sample to previous samples;
      X_train = X_train//X_next;
      Y_train = Y_train//Y_next;  
      * Save all proposed sampling points into a matrix;
      allProposed = allProposed//(j(1, 1, i)||X_next);
   * Save all proposed sampling points into a SAS dataset;
   create allProposed from allProposed [colname={"Iteration" "X"}];
   append from allProposed;
   close allProposed;

Visualize the step-by-step Bayesian optimization procedure

Figure 1: Target function and initial two samples.

Figure 2: Proposed location and acquisition function plot of the first two iterations.

Let's skip to the last two iterations.

Figure 3: Proposed location and acquisition function plot of the last two iterations.

To demonstrate the whole procedure visually, I created an animation file.

Figure 4: Animation of the step-by-step optimization.

The following two plots show the convergence trend along with iterations. The left plot displays distance between consecutive proposed points by iteration, and starting from iteration 9, the distance is close to zero. The right plot shows the value of the best selected sample, and starting from iteration 8, the values almost didn’t change.

Figure 5: Distance between consecutive proposed points by iteration.

The full code is available at Github.

The post Bayesian optimization in SAS appeared first on The SAS Data Science Blog.

Continue Reading…


Read More

If you did not already know

Porcellio Scaber Algorithm (PSA) google
Bio-inspired algorithms have received a significant amount of attention in both academic and engineering societies. In this paper, based on the observation of two major survival rules of a species of woodlice, i.e., porcellio scaber, we design and propose an algorithm called the porcellio scaber algorithm (PSA) for solving optimization problems, including differentiable and non-differential ones as well as the case with local optimums. Numerical results based on benchmark problems are presented to validate the efficacy of PSA. …

GeoSay google
Automatic extraction of buildings in remote sensing images is an important but challenging task and finds many applications in different fields such as urban planning, navigation and so on. This paper addresses the problem of buildings extraction in very high-spatial-resolution (VHSR) remote sensing (RS) images, whose spatial resolution is often up to half meters and provides rich information about buildings. Based on the observation that buildings in VHSR-RS images are always more distinguishable in geometry than in texture or spectral domain, this paper proposes a geometric building index (GBI) for accurate building extraction, by computing the geometric saliency from VHSR-RS images. More precisely, given an image, the geometric saliency is derived from a mid-level geometric representations based on meaningful junctions that can locally describe geometrical structures of images. The resulting GBI is finally measured by integrating the derived geometric saliency of buildings. Experiments on three public and commonly used datasets demonstrate that the proposed GBI achieves the state-of-the-art performance and shows impressive generalization capability. Additionally, GBI preserves both the exact position and accurate shape of single buildings compared to existing methods. …

Frequent Pattern Mining google
The problem of frequent pattern mining is that of finding relationships among the items in a database. The problem can be stated as follows. Given a database D with transactions T1 … TN, determine all patterns P that are present in at least a fraction s of the transactions. The fraction s is referred to as the minimum support. The parameter s can be expressed either as an absolute number, or as a fraction of the total number of transactions in the database. Each transaction Ti can be considered a sparse binary vector, or as a set of discrete values representing the identifiers of the binary attributes that are instantiated to the value of 1. The problem was originally proposed in the context of market basket data in order to find frequent groups of items that are bought together. Thus, in this scenario, each attribute corresponds to an item in a superstore, and the binary value represents whether or not it is present in the transaction. Because the problem was originally proposed, it has been applied to numerous other applications in the context of data mining,Web log mining, sequential pattern mining, and software bug analysis. …

Random KNN Feature Selection (RKNN-FS) google
We present RKNN-FS, an innovative feature selection procedure for ‘small n, large p problems.’ RKNN-FS is based on Random KNN (RKNN), a novel generalization of traditional nearest-neighbor modeling. RKNN consists of an ensemble of base k-nearest neighbor models, each constructed from a random subset of the input variables. To rank the importance of the variables, we define a criterion on the RKNN framework, using the notion of support. A two-stage backward model selection method is then developed based on this criterion. Empirical results on microarray data sets with thousands of variables and relatively few samples show that RKNN-FS is an effective feature selection approach for high-dimensional data. RKNN is similar to Random Forests in terms of classification accuracy without feature selection. However, RKNN provides much better classification accuracy than RF when each method incorporates a feature-selection step. Our results show that RKNN is significantly more stable and more robust than Random Forests for feature selection when the input data are noisy and/or unbalanced. Further, RKNN-FS is much faster than the Random Forests feature selection method (RF-FS), especially for large scale problems, involving thousands of variables and multiple classes. …

Continue Reading…


Read More

Whats new on arXiv

Transparency in Maintenance of Recruitment Chatbots

We report on experiences with implementing conversational agents in the recruitment domain based on a machine learning (ML) system. Recruitment chatbots mediate communication between job-seekers and recruiters by exposing ML data to recruiter teams. Errors are difficult to understand, communicate, and resolve because they may span and combine UX, ML, and software issues. In an effort to improve organizational and technical transparency, we came to rely on a key contact role. Though effective for design and development, the centralization of this role poses challenges for transparency in sustained maintenance of this kind of ML-based mediating system.

Edge-Assisted Hierarchical Federated Learning with Non-IID Data

Federated Learning (FL) is capable of leveraging massively distributed private data, e.g., on mobile phones and IoT devices, to collaboratively train a shared machine learning model with the help of a cloud server. However, its iterative training process results in intolerable communication latency, and causes huge burdens on the backbone network. Thus, reducing the communication overhead is critical to implement FL in practice. Meanwhile, the model performance degradation due to the unique non-IID data distribution at different devices is another big issue for FL. In this paper, by introducing the mobile edge computing platform as an intermediary structure, we propose a hierarchical FL architecture to reduce the communication rounds between users and the cloud. In particular, a Hierarchical Federated Averaging (HierFAVG) algorithm is proposed, which allows multiple local aggregations at each edge server before one global aggregation at the cloud. We establish the convergence of HierFAVG for both convex and non-convex objective functions with non-IID user data. It is demonstrated that HierFAVG can reach a desired model performance with less communication, and outperform the traditional Federated Averaging algorithm.

Deep Compressed Sensing

Compressed sensing (CS) provides an elegant framework for recovering sparse signals from compressed measurements. For example, CS can exploit the structure of natural images and recover an image from only a few random measurements. CS is flexible and data efficient, but its application has been restricted by the strong assumption of sparsity and costly reconstruction process. A recent approach that combines CS with neural network generators has removed the constraint of sparsity, but reconstruction remains slow. Here we propose a novel framework that significantly improves both the performance and speed of signal recovery by jointly training a generator and the optimisation process for reconstruction via meta-learning. We explore training the measurements with different objectives, and derive a family of models based on minimising measurement errors. We show that Generative Adversarial Nets (GANs) can be viewed as a special case in this family of models. Borrowing insights from the CS perspective, we develop a novel way of improving GANs using gradient information from the discriminator.

BrainTorrent: A Peer-to-Peer Environment for Decentralized Federated Learning

Access to sufficient annotated data is a common challenge in training deep neural networks on medical images. As annotating data is expensive and time-consuming, it is difficult for an individual medical center to reach large enough sample sizes to build their own, personalized models. As an alternative, data from all centers could be pooled to train a centralized model that everyone can use. However, such a strategy is often infeasible due to the privacy-sensitive nature of medical data. Recently, federated learning (FL) has been introduced to collaboratively learn a shared prediction model across centers without the need for sharing data. In FL, clients are locally training models on site-specific datasets for a few epochs and then sharing their model weights with a central server, which orchestrates the overall training process. Importantly, the sharing of models does not compromise patient privacy. A disadvantage of FL is the dependence on a central server, which requires all clients to agree on one trusted central body, and whose failure would disrupt the training process of all clients. In this paper, we introduce BrainTorrent, a new FL framework without a central server, particularly targeted towards medical applications. BrainTorrent presents a highly dynamic peer-to-peer environment, where all centers directly interact with each other without depending on a central body. We demonstrate the overall effectiveness of FL for the challenging task of whole brain segmentation and observe that the proposed server-less BrainTorrent approach does not only outperform the traditional server-based one but reaches a similar performance to a model trained on pooled data.

Stability of Linear Structural Equation Models of Causal Inference

We consider the numerical stability of the parameter recovery problem in Linear Structural Equation Model (\LSEM) of causal inference. A long line of work starting from Wright (1920) has focused on understanding which sub-classes of \LSEM allow for efficient parameter recovery. Despite decades of study, this question is not yet fully resolved. The goal of this paper is complementary to this line of work; we want to understand the stability of the recovery problem in the cases when efficient recovery is possible. Numerical stability of Pearl’s notion of causality was first studied in Schulman and Srivastava (2016) using the concept of condition number where they provide ill-conditioned examples. In this work, we provide a condition number analysis for the \LSEM. First we prove that under a sufficient condition, for a certain sub-class of \LSEM that are \emph{bow-free} (Brito and Pearl (2002)), the parameter recovery is stable. We further prove that \emph{randomly} chosen input parameters for this family satisfy the condition with a substantial probability. Hence for this family, on a large subset of parameter space, recovery is numerically stable. Next we construct an example of \LSEM on four vertices with \emph{unbounded} condition number. We then corroborate our theoretical findings via simulations as well as real-world experiments for a sociology application. Finally, we provide a general heuristic for estimating the condition number of any \LSEM instance.

AlgoNet: $C^\infty$ Smooth Algorithmic Neural Networks

Artificial neural networks revolutionized many areas of computer science in recent years since they provide solutions to a number of previously unsolved problems. On the other hand, for many problems, classic algorithms exist, which typically exceed the accuracy and stability of neural networks. To combine these two concepts, we present a new kind of neural networks—algorithmic neural networks (AlgoNets). These networks integrate smooth versions of classic algorithms and data structures into the topology of neural networks. A forward AlgoNet includes algorithmic layers into existing architectures while a backward AlgoNet can solve inverse problems without or with only weak supervision. In addition, we present the \texttt{algonet} package, a PyTorch based library that includes, inter alia, a smooth evaluated programming language, a smooth 3D mesh renderer, and smooth sorting algorithms.

Deep Learning for Multi-Scale Changepoint Detection in Multivariate Time Series

Many real-world time series, such as in health, have changepoints where the system’s structure or parameters change. Since changepoints can indicate critical events such as onset of illness, it is highly important to detect them. However, existing methods for changepoint detection (CPD) often require user-specified models and cannot recognize changes that occur gradually or at multiple time-scales. To address both, we show how CPD can be treated as a supervised learning problem, and propose a new deep neural network architecture to efficiently identify both abrupt and gradual changes at multiple timescales from multivariate data. Our proposed pyramid recurrent neural network (PRN) provides scale-invariance using wavelets and pyramid analysis techniques from multi-scale signal processing. Through experiments on synthetic and real-world datasets, we show that PRN can detect abrupt and gradual changes with higher accuracy than the state of the art and can extrapolate to detect changepoints at novel scales not seen in training.

How Case Based Reasoning Explained Neural Networks: An XAI Survey of Post-Hoc Explanation-by-Example in ANN-CBR Twins

This paper surveys an approach to the XAI problem, using post-hoc explanation by example, that hinges on twinning Artificial Neural Networks (ANNs) with Case-Based Reasoning (CBR) systems, so-called ANN-CBR twins. A systematic survey of 1100+ papers was carried out to identify the fragmented literature on this topic and to trace it influence through to more recent work involving Deep Neural Networks (DNNs). The paper argues that this twin-system approach, especially using ANN-CBR twins, presents one possible coherent, generic solution to the XAI problem (and, indeed, XCBR problem). The paper concludes by road-mapping some future directions for this XAI solution involving (i) further tests of feature-weighting techniques, (iii) explorations of how explanatory cases might best be deployed (e.g., in counterfactuals, near-miss cases, a fortori cases), and (iii) the raising of the unwelcome and, much ignored, issue of human user evaluation.

Contrastive Fairness in Machine Learning

We present contrastive fairness, a new direction in causal inference applied to algorithmic fairness. Earlier methods dealt with the ‘what if?’ question (counterfactual fairness, NeurIPS’17). We establish the theoretical and mathematical implications of the contrastive question ‘why this and not that?’ in context of algorithmic fairness in machine learning. This is essential to defend the fairness of algorithmic decisions in tasks where a person or sub-group of people is chosen over another (job recruitment, university admission, company layovers, etc). This development is also helpful to institutions to ensure or defend the fairness of their automated decision making processes. A test case of employee job location allocation is provided as an illustrative example.

Continue Reading…


Read More

Bayesian cell counting Pt. 2 - Growth over time

I’ve started growing yeast in my closet-turned-laboratory. There’s a reason why I am growing yeast, but that’ll be for another post. For this experiment, I wanted to use my new hemocytometer to do cell counts periodically over the next few days to gather data.

A nutrient-rich bioreactor (an Erlenmeyer flask with wort) was left at room temperature with plenty of aeration (a magnetic stirrer) for about 2.5 days. My collected data is below.

| hour | cell count |
| 0    | 20         |
| 12.5 | 21         |
| 17.5 | 28         |
| 23   | 34         |
| 36.5 | 34         |
| 42.5 | 31         |
| 48   | 32         |
| 65   | 32         |


As I mentioned in my previous post, counting cells is a very noisy process, so we want to keep that in mind as we analysis this data. We have two options to proceed:

  1. Assume each sample is independent, and run the same Bayesian model outlined in the previous article on each observed count, and plot the posterior distributions over time. This has the advantage of not assuming any parametric form of growth, but it has the serious disadvantage of not pooling any of the information (i.e. information at time 23 is very relevant to inference at time 17.5 and 36.5).
  2. Assume some parametric growth model with unknown parameters, and fit to these parameters.

I like option two, as it’s more of a challenge, and it can be used for interpolation within the data points and extrapolation outside of the observed data points.

Typically microorganism growth after inoculation has 3 phases: lag-phase, log-phase and stationary phase. The lag-phase is a period where the organisms become accustomed to their new environment, and have low reproduction. The log-phase is a poorly-named phase that represents the period of high (exponential) reproduction. Finally, after the medium has been depleted or the organism concentration has become too high, the organisms stop reproducing and they enter the stationary phase. This type of growth looks a lot like logistic growth, so let’s use that model:

$$(\text{yeast/mL})_t = P_0 + \frac{K}{1 + \exp(-r\cdot(t - \delta))}$$

The lag-phase is modeled by the \(\delta\) parameter - the larger this is, the longer the lag-phase went on for. From sources, the lag-phase usually lasts less than 24h. Given my initial bioreactor conditions, a lookup table suggests I should see about 50% growth. Furthermore, using the volume of the medium and the estimated concentration of my inoculate, I have an estimate for my initial concentration. All these give me priors for my estimates.


We can model this in PyMC3 like so (this is modified code from my previous yeast-counting blog article).

import pymc3 as pm


yeast_counted =    np.array([20, 21,   28,   34, 34,   31,   32, 32])
hours_since_inoc = np.array([0,  12.5, 17.5, 23, 36.5, 42.5, 48, 65])
n_obs = yeast_counted.shape[0]

def logistic(t, K, r, delta_t):
    return K / (1 + np.exp(-r * (t - delta_t)))

with pm.Model() as model:

    K = pm.Normal("K", mu=50 * MILLION, sd=25 * MILLION) # about 50% growth was expected
    P0 = pm.Normal("P0", mu=100 * MILLION, sd=25 * MILLION)
    r = pm.Exponential("r", lam=2.5)
    delta_t = pm.Uniform("delta_t", lower=0, upper=24) # lag phase stops in the first 24 hours

    yeast_conc = P0 + logistic(hours_since_inoc, K, r, delta_t)

    shaker1_volume = pm.Normal("shaker1 volume (mL)", mu=9.0, sigma=0.05, shape=n_obs)
    shaker2_volume = pm.Normal("shaker2 volume (mL)", mu=9.0, sigma=0.05, shape=n_obs)

    yeast_slurry_volume = pm.Normal("initial yeast slurry volume (mL)", mu=1.0, sigma=0.01, shape=n_obs)
    shaker1_to_shaker2_volume =    pm.Normal("shaker1 to shaker2 (mL)", mu=1.0, sigma=0.01, shape=n_obs)

    dilution_shaker1 = pm.Deterministic("dilution_shaker1", yeast_slurry_volume  / (yeast_slurry_volume + shaker1_volume))
    final_dilution_factor = pm.Deterministic("dilution_shaker2", dilution_shaker1 * shaker1_to_shaker2_volume / (shaker1_to_shaker2_volume + shaker2_volume))

    volume_of_chamber = pm.Gamma("volume of chamber (mL)", mu=0.0001, sd=0.0001 / 20)

    # why is Poisson justified? in my final shaker, I have yeast_conc * final_dilution_factor * shaker2_volume number of yeast
    # I remove volume_of_chamber / shaker2_volume fraction of them, hence it's a binomial with very high count, and very low probability.
    yeast_visible = pm.Poisson("cells in visible portion", mu=yeast_conc * final_dilution_factor * volume_of_chamber, shape=n_obs)

    number_of_counted_cells = pm.Binomial("number of counted cells", yeast_visible, SQUARES_COUNTED/TOTAL_SQUARES, observed=yeast_counted, shape=n_obs)

    trace = pm.sample(2000, tune=20000)

We are mostly interested in our posteriors for the parameters of the growth model.

Actually, that’s not true: we aren’t really interested in the parameters. Most don’t have an easy interpretation (except for delta_t). What we are really interested in is the posterior of the growth curve. Recall this is a distribution. To demonstrate this, we can sample from the parameters’ posteriors and drop those values into the growth curve. For example, if we sampled 4 times:

Each of these curves look very different, which should give us pause when we make inference about our growth. What happens if we keep sampling growth curves, and then average over all of them - what does that curve look like? What about the error bars on that? This is easy to do (and part of the reason I appreciate Bayesian computation). On the same graph, I’m also going to plot hundreds of potential realizations, as it’s important not to get too focused on the mean as being the “truth” - there is still lots of variation!

Lovely. We can see an estimate for our initial concentration was about 122M ± 18M yeast/mL, and our final concentration was about 191M ± 15M yeast/mL. Eyeballing, the lag phase stopped just prior to 8h (though we need to reconcile this with the posterior mean of the delta_t is 13.5).

What was our estimated yield? I mentioned previously that 50% is expected - how did we do? This computation is also easy in Bayesian inference, we just look at the distribution of the ratio of yeast/mL at time 60 over yeast/mL at time 0. Doing this, the average value of this distribution is 59%.


This was a fun little experiment to mix statistics and biology. Further extensions include adding covariates, and modeling death of cells too (there is a finite amount of energy in the medium, so this should happen eventually). For now, however, I’m washing the yeast and eventually going to dehydrate them.

Continue Reading…


Read More

visualizing artisanal data

A sampling of the submissions received this month

A sampling of the submissions received this month

This month, we explored the concept of artisanal data: a dataset collected entirely on your own—electronically, manually, via surveys, or by observation. Guest author Mike Cisneros challenged us to analyze and find the conclusions that we can have full confidence in because we are the true stewards of the data.  

Fifty-three readers submitted their bespoke visualizations. Accountability was a recurring theme: many commented that their intent was to hold themselves responsible to achieving a goal—or that this exercise enlightened them to begin doing so. Not surprisingly, as a result of this hot topics included exercise, weight loss, and spending. It was neat to see the range—both in approaches and tools—that readers applied when creating their visualizations and we encourage you to scroll through the entire post to be inspired by how your peers collected, analyzed and were influenced by their unique datasets.

When reflecting on the submissions, Mike observed the thoughtful considerations participants applied:

“There are many lessons to be gleaned here; not only from the data you chose to collect and the way in which you presented it, but also from the issues and considerations raised in the process. In setting out the challenge to collect your own data this month, I had worried that it might be too burdensome for people; thankfully, dozens of you proved me wrong, and I thank all of you for your efforts.”

Scroll through the entire post to see further commentary from Mike with examples of these underlying themes: data can be humanizing (Andre and Alyssa), visual metaphors can be evocative (Hilje, Kate and Tiffany) and individual experiences can be universal (Julia and Penola).

Other standout entries include Colin and Lotte, who applied an effective takeaway title to their exercise trends while Liz selected clean, uncluttered charts to highlight trends in her personal hobby, reading. Jared took a trip down memory lane with a cool visual timeline of his evolving professional development and soft skills while Lance employed annotations with categorization to visualize the length of names in his family. Several readers discovered something perhaps previously unknown: Haley realized the effect of Amazon Prime on her spending habits, Rebekah validated how she’s been occupying her time post-grad school, Tania discovered the impact of clicking on Gmail ads, and Pris evaluated consistency in implementing her New Year’s resolution.

As an added bonus, Cole highlighted the benefit of having guest author this month: she was able to participate! Her submission provided a peek into the labor-intensive process of writing her second book— storytelling with data: let’s practice!—which will be published soon.

To everyone who submitted examples: THANK YOU for taking the time to create and share your work! The submissions are posted below in alphabetical order. If you tweeted or thought you submitted one but don't see it here, upload your submission as a .png here and we'll work to include any late entries this week (just a reminder that tweeting on its own isn't enough—we don't have time to scrape Twitter for entries.)

The next monthly challenge will launch on June 1st. Until then, check out the archives of previous month’s challenges on our #SWDchallenge page. Happy reading!


For this months #SWDChallenge I have pulled together my Work Train travel for 2019 up to the end of April. I travel quite a bit for my role as Snr Data & Viz officer and we generally are required to travel for product development meetings, team planning meetings, current project face to face meetings and team working days. Our core team is disparately located around the UK, meaning meet ups are not always quick commutes.
Interactive viz | Blog


Interactive visualization of organic home garden productivity over time, including soil amendment and environmental effects. Total production by growing season & vegetable category is on top. Selected by filter, the lower graph displays individual vegetable yields with environmental input data. Future year iterations with sufficient historical data may include a yield forecast.
Interactive viz


Alyssa’s submission exemplifies how data can be humanizing. Mike notes “Alyssa’s submission hit on several key topics in data visualization: trust, both between collector and visualizer and trust in the accuracy of the observations; careful handling of personally-identifiable information (PII); the importance of not depicting subjective categories as having absolute values; and in being transparent about what data is not shown, as well as what is shown.”


I created this dataset from the narrative sleep log of an anonymous patient at the Counseling Center. They kept incredible detailed notes, and any uncertainty is reflected in the design (missing data, not putting precise labels on hours slept, etc). I feel incredibly lucky to have been trusted with this dataset and hope that the final product reflects one part of the way PTSD impacted the patient's life. A note on the design: the amounts of sleep designated as incapacitated/impaired/operational/rested are intentionally not labelled for three reasons. First, everyone's sleep needs are different, so the number of hours of sleep isn't as informative as the impact that sleep has on a person's life. Second, I want to avoid pissing contests over sleep deprivation ("wow, you consider THAT impaired? I haven't slept more than that in fifteen years," etc). Third, the number of hours of sleep in the log were rounded to the half hour, and according to the patient may have been off by 15-20 minutes since it's difficult to know the precise moment they fell asleep. Therefore, I would prefer to give a vaguer impression (low end of operational, high end of impaired, etc) rather than facilitate falsely precise estimates of the number of hours the patient slept.

The humanizing factor carries over to Andre’s submission. Mike notes, Andre turned the challenge into an opportunity to combine his interests with his young son’s interests, allowing them to work together on a visualization they could both enjoy. While we hope, at times, to make human connections through the outputs of our data visualization process, Andre showed us that we can also strengthen human connections through its creation.”


My 5 year old son, Túlio, absolutely loves pokemon cards. He doesn't know yet all the rules but is fascinated by the cards' types, numbers, colors and powers. Since he spends A LOT OF TIME looking and analyzing each one of them, I thought it would be interesting to build a visualization showing all of his cards. After cataloging all 153 of them, we made a histogram and a bar plot with pencil and paper. Then I thought it would be a good idea to show the correlation between the cards's "attack damage" and "health points". I also used color to distinguish the card's "evolution stages". Túlio is really getting into data visualization and I am really getting into pokemon cards!


This entry has the most personal data: my own blood pressure and heart rate. The motivation was to obtain a view over time of these data so that I could track trends and also to share it with my physician. The data was collected from the OMRON BP786N blood pressure monitor, and then recorded in a simple CSV file containing date, systolic pressure, diastolic pressure and heart rate. A script called "bpadd" records the data and then calls a decksh script to visualize and show the data.



In my kitchen ceiling there is a slow onset water damage that seems to happen over time. The upstairs neighbor has no water near the floor that could be responsible for this. A contractor suggested it must be condensation build-up running down onto the celling, slowly causing this damage. I wanted to investigate this further and confirm that rain was not running down inside the walls to cause this damage. To this end, I built a crude moisture sensor using a couple transistors, resistors and two needles I could stick into the drywall. To this I attached an Arduino with a timer flash drive logger in order to measure the moisture over time. I then plotted precipitation on top of this moisture data using a JavaScript chart library to see if rain has anything to do with this. This data was collected from the Midwest in June, so it was hot outside. I believe the spikes visible in the chart correlate with the air conditioner removing moisture from the air and the sensor is measuring the moisture over the surface of the wall rather than internally. This was a fun project where data visualization helped me see that rain does lead to an overall increase in moisture within the drywall!



Starting in March 2019, I really got into data visualization (viz) and starting practicing with data viz community projects, like the #SWDchallenge. Practice fosters improvement and learning! This month’s #SWDchallenge involved collecting our own data, and I have been keeping my track of my weight. Here, we look at my 2019 weight with a focus on the time (before and) after I really got into data viz. It looks like some of my weight loss is correlated with my new found data viz enthusiasm!


My family participates in an annual Thanksgiving Day 5K run, the Trinity Turkey Trot in Princeton NJ. I have an informal approach to getting in shape for this and other 5Ks, with mixed results. Peter Drucker says that “you can't manage what you can't measure.” With this adage in mind, I started tracking my workouts in late 2015. The free MapMyRun web site, sponsored by Under Armour, provides the tools I need. I draw each course I run on a road map of the area and store the course for future reference and comparison. By course and date, I record information on individual workouts: length, time, and number of steps. I also track time spent doing other types of exercise. MapMyRun calculates Average Pace per Mile. I was able to download my MapMyRun workout records to a .csv file. I then imported it into Excel, when my first assumption, “the data is clean”, was dashed. Date of workout was either in DD-MMM-YY format or a string “MMM. DD, YYYY”. Rather than exercising Excel functions or manually correcting rows, I imported the dataset to Tableau Public, which decoded the formatting, and set me up to use Tableau Public to build the chart. I kept Workout Types that were relevant to the analysis of my running, and discarded others. Average Pace per Mile values were generally reasonable. I discarded rows with zero or extremely large numbers. I anticipated that my running pace leading up to a race would impact my 5K race time. After looking at the data, I think that the number of workouts weeks before a race are also important. The chart itself took several iterations to get to where it adequately showed what I envisioned. Not perfect, but close enough.



This submission tracked all the expenses that I had for my cars from when I purchased my first car in July 2012 up to April 2019. This was the 3rd of 4 visualizations; starting with total overall expenses, then stripping out major expenses before getting to minor expense breakdown. It was eye opening to see how much I used to spend on gasoline and on maintenance!


If you are often tired and then track your sleep but don't change any of your sleep habits, then mostly what you'll learn is that you aren't getting enough sleep (which you already knew because you are tired). I did find out that I tend to get a little more sleep on Thursday and Friday evenings, when I would have expected to get the most on the weekends. The missing days are when I went to bed but didn't put my fitness tracker on. (I only wear it at night to track my sleep.) I used the FitBit API and an R script to get the raw data. Tableau was what I used to create the visualization. You can hover over each day and see the waterfall of that night's sleep.
Interactive viz



I wanted to evaluate my grocery expenses before and after signing up for the $5 Meal Plan service. It also occurred to me that other food expenses such as restaurants, fast food, coffee shops, etc. should be included since groceries could be substituted for eating out and vice versa. I pulled my dataset from, after re-categorizing many transactions, and used Power BI to create the visualization.



One benefit of a guest #SWDchallenge is that I get to participate! My submission plots my progress over the past year writing my second book, storytelling with data: let's practice! General learnings: I write best when I travel and have a harder time concentrating at home (home days are better spent planning content or editing). I also write well from cafes: the background buzz puts me in my head in a way that works well for getting words out. Also: next time, I should collect better data if I want to make a graph—it would have been great to have other metrics, like word count, and also I found myself wishing for more frequent and consistent data points over time. This is only part of the picture: the book includes over 100 hands-on exercises and more than 250 visuals. I'm looking forward to sharing it with you and it should be available this fall!



I started wanting to know how many days I had walked 10,000 steps or more.  So I downloaded my fitbit data for the last 30 days. As I work some days and provide childcare on others as well as family days out, I spliced the data by weekday to see if there were any trends. I created a bar chart of average steps to show which days I was less likely to hit my 10k steps target. The line chart breaks this up to show that on two days I hit my target less often. But it also highlighted that on the other days there was a large variance between the number of steps counted. Analysing the data as aggregate bar charts and individual points on a line made me think about different ways I could improve my average step count. Looking back over my activity for April made me realize how varied my walking patterns were for each day of the month.



This month's #SWDchallenge gave me the push to create something I've been wanting to visualize for a while now, a timeline of my dog's life. Next week he will be 15 years old!
Interactive | Twitter

Oscar timeline Jan 1 at top.png


After reading the book Dear Data I started collecting some of my personal data with the purpose of visualising it later this year. With this months #SWDChallenge I had a great opportunity to start visualising one of these datasets. I decided to visualise the data about the cups of coffee I drink during the day.




I've tracked our home utility use for years as a means to monitor the efficiency of the heating system. Several years ago we replaced an old furnace, and I used this challenge to estimate the annual savings. While I knew the new furnace was saving us money, the annual savings was less than expected.



I wanted to visualize my grocery shopping from last couple of months. Did I spend more every month? Did I place more orders? Simple curiosity!


Earlier this year I used the Strava web connector to import my running data into Tableau. I was aiming to make a visual exploration on my running activities. Although I knew them quite well, I was wondering if a lot of difference would appear, for instance in intensity, location and length of the activities between 2017 and 2018. I think it did. I not just started more running in 2018, I also enlarged my running area. Especially by preparing for a big ultra run in Austria.



Because of a health problem, for some months the only physical activity I can do is walk. To stimulate myself a little I use a heart rate monitor and these are the data from September to November 2018. I can select a date and see how much I walked that day, how long and the length of my stride.
Interactive viz | Blog


I downloaded my order history from Amazon and saw that I started ordering a lot more in 2016, the year I signed up for Amazon Prime. That caught my attention, so I pulled out some other data points that I found interesting too. This was created in Tableau.


I use the library a lot and wanted to explore how many items I had on loan each day during 2018. The number of items went from 7 to 24 and I always had something on loan. Unfortunately I didn't have time to add annotation, but some things I can direct you to. Like March with the sun out and people enjoying the snowy and sunny season after the long dark polar night or the road trip we took in July.

Hilje’s submission demonstrates that visual metaphor can be evocative. Mike writes, “The design replicates the look and feel of a festival map. As she explains in her blog, the stories themselves are, in many cases, as interesting as (or more interesting than) the aggregated data. Her choices in what she chose to emphasize, and how, truly honor that consideration.”


I collected data on the ultimate festival experience. You can read the entire story on my blog.



I initially wanted to created a timeline of well, something. After a bit of a brainstorm it hit me. I've recently been reflecting on past roles, and the skills I've been learning along the way, so why not visualize that?


For the May 2019 #SWDChallenge, we were challenged to work with a dataset that we have collected or created ourselves and use that data to create any type of chart we would like. I decided to create a visual profile of the content that I consume via Twitter and Podcasts. I would love to see a content profile on friends, family and the people I engage with online. Digital Marketers are already building this type of stuff into their algorithms, I'm sure. For my visualization, I decided to create a chart that is often used for population comparisons...a Population Pyramid. Although this isn't the type of data normally used in a Population Pyramid, I thought it was an effective way to represent the data. Hopefully you agree, but please reach out if you have feedback or suggestions.



As a BI Consultant in the Netherlands, I travel a lot to my clients across the country. Ever since I started working as a consultant in 2014, I have been saving all the locations I visit in Google Maps. For this months #SWDChallenge I downloaded all these locations from Google Takeout and loaded them into Power BI. I visualized the result with a heatmap.


Julia’s submission shows us how individual experiences can be universal. Mike comments, “Julia used a technique that I always appreciate, which is to use real-life, universal comparisons that help audiences grasp the magnitude of certain things; in this case, comparing the length of yarn she has used for her knitting projects over the last six years to the distance between two cities in Europe.”


I was curious how my history of knitting projects summaries. I took my data from Ravelry and plotted it using R and gimp.


Kate also shows us the power of a visual metaphor. Mike commented, “Kate’s animated coffee drinking viz also made use of an interesting visual metaphor, in the sense that she designed a radar chart that was reminiscent, in motion, of a coffee stain slowly spreading across a table. Sometimes these small touches make our work memorable in a way that a less-considered visual approach might not.”


I’d like to share an updated animated viz of all the coffee I drank in Italy while on vacation in February. I had fun trying to visualize this using a clock and learned some new animation tricks. You can watch my animation and read all about the “rules” of Italian coffee drinking and whether or not I followed these “rules” at my blog.



In January 2019, I bought a pack of 14 colored bands for my Fitbit. I tracked the color I wore each day, wondering which would end up being the most-used. I actively tried to avoid wearing black or gray, but still wore those neutrals one out of every three days. The biggest surprise was pink at number two, because I rarely wear pink clothing or accessories! I realized that this particular pink is fairly close to my skin tone, so I was treating it as a neutral.

May19SWDChallenge - Kelly Gilbert.png


While it took me some time to settle on what data to use for this challenge when I settled on this topic I had a lot of fun putting it together (in Tableau). While the topic might be a little different I have complete confidence in the quality of the data and the conclusions it produced!


This month’s #SWDChallenge by guest author, Mike Cisneros, inspired me to put to use a data sample I tracked in Observe, Collect, Draw: A Visual Journal. I visualized the data using a bar chart and added annotations and images to bring life to the chart.



I chose to use my Goodreads data from this year for this challenge. This visualization went through a few iterations, so big thanks to my boss, Chris, for helping me refine it! I made this visualization with Tableau.


My entry could also have the subtitle “How not to train for a 10K run”. In the spring of ’18 I signed up for a 10K run in October. After the run, my hip started causing problems and I had to cut down on my exercise. I am down to running 3K and only slowly increasing the distance. Loads of people do 10K runs and I wondered why I couldn’t do the same without injuries. The chart clearly shows the steep increase in distance. It does not show that I also played with speed and altitude. Everything you’re not supposed to do at once. The chart also shows that when I started training for the run, I also skipped the visits to my rowing club, missing out on maintaining some big muscle groups. Since the data set is mine, I did a bit of manipulation with the data. April, for instance, showed that I had been out riding. I knew that was not true; I had simply chosen the wrong type of workout in my tracker and I changed it to rowing. I also changed an orienteering run in October to running for simplicity. I considered adding comments in the chart but chose to leave the space on the right-hand side empty and let the lack of exercise “do all the noise”.
A 5K run is coming up in June and I have learned my lesson… :-)


I visualized Netflix activity data to for our Gilmore Girls watching trends in the Spring of 2019 using Excel. What I found was that the more we disliked a season, the more episodes we watched per day (quicker rate). So we like to SAVOR our favorites.


For my submission I used weight and nutrition data stored in MyFitnessPal. Particularly I wanted to explore any connection between an effort I have been making in reducing my carbohydrates and my long lasting goal of weight loss. The goal was to show if there was a visible correlation.


Inspired by my university experience, I decided to do an introspection about my erratic (and often inadequate) sleep schedule. Doctors usually recommend between 7 and 9 hours of sleep. I used my sleep patterns from my Fitbit to visualize how I sleep. This helps me see where I can improve upon in terms of budgeting my daily schedule.


It was my first freelancing SAP HANA implementation project which involved creation of new ABAP reports on SAP HANA system. As the project was critical for go-live, I worked on weekends too and sometimes extended for more than 9 hours per day. DECO report was an MIS report which was a mini project itself. My aim was to project the object which I worked upon and the time spent on it; for which the Gantt Chart was the perfect choice.



Hi, when the description of this months challenge came up I knew exactly what topic I would use. The main love of my life (apart from wife and kids obviously) is exercise and in particular being outside. Out of interest in 2007 I started keeping very simple records of how much I did. When I look back on it now over 11 years, it's like a barometer of some of the main milestones and events in my life. I also use it to make sure I'm always at a certain level of exercise as I get older and busier with family and work.

Paul C

Number of steps and heartbeats per minute since purchasing my Smart Watch.

Paul C.PNG

Paul T

I entered a 6 week fitness challenge and got a bit carried away. Lost 11kg in 6 weeks and had a lot of fun doing so (oddly).

Penola demonstrates how individual experiences can be universal. Mike notes, “I loved the way that Penola combined two numeric variables with a qualitative variable, because the story of her viz wasn’t in how a stock price fluctuated over time, but rather in how her friend reacted to those variables. The humble line chart, presented in this manner, conveyed emotion and connected with an audience through that universal recognition of hope, excitement, panic, sadness, and relief.”


I chose to use personal data of texts received from a friend investing in the stock market for the 1st time.  Being an dedicated & passionate vegan, she wanted to invest in the Beyond Meat IPO and show support.  Texts received from her were numerous - I did not include every detailed text - just those that highlighted her experience.  Approval was granted to use texts.



This was my first challenge, need to learn more about data viz.



My 2019 New Years Resolution was to become more consistent with my meditation. I collected my meditation data from the Headspace app to create a visualization showing my sessions completed and trends.



I used notes I leave on all the recipes I try in my cookbooks as the data for this challenge, creating a basic bar chart to show the number of new recipes I tried each month over a two year span. My hypothesis was the year post-grad school would have a significantly higher numbers per month compared to the year during grad school.


After suffering a TBI in January 2015 Alice's life changed forever and a intense routine of therapies is now her new normal. This visualization helps us to keep track of her claims with our health insurance plan.



Four years ago my wife and I had our first child. I had a new job at Tableau and was baby in the data visualization world. We had a lot of challenges keeping track of when and who was feeding our baby. "There's got to be an app for that!" There was. We started logging everything and kept it up for the other two babies that came along. We had all this data and started to ask questions like, "when do we drop this nap?", "When do we go from 5 bottles a day to 4?". The App had some decent visualizations, good for looking at daily perspectives and none for comparing other kids. I exported the data as a csv and went to town.
Interactive viz



In this challenge, I looked into my effort of lowering my commute costs and how efficient are my changes.
interactive viz



Kea is a pet parrot with a disease that has no cure, requires injections to manage symptoms, and requires careful weight monitoring. The data used are her weights recorded in 2019. I learned that the longer days (more vitamin D) have no obvious effect on her weight. House guests and the injections likely lead to a decrease in weight, but not always. I already knew she had stayed above the concern level of 540 grams. I used different colored and larger markers for the injection dates, to note her benchmark high weight, and for a special condition of multiple tornadoes in the area. The injections add to her stress, which can lead to weight loss; and the tornadoes were an unusual event that added to everyone's stress. Annotating the benchmark high weight added info and was helpful since that value is above the highest axis value shown. Since it's the only point above 565, eliminating the '570' value on the axis got rid of visual clutter. I used thicker orange lines to note when there were house guests, and thin gray lines otherwise. House guests can add to her stress, which leads to weight loss. I added annotations for context of the vertical axis, legend info, and for the horizontal line at 540 grams. I added title as an annotation to better control font size and color, and position. I used a white text box to cover up the legend of 535 on the vertical axis, which wasn't relevant but was needed to get the axis values that I wanted. I eliminated the horizontal axis and the grid lines on the vertical axis because they didn't add information. I intentionally started vertical axis values shown at the 'concern level' of 540 rather than lower values, such as 500 for the 'critical' level. Changes of 3 grams or more can be significant, especially if there are multiple days with decreases. The minor tick marks (hidden) are 1 gram changes. (Scale is accurate to 0.5 grams.) I explored eliminating markers for 'normal' days and decided to use very small markers of the same color as the connecting lines. Thought this added information without being visually intrusive. I tested printing in grayscale to make sure color choices did not lead to visual confusion on a non-color printer. Avoided red/green issue for colored markers.



I chose to download my twitter archive for analysis... the initial impetus was that I noticed that I have been primarily retweeting recently, and I wondered what my patterns of use have been over the years that I have been using Twitter. The most interesting thing I saw was that since I tend to spend the summer outside, and often on the road, my August tweets are only about 2% of the total. I'm basically offline for the month of August, and this has apparently been consistent for the last 9 years.



For this month’s challenge I attempted to learn a new technique in Tableau, a matrix style view.  Using made up data I built this. Happy with the outcome and with real data there may well be some worthwhile trends and stories to pick out in future versions.
Interactive viz



This is an analysis on the groceries I have bought over time.
Interactive viz


Since I first saw Gmail ads, I have always insisted I would never be the kind of person to click on them due to how creepy they are. Now that I can easily get all my data from here, I thought I'd put this assumption to the test.

Tiffany demonstrates that visual metaphor can be evocative. Mike notes, “Tiffany’s tale of KonMari-ing her closet was presented in a manner that built slowly over time as the audience interacted with it, mimicking the way the piles would grow and be sorted in a real-life KonMari process.”


Interactive data viz best seen on desktop. The data collected was an inventory of the items in my closet and their resulting fate during a KonMari decluttering session. The data took some time to collect, making the decluttering even harder!


Click ♥ if you've made it to the bottom—this helps us know that the time it takes to pull this together is worthwhile! Check out the #SWDchallenge page for more. Thanks for reading!

Continue Reading…


Read More

Unfolded ISTA and Orthogonality Regularization -implementation -

** Nuit Blanche is now on Twitter: @NuitBlog **

Xiaohan sent me the following a few months ago:
Dear Igor,
I'm a long-time fan of your blog and want to share our two recent NIPS papers with you. One is a theory paper on unfolding sparse recovery algorithms into deep networks (and a spotlight oral of NIPS'18); the other is an empirical exploration of applying orthogonality regularizations to training deep CNNs, with many techniques inspired by sparse optimization.
The first paper proves the theoretical linear convergence (as upper bound) of unfolded ISTA networks (LISTA), and proposes two new structures (weight and threshold) to facilitate that fast converegnce and significantly boost performance. The work is done in collaboration with Jialin Liu ( and Wotao Yin ( in Math@UCLA.

The second paper proposes several orthogonality regularizations on CNN weights, by penalizing the distance between the Gram matrix of weights and identity under different metrics. We show that orthogonality evidently accelerates and stabilizes the empirical training convergence, as well as improve as final accuracies. The mose powerful regularization was derived from Restrcited Isometry Property (RIP).


It would be great if you could distribute their informtion to potential interested audience on nuit-blanche.
Best regards,
Xiaohan Chen
Dept. Computer Science & Engineering
Texas A&M University, College Station, TX, U.S.

Thanks Xiaohan !

In recent years, unfolding iterative algorithms as neural networks has become an empirical success in solving sparse recovery problems. However, its theoretical understanding is still immature, which prevents us from fully utilizing the power of neural networks. In this work, we study unfolded ISTA (Iterative Shrinkage Thresholding Algorithm) for sparse signal recovery. We introduce a weight structure that is necessary for asymptotic convergence to the true sparse signal. With this structure, unfolded ISTA can attain a linear convergence, which is better than the sublinear convergence of ISTA/FISTA in general cases. Furthermore, we propose to incorporate thresholding in the network to perform support selection, which is easy to implement and able to boost the convergence rate both theoretically and empirically. Extensive simulations, including sparse vector recovery and a compressive sensing experiment on real image data, corroborate our theoretical results and demonstrate their practical usefulness. We have made our codes publicly available: this https URL.

This paper seeks to answer the question: as the (near-) orthogonality of weights is found to be a favorable property for training deep convolutional neural networks, how can we enforce it in more effective and easy-to-use ways? We develop novel orthogonality regularizations on training deep CNNs, utilizing various advanced analytical tools such as mutual coherence and restricted isometry property. These plug-and-play regularizations can be conveniently incorporated into training almost any CNN without extra hassle. We then benchmark their effects on state-of-the-art models: ResNet, WideResNet, and ResNeXt, on several most popular computer vision datasets: CIFAR-10, CIFAR-100, SVHN and ImageNet. We observe consistent performance gains after applying those proposed regularizations, in terms of both the final accuracies achieved, and faster and more stable convergences. We have made our codes and pre-trained models publicly available: this https URL.

Follow @NuitBlog or join the CompressiveSensing Reddit, the Facebook page, the Compressive Sensing group on LinkedIn  or the Advanced Matrix Factorization group on LinkedIn

Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email.

Other links:
Paris Machine||@Archives||LinkedIn||Facebook|| @ParisMLGroup< br/> About LightOnNewsletter ||@LightOnIO|| on LinkedIn || on CrunchBase || our Blog
About myselfLightOn || Google Scholar || LinkedIn ||@IgorCarron ||Homepage||ArXiv

Continue Reading…


Read More

✚ Setting Visualization Expectations to Avoid Audience Confusion (The Process #41)

People misinterpret charts all of the time, because they go in with the wrong expectations before they even fully interpret what a chart is about. Read More

Continue Reading…


Read More

Royal Society of Biology: Introduction to Reproducible Analyses in R

(This article was first published on Emma R, and kindly contributed to R-bloggers)

Learn to experiment with R to make analyses and figures more reproducible

If you’re in the UK and not too far from York you might be interested in a Royal Society of Biology course which forms part of the Industry Skills Certificate. More details at this link

Introduction to Reproducible Analyses in R

24 June 2019 from 10:00 until 16:00

The course is aimed at researchers at all stages of their careers interested in experimenting with R to make their analyses and figures more reproducible.

No previous coding experience is assumed and you can work on your own laptop or use a computer at the training venue, the University of York.

To leave a comment for the author, please follow the link and comment on their blog: Emma R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Exploring feedback from data and governance experts: A research-based response to the Data Transparency Advisory Group report

Facebook has undertaken a number of efforts to increase transparency and facilitate accountability and oversight by providing insight into our process for creating metrics that are meaningful and relevant both internally to our teams and externally to the broader community of people who use Facebook. To inform this process, we have a set of principles that govern how we think about metrics and analytics. The principles are grounded both in Facebook’s company values and in research principles on transparency and accountability.

We developed metrics to comport with these principles. But we also wanted to ensure we had meaningful input from experts who had studied transparency in the context of governance. This involved both exploring the existing range of academic research around data transparency specifically and governance models more generally, and building a formal process for seeking and incorporating expert feedback.

To ensure we could appropriately balance benefits and risks of transparency in creating valid and informative metrics, we established a formal process to solicit feedback and provide a public assessment of our metrics. We established the Data Transparency Advisory Group (DTAG), which comprises international experts in measurement, statistics, criminology, and governance, with the core function of providing an independent, public assessment of our Community Standards Enforcement Report (CSER) specifically and our measurement efforts related to content moderation more broadly.

Defining research-based criteria for data transparency

A long history of social science research has contributed to our understanding of the effects of transparency, including economic and political science research on the role of transparency in building effective and efficient governance (Stiglitz, 2000; Lindstedt and Naurin, 2010; Brunetti and Weder, 2003) and political philosophy and legal theory on the role of transparency and open deliberation in the civilizing effect on political behavior and generating procedural fairness (Cohn, White, and Sanders, 2000; Elster, 1998).

These bodies of research have been reviewed and applied to social media in several different contexts, notably in the Santa Clara Principles, a theory- and research-based set of recommendations for tech companies. Taken together, this social science and legal literature focuses on transparency as a key tool to ensure accountability and fairness in institutions charged with defining and enforcing rules and standards. We support the spirit of the Santa Clara Principles on Transparency and Accountability in Content Moderation and, informed by the DTAG report’s findings on the challenges of content moderation at scale, are committed to continuing to share more about how we enforce our Community Standards in the future.

At the same time, it is worth noting that transparency is not without its limitations and risks, especially when that transparency involves data and metrics. Karr (2008) summarizes three key concerns, typically associated with public use of government information, which also apply to the data that technology companies would release on content moderation:

  • Addressing an inherent tension between the comprehensiveness of the data and the comprehensibility of information to individuals not intimately familiar with underlying processes.
  • Managing the trade-off between providing appropriate levels of detail to ensure the data is useful while also ensuring there are appropriate protections for private or confidential information.
    • The risk of misinterpretation is higher for data based on information collected as part of operational processes. Often, these data are not structured in a way that purpose-built data sets (e.g., census data) are and thus can be difficult to interpret and face higher potential for misinterpretation. This issue often arises with crime enforcement statistics that are typically derived from police reports and court data used for individuals in these systems to execute their core functions rather than for measurement purposes.
    • This issue often arises with crime enforcement statistics that are often derived from police reports and court data used for individuals in these systems to execute their core functions rather than for measurement purposes.
  • Some information that is valuable for transparency is also sensitive and/or dangerous to release publicly.

Drawing insights from this research on transparency, we developed metrics focused on two distinct but related components.

First, did the metric reflect how we think, prioritize, and assess ourselves internally? The reason for this criterion is that it allows external scrutiny and public assessment of the problem space and the outcomes of our efforts, which can enable accountability.

Second, does the metric allow insight into our processes or actions? Although aggregate counts are not sufficient to provide process insights, the metrics we release provide the information needed to understand the frequency and nature of actions we take and for others to judge for themselves whether those actions appear appropriate in the context of the overall problem space.

Approach to external assessment

Naturally, this process resulted in metrics that balanced some of the external requests and suggestions with the technical constraints and realities of enforcement at scale. The next part of the measurement process then related to whether the metrics we viewed as most relevant and operationally feasible were also metrics that met the objective of allowing insight and ultimately public accountability for our moderation processes. This is why we established the DTAG.

The DTAG was asked to answer this core question: Do the publicly available metrics provide accurate and meaningful measures of the problem area and Facebook’s activities to address it? It was not set up to audit Facebook’s data labeling, data use, or privacy protection practices. As such, the DTAG’s findings reflect its review of the processes and information provided to the group, but not unrestricted access to Facebook systems and processes. This approach helped balance the critical role of external review of our processes with protections for user privacy and complexities in the underlying data architecture that would have limited relevance for external analysts.

To support their assessment, Facebook provided the DTAG with detailed confidential information about how we develop our policies, how we detect violations of these policies at scale, and how we have developed measurement-specific processes to sample, label, and quantify violations of policies. These measurement processes were intended to create analytically rigorous data that would accurately capture the scale and scope of content violations on the platform for different types of violations.

The DTAG members were not given access to user data and did not audit the underlying data used to construct the metrics, but they were provided with more detailed breakdowns of the counts and inputs into the metrics. They also had access to the engineers and data scientists who designed and executed these metrics to answer questions or concerns and provide any requested details on how the metrics were developed.

The group reviewed, assessed, and provided feedback about our approach to measurement along two parallel tracks. First, we worked closely with the group to present our existing metrics, how and why we developed them, technical constraints, as well as best practices in how these metrics should be defined and calculated. This step was critical to bring a range of outside expertise directly into our analytic process — for technical statistical applications, but also to ensure these metrics would comport with best practices. We used their findings both to improve how we defined our metrics and to ensure the information we released alongside the numbers usefully informed readers about our processes and practices.

Second, the group was charged with providing an external report detailing our methods in the context of our processes to provide an external assessment of whether these metrics were in fact a reasonable approach to measuring violations of Facebook’s rules and standards. The DTAG was among the first set of external technical experts who were exposed to the scale and scope of Facebook’s measurement operation and then permitted to freely publish their independent assessment. They recently released their external report including these assessments and their overall recommendations.

This report was reviewed only to ensure no predefined confidential information was included. The findings and recommendations are those of the DTAG and represent an external perspective on the metrics and measures contained in the CSER specifically and Facebook’s efforts to conduct rigorous measurement on violations of its rules and standards more generally.

Summary of DTAG findings and recommendations

After a rigorous review, the DTAG found our processes reasonable and that our core metrics are consistent with best practices from other settings. In particular, the DTAG noted that with hundreds of thousands of pieces of content added every minute, the processes combining automated and human detection and review were appropriate. The findings thus also highlighted the inherent trade-offs and technical challenges that must be balanced in building an effective detection and enforcement regime at scale. The DTAG concluded that the metrics in the CSER were reasonable ways of measuring violations of our Community Standards and that they comport with the best practices that are most analogous to metrics of crime currently published by a number of different governmental agencies globally. For more detailed information on the DTAG findings, see this Newsroom post.

Based on its assessment, the DTAG also offered a number of suggestions that Facebook is now systematically reviewing and determining how best to address. The DTAG offered 15 recommendations in its report.

Table 1 below summarizes the DTAG recommendations and how Facebook is incorporating its feedback into our broader transparency and measurement efforts. Of the 15 recommendations:

  • Five recommendations will be implemented in upcoming reports, briefings, or other settings.
  • Six recommendations are being actively explored to determine how best to operationalize the suggestion.
  • Four recommendations are being considered for alternative solutions, in the context of the underlying concerns or issues that the DTAG raised. The DTAG’s proposed approaches may not be feasible or optimal given other constraints.

DTAG recommendation #1: Release accuracy rates. Measuring accuracy is complex for a number of reasons. One of the major reasons in particular is that there is a range of “ambiguous” content — meaning specific kinds of text, images, or videos about which reasonable people might disagree when assessing whether these violate our Community Standards. Given this complexity, it can be difficult to define whether the content was “accurately” labeled and therefore establish standard measures of accuracy based on false positives, false negatives, and true positives, true negatives. To address this issue, we are working to refine our policies to reduce the content that is “ambiguous.” In practice though, given the range of topics and issues covered under the Community Standards, there will always be ambiguous content, and thus in parallel, we are exploring what could serve as a meaningful metric for accuracy that is robust to the inclusion (or exclusion) of ambiguous content.

DTAG recommendation #2: Release review-and-appeal rates and reversal rates separately from an accuracy metric. Facebook is planning to release metrics that capture the amount of content that is appealed and how much content it restores. These metrics will help provide transparency about additional aspects of the governance and moderation process distinct from the accuracy rates discussed above.

DTAG recommendation #3: Provide information about the percentage of posts that are actioned by automation and the percentage actioned by humans. Such a metric could help highlight that while Facebook may want to increase automatic detection of content that violates our standards, in many cases we may not want to increase automatic actions for specific violation types. We currently share the percent of content that is detected before anyone reports it, but humans then review some of this content. And how useful automation technology is in removing content varies. For instance, such actions might be more relevant for imagery depicting adult sexual activity rather than content that may be hate speech or harassment, which requires greater language and cultural context to assess. As a result, we agree it may be helpful in the future to more explicitly define automated detection versus automated actions along with their relative accuracy, and we will continue to explore ways to do this in a manner that is both meaningful and accurate.

DTAG recommendation #4: Check reviewers’ judgments not only against an internal “correct” interpretation of the Community Standards but also against users’ interpretations of the Community Standards. Although such comparisons are useful in some settings, we do not believe they represent meaningful measures against which Facebook would operate. This is because our internal research suggests users are often unaware or do not understand the Community Standards themselves or the processes by which they are applied. Moreover, when users report, the rates at which their reports can be actioned are very low. There are three possible reasons for these low rates: a lack of context on Facebook’s side that would illuminate what is wrong; a misunderstanding about how we apply our Community Standards; and sometimes, there is abusive mass reporting. Taken together, these findings show that Facebook should do more to help users understand the rules and inform their reporting. We have already begun some of this work and, in that context, will explore ways to more systematically research how users interpret these Community Standards in the broad range of cultural and regional settings in which our platforms are used.

DTAG recommendation #5: Report prevalence measures not only as a percentage of the total estimated number of views but also as a percentage of the total estimated number of posts. This was one of the most technically challenging suggestions, and we discussed it extensively with the DTAG. The idea was that we develop a violating content rate — that is, a measure of prevalence in which a unit of observation is an individual piece of content, and the measure is the fraction of all content that is violating. This rate would serve as a supplement to the current prevalence measure, which is based on views of content. So even though our current viewership-based prevalence metric is more of a “consumption metric” — indicating how much of violating content is (intentionally or not) consumed — the proposed content-based prevalence measure is a “production metric” — indicating the amount of violating content out of how much material exists on the platform. In discussions, we explored whether it was feasible to estimate the number of distinct posts that contribute to violating viewership. Initially, we explored whether a content-based metric can be constructed with the Hansen-Hurwitz estimator using the sampling rate of views and the number of views each sampled post received in the given time frame. The DTAG agreed, however, that such an approach would be limited because: (1) the uncertainty on this population estimate would likely be very large because the viewership distribution of material is heavily skewed; (2) the views-based prevalence sampling has been optimized to minimize error by focusing on material that is more likely to be viewed, and so content that is not viewed by anyone would skew the distribution; and (3) such a count would require sampling continuously in real-time or risk missing content that is proactively removed. In such conditions, the underlying assumption that the number of views is proportional to the population of content would not hold. Moreover, the degree to which such a sample might be biased would be different for various types of violations. These discussions with the DTAG made clear that in order to construct a consistent estimate of a violating content rate, we would need an entirely new and separate labeling and measurement effort. Given this, we considered this metric as an interesting addition but not a priority effort relative to other suggestions for which we are working to improve or expand metrics. We are exploring other ways to provide insight into “production” of violating content through more targeted analysis and research.

DTAG recommendation #6: Explore ways of relating prevalence metrics to real-world harm. We do not believe such a relationship could be meaningfully established in the context of our Content Standards Enforcement Report, but we certainly view it as part of a broader research agenda that is active both inside and outside of Facebook. Extensive research by Facebook and external scholars has found that misinformation and disinformation are associated with a range of harms that vary by regional, political, and social context. While the causal relationship between online information and offline behaviors, including violence, is still being explored, both internal Facebook work and external scholarly research have highlighted risks from amplification and rapid spread of information facilitated by social media platforms. (See, for instance, work by Dunn et al., 2017, on health information.) Facebook continues to explore these issues both through internal research and through support of external research (e.g., our recent call for proposals).

DTAG recommendation #7: Explore ways of accounting for the seriousness of a violation in the prevalence and proactivity metrics. This suggestion is also part of a broader effort by both research and data science teams. But there are a number of complexities related to constructing valid and reliable measures that are relevant across broad violation types and consistent in global contexts. This difficulty is similar to the complexity in measuring severity in the context of crime both in determining meaningful concepts and in designing feasible and valid measures. (See, for example, Greenfield, 2013; Sherman, et al., 2016; and Ramchand, et al., 2009) .

DTAG recommendation #8: Report prevalence measures in subpopulations. The DTAG also had a number of suggestions on releasing more disaggregated data related to content standards enforcement. These breakdowns are feasible for these count-based metrics but may not be for prevalence because of the current stratified sampling approach. Despite some technical constraints, we recognize the value in having different subpopulations of the various metrics and the value of exploring which breakdowns may be most useful and the ways in which we may present and share such information.

DTAG recommendation #9: Report actioned content and proactively actioned content as a proportion of estimated violating content. This recommendation relies on the creation of a content rate metric discussed in recommendation #5. We are exploring ways we can more transparently discuss and quantify how actioned content and prevalence rates are related to help readers better understand how to compare metrics based on distinct units of observation (e.g., compare views-based rates to content-based counts).

DTAG recommendation #10: Break out actioned content measures by type of action taken. We agree that this additional detail would be useful in understanding the way different policy and enforcement tools are used to balance various principles of maintaining voice on important issues — such as discussions of suicide or graphic depictions of human rights violations — while providing protections for users who might be made to feel unsafe or emotionally triggered by such content. We are exploring appropriate breakdowns of this metric to ensure it meaningfully captures some of these issues.

DTAG recommendation #11: Explore ways of accounting for changes in the Community Standards and changes in technology when reporting metrics in the CSER. This is an important point, and we agree with the DTAG. Policy changes could drive changes in metrics too. Currently, much of our narrative discussion focuses on changes in technology, and we are exploring how to include policy changes in the narrative, as well as how to account for these changes, to allow for consistency over time. In the interim, we have released a recent updates section to our Community Standards webpage, which can allow readers to identify changes in the Community Standards over time.

DTAG recommendation #12: Explore ways to enhance bottom-up (as opposed to top-down) governance. The DTAG offered suggestions related to exploring additional partnerships, diversifying input into our processes, and integrating expert evaluations more broadly. These suggestions are a critical and growing component of our ongoing efforts in the content moderation and governance space. We are exploring a range of ways to engage experts and conduct evaluations such as the DTAG — in particular, with our Content Policy Research Initiative workshops and funding opportunities — to help researchers better understand and study our policies and processes. We have also engaged in a range of collaborative processes in designing oversight mechanisms, including an open solicitation for public comments.

DTAG recommendation #13: Enhance components of procedural justice in the Community Standards enforcement and appeal-and-review process.
Over the past year, Facebook has engaged in dozens of sessions to help experts and users better understand how we set the rules and enforce them at scale. Facebook also conducts user research, which involves both qualitative and quantitative methods, to understand what users think about the rules as well as how we could more clearly communicate these rules. We also built and continue to scale an appeals process and increased communication with users on how we enforce these rules, consistent with principles of procedural justice. As part of this work, Facebook continues to research and test ways to better inform users of our rules consistent with the principles of procedural justice and transparency. (See, for example, work with Tyler, et al., 2018). Work in this area has and will continue to be a core aspect of Facebook’s content governance efforts.

DTAG recommendation #14: Publicly release anonymized or otherwise aggregated versions of the data used to calculate prevalence and other metrics in the CSER.
We are exploring ways to make accessible — either via public release or based on a more moderated application-based process (similar to access to sensitive government data) — anonymized or otherwise aggregated versions of the data used to calculate prevalence, content actioned, and proactive detection rate metrics.

DTAG recommendation #15: Modify the formatting, presentation, and text of CSER documents to make them more accessible and intelligible to readers. This recommendation, though not related to the metrics themselves, is critical for identifying ways that we could meaningfully enhance transparency. We incorporated many of these suggestions into our narrative descriptions and explanations, including improving the clarity of some of descriptions, discussing how policy changes affected movements in the metrics, and generally improving accessibility of the report. Many of these changes will be reflected in upcoming reports, and we will continue to consider how best to ensure that the language and details which accompany the metrics can be most clearly expressed.

Next steps and future collaborations

Facebook will release the third iteration of its Community Standards Enforcement Report, which reflects a number of the DTAG’s recommendations along with Facebook’s own efforts to expand its measurement efforts. Facebook will continue to explore ways to expand and improve its Community Standards enforcement transparency by working with external experts. We have conducted dozens of engagements with researchers globally to increase awareness about and research regarding our policies (including explicit efforts to support more extensive research collaborations).

We are building innovative ways to share data, and in parallel, we are working to identify new ways to expand research that can improve understanding of Facebook scale and constraints while preserving independence of external analysis. Working with experts to ensure we are developing, enforcing, and reporting on our Community Standards will continue to be a key tool in ensuring that Facebook is a safe and inclusive platform globally.


Brunetti, Aymo and Weder, Beatrice (2003). “A Free Press Is Bad News for Corruption,” Journal of Public Economics, 87(7–8): 1801–24

Cohn, E. S., White, S. O. & Sanders, J. (2000). Distributive and procedural justice in seven nations. Law and Human Behavior, 24, 553-579.

Dawes, Sharon S. (2010). Stewardship and Usefulness: Policy Principles for Information-Based Transparency. Government Information Quarterly, Volume 27, Issue 4, Pages 377-383.

Efron, B. (1987). “Better Bootstrap Confidence Intervals,” Journal of the American Statistical Association, Vol. 82, No. 397. 82 (397): 171–185. doi:10.2307/2289144. JSTOR 2289144.

Elster, J. (1998). Deliberation and constitution making. In Deliberative Democracy, ed. Jon Elster. New York: Cambridge University Press, pp. 97-122.

Greenfield, V. A. and Paoli, L. (2013). “A Framework to Assess the Harms of Crimes.” British Journal of Criminology, 53(5): 864–886.

Karr, A. F. (2008). Citizen access to government statistical information. In Digital Government (pp. 503-529). Springer, Boston, MA.

Lindstedt, C. & Naurin, D. (2010). Transparency is not enough: Making transparency effective in reducing corruption. International Political Science Review, 31(3), 301-322.

Ramchand, R., MacDonald, J. M., Haviland, A., et al. “A Developmental Approach for Measuring the Severity of Crimes,” Journal of Quantitative Criminology (2009), 25: 129.

Lawrence Sherman, Peter William Neyroud, Eleanor Neyroud; The Cambridge Crime Harm Index: Measuring Total Harm from Crime Based on Sentencing Guidelines, Policing: A Journal of Policy and Practice, Volume 10, Issue 3, 1 September 2016, Pages 171–183

Stiglitz, Joseph E. (2000). “The Contributions of the Economics of Information to Twentieth Century Economics,” Quarterly Journal of Economics, 115(4): 1441–78.

Tyler, T. R., Boeckmann, R J., Smith, H J. & Huo, Yuen J. (1997). Social Justice in a Diverse Society. Boulder: Westview Press.

Tyler, T. R. (2000). Social justice: Outcome and procedure. International Journal of Psychology, 35, 117-125.

Tyler, T. R. (2006). Psychological perspectives on legitimacy and legitimation. Annual Review of Psychology, 57, 375-400.

The post Exploring feedback from data and governance experts: A research-based response to the Data Transparency Advisory Group report appeared first on Facebook Research.

Continue Reading…


Read More

Counting and illustrating Game of Thrones deaths

Shelly Tan, for The Washington Post, has been counting on-screen deaths in Game of Thrones over the past few years. As the season ended, Tan described her process in an entertaining Twitter thread:

I kept thinking about how her process transfers to counting all things. You know, like the decennial Census. The hand-wavy process always seems so straightforward. It’s like, sure, it’ll take a while, but the challenge is just time. But then you get into it, and there’s all these small bumps along the way that make everything more complicated. And then you’re like, great, well, I’ve already come this far. Better keep on counting.

Tags: ,

Continue Reading…


Read More

Your Guide to Natural Language Processing (NLP)

This extensive post covers NLP use cases, basic examples, Tokenization, Stop Words Removal, Stemming, Lemmatization, Topic Modeling, the future of NLP, and more.

Continue Reading…


Read More

What is Data Science?

Data science has been called the sexiest career of 2019. What is data science, really, and how can you get your foot in the door of this exciting career?

The post What is Data Science? appeared first on Dataquest.

Continue Reading…


Read More

End-to-End Machine Learning: Making videos from images

Video is a natural way for us to understand three dimensional and time varying information. Read this short post on how to achieve the creation of videos from still images.

Continue Reading…


Read More

Data science best practices with pandas (video tutorial)

Data science best practices with pandas (video tutorial)

The pandas library is a powerful tool for multiple phases of the data science workflow, including data cleaning, visualization, and exploratory data analysis. However, the size and complexity of the pandas library makes it challenging to discover the best way to accomplish any given task.

In this in-depth tutorial, which I presented at PyCon 2019, you'll use pandas to answer questions about a real-world dataset. Through each exercise, you'll learn important data science skills as well as "best practices" for using pandas. By the end of the tutorial, you'll be more fluent at using pandas to correctly and efficiently answer your own data science questions.

This is an intermediate level tutorial, so if you're new to pandas, I recommend starting with my other video series: Easier data analysis with pandas.

If you want to follow along with the exercises at home, you can download the dataset and notebook from GitHub.

Here are some of the topics covered in the video:

  • adjusting for bias in your dataset
  • handling missing values
  • choosing an appropriate plot
  • customizing your plot
  • using the datetime data type
  • filtering using loc versus query
  • using multiple aggregation functions
  • checking for small sample sizes
  • method chaining
  • verifying your results using random samples
  • evaluating a "stringifed" Python container
  • applying a custom function to a Series
  • writing lambda functions

Let me know if you have any questions, and I'm happy to answer them!

P.S. If you like this video, you should check out my interactive pandas course, Analyzing Police Activity with pandas.

Continue Reading…


Read More

Spotlight on: Julia Silge, Stack Overflow

(This article was first published on RBlog – Mango Solutions, and kindly contributed to R-bloggers)

Julia Silge is joining us as one of our keynote speakers at EARL London 2019. We can’t wait to hear Julia’s full keynote, but until then she kindly answered a few questions. Julia shared with us what we can expect from her address – which will focus on how Stack Overflow uses R and their recent developer survey.

Hi Julia! Tell us about the StackOverflow Developer Survey and your role at Stack Overflow

The Stack Overflow Developer Survey is the largest and most comprehensive survey of people who code around the world each year. This year, we had almost 90,000 respondents who shared their opinions on topics including their favourite technologies, their priorities in looking for a job, and what music they listen to while coding. I am the data scientist who works on this survey, and I am involved throughout the process from initial design to writing copy about results. We have an amazing team who works together on this project, including a project manager, designers, community managers, marketers, and developers.

My role focuses on data analysis. Before the survey was fielded, I worked with one of our UX researchers on question writing, so that our expectations for data analysis were aligned, as well as using data from previous years’ surveys and our site to choose which technologies to include this year. After the survey was fielded, I cleaned and analyzed the data, created data visualizations, and wrote the text for both our developer-facing and business-facing reports.

Why did you use R to analyse the survey?

All of our data science tooling at Stack Overflow is R-centric, but specifically, with our annual survey, we are working with a complex dataset on a tight schedule and the R ecosystem provides the fluent data analysis tools we need to deliver compelling results on time. From munging complicated raw data to creating beautiful visualizations to delivering data deliverables via an API, R is the right tool for the job for us.

Were there results from the survey this year that came as a surprise?

This is such a rich dataset to get to work with, full of interesting things to notice! One result this year that I didn’t expect ahead of time was with our question about whether a respondent eventually wanted to move from technical work into people management. We found that younger, less experienced respondents were more likely to say that they wanted to make the switch! Once I thought about it more carefully, I came to think that those more experienced folks with an interest in managing probably had already shifted careers and were not there to answer that question anymore. Another result that was a surprise to me was just how many different kinds of metal people listen to, more than I even knew existed!

Do you see the gender imbalance improving?

Although our annual survey has a broad capacity for informing useful and actionable conclusions, including about gender, our results don’t represent everyone in the developer community evenly. We know that people from marginalized groups and underrepresented groups in tech participate on Stack Overflow at lower rates than they participate in the software workforce. This means that we undersample such groups in our survey (because of how we invite respondents to the survey, mostly on our site itself). Over the past few years, we have seen incremental improvement in the proportion of responses that are from marginalized or underindexed groups such as minority genders or minority racial/ethnic groups; we are so happy to see this because we want to hear from everyone who codes, everywhere. We believe the biggest driver of this kind of positive change is and will continue to be improving the balance of who participates on Stack Overflow itself, and we are committed to making Stack Overflow a more welcoming and inclusive platform. This kind of work can be difficult and slow, but we are in it for the long haul.

What future trends might you be able to predict from the survey?

One trend we’ve seen over the past several years that I expect to continue is the normalization of salaries for data work. Several years ago, people who worked as data scientists were extreme outliers in salary. Salaries for data scientists have started to move toward the norm for software engineering work, especially if you control for education (for example, comparing a data scientist with a master’s degree to a software engineer with a master’s degree). I don’t see this as entirely bad news, because it is associated with some standardization of data science as a role and increased industry agreement about what a data scientist is, what a data engineer is, how to hire for these roles, and what career paths might look like.

Given Python’s rise again this year, do you see this continuing? How will this affect the use of R?

Python has exhibited a meteoric rise over the past several years and is the fastest-growing major programming language in the world. Python has been climbing in the ranks of our survey over the past several years, edging past first PHP, then C#, then Java this year. It currently sits just below SQL in the ranking. I have a hard time imagining that next year more developers will say they use Python than say they use SQL! You can dig this interview up next year and point out my prediction failure if I am wrong.

In terms of R and R’s future, it’s important to note that R’s use has also been growing dramatically on Stack Overflow, both absolutely and relatively. R is now a top 10 to top 15 programming language (both in questions asked and traffic). Data technologies are in general growing a lot, and there are many factors that go into an individual or an organization deciding to embrace R, or Python, or both.

Thanks Julia! 

You can catch Julia and a whole host of other brilliant speakers at EARL London on 10-12 September at The Tower Hotel London.

We have discounted early bird tickets available for a limited time – please visit the EARL site for more information, we hope to see you there!

To leave a comment for the author, please follow the link and comment on their blog: RBlog – Mango Solutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Why we need to open up AI black boxes right now

AI is a tricky subject to wrap your head around, that is why we talked to experts of IBM. We were enlightened by Romeo Kienzler, head of IoT at IBM and frequent speaker at Data Natives. Next, Matthias Biniok, lead Watson Architect DACH at IBM, who designed CIMON, the smiling

The post Why we need to open up AI black boxes right now appeared first on Dataconomy.

Continue Reading…


Read More

Bayesian cell counting

Let’s say you are interested in counting the concentration of cells in some sample. This is a pretty common task: sperm counts, blood cell counts, plankton counts. Microbiologists are always counting. Let’s use the example of yeast counting, which is traditional in beer and wine making. The brewery has a sample of yeast slurry, a highly concentrated amount of yeast, and they would like to know how concentrated it is, so they can add the correct amount to a batch.

The first step is to dilute your sample. The original concentration could be as high as billions per mL, so even a fraction of a mL is going to be still too concentrated to properly count. This is solved by successive dilutions of the original sample:


Each time we transfer to a new test tube, we dilute the sample by a factor of 10.

After this is done, a very small amount of the final diluted slurry is added to a hemocytometer. A hecocytometer is a glass slide with a visible grid that an observer can place under a microscope and count cells on. Under the microscope, it looks like this (where yellow dots represent cells):


The hemocytometer is designed such that a known quantity of volume exists under the inner 25 squares (usually 0.0001 mL). The observer will apriori pick 5 squares to count cells in, typically the 4 corner squares and the middle square. (Why only 5? Unlike this image above, there could be thousands of cells, and counting all 25 squares would take too long). Since we know the volume of the 25 squares, and the dilution rate, we can recover our original slurry concentration:

$$ \text{cells/mL} = (\text{cells counted}) \cdot 5 \cdot (\text{dilution factor}) / 0.0001 \text{mL} $$

Given I counted 49 yeast and my dilution was 1000x, my estimate is 2.45B yeast/mL. Great! We are done, right? No way, consider all the sources of uncertainty we glossed over:

  1. Did we accurately measure out exactly 9mL of water in each test tube?
  2. Did we accurately extract 1mL of slurry between each test tube?
  3. Did we get lucky/unlucky with our 0.0001 mL sample for the hemocytometer?
  4. Did the hemocytometer manufacturer have some QA over the volume of the chamber?
  5. Did we get lucky/unlucky with cells numbers in the 5 counting squares?

So we should expect high variance in our estimate because of the many sources of noise, and because they are layered on top of one another. Let’s redo this with Bayesian statistics so we can model our uncertainty.

Here’s the code for an observation of 49 yeast cells counted. Each source of noise is a random variable. I’ve added some priors that I think are sensible. 

import pymc3 as pm


squares_counted = 5
yeast_counted = 49

with pm.Model() as model:
    yeast_conc = pm.Normal("cells/mL", mu=2 * BILLION, sd=0.4 * BILLION)

    shaker1_volume = pm.Normal("shaker1 volume (mL)", mu=9.0, sd=0.05)
    shaker2_volume = pm.Normal("shaker2 volume (mL)", mu=9.0, sd=0.05)
    shaker3_volume = pm.Normal("shaker3 volume (mL)", mu=9.0, sd=0.05)

    yeast_slurry_volume = pm.Normal("initial yeast slurry volume (mL)", mu=1.0, sd=0.01)
    shaker1_to_shaker2_volume =    pm.Normal("shaker1 to shaker2 (mL)", mu=1.0, sd=0.01)
    shaker2_to_shaker3_volume =    pm.Normal("shaker2 to shaker3 (mL)", mu=1.0, sd=0.01)

    dilution_shaker1 = yeast_slurry_volume       / (yeast_slurry_volume + shaker1_volume)
    dilution_shaker2 = shaker1_to_shaker2_volume / (shaker1_to_shaker2_volume + shaker2_volume)
    dilution_shaker3 = shaker2_to_shaker3_volume / (shaker2_to_shaker3_volume + shaker3_volume)
    final_dilution_factor = dilution_shaker1 * dilution_shaker2 * dilution_shaker3

    volume_of_chamber = pm.Gamma("volume of chamber (mL)", mu=0.0001, sd=0.0001 / 20)

    # why is Poisson justified? in my final shaker, I have yeast_conc * final_dilution_factor * shaker3_volume number of yeast
    # I remove volume_of_chamber / shaker3_volume fraction of them, hence it's a binomial with very high count, and very low probability.
    yeast_visible = pm.Poisson("cells in visible portion", mu=yeast_conc * final_dilution_factor * volume_of_chamber)

    number_of_counted_cells = pm.Binomial("number of counted cells", yeast_visible, squares_counted/TOTAL_SQUARES, observed=yeast_counted)

    trace = pm.sample(5000, tune=1000)

pm.plot_posterior(trace, varnames=['cells/mL'])

The posterior for cells/mL is below:

We can see that the width of the credible interval is about a billion. That’s surprisingly large. And that makes sense: we only have a single, noisy observation. We can see the influence of the prior here as well. Note that the posterior’s mean is about 2.25B, much closer to the priors’ mean of 2B than our naive estimate of 2.45B above. This brings up another point: using the naive formula above is like saying: “I observed a single coin flip, saw heads, so all coin flips are heads.” Sounds silly, but it’s identical inference. With Bayesian statistics, we get to lie in a much more reassuring bed!

Read the next article in this series, modelling growth

Continue Reading…


Read More

When Too Likely Human Means Not Human: Detecting Automatically Generated Text

Passably-human automated text generation is a reality. How do we best go about detecting it? As it turns out, being too predictably human may actually be a reasonably good indicator of not being human at all.

Continue Reading…


Read More

Pushing the guy in front of the trolley

So. I was reading the London Review of Books the other day and came across this passage by the philosopher Kieran Setiya:

Some of the most striking discoveries of experimental philosophers concern the extent of our own personal inconsistencies . . . how we respond to the trolley problem is affected by the details of the version we are presented with. It also depends on what we have been doing just before being presented with the case. After five minutes of watching Saturday Night Live, Americans are three times more likely to agree with the Tibetan monks that it is permissible to push someone in front of a speeding train carriage in order to save five. . . .

I’m not up on this literature, but I was suspicious. Watching a TV show for 5 minutes can change your view so strongly?? I was reminded of the claim from a few years ago, that subliminal smiley faces had huge effects on attitudes toward immigration—it turns out the data showed no such thing. And I was bothered, because it seemed that a possibly false fact was being used as part of a larger argument about philosophy. The concept of “experimental philosophy”—that’s interesting, but only if the experiments make sense.

So I thought I’d look into this particular example.

I started by googling *saturday night live trolley problem* which led me to this article in Slate by Daniel Engber, “Does the Trolley Problem Have a Problem?: What if your answer to an absurd hypothetical question had no bearing on how you behaved in real life?”

OK, so Engber’s skeptical too. I searched in the article for Saturday Night Live and found this passage:

Trolley-problem studies also tell us people may be more likely to favor the good of the many over the rights of the few when they’re reading in a foreign language, smelling Parmesan cheese, listening to sound effects of people farting, watching clips from Saturday Night Live, or otherwise subject to a cavalcade of weird and subtle morality-bending factors in the lab.

Which contained a link to this two-page article in Psychological Science by Piercarlo Valdesolo and David DeSteno, “Manipulations of Emotional Context Shape Moral Judgment.”

From that article:

The structure of such dilemmas often requires endorsing a personal moral violation in order to uphold a utilitarian principle. The well-known footbridge dilemma is illustrative. In it, the lives of five people can be saved through sacrificing another. However, the sacrifice involves pushing a rather large man off a footbridge to stop a runaway trolley before it kills the other five. . . . the proposed dual-process model of moral judgment suggests another unexamined route by which choice might be influenced: contextual sensitivity of affect. . . .

We examined this hypothesis using a paradigm in which 79 participants received a positive or neutral affect induction and immediately afterward were presented with the footbridge and trolley dilemmas embedded in a small set of nonmoral distractors.[1] The trolley dilemma is logically equivalent to the footbridge dilemma, but does not require consideration of an emotion-evoking personal violation to reach a utilitarian outcome; consequently, the vast majority of individuals select the utilitarian option for this dilemma.[2]

Here are the two footnotes to the above passage:

[1] Given that repeated consideration of dilemmas describing moral violations would rapidly reduce positive mood, we utilized responses to the matched set of the footbridge and trolley dilemmas as the primary dependent variable.

[2] Precise wording of the dilemmas can be found in Thomson (1986) or obtained from the authors.

I don’t understand footnote 1 at all. From my reading of it, I’d think that a matched set of the dilemmas corresponds to each participant in the experiment getting both questions, and then in the analysis having the responses compared. But from the published article it’s not clear what’s going on, as only 77 people seem to have been asked about the trolley dilemma compared to 79 asked about the footbridge—I don’t know what happened to those two missing responses—and, in any case, the dependent or outcome variable in the analyses are the responses to each question, one at a time. I’m not saying this to pick at the paper; I just don’t quite see how their analysis matches their described design. The problem isn’t just two missing people, it’s also that the numbers don’t align. In the data for the footbridge dilemma, 38 people get the control condition (“a 5-min segment taken from a documentary on a small Spanish village”) and 41 get the treatment (“a 5-min comedy clip taken from ‘Saturday Night Live'”). The entire experiment is said to have 79 participants. But for the trolley dilemma, it says that 40 got the control and 37 got the treatment. Maybe data were garbled in some way? The paper was published in 2006 so long before data sharing was any sort of standard, and this little example reminds us why we now think it good practice to share all data and experimental conditions.

Regarding footnote 2: I don’t have a copy of Thomson (1986) at hand, but some googling led me to this description by Michael Waldmann and Alex Wiegmann:

In the philosopher’s Judith Thomson’s (1986) version of the trolley dilemma, a situation is described in which a trolley whose brakes fail is about to run over five workmen who work on the tracks. However, the trolley could be redirected by a bystander on a side track where only one worker would be killed (bystander problem). Is it morally permissible for the bystander to throw the switch or is it better not to act and let fate run its course?

Now for the data. Valdesolo and DeSteno find the following results:

– Flip-the-swithch-on-the-trolley problem (no fat guy, no footbridge): 38/40 flip the switch under the control condition, 33/37 flip the switch under the “Saturday Night Live” condition. That’s an estimated treatment effect of -0.06 with standard error 0.06.

– Footbridge problem (trolley, fat guy, footbridge): 3/38 push the man under the control condition, 10/41 push the man under the “Saturday Night Live” condition. That’s an estimated treatment effect of 0.16 with standard error 0.08.

So from this set of experiments alone, I would not say it’s accurate to write that “After five minutes of watching Saturday Night Live, Americans are three times more likely to agree with the Tibetan monks that it is permissible to push someone in front of a speeding train carriage in order to save five.” For one thing, it’s not clear who the participants are in these experiments, so the description “Americans” seems too general. But, beyond that, we have a treatment with an effect -0.06 +/- 0.06 in one experiment and 0.16 +/- 0.08 in another: the evidence seems equivocal. Or, to put it another way, I wouldn’t expect such a large difference (“three times more likely”) to replicate in a new study or to be valid in the general population. (See for example section 2.1 of this paper for another example. The bias occurs because the study is noisy and there is selection on statistical significance.)

At this point I thought it best to dig deeper. Setiya’s article is a review of the book, “Philosophy within Its Proper Bounds,” by Edouard Machery. I looked up the book on Amazon, searched for “trolley,” and found this passage:

From this I learned that were some follow-up experiments. The two papers cited are Divergent effects of different positive emotions on moral judgment, by Nina Strohminger, Richard Lewis, and David Meyer (2011), and To push or not to push? Affective influences on moral judgment depend on decision frame, by Bernhard Pastötter, Sabine Gleixner, Theresa Neuhauser, and Karl-Heinz Bäuml (2013).

I followed the link to both papers. Machery describes these as replications, but none of the studies in question are exact replications, as the experimental conditions differ from the original study. Strohminger et al. use audio clips of comedians, inspirational stories, and academic lectures: no Saturday Night Live, no video clips at all. And Pastötter et al. don’t use video or comedy: they use audio clips of happy or sad-sounding music.

I’m not saying that these follow-up studies have no value or that they should not be considered replications of the original experiment, in some sense. I’m bringing them up partly because details matter—after all, if the difference between a serious video and a comedy video could have a huge effect on a survey response, one could also imagine that it makes a difference whether stimuli involve speech or music, or whether they are audio or video—but also because of the flexibility, the “researcher degrees of freedom,” involved in whether to consider something as a replication at all. Recall that when a study does not successfully replicate, a common reaction is to point out differences between the old and new experimental conditions and then declare that that the new study was not a real replication. But if the new study’s results are in the same direction as the old’s, then it’s treated as a replication, no questions asked. So the practice of counting replications has a heads-I-win, tails-you-lose character. (For an extreme example, recall Daryl Bem’s paper where he claimed to present dozens of replications of his controversial ESP study. One of those purported replications was entitled “Further testing of the precognitive habituation effect using spider stimuli.” I think we can be pretty confident that if the spider experiment didn’t yield the desired results, Bem could’ve just said it wasn’t a real replication because his own experiment didn’t involve spiders at all.)

Anyway, that’s just terminology. I have no problem with the Strohminger et al. and Pastötter et al. studies, which we can simply call follow-up experiments.

And, just to be clear, I agree that there’s nothing special about an SNL video or for that matter about a video at all. My concern about the replication studies is more of a selection issue: if a new study doesn’t replicate the original claim, then a defender can say it’s not a real replication. I guess we could call that “the no true replication fallacy”! Kinda like those notorious examples where people claimed that a failed replication didn’t count because it was done in a different country, or the stimulus was done for a different length of time, or the outdoor temperature was different.

The real question is, what did they find and how do these findings relate to the larger claim?

And the answer is, it’s complicated.

First, the two new studies only look at the footbridge scenario (where the decision is whether to push the fat man), not the flip-the-switch-on-the-trolley scenario, which is not so productive to study because most people are already willing to flip the switch. So the new studies to not allow comparison the two scenarios. (Strohminger et al. used 12 high conflict moral dilemmas; see here)

Second, the two new studies looked at interactions rather than main effects.

The Strohminger et al. analysis is complicated and I didn’t follow all the details, but I don’t see a direct comparison estimating the effect of listening to comedy versus something else. In any case, though, I think this experiment (55 people in what seems to be a between-person design) would be too small to reliably estimate the effect of interest, considering how large the standard error was in the original N=79 study.

Pastötter et al. had no comedy at all and found no main effect; rather, as reported by Machery, they found an effect whose sign depended on framing (whether the question was asked as, “Do you think it is appropriate to be active and push the man?” or “Do you think it is appropriate to be passive and not push the man?”:

I guess the question is, does the constellation of these results represent a replication of the finding that “situational cues or causal factors influencing people’s affective states—emotions or moods—have consistent effects on people’s general judgments about cases”?

And my answer is: I’m not sure. With this sort of grab bag of different findings (sometimes main effects, sometimes interactions) with different experimental conditions, I don’t really know what to think. I guess that’s the advantage of large preregistered replications: for all their flaws, they give us something to focus on.

Just to be clear: I agree that effects don’t have to be large to be interesting or important. But at the same time it’s not enough to just say that effects exist. I have no doubt that affective states affect survey responses, and these effects will be of different magnitudes and directions for different people and in different situations (hence the study of interactions as well as main effects). There have to be some consistent or systematic patterns for this to be considered a scientific effect, no? So, although I agree that effects don’t need to be large, I also don’t think a statement such as “emotions influence judgment” is enough either.

One thing that does seem clear, is that details matter, and lots of the details get garbled in the retelling. For example, Setiya reports that “Americans are three times more likely” to say they’d push someone, but that factor of 3 is based on a small noisy study on an unknown population, and for which I’ve not seen any exact replication, so to make that claim is a big leap of faith, or of statistical inference. Meanwhile, Engber refers to the flip-the-switch version of the dilemma, for which case the data show no such effect of the TV show. More generally, everyone seems to like talking about Saturday Night Live, I guess because it evokes vivid images, even though the larger study had no TV comedy at all but compared clips of happy or sad-sounding music.

What have we learned from this journey?

Reporting science is challenging, even for skeptics. None of the authors discussed above—Setiya, Engber, or Machery—are trying to sell us on this research, and none of them have a vested interest in making overblown claims. Indeed, I think it would be fair to describe Setiya and Engber as skeptics in this discussion. But even skeptics can get lost in the details. We all have a natural desire to smooth over the details and go for the bigger story. But this is tricky when the bigger story, whatever it is, depends on details that we don’t fully understand. Presumably our understanding in 2018 of affective influences on these survey responses should not depend on exactly how an experiment was done in 2006—but the description of the effects are framed in terms of that 2006 study, and with each lab’s experiment measuring something a bit different, I find it very difficult to put everything together.

This relates to the problem we discussed the other day, of psychology textbooks putting a complacent spin on the research in their field. The desire for a smooth and coherent story gets in the way of the real-world complexity that motivates this research in the first place.

There’s also another point that Engber emphasizes, which is the difference between a response to a hypothetical question, and an action in the external world. Paradoxically, one reason why I can accept that various irrelevant interventions (for example, watching a comedy show or a documentary film) could have a large effect on the response to the trolley question is that this response is not something that most people have thought about before. In contrast, I found similar claims involving political attitudes and voting (for example, the idea that 20% of women change their presidential preference depending on time of the month) to be ridiculous, on part because most people already have settled political views. But then, if the only reason we find the trolley claims plausible is that people aren’t answering them thoughtfully, then we’re really only learning about people’s quick reactions, not their deeper views. Quick reactions are important too; we should just be clear if that’s what we’re studying.

P.S. Edouard Machery and Nina Strohminger offered useful comments that influenced what I wrote above.

Continue Reading…


Read More

Announcing the winners of the Content Policy Research on Social Media Platforms research awards

In February, Facebook launched a request for proposals focused on content policies, specifically around hate speech and preventing offline harm. We were interested in supporting research that will help us develop better content policies and assess possible interventions. We asked for proposals that took a variety of approaches, including experimental and observational studies, along with qualitative and analytic research to understand the mechanisms by which online rhetoric influences offline events.

We received 184 proposals from 38 countries; a selection committee comprising members of the integrity research and policy teams evaluated them. We received proposals on a broad range of relevant research questions that are well-positioned to contribute to the global discussion on content policy issues. Thank you to all the researchers who took the time to submit a proposal, and congratulations to the winners.

Research award winners

Comparative Thematic Analysis Online Comms Strategy of Global Terrorist Organizations
Stevie Weinberg, Deborah Housen-Couriel, Rami Efrati, Eitan Azani, Michael Barak, Uri Ben Yaakov, Doron Rokah, Erez Kreiner, Sigalit Maor-Hirsh, Gabriel Weimann, Assaf Moghadam, Boaz Ganor, Yaakov Perry (International Institute for Counter-Terrorism)

Defending Online and Offline Civility: Analyzing and Debunking Hate Speech
Michael Hameleers, Toni van der Meer (University of Amsterdam)

Examining the Techniques and Tools Used by Intimate Partner Abusers Online
Nicola Dell (Cornell University), Savita Bailur (Caribou Digital)

Fair or Flawed: Experiments on Perceptions of Hate Speech Moderation
João Fernando Ferreira Gonçalves (Erasmus University Rotterdam), Gina Masullo Chen (The University of Texas at Austin), Marisa Torres da Silva (Nova University Lisbon)

Facebook content policy and hate in Zimbabwe
Patience Zirima, Prisiel Samu (Media Monitoring Project of Zimbabwe)

Hate Speech Detection via Deep Learning from Very Large Datasets
Amir Adler (Massachusetts Institute of Technology)

Identifying and Analyzing Hate in News Comments in Facebook
Fabrício Benevenuto (Universidade Federal de Minas Gerais)

Identifying and Examining Islamophobic Speech and Imagery
Anisah Beth Bagasra, Burton Speakman (Kennesaw State University)

Identifying Pull and Push Factors Impacting Reaction to Facebook Posts
Muhammad Umer Khan (Centre for Peace, Security and Developmental Studies and The Grief Directory), Fatima Ali Haider, Narmeen Hamid (The Grief Directory), Fizza Batool (Centre for Peace, Security and Developmental Studies)

Monitoring Cross-Platform Trends in Hate Speech
Jeremy Blackburn (University of Alabama at Birmingham), Barry Bradlyn (University of Illinois at Urbana-Champaign)

Networked Crowdsourcing: An Online Experiment in Content Moderation
Damon Centola, Douglas Guilbeault (University of Pennsylvania)

Pixels Hurt More Than Sticks and Stones: Confronting Cyber-Bullying on Facebook
Tom Kwanya, Angela Kogos, Claudior Onsare, Erick Ogolla, Lucy Kibe (The Technical University of Kenya)

Racial Bias in Reports and Evaluations of Potentially Harmful Speech
Nathan N. Cheek (The Trustees of Princeton University)

Regulating Hate Speech in the Asia Pacific
Katharine Gelber, Kirril Shields (University of Queensland), Aim Sinpeng, Fiona Martin (University of Sydney)

Social Media Platforms and Dirty Political Campaigns
Ernesto Schargrodsky (Universidad Torcuato Di Tella), Rafael Di Tella (Harvard Business School), Sebastian Galiani (University of Maryland)

Terrorist Content Classifier
Charlie Winter, Shiraz Maher (King’s College London)

The How and Where of Hate Speech: A 100 US City Mixed-Methods Study
Rumi Chunara, Stephanie Cook (New York University)

Tracking Causalities of Escalation Towards Dangerous Speech
Anulekha Nandi, Ritu Srivastava (Digital Empowerment Foundation)

Transparency on the Removal of Harmful Speech: Users, Platforms, and Law
Fabricio Bertini Pasquot Polido, Gustavo Ramos Rodrigues, Lahis Pasquali Kurtz, Lucas Costa dos Anjos, Luiza Couto Chaves Brandão, Paloma Rocillo Rolim do Carmo (Institute for Research on Internet & Society)

To view our currently open research awards and to subscribe to our email list, visit our Research Awards page.

The post Announcing the winners of the Content Policy Research on Social Media Platforms research awards appeared first on Facebook Research.

Continue Reading…


Read More

Data Science News for May 2019

Here is the latest data science news for May 2019.

From Data Science 101

General Data Science

Continue Reading…


Read More

Shadows of customers on the wall – key takeaways from the “AI in e-commerce” business breakfast

Global e-commerce is among the fastest growing industries globally, experiencing 18% growth in 2018. Worldwide, consumers purchased $2.86 trillion worth of e-goods in 2018, compared to $2.43 trillion in 2017.

Because digital commerce is data-driven, the industry is ripe territory for AI. However, lack of knowledge and uncertainty remain the most prominent obstacles to this technology gaining a stronger foothold. To address these obstacles, and Google Cloud co-organized a business breakfast to discuss the challenges and opportunities and share their remarks on artificial intelligence (AI) in e-commerce. Joining Google and were experts from, Sotrender, and iProspect, all companies deliver sophisticated tools for digital business.

Plato’s data cave

“When it comes to building AI applications, it’s all about the data,” said Paweł Osterreicher, Director of Strategy & Business Development at, during his presentation. He pointed out that the simplest analytics in smaller businesses can be done within an Excel spreadsheet or pen and paper. Preparing a simple segmentation within a client group or spotting best-performing products don’t pose a huge challenge. But those are only the tip of the iceberg. “The more sophisticated insights we gain, the more complicated the task becomes. And that’s where specialized software comes in,” he said.

“The greatest challenge is a lack of flexibility. There is no jack-of-all-trades among the popular tools, and each has its limitations. The problem is when a tool doesn’t fit a company’s needs. And, to be honest, that’s a common situation,” Osterreicher continued. Companies thus often need to tweak the tools at their disposal to make them fit or get used to missing insights from their data.

“Most companies process only a fraction of their data and operate with only half the picture. They are like the prisoners in Plato’s cave, watching only the shadows customers cast on the wall, with no access to or true grasp of their real form.”

The only way to analyze data in a convenient and cost-effective way is to leverage machine learning models. Machines are able to effectively spot patterns even in seemingly insignificant details.

“Sometimes information about how long customers hover over a button or how they go about filling in an online form is a first step to obtaining meaningful information. The model is only as good as the data it was built on,” concludes Osterreicher.

Retail reinvented

In another presentation, Jakub Skuratowicz focused on the technical aspects of how companies use AI. There are numerous ways for companies to benefit from AI, be it building engagement, personalizing the user experience or detecting fraud.

Google’s expert showed a new application of image search for omnichannel commerce. First applied by the Nordstrom clothing company, the app-enabled users to take a photograph of an item and then search for it in the shop’s database. Thus, the customer could quickly buy the product online or check its availability.

“By using Google Cloud Platform-delivered machine learning tools, the company reached 95% accuracy in recognizing an item shown in a photograph”

AI also thrives in recommendation engines. “It was common to recommend the user another version of the product – a different size of a dress, for example. That’s pointless. Why would one need another of the same dress, only slightly bigger?” Skuratowicz asked. Instead, the AI-powered model recommended products that complemented the one that had been searched for, like sunglasses or a scarf to go with the dress.

Skuratowicz also showed how AI spots fraudulent transactions in international e-commerce. “Manual or semi-automatic checking can be effective, but machine learning makes it more scalable,”  he said. By applying AI-based solutions, the international logistics provider Pitney Bowes boosted the accuracy of its fraudulent transaction detection by 80% while reducing false-positives by 50%.

The mind barrier

The presentations were followed by a panel discussion on machine learning in e-commerce. As the panelists remarked, the AI-powered future of e-commerce is a challenge that not all companies are ready for.

In response to a question about the state of data-proficiency in e-commerce companies, Arkadiusz Wiśniewski, Director of Data and Technology at iProspect, had this to say:

“some data are easy to collect, while others provide a challenge, so we have an incomplete view. The legal situation in Europe poses an additional challenge, so it is better to focus on the data owned and make the best use of it.”

“Data-readiness depends to a great extent on company size. But most businesses lack the skills and data to effectively apply machine learning techniques,” agreed Jarosław Trybuchowicz, owner of

The panelists agreed that the situation is hard even though data is becoming a commodity. “Sometimes the problem is the opposite. Despite having huge amounts of data, companies don’t get insights from it. They simply don’t know what questions to ask and what insights to look for,” added’s Borys Sobiegraj.

The panelists likewise agreed that the key to success for enterprises employing machine learning is to know and properly organize their own data. Getting the data is the first challenge; “deciding what to do with it is a different story altogether,” said Jakub Nowacki, Lead Machine Learning Engineer at Sotrender. “Another challenge is extracting value that often lies in matching the data from different sources. If a company is unable to determine the impact of a sales campaign, then what is the purpose of analytics?” he added.

A question-answer session and networking time followed the discussion panel. The next business breakfast is planned for Q3.




The post Shadows of customers on the wall – key takeaways from the “AI in e-commerce” business breakfast appeared first on

Continue Reading…


Read More

Three Critical Aspects of Design Thinking for Big Data Solutions

Design thinking continues to be all the rage amongst organizations of all kinds – from academia to startups, to agencies and consultancies, or large enterprises. The concept is popular today not because it’s new per se, but its approach to problem-solving fits well with the digital transformation that companies are

The post Three Critical Aspects of Design Thinking for Big Data Solutions appeared first on Dataconomy.

Continue Reading…


Read More

Applications of data science and machine learning in financial services

The O’Reilly Data Show Podcast: Jike Chong on the many exciting opportunities for data professionals in the U.S. and China.

In this episode of the Data Show, I spoke with Jike Chong, chief data scientist at Acorns, a startup focused on building tools for micro-investing. Chong has extensive experience using analytics and machine learning in financial services, and he has experience building data science teams in the U.S. and in China.

We had a great conversation spanning many topics, including:

  • Potential applications of data science in financial services.

  • The current state of data science in financial services in both the U.S. and China.

  • His experience recruiting, training, and managing data science teams in both the U.S. and China.

Continue reading Applications of data science and machine learning in financial services.

Continue Reading…


Read More

Four short links: 23 May 2019

Deep Fakes, GPU-Friendly Codec, Retro OS, and Production Readiness

  1. Few-Shot Adversarial Learning of Realistic Neural Talking Head Models -- astonishing work, where you can essentially do deep-fakes from one or two photos. See the YouTube clip for amazing footage of it learning from historical photos and even a painting. (via Dmitry Ulyanov)
  2. Basis Universal GPU Texture Codec -- open source codec for a super-compressed image file format that can be quickly transcoded to something ready for GPUs. See this Hacker News comment for a very readable explanation of why it's important for game developers.
  3. Serenity -- open source OS for x86 machines, which seems like Unix with Windows 98 UI.
  4. The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction -- We present a rubric as a set of 28 actionable tests, and offer a scoring system to measure how ready for production a given machine learning system is. With an implementation in Excel.

Continue reading Four short links: 23 May 2019.

Continue Reading…


Read More

Magister Dixit

“Advanced Analytics is “the analysis of all kinds of data using sophisticated quantitative methods (for example, statistics, descriptive and predictive data mining, simulation and optimization) to produce insights that traditional approaches to business intelligence (BI) – such as query and reporting – are unlikely to discover.”” rapidminer ( 2014 )

Continue Reading…


Read More

Allen Institue Opens an Israeli Branch

This Tuesday Prof. Oren Etsioni announced on his keynote talk at the Data Science Summit, that the Seattle Allen Institute opens an Israeli Branch:

Prof. Yoav Goldberg from Bar Ilan is heading the research effort, with focus on NLP.

Continue Reading…


Read More

If you did not already know

Rotation Equivariant Vector Field Networks google
We propose a method to encode rotation equivariance or invariance into convolutional neural networks (CNNs). Each convolutional filter is applied with several orientations and returns a vector field that represents the magnitude and angle of the highest scoring rotation at the given spatial location. To propagate information about the main orientation of the different features to each layer in the network, we propose an enriched orientation pooling, i.e. max and argmax operators over the orientation space, allowing to keep the dimensionality of the feature maps low and to propagate only useful information. We name this approach RotEqNet. We apply RotEqNet to three datasets: first, a rotation invariant classification problem, the MNIST-rot benchmark, in which we improve over the state-of-the-art results. Then, a neuron membrane segmentation benchmark, where we show that RotEqNet can be applied successfully to obtain equivariance to rotation with a simple fully convolutional architecture. Finally, we improve significantly the state-of-the-art on the problem of estimating cars’ absolute orientation in aerial images, a problem where the output is required to be covariant with respect to the object’s orientation. …

Kernel Graph Convolutional Neural Network google
Graph kernels have been successfully applied to many graph classification problems. Typically, a kernel is first designed, and then an SVM classifier is trained based on the features defined implicitly by this kernel. This two-stage approach decouples data representation from learning, which is suboptimal. On the other hand, Convolutional Neural Networks (CNNs) have the capability to learn their own features directly from the raw data during training. Unfortunately, they cannot handle irregular data such as graphs. We address this challenge by using graph kernels to embed meaningful local neighborhoods of the graphs in a continuous vector space. A set of filters is then convolved with these patches, pooled, and the output is then passed to a feedforward network. With limited parameter tuning, our approach outperforms strong baselines on 7 out of 10 benchmark datasets. …

Graph Optimized Convolutional Network (GOCN) google
Graph Convolutional Networks (GCNs) have been widely studied for graph data representation and learning tasks. Existing GCNs generally use a fixed single graph which may lead to weak suboptimal for data representation/learning and are also hard to deal with multiple graphs. To address these issues, we propose a novel Graph Optimized Convolutional Network (GOCN) for graph data representation and learning. Our GOCN is motivated based on our re-interpretation of graph convolution from a regularization/optimization framework. The core idea of GOCN is to formulate graph optimization and graph convolutional representation into a unified framework and thus conducts both of them cooperatively to boost their respective performance in GCN learning scheme. Moreover, based on the proposed unified graph optimization-convolution framework, we propose a novel Multiple Graph Optimized Convolutional Network (M-GOCN) to naturally address the data with multiple graphs. Experimental results demonstrate the effectiveness and benefit of the proposed GOCN and M-GOCN. …

RONA google
The soaring demand for intelligent mobile applications calls for deploying powerful deep neural networks (DNNs) on mobile devices. However, the outstanding performance of DNNs notoriously relies on increasingly complex models, which in turn is associated with an increase in computational expense far surpassing mobile devices’ capacity. What is worse, app service providers need to collect and utilize a large volume of users’ data, which contain sensitive information, to build the sophisticated DNN models. Directly deploying these models on public mobile devices presents prohibitive privacy risk. To benefit from the on-device deep learning without the capacity and privacy concerns, we design a private model compression framework RONA. Following the knowledge distillation paradigm, we jointly use hint learning, distillation learning, and self learning to train a compact and fast neural network. The knowledge distilled from the cumbersome model is adaptively bounded and carefully perturbed to enforce differential privacy. We further propose an elegant query sample selection method to reduce the number of queries and control the privacy loss. A series of empirical evaluations as well as the implementation on an Android mobile device show that RONA can not only compress cumbersome models efficiently but also provide a strong privacy guarantee. For example, on SVHN, when a meaningful $(9.83,10^{-6})$-differential privacy is guaranteed, the compact model trained by RONA can obtain 20$\times$ compression ratio and 19$\times$ speed-up with merely 0.97% accuracy loss. …

Continue Reading…


Read More

Counting and interval censoring analysis

Let’s say you have an initial population of (micro-)organisms, and you are curious about their survival rates. A common summary statistic of their survival is the half-life. How might you collect data to measure their survival? Since we are dealing with micro-organisms, we can’t track individual lifetimes. What we might do is periodically count the number of organisms still alive. Suppose our dataset looks like:

T = [0,    2,   4,   7  ]  # in hours
N = [1000, 914, 568, 112]

I’ll present two so-so solutions to finding the half-life, and then one much better solution.

Solution 1: Linear interpolation

We can plot this over time:

We can eyeball the half-life to be about 4.3h - note that this is a linear interpretation between two points, and thus this method doesn’t consider all the data (i.e. it’s not a global method).

Solution 2: curve-fitting an exponential model

Exponential death is a common model in ecology because of it’s simplicity. We can try to find a \(\beta\) value such that the sum of squares of observed values minus \(1000 \exp(\beta t) \) is minimized. That’s not hard to do with scipy’s minimize functions. The best estimate for \(\beta\) is about 0.17. But, as we can see, it’s not a very good fit:

We could choose a more flexible model (like a Weibull distribution), but we will still run up against a fundamental problem: the variance on the estimates will be very high because we are only considering 4 data points, and adding more parameters to the fitting model will only make things worse.

Solution 3: consider the entire population and censoring

Think again about how the data was collected for a moment. We took a count, waited a few hours, and took a count again. The delta in the population are composed of individuals died sometime in that interval. So what we have is a case of interval censoring. That is, I know that 86 individuals died sometime between 0 and 2 (lower and upper bound), 346 died sometime between 2 and 4 (lower and upper bound), etc. Finally, we are left with 112 that are right censored, that is, we stopped watching and don’t observe their death.

We can use survival analysis to answer this problem (hey, let’s use lifelines!). Importantly, we can treat all 1000 organisms as individual observations to make the inference stronger. First, we’ll do some data manipulation. Some notes below.

df = pd.DataFrame({
  "deltas": [86, 346, 456, 112],
  "start":  [0,  2,   4,   7],
  "stop":   [2,  4 ,  7,   np.inf],
  "observed_death": [False, False, False, False]

   deltas  start  stop  observed_death
0      86      0   2.0           False
1     346      2   4.0           False
2     456      4   7.0           False
3     112      7   inf           False

We have observed_death as always False, because we don’t have exact measurements on any of the deaths (that is, I have no organisms where I can say “this guy lived exactly X hours”). Why do I have the infinity in the last row? Well, we don’t observe the final 112 organisms die either, but we know they will die between hour 7 and infinity, so we code that in too.

from lifelines import WeibullFitter

wf = WeibullFitter()



<lifelines.WeibullFitter: fitted with 4 observations, 4 censored>
number of subjects = 4
  number of events = 0
    log-likelihood = -1182.357
        hypothesis = lambda_ != 1, rho_ != 1

         coef  se(coef)  lower 0.95  upper 0.95      p  -log2(p)
lambda_  5.09      0.07        4.94        5.23 <0.005       inf
rho_     2.49      0.08        2.35        2.64 <0.005    282.95

Note the very small standard errors. And if we plot the resulting survival curve, we see an excellent fit.

From this, we can report the half-life, or even better is to present the survival distribution, since it has the most information in it.

Multiple cultures

Let's say we actually have two cultures we are working with, and take population measurements at the same time. However we have different environments for each of the cultures: one has more double the sugar concentration in the medium than the other. We can model this using a regression survival model:


culture_1 = pd.DataFrame({
  "deltas": [86, 346, 456, 112],
  "start":  [0,  2,   4,   7],
  "stop":   [2,  4 ,  7,   np.inf],
  "conc_sugars": [0.15, 0.15, 0.15, 0.15], 
  "observed_death": [False, False, False, False]

culture_2 = pd.DataFrame({
  "deltas": [50, 202, 566, 182],
  "start":  [0,  2,   4,   7],
  "stop":   [2,  4 ,  7,   np.inf],
  "conc_sugars": [0.30, 0.30, 0.30, 0.30], 
  "observed_death": [False, False, False, False]
df = pd.concat([culture_2, culture_1])

from lifelines import WeibullAFTFitter

w_aft = WeibullAFTFitter().fit_interval_censoring(df, lower_bound_col='start', upper_bound_col='stop', event_col='observed_death', weights_col='deltas')

So what is the impact of sugar on microorganisms death? The coefficient associated with sugar concentration in this AFT model is interpreted as "higher means live longer". We get a coefficient of 0.90. Thus organisms with 0.15 sugar concentration "experience" time at a 14.4% faster rate than organisms with 0.30 sugar concentration. (Why? \(\exp(0.90 (0.30-0.15))\) ) and their median life is 0.64 hours less.


This is a simple case of when you want to use interval censoring instead of more naive ways. By thinking about the data generating process, we gain lots of statistical efficiency and better predictions. 

Continue Reading…


Read More

The Delta-Method and Autograd

One of the reasons I’m really excited about autograd is because it enables me to be able to transform my abstract parameters into business-logic. Let me explain with an example. Suppose I am modeling customer churn, and I have fitted a Weibull survival model using maximum likelihood estimation. I have two parameter estimates: lambda-hat and rho-hat. I also have their covariance matrix, which tells me how much uncertainty is present in the estimates (in lifelines, this is under the variance_matrix_ property). From this, I can plot the survival curve, which is fine, but what I really want is a measure of lifetime value.

Suppose customers give us $10 each time period for the first 3 time periods, and then $30 each time period afterwards. For a single user, the average LTV (up to timeline) calculation might look like:

# create a Weibull model with fake data
wf = WeibullFitter().fit(np.arange(1, 100))

from autograd import numpy as np

def LTV(params, timeline):
    lambda_, rho_ = params
    sf = np.exp(-(timeline/lambda_) ** rho_)
    clv = 10 * (timeline <= 3) * sf + 30 * (timeline > 3) * sf
    return clv.sum()

timeline = np.arange(10)
params = np.array([wf.lambda_, wf.rho_])

LTV(params, timeline)
# 214.76

This gives me a point estimate. But I also want the variance, since there is uncertainty in lambda-hat and rho-hat. There is a technique called the delta-method that allows me to transform variance from one domain to another. But to use that, I need the gradient of LTV w.r.t to lambda and rho. For this problem, I could probably write out the gradient, but it would be messy and error-prone. I can use autograd instead to do this.

from autograd import grad
gradient_LTV = grad(LTV)

gradient_LTV(params, timeline)
# array([ 0.15527043, 10.78900217])

The output of the above is the gradient at the params value. Very easy. Now, if I want the variance of the LTV, the delta methods tells me to multiply gradient by the covariance matrix and the gradient’s transpose (check: this should return a scalar and not a vector/matrix):

var = gradient_LTV(params, timeline)\
.dot(gradient_LTV(params, timeline)) # 3.127

So, my best estimate of LTV (at timeline=10) is 214.76 (3.127). From this, we can build confidence intervals, etc.

Continue Reading…


Read More

Document worth reading: “The relationship between Biological and Artificial Intelligence”

Intelligence can be defined as a predominantly human ability to accomplish tasks that are generally hard for computers and animals. Artificial Intelligence [AI] is a field attempting to accomplish such tasks with computers. AI is becoming increasingly widespread, as are claims of its relationship with Biological Intelligence. Often these claims are made to imply higher chances of a given technology succeeding, working on the assumption that AI systems which mimic the mechanisms of Biological Intelligence should be more successful. In this article I will discuss the similarities and differences between AI and the extent of our knowledge about the mechanisms of intelligence in biology, especially within humans. I will also explore the validity of the assumption that biomimicry in AI systems aids their advancement, and I will argue that existing similarity to biological systems in the way Artificial Neural Networks [ANNs] tackle tasks is due to design decisions, rather than inherent similarity of underlying mechanisms. This article is aimed at people who understand the basics of AI (especially ANNs), and would like to be better able to evaluate the often wild claims about the value of biomimicry in AI. The relationship between Biological and Artificial Intelligence

Continue Reading…


Read More

Whats new on arXiv

Data Markets to support AI for All: Pricing, Valuation and Governance

We discuss a data market technique based on intrinsic (relevance and uniqueness) as well as extrinsic value (influenced by supply and demand) of data. For intrinsic value, we explain how to perform valuation of data in absolute terms (i.e just by itself), or relatively (i.e in comparison to multiple datasets) or in conditional terms (i.e valuating new data given currently existing data).

FAT-DeepFFM: Field Attentive Deep Field-aware Factorization Machine

Click through rate (CTR) estimation is a fundamental task in personalized advertising and recommender systems. Recent years have witnessed the success of both the deep learning based model and attention mechanism in various tasks in computer vision (CV) and natural language processing (NLP). How to combine the attention mechanism with deep CTR model is a promising direction because it may ensemble the advantages of both sides. Although some CTR model such as Attentional Factorization Machine (AFM) has been proposed to model the weight of second order interaction features, we posit the evaluation of feature importance before explicit feature interaction procedure is also important for CTR prediction tasks because the model can learn to selectively highlight the informative features and suppress less useful ones if the task has many input features. In this paper, we propose a new neural CTR model named Field Attentive Deep Field-aware Factorization Machine (FAT-DeepFFM) by combining the Deep Field-aware Factorization Machine (DeepFFM) with Compose-Excitation network (CENet) field attention mechanism which is proposed by us as an enhanced version of Squeeze-Excitation network (SENet) to highlight the feature importance. We conduct extensive experiments on two real-world datasets and the experiment results show that FAT-DeepFFM achieves the best performance and obtains different improvements over the state-of-the-art methods. We also compare two kinds of attention mechanisms (attention before explicit feature interaction vs. attention after explicit feature interaction) and demonstrate that the former one outperforms the latter one significantly.

The Statistical Finite Element Method

The finite element method (FEM) is one of the great triumphs of modern day applied mathematics, numerical analysis and algorithm development. Engineering and the sciences benefit from the ability to simulate complex systems with FEM. At the same time the ability to obtain data by measurements from these complex systems, often through sensor networks, poses the question of how one systematically incorporates data into the FEM, consistently updating the finite element solution in the face of mathematical model misspecification with physical reality. This paper presents a statistical construction of FEM which goes beyond forward uncertainty propagation or solving inverse problems, and for the first time provides the means for the coherent synthesis of data and FEM.

IPC: A Benchmark Data Set for Learning with Graph-Structured Data

Benchmark data sets are an indispensable ingredient of the evaluation of graph-based machine learning methods. We release a new data set, compiled from International Planning Competitions (IPC), for benchmarking graph classification, regression, and related tasks. Apart from the graph construction (based on AI planning problems) that is interesting in its own right, the data set possesses distinctly different characteristics from popularly used benchmarks. The data set, named IPC, consists of two self-contained versions, grounded and lifted, both including graphs of large and skewedly distributed sizes, posing substantial challenges for the computation of graph models such as graph kernels and graph neural networks. The graphs in this data set are directed and the lifted version is acyclic, offering the opportunity of benchmarking specialized models for directed (acyclic) structures. Moreover, the graph generator and the labeling are computer programmed; thus, the data set may be extended easily if a larger scale is desired. The data set is accessible from \url{https://…/IPC-graph-data}.

End-to-End Entity Resolution for Big Data: A Survey

One of the most important tasks for improving data quality and the reliability of data analytics results is Entity Resolution (ER). ER aims to identify different descriptions that refer to the same real-world entity, and remains a challenging problem. While previous works have studied specific aspects of ER (and mostly in traditional settings), in this survey, we provide for the first time an end-to-end view of modern ER workflows, and of the novel aspects of entity indexing and matching methods in order to cope with more than one of the Big Data characteristics simultaneously. We present the basic concepts, processing steps and execution strategies that have been proposed by different communities, i.e., database, semantic Web and machine learning, in order to cope with the loose structuredness, extreme diversity, high speed and large scale of entity descriptions used by real-world applications. Finally, we provide a synthetic discussion of the existing approaches, and conclude with a detailed presentation of open research directions.

Transfer Entropy in Continuous Time

Transfer entropy (TE) was introduced by Schreiber in 2000 as a measurement of the predictive capacity of one stochastic process with respect to another. Originally stated for discrete time processes, we expand the theory of TE to stochastic processes indexed over a compact interval taking values in a Polish state space. We provide a definition for continuous time TE using the Radon-Nikodym Theorem, random measures, and projective limits of probability spaces. Furthermore, we provide necessary and sufficient conditions to obtain this definition as a limit of discrete time TE, as well as illustrate its application via an example involving Poisson point processes. As a derivative of continuous time TE, we also define the transfer entropy rate between two processes and show that (under mild assumptions) their stationarity implies a constant rate. We also investigate TE between homogeneous Markov jump processes and discuss some open problems and possible future directions.

Joint Learning of Neural Networks via Iterative Reweighted Least Squares

In this paper, we introduce the problem of jointly learning feed-forward neural networks across a set of relevant but diverse datasets. Compared to learning a separate network from each dataset in isolation, joint learning enables us to extract correlated information across multiple datasets to significantly improve the quality of learned networks. We formulate this problem as joint learning of multiple copies of the same network architecture and enforce the network weights to be shared across these networks. Instead of hand-encoding the shared network layers, we solve an optimization problem to automatically determine how layers should be shared between each pair of datasets. Experimental results show that our approach outperforms baselines without joint learning and those using pretraining-and-fine-tuning. We show the effectiveness of our approach on three tasks: image classification, learning auto-encoders, and image generation.

Data Processing Protocol for Regression of Geothermal Times Series with Uneven Intervals

Regression of data generated in simulations or experiments has important implications in sensitivity studies, uncertainty analysis, and prediction accuracy. Depending on the nature of the physical model, data points may not be evenly distributed. It is not often practical to choose all points for regression of a model because it doesn’t always guarantee a better fit. Fitness of the model is highly dependent on the number of data points and the distribution of the data along the curve. In this study, the effect of the number of points selected for regression is investigated and various schemes aimed to process regression data points are explored. Time series data i.e., output varying with time, is our prime interest mainly the temperature profile from enhanced geothermal system. The objective of the research is to find a better scheme for choosing a fraction of data points from the entire set to find a better fitness of the model without losing any features or trends in the data. A workflow is provided to summarize the entire protocol of data preprocessing, regression of mathematical model using training data, model testing, and error analysis. Six different schemes are developed to process data by setting criteria such as equal spacing along axes (X and Y), equal distance between two consecutive points on the curve, constraint in the angle of curvature, etc. As an example for the application of the proposed schemes, 1 to 20% of the data generated from the temperature change of a typical geothermal system is chosen from a total of 9939 points. It is shown that the number of data points, to a degree, has negligible effect on the fitted model depending on the scheme. The proposed data processing schemes are ranked in terms of R2 and NRMSE values.

On Conditioning GANs to Hierarchical Ontologies

The recent success of Generative Adversarial Networks (GAN) is a result of their ability to generate high quality images from a latent vector space. An important application is the generation of images from a text description, where the text description is encoded and further used in the conditioning of the generated image. Thus the generative network has to additionally learn a mapping from the text latent vector space to a highly complex and multi-modal image data distribution, which makes the training of such models challenging. To handle the complexities of fashion image and meta data, we propose Ontology Generative Adversarial Networks (O-GANs) for fashion image synthesis that is conditioned on an hierarchical fashion ontology in order to improve the image generation fidelity. We show that the incorporation of the ontology leads to better image quality as measured by Fr\'{e}chet Inception Distance and Inception Score. Additionally, we show that the O-GAN achieves better conditioning results evaluated by implicit similarity between the text and the generated image.

Recent Advances in Diversified Recommendation

With the rapid development of recommender systems, accuracy is no longer the only golden criterion for evaluating whether the recommendation results are satisfying or not. In recent years, diversity has gained tremendous attention in recommender systems research, which has been recognized to be an important factor for improving user satisfaction. On the one hand, diversified recommendation helps increase the chance of answering ephemeral user needs. On the other hand, diversifying recommendation results can help the business improve product visibility and explore potential user interests. In this paper, we are going to review the recent advances in diversified recommendation. Specifically, we first review the various definitions of diversity and generate a taxonomy to shed light on how diversity have been modeled or measured in recommender systems. After that, we summarize the major optimization approaches to diversified recommendation from a taxonomic view. Last but not the least, we project into the future and point out trending research directions on this topic.

An Information Theoretic Interpretation to Deep Neural Networks

It is commonly believed that the hidden layers of deep neural networks (DNNs) attempt to extract informative features for learning tasks. In this paper, we formalize this intuition by showing that the features extracted by DNN coincide with the result of an optimization problem, which we call the `universal feature selection’ problem, in a local analysis regime. We interpret the weights training in DNN as the projection of feature functions between feature spaces, specified by the network structure. Our formulation has direct operational meaning in terms of the performance for inference tasks, and gives interpretations to the internal computation results of DNNs. Results of numerical experiments are provided to support the analysis.

MAIA: A Microservices-based Architecture for Industrial Data Analytics

In recent decades, it has become a significant tendency for industrial manufacturers to adopt decentralization as a new manufacturing paradigm. This enables more efficient operations and facilitates the shift from mass to customized production. At the same time, advances in data analytics give more insights into the production lines, thus improving its overall productivity. The primary objective of this paper is to apply a decentralized architecture to address new challenges in industrial analytics. The main contributions of this work are therefore two-fold: (1) an assessment of the microservices’ feasibility in industrial environments, and (2) a microservices-based architecture for industrial data analytics. Also, a prototype has been developed, analyzed, and evaluated, to provide further practical insights. Initial evaluation results of this prototype underpin the adoption of microservices in industrial analytics with less than 20ms end-to-end processing latency for predicting movement paths for 100 autonomous robots on a commodity hardware server. However, it also identifies several drawbacks of the approach, which is, among others, the complexity in structure, leading to higher resource consumption.

Machine Learning based English Sentiment Analysis

Sentiment analysis or opinion mining aims to determine attitudes, judgments and opinions of customers for a product or a service. This is a great system to help manufacturers or servicers know the satisfaction level of customers about their products or services. From that, they can have appropriate adjustments. We use a popular machine learning method, being Support Vector Machine, combine with the library in Waikato Environment for Knowledge Analysis (WEKA) to build Java web program which analyzes the sentiment of English comments belongs one in four types of woman products. That are dresses, handbags, shoes and rings. We have developed and test our system with a training set having 300 comments and a test set having 400 comments. The experimental results of the system about precision, recall and F measures for positive comments are 89.3%, 95.0% and 92,.1%; for negative comments are 97.1%, 78.5% and 86.8%; and for neutral comments are 76.7%, 86.2% and 81.2%.

Continue Reading…


Read More

Analysing the HIV pandemic, Part 4: Classification of lab samples

(This article was first published on R Views, and kindly contributed to R-bloggers)

Andrie de Vries is the author of “R for Dummies” and a Solutions Engineer at RStudio

Phillip (Armand) Bester is a medical scientist, researcher, and lecturer at the Division of Virology, University of the Free State, and National Health Laboratory Service (NHLS), Bloemfontein, South Africa

In this post we complete our series on analysing the HIV pandemic in Africa. Previously we covered the bigger picture of HIV infection in Africa, and a pipeline for drug resistance testing of samples in the lab.

Then, in part 3 we saw that sometimes the same patient’s genotype must be repeatedly analysed in the lab, from samples taken years apart.

Let’s say we have genotyped a patient five years ago and we have a current genotype sequence. It should be possible to retrieve the previous sequence from a database of sequences without relying on identifiers only or at all. Sometimes when someone remarries they may change their surname or transcription errors can be made, which makes finding previous samples tedious and error-prone. So instead of using patient information to look for previous samples to include, we can rather use the sequence data itself and then confirm the sequences belong to the same patient or investigate any irregularities. If we suspect mother-to-child transmission from our analysis, we confirm this with the healthcare worker who sent the sample.

In this final part, we discuss how the inter- and intra-patient HIV genetic distances were analyzed using logistic regression to gain insights into the probability distribution of these two classes. In other words, the goal is to find a way to tell whether two genetic samples are from the same person or from two different people.

Samples from the same person can have slightly different genetic sequences, due to mutations and other errors. This is especially useful in comparing samples of genetic material from retroviruses.

Preliminary analysis

To help answer this question, we downloaded data from the Los Alamos HIV sequence database (specifically, Virus HIV-1, subtype C, genetic region POL CDS).

Each observation is the (dis)similarity distance between different samples.

## Warning: package 'ggplot2' was built under R version 3.5.2
pt_distance <- 
  read_csv("", col_types = "ccdccf")

## # A tibble: 6 x 6
##   sample1                sample2                 distance sub   area  type 
## 1 KI_797.67744.AB874124… KI_481.67593.AB873933.…   0.0644 B     INT   Inter
## 2 502-2794.39696.JF3202… WC3.27170.EF175209.B.U…   0.0418 B     INT   Inter
## 3 KI_882.67653.AB874186… KI_813.67589.AB874131.…   0.0347 B     INT   Inter
## 4 HTM360.13332.DQ322231… C11-2069070.63977.AB87…   0.0487 B     INT   Inter
## 5 O5598.34737.GQ372062.… LM49.4011.AF086817.B.T…   0.0360 B     INT   Inter
## 6 GKN.45901.HQ026515.B.… C11-2069083.65198.AB87…   0.0699 B     INT   Inter

Next, plot a histogram of the distance between samples. This clearly shows that the distance between samples of the same subject (intra-patient) is smaller than the distance between different subjects (inter-patient). This is not surprising.

However, from the histogram it is also clear that there is not a clear demarcation between these types. Simply eye-balling the data seems to indicate that one could use an arbitrary threshold of around 0.025 to indicate whether the sample is from the same person or different people.

pt_distance %>% 
    type = forcats::fct_rev(type)
  ) %>% 
  ggplot(aes(x = distance, fill = type)) +
  geom_histogram(binwidth = 0.001) +
  facet_grid(rows = vars(type), scales = "free_y") +
  scale_fill_manual(values = c("red", "blue")) +
  coord_cartesian(xlim = c(0, 0.1)) +
  ggtitle("Histogram of phylogenetic distance by type")


Since we have two sample types (intra-patient vs inter-patient), this is a binary classification problem.

Logistic regression is a simple algorithm for binary classification, and a special case of a generalized linear model (GLM). In R, you can use the glm() function to fit a GLM, and to specify a logistic regression, use the family = binomial argument.

In this case we want to train a model with distance as independent variable, and type the dependent variable, i.e. type ~ distance.

We train on 100,000 (n = 1e5) observations purely to reduce computation time:

pt_sample <- 
  pt_distance %>% 
model <- glm(type ~ distance, data = pt_sample, family = binomial)
## Warning: fitted probabilities numerically 0 or 1 occurred

(Note that sometimes the model throws a warning indicating numerical problems. This happens because the overlap between intra and inter is very small. If there is a very sharp dividing line between classes, the logistic regression algorithm has problems to converge.)

However, in this case the numerical problems doesn’t actually cause a practical problem with model itself.

The model summary tells us that the distance variable is highly significant (indicated by the ***):

## Call:
## glm(formula = type ~ distance, family = binomial, data = pt_sample)
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4035  -0.0050  -0.0010  -0.0002   8.4904  
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    5.7887     0.1796   32.23   <2e-16 ***
## distance    -355.1454     9.3247  -38.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Dispersion parameter for binomial family taken to be 1)
##     Null deviance: 23659.2  on 99999  degrees of freedom
## Residual deviance:  1440.5  on 99998  degrees of freedom
## AIC: 1444.5
## Number of Fisher Scoring iterations: 12

Now we can use the model to compute a prediction for a range of genetic distances (from 0 to 0.05) and create a plot.

newdata <-  data.frame(distance = seq(0, 0.05, by = 0.001))
pred <- predict(model, newdata, type = "response")
plot_inter <- 
  pt_sample %>% 
  filter(distance <= 0.05, type == "Inter") %>% 
plot_intra <- 
  pt_sample %>% 
  filter(distance <= 0.05, type == "Intra") %>% 

threshold <-  with(newdata, approx(pred, distance, xout = 0.5))$y

ggplot() +
  geom_point(data = plot_inter, aes(x = distance, y = 0), alpha = 0.05, col = "blue") +
  geom_point(data = plot_intra, aes(x = distance, y = 1), alpha = 0.05, col = "red") +
  geom_rug(data = plot_inter, aes(x = distance, y = 0), col = "blue") +
  geom_rug(data = plot_intra, aes(x = distance, y = 0), col = "red") +
  geom_line(data = newdata, aes(x = distance, y = pred)) +
  annotate(x = 0.005, y = 0.9, label = "Type == intra", geom = "text", col = "red") +
  annotate(x = 0.04, y = 0.1, label = "Type == inter", geom = "text", col = "blue") +
  geom_vline(xintercept = threshold, col = "grey50") +
  ggtitle("Model results", subtitle = "Predicted probability that Type == 'Intra'") +
  xlab("Phylogenetic distance") +

Logistic regression essentially fits an s-curve that indicates the probability. In this case, for small distances (lower than ~0.01) the probability of being the same person (i.e., type is intra) is almost 100%. For distances greater than 0.03 the probability of being type intra is almost zero (i.e., the model predicts type inter).

The model puts the distance threshold at approximately 0.016.

The practical value of this work

In part 2, we discussed how researchers developed an automated pipeline of phylogenetic analysis. The project was designed to run on the Raspberry Pi, a very low-cost computing device. This meant that the cost of implementation of the project is low, and the project has been implemented at the National Health Laboratory Service (NHLS) in South Africa.

In this part, we described the very simple logistic regression model that runs as part of the pipeline. In addition to the descriptive analysis, e.g., heat maps and trees (as described in part 3), this logistic regression makes a prediction whether two samples were obtained from the same person, or from two different people. This prediction is helpful in allowing the laboratory staff identify potential contamination of samples, or indeed to match samples from people who weren’t matched properly by their name and other identifying information (e.g., through spelling mistakes or name changes).

Finally, it’s interesting to note that traditionally the decision whether two samples were intra-patient or inter-patient was made on heuristics, instead of modelling. For example, a heuristic might say that if the genetic distance between two samples is less than 0.01, they should be considered a match from a single person.

Heuristics are easy to implement in the lab, but sometimes it can happen that the origin of the original heuristic gets lost. This means that it’s possible that the heuristic is no longer applicable to the sample population.

This modelling gave the researchers a tool to establish confidence intervals around predictions. In addition, it is now possible to repeat the model for many different local sample populations of interest, and thus have a tool that is better able to discriminate given the most recent data.


In this multi-part series of HIV in Africa we covered four topics:

  • In part 1, we analysed the incidence of HIV in sub-Sahara Africa, with special mention of the effect of the wide-spread availability of anti-retroviral (ARV) drugs during 2004. Since then, there was a rapid decline in HIV infection rates in South Africa.
  • In part 2, we described the PhyloPi project – a phylogenetic pipeline to analyse HIV in the lab, available for the low-cost RaspBerry Pi. This work as published in the PLoS ONE journal: “PhyloPi: An affordable, purpose built phylogenetic pipeline for the HIV drug resistance testing facility
  • Then, part 3 described the biological mechanism how the HIV virus mutates, and how this can be modeled using a Markov chain, and visualized as heat maps and phylogenetic trees.
  • This final part covered how we used a very simple logistic regression model to identify if two samples in the lab came from the same person or two different people.

Closing thoughts

Dear readers,

I hope that you enjoyed this series on ‘Analysing the HIV pandemic’ using R and some of the tools available as part of the tidyverse packages. Learning R provided me not only with a tool set to analyse data problems, but also a community. Being a biologist, I was not sure of the best approach for solving the problem of inter- and intra-patient genetic distances. I contacted Andrie from Rstudio, and not only did he help us with this, but he was also excited about it. It was a pleasure telling you about our journey on this blog site, and a privilege doing this with experts.


To leave a comment for the author, please follow the link and comment on their blog: R Views. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Comparing Frequentist, Bayesian and Simulation methods and conclusions

(This article was first published on Posts on R Lover ! a programmer, and kindly contributed to R-bloggers)

So, a programmer, a frequentist, and a bayesian walk into a bar. No this post
isn’t really on the path to some politically incorrect stereotypical humor. Jut
trying to make it fun and catch your attention. As the title implies this post
is really about applying the differing viewpoints and methodologies inherent in
those approaches to statistics. To be honest I’m not even going to spend a lot
time on the details of methods, I want to focus on the conclusions each of these
people would draw after analyzing the same data using their favored methods.
This post isn’t really so much about how they would proceed but more about what
they would conclude at the end of the analysis. I make no claim about which of
these fine people have the best way of doing things although I was raised a
frequentist, I am more and more enamored of bayesian methods and while my tagline
is accurate “R Lover !a programmer” I will admit a love for making computers do
the hard work for me so if that involves a little programming, I’m all for


Last week I saw a nice post from Anindya

on the R Bloggers feed. The topic material was
fun for me (analyzing the performance of male 100m sprinters and the fastest
man on earth), as well as exploring the concepts in Allen B. Downey’s book
Think Stats: Probability and Statistics for
, which is at
my very favorite price point, free download, and Creative Commons licensed. The
post got me interested in following up a bit and thinking things through “out
loud” as a blog post.

This post is predicated on your having read
Please do. I’ll wait until you come back.

Welcome back. I’m going to try and repeat as little as possible from that blog
post just make comparisons. So to continue our little teaser… So, a
programmer, a frequentist, and a bayesian walk into a bar and start arguing
about whether Usain Bolt really is the fastest man on earth. The programmer has
told us how they would go about answering the question. The answer was:

There is only a 1.85% chance of seeing a difference as large as the observed
difference if there is actually no difference between the (median) timings of
Usain Bolt and Asafa Powell.

and was derived by counting observations across 10,000 simulations of the data
using the infer package and looking at differences between median timings. Our
null hypothesis was there is no “real” difference difference between Bolt and
Powell even though our data has a median for Bolt of 9.90 and median for Powell
of 9.95. That is after all a very small difference. But our simulation allows us
to reject that null hypothesis and favor the alternative that the difference is

Should we be confident that we are 100% – 1.85% = 98% likely to be correct? NO!
as Downey notes:

For most problems, we only care about the order of magnitude: if the p-value
is smaller that 1/100, the effect is likely to be real; if it is greater than
1/10, probably not. If you think there is a difference between a 4.8%
(significant!) and 5.2% (not significant!), you are taking it too seriously.

Can we say that Bolt will win a race with Powell 98% of time? Again a resounding
NO! We’re 98% certain that the “true” difference in their medians is .05
seconds? NOPE.

Back to the future

Okay we’ve heard the programmer’s story at our little local bar. It’s time to
let our frequentist have their moment in the limelight. Technically the best
term would be Neyman-Pearson Frequentist but we’re not going to stand on
formality. It is nearly a century old and stands as an “improvement” on Fisher’s
significance testing. A nice little summary here on

I’m not here to belabor the nuances but frequentist methods are among the
oldest and arguably the most prevalent method in many fields. They are often the
first method people learned in college and sometimes the only method. They still
drive most of the published research in many fields although other methods are
taking root.

Before the frequentist can tell their tale though let’s make sure they have the
same data as in the earlier post. Let’s load all the libraries we’re going to
use and very quickly reproduce the process Anindya Mozumdar went through to
scrape and load the data. We’ll have a tibble named male_100 that contains
the requisite data and we’ll confirm that the summary for the top 6 runners mean
and median are identical. Note that I am suppressing messages as the libraries
load since R 3.6.0 has gotten quite chatty in this regard.


male_100_html <- read_html("")
male_100_pres <- male_100_html %>%
  html_nodes(xpath = "//pre")
male_100_htext <- male_100_pres %>%
male_100_htext <- male_100_htext[[1]]

male_100 <- readr::read_fwf(male_100_htext, skip = 1, n_max = 3178,
                            col_types = cols(.default = col_character()),
                            col_positions = fwf_positions(
                              c(1, 16, 27, 35, 66, 74, 86, 93, 123),
                              c(15, 26, 34, 65, 73, 85, 92, 122, 132)

male_100 <- male_100 %>%
  select(X2, X4) %>% 
  transmute(timing = X2, runner = X4) %>%
  mutate(timing = gsub("A", "", timing),
         timing = as.numeric(timing)) %>%
  filter(runner %in% c("Usain Bolt", "Asafa Powell", "Yohan Blake",
                       "Justin Gatlin", "Maurice Greene", "Tyson Gay")) %>%
  mutate_if(is.character, as.factor) %>%
## # A tibble: 520 x 2
##    timing runner       
##  1   9.58 Usain Bolt   
##  2   9.63 Usain Bolt   
##  3   9.69 Usain Bolt   
##  4   9.69 Tyson Gay    
##  5   9.69 Yohan Blake  
##  6   9.71 Tyson Gay    
##  7   9.72 Usain Bolt   
##  8   9.72 Asafa Powell 
##  9   9.74 Asafa Powell 
## 10   9.74 Justin Gatlin
## # … with 510 more rows
male_100$runner <- forcats::fct_reorder(male_100$runner, male_100$timing)

male_100 %>%
  group_by(runner) %>%
  summarise(mean_timing = mean(timing)) %>%
## # A tibble: 6 x 2
##   runner         mean_timing
## 1 Usain Bolt            9.90
## 2 Asafa Powell          9.94
## 3 Tyson Gay             9.95
## 4 Justin Gatlin         9.96
## 5 Yohan Blake           9.96
## 6 Maurice Greene        9.97
male_100 %>%
  group_by(runner) %>%
  summarise(median_timing = median(timing)) %>%
## # A tibble: 6 x 2
##   runner         median_timing
## 1 Usain Bolt              9.9 
## 2 Asafa Powell            9.95
## 3 Yohan Blake             9.96
## 4 Justin Gatlin           9.97
## 5 Maurice Greene          9.97
## 6 Tyson Gay               9.97

Most of the code above is simply shortened from the original post. The only
thing that is completely new is forcats::fct_reorder(male_100$runner, male_100$timing) which takes the runner factor and reorders it according to
the median by runner. This doesn’t matter for the calculations we’ll do but it
will make the plots look nicer.

Testing, testing!

One of the issues with a frequentist approach compared to a programmers approach
is that there are a lot of different tests you could choose. But in this case
wearing my frequentist hat there really are only two choices. A Oneway ANOVA or
the Kruskall Wallis which uses ranks and eliminates some assumptions.

This also gives me a chance to talk about a great package that supports both
frequentists and bayesian methods and completely integrates visualizing your
data with analyzing your data, which IMHO is the only way to go. The package is
ggstatsplot. Full disclosure
I’m a minor contributor to the package but please know that the true hero of the
package is Indrajeet Patil. It’s stable,
mature, well tested and well maintained try it out.

So let’s assume we want to run a classic Oneway ANOVA first (Welch’s method so
we don’t have to assume equal variances across groups). Assuming that the
omnibuds F test is significant lets say we’d like to look at the pairwise
comparisons and adjust the p values for multiple comparison using Holm. We’re a
big fan of visualizing the data by runner and of course we’d like to plot things
like the mean and median and the number of races per runner. We’d of course like
to know our effect size we’ll stick with eta squared we’d like it as elegant as

Doing this analysis using frequentist methods in R is not difficult. Heck
I’ve even blogged about it myself it’s so
“easy”. The benefit of ggbetweenstats from ggstatsplot is that it pretty
much allows you to do just about everything in one command. Seamlessly mixing
the plot and the results into one output. We’re only going to scratch the
surface of all the customization possibilities.

ggbetweenstats(data = male_100, 
               x = runner, 
               y = timing,
               type = "p",
               var.equal = FALSE,
               pairwise.comparisons = TRUE,
               partial = FALSE,
               effsize.type = "biased",
               point.jitter.height = 0, 
               title = "Parametric (Mean) testing assuming unequal variances",
               ggplot.component = ggplot2::scale_y_continuous(breaks = seq(9.6, 10.4, .2), 
                                                              limits = (c(9.6,10.4))),
               messages = FALSE

Our conclusion is similar to that drawn by simulation. We can clearly reject the
null that all these runners have the same mean time. Using Games-Howell and
controlling for multiple comparisons with Holm, however, we can only show
support for the difference between Usain Bolt and Maurice Green. There is
insufficient evidence to reject the null for all the other possible pairings.
(You can actually tell ggbetweenstats to show the p value for all the pairings
but that gets cluttered quickly).

From a frequentist perspective there are a whole set of non-parametric tests
that are available for use. They typically make fewer assumptions about the data
we have and often operate by exchanging the ranks of the outcome variable
(timing) rather than using the number.

The only thing we need to change in our input to the function is type = "np" and the title.

ggbetweenstats(data = male_100, 
               x = runner, 
               y = timing,
               type = "np",
               var.equal = FALSE,
               pairwise.comparisons = TRUE,
               partial = FALSE,
               effsize.type = "biased",
               point.jitter.height = 0, 
               title = "Non-Parametric (Rank) testing",
               ggplot.component = ggplot2::scale_y_continuous(breaks = seq(9.6, 10.4, .2), 
                                                              limits = (c(9.6,10.4))),
               messages = FALSE

Without getting overly consumed by the exact numbers note the very similar
results for the overall test, but that we now also are more confident about
whether the difference between Usain Bolt and Justin Gaitlin. I highlight that
because there is a common misconception that non-parametric tests are always less
powerful (sensitive) than their parametric cousins.

Asking the question differently (see Learning Statistics with R )

Much of the credit for this section goes to Danielle Navarro (bookdown translation: Emily Kothe) in Learning Statistics with R

It usually takes several lessons or even an entire semester to teach the
frequentist method, because null hypothesis testing is a very elaborate
contraption that people (well in my experience very smart undergraduate
students) find very hard to master. In contrast, the Bayesian approach to
hypothesis testing “feels” far more intuitive. Let’s apply it to our current

We’re at the bar the three of us wondering whether Usain Bolt is really the
fastest or whether all these individual data points really are just a random
mosaic of data noise. Both the programmer and the frequentist set the testing up
conceptually the same way. Can we use the data to reject the null that all the
runners are the same. Convinced they’re not all the same they applied the same
general procedure to reject (or not) the hypothesis that any pair was the same
for example Bolt versus Powell (for the record I’m not related to either). They
differ in computational methods and assumptions but not in overarching method.

At the end of their machinations they have no ability to talk about how likely
(probable) it is that runner 1 will beat runner 2. Often times that’s exactly
what you really want to know. There are two hypotheses that we want to compare,
a null hypothesis h0 that all the runners run equally fast and an alternative
hypothesis h1 that they don’t. Prior to looking at the data while we’re
sitting at the bar we have no real strong belief about which hypothesis is true
(odds are 1:1 in our naive state). We have our data and we want it to inform our
thinking. Unlike frequentist statistics, Bayesian statistics allow us to talk
about the probability that the null hypothesis is true (which is a complete no
in a frequentist context). Better yet, it allows us to calculate the
posterior probability of the null hypothesis, using Bayes’ rule and our data.

In practice, most Bayesian data analysts tend not to talk in terms of the raw
posterior probabilities. Instead, we/they tend to talk in terms of the posterior
odds ratio. Think of it like betting. Suppose, for instance, the posterior
probability of the null hypothesis is 25%, and the posterior probability of the
alternative is 75%. The alternative hypothesis h1 is three times as probable as the
null h0, so we say that the odds are 3:1 in favor of the alternative.

At the end of the Bayesian’s efforts they can make what feel like very natural
statements of interest, for example, “The evidence provided by our data
corresponds to odds of 42:1 that these runners are not all equally fast.

Let’s try it using ggbetweenstats…

ggbetweenstats(data = male_100, 
               x = runner, 
               y = timing,
               type = "bf",
               var.equal = FALSE,
               pairwise.comparisons = TRUE,
               partial = FALSE,
               effsize.type = "biased",
               point.jitter.height = 0, 
               title = "Bayesian testing",
               messages = FALSE

Yikes! Not what I wanted to see in the bar. The pairwise comparisons have gone
away (we’ll get them back) and worse yet what the heck does loge(BF10) = 2.9
mean? I hate log conversions I was promised a real number like 42:1! Who’s
Cauchy why is he there at .0.707?

Let’s break this down. loge(BF10) = 2.9 is also exp(2.9) or about 18 so
the good news is the odds are better than 18:1 that the runners are not equally
fast. Since rounding no doubt loses some accuracy lets use the BayesFactor
package directly and get a more accurate answer before we round anovaBF(timing ~ runner, data =, rscaleFixed = .707) is what we want
where rscaleFixed = .707 ensures we have the right Cauchy value.

anovaBF(timing ~ runner, data = male_100, rscaleFixed = .707)
## Bayes factor analysis
## --------------
## [1] runner : 19.04071 ±0.01%
## Against denominator:
##   Intercept only 
## ---
## Bayes factor type: BFlinearModel, JZS

Okay that’s better so to Bayesian thinking the odds are 19:1 against the fact that they all run about the same speed, or 19:1 they run at different speeds.

Hmmm. One of the strengths/weaknesses of the Bayesian approach is that people
can have their own sense of how strong 19:1 is. I like those odds. One of the
really nice things about the Bayes factor is the numbers are inherently
meaningful. If you run the data and you compute a Bayes factor of 4, it means
that the evidence provided by your data corresponds to betting odds of 4:1 in
favor of the alternative. However, there have been some attempts to quantify the
standards of evidence that would be considered meaningful in a scientific
context. One that is widely used is from Kass and Raftery (1995). (N.B. there are others and I have deliberately selected one of the most conservative standards. See for example

Bayes factor value Interpretation
1 – 3 Negligible evidence
3 – 20 Positive evidence
20 -150 Strong evidence
>150 Very strong evidence

Okay we have “positive evidence” and we can quantify it, that’s good. But what
about all the pairwise comparisons? Can we take this down to all the
individual pairings? I’m on the edge of my bar stool here. What are the odds
Bolt really is faster than Powell? Can we quantify that without somehow breaking
the multiple comparisons rule?

The short answer is yes we can safely extend this methodology to incorporate
pairwise comparisons. We shouldn’t abuse the method and we should fit our model
with the best possible prior information but in general, as simulated

With Bayesian inference (and the correct prior), though, this problem
disappears. Amazingly enough, you don’t have to correct Bayesian inferences for
multiple comparisons.

With that in mind let’s build a quick little function that will allow us to pass
a data source and two names and run a Bayesian t-test via BayesFactor::ttestBF
to compare two runners. ttestBF returns a lot of info in a custom object so
we’ll use the extractBF function to grab it in a format where we can pluck out
the actual BF10

compare_runners_bf <- function(df, runner1, runner2) {
  ds <- df %>%
    filter(runner %in% c(runner1, runner2)) %>%
    droplevels %>%
  zzz <- ttestBF(formula = timing ~ runner, data = ds)
  yyy <- extractBF(zzz)
  xxx <- paste0("The evidence provided by the data corresponds to odds of ", 
                ":1 that ", 
                " is faster than ",
                runner2 )

Now that we have a function we can see what the odds are that Bolt is faster
than the other 5 and print them one by one

compare_runners_bf(male_100, "Usain Bolt", "Asafa Powell")
## [1] "The evidence provided by the data corresponds to odds of 5:1 that Usain Bolt is faster than Asafa Powell"
compare_runners_bf(male_100, "Usain Bolt", "Tyson Gay")
## [1] "The evidence provided by the data corresponds to odds of 5:1 that Usain Bolt is faster than Tyson Gay"
compare_runners_bf(male_100, "Usain Bolt", "Justin Gatlin")
## [1] "The evidence provided by the data corresponds to odds of 21:1 that Usain Bolt is faster than Justin Gatlin"
compare_runners_bf(male_100, "Usain Bolt", "Yohan Blake")
## [1] "The evidence provided by the data corresponds to odds of 8:1 that Usain Bolt is faster than Yohan Blake"
compare_runners_bf(male_100, "Usain Bolt", "Maurice Greene")
## [1] "The evidence provided by the data corresponds to odds of 1355:1 that Usain Bolt is faster than Maurice Greene"

Okay now I feel like we’re getting somewhere with our bar discussions. Should I
feel inclined to make a little wager on say who buys the next round of drinks as
a Bayesian I have some nice useful information. I’m not rejecting a null
hypothesis I’m casting the information I have as a statement of the odds I think
I have of “winning”.

But of course this isn’t the whole story so please read on…

Who’s Cauchy and why does he matter?

Earlier I made light of the fact that the output from ggbetweenstats had
rCauchy = 0.707 and anovaBF uses rscaleFixed = .707. Now we need to spend
a little time actually understanding what that’s all about. Cauchy is
Augustin-Louis Cauchy and
the reason that’s relevant is that BayesFactor makes use of his distribution as
a default
. I’m not even
going to try and take you into the details of the math but it is important we
have a decent understanding of what we’re doing to our data.

The BayesFactor

has a few built-in “default” named settings. They all have the same shape; the
only differ by their scale, denoted by r. The three named defaults are medium =
0.707, wide = 1, and ultrawide = 1.414. “Medium”, is the default. The scale
controls how large, on average, the expected true effect sizes are. For a
particular scale 50% of the true effect sizes are within the interval (−r,r).
For the default scale of “medium”, 50% of the prior effect sizes are within the
range (−0.7071,0.7071). Increasing r increases the sizes of expected effects;
decreasing r decreases the size of the expected effects.

BayesFactor blog site – February 23, 2014

BayesFactor blog site – February 23, 2014

Let’s compare it to a frequentist test we’re all likely to know, the t-test,
(we’ll use the Welch variant). Our initial hypothesis is that Bolt’s mean times
are different than Powell’s mean times (two-sided) and then test the one-sided
that Bolt is faster. Then let’s go a little crazy and run it one sided but
specify the mean difference 0.038403 of a second faster that we “see” in our data
mu = -0.038403.

  justtwo <- male_100 %>%
    filter(runner %in% c("Usain Bolt", "Asafa Powell")) %>%
    droplevels %>%
  t.test(formula = timing ~ runner, data = justtwo)
##  Welch Two Sample t-test
## data:  timing by runner
## t = -2.5133, df = 111.58, p-value = 0.01339
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.06868030 -0.00812721
## sample estimates:
##   mean in group Usain Bolt mean in group Asafa Powell 
##                   9.904930                   9.943333
  t.test(formula = timing ~ runner, data = justtwo, alternative = "less")
##  Welch Two Sample t-test
## data:  timing by runner
## t = -2.5133, df = 111.58, p-value = 0.006694
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##         -Inf -0.01306002
## sample estimates:
##   mean in group Usain Bolt mean in group Asafa Powell 
##                   9.904930                   9.943333
  t.test(formula = timing ~ runner, data = justtwo, alternative = "less", mu = -0.038403)
##  Welch Two Sample t-test
## data:  timing by runner
## t = -4.9468e-05, df = 111.58, p-value = 0.5
## alternative hypothesis: true difference in means is less than -0.038403
## 95 percent confidence interval:
##         -Inf -0.01306002
## sample estimates:
##   mean in group Usain Bolt mean in group Asafa Powell 
##                   9.904930                   9.943333

Hopefully that last one didn’t trip you up and you recognized by definition if
the mean difference in our sample data is -0.038403 then the p value should
reflect 50/50 p value?

Let’s first just try different rCauchy values with ttestBF.

  justtwo <- male_100 %>%
    filter(runner %in% c("Usain Bolt", "Asafa Powell")) %>%
    droplevels %>%
  ttestBF(formula = timing ~ runner, data = justtwo, rscale = "medium")
## Bayes factor analysis
## --------------
## [1] Alt., r=0.707 : 5.164791 ±0%
## Against denominator:
##   Null, mu1-mu2 = 0 
## ---
## Bayes factor type: BFindepSample, JZS
  ttestBF(formula = timing ~ runner, data = justtwo, rscale = "wide")
## Bayes factor analysis
## --------------
## [1] Alt., r=1 : 4.133431 ±0%
## Against denominator:
##   Null, mu1-mu2 = 0 
## ---
## Bayes factor type: BFindepSample, JZS
  ttestBF(formula = timing ~ runner, data = justtwo, rscale = .2)
## Bayes factor analysis
## --------------
## [1] Alt., r=0.2 : 6.104113 ±0%
## Against denominator:
##   Null, mu1-mu2 = 0 
## ---
## Bayes factor type: BFindepSample, JZS

Okay the default medium returns just what we reported earlier 5:1 odds. Going
wider gets us 4:1 and going narrower (believing the difference is smaller) takes
us to 6:1. Not huge differences but noticeable and driven by our data.

Let’s investigate directional hypotheses with ttestBF. First let’s ask what’s the evidence that Bolt is faster than Powell NB the order is driven by factor level in the dataframe not the order in the filter command below. Also note that faster is a lower number

  justtwo <- male_100 %>%
    filter(runner %in% c("Usain Bolt", "Asafa Powell")) %>%
    droplevels %>%
  # notice these two just return the same answer in a different order
  ttestBF(formula = timing ~ runner, data = justtwo, nullInterval = c(0, Inf))
## Bayes factor analysis
## --------------
## [1] Alt., r=0.707 0
  ttestBF(formula = timing ~ runner, data = justtwo, nullInterval = c(-Inf, 0))
## Bayes factor analysis
## --------------
## [1] Alt., r=0.707 -Inf<0    : 10.28646   ±0%
## [2] Alt., r=0.707 !(-Inf<0) : 0.04312062 ±0%
## Against denominator:
##   Null, mu1-mu2 = 0 
## ---
## Bayes factor type: BFindepSample, JZS

So the odds that Bolt has a bigger number i.e. is slower than Powell is 0.04:1
and the converse is the odds that Bolt has a smaller timing (is faster) is 10:1.
You can feel free to put these in the order that makes the most sense to your
workflow. They’re always going to be mirror images.

And yes in some circumstances it is perfectly rational to combine the
information by dividing those odds. See this blog post for a research

but accomplishing it is trivial. Running this code snippet essentially combines
what we know in both directions of the hypothesis.

  justtwo <- male_100 %>%
    filter(runner %in% c("Usain Bolt", "Asafa Powell")) %>%
    droplevels %>%
  powellvbolt <- ttestBF(formula = timing ~ runner, data = justtwo, nullInterval = c(-Inf, 0))
## Bayes factor analysis
## --------------
## [1] Alt., r=0.707 -Inf<0 : 238.5509 ±0%
## Against denominator:
##   Alternative, r = 0.707106781186548, mu =/= 0 !(-Inf<0) 
## ---
## Bayes factor type: BFindepSample, JZS

What have I learned

  • All three approaches yielded similiar answers to our question at the bar.
  • Frequentist methods have stood the test of time and are pervasive in the
  • Computational methods like resmapling allow us to free ourselves
    from some of the restrictions and assumptions in classical hypothesis testing
    in an age when cpmpute power is cheap
  • Bayesian methods allow us to speak in
    the very human vernacular language of probabilities about our evidence.


I hope you’ve found this useful. I am always open to comments, corrections and

Chuck (ibecav at gmail dot com)

To leave a comment for the author, please follow the link and comment on their blog: Posts on R Lover ! a programmer. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

The race to become Britain’s next PM

Rivals to succeed Theresa May are jockeying for position

Continue Reading…


Read More

May 22, 2019

MRAN snapshots, and you

For almost five years, the entire CRAN repository of R packages has been archived on a daily basis at MRAN. If you use CRAN snapshots from MRAN, we'd love to hear how you use them in this survey. If you're not familiar with the concept, or just want to learn more, read on.

Every day since September 17, 2014, we (Microsoft and, before the acquisition, Revolution Analytics) have archived a snapshot of the entire CRAN repository as a service to the R community. These daily snapshots have several uses:

  • As a longer-term archive of binary R packages. (CRAN keeps an archive of package source versions, but binary versions of packages are kept for a limited time. CRAN keeps package binaries only for the current R version and the prior major version, and only for the latest version of the package). 
  • As a static CRAN repository you can use like the standard CRAN repository, but frozen in time. This means changes to CRAN packages won't affect the behavior of R scripts in the future (for better or worse). options(repos="") provides a CRAN repository that works with R 3.3.3, for example — and you can choose any date since September 17, 2014.
  • The checkpoint package on CRAN provides a simple interface to these CRAN snapshots, allowing you use a specific CRAN snapshot by specifying a date, and making it easy to manage multiple R project each using different snapshots.
  • Microsoft R Open, Microsoft R Client, Microsoft ML Server and SQL Server ML Services all use fixed CRAN repository snapshots from MRAN by default.
  • The rocker project provides container instances for historical versions of R, tied to an appropriate CRAN snapshot from MRAN suitable for the corresponding R version.
MRAN time machine
Browse the MRAN time machine to find specific CRAN snapshots by date. (Tip: click the R logo to open the snapshot URL in its own new window.)

MRAN and the CRAN snapshot system was created at a time when reproducibility was an emerging concept in the R ecosystem. Now, there are several methods available to ensure that your R code works consistently, even as R and CRAN changes. Beyond virtualization and containers, you have packages like packrat and miniCRAN, RStudio's package manager, and the full suite of tools for reproducible research.

As CRAN has grown and changes to packages have become more frequent, maintaining MRAN is an increasingly resource-intensive process. We're contemplating changes, like changing the frequency of snapshots, or thinning the archive of snapshots that haven't been used. But before we do that we'd  like to hear from the community first. Have you used MRAN snapshots? If so, how are you using them? How many different snapshots have you used, and how often do you change that up? Please leave your feedback at the survey link below by June 14, and we'll use the feedback we gather in our decision-making process. Responses are anonymous, and we'll summarize the responses in a future blog post. Thanks in advance!

Take the MRAN survey here.

Continue Reading…


Read More

MRAN snapshots, and you

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

For almost five years, the entire CRAN repository of R packages has been archived on a daily basis at MRAN. If you use CRAN snapshots from MRAN, we'd love to hear how you use them in this survey. If you're not familiar with the concept, or just want to learn more, read on.

Every day since September 17, 2014, we (Microsoft and, before the acquisition, Revolution Analytics) have archived a snapshot of the entire CRAN repository as a service to the R community. These daily snapshots have several uses:

  • As a longer-term archive of binary R packages. (CRAN keeps an archive of package source versions, but binary versions of packages are kept for a limited time. CRAN keeps package binaries only for the current R version and the prior major version, and only for the latest version of the package). 
  • As a static CRAN repository you can use like the standard CRAN repository, but frozen in time. This means changes to CRAN packages won't affect the behavior of R scripts in the future (for better or worse). options(repos="") provides a CRAN repository that works with R 3.3.3, for example — and you can choose any date since September 17, 2014.
  • The checkpoint package on CRAN provides a simple interface to these CRAN snapshots, allowing you use a specific CRAN snapshot by specifying a date, and making it easy to manage multiple R project each using different snapshots.
  • Microsoft R Open, Microsoft R Client, Microsoft ML Server and SQL Server ML Services all use fixed CRAN repository snapshots from MRAN by default.
  • The rocker project provides container instances for historical versions of R, tied to an appropriate CRAN snapshot from MRAN suitable for the corresponding R version.
MRAN time machine

Browse the MRAN time machine to find specific CRAN snapshots by date. (Tip: click the R logo to open the snapshot URL in its own new window.)

MRAN and the CRAN snapshot system was created at a time when reproducibility was an emerging concept in the R ecosystem. Now, there are several methods available to ensure that your R code works consistently, even as R and CRAN changes. Beyond virtualization and containers, you have packages like packrat and miniCRAN, RStudio's package manager, and the full suite of tools for reproducible research.

As CRAN has grown and changes to packages have become more frequent, maintaining MRAN is an increasingly resource-intensive process. We're contemplating changes, like changing the frequency of snapshots, or thinning the archive of snapshots that haven't been used. But before we do that we'd  like to hear from the community first. Have you used MRAN snapshots? If so, how are you using them? How many different snapshots have you used, and how often do you change that up? Please leave your feedback at the survey link below by June 14, and we'll use the feedback we gather in our decision-making process. Responses are anonymous, and we'll summarize the responses in a future blog post. Thanks in advance!

Take the MRAN survey here.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Book Memo: “Distributed Computing in Big Data Analytics”

Big data technologies are used to achieve any type of analytics in a fast and predictable way, thus enabling better human and machine level decision making. Principles of distributed computing are the keys to big data technologies and analytics. The mechanisms related to data storage, data access, data transfer, visualization and predictive modeling using distributed processing in multiple low cost machines are the key considerations that make big data analytics possible within stipulated cost and time practical for consumption by human and machines. However, the current literature available in big data analytics needs a holistic perspective to highlight the relation between big data analytics and distributed processing for ease of understanding and practitioner use. This book fills the literature gap by addressing key aspects of distributed processing in big data analytics. The chapters tackle the essential concepts and patterns of distributed computing widely used in big data analytics. This book discusses also covers the main technologies which support distributed processing. Finally, this book provides insight into applications of big data analytics, highlighting how principles of distributed computing are used in those situations. Practitioners and researchers alike will find this book a valuable tool for their work, helping them to select the appropriate technologies, while understanding the inherent strengths and drawbacks of those technologies.

Continue Reading…


Read More

Deep (learning) like Jacques Cousteau – Part 5 – Vector addition

(This article was first published on Embracing the Random | R, and kindly contributed to R-bloggers)

(TL;DR: You can add vectors that have the same number of elements.)

LaTeX and MathJax warning for those viewing my feed: please view
directly on website!

Mos def-11-mika

You want to know how to rhyme, you better learn how to add
It’s mathematics

Mos Def in ‘Mathematics’

we learnt about scalar multiplication. Let’s get to adding vectors

Today’s topic: vector addition

We will follow the notation in Goodfellow, Ian, et
. If our vector
has elements that are real numbers, then we can say that
is a -dimensional vector.

We can also say that lies in some set of all vectors
that have the same dimensions as itself
. This might be a bit abstract at first,
but it’s not too bad at all.

Let’s define two vectors:

They have two elements. They are, therefore, two-dimensional vectors.

What are some other two dimensional vectors made up of real numbers? We
could have:

There are infinitely many vectors with two elements that we could come up
with! How can we describe this infinite set of vectors made up of real
numbers? We can say that our two-dimensional vectors made up of real
numbers lie in this set:

What in the world does this mean?

Cartesian products

To understand what the above product means, let’s use a simplified
example. Let’s define sets of integers like so:

What is the result of ? We can depict this operation in a

Continue Reading…


Read More

Top KDnuggets Tweets, May 15 – 21: 7 Steps to Mastering SQL for Data Science — 2019 Edition

Also: The Data Fabric for Machine Learning; 10 Free Must-Read Books for ML and Data Science; Another 10 Free Must-See Courses for Machine Learning and Data Science; WTF is a Tensor?!?

Continue Reading…


Read More

New Color Palette for R

(This article was first published on Deeply Trivial, and kindly contributed to R-bloggers)

As I was preparing some graphics for a presentation recently, I started digging into some of the different color palette options. My motivation was entirely about creating graphics that weren’t too visually overwhelming, which I found the default “rainbow” palette to be.

But as the creators of the viridis R package point out, we also need to think about how people with colorblindness might struggle with understanding graphics. If you create figures in R, I highly recommend checking it out at the link above!

To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

R Packages worth a look

Data Bank for Statistical Analysis and Visualization (datarium)
Contains data organized by topics: categorical data, regression model, means comparisons, independent and repeated measures ANOVA, mixed ANOVA and ANCOVA.

Analyze Lines of R Code the Tidy Way (tidycode)
Analyze lines of R code using tidy principles. This allows you to input lines of R code and output a data frame with one row per function included. Add …

Interface to the ‘Zoltar’ Forecast Repository API (zoltr)
Zoltar’ <https://…/> is a website that provides a repository of model forecast resu …

Create Full Text Browsers from Annotated Token Lists (tokenbrowser)
Create browsers for reading full texts from a token list format. Information obtained from text analyses (e.g., topic modeling, word scaling) can be us …

Continue Reading…


Read More

Cost of College

We know that more education usually equals more income, but as the cost of education continues to rise, the challenge to earn a college degree also increases. Read More

Continue Reading…


Read More

Let’s get it right

Article: Racial Bias in Conversational Artificial Intelligence

It is an emerging field of human computer interaction where we use natural language to exchange information and pass commands to computers. Any single interface with digital devices that you can think of can either be replaced or augmented with AI-enabled conversational interfaces. Examples include chatbots & speech based assistants like Siri. The modern smart city concept also involves a ‘citizen-centric’ services model which uses conversational AI interfaces to personalize and contextualize city services. One example is Citibot, a citizen engagement platform. Similarly, Vienna’s has a WienBot that allows residents and tourists to find common civic services like find parking , restrooms , Restaurants and other facilities. They no longer need to rely of kindness of strangers, or to scroll through long list on websites.

Article: Building inclusion, fairness, and ethics into machine learning

Andrew Zaldivar is a Developer Advocate for Google AI. His job is to help to bring the benefits of AI to everyone. Andrew develops, evaluates, and promotes tools and techniques that can help communities build responsible AI systems, writing posts for the Google Developers blog and speaking at a variety of conferences. Before joining Google AI, Andrew was a Senior Strategist in Google’s Trust & Safety group and worked on protecting the integrity of some of Google’s key products by using machine learning to scale, optimize and automate abuse-fighting efforts. Prior to joining Google, Andrew completed his Ph.D. in Cognitive Neuroscience from the University of California, Irvine and was an Insight Data Science fellow. Here, Andrew shares details on his role at Google, his personal and professional passions, and how he applies his academic background to his work creating and sharing tools that help teams build more inclusive products and user experiences.

Paper: Software Engineering for Fairness: A Case Study with Hyperparameter Optimization

We assert that it is the ethical duty of software engineers to strive to reduce software discrimination. This paper discusses how that might be done. This is an important topic since machine learning software is increasingly being used to make decisions that affect people’s lives. Potentially, the application of that software will result in fairer decisions because (unlike humans) machine learning software is not biased. However, recent results show that the software within many data mining packages exhibits ‘group discrimination’; i.e. their decisions are inappropriately affected by ‘protected attributes'(e.g., race, gender, age, etc.). There has been much prior work on validating the fairness of machine-learning models (by recognizing when such software discrimination exists). But after detection, comes mitigation. What steps can ethical software engineers take to reduce discrimination in the software they produce? This paper shows that making \textit{fairness} as a goal during hyperparameter optimization can (a) preserve the predictive power of a model learned from a data miner while also (b) generates fairer results. To the best of our knowledge, this is the first application of hyperparameter optimization as a tool for software engineers to generate fairer software.

Paper: From What to How. An Overview of AI Ethics Tools, Methods and Research to Translate Principles into Practices

The debate about the ethical implications of Artificial Intelligence dates from the 1960s. However, in recent years symbolic AI has been complemented and sometimes replaced by Neural Networks and Machine Learning techniques. This has vastly increased its potential utility and impact on society, with the consequence that the ethical debate has gone mainstream. Such debate has primarily focused on principles – the what of AI ethics – rather than on practices, the how. Awareness of the potential issues is increasing at a fast rate, but the AI community’s ability to take action to mitigate the associated risks is still at its infancy. Therefore, our intention in presenting this research is to contribute to closing the gap between principles and practices by constructing a typology that may help practically-minded developers apply ethics at each stage of the pipeline, and to signal to researchers where further work is needed. The focus is exclusively on Machine Learning, but it is hoped that the results of this research may be easily applicable to other branches of AI. The article outlines the research method for creating this typology, the initial findings, and provides a summary of future research needs.

Paper: Contrastive Fairness in Machine Learning

We present contrastive fairness, a new direction in causal inference applied to algorithmic fairness. Earlier methods dealt with the ‘what if?’ question (counterfactual fairness, NeurIPS’17). We establish the theoretical and mathematical implications of the contrastive question ‘why this and not that?’ in context of algorithmic fairness in machine learning. This is essential to defend the fairness of algorithmic decisions in tasks where a person or sub-group of people is chosen over another (job recruitment, university admission, company layovers, etc). This development is also helpful to institutions to ensure or defend the fairness of their automated decision making processes. A test case of employee job location allocation is provided as an illustrative example.

Article: Ian McEwan on His New Novel and Ethics in the Age of A.I.

When we program morality into robots, are we doomed to disappoint them with our very human ethical inconsistency?

Article: The human problem of AI

When it comes to most things business, AI is making its mark as the must-have technology. Whether we are talking about customer-facing chatbots to help with engagement and conversion or AI working in the background to help make critical business decisions, AI is everywhere. And the expectations of what it can and should be able to do is often sky-high. When those expectations aren’t met, however, it’s not always the tech that’s to blame. More likely, it’s the humans who brought it on board. Here are some of the most common human errors when it comes to implementing AI.
Mistake #1: Confusing automation with AI
Mistake #2: Not determining success factors
Mistake #3: Not getting organizational buy-in
Mistake #4: Not considering the impact on the entire customer journey
Mistake #5: Not understanding the cause of the problems you’re trying to solve

Article: The Future of Life Institute (FLI)

Mission: ‘To catalyze and support research and initiatives for safeguarding life and developing optimistic visions of the future, including positive ways for humanity to steer its own course considering new technologies and challenges.’ We have technology to thank for all the ways in which today is better than the stone age, and technology is likely to keep improving at an accelerating pace. We are a charity and outreach organization working to ensure that tomorrow’s most powerful technologies are beneficial for humanity. With less powerful technologies such as fire, we learned to minimize risks largely by learning from mistakes. With more powerful technologies such as nuclear weapons, synthetic biology and future strong artificial intelligence, planning ahead is a better strategy than learning from mistakes, so we support research and other efforts aimed at avoiding problems in the first place. We are currently focusing on keeping artificial intelligence beneficial and we are also exploring ways of reducing risks from nuclear weapons and biotechnology. FLI is based in the Boston area, and welcomes the participation of scientists, students, philanthropists, and others nearby and around the world. Here is a video highlighting our activities from our first year.

Continue Reading…


Read More

Distilled News

Parsing Structured Documents with Custom Entity Extraction

There are lots of great tutorials on the web that explain how to classify chunks of text with Machine Learning. But what if, rather than just categorize text, you want to categorize individual words.

The Almighty Policy Gradient in Reinforcement Learning

A simple step by step explanation to the concept of policy gradients and how they fit into reinforcement learning. Maybe too simple.

Achieving Artificial General Intelligence (AGI) using Self Models

Moravec’s paradox is the observation made by many AI researchers that high-level reasoning requires less computation than low-level unconscious cognition. This is an empirical observation that goes against the notion that greater computational capability leads to more intelligent systems. However, we have today computer systems that have super-human symbolic reasoning capabilities. Nobody is going to argue that a man with an abacus, a chess grandmaster or a champion Jeopardy player has any chance at besting a computer. Artificial symbolic reasoning is technology that has been available for decades now and this capability is without argument superior in capability than what any human can provide. Despite this, nobody will claim that computers are conscious. Today, with the discovery of deep learning (i.e. intuition or unconscious reasoning machines), low-level unconscious cognition is within humanity’s grasp. In this article, I will explore the ramifications of a scenario where machine subjectivity or self-awareness is discovered prior to the discovery of intelligent machines. This is a scenario where self-awareness is not a higher reasoning capability. Let us ask, what if self-aware machines were discovered before intelligent machines. What would the progression of breakthroughs look like? What is the order of the milestones?

Integrating Machine Learning Models within Matured Business Process

Machine Learning today is reaching every business process of enterprise, helping to create value, enhance customer experience or to bring in operational efficiency. Business today have necessary infrastructure, right tooling and data to generate insights faster than before. While Machine Leaning models can have a significant and positive impact on how business process are run, it can also turn out to be risky, if put in live production without monitoring these models for reasonable amount of time. Major hurdle one hits is when organizations have some form of business rules embedded into their critical business process. These rules might have evolved over time taking real world domain knowledge into play and also might be performing exceptionally well. In these scenarios stakeholders typically might push-back on completely doing away with existing rules ecosystem. Challenge with rule based system is data and business scenario change faster today to a point where either rules are unable to catch up with real world scenario or it is very time consuming to create and maintain additional rules With the set background, this article is about how we can make use of best of both worlds (Rules + Machine Learning) and also over time measure performance of machine learning models with real world data to see if they can exist by themselves.

Should You Be Recommending Deep Learning Solutions in Your Company?

If you are guiding your company’s digital journey, to what extent should you be advising them to adopt deep learning AI methods versus traditional and mature machine learning techniques.

Automated data report storytelling in R

In this article, you learn how to make Automated data report storytelling in R for Credit Modelling. First you need to install the rmarkdown rmarkdown package into your R library. Assuming that you installed the rmarkdown rmarkdown , next you create a new rmarkdown rmarkdown script in R.

Customer Support Chatbots: Easier & More Effective Than You Think

Learn how to create your own free chatbot environment with just a few commands, as well as learning more about the benefits of customer service chatbots.

An Introductory Guide to Computer Vision

The fantasy that a machine is capable of simulating the human visual system is old. We’ve come a long way since the first university papers appeared back in the 1960s, as evidenced by the advent of modern systems trivially integrated into mobile applications. Today, computer vision is one of the hottest subfields of artificial intelligence and machine learning, given its wide variety of applications and tremendous potential. Its goal: to replicate the powerful capacities of human vision. But, what exactly is computer vision? How is it currently applied in different industries? What are some well-known business use cases? What tasks are typical to computer vision? In this guide, you’ll learn about the basic concept of computer vision and how it’s used in the real world. It’s a simple examination of a complex problem for anybody who has ever heard of computer vision but isn’t quite sure what it’s all about and how it’s applied.

R Studio Shortcuts and Tips – part 2

Welcome to the second part of R Studio shortcuts and tips! If you have not yet read r studio shortcuts and tips – part one, I strongly recommend to do it before proceeding further.

Part 2: Simple EDA in R with inspectdf

Previously, I wrote a blog post showing a number of R packages and functions which you could use to quickly explore your data set. Since posting that, I’ve become aware of another exciting EDA package: inspectdf by Alastair Rushworth! As is very often the case, I became aware of this package in a twitter post by none other than Mara Averick.

Understanding the 3 most common loss functions for Machine Learning Regression

A loss function in Machine Learning is a measure of how accurately your ML model is able to predict the expected outcome i.e the ground truth. The loss function will take two items as input: the output value of our model and the ground truth expected value. The output of the loss function is called the loss which is a measure of how well our model did at predicting the outcome. A high value for the loss means our model performed very poorly. A low value for the loss means our model performed very well. Selection of the proper loss function is critical for training an accurate model. Certain loss functions will have certain properties and help your model learn in a specific way. Some may put more weight on outliers, others on the majority. In this article we’re going to take a look at the 3 most common loss functions for Machine Learning Regression. I’ll explain how they work, their pros and cons, and how they can be most effectively applied when training regression models.

Reinforcement learning basics: stationary and non-stationary multi-armed bandit problem

The multi-armed (also called k-armed) bandit is an introductory reinforcement learning problem in which an agent has to make n choices among k different options. Each option delivers a (possibly) different reward from an unknown distribution which usually doesn’t change over time (i.e. it is stationary). If the distribution changes over time (i.e. it is not stationary), the problem gets harder because previous observations (i.e. previous games) are of little usefulness. In either case, the goal is to maximize the total reward obtained. This article reviews one (of many) simple solution for both a stationary and a non-stationary 5-armed bandit across 1000 games. Note that only some remarks of the full code will be showcased here, for the fully functional notebook, please see this github repository.

Detection Free Human Instance Segmentation using Pose2Seg and PyTorch

In recent years, research related to ‘humans’ in the computer vision community has become increasingly active because of the high demand for real-life applications, among them is instance segmentation. The standard approach to image instance segmentation is to perform the object detection first, and then segment the object from the detection bounding-box. More recently, deep learning methods like Mask R-CNN perform them jointly. However, as human associated tasks becoming more common like human recognition, tracking etc. one might wonder why does the uniqueness of the ‘human’ category does not taken into account. The uniqueness of the ‘human’ category, can be well defined by the pose skeleton. Moreover, the human pose skeleton can be used to better distinguish instances with heavy occlusion than using bounding-boxes. In this post, I am going to review ‘Pose2Seg: Detection Free Human Instance Segmentation’, which presents a new pose-based instance segmentation framework for humans which separates instances based on human pose.

Machine Learning in Python NumPy: Neural Network in 9 Steps

1. Initialization
2. Data Generation
3. Train-test Splitting
4. Data Standardization
5. Neural Net Construction
6. Forward Propagation
7. Back-propagation
8. Iterative Optimization
9. Testing
This is how you can build a neural net from scratch using NumPy in 9 steps. Some of you might have already built neural nets using some high-level frameworks such as TensorFlow, PyTorch, or Keras. However, building a neural net using only low-level libraries enable us to truly understand the mathematics behind the mystery.

Deep Learning for Clinical Diagnostics

This is the fourth article in the series Deep Learning for Life Sciences. In the previous posts, I showed how to use Deep Learning on Ancient DNA, Deep Learning for Single Cell Biology and Deep Learning for Data Integration. Now we are going to dive into Biomedicine and learn why and how we should use Bayesian Deep Learning for patient safety.

Can we generate Automatic Cricket Commentary using Neural Networks ?

Like everything else, the world of cricket has also gone through a lot of technological transformations in the recent years. They way cricket is played and and how it is viewed all around the world have both changed as a result. In this post we discuss if neural networks are capable of generating cricket commentary by just watching it. There has been some work in the literature (can be found here, here and here) but they do not use neural networks. Being a believer in end to end deep learning, I think neural networks will seal the deal on this task in the near future. This is a hard problem to tackle, because apart from visual feature extraction, it involves very complex temporal dynamics and handling of long term dependencies. This is because commentary is generally highly contextualized by the development of current game, its significance in broader perspective (friendly match vs tournament), and histories of teams and players involved. Decontextualized explanation of what is happening appears to be a easier problem to solve and I can think of an architecture that can used for modelling this.

Continue Reading…


Read More

Going beyond the rainbow color scheme for statistical graphics

Yesterday in our discussion of easy ways to improve your graphs, a commenter wrote:

I recently read and enjoyed several articles about alternatives to the rainbow color palette. I particularly like the sections where they show how each color scheme looks under different forms of color-blindness and/or in black and white.

Here’s a couple of them (these are R-centric but relevant beyond that):

The viridis color palettes, by Bob Rudis, Noam Ross and Simon Garnier

Somewhere over the Rainbow, by Ross Ihaka, Paul Murrell, Kurt Hornik, Jason Fisher, Reto Stauffer, Claus Wilke, Claire McWhite, and Achim Zeileis.

I particularly like that second article, which includes lots of examples.

Continue Reading…


Read More

University of Edinburgh Jupyter Community nbgrader Hackathon

The University of Edinburgh is happy to announce our upcoming event as part of the Jupyter Community Workshop series funded by Bloomberg. The University will be hosting a three-day event, the core aspect of this event being a hackathon focused on adding improvements, fixes and extra documentation for the nbgrader extension. Alongside this we will also hold an afternoon of talks highlighting how Jupyter can be used in education at varying levels. The event will take place on 29 to 31 May at the University of Edinburgh, with the afternoon of talks taking place on 30 May.

The first and main part of the event will be the nb grader hackathon. nbgrader is a Jupyter extension that allows for the creation and marking of notebook-based assignments. Here at the University, we have adopted nbgrader and our developers have integrated the extension into our Jupyter service Noteable. The hackathon will focus on improving the core features and extending the abilities of nbgrader; by adding such features, it will be easier for institutions to adopt and embed both Jupyter and nbgrader into their teaching practice.

The second, equally important part of our event is a series of talks aimed at highlighting the uses of Jupyter within education. As part of developing our local Jupyter service, we have uncovered many use cases across the University of how Jupyter can be adopted in a variety of disciplines and scenarios that we are keen to share. We will also be able to showcase an institutional approach to adopting and supporting Jupyter at scale. On top of this, there is also the opportunity to hear from many of our hackathon attendees. This series of talks is aimed at academic colleagues, and teaching and support staff at any level of education and includes an evening networking event to allow attendees to further explore how they may introduce Jupyter to their institution.

What are we working on?

We have worked with our local developers and scoured the nbgrader Github repo to devise a plan of attack for the hackathon in terms of features and improvement. We’re keen to engage with the wider community regarding these goals and have created a post on the Jupyter Discourse to allow further discussion.

  • Support for Multiple course/Multiple classes. Instructors that teach multiple course using nbgrader or students enrolled on multiple course. Multiple Courses (ref: PR #1040) vs Support for multiple classes via Jupyterhub groups (PR #893)
  • Considerations for LTI use: Users/Courses not in the database at startup
  • Support for Multiple markers for one assignment (part of issue #1030, which extends #998)
  • API Tests have hard-coded file copy methods to pre-load the system to enable testing, and os.file_exists-type tests for release & submit tests
  • Generation of feedback copies for students within the formgrader UI (to mirror the existing terminal command). Also consider ability to disseminate this feedback back to students within nbgrader.

Want to get involved?

If you would like to be involved in the hackathon, we still have an amount of funding left for travel and accommodation. We are looking for participants who have a good understanding of Jupyter and nbgrader who would be able to attend all three days of the event. If you would be able to attend, please complete the following form: [].

If you can’t attend but want to have a say on what is worked on during the hackathon then join the discussion on the Jupyter Discourse (if you’re new to Jupyter this is an excellent place to head to for community discussions)

If you’d like to come along to our Jupyter community afternoon on 30 May, book a place via Eventbrite.

University of Edinburgh Jupyter Community nbgrader Hackathon was originally published in Jupyter Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Continue Reading…


Read More

Facebook AI’s Joelle Pineau receives Governor General’s Innovation Award

Joelle Pineau, the Co-Managing Director of Facebook AI Research (FAIR) and head of FAIR Montreal, is one of six recipients of the Governor General’s Innovation Awards. The award recognizes Canadian leaders for their groundbreaking innovations and positive impact on the quality of life in the country.

Pineau’s research focuses on developing new models and algorithms for planning and learning in complex, partially observable domains. She is recognized for applying these algorithms to problems in robotics and in health care. Pineau is also a vocal advocate for increasing diversity among researchers and academics in the AI community.

In addition to leading Facebook AI’s research efforts in Montreal, she is also an associate professor at McGill University, where she co-directs the Reasoning and Learning Lab.

Pineau sat down to share her health care-related work and how AI can have a positive impact on the world. See the Facebook AI blog for the full Q&A.

The post Facebook AI’s Joelle Pineau receives Governor General’s Innovation Award appeared first on Facebook Research.

Continue Reading…


Read More

SQL Injection Protection

SQL injection is a common form of data theft. I am hopeful we can make SQL injection protection more common.

The 2018 TrustWave Global Security Report listed SQL Injection as the second most common technique for web attacks, trailing only cross-site scripting (XSS) attacks. This is a 38% increase from the previous year. That same report also shows SQL Injection ranked fifth on a list of vulnerabilities that can be identified through simple penetration testing.

You may look at the increase and think “whoa, attacks are increasing”. But I believe that what we are seeing is a rising awareness in security. No longer the stepchild, security is a first-class citizen in application design and deployment today. As companies focus on security, they deploy tools and systems to help identify exploits, leading to more reporting of attacks.

SQL Injection is preventable. That’s the purpose of this post today, to help you understand what SQL Injection is, how to identify when it is happening, and how to prevent it from being an issue.


SQL Injection Explained

SQL injection is the method where an adversary appends a SQL statement to the input field inside a web page or application, thereby sending their own custom request to a database. That request could be to read data, or download the entire database, or even delete all data completely.

The most common example for SQL injection attacks are found inside username and password input boxes on a web page. This login design is standard for allowing users to access a website. Unfortunately, many websites do not take precautions to block SQL injection on these input fields, leading to SQL injection attacks.

Let’s look at a sample website built for the fictional Contoso Clinic. The source code for this can be found at

On the Patients page you will find an input field at the top, next to a ‘Search’ button, and next to that a hyperlink for ‘SQLi Hints’.


contoso clinic sql injectoin example


Clicking on the SQLi Hints link will display some sample text to put into the search field.


sql injection example


I’m going to take the first statement and put it into the search field. Here is the result:




This is a common attack vector, as the adversary can use this method to determine what version of SQL Server is running. This is also a nice reminder to not allow your website to return such error details to the end user. More on that later.

Let’s talk a bit about how SQL injection works under the covers.


How SQL Injection works

The vulnerability in my sample website is the result of this piece of code:

return View(db.Patients.SqlQuery
("SELECT * FROM dbo.Patients
WHERE [FirstName] LIKE '%" + search + "%'
OR [LastName] LIKE '%" + search + "%'
OR [StreetAddress] LIKE '%" + search + "%'
OR [City] LIKE '%" + search + "%'
OR [State] LIKE '%" + search + "%'").ToList());

This is a common piece of code used by many websites. It is building a dynamic SQL statement based upon the input fields on the page. If I were to search the Patients page for ‘Rock’, the SQL statement sent to the database would then become:

SELECT * FROM dbo.Patients
WHERE [FirstName] LIKE '%Rock%'
OR [LastName] LIKE '%Rock%'
OR [StreetAddress] LIKE '%Rock%'
OR [City] LIKE '%Rock%'
OR [State] LIKE '%Rock%'

In the list of SQLi hints on that page you will notice that each example starts with a single quote, followed by a SQL statement, and at the end is a comment block (the two dashes). For the example I chose above, the resulting statement is as follows:

SELECT * FROM dbo.Patients
WHERE [FirstName] LIKE '%' OR CAST(@@version as int) = 1 --%'
OR [LastName] LIKE '%' OR CAST(@@version as int) = 1 --%'
OR [StreetAddress] LIKE '%' OR CAST(@@version as int) = 1 --%'
OR [City] LIKE '%' OR CAST(@@version as int) = 1 --%'
OR [State] LIKE '%' OR CAST(@@version as int) = 1 --%'

This results in the conversion error shown above. This also means that I can do interesting searches to return information about the database. Or I could do malicious things, like drop tables.

Chance are you have code like this, somewhere, right now. Let’s look at how to find out what your current code looks like.


SQL Injection Discovery

Discovering SQL injection is not trivial. You must examine your code to determine if it is vulnerable. You must also know if someone is actively trying SQL injection attacks against your website. Trying to roll your own solution can take considerable time and effort.

There are two tools I can recommend you use to help discover SQL injection.


Test Websites with sqlmap

One method is to use sqlmap, an open-source penetration testing project that will test websites for SQL injection vulnerabilities. This is a great way to uncover vulnerabilities in your code. However, sqlmap will not tell you if someone is actively using SQL injection against your website. You will need to use something else for alerts.


Azure Threat Detection

If you are using Azure SQL Database, then you have the option to enable Azure Threat Detection. This feature will discover code vulnerabilities as well as alert you to attacks. It also checks for anomalous client login, data exfiltration, and if a harmful application is trying to access your database.

(For fairness, I should mention that AWS WAF allows for SQL injection detection, but their process is a bit more manual that Azure).

If you try to roll your own discovery, you will want to focus on finding queries that have caused errors. Syntax errors, missing objects, permission errors, and UNION ALL errors are the most common. You can find a list of the common SQL Server error message numbers here.

It warrants mentioning that not all SQL injection attacks are discoverable. But when it comes to security, you will never eliminate all risk, you take steps to lower your risk. SQL injection discovery is one way to lower your risk.


SQL Injection Protection

Detection of SQL Injection vulnerabilities and attacks are only part of the solution. In an ideal world, your application code would not allow for SQL Injection. Here’s a handful of ways you can lower your risk of SQL injection attacks.


Parameterize Your Queries

Also known as ‘prepared statements’, this is a good way to prevent SQL injection attacks against the database. For SQL Server, prepared statements are typically done using the sp_executesql() system stored procedure.

Prepared statements should not allow an attacker to change the nature of the SQL statement by injecting additional code into the input field. I said “should”, because it is possible to write prepared statements in a way that would still be vulnerable to SQL injection. You must (1) know what you are doing and (2) learn to sanitize your inputs.

Traditionally, one argument against the use of prepared statements centers on performance. It is possible that a prepared statement may not perform as well as the original dynamic SQL statement. However, if you are reading this and believe performance is more important than security, you should reconsider your career in IT before someone does that for you.


Use Stored Procedures

Another method available are stored procedures. Stored procedures offer additional layers of security that prepared statements may not allow. While prepared statements require permissions on the underlying tables, stored procedures can execute against objects without the user having similar direct access.

Like prepared statements, stored procedures are not exempt from SQL injection. It is quite possible you could put vulnerable code into a stored procedure. You must take care to compose your stored procedures properly, making use of parameters. You should also consider validating the input parameters being passed to the procedure, either on the client side or in the procedure itself.



You could use a security method such as EXECUTE AS to switch the context of the user as you make a request to the database. As mentioned above, stored procedures somewhat act in this manner by default. But EXECUTE AS can be used directly for requests such as prepared statements or ad-hoc queries.


Remove Extended Stored Procedures

Disabling the use of extended stored procedures is a good way to limit your risk with SQL injection. Not because you won’t be vulnerable, but because you limit the surface area for the attacker. By disabling these system procedures you limit a common way that an attacker can get details about your database system.


Sanitize Error Messages

You should never reveal error messages to the end user. Trap all errors and redirect to a log for review later. The less error information you bubble up, the better.


Use Firewalls

Whitelisting of IP addresses is a good way to limit activity from anomalous users. Use of VPNs and VNETs to segment traffic can also reduce your risk.



The #hardtruth here is that every database is susceptible to SQL injection attacks. No one platform is more at risk than any other. The weak link here is the code being written on top of the database. Most code development does not emphasize security enough, leaving themselves open to attacks.

When you combine poor database security techniques along with poor code, you get the recipe for SQL Injection.



2018 TrustWave Global Security Report
Contoso Clinic Demo Application
sqlmap: Automatic SQL injection and database takeover tool
Azure SQL Database threat detection
Working with SQL Injection Match Conditions
How to Detect SQL Injection Attacks
sp_executesql (Transact-SQL)
Server Configuration Options (SQL Server)

The post SQL Injection Protection appeared first on Thomas LaRock.

Continue Reading…


Read More

Easy quick PCA analysis in R

(This article was first published on R – intobioinformatics, and kindly contributed to R-bloggers)

Principal component analysis (PCA) is very useful for doing some basic quality control (e.g. looking for batch effects) and assessment of how the data is distributed (e.g. finding outliers). A straightforward way is to make your own wrapper function for prcomp and ggplot2, another way is to use the one that comes with M3C ( or another package. M3C is a consensus clustering tool that makes some major modifications to the Monti et al. (2003) algorithm so that it behaves better, it also provides functions for data visualisation.

Let’s have a go on an example cancer microarray dataset.

# M3C loads an example dataset mydata with metadata desx
# do PCA

So, now what prcomp has done is extracted the eigenvectors of the data’s covariance matrix, then projected the original data samples onto them using linear combination. This yields PC scores which are plotted on PC1 and PC2 here (eigenvectors 1 and 2). The eigenvectors are sorted and these first two contain the highest variation in the data, but it might be a good idea to look at additional PCs, which is beyond this simple blog post and function.

You can see above there are no obvious outliers for removal here, which is good for cluster analysis. However, were there outliers, we can label all samples with their names using the ‘text’ parameter.


Now other objectives would be comparing samples with batch to make sure we do not have batch effects driving the variation in our data, and comparing with other variables such as gender and age. Since the metadata only contains tumour class we are going to use that next to see how it is related to these PC scores.

This is a categorical variable, so the ‘scale’ parameter should be set to 3, ‘controlscale’ is set to TRUE because I want to choose the colours, and ‘labels’ parameter passes the metadata tumour class into the function. I am increasing the ‘printwidth’ from its default value because the legend is quite wide.

For more information see the function documentation using ?pca.

# first reformat meta to replace NA with Unknown
desx$class <- as.character(desx$class)
desx$class[$class)] <- as.factor('Unknown')
# next do the plot

So, now it appears that the variation that governs these two PCs is indeed related to tumour class which is good. But, what if the variable is continuous, and we wanted to compare read mapping rate, or read duplication percentage, or age with our data? So, a simple change of the parameters can allow this too. Let’s make up a continuous variable, then add this to the plot. In this case we change the ‘scale’ parameter to reflect we are using continuous data, and the spectral palette is used for the colours by default.

randomvec <- seq(1,50)

So, since this is a random variable, we can see it has no relationship with our data. Let’s just define a custom scale now. So we change ‘scale’ to 2, then use the ‘low’ and ‘high’ parameters to define the scale colour range.


Super easy, yet pretty cool. The idea here is just to minimise the amount of code hanging around for doing basic analyses like these. You can rip the code from here: Remember if your features are on different scales, the data needs transforming to be comparable beforehand.

M3C is part of a series of cluster analysis tools that we are releasing, the next is ‘Spectrum’, a fast adaptive spectral clustering tool, which is already on CRAN ( Perhaps there will be time to blog about that in the future.

For quantitative analysis of drivers in the variation on data, I recommend checking out David Watson’s great function in the ‘bioplotr’ package called ‘plot_drivers’ ( This compares the PCs with categorical and continuous variables and performs univariate statistical analysis on them producing a very nice plot.

To leave a comment for the author, please follow the link and comment on their blog: R – intobioinformatics. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Fixing a Major Weakness in Machine Learning of Images with Hinton’s Capsule Networks

We explore Geoffrey Hinton's capsule networks to deal with rotational variance in images.

Continue Reading…


Read More

Thanks for reading!