# My Data Science Blogs

## February 21, 2019

### R Packages worth a look

Comprehensive, User-Friendly Toolkit for Probing Interactions (interactions)
A suite of functions for conducting and interpreting analysis of statistical interaction in regression models that was formerly part of the ‘jtools’ pa …

Quick Serialization of R Objects (qs)
Provides functions for quickly writing and reading any R object to and from disk. This package makes use of the ‘zstd’ library for compression and deco …

Simple ‘htmlwidgets’ Image Viewer with WebGL Brightness/Contrast (imageviewer)
Display a 2D-matrix data as a interactive zoomable gray-scale image viewer, providing tools for manual data inspection. The viewer window shows cursor …

### Document worth reading: “A Survey of Neuromorphic Computing and Neural Networks in Hardware”

Neuromorphic computing has come to refer to a variety of brain-inspired computers, devices, and models that contrast the pervasive von Neumann computer architecture. This biologically inspired approach has created highly connected synthetic neurons and synapses that can be used to model neuroscience theories as well as solve challenging machine learning problems. The promise of the technology is to create a brain-like ability to learn and adapt, but the technical challenges are significant, starting with an accurate neuroscience model of how the brain works, to finding materials and engineering breakthroughs to build devices to support these models, to creating a programming framework so the systems can learn, to creating applications with brain-like capabilities. In this work, we provide a comprehensive survey of the research and motivations for neuromorphic computing over its history. We begin with a 35-year review of the motivations and drivers of neuromorphic computing, then look at the major research areas of the field, which we define as neuro-inspired models, algorithms and learning approaches, hardware and devices, supporting systems, and finally applications. We conclude with a broad discussion on the major research topics that need to be addressed in the coming years to see the promise of neuromorphic computing fulfilled. The goals of this work are to provide an exhaustive review of the research conducted in neuromorphic computing since the inception of the term, and to motivate further work by illuminating gaps in the field where new research is needed. A Survey of Neuromorphic Computing and Neural Networks in Hardware

## February 20, 2019

### Top KDnuggets tweets, Feb 13-19: Intro to Scikit Learn: The Gold Standard of Python ML; The Essential Data Science Venn Diagram

Also: Cartoon: #MachineLearning Problems in 2118 #ValentinesDay; A must-read tutorial when you are starting your journey with #DeepLearning.

### Whats new on arXiv

Variational autoencoders are powerful algorithms for identifying dominant latent structure in a single dataset. In many applications, however, we are interested in modeling latent structure and variation that are enriched in a target dataset compared to some background—e.g. enriched in patients compared to the general population. Contrastive learning is a principled framework to capture such enriched variation between the target and background, but state-of-the-art contrastive methods are limited to linear models. In this paper, we introduce the contrastive variational autoencoder (cVAE), which combines the benefits of contrastive learning with the power of deep generative models. The cVAE is designed to identify and enhance salient latent features. The cVAE is trained on two related but unpaired datasets, one of which has minimal contribution from the salient latent features. The cVAE explicitly models latent features that are shared between the datasets, as well as those that are enriched in one dataset relative to the other, which allows the algorithm to isolate and enhance the salient latent features. The algorithm is straightforward to implement, has a similar run-time to the standard VAE, and is robust to noise and dataset purity. We conduct experiments across diverse types of data, including gene expression and facial images, showing that the cVAE effectively uncovers latent structure that is salient in a particular analysis.
Stochastic multiplayer games (SMGs) have gained attention in the field of strategy synthesis for multi-agent reactive systems. However, standard SMGs are limited to modeling systems where all agents have full knowledge of the state of the game. In this paper, we introduce delayed-action games (DAGs) formalism that simulates hidden-information games (HIGs) as SMGs, by eliminating hidden information by delaying a player’s actions. The elimination of hidden information enables the usage of SMG off-the-shelf model checkers to implement HIGs. Furthermore, we demonstrate how a DAG can be decomposed into a number of independent subgames. Since each subgame can be independently explored, parallel computation can be utilized to reduce the model checking time, while alleviating the state space explosion problem that SMGs are notorious for. In addition, we propose a DAG-based framework for strategy synthesis and analysis. Finally, we demonstrate applicability of the DAG-based synthesis framework on a case study of a human-on-the-loop unmanned-aerial vehicle system that may be under stealthy attack, where the proposed framework is used to formally model, analyze and synthesize security-aware strategies for the system.
State-of-the-art models are now trained with billions of parameters, reaching hardware limits in terms of memory consumption. This has created a recent demand for memory-efficient optimizers. To this end, we investigate the limits and performance tradeoffs of memory-efficient adaptively preconditioned gradient methods. We propose extreme tensoring for high-dimensional stochastic optimization, showing that an optimizer needs very little memory to benefit from adaptive preconditioning. Our technique applies to arbitrary models (not necessarily with tensor-shaped parameters), and is accompanied by regret and convergence guarantees, which shed light on the tradeoffs between preconditioner quality and expressivity. On a large-scale NLP model, we reduce the optimizer memory overhead by three orders of magnitude, without degrading performance.
The main goal of statistical learning theory is to provide a fundamental framework for the problem of decision making and model construction based on sets of data. Here, we present a brief introduction to the fundamentals of statistical learning theory, in particular the difference between empirical and structural risk minimization, including one of its most prominent implementations, i.e. the Support Vector Machine.
We present $\alpha$-loss, $\alpha \in [1,\infty]$, a tunable loss function for binary classification that bridges log-loss ($\alpha=1$) and $0$$1$ loss ($\alpha = \infty$). We prove that $\alpha$-loss has an equivalent margin-based form and is classification-calibrated, two desirable properties for a good surrogate loss function for the ideal yet intractable $0$$1$ loss. For logistic regression-based classification, we provide an upper bound on the difference between the empirical and expected risk for $\alpha$-loss by exploiting its Lipschitzianity along with recent results on the landscape features of empirical risk functions. Finally, we show that $\alpha$-loss with $\alpha = 2$ performs better than log-loss on MNIST for logistic regression.
Marginal Structural Models (MSM)~\cite{Robins00} are the most popular models for causal inference from time-series observational data. However, they have two main drawbacks: (a) they do not capture subject heterogeneity, and (b) they only consider fixed time intervals and do not scale gracefully with longer intervals. In this work, we propose a new family of MSMs to address these two concerns. We model the potential outcomes as a three-dimensional tensor of low rank, where the three dimensions correspond to the agents, time periods and the set of possible histories. Unlike the traditional MSM, we allow the dimensions of the tensor to increase with the number of agents and time periods. We set up a weighted tensor completion problem as our estimation procedure, and show that the solution to this problem converges to the true model in an appropriate sense. Then we show how to solve the estimation problem, providing conditions under which we can approximately and efficiently solve the estimation problem. Finally we propose an algorithm based on projected gradient descent, which is easy to implement, and evaluate its performance on a simulated dataset.
We consider the problem of estimating the probability distribution of a discrete random variable in the setting where the observations are corrupted by outliers. Assuming that the discrete variable takes k values, the unknown parameter p is a k-dimensional vector belonging to the probability simplex. We first describe various settings of contamination and discuss the relation between these settings. We then establish minimax rates when the quality of estimation is measured by the total-variation distance, the Hellinger distance, or the L2-distance between two probability measures. Our analysis reveals that the minimax rates associated to these three distances are all different, but they are all attained by the maximum likelihood estimator. Note that the latter is efficiently computable even when the dimension is large. Some numerical experiments illustrating our theoretical findings are reported.
Recently, Generative Adversarial Networks (GANs) have emerged as a popular alternative for modeling complex high dimensional distributions. Most of the existing works implicitly assume that the clean samples from the target distribution are easily available. However, in many applications, this assumption is violated. In this paper, we consider the observation setting when the samples from target distribution are given by the superposition of two structured components and leverage GANs for learning the structure of the components. We propose two novel frameworks: denoising-GAN and demixing-GAN. The denoising-GAN assumes access to clean samples from the second component and try to learn the other distribution, whereas demixing-GAN learns the distribution of the components at the same time. Through extensive numerical experiments, we demonstrate that proposed frameworks can generate clean samples from unknown distributions, and provide competitive performance in tasks such as denoising, demixing, and compressive sensing.
Many modern neural network architectures are trained in an overparameterized regime where the parameters of the model exceed the size of the training dataset. Sufficiently overparameterized neural network architectures in principle have the capacity to fit any set of labels including random noise. However, given the highly nonconvex nature of the training landscape it is not clear what level and kind of overparameterization is required for first order methods to converge to a global optima that perfectly interpolate any labels. A number of recent theoretical works have shown that for very wide neural networks where the number of hidden units is polynomially large in the size of the training data gradient descent starting from a random initialization does indeed converge to a global optima. However, in practice much more moderate levels of overparameterization seems to be sufficient and in many cases overparameterized models seem to perfectly interpolate the training data as soon as the number of parameters exceed the size of the training data by a constant factor. Thus there is a huge gap between the existing theoretical literature and practical experiments. In this paper we take a step towards closing this gap. Focusing on shallow neural nets and smooth activations, we show that (stochastic) gradient descent when initialized at random converges at a geometric rate to a nearby global optima as soon as the square-root of the number of network parameters exceeds the size of the training data. Our results also benefit from a fast convergence rate and continue to hold for non-differentiable activations such as Rectified Linear Units (ReLUs).
The advent of modern technology, permitting the measurement of thousands of characteristics simultaneously, has given rise to floods of data characterized by many large or even huge datasets. This new paradigm presents extraordinary challenges to data analysis and the question arises: how can conventional data analysis methods, devised for moderate or small datasets, cope with the complexities of modern data? The case of high dimensional data is particularly revealing of some of the drawbacks. We look at the case where the number of characteristics measured in an object is at least the number of observed objects and conclude that this configuration leads to geometrical and mathematical oddities and is an insurmountable barrier for the direct application of traditional methodologies. If scientists are going to ignore fundamental mathematical results arrived at in this paper and blindly use software to analyze data, the results of their analyses may not be trustful, and the findings of their experiments may never be validated. That is why new methods together with the wise use of traditional approaches are essential to progress safely through the present reality.
We study the interplay between memorization and generalization of overparametrized networks in the extreme case of a single training example. The learning task is to predict an output which is as similar as possible to the input. We examine both fully-connected and convolutional networks that are initialized randomly and then trained to minimize the reconstruction error. The trained networks take one of the two forms: the constant function (‘memorization’) and the identity function (‘generalization’). We show that different architectures exhibit vastly different inductive bias towards memorization and generalization. An important consequence of our study is that even in extreme cases of overparameterization, deep learning can result in proper generalization.
This paper introduces a new method for model selection and more generally hyperparameter selection in machine learning. The paper first proves a relationship between generalization error and a difference of description lengths of the training data; we call this difference differential description length (DDL). This allows prediction of generalization error from the training data \emph{alone} by performing encoding of the training data. This can now be used for model selection by choosing the model that has the smallest predicted generalization error. We show how this encoding can be done for linear regression and neural networks. We provide experiments showing that this leads to smaller generalization error than cross-validation and traditional MDL and Bayes methods.
Originally inspired by neurobiology, deep neural network models have become a powerful tool of machine learning and artificial intelligence, where they are used to approximate functions and dynamics by learning from examples. Here we give a brief introduction to neural network models and deep learning for biologists. We introduce feedforward and recurrent networks and explain the expressive power of this modeling framework and the backpropagation algorithm for setting the parameters. Finally, we consider how deep neural networks might help us understand the brain’s computations.
If we assume that earthquakes are chaotic, and influenced locally then chaos theory suggests that there should be a temporal association between earthquakes in a local region that should be revealed with statistical examination. To date no strong relationship has been shown (refs not prediction). However, earthquakes are basically failures of structured material systems, and when multiple failure mechanisms are present, prediction of failure is strongly inhibited without first separating the mechanisms. Here we show that by separating earthquakes statistically, based on their central tensor moment structure, along lines first suggested by a separation into mechanisms according to depth of the earthquake, a strong indication of temporal association appears. We show this in earthquakes above 200 Km along the pacific ring of fire, with a positive association in time between earthquakes of the same statistical type and a negative association in time between earthquakes of different types. Whether this can reveal either useful mechanistic information to seismologists, or can result in useful forecasts remains to be seen.
We study a classic algorithmic problem through the lens of statistical learning. That is, we consider a matching problem where the input graph is sampled from some distribution. This distribution is unknown to the algorithm; however, an additional graph which is sampled from the same distribution is given during a training phase (preprocessing). More specifically, the algorithmic problem is to match $k$ out of $n$ items that arrive online to $d$ categories ($d\ll k \ll n$). Our goal is to design a two-stage online algorithm that retains a small subset of items in the first stage which contains an offline matching of maximum weight. We then compute this optimal matching in a second stage. The added statistical component is that before the online matching process begins, our algorithms learn from a training set consisting of another matching instance drawn from the same unknown distribution. Using this training set, we learn a policy that we apply during the online matching process. We consider a class of online policies that we term \emph{thresholds policies}. For this class, we derive uniform convergence results both for the number of retained items and the value of the optimal matching. We show that the number of retained items and the value of the offline optimal matching deviate from their expectation by $O(\sqrt{k})$. This requires usage of less-standard concentration inequalities (standard ones give deviations of $O(\sqrt{n})$). Furthermore, we design an algorithm that outputs the optimal offline solution with high probability while retaining only $O(k\log \log n)$ items in expectation.
We study online linear regression problems in a distributed setting, where the data is spread over a network. In each round, each network node proposes a linear predictor, with the objective of fitting the \emph{network-wide} data. It then updates its predictor for the next round according to the received local feedback and information received from neighboring nodes. The predictions made at a given node are assessed through the notion of regret, defined as the difference between their cumulative network-wide square errors and those of the best off-line network-wide linear predictor. Various scenarios are investigated, depending on the nature of the local feedback (full information or bandit feedback), on the set of available predictors (the decision set), and the way data is generated (by an oblivious or adaptive adversary). We propose simple and natural distributed regression algorithms, involving, at each node and in each round, a local gradient descent step and a communication and averaging step where nodes aim at aligning their predictors to those of their neighbors. We establish regret upper bounds typically in ${\cal O}(T^{3/4})$ when the decision set is unbounded and in ${\cal O}(\sqrt{T})$ in case of bounded decision set.
We study the expressive power of kernel methods and the algorithmic feasibility of multiple kernel learning for a special rich class of kernels. Specifically, we define \emph{Euclidean kernels}, a diverse class that includes most, if not all, families of kernels studied in literature such as polynomial kernels and radial basis functions. We then describe the geometric and spectral structure of this family of kernels over the hypercube (and to some extent for any compact domain). Our structural results allow us to prove meaningful limitations on the expressive power of the class as well as derive several efficient algorithms for learning kernels over different domains.
When searching for information, a human reader first glances over a document, spots relevant sections and then focuses on a few sentences for resolving her intention. However, the high variance of document structure complicates to identify the salient topic of a given section at a glance. To tackle this challenge, we present SECTOR, a model to support machine reading systems by segmenting documents into coherent sections and assigning topic labels to each section. Our deep neural network architecture learns a latent topic embedding over the course of a document. This can be leveraged to classify local topics from plain text and segment a document at topic shifts. In addition, we contribute WikiSection, a publicly available dataset with 242k labeled sections in English and German from two distinct domains: diseases and cities. From our extensive evaluation of 20 architectures, we report a highest score of 71.6% F1 for the segmentation and classification of 30 topics from the English city domain, scored by our SECTOR LSTM model with bloom filter embeddings and bidirectional segmentation. This is a significant improvement of 29.5 points F1 compared to state-of-the-art CNN classifiers with baseline segmentation.
We investigate conditions under which test statistics exist that can reliably detect examples, which have been adversarially manipulated in a white-box attack. These statistics can be easily computed and calibrated by randomly corrupting inputs. They exploit certain anomalies that adversarial attacks introduce, in particular if they follow the paradigm of choosing perturbations optimally under p-norm constraints. Access to the log-odds is the only requirement to defend models. We justify our approach empirically, but also provide conditions under which detectability via the suggested test statistics is guaranteed to be effective. In our experiments, we show that it is even possible to correct test time predictions for adversarial attacks with high accuracy.
Rational decision making in its linguistic description means making logical decisions. In essence, a rational agent optimally processes all relevant information to achieve its goal. Rationality has two elements and these are the use of relevant information and the efficient processing of such information. In reality, relevant information is incomplete, imperfect and the processing engine, which is a brain for humans, is suboptimal. Humans are risk averse rather than utility maximizers. In the real world, problems are predominantly non-convex and this makes the idea of rational decision-making fundamentally unachievable and Herbert Simon called this bounded rationality. There is a trade-off between the amount of information used for decision-making and the complexity of the decision model used. This explores whether machine rationality is subjective and concludes that indeed it is.
In this paper, we present a dynamic non-diagonal regularization for interior point methods. The non-diagonal aspect of this regularization is implicit, since all the off-diagonal elements of the regularization matrices are cancelled out by those elements present in the Newton system, which do not contribute important information in the computation of the Newton direction. Such a regularization has multiple goals. The obvious one is to improve the spectral properties of the Newton system solved at each iteration of the interior point method. On the other hand, the regularization matrices introduce sparsity to the aforementioned linear system, allowing for more efficient factorizations. We also propose a rule for tuning the regularization dynamically based on the properties of the problem, such that sufficiently large eigenvalues of the non-regularized system are perturbed insignificantly. This alleviates the need of finding specific regularization values through experimentation, which is the most common approach in literature. We provide perturbation bounds for the eigenvalues of the non-regularized system matrix and then discuss the spectral properties of the regularized matrix. Finally, we demonstrate the efficiency of the method applied to solve standard small and medium-scale linear and convex quadratic programming test problems.
We present a novel and hierarchical approach for supervised classification of signals spanning over a fixed graph, reflecting shared properties of the dataset. To this end, we introduce a Convolutional Cluster Pooling layer exploiting a multi-scale clustering in order to highlight, at different resolutions, locally connected regions on the input graph. Our proposal generalises well-established neural models such as Convolutional Neural Networks (CNNs) on irregular and complex domains, by means of the exploitation of the weight sharing property in a graph-oriented architecture. In this work, such property is based on the centrality of each vertex within its soft-assigned cluster. Extensive experiments on NTU RGB+D, CIFAR-10 and 20NEWS demonstrate the effectiveness of the proposed technique in capturing both local and global patterns in graph-structured data out of different domains.
Session-based recommender systems (SBRS) are an emerging topic in the recommendation domain and have attracted much attention from both academia and industry in recent years. Most of existing works only work on modelling the general item-level dependency for recommendation tasks. However, there are many more other challenges at different levels, e.g., item feature level and session level, and from various perspectives, e.g., item heterogeneity and intra- and inter-item feature coupling relations, associated with SBRS. In this paper, we provide a systematic and comprehensive review on SBRS and create a hierarchical and in-depth understanding of a variety of challenges in SBRS. To be specific, we first illustrate the value and significance of SBRS, followed by a hierarchical framework to categorize the related research issues and methods of SBRS and to reveal its intrinsic challenges and complexities. Further, a summary together with a detailed introduction of the research progress is provided. Lastly, we share some prospects in this research area.
Today’s AI still faces two major challenges. One is that in most industries, data exists in the form of isolated islands. The other is the strengthening of data privacy and security. We propose a possible solution to these challenges: secure federated learning. Beyond the federated learning framework first proposed by Google in 2016, we introduce a comprehensive secure federated learning framework, which includes horizontal federated learning, vertical federated learning and federated transfer learning. We provide definitions, architectures and applications for the federated learning framework, and provide a comprehensive survey of existing works on this subject. In addition, we propose building data networks among organizations based on federated mechanisms as an effective solution to allow knowledge to be shared without compromising user privacy.
Generating informative responses in end-to-end neural dialogue systems attracts a lot of attention in recent years. Various previous work leverages external knowledge and the dialogue contexts to generate such responses. Nevertheless, few has demonstrated their capability on incorporating the appropriate knowledge in response generation. Motivated by this, we propose a novel open-domain conversation generation model in this paper, which employs the posterior knowledge distribution to guide knowledge selection, therefore generating more appropriate and informative responses in conversations. To the best of our knowledge, we are the first one who utilize the posterior knowledge distribution to facilitate conversation generation. Our experiments on both automatic and human evaluation clearly verify the superior performance of our model over the state-of-the-art baselines.
Before training a neural net, a classic rule of thumb is to randomly initialize the weights so that the variance of the preactivation is preserved across all layers. This is traditionally interpreted using the total variance due to randomness in both networks (weights) and samples. Alternatively, one can interpret the rule of thumb as preservation of the \emph{sample} mean and variance for a fixed network, i.e., preactivation statistics computed over the random sample of training samples. The two interpretations differ little for a shallow net, but the difference is shown to be large for a deep ReLU net by decomposing the total variance into the network-averaged sum of the sample variance and square of the sample mean. We demonstrate that the latter term dominates in the later layers through an analytical calculation in the limit of infinite network width, and numerical simulations for finite width. Our experimental results from training neural nets support the idea that preserving sample statistics can be better than preserving total variance. We discuss the implications for the alternative rule of thumb that a network should be initialized to be at the ‘edge of chaos.’
Learning to solve diagrammatic reasoning (DR) can be a challenging but interesting problem to the computer vision research community. It is believed that next generation pattern recognition applications should be able to simulate human brain to understand and analyze reasoning of images. However, due to the lack of benchmarks of diagrammatic reasoning, the present research primarily focuses on visual reasoning that can be applied to real-world objects. In this paper, we present a diagrammatic reasoning dataset that provides a large variety of DR problems. In addition, we also propose a Knowledge-based Long Short Term Memory (KLSTM) to solve diagrammatic reasoning problems. Our proposed analysis is arguably the first work in this research area. Several state-of-the-art learning frameworks have been used to compare with the proposed KLSTM framework in the present context. Preliminary results indicate that the domain is highly related to computer vision and pattern recognition research with several challenging avenues.
We estimate model parameters of L\’evy-driven causal CARMA random fields by fitting the empirical variogram to the theoretical counterpart using a weighted least squares (WLS) approach. Subsequent to deriving asymptotic results for the variogram estimator, we show strong consistency and asymptotic normality of the parameter estimator. Furthermore, we conduct a simulation study to assess the quality of the WLS estimator for finite samples. For the simulation we utilize numerical approximation schemes based on truncation and discretization of stochastic integrals and we analyze the associated simulation errors in detail. Finally, we apply our results to real data of the cosmic microwave background.
This paper presents a novel, high-performance, graphical processing unit-based algorithm for efficiently solving two-dimensional linear programs in batches. The domain of two-dimensional linear programs is particularly useful due to the prevalence of relevant geometric problems. Batch linear programming refers to solving numerous different linear programs within one operation. By solving many linear programs simultaneously and distributing workload evenly across threads, graphical processing unit utilization can be maximized. Speedups of over 22 times and 63 times are obtained against state-of-the-art graphics processing unit and CPU linear program solvers, respectively.
In this paper we propose to perform model ensembling in a multiclass or a multilabel learning setting using Wasserstein (W.) barycenters. Optimal transport metrics, such as the Wasserstein distance, allow incorporating semantic side information such as word embeddings. Using W. barycenters to find the consensus between models allows us to balance confidence and semantics in finding the agreement between the models. We show applications of Wasserstein ensembling in attribute-based classification, multilabel learning and image captioning generation. These results show that the W. ensembling is a viable alternative to the basic geometric or arithmetic mean ensembling.
To relieve the pain of manually selecting machine learning algorithms and tuning hyperparameters, automated machine learning (AutoML) methods have been developed to automatically search for good models. Due to the huge model search space, it is impossible to try all models. Users tend to distrust automatic results and increase the search budget as much as they can, thereby undermining the efficiency of AutoML. To address these issues, we design and implement ATMSeer, an interactive visualization tool that supports users in refining the search space of AutoML and analyzing the results. To guide the design of ATMSeer, we derive a workflow of using AutoML based on interviews with machine learning experts. A multi-granularity visualization is proposed to enable users to monitor the AutoML process, analyze the searched models, and refine the search space in real time. We demonstrate the utility and usability of ATMSeer through two case studies, expert interviews, and a user study with 13 end users.

### Book Memo: “Keras to Kubernetes”

 The Journey of a Machine Learning Model to Production We have seen an exponential growth in the use of Artificial Intelligence (AI) over last few years. AI is becoming the new electricity and is touching every industry from retail to manufacturing to healthcare to entertainment. Within AI, we’re seeing a particular growth in Machine Learning (ML) and Deep Learning (DL) applications. ML is all about learning relationships from labeled (Supervised) or unlabeled data (Unsupervised). DL has many layers of learning and can extract patterns from unstructured data like images, video, audio, etc. Machine Learning with Keras and Kubernetes takes you through real-world examples of building a Keras model for detecting logos in images. You will then take that trained model and package it as a web application container before learning how to deploy this model at scale on a Kubernetes cluster.

### Three surveys of AI adoption reveal key advice from more mature practices

An overview of emerging trends, known hurdles, and best practices in artificial intelligence.

Recently, O’Reilly Media published AI Adoption in the Enterprise: How Companies Are Planning and Prioritizing AI Projects in Practice, a report based on an industry survey. That was the third of three industry surveys conducted in 2018 to probe trends in artificial intelligence (AI), big data, and cloud adoption. The other two surveys were The State of Machine Learning Adoption in the Enterprise, released in July 2018, and Evolving Data Infrastructure, released in January 2019.

This article looks at those results in further detail, comparing high-level themes based on the three reports, plus related presentations at the Strata Data Conference and the AI Conference. These points would have been out of scope for any of the individual reports.

## Exploring new markets by repurposing AI applications

Looking across industry sectors in AI Adoption in the Enterprise, we see how technology, health care, and retail tend to be the leaders in AI adoption, whereas the public sector (government) tends to be the laggards, along with education and manufacturing. Although that gap could be taken as commentary about the need for “data for social good,” it also points toward opportunities. Consider this: finance has enjoyed first-mover advantages in artificial intelligence adoption, as have the technology and retail sectors. After having matured in these practices, now we see financial services firms exploring opportunities that just a few years ago might have been considered niches. For example, at our recent AI Conference in London, two talks—Ashok Srivastava of Intuit and Johnny Ball of Fluidy—presented business applications for AI aimed at establishing safety nets for small businesses. Both teams applied anomaly detection techniques (for example, reused from aircraft engine monitoring) to spot when small businesses were likely to fail. That’s important since more than 50% of small businesses fail, mostly due to exactly those “anomalies”: cash flow problems and late payments.

Given how government and education trail as laggards in the AI space, could similar kinds of technology reuse apply there? For example, within the past few years, it’s become common practice in U.S. grade schools for teachers to provide detailed information online to parents about student assignments and grades. This data can be extremely helpful as early warning signals for at-risk students who might be failing school—although, quite frankly, few working parents can afford the time to track that much data. Moreover, few schools have resources to act on that data in aggregate. Even so, the anomaly detection used in small business cash-flow analysis is strikingly similar to what a homework “safety net” for students would need. Undoubtedly, there are areas within government (especially at the local level) where similar AI applications could lead to considerable public upside, which would otherwise be understaffed due to budget restraints. As the enterprise adoption of AI continues to mature, we can hope that diffusion from the leaders to the laggards comes through similarly innovative acts of technology repurposing. The trick seems to be finding enough people with depth in both technical and business skills who can recognize business use cases for AI.

## Differentiated tooling

Looking at the “Tools for Building AI Applications” section of AI Adoption in the Enterprise for trends about technology adoption, we see how frameworks such as Spark NLP, scikit-learn, and H2O hold popularity in finance, whereas Google Cloud ML Engine gets higher share within the health care industry. Compared with analysis last year, both Keras and PyTorch have picked up significant gains over the category leader TensorFlow. Also, while there has been debate in the industry about the relative merits of using Jupyter Notebooks in production, usage has been growing dramatically. We see from this survey’s results that support for notebooks (23%) now leads over support for IDEs (17%).

The summary results about health care and life sciences create an interesting picture. 70 percent of all respondents from the health sector are using AI for R&D projects. Respondents from the health care sector also had significantly less trouble identifying appropriate uses cases for AI, although hurdles for the sector seem to come later in the AI production lifecycle. In general, health care leads other verticals in how it checks for a broad range of AI-related risks, and this vertical makes more use of data visualization than others, as would be expected. It’s also gaining in use of reinforcement learning, which was not expected. Although we know of reinforcement learning production use cases in finance, we don’t have optics into how reinforcement learning is used in health care. That could be a good topic for a subsequent survey.

Admittedly, the survey for AI Adoption in the Enterprise drew from the initiated: 81% of respondents work for organizations that already use AI. We have much to learn from their collective experiences. For example, there’s a story unfolding in the contrast between mature practices and firms that are earlier in their journey toward AI adoption. Some of the key advice emerging from the mature organizations includes:

• Work toward overcoming challenges related to company culture or not being able to recognize the business use cases.
• Be mindful that the lack of data and lack of skilled people will pose ongoing challenges.
• While hiring data scientists, complement by also hiring people who can identify business use cases for AI solutions.
• Beyond just optimizing for business metrics, also check for model transparency and interpretability, fairness and bias, and that your AI systems are reliable and safe.
• Explore use cases beyond deep learning: other solutions have gained significant traction, including human-in-the-loop, knowledge graphs, and reinforcement learning.
• Look for value in applications of transfer learning, which is a nuanced technique the more advanced organizations recognize.
• Your organization probably needs to invest more in infrastructure engineering than it thinks, perpetually.

This is a story about the relative mix of priorities as a team gains experience. That experience is often gained by learning from early mistakes. In other words, there’s quite a long list of potential issues and concerns that an organization might consider at the outset of AI adoption in enterprise. However, “Go invest in everything, all at once” is not much of a strategy. Advice from leaders at the more sophisticated AI practices tends to be: “Here are the N things we tried early and have learned not to prioritize as much.” We hope that these surveys offer helpful guidance that other organizations can follow.

This is also a story about how to pace investments and sequence large initiatives effectively. For example, you must address the more foundational pain points early—such as problems with company culture, or the lack of enough personnel who can identify the business uses—or those will become blockers for other AI initiatives down the road. Meanwhile, some investments must be ongoing, such as hiring appropriate talent and working to improve data sets. As an executive, don’t assume that one-shot initiatives will work as a panacea. These are ongoing challenges and you must budget for them as such.

Speaking of budget, firms are clearly taking the matter of AI adoption seriously, allocating significant amounts of their IT budgets for AI-related projects. Even if your firm isn’t, you can pretty much bet that the competition will be. Which side of that bet will pay off?

## Heading toward a threshold point

Another issue emerged from the surveys that concerns messaging about AutoML. Adoption percentages for AutoML had been in single-digit territory in our earlier survey just two quarters ago. Now, we see many organizations making serious budget allocations toward integrating AutoML over the course of the next year. This is especially poignant for the more mature practices: 86% will be integrating AutoML within the next year, nearly two times that of the evaluation stage firms. That shift is timed almost precisely as cloud providers extend their AutoML offerings. For example, this was an important theme emphasized at Amazon’s recent re:Invent conference in Las Vegas. Both sides, demand and supply, are rolling the dice on AutoML in a big way.

Even so, there’s a risk that less-informed executives might interpret the growing uptake of AutoML as a signal that “AI capabilities are readily available off-the-shelf.” That’s anything but the case at hand. The process of leveraging AI capabilities, even within the AutoML category, depends on multi-year transformations for organizations. That effort requires substantial capital investments and typically an extensive evolution of mindshare by the leadership. It’s not an impulse buy. Another important point to remember is that AutoML is only one portion of the automation that's needed. See the recent Data Show Podcast interview “Building tools for enterprise data science” with Vitaly Gordon, VP of data science and engineering at Salesforce, about their TransmogrifAI open source project for machine learning workflows. It's clear that automating the model building and model search step—the AutoML part—is just one piece of the puzzle.

We’ve also known—since studies published in 2017 plus the analysis that followed—that a “digital divide” is growing in enterprise between the leaders and the laggards in AI adoption. See the excellent “Notes from the frontier: Making AI work,” by Michael Chui at McKinsey Global Institute, plus the related report, AI adoption advances, but foundational barriers remain. What we observe now in Q4 2018 and Q1 2019 is how the mature practices are investing significantly, and based on lessons learned, they’re investing more wisely. However, most of the laggards aren’t even beginning to invest in crucial transformations that will require years. We cannot overstress how this demonstrates a growing divide between “haves” and “have nots” among enterprise organizations. At some threshold point relatively soon, the “have nots” might simply fall too many years behind their competitors to be worth the investments that will be needed to catch up.

hist(df_result$steps) With some additional visual, you can see the results as well: library(ggplot2) library(gridExtra) #par(mfrow=c(1,2)) p1 <- ggplot(df_result, aes(x=number,y=steps)) + geom_bar(stat='identity') + scale_y_continuous(expand = c(0, 0), limits = c(0, 8)) p2 <- ggplot(df_result, aes(x=log10(number),y=steps)) + geom_point(alpha = 1/50) grid.arrange(p1, p2, ncol=2, nrow = 1) And the graph: A lot of numbers converges on third step, meaning that every 4th or 5th number. We would need to look into the steps of the solutions, what these numbers have in common. This will follow! So stay tuned. Fun fact: For the time of writing this blog post, the number 6174 was not constant in R base. As always, code is available at Github. Happy Rrrring To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### R Packages worth a look Similarity-Based Segmentation of Multidimensional Signals (segmenTier) A dynamic programming solution to segmentation based on maximization of arbitrary similarity measures within segments. The general idea, theory and thi … Analysis of Longitudinal Data with Irregular Observation Times (IrregLong) Analysis of longitudinal data for which the times of observation are random variables that are potentially associated with the outcome process. The pac … Shiny Matrix Input Field (shinyMatrix) Implements a custom matrix input field. Continue Reading… ### I Just Wanted The Data : Turning Tableau & Tidyverse Tears Into Smiles with Base R (An Encoding Detective Story) (This article was first published on R – rud.is, and kindly contributed to R-bloggers) Those outside the Colonies may not know that Payless—a national chain that made footwear affordable for millions of ‘Muricans who can’t spare$100.00 USD for a pair of shoes their 7 year old will outgrow in a year— is closing. CNBC also had a story that featured a choropleth with a tiny button at the bottom that indicated one could get the data:

I should have known this would turn out to be a chore since they used Tableau—the platform of choice when you want to take advantage of all the free software libraries they use to power their premier platform which, in turn, locks up all the data for you so others can’t adopt, adapt and improve. Go. Egregious. Predatory. Capitalism.

Anyway.

I wanted the data to do some real analysis vs produce a fairly unhelpful visualization (TLDR: layer in Census data for areas impacted, estimate job losses, compute nearest similar Payless stores to see impact on transportation-challenged homes, etc. Y’now, citizen data journalism-y things) so I pressed the button and watched for the URL in Chrome (aye, for those that remember I moved to Firefox et al in 2018, I switched back; more on that in March) and copied it to try to make this post actually reproducible (a novel concept for Tableau fanbois):

library(tibble)

# https://www.cnbc.com/2019/02/19/heres-a-map-of-where-payless-shoesource-is-closing-2500-stores.html

tfil <- "~/Data/Sheet_3_data.csv"

"https://public.tableau.com/vizql/w/PAYLESSSTORECLOSINGS/v/Dashboard2/vud/sessions/6A678928620645FF99C7EF6353426CE8-0:0/views/10625182665948828489_7202546092381496425?csv=true&showall=true",
tfil
)
## trying URL 'https://public.tableau.com/vizql/w/PAYLESSSTORECLOSINGS/v/Dashboard2/vud/sessions/6A678928620645FF99C7EF6353426CE8-0:0/views/10625182665948828489_7202546092381496425?csv=true&showall=true'
##   cannot open URL 'https://public.tableau.com/vizql/w/PAYLESSSTORECLOSINGS/v/Dashboard2/vud/sessions/6A678928620645FF99C7EF6353426CE8-0:0/views/10625182665948828489_7202546092381496425?csv=true&showall=true'
##   cannot open URL 'https://public.tableau.com/vizql/w/PAYLESSSTORECLOSINGS/v/Dashboard2/vud/sessions/6A678928620645FF99C7EF6353426CE8-0:0/views/10625182665948828489_7202546092381496425?csv=true&showall=true': HTTP status was '410 Gone'


WAT

Truth be told I expected a time-boxed URL of some sort (prior experience FTW). Selenium or Splash were potential alternatives but I didn’t want to research the legality of more forceful scraping (I just wanted the data) so I manually downloaded the file (*the horror*) and proceeded to read it in. Well, try to read it in:

read_csv(tfil)
## Parsed with column specification:
## cols(
##   A = col_logical()
## )
## Warning: 2092 parsing failures.
## row col           expected actual                      file
##   1   A 1/0/T/F/TRUE/FALSE        '~/Data/Sheet_3_data.csv'
##   2   A 1/0/T/F/TRUE/FALSE        '~/Data/Sheet_3_data.csv'
##   3   A 1/0/T/F/TRUE/FALSE        '~/Data/Sheet_3_data.csv'
##   4   A 1/0/T/F/TRUE/FALSE        '~/Data/Sheet_3_data.csv'
##   5   A 1/0/T/F/TRUE/FALSE        '~/Data/Sheet_3_data.csv'
## ... ... .................. ...... .........................
## See problems(...) for more details.
##
## # A tibble: 2,090 x 1
##    A
##
##  1 NA
##  2 NA
##  3 NA
##  4 NA
##  5 NA
##  6 NA
##  7 NA
##  8 NA
##  9 NA
## 10 NA
## # … with 2,080 more rows


WAT

Getting a single column back from readr::read_[ct]sv() is (generally) a tell-tale sign that the file format is amiss. Before donning a deerstalker (I just wanted the data!) I tried to just use good ol’ read.csv():

read.csv(tfil, stringsAsFactors=FALSE)
## Error in make.names(col.names, unique = TRUE) :
##   invalid multibyte string at 'A'
##   line 1 appears to contain embedded nulls
##   line 2 appears to contain embedded nulls
##   line 3 appears to contain embedded nulls
##   line 4 appears to contain embedded nulls
##   line 5 appears to contain embedded nulls


WAT

Actually the “WAT” isn’t really warranted since read.csv() gave us some super-valuable info via invalid multibyte string at 'A'. FF FE is a big signal1 2 we’re working with a file in another encoding as that’s a common “magic” sequence at the start of such files.

But, I didn’t want to delve into my Columbo persona… I. Just. Wanted. The. Data. So, I tried the mind-bendingly fast and flexible helper from data.table:

data.table::fread(tfil)
##   File is encoded in UTF-16, this encoding is not supported by fread(). Please recode the file to UTF-8.


AHA. UTF-16 (maybe). Let’s poke at the raw file:

x <- readBin(tfil, "raw", file.size(tfil)) ## also: read_file_raw(tfil)

x[1:100]
##   [1] ff fe 41 00 64 00 64 00 72 00 65 00 73 00 73 00 09 00 43 00
##  [21] 69 00 74 00 79 00 09 00 43 00 6f 00 75 00 6e 00 74 00 72 00
##  [41] 79 00 09 00 49 00 6e 00 64 00 65 00 78 00 09 00 4c 00 61 00
##  [61] 62 00 65 00 6c 00 09 00 4c 00 61 00 74 00 69 00 74 00 75 00
##  [81] 64 00 65 00 09 00 4c 00 6f 00 6e 00 67 00 69 00 74 00 75 00


There’s our ff fe (which is the beginning of the possibility it’s UTF-16) but that 41 00 harkens back to UTF-16’s older sibling UCS-2. The 0x00‘s are embedded nuls (likely to get bytes aligned). And, there are alot of 09s. Y’know what they are? They’re s. That’s right. Tableau named file full of TSV records in an unnecessary elaborate encoding CSV. Perhaps they broke the “T” on all their keyboards typing their product name so much.

### Living A Boy’s [Data] Adventure Tale

At this point we have:

• no way to support an automated, reproducible workflow
• an ill-named file for what it contains
• an overly-encoded file for what it contains
• many wasted minutes (which is likely by design to have us give up and just use Tableau. No. Way.)

At this point I’m in full-on Rockford Files (pun intended) mode and delved down to the command line to use a old, trusted sidekick enca:

### More fun with fast remainders when the divisor is a constant

In software, compilers can often optimize away integer divisions, and replace them with cheaper instructions, especially when the divisor is a constant. I recently wrote about some work on faster remainders when the divisor is a constant. I reported that it can be fruitful to compute the remainder directly, instead of first computing the quotient (as compilers are doing when the divisor is a constant).

To get good results, we can use an important insight that is not documented anywhere at any length: we can use 64-bit processor instructions to do 32-bit arithmetic. This is fair game and compilers could use this insight, but they do not do it systematically. Using this trick alone is enough to get substantial gains in some instances, if the algorithmic issues are just right.

So it is a bit complicated. Using 64-bit processor instructions for 32-bit arithmetic is sometimes useful. In addition, computing the remainder directly without first computing the quotient is sometimes useful. Let us collect a data point for fun and to motivate further work.

First let us consider how you might compute the remainder by leaving it up to the compiler to do the heavy lifting (D is a constant known to the compiler). I expect that the compiler will turn this code into a sequence of instructions over 32-bit registers:

uint32_t compilermod32(uint32_t a) {
return a % D;
}



Then we can compute the remainder directly, using some magical mathematics and 64-bit instructions:

#define M ((uint64_t)(UINT64_C(0xFFFFFFFFFFFFFFFF) / (D) + 1))

uint32_t directmod64(uint32_t a) {
uint64_t lowbits = M * a;
return ((__uint128_t)lowbits * D) >> 64;
}


Finally, you can compute the remainder “indirectly” (by first computing the quotient) but using 64-bit processor instructions.

uint32_t indirectmod64(uint32_t a) {
uint64_t quotient = ( (__uint128_t) M * a ) >> 64;
return a - quotient * D;
}


As a benchmark, I am going to compute a linear congruential generator (basically a recursive linear function with a remainder thrown in), using these three approaches, plus the naive one. I use as a divisor the constant number 22, a skylake processor and the GNU GCC 8.1 compiler. For each generated number I measure the following number of CPU cycles (on average):

 slow (division instruction) 29 cycles compiler (32-bit) 12 cycles direct (64-bit) 10 cycles indirect (64-bit) 11 cycles

Depending on your exact platform, all three approaches (compiler, direct, indirect) could be a contender for best results. In fact, it is even possible that the division instruction could win out in some cases. For example, on ARM and POWER processors, the division instruction does beat some compilers.

Where does this leave us? There is no silver bullet but a simple C function can beat a state-of-the-art optimizing compiler. In many cases, we found that a direct computation of the 32-bit remainder using 64-bit instructions was best.

### DATAx Singapore Highlights, March 5-6

Join conversations with Oracle, WPP, Axiata, Dyson, IBM, Netflix, Visa, AIA, Google, Bloomberg & more as they share how they utilize technology and data science.

### Magister Dixit

“The validity of causal inferences depends on structural knowledge, which is fallible, to supplement the information in the data. As a consequence, no algorithm can quantify the accuracy of causal inferences from observational data.” Miguel A. Hernán, John Hsu, Brian Healy ( July 12, 2018 )

### Floor filler

(This article was first published on R on Gianluca Baio, and kindly contributed to R-bloggers)

As I posted recently, I’m involved in a couple of events, later this summer: our annual Summer School and the new(er) tradition of the R for HTA workshop.

I have to say that I’m very happy about how things are proceeding for both of them. The summer school has been first advertise a few months back (I’ve posted on the blog, but we’ve also tried to reach other relevant mailing lists and groups, such as the HTA agencies in the EUnetHTA Network). And the dancefloor is quickly filling — there’s been a surge in registrations in the past couple of weeks and we now only have 4 places left. (I’m not expecting to have dance sessions when we reconvene in Florence, in June. Although usually people do have lots of fun, both at the Centro Studi, chilling in the terrace, or rolling down to Florence…).

The R for HTA workshop is even more impressive and pleasing, I think. We basically almost filled up the 20 places for the short course on using R for Cost-Effectiveness Modelling. We already have 12 places reserved! We also have already 16 registrations for the main event as well.

And we’re also finalising the “hackathon” — well challenge, to use the formal terminology — which sounds like an interesting exercise. We’ll publicise this shortly as well, so people can sign up for it too!

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Tourism’s boom is not universally welcome

Global tourism has been doing well, but several difficulties point to a slowdown in the coming years

### Descriptive/Summary Statistics with descriptr

We are pleased to introduce the descriptr package, a set of tools for
generating descriptive/summary statistics.

## Installation

# Install release version from CRAN
install.packages("descriptr")

# Install development version from GitHub
# install.packages("devtools")
devtools::install_github("rsquaredacademy/descriptr")

## Shiny App

descriptr includes a shiny app which can be launched using

ds_launch_shiny_app()

or try the live version here.

descriptr website for
detailed documentation on using the package.

## Data

We have modified the mtcars data to create a new data set mtcarz. The only
difference between the two data sets is related to the variable types.

str(mtcarz)
## 'data.frame':    32 obs. of  11 variables:
##  $mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... ##$ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
##  $disp: num 160 160 108 258 360 ... ##$ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... ##$ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $qsec: num 16.5 17 18.6 19.4 17 ... ##$ vs  : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
##  $am : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ... ##$ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
##  $carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ... ## Data Screening The ds_screener() function will screen a data set and return the following: – Column/Variable Names – Data Type – Levels (in case of categorical data) – Number of missing observations – % of missing observations ds_screener(mtcarz) ## ----------------------------------------------------------------------- ## | Column Name | Data Type | Levels | Missing | Missing (%) | ## ----------------------------------------------------------------------- ## | mpg | numeric | NA | 0 | 0 | ## | cyl | factor | 4 6 8 | 0 | 0 | ## | disp | numeric | NA | 0 | 0 | ## | hp | numeric | NA | 0 | 0 | ## | drat | numeric | NA | 0 | 0 | ## | wt | numeric | NA | 0 | 0 | ## | qsec | numeric | NA | 0 | 0 | ## | vs | factor | 0 1 | 0 | 0 | ## | am | factor | 0 1 | 0 | 0 | ## | gear | factor | 3 4 5 | 0 | 0 | ## | carb | factor |1 2 3 4 6 8| 0 | 0 | ## ----------------------------------------------------------------------- ## ## Overall Missing Values 0 ## Percentage of Missing Values 0 % ## Rows with Missing Values 0 ## Columns With Missing Values 0 ## Continuous Data ### Summary Statistics The ds_summary_stats() function returns a comprehensive set of statistics including measures of location, variation, symmetry and extreme observations. ds_summary_stats(mtcarz, mpg) ## ------------------------------ Variable: mpg ------------------------------ ## ## Univariate Analysis ## ## N 32.00 Variance 36.32 ## Missing 0.00 Std Deviation 6.03 ## Mean 20.09 Range 23.50 ## Median 19.20 Interquartile Range 7.38 ## Mode 10.40 Uncorrected SS 14042.31 ## Trimmed Mean 19.95 Corrected SS 1126.05 ## Skewness 0.67 Coeff Variation 30.00 ## Kurtosis -0.02 Std Error Mean 1.07 ## ## Quantiles ## ## Quantile Value ## ## Max 33.90 ## 99% 33.44 ## 95% 31.30 ## 90% 30.09 ## Q3 22.80 ## Median 19.20 ## Q1 15.43 ## 10% 14.34 ## 5% 12.00 ## 1% 10.40 ## Min 10.40 ## ## Extreme Values ## ## Low High ## ## Obs Value Obs Value ## 15 10.4 20 33.9 ## 16 10.4 18 32.4 ## 24 13.3 19 30.4 ## 7 14.3 28 30.4 ## 17 14.7 26 27.3 You can pass multiple variables as shown below: ds_summary_stats(mtcarz, mpg, disp) ## ------------------------------ Variable: mpg ------------------------------ ## ## Univariate Analysis ## ## N 32.00 Variance 36.32 ## Missing 0.00 Std Deviation 6.03 ## Mean 20.09 Range 23.50 ## Median 19.20 Interquartile Range 7.38 ## Mode 10.40 Uncorrected SS 14042.31 ## Trimmed Mean 19.95 Corrected SS 1126.05 ## Skewness 0.67 Coeff Variation 30.00 ## Kurtosis -0.02 Std Error Mean 1.07 ## ## Quantiles ## ## Quantile Value ## ## Max 33.90 ## 99% 33.44 ## 95% 31.30 ## 90% 30.09 ## Q3 22.80 ## Median 19.20 ## Q1 15.43 ## 10% 14.34 ## 5% 12.00 ## 1% 10.40 ## Min 10.40 ## ## Extreme Values ## ## Low High ## ## Obs Value Obs Value ## 15 10.4 20 33.9 ## 16 10.4 18 32.4 ## 24 13.3 19 30.4 ## 7 14.3 28 30.4 ## 17 14.7 26 27.3 ## ## ## ## ------------------------------ Variable: disp ----------------------------- ## ## Univariate Analysis ## ## N 32.00 Variance 15360.80 ## Missing 0.00 Std Deviation 123.94 ## Mean 230.72 Range 400.90 ## Median 196.30 Interquartile Range 205.18 ## Mode 275.80 Uncorrected SS 2179627.47 ## Trimmed Mean 228.00 Corrected SS 476184.79 ## Skewness 0.42 Coeff Variation 53.72 ## Kurtosis -1.07 Std Error Mean 21.91 ## ## Quantiles ## ## Quantile Value ## ## Max 472.00 ## 99% 468.28 ## 95% 449.00 ## 90% 396.00 ## Q3 326.00 ## Median 196.30 ## Q1 120.83 ## 10% 80.61 ## 5% 77.35 ## 1% 72.53 ## Min 71.10 ## ## Extreme Values ## ## Low High ## ## Obs Value Obs Value ## 20 71.1 15 472 ## 19 75.7 16 460 ## 18 78.7 17 440 ## 26 79 25 400 ## 28 95.1 5 360 If you do not specify any variables, it will detect all the continuous variables in the data set and return summary statistics for each of them. ### Frequency Distribution The ds_freq_table() function creates frequency tables for continuous variables. The default number of intervals is 5. ds_freq_table(mtcarz, mpg, 4) ## Variable: mpg ## |---------------------------------------------------------------------------| ## | Bins | Frequency | Cum Frequency | Percent | Cum Percent | ## |---------------------------------------------------------------------------| ## | 10.4 - 16.3 | 10 | 10 | 31.25 | 31.25 | ## |---------------------------------------------------------------------------| ## | 16.3 - 22.1 | 13 | 23 | 40.62 | 71.88 | ## |---------------------------------------------------------------------------| ## | 22.1 - 28 | 5 | 28 | 15.62 | 87.5 | ## |---------------------------------------------------------------------------| ## | 28 - 33.9 | 4 | 32 | 12.5 | 100 | ## |---------------------------------------------------------------------------| ## | Total | 32 | - | 100.00 | - | ## |---------------------------------------------------------------------------| #### Histogram A plot() method has been defined which will generate a histogram. k <- ds_freq_table(mtcarz, mpg, 4) plot(k) ### Auto Summary If you want to view summary statistics and frequency tables of all or subset of variables in a data set, use ds_auto_summary(). ds_auto_summary_stats(mtcarz, disp, mpg) ## ------------------------------ Variable: disp ----------------------------- ## ## ---------------------------- Summary Statistics --------------------------- ## ## ------------------------------ Variable: disp ----------------------------- ## ## Univariate Analysis ## ## N 32.00 Variance 15360.80 ## Missing 0.00 Std Deviation 123.94 ## Mean 230.72 Range 400.90 ## Median 196.30 Interquartile Range 205.18 ## Mode 275.80 Uncorrected SS 2179627.47 ## Trimmed Mean 228.00 Corrected SS 476184.79 ## Skewness 0.42 Coeff Variation 53.72 ## Kurtosis -1.07 Std Error Mean 21.91 ## ## Quantiles ## ## Quantile Value ## ## Max 472.00 ## 99% 468.28 ## 95% 449.00 ## 90% 396.00 ## Q3 326.00 ## Median 196.30 ## Q1 120.83 ## 10% 80.61 ## 5% 77.35 ## 1% 72.53 ## Min 71.10 ## ## Extreme Values ## ## Low High ## ## Obs Value Obs Value ## 20 71.1 15 472 ## 19 75.7 16 460 ## 18 78.7 17 440 ## 26 79 25 400 ## 28 95.1 5 360 ## ## ## ## NULL ## ## ## -------------------------- Frequency Distribution ------------------------- ## ## Variable: disp ## |---------------------------------------------------------------------------| ## | Bins | Frequency | Cum Frequency | Percent | Cum Percent | ## |---------------------------------------------------------------------------| ## | 71.1 - 151.3 | 12 | 12 | 37.5 | 37.5 | ## |---------------------------------------------------------------------------| ## | 151.3 - 231.5 | 5 | 17 | 15.62 | 53.12 | ## |---------------------------------------------------------------------------| ## | 231.5 - 311.6 | 6 | 23 | 18.75 | 71.88 | ## |---------------------------------------------------------------------------| ## | 311.6 - 391.8 | 5 | 28 | 15.62 | 87.5 | ## |---------------------------------------------------------------------------| ## | 391.8 - 472 | 4 | 32 | 12.5 | 100 | ## |---------------------------------------------------------------------------| ## | Total | 32 | - | 100.00 | - | ## |---------------------------------------------------------------------------| ## ## ## ------------------------------ Variable: mpg ------------------------------ ## ## ---------------------------- Summary Statistics --------------------------- ## ## ------------------------------ Variable: mpg ------------------------------ ## ## Univariate Analysis ## ## N 32.00 Variance 36.32 ## Missing 0.00 Std Deviation 6.03 ## Mean 20.09 Range 23.50 ## Median 19.20 Interquartile Range 7.38 ## Mode 10.40 Uncorrected SS 14042.31 ## Trimmed Mean 19.95 Corrected SS 1126.05 ## Skewness 0.67 Coeff Variation 30.00 ## Kurtosis -0.02 Std Error Mean 1.07 ## ## Quantiles ## ## Quantile Value ## ## Max 33.90 ## 99% 33.44 ## 95% 31.30 ## 90% 30.09 ## Q3 22.80 ## Median 19.20 ## Q1 15.43 ## 10% 14.34 ## 5% 12.00 ## 1% 10.40 ## Min 10.40 ## ## Extreme Values ## ## Low High ## ## Obs Value Obs Value ## 15 10.4 20 33.9 ## 16 10.4 18 32.4 ## 24 13.3 19 30.4 ## 7 14.3 28 30.4 ## 17 14.7 26 27.3 ## ## ## ## NULL ## ## ## -------------------------- Frequency Distribution ------------------------- ## ## Variable: mpg ## |-----------------------------------------------------------------------| ## | Bins | Frequency | Cum Frequency | Percent | Cum Percent | ## |-----------------------------------------------------------------------| ## | 10.4 - 15.1 | 6 | 6 | 18.75 | 18.75 | ## |-----------------------------------------------------------------------| ## | 15.1 - 19.8 | 12 | 18 | 37.5 | 56.25 | ## |-----------------------------------------------------------------------| ## | 19.8 - 24.5 | 8 | 26 | 25 | 81.25 | ## |-----------------------------------------------------------------------| ## | 24.5 - 29.2 | 2 | 28 | 6.25 | 87.5 | ## |-----------------------------------------------------------------------| ## | 29.2 - 33.9 | 4 | 32 | 12.5 | 100 | ## |-----------------------------------------------------------------------| ## | Total | 32 | - | 100.00 | - | ## |-----------------------------------------------------------------------| ### Group Summary The ds_group_summary() function returns descriptive statistics of a continuous variable for the different levels of a categorical variable. k <- ds_group_summary(mtcarz, cyl, mpg) k ## mpg by cyl ## ----------------------------------------------------------------------------------------- ## | Statistic/Levels| 4| 6| 8| ## ----------------------------------------------------------------------------------------- ## | Obs| 11| 7| 14| ## | Minimum| 21.4| 17.8| 10.4| ## | Maximum| 33.9| 21.4| 19.2| ## | Mean| 26.66| 19.74| 15.1| ## | Median| 26| 19.7| 15.2| ## | Mode| 22.8| 21| 10.4| ## | Std. Deviation| 4.51| 1.45| 2.56| ## | Variance| 20.34| 2.11| 6.55| ## | Skewness| 0.35| -0.26| -0.46| ## | Kurtosis| -1.43| -1.83| 0.33| ## | Uncorrected SS| 8023.83| 2741.14| 3277.34| ## | Corrected SS| 203.39| 12.68| 85.2| ## | Coeff Variation| 16.91| 7.36| 16.95| ## | Std. Error Mean| 1.36| 0.55| 0.68| ## | Range| 12.5| 3.6| 8.8| ## | Interquartile Range| 7.6| 2.35| 1.85| ## ----------------------------------------------------------------------------------------- ds_group_summary() returns a tibble which can be used for further analysis. k$tidy_stats
## # A tibble: 3 x 15
##   cyl   length   min   max  mean median  mode    sd variance skewness
##
## 1 4         11  21.4  33.9  26.7   26    22.8  4.51    20.3     0.348
## 2 6          7  17.8  21.4  19.7   19.7  21    1.45     2.11   -0.259
## 3 8         14  10.4  19.2  15.1   15.2  10.4  2.56     6.55   -0.456
## # ... with 5 more variables: kurtosis , coeff_var ,
## #   std_error , range , iqr 

#### Box Plot

A plot() method has been defined for comparing distributions.

k <- ds_group_summary(mtcarz, cyl, mpg)
plot(k)

### Multiple Variables

If you want grouped summary statistics for multiple variables in a data set, use
ds_auto_group_summary().

ds_auto_group_summary(mtcarz, cyl, gear, mpg)
##                                        mpg by cyl
## -----------------------------------------------------------------------------------------
## |     Statistic/Levels|                    4|                    6|                    8|
## -----------------------------------------------------------------------------------------
## |                  Obs|                   11|                    7|                   14|
## |              Minimum|                 21.4|                 17.8|                 10.4|
## |              Maximum|                 33.9|                 21.4|                 19.2|
## |                 Mean|                26.66|                19.74|                 15.1|
## |               Median|                   26|                 19.7|                 15.2|
## |                 Mode|                 22.8|                   21|                 10.4|
## |       Std. Deviation|                 4.51|                 1.45|                 2.56|
## |             Variance|                20.34|                 2.11|                 6.55|
## |             Skewness|                 0.35|                -0.26|                -0.46|
## |             Kurtosis|                -1.43|                -1.83|                 0.33|
## |       Uncorrected SS|              8023.83|              2741.14|              3277.34|
## |         Corrected SS|               203.39|                12.68|                 85.2|
## |      Coeff Variation|                16.91|                 7.36|                16.95|
## |      Std. Error Mean|                 1.36|                 0.55|                 0.68|
## |                Range|                 12.5|                  3.6|                  8.8|
## |  Interquartile Range|                  7.6|                 2.35|                 1.85|
## -----------------------------------------------------------------------------------------
##
##
##
##                                        mpg by gear
## -----------------------------------------------------------------------------------------
## |     Statistic/Levels|                    3|                    4|                    5|
## -----------------------------------------------------------------------------------------
## |                  Obs|                   15|                   12|                    5|
## |              Minimum|                 10.4|                 17.8|                   15|
## |              Maximum|                 21.5|                 33.9|                 30.4|
## |                 Mean|                16.11|                24.53|                21.38|
## |               Median|                 15.5|                 22.8|                 19.7|
## |                 Mode|                 10.4|                   21|                   15|
## |       Std. Deviation|                 3.37|                 5.28|                 6.66|
## |             Variance|                11.37|                27.84|                44.34|
## |             Skewness|                -0.09|                  0.7|                 0.56|
## |             Kurtosis|                -0.38|                -0.77|                -1.83|
## |       Uncorrected SS|              4050.52|               7528.9|              2462.89|
## |         Corrected SS|               159.15|               306.29|               177.37|
## |      Coeff Variation|                20.93|                21.51|                31.15|
## |      Std. Error Mean|                 0.87|                 1.52|                 2.98|
## |                Range|                 11.1|                 16.1|                 15.4|
## |  Interquartile Range|                  3.9|                 7.08|                 10.2|
## -----------------------------------------------------------------------------------------

## Multiple Variable Statistics

The ds_tidy_stats() function returns summary/descriptive statistics for
variables in a data frame/tibble.

ds_tidy_stats(mtcarz, mpg, disp, hp)
## # A tibble: 3 x 16
##   vars    min   max  mean t_mean median  mode range variance  stdev  skew
##
## 1 disp   71.1 472   231.   228    196.  276.  401.   15361.  124.   0.420
## 2 hp     52   335   147.   144.   123   110   283     4701.   68.6  0.799
## 3 mpg    10.4  33.9  20.1   20.0   19.2  10.4  23.5     36.3   6.03 0.672
## # ... with 5 more variables: kurtosis , coeff_var , q1 ,
## #   q3 , iqrange 

### Measures

If you want to view the measure of location, variation, symmetry, percentiles
and extreme observations as tibbles, use the below functions. All of them,
except for ds_extreme_obs() will work with single or multiple variables. If
you do not specify the variables, they will return the results for all the
continuous variables in the data set.

#### Measures of Location

ds_measures_location(mtcarz)
## # A tibble: 6 x 5
##   var     mean trim_mean median   mode
##
## 1 disp  231.      228    196.   276.
## 2 drat    3.60      3.58   3.70   3.07
## 3 hp    147.      144.   123    110
## 4 mpg    20.1      20.0   19.2   10.4
## 5 qsec   17.8      17.8   17.7   17.0
## 6 wt      3.22      3.20   3.32   3.44

#### Measures of Variation

ds_measures_variation(mtcarz)
## # A tibble: 6 x 7
##   var    range     iqr  variance      sd coeff_var std_error
##
## 1 disp  401.   205.    15361.    124.         53.7   21.9
## 2 drat    2.17   0.840     0.286   0.535      14.9    0.0945
## 3 hp    283     83.5    4701.     68.6        46.7   12.1
## 4 mpg    23.5    7.38     36.3     6.03       30.0    1.07
## 5 qsec    8.40   2.01      3.19    1.79       10.0    0.316
## 6 wt      3.91   1.03      0.957   0.978      30.4    0.173

#### Measures of Symmetry

ds_measures_symmetry(mtcarz)
## # A tibble: 6 x 3
##   var   skewness kurtosis
##
## 1 disp     0.420  -1.07
## 2 drat     0.293  -0.450
## 3 hp       0.799   0.275
## 4 mpg      0.672  -0.0220
## 5 qsec     0.406   0.865
## 6 wt       0.466   0.417

#### Percentiles

ds_percentiles(mtcarz)
## # A tibble: 6 x 12
##   var     min  per1  per5 per10     q1 median     q3  per95  per90  per99
##
## 1 disp  71.1  72.5  77.4  80.6  121.   196.   326    449    396.   468.
## 2 drat   2.76  2.76  2.85  3.01   3.08   3.70   3.92   4.31   4.21   4.78
## 3 hp    52    55.1  63.6  66     96.5  123    180    254.   244.   313.
## 4 mpg   10.4  10.4  12.0  14.3   15.4   19.2   22.8   31.3   30.1   33.4
## 5 qsec  14.5  14.5  15.0  15.5   16.9   17.7   18.9   20.1   20.0   22.1
## 6 wt     1.51  1.54  1.74  1.96   2.58   3.32   3.61   5.29   4.05   5.40
## # ... with 1 more variable: max 

## Categorical Data

### Cross Tabulation

The ds_cross_table() function creates two way tables of categorical variables.

ds_cross_table(mtcarz, cyl, gear)
##     Cell Contents
##  |---------------|
##  |     Frequency |
##  |       Percent |
##  |       Row Pct |
##  |       Col Pct |
##  |---------------|
##
##  Total Observations:  32
##
## ----------------------------------------------------------------------------
## |              |                           gear                            |
## ----------------------------------------------------------------------------
## |          cyl |            3 |            4 |            5 |    Row Total |
## ----------------------------------------------------------------------------
## |            4 |            1 |            8 |            2 |           11 |
## |              |        0.031 |         0.25 |        0.062 |              |
## |              |         0.09 |         0.73 |         0.18 |         0.34 |
## |              |         0.07 |         0.67 |          0.4 |              |
## ----------------------------------------------------------------------------
## |            6 |            2 |            4 |            1 |            7 |
## |              |        0.062 |        0.125 |        0.031 |              |
## |              |         0.29 |         0.57 |         0.14 |         0.22 |
## |              |         0.13 |         0.33 |          0.2 |              |
## ----------------------------------------------------------------------------
## |            8 |           12 |            0 |            2 |           14 |
## |              |        0.375 |            0 |        0.062 |              |
## |              |         0.86 |            0 |         0.14 |         0.44 |
## |              |          0.8 |            0 |          0.4 |              |
## ----------------------------------------------------------------------------
## | Column Total |           15 |           12 |            5 |           32 |
## |              |        0.468 |        0.375 |        0.155 |              |
## ----------------------------------------------------------------------------

If you want the above result as a tibble, use ds_twoway_table().

ds_twoway_table(mtcarz, cyl, gear)
## Joining, by = c("cyl", "gear", "count")
## # A tibble: 8 x 6
##   cyl   gear  count percent row_percent col_percent
##
## 1 4     3         1  0.0312      0.0909      0.0667
## 2 4     4         8  0.25        0.727       0.667
## 3 4     5         2  0.0625      0.182       0.4
## 4 6     3         2  0.0625      0.286       0.133
## 5 6     4         4  0.125       0.571       0.333
## 6 6     5         1  0.0312      0.143       0.2
## 7 8     3        12  0.375       0.857       0.8
## 8 8     5         2  0.0625      0.143       0.4

A plot() method has been defined which will generate:

#### Grouped Bar Plots

k <- ds_cross_table(mtcarz, cyl, gear)
plot(k)

#### Stacked Bar Plots

k <- ds_cross_table(mtcarz, cyl, gear)
plot(k, stacked = TRUE)

#### Proportional Bar Plots

k <- ds_cross_table(mtcarz, cyl, gear)
plot(k, proportional = TRUE)

### Frequency Table

The ds_freq_table() function creates frequency tables.

ds_freq_table(mtcarz, cyl)
##                              Variable: cyl
## -----------------------------------------------------------------------
## Levels     Frequency    Cum Frequency       Percent        Cum Percent
## -----------------------------------------------------------------------
##    4          11             11              34.38            34.38
## -----------------------------------------------------------------------
##    6           7             18              21.88            56.25
## -----------------------------------------------------------------------
##    8          14             32              43.75             100
## -----------------------------------------------------------------------
##  Total        32              -             100.00              -
## -----------------------------------------------------------------------

A plot() method has been defined which will create a bar plot.

k <- ds_freq_table(mtcarz, cyl)
plot(k)

### Multiple One Way Tables

The ds_auto_freq_table() function creates multiple one way tables by creating a
frequency table for each categorical variable in a data set. You can also
specify a subset of variables if you do not want all the variables in the data
set to be used.

ds_auto_freq_table(mtcarz)
##                              Variable: cyl
## -----------------------------------------------------------------------
## Levels     Frequency    Cum Frequency       Percent        Cum Percent
## -----------------------------------------------------------------------
##    4          11             11              34.38            34.38
## -----------------------------------------------------------------------
##    6           7             18              21.88            56.25
## -----------------------------------------------------------------------
##    8          14             32              43.75             100
## -----------------------------------------------------------------------
##  Total        32              -             100.00              -
## -----------------------------------------------------------------------
##
##                              Variable: vs
## -----------------------------------------------------------------------
## Levels     Frequency    Cum Frequency       Percent        Cum Percent
## -----------------------------------------------------------------------
##    0          18             18              56.25            56.25
## -----------------------------------------------------------------------
##    1          14             32              43.75             100
## -----------------------------------------------------------------------
##  Total        32              -             100.00              -
## -----------------------------------------------------------------------
##
##                              Variable: am
## -----------------------------------------------------------------------
## Levels     Frequency    Cum Frequency       Percent        Cum Percent
## -----------------------------------------------------------------------
##    0          19             19              59.38            59.38
## -----------------------------------------------------------------------
##    1          13             32              40.62             100
## -----------------------------------------------------------------------
##  Total        32              -             100.00              -
## -----------------------------------------------------------------------
##
##                             Variable: gear
## -----------------------------------------------------------------------
## Levels     Frequency    Cum Frequency       Percent        Cum Percent
## -----------------------------------------------------------------------
##    3          15             15              46.88            46.88
## -----------------------------------------------------------------------
##    4          12             27              37.5             84.38
## -----------------------------------------------------------------------
##    5           5             32              15.62             100
## -----------------------------------------------------------------------
##  Total        32              -             100.00              -
## -----------------------------------------------------------------------
##
##                             Variable: carb
## -----------------------------------------------------------------------
## Levels     Frequency    Cum Frequency       Percent        Cum Percent
## -----------------------------------------------------------------------
##    1           7              7              21.88            21.88
## -----------------------------------------------------------------------
##    2          10             17              31.25            53.12
## -----------------------------------------------------------------------
##    3           3             20              9.38             62.5
## -----------------------------------------------------------------------
##    4          10             30              31.25            93.75
## -----------------------------------------------------------------------
##    6           1             31              3.12             96.88
## -----------------------------------------------------------------------
##    8           1             32              3.12              100
## -----------------------------------------------------------------------
##  Total        32              -             100.00              -
## -----------------------------------------------------------------------

### Multiple Two Way Tables

The ds_auto_cross_table() function creates multiple two way tables by creating a
cross table for each unique pair of categorical variables in a data set. You
can also specify a subset of variables if you do not want all the variables in
the data set to be used.

ds_auto_cross_table(mtcarz, cyl, gear, am)
##     Cell Contents
##  |---------------|
##  |     Frequency |
##  |       Percent |
##  |       Row Pct |
##  |       Col Pct |
##  |---------------|
##
##  Total Observations:  32
##
##                                 cyl vs gear
## ----------------------------------------------------------------------------
## |              |                           gear                            |
## ----------------------------------------------------------------------------
## |          cyl |            3 |            4 |            5 |    Row Total |
## ----------------------------------------------------------------------------
## |            4 |            1 |            8 |            2 |           11 |
## |              |        0.031 |         0.25 |        0.062 |              |
## |              |         0.09 |         0.73 |         0.18 |         0.34 |
## |              |         0.07 |         0.67 |          0.4 |              |
## ----------------------------------------------------------------------------
## |            6 |            2 |            4 |            1 |            7 |
## |              |        0.062 |        0.125 |        0.031 |              |
## |              |         0.29 |         0.57 |         0.14 |         0.22 |
## |              |         0.13 |         0.33 |          0.2 |              |
## ----------------------------------------------------------------------------
## |            8 |           12 |            0 |            2 |           14 |
## |              |        0.375 |            0 |        0.062 |              |
## |              |         0.86 |            0 |         0.14 |         0.44 |
## |              |          0.8 |            0 |          0.4 |              |
## ----------------------------------------------------------------------------
## | Column Total |           15 |           12 |            5 |           32 |
## |              |        0.468 |        0.375 |        0.155 |              |
## ----------------------------------------------------------------------------
##
##
##                          cyl vs am
## -------------------------------------------------------------
## |              |                     am                     |
## -------------------------------------------------------------
## |          cyl |            0 |            1 |    Row Total |
## -------------------------------------------------------------
## |            4 |            3 |            8 |           11 |
## |              |        0.094 |         0.25 |              |
## |              |         0.27 |         0.73 |         0.34 |
## |              |         0.16 |         0.62 |              |
## -------------------------------------------------------------
## |            6 |            4 |            3 |            7 |
## |              |        0.125 |        0.094 |              |
## |              |         0.57 |         0.43 |         0.22 |
## |              |         0.21 |         0.23 |              |
## -------------------------------------------------------------
## |            8 |           12 |            2 |           14 |
## |              |        0.375 |        0.062 |              |
## |              |         0.86 |         0.14 |         0.44 |
## |              |         0.63 |         0.15 |              |
## -------------------------------------------------------------
## | Column Total |           19 |           13 |           32 |
## |              |        0.594 |        0.406 |              |
## -------------------------------------------------------------
##
##
##                          gear vs am
## -------------------------------------------------------------
## |              |                     am                     |
## -------------------------------------------------------------
## |         gear |            0 |            1 |    Row Total |
## -------------------------------------------------------------
## |            3 |           15 |            0 |           15 |
## |              |        0.469 |            0 |              |
## |              |            1 |            0 |         0.47 |
## |              |         0.79 |            0 |              |
## -------------------------------------------------------------
## |            4 |            4 |            8 |           12 |
## |              |        0.125 |         0.25 |              |
## |              |         0.33 |         0.67 |         0.38 |
## |              |         0.21 |         0.62 |              |
## -------------------------------------------------------------
## |            5 |            0 |            5 |            5 |
## |              |            0 |        0.156 |              |
## |              |            0 |            1 |         0.16 |
## |              |            0 |         0.38 |              |
## -------------------------------------------------------------
## | Column Total |           19 |           13 |           32 |
## |              |        0.594 |        0.406 |              |
## -------------------------------------------------------------

## Visualization

descriptr can help visualize multiple variables by automatically
detecting their data types.

### Continuous Data

ds_plot_scatter(mtcarz, mpg, disp, hp)

### Categorical Data

ds_plot_bar_stacked(mtcarz, cyl, gear, am)

## Learning More

The descriptr website includes
comprehensive documentation on using the package, including the following
articles that cover various aspects of using rfm:

## Feedback

All feedback is welcome. Issues (bugs and feature
requests) can be posted to github tracker.
For help with code or other related questions, feel free to reach me hebbali.aravind@gmail.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

## February 19, 2019

### simulation fodder for future exams

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

Here are two nice exercises for a future simulation exam, seen and solved on X validated.The first one is about simulating a Gibbs sampler associated with the joint target

exp{-|x|-|y|-a(y-x|}

defined over IR² for a≥0 (or possibly a>-1). The conditionals are identical and non-standard, but a simple bound on the conditional density is the corresponding standard double exponential density, which makes for a straightforward accept-reject implementation. However it is also feasible to break the full conditional into three parts, depending on the respective positions of x, y, and 0, and to obtain easily invertible cdfs on the three intervals.The second exercise is about simulating from the cdf

$F(x)=1-\exp\{-ax-bx^{p+1}/(p+1)\}$

which can be numerically inverted. It is however more fun to call for an accept-reject algorithm by bounding the density with a ½ ½ mixture of an Exponential Exp(a) and of the 1/(p+1)-th power of an Exponential Exp(b/(p+1)). Since no extra constant appears in the solution,  I suspect the (p+1) in b/(p+1) was introduced on purpose. As seen in the above fit for 10⁶ simulations (and a=1,b=2,p=3), there is no deviation from the target! There is however an even simpler resolution to the exercise: since the tail function (1-F(x)) appears as the product of two tail functions, exp(-ax) and the other one, the cdf is the distribution of the minimum of two random variates, one with the Exp(a) distribution and the other one being the 1/(p+1)-th power of an Exponential Exp(b/(p+1)) distribution. Which of course returns a very similar histogram fit:

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Apple’s privacy play keeps internet regulators at bay

In 2018, the GDPR changed how tech companies handle data privacy. In 2019, it’s influencing the public’s perception of internet privacy and changing how tech companies treat violations—and one another. Last month, I wrote about the state of internet privacy in the context of the GDPR and other regulations that

The post Apple’s privacy play keeps internet regulators at bay appeared first on Dataconomy.

### Document worth reading: “Regularization for Deep Learning: A Taxonomy”

Regularization is one of the crucial ingredients of deep learning, yet the term regularization has various definitions, and regularization methods are often studied separately from each other. In our work we present a systematic, unifying taxonomy to categorize existing methods. We distinguish methods that affect data, network architectures, error terms, regularization terms, and optimization procedures. We do not provide all details about the listed methods; instead, we present an overview of how the methods can be sorted into meaningful categories and sub-categories. This helps revealing links and fundamental similarities between them. Finally, we include practical recommendations both for users and for developers of new regularization methods. Regularization for Deep Learning: A Taxonomy

### Whats new on arXiv

To properly convey neural network architectures in publications, appropriate visualization techniques are of great importance. While most current deep learning papers contain such visualizations, these are usually handcrafted, which results in a lack of a common visual grammar, as well as a significant time investment. Since these visualizations are often crafted just before publication, they are also prone to contain errors, might deviate from the actual architecture, and are sometimes ambiguous to interpret. Current automatic network visualization toolkits focus on debugging the network itself, and are therefore not ideal for generating publication-ready visualization, as they cater a different level of detail. Therefore, we present an approach to automate this process by translating network architectures specified in Python, into publication-ready network visualizations that can directly be embedded into any publication. To improve the readability of these visualizations, and in order to make them comparable, the generated visualizations obey to a visual grammar, which we have derived based on the analysis of existing network visualizations. Besides carefully crafted visual encodings, our grammar also incorporates abstraction through layer accumulation, as it is often done to reduce the complexity of the network architecture to be communicated. Thus, our approach not only reduces the time needed to generate publication-ready network visualizations, but also enables a unified and unambiguous visualization design.
Expectation-Maximization (EM) is the fallback method for parameter estimation of hidden (aka latent) variable models. Given the full batch of data, EM forms an upper-bound of the negative log-likelihood of the model at each iteration and then updates to the minimizer of this upper-bound. We introduce a versatile online variant of EM where the data arrives in as a stream. Our motivation is based on the relative entropy divergences between two joint distributions over the hidden and visible variables. We view the EM upper-bound as a Monte Carlo approximation of an expectation and show that the joint relative entropy divergence induces a similar expectation form. As a result, we employ the divergence to the old model as the inertia term to motivate our online EM algorithm. Our motivation is more widely applicable than previous ones and leads to simple online updates for mixture of exponential distributions, hidden Markov models, and the first known online update for Kalman filters. Additionally, the finite sample form of the inertia term lets us derive online updates when there is no closed form solution. Experimentally, sweeping the data with an online update converges much faster than the batch update. Our divergence based methods also lead to a simple way to combine hidden variable models and this immediately gives efficient algorithms for distributed setting.
Large amounts of RDF/S data are produced and published lately, and several modern applications require the provision of versioning and archiving services over such datasets. In this paper we propose a novel storage index for archiving versions of such datasets, called CPOI (compact partial order index), that exploits the fact that an RDF Knowledge Base (KB), is a graph (or equivalently a set of triples), and thus it has not a unique serialization (as it happens with text). If we want to keep stored several versions we actually want to store multiple sets of triples. CPOI is a data structure for storing such sets aiming at reducing the storage space since this is important not only for reducing storage costs, but also for reducing the various communication costs and enabling hosting in main memory (and thus processing efficiently) large quantities of data. CPOI is based on a partial order structure over sets of triple identifiers, where the triple identifiers are represented in a gapped form using variable length encoding schemes. For this index we evaluate analytically and experimentally various identifier assignment techniques and their space savings. The results show significant storage savings, specifically, the storage space of the compressed sets in large and realistic synthetic datasets is about the 8% of the size of the uncompressed sets.
In this work, we propose ReStoCNet, a residual stochastic multilayer convolutional Spiking Neural Network (SNN) composed of binary kernels, to reduce the synaptic memory footprint and enhance the computational efficiency of SNNs for complex pattern recognition tasks. ReStoCNet consists of an input layer followed by stacked convolutional layers for hierarchical input feature extraction, pooling layers for dimensionality reduction, and fully-connected layer for inference. In addition, we introduce residual connections between the stacked convolutional layers to improve the hierarchical feature learning capability of deep SNNs. We propose Spike Timing Dependent Plasticity (STDP) based probabilistic learning algorithm, referred to as Hybrid-STDP (HB-STDP), incorporating Hebbian and anti-Hebbian learning mechanisms, to train the binary kernels forming ReStoCNet in a layer-wise unsupervised manner. We demonstrate the efficacy of ReStoCNet and the presented HB-STDP based unsupervised training methodology on the MNIST and CIFAR-10 datasets. We show that residual connections enable the deeper convolutional layers to self-learn useful high-level input features and mitigate the accuracy loss observed in deep SNNs devoid of residual connections. The proposed ReStoCNet offers >20x kernel memory compression compared to full-precision (32-bit) SNN while yielding high enough classification accuracy on the chosen pattern recognition tasks.
Deep learning for supervised learning has achieved astonishing performance in various machine learning applications. However, annotated data is expensive and rare. In practice, only a small portion of data samples are annotated. Pseudo-ensembling-based approaches have achieved state-of-the-art results in computer vision related tasks. However, it still relies on the quality of an initial model built by labeled data. Less labeled data may degrade model performance a lot. Domain constraint is another way regularize the posterior but has some limitation. In this paper, we proposed a fuzzy domain-constraint-based framework which loses the requirement of traditional constraint learning and enhances the model quality for semi supervision. Simulations results show the effectiveness of our design.
In reinforcement learning episodes, the rewards and punishments are often non-deterministic, and there are invariably stochastic elements governing the underlying situation. Such stochastic elements are often numerous and cannot be known in advance, and they have a tendency to obscure the underlying rewards and punishments patterns. Indeed, if stochastic elements were absent, the same outcome would occur every time and the learning problems involved could be greatly simplified. In addition, in most practical situations, the cost of an observation to receive either a reward or punishment can be significant, and one would wish to arrive at the correct learning conclusion by incurring minimum cost. In this paper, we present a stochastic approach to reinforcement learning which explicitly models the variability present in the learning environment and the cost of observation. Criteria and rules for learning success are quantitatively analyzed, and probabilities of exceeding the observation cost bounds are also obtained.
In reinforcement learning, a decision needs to be made at some point as to whether it is worthwhile to carry on with the learning process or to terminate it. In many such situations, stochastic elements are often present which govern the occurrence of rewards, with the sequential occurrences of positive rewards randomly interleaved with negative rewards. For most practical learners, the learning is considered useful if the number of positive rewards always exceeds the negative ones. A situation that often calls for learning termination is when the number of negative rewards exceeds the number of positive rewards. However, while this seems reasonable, the error of premature termination, whereby termination is enacted along with the conclusion of learning failure despite the positive rewards eventually far outnumber the negative ones, can be significant. In this paper, using combinatorial analysis we study the error probability in wrongly terminating a reinforcement learning activity which undermines the effectiveness of an optimal policy, and we show that the resultant error can be quite high. Whilst we demonstrate mathematically that such errors can never be eliminated, we propose some practical mechanisms that can effectively reduce such errors. Simulation experiments have been carried out, the results of which are in close agreement with our theoretical findings.
We study the problem of interpreting trained classification models in the setting of linguistic data sets. Leveraging a parse tree, we propose to assign least-squares based importance scores to each word of an instance by exploiting syntactic constituency structure. We establish an axiomatic characterization of these importance scores by relating them to the Banzhaf value in coalitional game theory. Based on these importance scores, we develop a principled method for detecting and quantifying interactions between words in a sentence. We demonstrate that the proposed method can aid in interpretability and diagnostics for several widely-used language models.
Flow-based generative models, conceptually attractive due to tractability of both the exact log-likelihood computation and latent-variable inference, and efficiency of both training and sampling, has led to a number of impressive empirical successes and spawned many advanced variants and theoretical investigations. Despite their computational efficiency, the density estimation performance of flow-based generative models significantly falls behind those of state-of-the-art autoregressive models. In this work, we introduce masked convolutional generative flow (MaCow), a simple yet effective architecture of generative flow using masked convolution. By restricting the local connectivity in a small kernel, MaCow enjoys the properties of fast and stable training, and efficient sampling, while achieving significant improvements over Glow for density estimation on standard image benchmarks, considerably narrowing the gap to autoregressive models.
We present a Bayesian multi-objective optimisation algorithm that allows the user to express preference-order constraints on the objectives of the type objective A is more important than objective B’. Rather than attempting to find a representative subset of the complete Pareto front, our algorithm searches for and returns only those Pareto-optimal points that satisfy these constraints. We formulate a new acquisition function based on expected improvement in dominated hypervolume (EHI) to ensure that the subset of Pareto front satisfying the constraints is thoroughly explored. The hypervolume calculation only includes those points that satisfy the preference-order constraints, where the probability of a point satisfying the constraints is calculated from a gradient Gaussian Process model. We demonstrate our algorithm on both synthetic and real-world problems.
The ultimatum game has been a prominent paradigm in studying the evolution of fairness. It predicts that responders should accept any nonzero offer and proposers should offer the smallest possible amount according to orthodox game theory. However, the prediction strongly contradicts with experimental findings where responders usually reject low offers below $20\%$ and proposers usually make higher offers than expected. To explain the evolution of such fair behaviors, we here introduce empathy in group-structured populations by allowing a proportion $\alpha$ of the population to play empathetic strategies. Interestingly, we find that for high mutation probabilities, the mean offer decreases with $\alpha$ and the mean demand increases, implying empathy inhibits the evolution of fairness. For low mutation probabilities, the mean offer and demand approach to the fair ones with increasing $\alpha$, implying empathy promotes the evolution of fairness. Furthermore, under both weak and strong intensities of natural selection, we analytically calculate the mean offer and demand for different levels of $\alpha$. Counterintuitively, we demonstrate that although a higher mutation probability leads to a higher level of fairness under weak selection, an intermediate mutation probability corresponds to the lowest level of fairness under strong selection. Our study provides systematic insights into the evolutionary origin of fairness in group-structured populations with empathetic strategies.
Many IoT (Internet of Things) systems run Android systems or Android-like systems. With the continuous development of machine learning algorithms, the learning-based Android malware detection system for IoT devices has gradually increased. However, these learning-based detection models are often vulnerable to adversarial samples. An automated testing framework is needed to help these learning-based malware detection systems for IoT devices perform security analysis. The current methods of generating adversarial samples mostly require training parameters of models and most of the methods are aimed at image data. To solve this problem, we propose a \textbf{t}esting framework for \textbf{l}earning-based \textbf{A}ndroid \textbf{m}alware \textbf{d}etection systems (TLAMD) for IoT Devices. The key challenge is how to construct a suitable fitness function to generate an effective adversarial sample without affecting the features of the application. By introducing genetic algorithms and some technical improvements, our test framework can generate adversarial samples for the IoT Android Application with a success rate of nearly 100\% and can perform black-box testing on the system.
We present VERIFAI, a software toolkit for the formal design and analysis of systems that include artificial intelligence (AI) and machine learning (ML) components. VERIFAI particularly seeks to address challenges with applying formal methods to perception and ML components, including those based on neural networks, and to model and analyze system behavior in the presence of environment uncertainty. We describe the initial version of VERIFAI which centers on simulation guided by formal models and specifications. Several use cases are illustrated with examples, including temporal-logic falsification, model-based systematic fuzz testing, parameter synthesis, counterexample analysis, and data set augmentation.
We consider a finite time horizon multi-armed bandit (MAB) problem in a Bayesian framework, for which we develop a general set of control policies that leverage ideas from information relaxations of stochastic dynamic optimization problems. In crude terms, an information relaxation allows the decision maker (DM) to have access to the future (unknown) rewards and incorporate them in her optimization problem to pick an action at time $t$, but penalizes the decision maker for using this information. In our setting, the future rewards allow the DM to better estimate the unknown mean reward parameters of the multiple arms, and optimize her sequence of actions. By picking different information penalties, the DM can construct a family of policies of increasing complexity that, for example, include Thompson Sampling and the true optimal (but intractable) policy as special cases. We systematically develop this framework of information relaxation sampling, propose an intuitive family of control policies for our motivating finite time horizon Bayesian MAB problem, and prove associated structural results and performance bounds. Numerical experiments suggest that this new class of policies performs well, in particular in settings where the finite time horizon introduces significant tension in the problem. Finally, inspired by the finite time horizon Gittins index, we propose an index policy that builds on our framework that particularly outperforms to the state-of-the-art algorithms in our numerical experiments.
We consider a model of selective prediction, where the prediction algorithm is given a data sequence in an online fashion and asked to predict a pre-specified statistic of the upcoming data points. The algorithm is allowed to choose when to make the prediction as well as the length of the prediction window, possibly depending on the observations so far. We prove that, even without any distributional assumption on the input data stream, a large family of statistics can be estimated to non-trivial accuracy. To give one concrete example, suppose that we are given access to an arbitrary binary sequence $x_1, \ldots, x_n$ of length $n$. Our goal is to accurately predict the average observation, and we are allowed to choose the window over which the prediction is made: for some $t < n$ and $m \le n - t$, after seeing $t$ observations we predict the average of $x_{t+1}, \ldots, x_{t+m}$. We show that the expected squared error of our prediction can be bounded by $O\left(\frac{1}{\log n}\right)$, and prove a matching lower bound. This result holds for any sequence (that is not adaptive to when the prediction is made, or the predicted value), and the expectation of the error is with respect to the randomness of the prediction algorithm. Our results apply to more general statistics of a sequence of observations, and we highlight several open directions for future work.
To widen their accessibility and increase their utility, intelligent agents must be able to learn complex behaviors as specified by (non-expert) human users. Moreover, they will need to learn these behaviors within a reasonable amount of time while efficiently leveraging the sparse feedback a human trainer is capable of providing. Recent work has shown that human feedback can be characterized as a critique of an agent’s current behavior rather than as an alternative reward signal to be maximized, culminating in the COnvergent Actor-Critic by Humans (COACH) algorithm for making direct policy updates based on human feedback. Our work builds on COACH, moving to a setting where the agent’s policy is represented by a deep neural network. We employ a series of modifications on top of the original COACH algorithm that are critical for successfully learning behaviors from high-dimensional observations, while also satisfying the constraint of obtaining reduced sample complexity. We demonstrate the effectiveness of our Deep COACH algorithm in the rich 3D world of Minecraft with an agent that learns to complete tasks by mapping from raw pixels to actions using only real-time human feedback in 10-15 minutes of interaction.
Interactive Fiction (IF) games are complex textual decision making problems. This paper introduces NAIL, an autonomous agent for general parser-based IF games. NAIL won the 2018 Text Adventure AI Competition, where it was evaluated on twenty unseen games. This paper describes the architecture, development, and insights underpinning NAIL’s performance.
Semantic parsing is the task of mapping natural language to logic form. In question answering, semantic parsing can be used to map the question to logic form and execute the logic form to get the answer. One key problem for semantic parsing is the hard label work. We study this problem in another way: we do not use the logic form any more. Instead we only use the schema and answer info. We think that the logic form step can be injected into the deep model. The reason why we think removing the logic form step is possible is that human can do the task without explicit logic form. We use BERT-based model and do the experiment in the WikiSQL dataset, which is a large natural language to SQL dataset. Our experimental evaluations that show that our model can achieves the baseline results in WikiSQL dataset.
Wikipedia articles contain multiple links connecting a subject to other pages of the encyclopedia. In Wikipedia parlance, these links are called internal links or wikilinks. We present a complete dataset of the network of internal Wikipedia links for the $9$ largest language editions. The dataset contains yearly snapshots of the network and spans $17$ years, from the creation of Wikipedia in 2001 to March 1st, 2018. While previous work has mostly focused on the complete hyperlink graph which includes also links automatically generated by templates, we parsed each revision of each article to track links appearing in the main text. In this way we obtained a cleaner network, discarding more than half of the links and representing all and only the links intentionally added by editors. We describe in detail how the Wikipedia dumps have been processed and the challenges we have encountered, including the need to handle special pages such as redirects, i.e., alternative article titles. We present descriptive statistics of several snapshots of this network. Finally, we propose several research opportunities that can be explored using this new dataset.
The result of a temporal-probabilistic (TP) join with negation includes, at each time point, the probability with which a tuple of a positive relation ${\bf p}$ matches none of the tuples in a negative relation ${\bf n}$, for a given join condition $\theta$. TP outer and anti joins thus resemble the characteristics of relational outer and anti joins also in the case when there exist time points at which input tuples from ${\bf p}$ have non-zero probabilities to be $true$ and input tuples from ${\bf n}$ have non-zero probabilities to be $false$, respectively. For the computation of TP joins with negation, we introduce generalized lineage-aware temporal windows, a mechanism that binds an output interval to the lineages of all the matching valid tuples of each input relation. We group the windows of two TP relations into three disjoint sets based on the way attributes, lineage expressions and intervals are produced. We compute all windows in an incremental manner, and we show that pipelined computations allow for the direct integration of our approach into PostgreSQL. We thereby alleviate the prevalent redundancies in the interval computations of existing approaches, which is proven by an extensive experimental evaluation with real-world datasets.
A verification code is an automated test method used to distinguish between humans and computers. Humans can easily identify verification codes, whereas machines cannot. With the development of convolutional neural networks, automatically recognizing a verification code is now possible for machines. However, the advantages of convolutional neural networks depend on the data used by the training classifier, particularly the size of the training set. Therefore, identifying a verification code using a convolutional neural network is difficult when training data are insufficient. This study proposes an active and deep learning strategy to obtain new training data on a special verification code set without manual intervention. A feature learning model for a scene with less training data is presented in this work, and the verification code is identified by the designed convolutional neural network. Experiments show that the method can considerably improve the recognition accuracy of a neural network when the amount of initial training data is small.
Machine learning has become a critical component of modern data-driven online services. Typically, the training phase of machine learning techniques requires to process large-scale datasets which may contain private and sensitive information of customers. This imposes significant security risks since modern online services rely on cloud computing to store and process the sensitive data. In the untrusted computing infrastructure, security is becoming a paramount concern since the customers need to trust the thirdparty cloud provider. Unfortunately, this trust has been violated multiple times in the past. To overcome the potential security risks in the cloud, we answer the following research question: how to enable secure executions of machine learning computations in the untrusted infrastructure? To achieve this goal, we propose a hardware-assisted approach based on Trusted Execution Environments (TEEs), specifically Intel SGX, to enable secure execution of the machine learning computations over the private and sensitive datasets. More specifically, we propose a generic and secure machine learning framework based on Tensorflow, which enables secure execution of existing applications on the commodity untrusted infrastructure. In particular, we have built our system called TensorSCONE from ground-up by integrating TensorFlow with SCONE, a shielded execution framework based on Intel SGX. The main challenge of this work is to overcome the architectural limitations of Intel SGX in the context of building a secure TensorFlow system. Our evaluation shows that we achieve reasonable performance overheads while providing strong security properties with low TCB.
The main goal of this study is to investigate the robustness of graph-based Deep Learning (DL) models used for Internet of Things (IoT) malware classification against Adversarial Learning (AL). We designed two approaches to craft adversarial IoT software, including Off-the-Shelf Adversarial Attack (OSAA) methods, using six different AL attack approaches, and Graph Embedding and Augmentation (GEA). The GEA approach aims to preserve the functionality and practicality of the generated adversarial sample through a careful embedding of a benign sample to a malicious one. Our evaluations demonstrate that OSAAs are able to achieve a misclassification rate (MR) of 100%. Moreover, we observed that the GEA approach is able to misclassify all IoT malware samples as benign.
We develop an approach to learn an interpretable semi-parametric model of a latent continuous-time stochastic dynamical system, assuming noisy high-dimensional outputs sampled at uneven times. The dynamics are described by a nonlinear stochastic differential equation (SDE) driven by a Wiener process, with a drift evolution function drawn from a Gaussian process (GP) conditioned on a set of learnt fixed points and corresponding local Jacobian matrices. This form yields a flexible nonparametric model of the dynamics, with a representation corresponding directly to the interpretable portraits routinely employed in the study of nonlinear dynamical systems. The learning algorithm combines inference of continuous latent paths underlying observed data with a sparse variational description of the dynamical process. We demonstrate our approach on simulated data from different nonlinear dynamical systems.
We examine the practice of joint training for neural network ensembles, in which a multi-branch architecture is trained via single loss. This approach has recently gained traction, with claims of greater accuracy per parameter along with increased parallelism. We introduce a family of novel loss functions generalizing multiple previously proposed approaches, with which we study theoretical and empirical properties of joint training. These losses interpolate smoothly between independent and joint training of predictors, demonstrating that joint training has several disadvantages not observed in prior work. However, with appropriate regularization via our proposed loss, the method shows new promise in resource limited scenarios and fault-tolerant systems, e.g., IoT and edge devices. Finally, we discuss how these results may have implications for general multi-branch architectures such as ResNeXt and Inception.
Algorithms for computing All-Pairs Shortest-Paths (APSP) are critical building blocks underlying many practical applications. The standard sequential algorithms, such as Floyd-Warshall and Johnson, quickly become infeasible for large input graphs, necessitating parallel approaches. In this work, we provide detailed analysis of parallel APSP performance on distributed memory clusters with Apache Spark. The Spark model allows for a portable and easy to deploy distributed implementation, and hence is attractive from the end-user point of view. We propose four different APSP implementations for large undirected weighted graphs, which differ in complexity and degree of reliance on techniques outside of pure Spark API. We demonstrate that Spark is able to handle APSP problems with over 200,000 vertices on a 1024-core cluster, and can compete with a naive MPI-based solution. However, our best performing solver requires auxiliary shared persistent storage, and is over two times slower than optimized MPI-based solver.
Recent advances in deep neural networks, language modeling and language generation have introduced new ideas to the field of conversational agents. As a result, deep neural models such as sequence-to-sequence, Memory Networks, and the Transformer have become key ingredients of state-of-the-art dialog systems. While those models are able to generate meaningful responses even in unseen situation, they need a lot of training data to build a reliable model. Thus, most real-world systems stuck to traditional approaches based on information retrieval and even hand-crafted rules, due to their robustness and effectiveness, especially for narrow-focused conversations. Here, we present a method that adapts a deep neural architecture from the domain of machine reading comprehension to re-rank the suggested answers from different models using the question as context. We train our model using negative sampling based on question-answer pairs from the Twitter Customer Support Dataset.The experimental results show that our re-ranking framework can improve the performance in terms of word overlap and semantics both for individual models as well as for model combinations.
As one of the most popular techniques for solving the ranking problem in information retrieval, Learning-to-rank (LETOR) has received a lot of attention both in academia and industry due to its importance in a wide variety of data mining applications. However, most of existing LETOR approaches choose to learn a single global ranking function to handle all queries, and ignore the substantial differences that exist between queries. In this paper, we propose a domain generalization strategy to tackle this problem. We propose Query-Invariant Listwise Context Modeling (QILCM), a novel neural architecture which eliminates the detrimental influence of inter-query variability by learning \textit{query-invariant} latent representations, such that the ranking system could generalize better to unseen queries. We evaluate our techniques on benchmark datasets, demonstrating that QILCM outperforms previous state-of-the-art approaches by a substantial margin.
Designing neural network architectures is a task that lies somewhere between science and art. For a given task, some architectures are eventually preferred over others, based on a mix of intuition, experience, experimentation and luck. For many tasks, the final word is attributed to the loss function, while for some others a further perceptual evaluation is necessary to assess and compare performance across models. In this paper, we introduce the concept of capacity allocation analysis, with the aim of shedding some light on what network architectures focus their modelling capacity on, when used on a given task. We focus more particularly on spatial capacity allocation, which analyzes a posteriori the effective number of parameters that a given model has allocated for modelling dependencies on a given point or region in the input space, in linear settings. We use this framework to perform a quantitative comparison between some classical architectures on various synthetic tasks. Finally, we consider how capacity allocation might translate in non-linear settings.
Privacy-preserving data analysis is a rising challenge in contemporary statistics, as the privacy guarantees of statistical methods are often achieved at the expense of accuracy. In this paper, we investigate the tradeoff between statistical accuracy and privacy in mean estimation and linear regression, under both the classical low-dimensional and modern high-dimensional settings. A primary focus is to establish minimax optimality for statistical estimation with the $(\varepsilon,\delta)$-differential privacy constraint. To this end, we find that classical lower bound arguments fail to yield sharp results, and new technical tools are called for. We first develop a general lower bound argument for estimation problems with differential privacy constraints, and then apply the lower bound argument to mean estimation and linear regression. For these statistical problems, we also design computationally efficient algorithms that match the minimax lower bound up to a logarithmic factor. In particular, for the high-dimensional linear regression, a novel private iterative hard thresholding pursuit algorithm is proposed, based on a privately truncated version of stochastic gradient descent. The numerical performance of these algorithms is demonstrated by simulation studies and applications to real data containing sensitive information, for which privacy-preserving statistical methods are necessary.
The encoder-decoder framework is state-of-the-art for offline semantic image segmentation. Since the rise in autonomous systems, real-time computation is increasingly desirable. In this paper, we introduce fast segmentation convolutional neural network (Fast-SCNN), an above real-time semantic segmentation model on high resolution image data (1024x2048px) suited to efficient computation on embedded devices with low memory. Building on existing two-branch methods for fast segmentation, we introduce our learning to downsample’ module which computes low-level features for multiple resolution branches simultaneously. Our network combines spatial detail at high resolution with deep features extracted at lower resolution, yielding an accuracy of 68.0% mean intersection over union at 123.5 frames per second on Cityscapes. We also show that large scale pre-training is unnecessary. We thoroughly validate our metric in experiments with ImageNet pre-training and the coarse labeled data of Cityscapes. Finally, we show even faster computation with competitive results on subsampled inputs, without any network modifications.
Within OSNs, many of our supposedly online friends may instead be fake accounts called social bots, part of large groups that purposely re-share targeted content. Here, we study retweeting behaviors on Twitter, with the ultimate goal of detecting retweeting social bots. We collect a dataset of 10M retweets. We design a novel visualization that we leverage to highlight benign and malicious patterns of retweeting activity. In this way, we uncover a ‘normal’ retweeting pattern that is peculiar of human-operated accounts, and 3 suspicious patterns related to bot activities. Then, we propose a bot detection technique that stems from the previous exploration of retweeting behaviors. Our technique, called Retweet-Buster (RTbust), leverages unsupervised feature extraction and clustering. An LSTM autoencoder converts the retweet time series into compact and informative latent feature vectors, which are then clustered with a hierarchical density-based algorithm. Accounts belonging to large clusters characterized by malicious retweeting patterns are labeled as bots. RTbust obtains excellent detection results, with F1 = 0.87, whereas competitors achieve F1 < 0.76. Finally, we apply RTbust to a large dataset of retweets, uncovering 2 previously unknown active botnets with hundreds of accounts.
Binary Stochastic Filtering (BSF), the algorithm for feature selection and neuron pruning is proposed in this work. Filtering layer stochastically passes or filters out features based on individual weights, which are tuned during neural network training process. By placing BSF after the neural network input, the filtering of input features is performed, i.e. feature selection. More then 5-fold dimensionality decrease was achieved in the experiments. Placing BSF layer in between hidden layers allows filtering of neuron outputs and could be used for neuron pruning. Up to 34-fold decrease in the number of weights in the network was reached, which corresponds to the significant increase of performance, that is especially important for mobile and embedded applications.
Online detection of instantaneous changes in the generative process of a data sequence generally focuses on retrospective inference of such change points without considering their future occurrences. We extend the Bayesian Online Change Point Detection algorithm to also infer the number of time steps until the next change point (i.e., the residual time). This enables us to handle observation models which depend on the total segment duration, which is useful to model data sequences with temporal scaling. In addition, we extend the model by removing the i.i.d. assumption on the observation model parameters. The resulting inference algorithm for segment detection can be deployed in an online fashion, and we illustrate applications to synthetic and to two medical real-world data sets.
Sparse reward is one of the most challenging problems in reinforcement learning (RL). Hindsight Experience Replay (HER) attempts to address this issue by converting a failed experience to a successful one by relabeling the goals. Despite its effectiveness, HER has limited applicability because it lacks a compact and universal goal representation. We present Augmenting experienCe via TeacheR’s adviCE (ACTRCE), an efficient reinforcement learning technique that extends the HER framework using natural language as the goal representation. We first analyze the differences among goal representation, and show that ACTRCE can efficiently solve difficult reinforcement learning problems in challenging 3D navigation tasks, whereas HER with non-language goal representation failed to learn. We also show that with language goal representations, the agent can generalize to unseen instructions, and even generalize to instructions with unseen lexicons. We further demonstrate it is crucial to use hindsight advice to solve challenging tasks, and even small amount of advice is sufficient for the agent to achieve good performance.
A measure of quality of a control system is a quantitative extension of the classical binary notion of controllability. In this article we study the quality of linear control systems from a frame-theoretic perspective. We demonstrate that all LTI systems naturally generate a frame on their state space, and that three standard measures of quality involving the trace, minimum eigenvalue, and the determinant of the controllability Gramian achieve their optimum values when this generated frame is tight. Motivated by this, and in view of some recent developments in frame-theoretic signal processing, we propose a natural measure of quality for continuous time LTI systems based on a measure of tightness of the frame generated by it and then discuss some properties of this frame-theoretic measure of quality.
We propose infinite mixture prototypes to adaptively represent both simple and complex data distributions for few-shot learning. Our infinite mixture prototypes represent each class by a set of clusters, unlike existing prototypical methods that represent each class by a single cluster. By inferring the number of clusters, infinite mixture prototypes interpolate between nearest neighbor and prototypical representations, which improves accuracy and robustness in the few-shot regime. We show the importance of adaptive capacity for capturing complex data distributions such as alphabets, with 25% absolute accuracy improvements over prototypical networks, while still maintaining or improving accuracy on the standard Omniglot and mini-ImageNet benchmarks. In clustering labeled and unlabeled data by the same clustering rule, infinite mixture prototypes achieves state-of-the-art semi-supervised accuracy. As a further capability, we show that infinite mixture prototypes can perform purely unsupervised clustering, unlike existing prototypical methods.

### Databricks Security Advisory: Critical Runc Vulnerability (CVE-2019-5736)

Databricks became aware of a new critical runc vulnerability (CVE-2019-5736) on February 12, 2019 that allows malicious container users to gain root access to the host operating system. This vulnerability affects many container runtimes, including Docker and LXC. The Databricks security team has evaluated the vulnerability and confirmed that, due to the Databricks platform architecture, there is no external vector by which an attacker could exploit the flaw to gain access to the host VM on which the containers reside.  Additionally, our architecture isolates each customer by providing each customer with a separate host VM located within the customer’s cloud services account, so this exploit would not permit any cross-customer access, even if the underlying container were compromised.

This CVE includes two attack vectors:

• Creating a new container using an attacker-controlled image.

Databricks only launches containers built by the Databricks engineering team, so malicious external users have no way of launching their own image.

Only Databricks services can attach to existing containers. Users access containers through RPCs, and cannot attach to existing containers.

Though we believe the vulnerability is unlikely to be practically exploitable in our environment, Databricks engineering will push a hotfix that will be deployed as soon as reasonably possible.

### How does the exploit work in detail?

The exploit tries to compromise the container runtime binary to gain root access to the host. The container runtime is a binary program that runs on the host system and orchestrates the process execution inside the container. It is designed to ensure that the container’s processes are run in their own isolated namespace and with reduced privilege. On docker, the default container runtime is runC binary, and on LXC it is the miscellaneous lxc-* utilities.

Take lxc-attach as an example, a malicious user can mount the attack with the following steps:

• Replace a target binary inside the container with a custom content that points back to the lxc-attach binary itself. For example, one can replace the container’s /bin/bash with the following content:

#!/proc/self/exe

In this way, /bin/bash (container path) becomes an executable script using /proc/self/exe to interpret its malicious content. Note that /proc/self/exe is a symbolic link created by the kernel for every process which points to the binary that was executed for that process.

• Trick the container runtime into executing the target binary from the host system. As such when /bin/bash is executed inside the container, instead the target of /proc/self/exe will be executed — which will point to the container runtime binary on the host. In the example, when the attacker uses lxc-attach to run a command inside the container, lxc-attach invokes container’s /bin/bash using execve() syscall, which in turn runs /proc/self/exe i.e. lxc-attach itself to interpret the injected malicious payload.
• Proceed to write to the target of /proc/self/exe so as to overwrite the lxc-attach binary on the host. In general, however, this will not succeed as the kernel will not permit it to be overwritten while lxc-attach is executing. To overcome this, the attacker can instead open /proc/self/exe using the O_PATH flag to get a file descriptor <fd> and then reopen the binary as O_WRONLY through /proc/self/fd/<fd> and try to write to it in a busy loop from a newly forked subprocess. Eventually, it will succeed when the parent lxc-attach process exits. After this the lxc-attach binary on the host is compromised and can be used to attack other containers or the host itself. The rewriting logic can be done from the malicious payload injected to the target binary in step 1.

Therefore, there are 3 major conditions to enable the attack:

1. The attacker must have or gain control the content of the image in order to replace the target binary inside the container. This is achievable if the attacker controls the container image or has write access to the container previously.
2. The attacker must be able to invoke the container runtime on the host system through some external channel. This is the case if the host system exposes an API layer (e.g., kubelet API server) that allows users to invoke the container runtime binary indirectly. For example, if there’s an API allowing a remote user to launch a container with a custom image, or to attach to a running container using lxc-attach or docker exec
3. The attacker must have permission to overwrite the content of the host’s container runtime binary from the container. This is possible if the container is running as a privileged user on the host system, but impossible if it is running as an unprivileged user.

Databricks only exposes an API to launch containers with trusted Databricks Runtime images released by our engineering team, and these containers are not subject to modification by users prior to being attached or created.  Since an image that was modified after creation cannot be used to take advantage of this exploit, the trusted container status renders the Databricks standard architecture unaffected. Additionally, Databricks workspace users access containers through an RPC server running inside the container, and so cannot attach to existing containers using low-level container runtime binary.

--

The post Databricks Security Advisory: Critical Runc Vulnerability (CVE-2019-5736) appeared first on Databricks.

### How to Cope with the Rise of the Citizen Data Scientist

Gartner predicts that citizen data scientists will surpass data scientists in the amount of advanced analytics produced. Does that mean that Enterprise AI and augmented analytics render the job of a data scientist obsolete? Download this white paper to found out more.

### Playing With Pipe Notations

I am not sure if it is a good or bad idea. But let’s play with it a bit, and perhaps readers can submit their experience and opinions in the comments section.

## Right assignment

Right assignment is a bit of an oddity in programming languages. Offhand I can think of a few programming languages that use it: COBOL, TI-Basic, and Forth (due to its value-stack notation).

I have written a bit about right assignment in R in the past. And that led to some interesting discussion.

Frankly I thought right assignment was prohibited in Hadley Wickham’s own style guide. But as Gabe Becker taught me a while ago: there isn’t actually any right-arrow in R (so maybe “Use <-, not =, for assignment.” allows ->).

substitute(5 -> x)
# x <- 5


Another point: R pipes are very closely related to right assignment notation, so once you allow right assignment you don’t actually need pipes in the current R sense (though other forms of pipes such as Unix pipes would be a great addition).

## “then”

The idea of having a canonical “pronunciation” for symbols is not a new one. It is fairly standard practice in the Unix community (one reference here). The McIlroy Unix pipe|” (which streams partial results, resulting in very powerful concurrent composition) is said to be read as “pipe”, “pipe to”, “to”, “thru” (and a few more variations). In this era of keyboard shortcuts it is worth considering more verbose piping operators.

Let’s try the idea.

library("dplyr")
#> Attaching package: &aposdplyr&apos
#> The following objects are masked from &apospackage:stats&apos:
#>
#>     filter, lag
#> The following objects are masked from &apospackage:base&apos:
#>
#>     intersect, setdiff, setequal, union

d <- data.frame(x = 1:3)

%then% <- magrittr::%>%

d %then%
mutate(., y = x + 1) %then%
knitr::kable(.)
#> Error in pipes[[i]]: subscript out of bounds


I’d say this fails on at least two counts, the first “%then%” doesn’t seem grammatical (as d is a noun), and magrittr pipes can’t be associated with a new name (as they are implemented by looking for theirselves by name in captured unevaluated code).

However, the wrapr dot arrow pipe can take on new names.

Let’s try a variation, using a traditional pronunciation: “to”.

%to% <- wrapr::%.>%

d %to%
mutate(., y = x + 1) %to%
knitr::kable(.)

x y
1 2
2 3
3 4

## Conclusion

I am still not sure about the above notation one way or the other. Notational prescriptions are at best proposals or “requests for comment”, and need to consider context and precedent to be useful.

### Book Memo: “Domain-Specific Knowledge Graph Construction”

 The vast amounts of ontologically unstructured information on the Web, including HTML, XML and JSON documents, natural language documents, tweets, blogs, markups, and even structured documents like CSV tables, all contain useful knowledge that can present a tremendous advantage to the Artificial Intelligence community if extracted robustly, efficiently and semi-automatically as knowledge graphs. Domain-specific Knowledge Graph Construction (KGC) is an active research area that has recently witnessed impressive advances due to machine learning techniques like deep neural networks and word embeddings. This book will synthesize Knowledge Graph Construction over Web Data in an engaging and accessible manner. The book will describe a timely topic for both early -and mid-career researchers. Every year, more papers continue to be published on knowledge graph construction, especially for difficult Web domains. This work would serve as a useful reference, as well as an accessible but rigorous overview of this body of work. The book will present interdisciplinary connections when possible to engage researchers looking for new ideas or synergies. This will allow the book to be marketed in multiple venues and conferences. The book will also appeal to practitioners in industry and data scientists since it will have chapters on both data collection, as well as a chapter on querying and off-the-shelf implementations.

### Jupyter Community Workshop: Dashboarding with Project Jupyter

We have some exciting news about the Jupyter Community Workshop on dashboarding!

The workshop will be held in Paris, France, from June 3rd to June 6th, 2019. The event is being hosted at Center for Interdisciplinary Research (CRI), in the heart of Paris.

The workshop committee consists of Maarten Breddels (Freelance), Pascal Bugnion (Faculty.ai), Sylvain Corlay (QuantStack), Alexandre Gramfort (INRIA), and Vidar Tonaas Fauske (Simula).

The workshop will last four days, with hands-on discussions, hacking sessions, and technical presentations. The goal of the event is to foster collaboration and the sharing of knowledge between downstream library authors and contributors, and favor upstream contributions.

In addition to the Community Workshop, we plan on holding a public Meetup on June 5th, in partnership with the PyData Paris Meetup, with a series of lightning talks of Project Jupyter and related projects.

Why a Workshop on Dashboarding?

The Jupyter ecosystem has great tools for teaching, exploration and development. Dashboards allow users to interact with a kernel with interactive controls, plots, maps, etc., and allow researchers and data scientists to share their results with students, with their peers, and with the general public. Currently, users of Jupyter are (mostly) forced towards other Python or R libraries or they make direct use of front-end technologies or develop directly in JavaScript.

There are existing early technologies that allow serving dashboards based on notebooks, most notably voila. The goal of this workshop is to gather core Jupyter widgets developers, members of the community and users with experience in dashboarding to bring dashboarding to a level where it can be used by all members of the Jupyter ecosystem. Ultimately, we envisage users being able to develop and deploy dashboards entirely within the Jupyter ecosystem.

We will lay the foundations for dashboarding as a first-class citizen in the Jupyter ecosystem.

Acknowledgements

This would not have been possible without the generous support provided by Bloomberg, who made this workshop series possible.

We are also grateful to the CRI for gracefully hosting the dashboarding community workshop.

Jupyter Community Workshop: Dashboarding with Project Jupyter was originally published in Jupyter Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

### PDF Data Extraction: What You Need to Know

In our free guide, we show you how and where you can use extracted data from PDFs, and explain the necessary qualities you should be looking for when evaluating extraction tools.

### A Guide to Developer, Deep Dive, and Continuous Streaming Applications Talks at Spark + AI Summit

In January 2013 when Stephen O’Grady, an analyst at RedMonk, published “The New Kingmakers: How Developers Conquered the World,” the book’s central argument (then and still now) universally resonated with an emerging open-source community. He convincingly charts developers’ movement “out of the shadows and into the light as new influencers on society’s [technical landscape].”

Using their choice of the open-source software and at-the-ready at studying or contributing source code on GitHub, developers have built data products using open-source technologies that have shaped the industry today. O’Grady cites notable open-source examples that engendered successful software companies as well as companies that employ open-source to build their infrastructure stacks.

He asserts that developers make a difference; they chart the course, like the Kingmakers.

And this April, you can join many of these kingmakers at Spark + AI Summit 2019. Hear and learn from them as they offer their insight into use cases in how they combine data and AI, to build data pipelines as well as use and extend Apache Spark ™ to solve tough data problems.

In this blog, we highlight selected sessions that speak to developers’ endeavors in combining the immense value of data and AI across three tracks: Developer, Deep Dives, and Continuous Streaming Applications.

## Developer

Naturally, let’s start with the Developer track. Ryan Blue of Netflix in his talk, Improving Apache Spark’s Reliability with DataSourceV2, will share Spark’s new DataSource V2 API, which allows working with data from tables and streams. With relevant changes to Spark SQL internals, the V2 allows developers to build reliable data pipelines from relevant data sources. For Spark developers writing data source connectors, this is a must talk to attend.

Enhanced in Spark 2.3, columnar storage is an efficient way to store DataFrames. In his talk, In-Memory Storage Evolution in Apache Spark, Dr. Kazuaki Ishizaki, PMC Spark committer and an ACM award winner, will discuss the evolution of in-memory storage: How Apache Arrow exchange format and Spark’s ColumnVector for storage enhance Spark SQL access and query performance on DataFrames.

Related to DataFrames and Spark SQL, Messrs DB Tsai and Cesar Delgado of Apple Inc will address how they handle deeply nested structures by making them first-class citizens in Spark SQL, giving them immense speed up in querying and processing humongous data for Apple Siri, a virtual assistant. Their talk, Making Nested Columns as First Citizen in Apache Spark SQL, is a good example to show developers how to extend Spark SQL.

Which brings us Spark’s extensibility. Among many features that attract developers to Spark, one is its extensibility with new language bindings or libraries. Messrs Tyson Condie and Rahul Potharaju of Microsoft will explain how they extended Spark to include a new .NET bindings in their talk: Introducing .NET bindings for Apache Spark.

Yet for all Spark’s many merits, fast-paced adoption and innovation from the wider community, developers face some challenges: how do you automate testing, assess the quality and performance of new developments? To that end Messers, Bogdan Ghit and Nicolas Poggi of Databricks will share their work in building a new testing and validating framework for Spark SQL in their talk: Fast and Reliable Apache Spark SQL Engine.

## Technical Deep Dives

Since its introduction in 2016 as a track with developer focused sessions, technical deep dives track has gained popularity in attendance. It attracts both data engineers and data scientists to get deeper experience on the subject. For example, this year three sessions stand out.

First, data privacy and protection have become imperative today, in light of GDPR, especially in Europe. Great Models with Great Privacy: Optimizing ML and AI Over Sensitive Data talk from CTO Sim Simeonov of Swoop will challenge some technical assumptions that privacy asserts worse predictions in the ML models by examining some production environments techniques to mitigate this notion.

Second, Spark SQL is at the core of Spark’s structured APIs, including Structured Streaming, and its efficient query processing engine. But what enables it? What’s under the hood that’s performant and why? Messrs Maryann Xue and Takuya Ueshin of Databricks’ Apache Spark core team will dive into pipeline execution, whole-stage code generation, memory management, and internals that make this engine fault-tolerant and performant. A valuable lesson into the Spark core internals is their talk: A Deep Dive into Query Execution Engine of Spark SQL.

And third, closely related to Spark SQL, is an effort to extend Spark to support Graph data in processing Spark SQL queries to enable data scientists and engineers to inspect and update graph databases. A proposed effort underway to integrate into Spark’s upcoming release, developers Alastair Green and Martin Junghanns from Neo4j will make the case for Cyhper, a graph querying language for Apache Spark in their talk: Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apache Spark.

## Continuous Applications and Structured Streaming

Structured Streaming has garnered a lot of interest in building end-to-end data pipelines or writing continuous applications that interact in real-time with data and other applications. Three deep-dive talks will give you insight into how.

First is from Tathagata Das of Databricks: Designing Structured Streaming Pipelines—How to Architect Things Right. Second is from Scott Klein of Microsoft: Using Azure Databricks, Structured Streaming & Deep Learning Pipelines, to Monitor 1,000+ Solar Farms in Real-Time. And third is from Brandon Hamric of Eventbrite: Near real-time analytics with Apache Spark: Ingestion, ETL, and Interactive Queries.

## Apache Spark Training Sessions

And finally, check out two training courses for big data developers to extend your knowledge of Apache Spark programming, how to build scalable data pipelines with Delta, and performance and tunning respectively: APACHE SPARK™ PROGRAMMING AND DELTA and APACHE SPARK™ TUNING AND BEST PRACTICES.

## What’s Next

You can also peruse and pick sessions from the schedule. In the next blog, we will share our picks from sessions related to Data Science and Data Engineering tracks.

__

--

### Julia Child (2) vs. Frank Sinatra (3); Dorothy Parker

For yesterday‘s contest, Jonathan gave a strong argument:

First New Yorker showdown, just to see who will be taking on Veronica Geng in the finals. All the other contestants are just for show. I’m going with Liebling, because Parker wasn’t even the best New Yorker writer of her generation, being edged out by Benchley. Liebling dominated his era. If it comes down to Liebling vs. Geng, we’ll just exhume Harold Ross and make him pick.

But we’re looking for a talker, not a writer, so I’ll have to go with Dzhaughn:

After the Seance, we were chatting about the inspiration for this tournament. I said I thought Bruno was just a minor intellectual swindler rather than a real threat. Dorothy replied:

I used to think Latour was just something on a Schwinn dealer’s list*, but that was before I saw Julia’s child Oscar wildly strong-arm Lance with an ephronedrine-filled syringe merrily down the Streep, past a sidewalk cafe where the turing Pele and big bejeweled #23, in Brooks’ Brothers suits, were yakking over Smirnoff Martinis, eating a pile of franks, caesar salads, and some weirder dishes. James was on the phone, taking the TV network to hell and back over “letting that degenerate George Karl off the hook” for some remark, when, from behind a bush, sudden as a python, out springs teen-aged Babe D.-Z, among others! That geng didn’t look like they were here to serenade us with arias from Yardbird, that jazz oprah about Parker! No, they were there to revolt–air their own grievances–and when he stood to object, Babe just shoved LeBron and all his LeBling back onto LaPlace where he sat: Oof!

A bit of recursion is usually a good plan.

For today it’s the French Chef vs. the Chairman of the Board. Frank’s got a less screechy voice, but Julia should be able to handle the refreshments. Any thoughts?

Again, here’s the bracket and here are the rules.

### Automatic Machine Learning is broken

We take a look at the arguments against implementing a machine learning solution, and the occasions when the problems faced are not ML problems and can perhaps be solved using optimization, exploratory data analysis tasks or problems that can be solved with simple statistics.

### My talk today (Tues 19 Feb) 2pm at the University of Southern California

At the Center for Economic and Social Research, Dauterive Hall (VPD), room 110, 635 Downey Way, Los Angeles:

The study of American politics as a window into understanding uncertainty in science

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

We begin by discussing recent American elections in the context of political polarization, and we consider similarities and differences with European politics. We then discuss statistical challenges in the measurement of public opinion: inference from opinion polls with declining response rates has much in common with challenges in big-data analytics. From here we move to the recent replication crisis in science, and we argue that Bayesian methods are well suited to resolve some of these problems, if researchers can move away from inappropriate demands for certainty. We illustrate with examples in many different fields of research, our own and others’.

19 things we learned from the 2016 election (with Julia Azari), http://www.stat.columbia.edu/~gelman/research/published/what_learned_in_2016_5.pdf
The mythical swing voter (with Sharad Goel, Doug Rivers, and David Rothschild). http://www.stat.columbia.edu/~gelman/research/published/swingers.pdf
The failure of null hypothesis significance testing when studying incremental changes, and what to do about it. http://www.stat.columbia.edu/~gelman/research/published/incrementalism_3.pdf
Honesty and transparency are not enough. http://www.stat.columbia.edu/~gelman/research/published/ChanceEthics14.pdf
The connection between varying treatment effects and the crisis of unreplicable research: A Bayesian perspective. http://www.stat.columbia.edu/~gelman/research/published/bayes_management.pdf

The talk will mostly be about statistics, not political science, but it’s good to have a substantive home base when talking about methods.

### Running R and Python in Jupyter

The Jupyter Project began in 2014 for interactive and scientific computing. Fast forward 5 years and now Jupyter is one of the most widely adopted Data Science IDE's on the market and gives the user access to Python and R

### AI Trends that Paved the Way in 2018

With 2018 behind us, it’s been amazing to see AI projects gain steam and make significant impact across industries. In fact, a recent survey by CIO.com cites that 90% of enterprises are actively investing in AI.

What has fueled this innovation is the massive influx of organizations tapping into the potential of their data and the increasing availability of various machine learning technologies and frameworks. Furthermore, the cloud enables a new level of scale to match these massive data volumes without taking a hit on performance. Combined with the exponential growth in data volumes, AI has enabled companies to do amazing things — from accelerating drug discovery through genomics to preventing fraud in the securities market.

So what new trends and advances in 2019 will help address these challenges and move AI adoption further into the mainstream?

We asked some of the most innovative companies in the world this question and many others, to put our fingers on the pulse of where AI stands in the enterprise and gain a better understanding of how AI will continue to disrupt industries.

This blog, the first in a series of blog posts, provides highlights into what many thought were the most impactful trends and innovations in 2018 and what we should be excited about in 2019.

### What big data and AI innovations or trends did you see in 2018 that you were excited to see? And How do you think those innovations will continue to evolve and/or gain traction in 2019?

##### Deep Learning Goes Mainstream

“Keras and tensorflow have been around for a while, but we’re seeing companies spanning the range from innovative startups to massive corporations using DL to unlock new business opportunities. At Quby we used a Deep Learning system in production for the first time in 2018. In 2019 TensorFlow 2.0 will be a huge milestone.”

• Stephen Galsworthy, Head of Data Science at Quby
##### The Unification of Analytics

“Companies like Databricks are consistently breaking down and refining the barriers and cost of entry for a truly unified data platform, so the boundaries between data lake, data science, streaming are not only compatible, but seamlessly integrated. You can often forget which part of the platform you’re using. And that’s a really good thing.”

• Stephen Harrison, Data Science Architect at Rue Gilt Groupe
##### The Democratization of Machine Learning

“The rise of productizing ML and tools around making ML far easier and more scalable was a bit step forward in 2018.   You’re seeing lots of products that seek to embed ML into decision making which makes ML far more accessible by de-emphasizing the algorithm and making it easier to leverage. Some of these are frameworks or automation of ML (MLFlow, Einstein, etc.) and others are whole platforms where ML is the core.

Also, Reinforcement Learning has also really taken off and I think it’s incredibly exciting because it helps AI solve far more abstract problems.  These haven’t made it into many products but I think it’s really exciting shows the future of AI.”

• Bradley Kent, AVP of Program Analytics at LoyaltyOne
##### Data Science is Permeating the Line of Business

“In 2018, the discussion on “Explainable AI,” trust and data bias was really encouraging. I believe that it is critical to develop AI that is explainable, provable and transparent. As always the case, this journey towards trusted systems truly starts with the quality of data used for AI training. This renewed focus in 2018 on labeled data that can be verified, validated and explained is exciting for us at Nielsen, as we are relentlessly focused on developing high quality labeled data on consumer behavior. It is exciting that Explainable AI can lay the foundations for AI systems that can be both generalized across use cases and be trusted.”

• Mainak Mazumdar, Chief Research Officer at Nielsen
##### Novel Applications of Deep Learning

“There has been a lot of innovation in the last year in the field of deep learning that I am excited about! I think these innovations will create a lot of new AI applications, some of which are already in production and making massive changes in the industry. At Overstock, we use deep learning on multiple products, from email campaigns with predictive taxonomies to personalization modules that infer user style with deep learning. I’m excited to see how the industry, and specifically online retail, integrates more with deep learning and some novel applications that will follow.”

• Kamelia Aryafar, Chief Algorithm Officer at Overstock

Clearly, 2018 was the year of rapid adoption and democratization of the latest analytics innovations such as deep learning. It was encouraging to hear analytics leaders aligned on the importance and the progress around making analytics and AI simple across the organization. What came across clearly was the concept of unification across all facets of analytics — ensuring all stages of the analytics pipeline, the associated technologies, and the teams involved in data science and engineering, are seamlessly integrated and operating in harmony.

The next installment of this blog series will uncover predictions on what the next set of trends and innovations around AI, machine learning, and big data that will surface in 2019.

--

The post AI Trends that Paved the Way in 2018 appeared first on Databricks.

### I believe this study because it is consistent with my existing beliefs.

Kevin Lewis points us to this.

### Are BERT Features InterBERTible?

This is a short analysis of the interpretability of BERT contextual word representations. Does BERT learn a semantic vector representation like Word2Vec?

### Seasonality in NZ voting preference? by @ellis2013nz

(This article was first published on free range statistics - R, and kindly contributed to R-bloggers)

There was a flurry of activity in the last couple of days on Twitter and the blogosphere, most notably Thomas Lumley’s excellent Stats Chat, relating to whether there is a pro-government bias in surveys of New Zealand voting intention in the summer. As the analysis I’ve seen used my nzelect R package, this motivated me to update it for recent polls.

## The nzelect update

nzelect hasn’t been updated on CRAN for some time, because about a year ago I made some major changes to the data model for the historical election results by voting place and I haven’t been able to complete testing and stabilisation of the result. I do hope to do this some time in the next few months. In the meantime, the version on GitHub has the current polling data, and I intend to keep it current. Political polls are very thin on the ground these days for New Zealand, so that’s not too big an ask! I’ve now spent a bit of time tidying it up and adding the three most recent polls, which I’d previously neglected.

Let’s start with the basics. One of the reasons I first put the package together several years ago was to help facilitate analysis of the relatively long run in political opinion. I wanted to lift up analysis from over-interpretation of the last few noisy data points. Here’s the expressed voting intention of New Zealanders for the four currently largest parties in Parliament over time:

It’s striking, but unsurprising, how the Greens and New Zealand First support has collapsed since coming into government with Labour (or just before, in the case of the Greens and their disappointing 2017 election campaign). The junior party in a coalition often suffers, as attention and kudos goes to the leaders of the large party (in this case, Prime Minister Jacinda Ardern) and the smaller parties’ own base deals with the realities and compromises of being in government.

The other interesting (and disappointing, for statisticians and political scientists) observation from this chart is how obviously the number of polls has decreased. Topic for another day (or more likely, someone else to write about).

Here’s the R code for that chart:

There’s obvious interest for supporters of the left-of-centre parties in the combined vote for Labour and the Greens. That suggests the importance of this chart:

(Apologies for red-green colour blind people in the use of these party colours; the Greens are the lowest of the three lines and Labour the middle.)

It’s clear that to a significant degree electoral support for the two is substitutable, with the green and red lines moving in scissors-like counter directions at several key times since 2010, with the last 24 months just the most dramatic example.

It’s also clear that the combined support for the centre-left in New Zealand is pretty strong, recovered to a point it hasn’t been since several years before the end of the government led by Helen Clarke in the 2000s.

Here’s the code for that chart. Note how I use the dates of elections to make a simple data frame of who is in power when, for the background rectangles; and leverage the parties_v vector of colours in nzelect to allocate the official party colours to both the parties’ lines and to the background fill.

## Seasonality

Now, on to the question of seasonality. I don’t have much to add to the analysis of David Hood on Twitter and Thomas Lumley on StatsChat; I basically agree with their conclusions. I have the advantage of a few more polls because of the update to nzelect this morning.

Here is the expressed intended vote for the party of the Prime Minister over time:

The blue line for the National Party is higher than the equivalent for Labour Prime Ministers because National has tended to form a larger proportion of its governing coalitions than Labour in this time period. I could (and probably should) have added the intended vote for all parties currently in coalition government, but this is actually a pretty complicated thing to do so I’ve gone for the simpler approach for now. So long as we don’t make simplistic comparisons forgetting that New Zealand has a proportional representation system, that is ok for our purposes.

We obviously can’t tell anything from this chart about the seasonality; there are two many data points and too much noise. Actually, one thing we can say for sure is that seasonality isn’t strong. For very seasonal data such as tourist numbers, the seasonality would be obvious even in a chart like this.

To see if there is a subtle seasonality effect, I tried modelling voting intention for the Prime Minister’s party on the month of the year, controlling for the party in power, a smooth trend over time, and whether or not it is an election month (otherwise September and November, with five of the six elections in this period, would certainly cloud the data). I used (as I nearly always do in this situation) Simon Woods excellent mgcv R package.

Having done that, we can approximate confidence intervals for the impact on voting preference of the month of the year. The next chart shows those estimates:

As the title says, it’s weak evidence of a weak effect, which might be around half a percentage point more positive for the Prime Minister’s party in the summer months than it is in June. Or it might be more than that, or even negative.

Here’s the code for that model and the last two charts:

It’s very motivating to see others using the nzelect package. Please tag me in Twitter, or let me know some other way, if you use this and it will encourage me for further enhancements, and to get the new version with better historical data onto CRAN!

## Thanksgiving

I’m going to try to get into the habit of this at the end of each blog post. Without the hard work, innovation and sheer smarts of the open source community my blog (and many much more important things!) wouldn’t be possible. Here are just those in the R world whose code I used in this session (not all of it made it into the excerpt above, but that’s all the more reason to give thanks below).

maintainer no_packages packages
Hadley Wickham hadley@rstudio.com 17 assertthat, dplyr, forcats, ggplot2, gtable, haven, httr, lazyeval, modelr, plyr, rvest, scales, stringr, testthat, tidyr, tidyverse, usethis
R Core Team R-core@r-project.org 10 base, compiler, datasets, graphics, grDevices, grid, methods, stats, tools, utils
Gábor Csárdi csardi.gabor@gmail.com 9 callr, cli, crayon, desc, pkgconfig, processx, ps, remotes, sessioninfo
Kirill Müller 6 bindr, bindrcpp, hms, pillar, rprojroot, tibble
Winston Chang winston@stdout.org 4 extrafont, extrafontdb, R6, Rttf2pt1
Jim Hester james.f.hester@gmail.com 3 fs, glue, withr
Lionel Henry lionel@rstudio.com 3 purrr, rlang, tidyselect
Dirk Eddelbuettel edd@debian.org 3 digest, Rcpp, x13binary
Yixuan Qiu yixuan.qiu@cos.name 3 showtext, showtextdb, sysfonts
Jeroen Ooms jeroen@berkeley.edu 2 curl, jsonlite
Yihui Xie xie@yihui.name 2 knitr, xfun
R-core R-core@R-project.org 1 nlme
Vitalie Spinu spinuvit@gmail.com 1 lubridate
Michel Lang michellang@gmail.com 1 backports
Patrick O. Perry patperry@gmail.com 1 utf8
Simon Wood simon.wood@r-project.org 1 mgcv
Achim Zeileis Achim.Zeileis@R-project.org 1 colorspace
Baptiste Auguie baptiste.auguie@gmail.com 1 gridExtra
Gabor Csardi csardi.gabor@gmail.com 1 prettyunits
Peter Ellis peter.ellis2013nz@gmail.com 1 nzelect
Simon Urbanek Simon.Urbanek@r-project.org 1 Cairo
James Hester james.hester@rstudio.com 1 xml2
Justin Talbot justintalbot@gmail.com 1 labeling
Torsten Hothorn Torsten.Hothorn@R-project.org 1 mvtnorm
Christoph Sax christoph.sax@gmail.com 1 seasonal
Kevin Ushey kevin@rstudio.com 1 rstudioapi
Max Kuhn max@rstudio.com 1 generics
Stefan Milton Bache stefan@stefanbache.dk 1 magrittr
Martin Maechler 1 Matrix
Charlotte Wickham cwickham@gmail.com 1 munsell
Brodie Gaslam brodie.gaslam@yahoo.com 1 fansi
Matthew Lincoln matthew.d.lincoln@gmail.com 1 clipr
Gavin L. Simpson ucfagls@gmail.com 1 gratia
Marek Gagolewski gagolews@rexamine.com 1 stringi
Jeremy Stephens jeremy.f.stephens@vumc.org 1 yaml
Brian Ripley ripley@stats.ox.ac.uk 1 MASS
Deepayan Sarkar deepayan.sarkar@r-project.org 1 lattice
Claus O. Wilke wilke@austin.utexas.edu 1 cowplot
Rasmus Bååth rasmus.baath@gmail.com 1 beepr
Jennifer Bryan jenny@stat.ubc.ca 1 cellranger
Alex Hayes alexpghayes@gmail.com 1 broom
Simon Urbanek simon.urbanek@r-project.org 1 audio
Jim Hester jim.hester@rstudio.com 1 memoise

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Monitoring Diabetes’ risk and BMI thanks to a Shiny dashboard

(This article was first published on MilanoR, and kindly contributed to R-bloggers)

Hi everyone and welcome back to our blog!
Valentine’s day has come and I guess many of you have eaten a lot of sweets during these days, so it’s the right time for a health check; we’ve got you covered, with a touch of r-based magic!

#### A little backstory: R-lab in 2018

In January 2018 i joined MilanoR, a community dedicated to bring together local R-users, aiming to share knowledge, best practice and good times with everyone who wants to get involved, at all skill levels; you can know more about the project here.

Between all the event formats they experimented, the most interesting are the R-labs. An R-lab is a non-competitive workshop, where everyone works together through a common effort, be it the development of a Shiny dashboard, optimizing an existing one, or simply helping the main guest solving a business problem with R.

Cool, isn’t it? Check out some of the previous events on our blog!

During our January R-lab, we met Riccardo Rossi, computational biologist and bioinformatics facility manager at INGM.

Showing us the already existing medical guidelines to assess risks of obesity, type two diabetes, hypertension and cardiovascular health, he invited us to build a Shiny app to allow people to keep in check their health status, just by entering some key parameters, such as height, weight and age.

#### Creating the dashboard

After meeting Riccardo, i embraced this challenge and started thinking; how could i translate medical guidelines, expressed as formulas, into an easy, working piece of R code?

#### Assessing the risk: server functions

My first goal was to build two functions, one for assessing risk of obesity and the second one to assess the risk of type 2 diabetes.

For the sake of simplcity, i’ll only show the first one:

obesity_risk <- function(weight, height, gender){

bmi_2 = weight/(height^2)

if (gender == "female" && bmi_2 < 25){ob_absolut = 1}
if (gender == "female" && bmi_2 > 25 & bmi_2 < 30){ob_absolut = 19.5}

if (gender == "male" && bmi_2 < 25){ ob_absolut = 1}
if (gender == "male" && bmi_2 > 25 & bmi_2 < 30){ob_absolut = 13}

if (bmi_2 > 30) {return("100%")}
else
{

ob_relative = round((ob_absolut/100)/(8/100),1)
return(paste0(ob_relative,"%"));

}
}

Following the provided guidelines, this function calculates the user’s BMI, and returns the relative obesity risk. The other one does the same to assess the risk of contracting type 2 diabetes.

#### Interacting with the user

What’s an app without an user interacting with it? Front End time!
Using shiny and shinyDashboard libraries i designed a user-friendly interface to allow people to enter the needed personal data:

• age
• gender
• weight
• height
• waist
• lifestyle habits
dashboardPage(
dashboardHeader(title = "Hi, I'm Doctor Thomas!", titleWidth = 300),
dashboardSidebar(disable = TRUE),
dashboardBody(
fluidRow(

box(title = "Your data", solidHeader = TRUE, width = 3, status = "primary",
numericInput("age", "How old are you?",25),
numericInput("weight", "What is your weight (Kilos)?",70),
numericInput("height", "What is your height (Meters) ?",1.7),
numericInput("waist", "What is your waist size?",70),
radioButtons("hypdrugs", "Do you take hypertension drugs?", choices = c("Yes", "No"))
)

#### Enough with the code, show me the dashboard!

The full working app is hosted here, let us know what you think about it.
If you’re interested in the full code i will upload it on github and edit this post. See you at the next meetup!

The post Monitoring Diabetes’ risk and BMI thanks to a Shiny dashboard appeared first on MilanoR.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Hacking Deep Learning (Bar Ilan) Workshop Videos

Hacking Deep Learning (Bar Ilan) Workshop Videos now online. Thanks for my friend Prof. Yossi Keshet for organizing and inviting me!
One notable talk which is unfortunately missing from the videos is of Prof. Adi Shamir described in this paper. The work analysis how many pixels one should change to confuse a deep learning based classifier. The result is surprising - only a few! A related describe work is this.

### Four short links: 19 February 2019

3D with Face Tracking, Cleaning Data, Data as Labor, Walking Robotics

1. Depth Index -- A JavaScript package that turns z-index into physically realistic depth, using PoseNet face tracking. Deep, man.
2. Data Cleaner's Cookbook -- This is version 1 of a cookbook that will help you check whether a data table (defined on the data tables page) is properly structured and free from formatting errors, inconsistencies, duplicates, and other data headaches. All the data-auditing and data-cleaning recipes on this website use GNU/Linux tools in a BASH shell and work on plain text files.
3. Should We Treat Data as Labor? Moving Beyond "Free" -- In this paper, we explore whether and how treating the market for data like a labor market could serve as a radical market that is practical in the near term.
4. Underactuated Robotics -- working notes used for a course being taught at MIT [on] Algorithms for Walking, Running, Swimming, Flying, and Manipulation. Even if you don't care about robotics, read this excellent Hacker News comment (words I don't say often) and you'll think about walking completely differently.

### Animate intermediate results of your algorithm

(This article was first published on Stanislas Morbieu - R, and kindly contributed to R-bloggers)

The R package gganimate enables to animate plots. It is particularly interesting
to visualize the intermediate results of an algorithm, to see how it converges towards
the final results. The following illustrates this with K-means clustering.

The outline of this post is as follows: We will first generate some artificial data to work with.
This allows to visualize the behavior of the algorithm. The k-means criterion and an algorithm to optimize it
is then presented and implemented in R in order to store the intermediate results in a dataframe. Last, the content of the dataframe
is ploted dynamically with gganimate.

## Generate some data

To see how the algorithm behaves, we first need some data. Let’s
generate an artificial dataset:

library(mvtnorm)
library(dplyr)

generateGaussianData <- function(n, center, sigma, label) {
data = rmvnorm(n, mean = center, sigma = sigma)
data = data.frame(data)
names(data) = c("x", "y")
data = data %>% mutate(class=factor(label))
data
}

dataset <- {
# cluster 1
n = 50
center = c(5, 5)
sigma = matrix(c(1, 0, 0, 1), nrow = 2)
data1 = generateGaussianData(n, center, sigma, 1)

# cluster 2
n = 50
center = c(1, 1)
sigma = matrix(c(1, 0, 0, 1), nrow = 2)
data2 = generateGaussianData(n, center, sigma, 2)

# all data
data = bind_rows(data1, data2)
data$class = as.factor(data$class)
data
}


We generated a mixture of two Gaussians. There is nothing very special about it, except
it is in two dimensions to make it easy to plot it, without the need of a dimensionality reduction method.

Let’s now move on to our algorithm.

## K-means

Here, I choose to use k-means since it is widely used for clustering, and moreover, results in a simple implementation.

For a given number of clusters K, and a set of N vectors $$x_i , i \in [1, N]$$,
K-means aims to minimize the following criterion:

\begin{equation*}
W(z, \mu) = \sum_{k=1}^{K} \sum_{i=1}^{N} z_{ik} ||x_i – \mu_k||^2
\end{equation*}

with:

• $$z_{ik} \in {0, 1}$$ indicates if the vector $$x_i$$ belongs to the cluster $$k$$;
• $$\mu_k \in \mathbb{R}^p$$, the center of the cluster $$k$$.

### Llyod and Forgy algorithms

Several algorithms optimize the k-means criterion.
For instance, the R function kmeans() provides four algorithms:

kmeans(x, centers, iter.max = 10, nstart = 1,
algorithm = c("Hartigan-Wong", "Lloyd", "Forgy",
"MacQueen"), trace=FALSE)


In fact, Forgy and Lloyd algorithms are implemented the same way. We can see this in the source
code of kmeans():

edit(kmeans)


It opens the source code in your favorite text editor. At lines 56 and 57, “Forgy” and “Lloyd” are assigned
to the same number (2L) and are thus mapped to the same implementation:

nmeth <- switch(match.arg(algorithm), Hartigan-Wong = 1L,
Lloyd = 2L, Forgy = 2L, MacQueen = 3L)


In the following, we will implement this algorithm. After the initilization, it iterates over two steps until convergence:

• an assignment step which assigns the points to the clusters;
• an update step which updates the centroids of the clusters.

### Initialization

The initialization consists in selecting $$K$$ points at random and consider them as the centroids of the clusters:

dataset = dataset %>% mutate(sample = row_number())
centroids = dataset %>% sample_n(2) %>% mutate(cluster = row_number()) %>% select(x, y, cluster)


### Assignment step

The assignment step of k-means is equivalent to the E and C step of the CEM algorithm in the
Gaussian mixture model.
It assigns the points to the clusters according to the distances between the points and the centroids.
Let’s write $$z_k$$ the set of points in the cluster $$k$$:

\begin{equation*}
z_k = \left\{ i; z_{ik} = 1 \right\}
\end{equation*}

We estimate $$z_k$$ by:

\begin{equation*}
\hat{z}_k = \{ i; ||x_i – \mu_k||^2 \leq ||x_i – \mu_{k’}||^2; k’ \neq k \}
\end{equation*}

A point $$x_i$$ is set to be in the cluster $$k$$ if the closest centroid
(using the euclidean distance) is the centroid $$\mu_k$$ of the cluster $$k$$. This is done by the following R code:

assignmentStep = function(samplesDf, centroids) {
d = samplesDf %>% select(x, y, sample)
repCentroids = bind_rows(replicate(nrow(d), centroids, simplify = FALSE)) %>%
transmute(xCentroid = x, yCentroid = y, cluster)
d %>% slice(rep(1:n(), each=2)) %>%
bind_cols(repCentroids) %>%
mutate(s = (x-xCentroid)^2 + (y-yCentroid)^2) %>%
group_by(sample) %>%
top_n(1, -s) %>%
select(cluster, x, y)
}


### Update step

In the update step, the centroid of a cluster is computed by taking the
mean of the points in the cluster, as defined in the previous step.
It corresponds to the M step of the Gaussian mixture model and it is done
in R with:

updateStep = function(samplesDf) {
samplesDf %>% group_by(cluster) %>%
summarise(x = mean(x), y = mean(y))
}


### Iterations

Let’s put together the steps defined above in a loop to complete the algorithm.
We define a maximum number of iterations maxIter and iterate over the two steps
until either convergence or maxIter is reached. It converges if the centroids
are the same in two consecutive iterations:

maxIter = 10
d = data.frame(sample=c(), cluster=c(), x=c(), y=c(), step=c())
dCentroids = data.frame(cluster=c(), x=c(), y=c(), step=c())
for (i in 1:maxIter) {
df = assignmentStep(dataset, centroids)
if (all(updatedCentroids == centroids )) {
break
}
centroids = updatedCentroids
d = bind_rows(d, df %>% mutate(step=i))
dCentroids = bind_rows(dCentroids, centroids %>% mutate(step=i))
}


The above R code constructs two dataframes d and dCentroids which contain
respectively the assignations of the points and the centroids. The column step indicates
the iteration number and will be used to animate the plot.

## Plot

We are now ready to plot the data. For this, ggplot2
is used with some code specific to gganimate:

library(ggplot2)
library(gganimate)

a <- ggplot(d, aes(x = x, y = y, color=factor(cluster), shape=factor(cluster))) +
labs(color="Cluster", shape="Cluster", title="Step: {frame} / {nframes}") +
geom_point() +
geom_point(data=dCentroids, shape=10, size=5) +
transition_manual(step)
animate(a, fps=10)
anim_save("steps.gif")


The function transition_manual of gganimate allows to animate the plot by filtering
the dataframe at each step given the value of the column passed as parameter (here step).
The variables frame and nframes are provided by gganimate and are used in the title.
They give the number of the current frame and the total number of frames respectively.

The animate function takes the argument fps which stands for “frames per second”. This call
takes some time to process since it generates the animation. The animation is then stored in “steps.gif”:

## To sum up

This post gives an example of how to use gganimate to plot the intermediate results of an algorithm.
To do this, one have to:

• import gganimate;
• create a dataframe with an additional column which stores the iteration number;
• create a standard ggplot2 object;
• use the transition_manual function to specify the column used for the transition between the frames (the iteration number);
• generate the animation with animate;
• save the animation with anim_save.

We also covered the Lloyd and Forgy algorithms to optimize the k-means criterion.

Looking at the implementation of R functions is sometimes helpfull.
For instance we looked at the implementation of k-means to see that two
algorithms proposed as arguments are in fact the same.