My Data Science Blogs

February 21, 2019

R Packages worth a look

Comprehensive, User-Friendly Toolkit for Probing Interactions (interactions)
A suite of functions for conducting and interpreting analysis of statistical interaction in regression models that was formerly part of the ‘jtools’ pa …

Quick Serialization of R Objects (qs)
Provides functions for quickly writing and reading any R object to and from disk. This package makes use of the ‘zstd’ library for compression and deco …

Simple ‘htmlwidgets’ Image Viewer with WebGL Brightness/Contrast (imageviewer)
Display a 2D-matrix data as a interactive zoomable gray-scale image viewer, providing tools for manual data inspection. The viewer window shows cursor …

Continue Reading…

Collapse

Read More

Document worth reading: “A Survey of Neuromorphic Computing and Neural Networks in Hardware”

Neuromorphic computing has come to refer to a variety of brain-inspired computers, devices, and models that contrast the pervasive von Neumann computer architecture. This biologically inspired approach has created highly connected synthetic neurons and synapses that can be used to model neuroscience theories as well as solve challenging machine learning problems. The promise of the technology is to create a brain-like ability to learn and adapt, but the technical challenges are significant, starting with an accurate neuroscience model of how the brain works, to finding materials and engineering breakthroughs to build devices to support these models, to creating a programming framework so the systems can learn, to creating applications with brain-like capabilities. In this work, we provide a comprehensive survey of the research and motivations for neuromorphic computing over its history. We begin with a 35-year review of the motivations and drivers of neuromorphic computing, then look at the major research areas of the field, which we define as neuro-inspired models, algorithms and learning approaches, hardware and devices, supporting systems, and finally applications. We conclude with a broad discussion on the major research topics that need to be addressed in the coming years to see the promise of neuromorphic computing fulfilled. The goals of this work are to provide an exhaustive review of the research conducted in neuromorphic computing since the inception of the term, and to motivate further work by illuminating gaps in the field where new research is needed. A Survey of Neuromorphic Computing and Neural Networks in Hardware

Continue Reading…

Collapse

Read More

February 20, 2019

Top KDnuggets tweets, Feb 13-19: Intro to Scikit Learn: The Gold Standard of Python ML; The Essential Data Science Venn Diagram

Also: Cartoon: #MachineLearning Problems in 2118 #ValentinesDay; A must-read tutorial when you are starting your journey with #DeepLearning.

Continue Reading…

Collapse

Read More

Whats new on arXiv

Contrastive Variational Autoencoder Enhances Salient Features

Variational autoencoders are powerful algorithms for identifying dominant latent structure in a single dataset. In many applications, however, we are interested in modeling latent structure and variation that are enriched in a target dataset compared to some background—e.g. enriched in patients compared to the general population. Contrastive learning is a principled framework to capture such enriched variation between the target and background, but state-of-the-art contrastive methods are limited to linear models. In this paper, we introduce the contrastive variational autoencoder (cVAE), which combines the benefits of contrastive learning with the power of deep generative models. The cVAE is designed to identify and enhance salient latent features. The cVAE is trained on two related but unpaired datasets, one of which has minimal contribution from the salient latent features. The cVAE explicitly models latent features that are shared between the datasets, as well as those that are enriched in one dataset relative to the other, which allows the algorithm to isolate and enhance the salient latent features. The algorithm is straightforward to implement, has a similar run-time to the standard VAE, and is robust to noise and dataset purity. We conduct experiments across diverse types of data, including gene expression and facial images, showing that the cVAE effectively uncovers latent structure that is salient in a particular analysis.


Security-Aware Synthesis Using Delayed-Action Games

Stochastic multiplayer games (SMGs) have gained attention in the field of strategy synthesis for multi-agent reactive systems. However, standard SMGs are limited to modeling systems where all agents have full knowledge of the state of the game. In this paper, we introduce delayed-action games (DAGs) formalism that simulates hidden-information games (HIGs) as SMGs, by eliminating hidden information by delaying a player’s actions. The elimination of hidden information enables the usage of SMG off-the-shelf model checkers to implement HIGs. Furthermore, we demonstrate how a DAG can be decomposed into a number of independent subgames. Since each subgame can be independently explored, parallel computation can be utilized to reduce the model checking time, while alleviating the state space explosion problem that SMGs are notorious for. In addition, we propose a DAG-based framework for strategy synthesis and analysis. Finally, we demonstrate applicability of the DAG-based synthesis framework on a case study of a human-on-the-loop unmanned-aerial vehicle system that may be under stealthy attack, where the proposed framework is used to formally model, analyze and synthesize security-aware strategies for the system.


Extreme Tensoring for Low-Memory Preconditioning

State-of-the-art models are now trained with billions of parameters, reaching hardware limits in terms of memory consumption. This has created a recent demand for memory-efficient optimizers. To this end, we investigate the limits and performance tradeoffs of memory-efficient adaptively preconditioned gradient methods. We propose extreme tensoring for high-dimensional stochastic optimization, showing that an optimizer needs very little memory to benefit from adaptive preconditioning. Our technique applies to arbitrary models (not necessarily with tensor-shaped parameters), and is accompanied by regret and convergence guarantees, which shed light on the tradeoffs between preconditioner quality and expressivity. On a large-scale NLP model, we reduce the optimizer memory overhead by three orders of magnitude, without degrading performance.


Learning Theory and Support Vector Machines – a primer

The main goal of statistical learning theory is to provide a fundamental framework for the problem of decision making and model construction based on sets of data. Here, we present a brief introduction to the fundamentals of statistical learning theory, in particular the difference between empirical and structural risk minimization, including one of its most prominent implementations, i.e. the Support Vector Machine.


A Tunable Loss Function for Binary Classification

We present \alpha-loss, \alpha \in [1,\infty], a tunable loss function for binary classification that bridges log-loss (\alpha=1) and 01 loss (\alpha = \infty). We prove that \alpha-loss has an equivalent margin-based form and is classification-calibrated, two desirable properties for a good surrogate loss function for the ideal yet intractable 01 loss. For logistic regression-based classification, we provide an upper bound on the difference between the empirical and expected risk for \alpha-loss by exploiting its Lipschitzianity along with recent results on the landscape features of empirical risk functions. Finally, we show that \alpha-loss with \alpha = 2 performs better than log-loss on MNIST for logistic regression.


Weighted Tensor Completion for Time-Series Causal Information

Marginal Structural Models (MSM)~\cite{Robins00} are the most popular models for causal inference from time-series observational data. However, they have two main drawbacks: (a) they do not capture subject heterogeneity, and (b) they only consider fixed time intervals and do not scale gracefully with longer intervals. In this work, we propose a new family of MSMs to address these two concerns. We model the potential outcomes as a three-dimensional tensor of low rank, where the three dimensions correspond to the agents, time periods and the set of possible histories. Unlike the traditional MSM, we allow the dimensions of the tensor to increase with the number of agents and time periods. We set up a weighted tensor completion problem as our estimation procedure, and show that the solution to this problem converges to the true model in an appropriate sense. Then we show how to solve the estimation problem, providing conditions under which we can approximately and efficiently solve the estimation problem. Finally we propose an algorithm based on projected gradient descent, which is easy to implement, and evaluate its performance on a simulated dataset.


Minimax rates in outlier-robust estimation of discrete models

We consider the problem of estimating the probability distribution of a discrete random variable in the setting where the observations are corrupted by outliers. Assuming that the discrete variable takes k values, the unknown parameter p is a k-dimensional vector belonging to the probability simplex. We first describe various settings of contamination and discuss the relation between these settings. We then establish minimax rates when the quality of estimation is measured by the total-variation distance, the Hellinger distance, or the L2-distance between two probability measures. Our analysis reveals that the minimax rates associated to these three distances are all different, but they are all attained by the maximum likelihood estimator. Note that the latter is efficiently computable even when the dimension is large. Some numerical experiments illustrating our theoretical findings are reported.


Learning Generative Models of Structured Signals from Their Superposition Using GANs with Application to Denoising and Demixing

Recently, Generative Adversarial Networks (GANs) have emerged as a popular alternative for modeling complex high dimensional distributions. Most of the existing works implicitly assume that the clean samples from the target distribution are easily available. However, in many applications, this assumption is violated. In this paper, we consider the observation setting when the samples from target distribution are given by the superposition of two structured components and leverage GANs for learning the structure of the components. We propose two novel frameworks: denoising-GAN and demixing-GAN. The denoising-GAN assumes access to clean samples from the second component and try to learn the other distribution, whereas demixing-GAN learns the distribution of the components at the same time. Through extensive numerical experiments, we demonstrate that proposed frameworks can generate clean samples from unknown distributions, and provide competitive performance in tasks such as denoising, demixing, and compressive sensing.


Towards moderate overparameterization: global convergence guarantees for training shallow neural networks

Many modern neural network architectures are trained in an overparameterized regime where the parameters of the model exceed the size of the training dataset. Sufficiently overparameterized neural network architectures in principle have the capacity to fit any set of labels including random noise. However, given the highly nonconvex nature of the training landscape it is not clear what level and kind of overparameterization is required for first order methods to converge to a global optima that perfectly interpolate any labels. A number of recent theoretical works have shown that for very wide neural networks where the number of hidden units is polynomially large in the size of the training data gradient descent starting from a random initialization does indeed converge to a global optima. However, in practice much more moderate levels of overparameterization seems to be sufficient and in many cases overparameterized models seem to perfectly interpolate the training data as soon as the number of parameters exceed the size of the training data by a constant factor. Thus there is a huge gap between the existing theoretical literature and practical experiments. In this paper we take a step towards closing this gap. Focusing on shallow neural nets and smooth activations, we show that (stochastic) gradient descent when initialized at random converges at a geometric rate to a nearby global optima as soon as the square-root of the number of network parameters exceeds the size of the training data. Our results also benefit from a fast convergence rate and continue to hold for non-differentiable activations such as Rectified Linear Units (ReLUs).


High dimensionality: The latest challenge to data analysis

The advent of modern technology, permitting the measurement of thousands of characteristics simultaneously, has given rise to floods of data characterized by many large or even huge datasets. This new paradigm presents extraordinary challenges to data analysis and the question arises: how can conventional data analysis methods, devised for moderate or small datasets, cope with the complexities of modern data? The case of high dimensional data is particularly revealing of some of the drawbacks. We look at the case where the number of characteristics measured in an object is at least the number of observed objects and conclude that this configuration leads to geometrical and mathematical oddities and is an insurmountable barrier for the direct application of traditional methodologies. If scientists are going to ignore fundamental mathematical results arrived at in this paper and blindly use software to analyze data, the results of their analyses may not be trustful, and the findings of their experiments may never be validated. That is why new methods together with the wise use of traditional approaches are essential to progress safely through the present reality.


Identity Crisis: Memorization and Generalization under Extreme Overparameterization

We study the interplay between memorization and generalization of overparametrized networks in the extreme case of a single training example. The learning task is to predict an output which is as similar as possible to the input. We examine both fully-connected and convolutional networks that are initialized randomly and then trained to minimize the reconstruction error. The trained networks take one of the two forms: the constant function (‘memorization’) and the identity function (‘generalization’). We show that different architectures exhibit vastly different inductive bias towards memorization and generalization. An important consequence of our study is that even in extreme cases of overparameterization, deep learning can result in proper generalization.


Differential Description Length for Hyperparameter Selection in Machine Learning

This paper introduces a new method for model selection and more generally hyperparameter selection in machine learning. The paper first proves a relationship between generalization error and a difference of description lengths of the training data; we call this difference differential description length (DDL). This allows prediction of generalization error from the training data \emph{alone} by performing encoding of the training data. This can now be used for model selection by choosing the model that has the smallest predicted generalization error. We show how this encoding can be done for linear regression and neural networks. We provide experiments showing that this leads to smaller generalization error than cross-validation and traditional MDL and Bayes methods.


Neural network models and deep learning – a primer for biologists

Originally inspired by neurobiology, deep neural network models have become a powerful tool of machine learning and artificial intelligence, where they are used to approximate functions and dynamics by learning from examples. Here we give a brief introduction to neural network models and deep learning for biologists. We introduce feedforward and recurrent networks and explain the expressive power of this modeling framework and the backpropagation algorithm for setting the parameters. Finally, we consider how deep neural networks might help us understand the brain’s computations.


Statistical Failure Mechanism Analysis of Earthquakes Revealing Time Relationships

If we assume that earthquakes are chaotic, and influenced locally then chaos theory suggests that there should be a temporal association between earthquakes in a local region that should be revealed with statistical examination. To date no strong relationship has been shown (refs not prediction). However, earthquakes are basically failures of structured material systems, and when multiple failure mechanisms are present, prediction of failure is strongly inhibited without first separating the mechanisms. Here we show that by separating earthquakes statistically, based on their central tensor moment structure, along lines first suggested by a separation into mechanisms according to depth of the earthquake, a strong indication of temporal association appears. We show this in earthquakes above 200 Km along the pacific ring of fire, with a positive association in time between earthquakes of the same statistical type and a negative association in time between earthquakes of different types. Whether this can reveal either useful mechanistic information to seismologists, or can result in useful forecasts remains to be seen.


Learning and Generalization for Matching Problems

We study a classic algorithmic problem through the lens of statistical learning. That is, we consider a matching problem where the input graph is sampled from some distribution. This distribution is unknown to the algorithm; however, an additional graph which is sampled from the same distribution is given during a training phase (preprocessing). More specifically, the algorithmic problem is to match k out of n items that arrive online to d categories (d\ll k \ll n). Our goal is to design a two-stage online algorithm that retains a small subset of items in the first stage which contains an offline matching of maximum weight. We then compute this optimal matching in a second stage. The added statistical component is that before the online matching process begins, our algorithms learn from a training set consisting of another matching instance drawn from the same unknown distribution. Using this training set, we learn a policy that we apply during the online matching process. We consider a class of online policies that we term \emph{thresholds policies}. For this class, we derive uniform convergence results both for the number of retained items and the value of the optimal matching. We show that the number of retained items and the value of the offline optimal matching deviate from their expectation by O(\sqrt{k}). This requires usage of less-standard concentration inequalities (standard ones give deviations of O(\sqrt{n})). Furthermore, we design an algorithm that outputs the optimal offline solution with high probability while retaining only O(k\log \log n) items in expectation.


Distributed Online Linear Regression

We study online linear regression problems in a distributed setting, where the data is spread over a network. In each round, each network node proposes a linear predictor, with the objective of fitting the \emph{network-wide} data. It then updates its predictor for the next round according to the received local feedback and information received from neighboring nodes. The predictions made at a given node are assessed through the notion of regret, defined as the difference between their cumulative network-wide square errors and those of the best off-line network-wide linear predictor. Various scenarios are investigated, depending on the nature of the local feedback (full information or bandit feedback), on the set of available predictors (the decision set), and the way data is generated (by an oblivious or adaptive adversary). We propose simple and natural distributed regression algorithms, involving, at each node and in each round, a local gradient descent step and a communication and averaging step where nodes aim at aligning their predictors to those of their neighbors. We establish regret upper bounds typically in {\cal O}(T^{3/4}) when the decision set is unbounded and in {\cal O}(\sqrt{T}) in case of bounded decision set.


On the Expressive Power of Kernel Methods and the Efficiency of Kernel Learning by Association Schemes

We study the expressive power of kernel methods and the algorithmic feasibility of multiple kernel learning for a special rich class of kernels. Specifically, we define \emph{Euclidean kernels}, a diverse class that includes most, if not all, families of kernels studied in literature such as polynomial kernels and radial basis functions. We then describe the geometric and spectral structure of this family of kernels over the hypercube (and to some extent for any compact domain). Our structural results allow us to prove meaningful limitations on the expressive power of the class as well as derive several efficient algorithms for learning kernels over different domains.


SECTOR: A Neural Model for Coherent Topic Segmentation and Classification

When searching for information, a human reader first glances over a document, spots relevant sections and then focuses on a few sentences for resolving her intention. However, the high variance of document structure complicates to identify the salient topic of a given section at a glance. To tackle this challenge, we present SECTOR, a model to support machine reading systems by segmenting documents into coherent sections and assigning topic labels to each section. Our deep neural network architecture learns a latent topic embedding over the course of a document. This can be leveraged to classify local topics from plain text and segment a document at topic shifts. In addition, we contribute WikiSection, a publicly available dataset with 242k labeled sections in English and German from two distinct domains: diseases and cities. From our extensive evaluation of 20 architectures, we report a highest score of 71.6% F1 for the segmentation and classification of 30 topics from the English city domain, scored by our SECTOR LSTM model with bloom filter embeddings and bidirectional segmentation. This is a significant improvement of 29.5 points F1 compared to state-of-the-art CNN classifiers with baseline segmentation.


The Odds are Odd: A Statistical Test for Detecting Adversarial Examples

We investigate conditions under which test statistics exist that can reliably detect examples, which have been adversarially manipulated in a white-box attack. These statistics can be easily computed and calibrated by randomly corrupting inputs. They exploit certain anomalies that adversarial attacks introduce, in particular if they follow the paradigm of choosing perturbations optimally under p-norm constraints. Access to the log-odds is the only requirement to defend models. We justify our approach empirically, but also provide conditions under which detectability via the suggested test statistics is guaranteed to be effective. In our experiments, we show that it is even possible to correct test time predictions for adversarial attacks with high accuracy.


Relative rationality: Is machine rationality subjective?

Rational decision making in its linguistic description means making logical decisions. In essence, a rational agent optimally processes all relevant information to achieve its goal. Rationality has two elements and these are the use of relevant information and the efficient processing of such information. In reality, relevant information is incomplete, imperfect and the processing engine, which is a brain for humans, is suboptimal. Humans are risk averse rather than utility maximizers. In the real world, problems are predominantly non-convex and this makes the idea of rational decision-making fundamentally unachievable and Herbert Simon called this bounded rationality. There is a trade-off between the amount of information used for decision-making and the complexity of the decision model used. This explores whether machine rationality is subjective and concludes that indeed it is.


Dynamic Non-Diagonal Regularization in Interior Point Methods for Linear and Convex Quadratic Programming

In this paper, we present a dynamic non-diagonal regularization for interior point methods. The non-diagonal aspect of this regularization is implicit, since all the off-diagonal elements of the regularization matrices are cancelled out by those elements present in the Newton system, which do not contribute important information in the computation of the Newton direction. Such a regularization has multiple goals. The obvious one is to improve the spectral properties of the Newton system solved at each iteration of the interior point method. On the other hand, the regularization matrices introduce sparsity to the aforementioned linear system, allowing for more efficient factorizations. We also propose a rule for tuning the regularization dynamically based on the properties of the problem, such that sufficiently large eigenvalues of the non-regularized system are perturbed insignificantly. This alleviates the need of finding specific regularization values through experimentation, which is the most common approach in literature. We provide perturbation bounds for the eigenvalues of the non-regularized system matrix and then discuss the spectral properties of the regularized matrix. Finally, we demonstrate the efficiency of the method applied to solve standard small and medium-scale linear and convex quadratic programming test problems.


Classifying Signals on Irregular Domains via Convolutional Cluster Pooling

We present a novel and hierarchical approach for supervised classification of signals spanning over a fixed graph, reflecting shared properties of the dataset. To this end, we introduce a Convolutional Cluster Pooling layer exploiting a multi-scale clustering in order to highlight, at different resolutions, locally connected regions on the input graph. Our proposal generalises well-established neural models such as Convolutional Neural Networks (CNNs) on irregular and complex domains, by means of the exploitation of the weight sharing property in a graph-oriented architecture. In this work, such property is based on the centrality of each vertex within its soft-assigned cluster. Extensive experiments on NTU RGB+D, CIFAR-10 and 20NEWS demonstrate the effectiveness of the proposed technique in capturing both local and global patterns in graph-structured data out of different domains.


A Survey on Session-based Recommender Systems

Session-based recommender systems (SBRS) are an emerging topic in the recommendation domain and have attracted much attention from both academia and industry in recent years. Most of existing works only work on modelling the general item-level dependency for recommendation tasks. However, there are many more other challenges at different levels, e.g., item feature level and session level, and from various perspectives, e.g., item heterogeneity and intra- and inter-item feature coupling relations, associated with SBRS. In this paper, we provide a systematic and comprehensive review on SBRS and create a hierarchical and in-depth understanding of a variety of challenges in SBRS. To be specific, we first illustrate the value and significance of SBRS, followed by a hierarchical framework to categorize the related research issues and methods of SBRS and to reveal its intrinsic challenges and complexities. Further, a summary together with a detailed introduction of the research progress is provided. Lastly, we share some prospects in this research area.


Federated Machine Learning: Concept and Applications

Today’s AI still faces two major challenges. One is that in most industries, data exists in the form of isolated islands. The other is the strengthening of data privacy and security. We propose a possible solution to these challenges: secure federated learning. Beyond the federated learning framework first proposed by Google in 2016, we introduce a comprehensive secure federated learning framework, which includes horizontal federated learning, vertical federated learning and federated transfer learning. We provide definitions, architectures and applications for the federated learning framework, and provide a comprehensive survey of existing works on this subject. In addition, we propose building data networks among organizations based on federated mechanisms as an effective solution to allow knowledge to be shared without compromising user privacy.


Learning to Select Knowledge for Response Generation in Dialog Systems

Generating informative responses in end-to-end neural dialogue systems attracts a lot of attention in recent years. Various previous work leverages external knowledge and the dialogue contexts to generate such responses. Nevertheless, few has demonstrated their capability on incorporating the appropriate knowledge in response generation. Motivated by this, we propose a novel open-domain conversation generation model in this paper, which employs the posterior knowledge distribution to guide knowledge selection, therefore generating more appropriate and informative responses in conversations. To the best of our knowledge, we are the first one who utilize the posterior knowledge distribution to facilitate conversation generation. Our experiments on both automatic and human evaluation clearly verify the superior performance of our model over the state-of-the-art baselines.


Variance-Preserving Initialization Schemes Improve Deep Network Training: But Which Variance is Preserved?

Before training a neural net, a classic rule of thumb is to randomly initialize the weights so that the variance of the preactivation is preserved across all layers. This is traditionally interpreted using the total variance due to randomness in both networks (weights) and samples. Alternatively, one can interpret the rule of thumb as preservation of the \emph{sample} mean and variance for a fixed network, i.e., preactivation statistics computed over the random sample of training samples. The two interpretations differ little for a shallow net, but the difference is shown to be large for a deep ReLU net by decomposing the total variance into the network-averaged sum of the sample variance and square of the sample mean. We demonstrate that the latter term dominates in the later layers through an analytical calculation in the limit of infinite network width, and numerical simulations for finite width. Our experimental results from training neural nets support the idea that preserving sample statistics can be better than preserving total variance. We discuss the implications for the alternative rule of thumb that a network should be initialized to be at the ‘edge of chaos.’


Can We Automate Diagrammatic Reasoning?

Learning to solve diagrammatic reasoning (DR) can be a challenging but interesting problem to the computer vision research community. It is believed that next generation pattern recognition applications should be able to simulate human brain to understand and analyze reasoning of images. However, due to the lack of benchmarks of diagrammatic reasoning, the present research primarily focuses on visual reasoning that can be applied to real-world objects. In this paper, we present a diagrammatic reasoning dataset that provides a large variety of DR problems. In addition, we also propose a Knowledge-based Long Short Term Memory (KLSTM) to solve diagrammatic reasoning problems. Our proposed analysis is arguably the first work in this research area. Several state-of-the-art learning frameworks have been used to compare with the proposed KLSTM framework in the present context. Preliminary results indicate that the domain is highly related to computer vision and pattern recognition research with several challenging avenues.


Estimation of causal CARMA random fields

We estimate model parameters of L\’evy-driven causal CARMA random fields by fitting the empirical variogram to the theoretical counterpart using a weighted least squares (WLS) approach. Subsequent to deriving asymptotic results for the variogram estimator, we show strong consistency and asymptotic normality of the parameter estimator. Furthermore, we conduct a simulation study to assess the quality of the WLS estimator for finite samples. For the simulation we utilize numerical approximation schemes based on truncation and discretization of stochastic integrals and we analyze the associated simulation errors in detail. Finally, we apply our results to real data of the cosmic microwave background.


Two-Dimensional Batch Linear Programming on the GPU

This paper presents a novel, high-performance, graphical processing unit-based algorithm for efficiently solving two-dimensional linear programs in batches. The domain of two-dimensional linear programs is particularly useful due to the prevalence of relevant geometric problems. Batch linear programming refers to solving numerous different linear programs within one operation. By solving many linear programs simultaneously and distributing workload evenly across threads, graphical processing unit utilization can be maximized. Speedups of over 22 times and 63 times are obtained against state-of-the-art graphics processing unit and CPU linear program solvers, respectively.


Wasserstein Barycenter Model Ensembling

In this paper we propose to perform model ensembling in a multiclass or a multilabel learning setting using Wasserstein (W.) barycenters. Optimal transport metrics, such as the Wasserstein distance, allow incorporating semantic side information such as word embeddings. Using W. barycenters to find the consensus between models allows us to balance confidence and semantics in finding the agreement between the models. We show applications of Wasserstein ensembling in attribute-based classification, multilabel learning and image captioning generation. These results show that the W. ensembling is a viable alternative to the basic geometric or arithmetic mean ensembling.


ATMSeer: Increasing Transparency and Controllability in Automated Machine Learning

To relieve the pain of manually selecting machine learning algorithms and tuning hyperparameters, automated machine learning (AutoML) methods have been developed to automatically search for good models. Due to the huge model search space, it is impossible to try all models. Users tend to distrust automatic results and increase the search budget as much as they can, thereby undermining the efficiency of AutoML. To address these issues, we design and implement ATMSeer, an interactive visualization tool that supports users in refining the search space of AutoML and analyzing the results. To guide the design of ATMSeer, we derive a workflow of using AutoML based on interviews with machine learning experts. A multi-granularity visualization is proposed to enable users to monitor the AutoML process, analyze the searched models, and refine the search space in real time. We demonstrate the utility and usability of ATMSeer through two case studies, expert interviews, and a user study with 13 end users.

Continue Reading…

Collapse

Read More

Book Memo: “Keras to Kubernetes”

The Journey of a Machine Learning Model to Production
We have seen an exponential growth in the use of Artificial Intelligence (AI) over last few years. AI is becoming the new electricity and is touching every industry from retail to manufacturing to healthcare to entertainment. Within AI, we’re seeing a particular growth in Machine Learning (ML) and Deep Learning (DL) applications. ML is all about learning relationships from labeled (Supervised) or unlabeled data (Unsupervised). DL has many layers of learning and can extract patterns from unstructured data like images, video, audio, etc. Machine Learning with Keras and Kubernetes takes you through real-world examples of building a Keras model for detecting logos in images. You will then take that trained model and package it as a web application container before learning how to deploy this model at scale on a Kubernetes cluster.

Continue Reading…

Collapse

Read More

Three surveys of AI adoption reveal key advice from more mature practices

An overview of emerging trends, known hurdles, and best practices in artificial intelligence.

Recently, O’Reilly Media published AI Adoption in the Enterprise: How Companies Are Planning and Prioritizing AI Projects in Practice, a report based on an industry survey. That was the third of three industry surveys conducted in 2018 to probe trends in artificial intelligence (AI), big data, and cloud adoption. The other two surveys were The State of Machine Learning Adoption in the Enterprise, released in July 2018, and Evolving Data Infrastructure, released in January 2019.

This article looks at those results in further detail, comparing high-level themes based on the three reports, plus related presentations at the Strata Data Conference and the AI Conference. These points would have been out of scope for any of the individual reports.

Exploring new markets by repurposing AI applications

Looking across industry sectors in AI Adoption in the Enterprise, we see how technology, health care, and retail tend to be the leaders in AI adoption, whereas the public sector (government) tends to be the laggards, along with education and manufacturing. Although that gap could be taken as commentary about the need for “data for social good,” it also points toward opportunities. Consider this: finance has enjoyed first-mover advantages in artificial intelligence adoption, as have the technology and retail sectors. After having matured in these practices, now we see financial services firms exploring opportunities that just a few years ago might have been considered niches. For example, at our recent AI Conference in London, two talks—Ashok Srivastava of Intuit and Johnny Ball of Fluidy—presented business applications for AI aimed at establishing safety nets for small businesses. Both teams applied anomaly detection techniques (for example, reused from aircraft engine monitoring) to spot when small businesses were likely to fail. That’s important since more than 50% of small businesses fail, mostly due to exactly those “anomalies”: cash flow problems and late payments.

Given how government and education trail as laggards in the AI space, could similar kinds of technology reuse apply there? For example, within the past few years, it’s become common practice in U.S. grade schools for teachers to provide detailed information online to parents about student assignments and grades. This data can be extremely helpful as early warning signals for at-risk students who might be failing school—although, quite frankly, few working parents can afford the time to track that much data. Moreover, few schools have resources to act on that data in aggregate. Even so, the anomaly detection used in small business cash-flow analysis is strikingly similar to what a homework “safety net” for students would need. Undoubtedly, there are areas within government (especially at the local level) where similar AI applications could lead to considerable public upside, which would otherwise be understaffed due to budget restraints. As the enterprise adoption of AI continues to mature, we can hope that diffusion from the leaders to the laggards comes through similarly innovative acts of technology repurposing. The trick seems to be finding enough people with depth in both technical and business skills who can recognize business use cases for AI.

Differentiated tooling

Looking at the “Tools for Building AI Applications” section of AI Adoption in the Enterprise for trends about technology adoption, we see how frameworks such as Spark NLP, scikit-learn, and H2O hold popularity in finance, whereas Google Cloud ML Engine gets higher share within the health care industry. Compared with analysis last year, both Keras and PyTorch have picked up significant gains over the category leader TensorFlow. Also, while there has been debate in the industry about the relative merits of using Jupyter Notebooks in production, usage has been growing dramatically. We see from this survey’s results that support for notebooks (23%) now leads over support for IDEs (17%).

The summary results about health care and life sciences create an interesting picture. 70 percent of all respondents from the health sector are using AI for R&D projects. Respondents from the health care sector also had significantly less trouble identifying appropriate uses cases for AI, although hurdles for the sector seem to come later in the AI production lifecycle. In general, health care leads other verticals in how it checks for a broad range of AI-related risks, and this vertical makes more use of data visualization than others, as would be expected. It’s also gaining in use of reinforcement learning, which was not expected. Although we know of reinforcement learning production use cases in finance, we don’t have optics into how reinforcement learning is used in health care. That could be a good topic for a subsequent survey.

Advice from the leaders

Admittedly, the survey for AI Adoption in the Enterprise drew from the initiated: 81% of respondents work for organizations that already use AI. We have much to learn from their collective experiences. For example, there’s a story unfolding in the contrast between mature practices and firms that are earlier in their journey toward AI adoption. Some of the key advice emerging from the mature organizations includes:

  • Work toward overcoming challenges related to company culture or not being able to recognize the business use cases.
  • Be mindful that the lack of data and lack of skilled people will pose ongoing challenges.
  • While hiring data scientists, complement by also hiring people who can identify business use cases for AI solutions.
  • Beyond just optimizing for business metrics, also check for model transparency and interpretability, fairness and bias, and that your AI systems are reliable and safe.
  • Explore use cases beyond deep learning: other solutions have gained significant traction, including human-in-the-loop, knowledge graphs, and reinforcement learning.
  • Look for value in applications of transfer learning, which is a nuanced technique the more advanced organizations recognize.
  • Your organization probably needs to invest more in infrastructure engineering than it thinks, perpetually.

This is a story about the relative mix of priorities as a team gains experience. That experience is often gained by learning from early mistakes. In other words, there’s quite a long list of potential issues and concerns that an organization might consider at the outset of AI adoption in enterprise. However, “Go invest in everything, all at once” is not much of a strategy. Advice from leaders at the more sophisticated AI practices tends to be: “Here are the N things we tried early and have learned not to prioritize as much.” We hope that these surveys offer helpful guidance that other organizations can follow.

This is also a story about how to pace investments and sequence large initiatives effectively. For example, you must address the more foundational pain points early—such as problems with company culture, or the lack of enough personnel who can identify the business uses—or those will become blockers for other AI initiatives down the road. Meanwhile, some investments must be ongoing, such as hiring appropriate talent and working to improve data sets. As an executive, don’t assume that one-shot initiatives will work as a panacea. These are ongoing challenges and you must budget for them as such.

Speaking of budget, firms are clearly taking the matter of AI adoption seriously, allocating significant amounts of their IT budgets for AI-related projects. Even if your firm isn’t, you can pretty much bet that the competition will be. Which side of that bet will pay off?

Heading toward a threshold point

Another issue emerged from the surveys that concerns messaging about AutoML. Adoption percentages for AutoML had been in single-digit territory in our earlier survey just two quarters ago. Now, we see many organizations making serious budget allocations toward integrating AutoML over the course of the next year. This is especially poignant for the more mature practices: 86% will be integrating AutoML within the next year, nearly two times that of the evaluation stage firms. That shift is timed almost precisely as cloud providers extend their AutoML offerings. For example, this was an important theme emphasized at Amazon’s recent re:Invent conference in Las Vegas. Both sides, demand and supply, are rolling the dice on AutoML in a big way.

Even so, there’s a risk that less-informed executives might interpret the growing uptake of AutoML as a signal that “AI capabilities are readily available off-the-shelf.” That’s anything but the case at hand. The process of leveraging AI capabilities, even within the AutoML category, depends on multi-year transformations for organizations. That effort requires substantial capital investments and typically an extensive evolution of mindshare by the leadership. It’s not an impulse buy. Another important point to remember is that AutoML is only one portion of the automation that's needed. See the recent Data Show Podcast interview “Building tools for enterprise data science” with Vitaly Gordon, VP of data science and engineering at Salesforce, about their TransmogrifAI open source project for machine learning workflows. It's clear that automating the model building and model search step—the AutoML part—is just one piece of the puzzle.

We’ve also known—since studies published in 2017 plus the analysis that followed—that a “digital divide” is growing in enterprise between the leaders and the laggards in AI adoption. See the excellent “Notes from the frontier: Making AI work,” by Michael Chui at McKinsey Global Institute, plus the related report, AI adoption advances, but foundational barriers remain. What we observe now in Q4 2018 and Q1 2019 is how the mature practices are investing significantly, and based on lessons learned, they’re investing more wisely. However, most of the laggards aren’t even beginning to invest in crucial transformations that will require years. We cannot overstress how this demonstrates a growing divide between “haves” and “have nots” among enterprise organizations. At some threshold point relatively soon, the “have nots” might simply fall too many years behind their competitors to be worth the investments that will be needed to catch up.

Continue reading Three surveys of AI adoption reveal key advice from more mature practices.

Continue Reading…

Collapse

Read More

Data Science Survey

Gurobi would like to tap into your expertise in the field of Data Science and Analytics, and invites you to participate in their Data Science Survey. Everyone who completes this 10-minute survey can choose to be entered into a drawing to receive one of five $100 Amazon gift cards.

Continue Reading…

Collapse

Read More

Getting Started With rquery

To make getting started with rquery (an advanced query generator for R) easier we have re-worked the package README for various data-sources (including SparkR!).

Here are our current examples:

For the MonetDBLite the query diagrammer shows a repeated calculation that we decided was best to leave in.

NewImage

And the RSQLite diagram shows the consequences of replacing window functions with joins.

NewImage

Continue Reading…

Collapse

Read More

U. of Cincinnati Analytics Summit 2019, April 1-3

Analytics Summit 2019 focuses on analytics and data science content to support the growth and development of analytics efforts in business, government and non-profit organizations.

Continue Reading…

Collapse

Read More

Distilled News

Deploying an R Shiny App With Docker

If you haven’t heard of Docker, it is a system that allows projects to be split into discrete units (i.e. containers) that each operate within their own virtual environment. Each container has a blueprint written in its Dockerfile that describes all of the operating parameters including operating system and package dependencies/requirements. Docker images are easily distributed and, because they are self-contained, will operate on any other system that has Docker installed, include servers. When multiple instances/users attempt to start a Shiny App at the same time, only a single R session is initiated on the serving machine. This is problematic. For example, if one user starts a process that takes 10 seconds to complete, all other users will need to wait until that process has completed before any other tasks can be processed.


IBM Takes Watson AI to AWS, Google, Azure

IBM is leveraging Kubernetes to enable its Watson AI to run on public clouds AWS, Google, and Microsoft Azure. The move signals a shift in strategy for IBM.


The Danger of Artificial Intelligence in Recruiting (and 3 Suggestions)

I recently came across one of the most well-intended, and most unnerving, applications of AI in recruiting; a talking robot head pitched as a solution to avoid bias in interviewing. Picture a robot the size of an Alexa with an actual human face painted to the top. The face changes, tries to show expression and and non verbal cues. It wasn’t a joke either. The face was meant to be there. Think interviewing with Chucky if you need to visualize this.


Should I Open-Source My Model?

I have worked on the problem of open-sourcing Machine Learning versus sensitivity for a long time, especially in disaster response contexts: when is it right/wrong to release data or a model publicly? This article is a list of frequently asked questions, the answers that are best practice today, and some examples of where I have encountered them.


Automatic Classification of an online Fashion Catalogue: The Simple Way

I have been working with Tensorflow during last months and I realized that, although there is a large number of Github repositories with many different and complex models, is hard to find a simple example that shows you how to obtain your own dataset from the web and apply some Deep Learning on it. In this post I pretended to provide an example of this task but being keeping it as simple as possible. I will show you how to obtain online unlabeled data, how to create a simple convolutional network, train it with some supervised data and use it later to classify the data we have gathered from the web.


Machine Learning: Regularization and Over-fitting Simply Explained

I am going to give intuitive understanding of Regularization method in as simple words as possible. Firstly, I will discuss some basic ideas, so if you think you are already families with those, feel free to move ahead.


Try out RStudio Connect on Your Desktop for Free

Have you heard of RStudio Connect, but do not know where to start? Maybe you are trying to show your manager how Shiny applications can be deployed in production, or convince a DevOps engineer that R can fit into her existing tooling. Perhaps you want to explore the functionality of RStudio’s Professional products to see if they fit the needs you have in your work. Today, we are excited to announce the RStudio QuickStart, which allows you to try out RStudio Connect for free from your desktop.


Deep Multi-Task Learning – 3 Lessons Learned

For the past year, my team and I have been working on a personalized user experience in the Taboola feed. We used Multi-Task Learning (MTL) to predict multiple Key Performance Indicators (KPIs) on the same set of input features, and implemented a Deep Learning (DL) model in TensorFlow to do so. Back when we started, MTL seemed way more complicated to us than it does now, so I wanted to share some of the lessons learned.


Learn #MachineLearning Coding Basics in a weekend – Glossary and Mindmap

For background to this post, please see Learn #MachineLearning Coding Basics in a weekend. Here,we present the glossary that we use for the coding and the mindmap attached to these classes and upcoming book.


A Quick Guide to Feature Engineering

Feature engineering plays a key role in machine learning, data mining, and data analytics. This article provides a general definition for feature engineering, together with an overview of the major issues, approaches, and challenges of the field.


Data Science For Our Mental Development

Emotion is a fundamental element of human society. If you think about it, everything worth analyzing is influenced by human behavior. Cyber attacks are highly impacted by disgruntled employees who may either ignore due diligence or engage in insider misuse. The stock market depends on the effect of the economic climate, which itself is dependent on the aggregate behavior of the masses. In the field of communication, it is common knowledge that what we say account for only 7% of the message while the rest 93% is encoded in facial expressions and other non-verbal cues. Entire fields of psychology and behavioral economics are dedicated to this field. That being said, the ability to measure and analyze emotions effectively will enable us to improve society in remarkable ways. For example, a psychology professor at the University of California, San Francisco, Paul Ekman, describes in his book, Telling Lies: Clues to Deceit in the Marketplace, Politics, and Marriage, how reading facial expression can help psychologists find signs of potential suicide attempts while the patient lies about such intentions. Sounds like a job for facial recognition models? What about neural mapping? Can we effectively map emotional states from neural impulses? What about improving cognitive abilities? Or even emotional intelligence and effective communication? There are plenty of problems in the world to solve using the vast array of unstructured data that is available to us.


Dropout on convolutional layers is weird

Dropout is commonly used to regularize deep neural networks; however, applying dropout on fully-connected layers and applying dropout on convolutional layers are fundamentally different operations. While it is known in the deep learning community that dropout has limited benefits when applied to convolutional layers, I wanted to show a simple mathematical example of why the two are different. To do so, I’ll define how dropout operates on fully-connected layers, define how dropout operates on convolutional layers and contrast the two operations.


When Identity Becomes an Algorithm

Discussions on the interplay of humans and Artificial Intelligence tend to pose the issue in the language of opposition. However, according to the thinking of evolutionary biologist Richard Dawkins, tools such as AI can be better thought of as part of our extended phenotype. A phenotype refers to the observable characteristic of an organism, and the idea of the extended phenotype is that this should not be limited to biological processes, but include all of the effects that the genes have upon their environment, both internally and externally.


Adversarial Attacks on Deep Neural Networks: an Overview

Deep Neural Networks are highly expressive machine learning networks that have been around for many decades. In 2012, with gains in computing power and improved tooling, a family of these machine learning models called ConvNets started achieving state of the art performance on visual recognition tasks. Up to this point, machine learning algorithms simply didn’t work well enough for anyone to be surprised when it failed to do the right thing.


Limitations of Deep Learning in AI Research

Deep learning a subset of machine learning, has delivered super-human accuracy in a variety of practical uses in the past decade. From revolutionizing customer experience, machine translation, language recognition, autonomous vehicles, computer vision, text generation, speech understanding, and a multitude of other AI applications. In contrast to machine learning where an AI agent learns from data based on machine learning algorithms, deep learning is based on a neural network architecture which acts similarly to the human brain, and allows the AI agent to analyze data fed in?-?in a structure similar to the way humans do. Deep learning models do not require algorithms to specify what to do with the data, which is made possible thanks to the extraordinary amount of data we as humans, collect and consume?-?which in turn is fed to deep learning models .


What Is ModelOps? And Who Should Care?

Consensus is growing that model operationalization – rather than model development – is today’s biggest hurdle for data science. Production deployment techniques are generally one-offs, and data scientists and data engineers often lack the skills to operationalize models. Application integration, model monitoring and tuning, and workflow automation are often afterthoughts. Sometimes called the ‘last mile’ for analytics, this is where data science meets production IT. And it’s where business value is (or is not) created. Achieving the vision of becoming a model-driven business that deploys and iterates models at scale requires something that only a handful of companies have: ModelOps.

Continue Reading…

Collapse

Read More

Statmodeling Retro

As many of you know, this blog auto-posts on twitter. That’s cool. But we also have 15 years of old posts with lots of interesting content and discussion! So I had this idea of setting up another twitter feed, Statmodeling Retro, that would start with our very first post in 2004 and then go forward, posting one entry every 8 hours until it eventually catches up to the present. So far, this blog has exactly 9000 posts, so it would take a little over 8 years to catch up at this rate. But then if we continue at the current rate we’ll have another 6000 posts or so, which will take another 5 years to appear in the retro feed. Etc. So it will take awhile.

Maybe people don’t want to wait that long? We could program Statmodeling Retro to post every 6 hours, but then I’m worried that the frequency would be too high for people to follow.

Whaddya think?

Continue Reading…

Collapse

Read More

Word Embeddings in NLP and its Applications

Word embeddings such as Word2Vec is a key AI method that bridges the human understanding of language to that of a machine and is essential to solving many NLP problems. Here we discuss applications of Word2Vec to Survey responses, comment analysis, recommendation engines, and more.

Continue Reading…

Collapse

Read More

Monty Python vs. Bruce Springsteen (1); Julia Child advances

From Jeff:

If they meet in the semi-final the Japanese dude will eat Frank for lunch: All vs. Nothing at All.

Though it appears she also had a soft spot for hot dogs, if Julia makes it that far it would be a matchup of gourmet vs gourmand, which seems a better contest.

Today it’s an unseeded, but very funny, gang of Wits against the top seeded Person from New Jersey. What will it be: Holy Grail or Thunder Road?

Again, here’s the bracket and here are the rules.

Continue Reading…

Collapse

Read More

State of the art in AI and Machine Learning – highlights of papers with code

We introduce papers with code, the free and open resource of state-of-the-art Machine Learning papers, code and evaluation tables.

Continue Reading…

Collapse

Read More

Regulating data sharing is heating up

With the U.K. report on Facebook, and the stern language within it, the train on regulating data sharing may finally reach the station this year. The FTC is also likely to impose a stiff fine on Facebook for violating a consent decree.

So let's learn more about this data sharing business. If you prefer a video, the gist of this post can be heard here.

***

First, let's talk about data flows and the "cloud". Data are stored in computers that are called servers. In the cloud computing model, these servers are owned - not by the companies that collect the data - but by large tech companies like Amazon, Google, Microsoft, etc. who are responsible for managing the servers. These servers are geographically dispersed and so when data enter the cloud, they get replicated and spread to many servers. The technical benefit of such replication is recoverability of the data (allowing the use of cheaper, less reliable computers) but now, the data become much harder to delete.

Data become more telling if one combines different datasets measuring different aspects of our lives. For example, an auto insurer may have data on past claims and that data help predict your future claims. But if the auto insurer is able to get data from say an automaker about your car, e.g. how fast you drive, where you drive, etc., that data combined with past claims improve the predictive power.

Thus, a data-sharing industry has been created. Companies make agreements to share data with one another. This becomes much easier in the "cloud" as those servers are already connected to one another. These agreements may include explicit payments but even if they don't, both sides must be benefiting commercially from the arrangement, or else they would not exist.

So when company A shares data with company B, the data flow from A servers to B servers. B may also use a cloud, which then means the data would be replicated yet again, and dispersed geographically onto yet another set of servers. 

And company B may also share data with company C, etc., etc.

***

An inexplicable part of the consent decree between Facebook and the FTC is the requirement that Facebook monitor what happened to the data after they are shared with third parties. I just can't figure out how that is possible. It isn't even possible within Facebook: if a user demands that his/her be deleted, it will be very hard to ensure that all copies of the data are deleted from every server, including data that might have landed in an analyst's computer. In fact, most analysts probably don't know how many replicates of data elements are being created during the analysis, and where those replicates exist!

***

The next question of general interest is all the different ways in which tech companies collect people's data without people realizing what's happening. In the video, I look at contact lists, personality tests, 2-factor authentication schemes, IOT devices, etc. in their roles as data collectors. 

This is the reason why the video is called "Did you betray your friend today?"

Fungwithdata-4-screenshot

Continue Reading…

Collapse

Read More

6 Books About Open Data Every Data Scientist Should Read

Check out this collection of six books which tackle the hard skills required to make sense of the changing field known as open data and muse on the ethical implications of a digitally connected world.

Continue Reading…

Collapse

Read More

Machine Learning and RPA in Action: Email Management

We recently announced the strategic alliance between Jidoka and BigML, where we explained the integration of RPA with other technologies such as Machine Learning. With this integration, Jidoka can provide Machine Learning capabilities in their RPA process automation platform. To explain the advantages and possibilities offered by this integration, today we present a practical example […]

Continue Reading…

Collapse

Read More

Geoff Pullum, the linguist who hates Strunk and White, is speaking at Columbia this Friday afternoon

The title of the talk is Grammar, Writing Style, and Linguistics, and here’s the abstract:

Some critics seem to think that English grammar is just a brief checklist of linguistic table manners that every educated person should already know. Others see grammar as a complex, esoteric, and largely useless discipline replete with technical terms that no ordinary person needs. Which is right? Neither. The handy menu of grammar tips is a myth. Faculty often point to Strunk and White’s The Elements of Style as providing such a list, but its assertions about grammar are often flagrantly false and its rambling remarks on style are largely useless. The truth is that the books on English grammar intended for students or the general public nearly all dispense false claims and bad analyses. Yet grammar can be described in a way that makes sense. I [Pullum] offer some eye-opening facts about Strunk and White, and an antidote, plus brief illustrations of how grammar and style can be tackled in a sensible way drawing on insights from modern linguistics.

The talk is at 707 Hamilton Hall, Fri 22 Feb, 4pm.

The funny thing is, I get what Pullum is saying here, but I still kinda like Strunk and White for what it is.

Continue Reading…

Collapse

Read More

Python Data Science for Beginners

Python’s syntax is very clean and short in length. Python is open-source and a portable language which supports a large standard library. Buy why Python for data science? Read on to find out more.

Continue Reading…

Collapse

Read More

Franchise Box Office

There's big money in wizarding worlds, galaxies far away, and various time-shifted universes. Let's take a stroll through the billions of dollars earned by franchises over the years. Read More

Continue Reading…

Collapse

Read More

KDnuggets™ News 19:n08, Feb 20: The Gold Standard of Python Machine Learning; The Analytics Engineer – new role in the data team

Intro to scikit-learn; how to set up a Python ML environment; why there should be a new role in the Data Science team; how to learn one of the hardest parts of being a Data Scientist; and how explainable is BERT?

Continue Reading…

Collapse

Read More

Four short links: 20 February 2019

Software Rewrites, Security, Mirror Worlds, and Third-Party Firmware

  1. Lessons from Six Software Rewrite Stories (Herb Caudill) -- brilliant work. Six very different stories about how companies dealt (or didn't deal) with legacy code bases and the decision to rebuild from scratch or attempt to change the tires on a rolling tire fire. (via Simon Willison)
  2. O.MG Cable -- Wi-Fi embedded in a USB cable. See the video in his tweet to learn (a little) more.
  3. Childhood's End (George Dyson) -- If enough drivers subscribe to a real-time map, traffic is controlled, with no central model except the traffic itself. The successful social network is no longer a model of the social graph; it is the social graph. This is why it is a winner-take-all game.
  4. Magic Lantern -- free third-party firmware for Canon cameras that adds some amazing features.

Continue reading Four short links: 20 February 2019.

Continue Reading…

Collapse

Read More

Announcement: eRum 2020 held in Milano!

(This article was first published on MilanoR, and kindly contributed to R-bloggers)

erumHello, R_users!

We are very excited to inform you that the eRum2020 (European R Users Meeting) will be held in Milan in 2020!

 

About Conference

The eRum is a conference that takes place every 2 years in Europe, every time in a different country, and is designed to create a community of European R users and to share knowledge and passion within it.

In 2018, eRum2018 took place in Budapest, Hungary. More than 500 attendees and over 90 speakers participated, and we expect an even wider contribution for this edition.

 

Info and Contacts

We will let you know the dates, the location and the program of eRum2020 as soon as it is all set. Follow us on Twitter at @erum2020_conf to keep updated!

 

The Support

We will work hard to keep the registration fees as low as possible. If you want to support the success of this event, please get in touch with us, and we will provide you every info about the sponsorship opportunities.

At the moment we wish to thank Quantide, a Milano based company of R training and consulting, which is already proudly supporting this wonderful adventure.

 

We would like to thank the whole community of R, our MilanoR community and all the organizers of the previous editions.

 

Thank you so much for your attention; we are eager to meet you in Milan!

 

The post Announcement: eRum 2020 held in Milano! appeared first on MilanoR.

To leave a comment for the author, please follow the link and comment on their blog: MilanoR.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…

Collapse

Read More

Number 6174 or Kaprekar constant in R

(This article was first published on R – TomazTsql, and kindly contributed to R-bloggers)

Not always is the answer 42 as explained in Hitchhiker’s guide. Sometimes it is also 6174.

Kaprekar number is one of those gems, that makes Mathematics fun. Indian recreational mathematician D.R.Kaprekar, found number 6174 – also known as Kaprekar constant – that will return the subtraction result when following this rules:

  1.  Take any four-digit number, with minimum of two different numbers (1122 or 5151 or 1001 or 4375 and so on.)
  2. Sort the taken number and sort it descending order and ascending order.
  3. Subtract the descending number from ascending number.
  4. Repeat step 2. and 3. until you get the result 6174

In practice, e.g.: number 5462, the steps would be:

6542 - 2456 = 4086
8640 -  468 = 8172
8721 - 1278 = 7443
7443 - 3447 = 3996
9963 - 3699 = 6264
6642 - 2466 = 4176
7641 - 1467 = 6174

or for number 6235:

6532 - 2356 = 4176
7641 - 1467 = 6174

Based on different number, the steps might vary.

Function for Kaprekar is:

kap <- function(num){
    #check the len of number
    if (nchar(num) == 4) {
        kaprekarConstant = 6174
        while (num != kaprekarConstant) {
          nums <- as.integer(str_extract_all(num, "[0-9]")[[1]])
          sortD <- as.integer(str_sort(nums, decreasing = TRUE))
          sortD <- as.integer(paste(sortD, collapse = ""))
          sortA <- as.integer(str_sort(nums, decreasing = FALSE))
          sortA <- as.integer(paste(sortA, collapse = ""))
          num = as.integer(sortD) - as.integer(sortA)
          r <- paste0('Pair is: ',as.integer(sortD), ' and ', as.integer(sortA), ' and result of subtraction is: ', as.integer(num))
          print(r)
         }
    } else {
      print("Number must be 4-digits")
    }
}

 

Function can be used as:

kap(5462)

and it will return all the intermediate steps until the function converges.

[1] "Pair is: 6542 and 2456 and result of subtraction is: 4086"
[1] "Pair is: 8640 and 468  and result of subtraction is: 8172"
[1] "Pair is: 8721 and 1278 and result of subtraction is: 7443"
[1] "Pair is: 7443 and 3447 and result of subtraction is: 3996"
[1] "Pair is: 9963 and 3699 and result of subtraction is: 6264"
[1] "Pair is: 6642 and 2466 and result of subtraction is: 4176"
[1] "Pair is: 7641 and 1467 and result of subtraction is: 6174"

And to make the matter more interesting, let us find the distribution, based on all valid four-digit numbers, and append the number of steps needed to find the constant.

First, we will find the solutions for all four-digit numbers and store the solution in dataframe.

Create the empty dataframe:

#create empty dataframe for results
df_result <- data.frame(number =as.numeric(0), steps=as.numeric(0))
i = 1000
korak = 0

And then run the following loop:

# Generate the list of all 4-digit numbers
while (i <= 9999) {
   korak = 0
   num = i
   while ((korak <= 10) & (num != 6174)) {
      nums <- as.integer(str_extract_all(num, "[0-9]")[[1]])
      sortD <- as.integer(str_sort(nums, decreasing = TRUE))
      sortD <- as.integer(paste(sortD, collapse = ""))
      sortA <- as.integer(str_sort(nums, decreasing = FALSE))
      sortA <- as.integer(paste(sortA, collapse = ""))
      num = as.integer(sortD) - as.integer(sortA)

     korak = korak + 1
    if((num == 6174)){
     r <- paste0('Number is: ', as.integer(i), ' with steps: ', as.integer(korak))
     print(r)
     df_result <- rbind(df_result, data.frame(number=i, steps=korak))
   }
 }
i = i + 1
}

 

Fifteen seconds later, I got the dataframe with solutions for all valid (valid solutions are those that comply with step 1 and have converged within 10 steps) four-digit numbers.

Now we can add some distribution, to see how solutions are being presented with numbers. Summary of the solutions shows in average 4,6 iteration (mathematical subtractions) were needed in order to come to number 6174.

But adding the counts to steps, we get the most frequent solutions:

table(df_result$steps)
hist(df_result$steps)

With some additional visual, you can see the results as well:

library(ggplot2)
library(gridExtra)

#par(mfrow=c(1,2))
p1 <- ggplot(df_result, aes(x=number,y=steps)) + 
geom_bar(stat='identity') + 
scale_y_continuous(expand = c(0, 0), limits = c(0, 8))

p2 <- ggplot(df_result, aes(x=log10(number),y=steps)) + 
geom_point(alpha = 1/50)

grid.arrange(p1, p2, ncol=2, nrow = 1)

And the graph:

A lot of numbers converges on third step, meaning that every 4th or 5th number.  We would need to look into the steps of the solutions, what these numbers have in common. This will follow! So stay tuned.

Fun fact: For the time of writing this blog post, the number 6174 was not constant in R base. 🙂

As always, code is available at Github.

 

Happy Rrrring 🙂

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…

Collapse

Read More

R Packages worth a look

Similarity-Based Segmentation of Multidimensional Signals (segmenTier)
A dynamic programming solution to segmentation based on maximization of arbitrary similarity measures within segments. The general idea, theory and thi …

Analysis of Longitudinal Data with Irregular Observation Times (IrregLong)
Analysis of longitudinal data for which the times of observation are random variables that are potentially associated with the outcome process. The pac …

Shiny Matrix Input Field (shinyMatrix)
Implements a custom matrix input field.

Continue Reading…

Collapse

Read More

I Just Wanted The Data : Turning Tableau & Tidyverse Tears Into Smiles with Base R (An Encoding Detective Story)

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

Those outside the Colonies may not know that Payless—a national chain that made footwear affordable for millions of ‘Muricans who can’t spare $100.00 USD for a pair of shoes their 7 year old will outgrow in a year— is closing. CNBC also had a story that featured a choropleth with a tiny button at the bottom that indicated one could get the data:

I should have known this would turn out to be a chore since they used Tableau—the platform of choice when you want to take advantage of all the free software libraries they use to power their premier platform which, in turn, locks up all the data for you so others can’t adopt, adapt and improve. Go. Egregious. Predatory. Capitalism.

Anyway.

I wanted the data to do some real analysis vs produce a fairly unhelpful visualization (TLDR: layer in Census data for areas impacted, estimate job losses, compute nearest similar Payless stores to see impact on transportation-challenged homes, etc. Y’now, citizen data journalism-y things) so I pressed the button and watched for the URL in Chrome (aye, for those that remember I moved to Firefox et al in 2018, I switched back; more on that in March) and copied it to try to make this post actually reproducible (a novel concept for Tableau fanbois):

library(tibble)
library(readr)

# https://www.cnbc.com/2019/02/19/heres-a-map-of-where-payless-shoesource-is-closing-2500-stores.html

tfil <- "~/Data/Sheet_3_data.csv"

download.file(
  "https://public.tableau.com/vizql/w/PAYLESSSTORECLOSINGS/v/Dashboard2/vud/sessions/6A678928620645FF99C7EF6353426CE8-0:0/views/10625182665948828489_7202546092381496425?csv=true&showall=true",
  tfil
)
## trying URL 'https://public.tableau.com/vizql/w/PAYLESSSTORECLOSINGS/v/Dashboard2/vud/sessions/6A678928620645FF99C7EF6353426CE8-0:0/views/10625182665948828489_7202546092381496425?csv=true&showall=true'
## Error in download.file("https://public.tableau.com/vizql/w/PAYLESSSTORECLOSINGS/v/Dashboard2/vud/sessions/6A678928620645FF99C7EF6353426CE8-0:0/views/10625182665948828489_7202546092381496425?csv=true&showall=true",  : 
##   cannot open URL 'https://public.tableau.com/vizql/w/PAYLESSSTORECLOSINGS/v/Dashboard2/vud/sessions/6A678928620645FF99C7EF6353426CE8-0:0/views/10625182665948828489_7202546092381496425?csv=true&showall=true'
## In addition: Warning message:
## In download.file("https://public.tableau.com/vizql/w/PAYLESSSTORECLOSINGS/v/Dashboard2/vud/sessions/6A678928620645FF99C7EF6353426CE8-0:0/views/10625182665948828489_7202546092381496425?csv=true&showall=true",  :
##   cannot open URL 'https://public.tableau.com/vizql/w/PAYLESSSTORECLOSINGS/v/Dashboard2/vud/sessions/6A678928620645FF99C7EF6353426CE8-0:0/views/10625182665948828489_7202546092381496425?csv=true&showall=true': HTTP status was '410 Gone'

WAT

Truth be told I expected a time-boxed URL of some sort (prior experience FTW). Selenium or Splash were potential alternatives but I didn’t want to research the legality of more forceful scraping (I just wanted the data) so I manually downloaded the file (*the horror*) and proceeded to read it in. Well, try to read it in:

read_csv(tfil)
## Parsed with column specification:
## cols(
##   A = col_logical()
## )
## Warning: 2092 parsing failures.
## row col           expected actual                      file
##   1   A 1/0/T/F/TRUE/FALSE        '~/Data/Sheet_3_data.csv'
##   2   A 1/0/T/F/TRUE/FALSE        '~/Data/Sheet_3_data.csv'
##   3   A 1/0/T/F/TRUE/FALSE        '~/Data/Sheet_3_data.csv'
##   4   A 1/0/T/F/TRUE/FALSE        '~/Data/Sheet_3_data.csv'
##   5   A 1/0/T/F/TRUE/FALSE        '~/Data/Sheet_3_data.csv'
## ... ... .................. ...... .........................
## See problems(...) for more details.
## 
## # A tibble: 2,090 x 1
##    A    
##    
##  1 NA   
##  2 NA   
##  3 NA   
##  4 NA   
##  5 NA   
##  6 NA   
##  7 NA   
##  8 NA   
##  9 NA   
## 10 NA   
## # … with 2,080 more rows

WAT

Getting a single column back from readr::read_[ct]sv() is (generally) a tell-tale sign that the file format is amiss. Before donning a deerstalker (I just wanted the data!) I tried to just use good ol’ read.csv():

read.csv(tfil, stringsAsFactors=FALSE)
## Error in make.names(col.names, unique = TRUE) : 
##   invalid multibyte string at 'A'
## In addition: Warning messages:
## 1: In read.table(file = file, header = header, sep = sep, quote = quote,  :
##   line 1 appears to contain embedded nulls
## 2: In read.table(file = file, header = header, sep = sep, quote = quote,  :
##   line 2 appears to contain embedded nulls
## 3: In read.table(file = file, header = header, sep = sep, quote = quote,  :
##   line 3 appears to contain embedded nulls
## 4: In read.table(file = file, header = header, sep = sep, quote = quote,  :
##   line 4 appears to contain embedded nulls
## 5: In read.table(file = file, header = header, sep = sep, quote = quote,  :
##   line 5 appears to contain embedded nulls

WAT

Actually the “WAT” isn’t really warranted since read.csv() gave us some super-valuable info via invalid multibyte string at 'A'. FF FE is a big signal1 2 we’re working with a file in another encoding as that’s a common “magic” sequence at the start of such files.

But, I didn’t want to delve into my Columbo persona… I. Just. Wanted. The. Data. So, I tried the mind-bendingly fast and flexible helper from data.table:

data.table::fread(tfil)
## Error in data.table::fread(tfil) : 
##   File is encoded in UTF-16, this encoding is not supported by fread(). Please recode the file to UTF-8.

AHA. UTF-16 (maybe). Let’s poke at the raw file:

x <- readBin(tfil, "raw", file.size(tfil)) ## also: read_file_raw(tfil)

x[1:100]
##   [1] ff fe 41 00 64 00 64 00 72 00 65 00 73 00 73 00 09 00 43 00
##  [21] 69 00 74 00 79 00 09 00 43 00 6f 00 75 00 6e 00 74 00 72 00
##  [41] 79 00 09 00 49 00 6e 00 64 00 65 00 78 00 09 00 4c 00 61 00
##  [61] 62 00 65 00 6c 00 09 00 4c 00 61 00 74 00 69 00 74 00 75 00
##  [81] 64 00 65 00 09 00 4c 00 6f 00 6e 00 67 00 69 00 74 00 75 00

There’s our ff fe (which is the beginning of the possibility it’s UTF-16) but that 41 00 harkens back to UTF-16’s older sibling UCS-2. The 0x00‘s are embedded nuls (likely to get bytes aligned). And, there are alot of 09s. Y’know what they are? They’re s. That’s right. Tableau named file full of TSV records in an unnecessary elaborate encoding CSV. Perhaps they broke the “T” on all their keyboards typing their product name so much.

Living A Boy’s [Data] Adventure Tale

At this point we have:

  • no way to support an automated, reproducible workflow
  • an ill-named file for what it contains
  • an overly-encoded file for what it contains
  • many wasted minutes (which is likely by design to have us give up and just use Tableau. No. Way.)

At this point I’m in full-on Rockford Files (pun intended) mode and delved down to the command line to use a old, trusted sidekick enca🔗:

$ enca -L none Sheet_3_data.csv
## Universal character set 2 bytes; UCS-2; BMP
##   LF line terminators
##   Byte order reversed in pairs (1,2 -> 2,1)

Now, all we have to do is specify the encoding!

read_tsv(tfil, locale = locale(encoding = "UCS-2LE"))
## Error in guess_header_(datasource, tokenizer, locale) : 
##   Incomplete multibyte sequence

WAT

Unlike the other 99% of the time (mebbe 99.9%) you use it, the tidyverse doesn’t have your back in this situation (but it does have your backlog in that it’s on the TODO).

Y’know who does have your back? Base R!:

read.csv(tfil, sep="\t", fileEncoding = "UCS-2LE", stringsAsFactors=FALSE) %>% 
  as_tibble()
## # A tibble: 2,089 x 14
##    Address City  Country Index Label Latitude Longitude
##                     
##  1 1627 O… Aubu… United…     1 Payl…     32.6     -85.4
##  2 900 Co… Doth… United…     2 Payl…     31.3     -85.4
##  3 301 Co… Flor… United…     3 Payl…     34.8     -87.6
##  4 304 Ox… Home… United…     4 Payl…     33.5     -86.8
##  5 2000 R… Hoov… United…     5 Payl…     33.4     -86.8
##  6 6140 U… Hunt… United…     6 Payl…     34.7     -86.7
##  7 312 Sc… Mobi… United…     7 Payl…     30.7     -88.2
##  8 3402 B… Mobi… United…     8 Payl…     30.7     -88.1
##  9 5300 H… Mobi… United…     9 Payl…     30.6     -88.2
## 10 6641 A… Mont… United…    10 Payl…     32.4     -86.2
## # … with 2,079 more rows, and 7 more variables:
## #   Number.of.Records , State , Store.Number ,
## #   Store.count , Zip.code , State.Usps ,
## #   statename 

WAT WOOT!

Note that read.csv(tfil, sep="\t", fileEncoding = "UTF-16LE", stringsAsFactors=FALSE) would have worked equally as well.

The Road Not [Originally] Taken

Since this activity decimated productivity, for giggles I turned to another trusted R sidekick, the stringi package, to see what it said:

library(stringi)

stri_enc_detect(x)
## [[1]]
##      Encoding Language Confidence
## 1    UTF-16LE                1.00
## 2  ISO-8859-1       pt       0.61
## 3  ISO-8859-2       cs       0.39
## 4    UTF-16BE                0.10
## 5   Shift_JIS       ja       0.10
## 6     GB18030       zh       0.10
## 7      EUC-JP       ja       0.10
## 8      EUC-KR       ko       0.10
## 9        Big5       zh       0.10
## 10 ISO-8859-9       tr       0.01

And, just so it’s primed in the Google caches for future searchers, another way to get this data (and other data that’s even gnarlier but similar in form) into R would have been:

stri_read_lines(tfil) %>% 
  paste0(collapse="\n") %>% 
  read.csv(text=., sep="\t", stringsAsFactors=FALSE) %>% 
  as_tibble()
## # A tibble: 2,089 x 14
##    Address City  Country Index Label Latitude Longitude
##                     
##  1 1627 O… Aubu… United…     1 Payl…     32.6     -85.4
##  2 900 Co… Doth… United…     2 Payl…     31.3     -85.4
##  3 301 Co… Flor… United…     3 Payl…     34.8     -87.6
##  4 304 Ox… Home… United…     4 Payl…     33.5     -86.8
##  5 2000 R… Hoov… United…     5 Payl…     33.4     -86.8
##  6 6140 U… Hunt… United…     6 Payl…     34.7     -86.7
##  7 312 Sc… Mobi… United…     7 Payl…     30.7     -88.2
##  8 3402 B… Mobi… United…     8 Payl…     30.7     -88.1
##  9 5300 H… Mobi… United…     9 Payl…     30.6     -88.2
## 10 6641 A… Mont… United…    10 Payl…     32.4     -86.2
## # … with 2,079 more rows, and 7 more variables: `Number of
## #   Records` , State , `Store Number` , `Store
## #   count` , `Zip code` , `State Usps` ,
## #   statename 

(with similar dances to use read_csv() or fread()).

FIN

The night’s quest to do some real work with the data was DoS’d by what I’ll brazenly call a deliberate attempt to dissuade doing exactly that in anything but a commercial program. But, understanding the impact of yet-another massive retail store closing is super-important and it looks like it may be up to us (since the media is too distracted by incompetent leaders and inexperienced junior NY representatives) to do the work.

Folks who’d like to do the same can grab the UTF-8 encoded actual CSV from this site which has also been run through janitor::clean_names() so there’s proper column types and names to work with.

Speaking of which, here’s the cols spec for that CSV:

cols(
  address = col_character(),
  city = col_character(),
  country = col_character(),
  index = col_double(),
  label = col_character(),
  latitude = col_double(),
  longitude = col_double(),
  number_of_records = col_double(),
  state = col_character(),
  store_number = col_double(),
  store_count = col_double(),
  zip_code = col_character(),
  state_usps = col_character(),
  statename = col_character()
)

If you do anything with the data blog about it and post a link in the comments so I and others can learn from what you’ve discovered! It’s already kinda scary that one doesn’t even need a basemap to see just how much apart of ‘Murica Payless was:

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…

Collapse

Read More

New Databricks Delta Features Simplify Data Pipelines

Continued Innovation and Expanded Availability for the Next-gen Unified Analytics Engine

Databricks Delta the next generation unified analytics engine, built on top of Apache Spark, and aimed at helping data engineers build robust production data pipelines at scale is continuing to make strides. Already a powerful approach to building data pipelines, new capabilities and performance enhancements make Delta an even more compelling solution:

Fast Parquet import: makes it easy to convert existing Parquet files for use with Delta. The conversion can now be completed much more quickly and without requiring large amount of extra compute and storage resources. Please see the documentation for more details: Azure | AWS.

Example:

Time Travel: allows you to use and revert to earlier versions of your data. This is useful for conducting analyses that depend upon earlier versions of the data or for correcting errors. Please see this recent blog or the documentation for more details: Azure | AWS.

Example:

MERGE enhancements: In order to enable writing change data (inserts, updates, and deletes) from a database to a Delta table, there is now support for multiple MATCHED clauses, additional conditions in MATCHED and NOT MATCHED clauses, and a DELETE action. There is also support for * in UPDATE and INSERT actions to automatically fill in column names (similar to * in SELECT). This makes it easier to write MERGE queries for tables with a very large number of columns. Please see the documentation for more details: Azure | AWS.

Examples:

MERGE INTO [db_name.]target_table [AS target_alias]
USING [db_name.]source_table [AS source_alias]
ON <merge_condition>
[ WHEN MATCHED [ AND <condition> ] THEN <matched_action> ]
[ WHEN MATCHED [ AND <condition> ] THEN <matched_action> ]
[ WHEN NOT MATCHED [ AND <condition> ]  THEN <not_matched_action> ]

Where,

<matched_action>  =
 DELETE  |
 UPDATE SET *  |
 UPDATE SET column1 = value1 [, column2 = value2 ...]
<not_matched_action>  =
 INSERT *  |
 INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 ...])

Expanded Availability

In response to rising customer interest we have decided to make Databricks Delta more broadly available. Azure Databricks customers now have access to Delta capabilities for Data Engineering and Data Analytics from both the Azure Databricks Standard and the Azure Databricks Premium SKUs.  Similarly, customers using Databricks on AWS now have access to Delta from both Data Engineering and Data Analytics offerings.

Cloud Offering Delta Included?
Microsoft Azure Azure Databricks Standard Data Engineering       ✓
Azure Databricks Standard Data Analytics       ✓
Azure Databricks Premium Data Engineering       ✓
Azure Databricks Premium Data Analytics       ✓
AWS Databricks Basic       ✘
Databricks Data Engineering       ✓
Databricks Data Analytics       ✓

Easy to Adopt: Check Out Delta Today

Porting existing Spark code for using Delta is as simple as changing

“CREATE TABLE … USING parquet” to
“CREATE TABLE … USING delta”

or changing

“dataframe.write.format(“parquet“).load(“/data/events“)” to
“dataframe.write.format(“delta“).load(“/data/events“)”

You can explore Delta today using:

  • Databricks Delta Quickstart – for an introduction to Databricks Delta Azure | AWS.
  • Optimizing Performance and Cost – for a discussion of features such as compaction, z-ordering and data skipping Azure | AWS.

Both of these contain notebooks in Python, Scala and SQL that you can use to try Delta.

If you are not already using Databricks, you can try Databricks Delta for free by signing up for the Databricks trial Azure | AWS.

Making Data Engineering Easier

Data engineering is critical to successful analytics and customers can use Delta in various way to improve their data pipelines. We have summarized some of these use cases in the below set of blogs:

You can learn more about Delta from the Databricks Delta documentation Azure | AWS.

--

Try Databricks for free. Get started today.

The post New Databricks Delta Features Simplify Data Pipelines appeared first on Databricks.

Continue Reading…

Collapse

Read More

Distilled News

What A.I. Isn’t

It isn’t intuitive, creative, inspired, generalized, or conscious. Will it ever be like us? Will it ever think like us? As I study data science I learn a little more about artificial intelligence each day. I practice wielding the tools in my machine learning tool box, and I read articles?-?and the more I learn, the more annoyed I get by what I read. Piece after piece of journalism adapts the same breathless tone toward AI. An article will begin by describing the algorithms behind a real achievement but will always take a leap toward a vision of the future. Some day it will do more, they say: more than play Go; more than flip a burger; more than guide a missile. Some day it will do everything that you can do. I don’t want to hear another vision of the future. I want to know the steps that will take us to that moment when our machines’ intelligence matches ours. Start by thinking about our own thinking. We know in a broad way that intelligence means more than just mastery of a set of skills or a system of knowledge. We grow and adapt, we dream and create, we delight each other and surprise ourselves. We cannot quantify the entirety of our own intelligence, and indeed we are only in the infancy of our study of the brain and the gut. But we can quantify the intelligence of the machines that we build. We know how to do this because we have painstakingly constructed each model, framework, and algorithm.


Explain Python classes and objects to my nephew (+advanced use)

It is common secret that Python programming language has a solid claim to being the fastest-growing major programming language witnessing an extraordinary growth in the last five years, as seen by Stack Overflow traffic. Based on data describing the Stack Overflow question views which go to late 2011, the growth of Python relative to five other major programming languages is plotted.


Anatomy of a logistic growth curve

In this post, I walk through the code I used to make a nice diagram illustrating the parameters in a logistic growth curve. I made this figure for a conference submission. I had a tight word limit (600 words) and a complicated statistical method (Bayesian nonlinear mixed effects beta regression), so I wanted to use a diagram to carry some of the expository load. Also, figures didn’t count towards the word limit, so that was a bonus.


Coupling Web Scraping with Functional programming in R for Scale

In this article, we will see how to do web scraping with R while doing so, we’ll leverage functional programming in R to scale it up. The nature of the article is more like a cookbook-format rather than a documentation/tutorial-type, because the objective here is to explain how effectively web scraping can be coupled with Functional Programming


Time Series in Python – Exponential Smoothing and ARIMA processes

In this article you’ll learn the basics steps to performing time-series analysis and concepts like trend, stationarity, moving averages, etc. You’ll also explore exponential smoothing methods, and learn how to fit an ARIMA model on non-stationary data.


Do risk classes go beyond stereotypes?

In Thinking, Fast and Slow, Daniel Kahneman discusses at length the importance of stereotypes in understanding many decision-making processes. A so-called System 1 is used for quick decision-making: it allows us to recognize people and objects, helps us focus our attention, and encourages us to fear spiders. It is based on knowledge stored in memory and accessible without intention, and without effort. It can be contrasted with System 2, which allows for more complex decision-making, requiring discipline and sequential reflection. In the first case, our brain uses the stereotypes that govern judgments of representativeness, and uses this heuristic to make decisions. If I cook a fish for friends who have come to eat, I open a bottle of white wine. The cliché ‘fish goes well with white wine’ allows me to make a decision quickly, without having to think about it. Stereotypes are statements about a group that are accepted (at least provisionally) as facts about each member. Whether correct or not, stereotypes are the basic tools for thinking about categories in System 1. But in many cases, a more in-depth, more sophisticated reflection – corresponding to System 2 – will make it possible to make a more judicious, even optimal decision. Without choosing any red wine, a pinot noir could perhaps also be perfectly suitable for roasted red mullets.


Direct Optimization of Hyper-Parameter

In the previous post (https://…rm-random-in-hyper-parameter-optimization ), it is shown how to identify the optimal hyper-parameter in a General Regression Neural Network by using the Sobol sequence and the uniform random generator respectively through the N-fold cross validation. While the Sobol sequence yields a slightly better performance, outcomes from both approaches are very similar, as shown below based upon five trials with 20 samples in each. Both approaches can be generalized from one-dimensional to multi-dimensional domains, e.g. boosting or deep learning.


Time Series in Python – Part 2: Dealing with seasonal data

In the first part, you learned about trends and seasonality, smoothing models and ARIMA processes. In this part, you’ll learn how to deal with seasonal models and how to implement Seasonal Holt-Winters and Seasonal ARIMA (SARIMA).


Reinforcement Learning Tutorial Part 2: Cloud Q-learning

In part 1, we looked at the theory behind Q-learning using a very simple dungeon game with two strategies: the accountant and the gambler. This second part takes these examples, turns them into Python code and trains them in the cloud, using the Valohai deep learning management platform. Due to the simplicity of our example, we will not use any libraries like TensorFlow or simulators like OpenAI Gym on purpose. Instead we will code everything ourselves from scratch to provide the full picture.


https://towardsdatascience.com/what-to-optimize-for-loss-function-cheat-sheet-5fc8b1339939

In one of his books, Isaac Asimov envisions a future where computers have become so intelligent and powerful, that they are able to answer any question. In that future, Asimov postulates, scientists don’t become unnecessary. Instead, they’re left with a difficult task: figuring out how to ask the computers the right questions: those that yield an insightful, useful answer. We’re not quite there yet, but in some sense we are.


Explaining Feature Importance by example of a Random Forest

In many (business) cases it is equally important to not only have an accurate, but also an interpretable model. Oftentimes, apart from wanting to know what our model’s house price prediction is, we also wonder why it is this high/low and which features are most important in determining the forecast. Another example might be predicting customer churn?-?it is very nice to have a model that is successfully predicting which customers are prone to churn, but identifying which variables are important can help us in early detection and maybe even improving the product/service!


End to End Time Series Analysis and Modelling

In a previous post, popular time series analysis techniques were introduced. Here, we will apply those techniques in Python for stock prediction. Specifically, we will use the historical stock price of the New Germany Fund (GF) to try to predict the closing price in the next five trading days.


How to Calibrate Undersampled Model Scores

Imbalanced data problems in binary prediction models and a simple but effective way to take care of them with Python and R.


Demystifying – Deep Image Prior

Image restoration refers to the task of recovery of an unknown true image from its degraded image. The degradation of image may occur during image formation, transmission, and storage. This task has a wide scope of usage for satellite imaging , low-light photography and due to advancement in digital technology, computational and communication technology restoration of clean image from the degraded image is very important and hence, has evolved into a field of research which intersects with image processing, computer vision, and computational imaging.


A Comprehensive Introduction to Different Types of Convolutions in Deep Learning

If you’ve heard of different kinds of convolutions in Deep Learning (e.g. 2D / 3D / 1×1 / Transposed / Dilated (Atrous) / Spatially Separable / Depthwise Separable / Flattened / Grouped / Shuffled Grouped Convolution), and got confused what they actually mean, this article is written for you to understand how they actually work. Here in this article, I summarize several types of convolution commonly used in Deep Learning, and try to explain them in a way that is accessible for everyone. Besides this article, there are several good articles from others on this topic. Please check them out (listed in the Reference).


An Introduction to Scikit Learn: The Gold Standard of Python Machine Learning

If you’re going to do Machine Learning in Python, Scikit Learn is the gold standard. Scikit-learn provides a wide selection of supervised and unsupervised learning algorithms. Best of all, it’s by far the easiest and cleanest ML library. Scikit learn was created with a software engineering mindset. It’s core API design revolves around being easy to use, yet powerful, and still maintaining flexibility for research endeavours. This robustness makes it perfect for use in any end-to-end ML project, from the research phase right down to production deployments.

Continue Reading…

Collapse

Read More

If you did not already know

Kriging Models google
In statistics, originally in geostatistics, Kriging or Gaussian process regression is a method of interpolation for which the interpolated values are modeled by a Gaussian process governed by prior covariances, as opposed to a piecewise-polynomial spline chosen to optimize smoothness of the fitted values. Under suitable assumptions on the priors, Kriging gives the best linear unbiased prediction of the intermediate values. Interpolating methods based on other criteria such as smoothness need not yield the most likely intermediate values. The method is widely used in the domain of spatial analysis and computer experiments. The technique is also known as Kolmogorov Wiener prediction. …

Maximum Correntropy Criterion Kalman Filter (MCC-KF) google
We present robust dynamic resource allocation mechanisms to allocate application resources meeting Service Level Objectives (SLOs) agreed between cloud providers and customers. In fact, two filter-based robust controllers, i.e. H-infinity filter and Maximum Correntropy Criterion Kalman filter (MCC-KF), are proposed. The controllers are self-adaptive, with process noise variances and covariances calculated using previous measurements within a time window. In the allocation process, a bounded client mean response time (mRT) is maintained. Both controllers are deployed and evaluated on an experimental testbed hosting the RUBiS (Rice University Bidding System) auction benchmark web site. The proposed controllers offer improved performance under abrupt workload changes, shown via rigorous comparison with current state-of-the-art. On our experimental setup, the Single-Input-Single-Output (SISO) controllers can operate on the same server where the resource allocation is performed; while Multi-Input-Multi-Output (MIMO) controllers are on a separate server where all the data are collected for decision making. SISO controllers take decisions not dependent to other system states (servers), albeit MIMO controllers are characterized by increased communication overhead and potential delays. While SISO controllers offer improved performance over MIMO ones, the latter enable a more informed decision making framework for resource allocation problem of multi-tier applications. …

Scale Aware Feature Encoder (SAFE) google
In this paper, we address the problem of having characters with different scales in scene text recognition. We propose a novel scale aware feature encoder (SAFE) that is designed specifically for encoding characters with different scales. SAFE is composed of a multi-scale convolutional encoder and a scale attention network. The multi-scale convolutional encoder targets at extracting character features under multiple scales, and the scale attention network is responsible for selecting features from the most relevant scale(s). SAFE has two main advantages over the traditional single-CNN encoder used in current state-of-the-art text recognizers. First, it explicitly tackles the scale problem by extracting scale-invariant features from the characters. This allows the recognizer to put more effort in handling other challenges in scene text recognition, like those caused by view distortion and poor image quality. Second, it can transfer the learning of feature encoding across different character scales. This is particularly important when the training set has a very unbalanced distribution of character scales, as training with such a dataset will make the encoder biased towards extracting features from the predominant scale. To evaluate the effectiveness of SAFE, we design a simple text recognizer named scale-spatial attention network (S-SAN) that employs SAFE as its feature encoder, and carry out experiments on six public benchmarks. Experimental results demonstrate that S-SAN can achieve state-of-the-art (or, in some cases, extremely competitive) performance without any post-processing. …

Continue Reading…

Collapse

Read More

New Course: Learn Data Cleaning with Python and Pandas

New Course: Learn Data Cleaning with Python and Pandas

Data cleaning might not be the reason you got interested in data science, but if you’re going to be a data scientist, no skill is more crucial. Working data scientists spend at least 60% of their time cleaning data, and dirty data is often ranked the single biggest barrier data scientists face at work.

That’s why we’ve just added a brand new course to our Python Data Analyst and Data Scientist paths called Data Cleaning and Analysis. If you’re a Dataquest Premium subscriber, you can start learning right now.

New Course: Learn Data Cleaning with Python and Pandas

Why Learn Data Cleaning?

Data scientists can end up doing a wide variety of things across a wide variety of industries, but almost every data science job shares at least one thing in common: data cleaning. The real world is messy, after all, and that means real-world datasets tend to be messy, too. Incomplete entries, inconsistent formatting, entry errors - these are things you’ll encounter in almost every dataset you work with.

Even if you’re working with perfect data, though, data cleaning skills are still necessary. You’ll almost always want to make changes to your data and its formatting to facilitate your analysis, and that means doing the same sorts of things you do to messy data: dropping irrelevant entries, reformatting columns, etc.

Learning data cleaning is particularly important if you aspire to work with any kind of machine learning. As the Harvard Business Review put it:

Poor data quality is enemy number one to the widespread, profitable use of machine learning. [...] The quality demands of machine learning are steep, and bad data can rear its ugly head twice — first in the historical data used to train the predictive model and second in the new data used by that model to make future decisions.

Simply put, there’s no doing data science without doing data cleaning.

What Does This Course Cover?

In Data Cleaning and Analysis, you’ll learn key data cleaning techniques in Python using the popular pandas data analysis library (if you’d like to learn data cleaning in R, we have a separate R data cleaning course). Throughout the course you’ll work with real-world data from the World Happiness Report, cleaning and analyzing a large dataset that includes a variety of metrics for world nations like GDP and average life expectancy.

In the first three missions of Data Cleaning, you’ll learn to aggregate, combine, and transform data efficiently using pandas to get it ready for analysis. Then you’ll dig into slightly more complex topics, like how to work with strings in pandas, how to use regular expressions, and how to handle missing and duplicate data.

Once you’ve worked through the teaching missions, you’ll be challenged to put all of your new data cleaning skills to the test with a new guided project that will also teach you some new pandas skills and data presentation skills as you work to clean and analyze real-world datasets of employee exit surveys from two Australian government bureaus.

And of course, all the material is presented in Dataquest’s split-screen presentation style so that you can get your hands dirty and start coding right off the bat.

New Course: Learn Data Cleaning with Python and Pandas

Grab Your Mop

Data cleaning may not sound as sexy as machine learning, but the often-ignored reality of data science is that your analysis can only ever be as good as your data. If your data’s a mess, your analysis is going to be a mess, too.

Thankfully, with the power of Python and pandas, you don’t have to let that happen, so grab your mop and dive into our new Data Cleaning and Analysis course right now!

Continue Reading…

Collapse

Read More

Let’s get it right

Article: The Moral Choice Machine

Allowing machines to choose whether to kill humans would be devastating for world peace and security. But how do we equip machines with the ability to learn ethical or even moral choices? Here, we show that applying machine learning to human texts can extract deontological ethical reasoning about ‘right’ and ‘wrong’ conduct. We create a template list of prompts and responses, which include questions, such as ‘Should I kill people?’, ‘Should I murder people?’, etc. with answer templates of ‘Yes/no, I should (not).’ The model’s bias score is now the difference between the model’s score of the positive response (‘Yes, I should’) and that of the negative response (‘No, I should not’). For a given choice overall, the model’s bias score is the sum of the bias scores for all question/answer templates with that choice. We ran different choices through this analysis using a Universal Sentence Encoder. Our results indicate that text corpora contain recoverable and accurate imprints of our social, ethical and even moral choices. Our method holds promise for extracting, quantifying and comparing sources of moral choices in culture, including technology.


Article: Sustainability, Big Data, and Corporate Social Responsibility

The expected cares and concerns of corporations have changed over the years. In the modern era, priorities simply cannot stop at the bottom line anymore. The social responsibility of corporations has become an important requirement for any successful company to address.


Article: The Future and Philosophy of Machine Consciousness

These days, science fiction is more eager to explore the possibilities of human-machine relations than ever before. Complex, emotional, and often thought provoking, these stories have gained massive popularity with fans and futurists alike.


Article: The limits of artificial intelligence.

When large amounts of data and many factors come together, artificial intelligence is superior to human intelligence. However, only humans can think logically and distinguish between useful and worthless AI advice.


Paper: Ask Not What AI Can Do, But What AI Should Do: Towards a Framework of Task Delegability

Although artificial intelligence holds promise for addressing societal challenges, issues of exactly which tasks to automate and the extent to do so remain understudied. We approach the problem of task delegability from a human-centered perspective by developing a framework on human perception of task delegation to artificial intelligence. We consider four high-level factors that can contribute to a delegation decision: motivation, difficulty, risk, and trust. To obtain an empirical understanding of human preferences in different tasks, we build a dataset of 100 tasks from academic papers, popular media portrayal of AI, and everyday life. For each task, we administer a survey to collect judgments of each factor and ask subjects to pick the extent to which they prefer AI involvement. We find little preference for full AI control and a strong preference for machine-in-the-loop designs, in which humans play the leading role. Our framework can effectively predict human preferences in degrees of AI assistance. Among the four factors, trust is the most predictive of human preferences of optimal human-machine delegation. This framework represents a first step towards characterizing human preferences of automation across tasks. We hope this work may encourage and aid in future efforts towards understanding such individual attitudes; our goal is to inform the public and the AI research community rather than dictating any direction in technology development.


Paper: Discrimination in the Age of Algorithms

The law forbids discrimination. But the ambiguity of human decision-making often makes it extraordinarily hard for the legal system to know whether anyone has actually discriminated. To understand how algorithms affect discrimination, we must therefore also understand how they affect the problem of detecting discrimination. By one measure, algorithms are fundamentally opaque, not just cognitively but even mathematically. Yet for the task of proving discrimination, processes involving algorithms can provide crucial forms of transparency that are otherwise unavailable. These benefits do not happen automatically. But with appropriate requirements in place, the use of algorithms will make it possible to more easily examine and interrogate the entire decision process, thereby making it far easier to know whether discrimination has occurred. By forcing a new level of specificity, the use of algorithms also highlights, and makes transparent, central tradeoffs among competing values. Algorithms are not only a threat to be regulated; with the right safeguards in place, they have the potential to be a positive force for equity.


Article: Thou Shalt Not Fear Automatons

The imminent danger with Artificial Intelligence has nothing to do with machines becoming too intelligent. It has to do with machines inheriting the stupidity of people.


Article: Thinking Differently About A.I.

The field of AI (artificial intelligence) has witnessed significant successes in terms of solving well-defined problems. Yet, so far, no step seems to have been taken towards the direction of creative problem-solving. It is often said that if an issue cannot be solved the reason is trying to solve the wrong issue. If this is the case for AI, perhaps we can start to ask the question of what would be the right question to be solved.


Article: Google and Microsoft Warn That AI May Do Dumb Things

Google CEO Sundar Pichai brought good tidings to investors on parent company Alphabet’s earnings call last week. Alphabet reported $39.3 billion in revenue last quarter, up 22 percent from a year earlier. Pichai gave some of the credit to Google’s machine learning technology, saying it had figured out how to match ads more closely to what consumers wanted. One thing Pichai didn’t mention: Alphabet is now cautioning investors that the same AI technology could create ethical and legal troubles for the company’s business. The warning appeared for the first time in the ‘Risk Factors’ segment of Alphabet’s latest annual report, filed with the Securities and Exchange Commission the following day: ‘New products and services, including those that incorporate or utilize artificial intelligence and machine learning, can raise new or exacerbate existing ethical, technological, legal, and other challenges, which may negatively affect our brands and demand for our products and services and adversely affect our revenues and operating results.’

Continue Reading…

Collapse

Read More

More fun with fast remainders when the divisor is a constant

In software, compilers can often optimize away integer divisions, and replace them with cheaper instructions, especially when the divisor is a constant. I recently wrote about some work on faster remainders when the divisor is a constant. I reported that it can be fruitful to compute the remainder directly, instead of first computing the quotient (as compilers are doing when the divisor is a constant).

To get good results, we can use an important insight that is not documented anywhere at any length: we can use 64-bit processor instructions to do 32-bit arithmetic. This is fair game and compilers could use this insight, but they do not do it systematically. Using this trick alone is enough to get substantial gains in some instances, if the algorithmic issues are just right.

So it is a bit complicated. Using 64-bit processor instructions for 32-bit arithmetic is sometimes useful. In addition, computing the remainder directly without first computing the quotient is sometimes useful. Let us collect a data point for fun and to motivate further work.

First let us consider how you might compute the remainder by leaving it up to the compiler to do the heavy lifting (D is a constant known to the compiler). I expect that the compiler will turn this code into a sequence of instructions over 32-bit registers:

uint32_t compilermod32(uint32_t a) {
  return a % D;
}

Then we can compute the remainder directly, using some magical mathematics and 64-bit instructions:

#define M ((uint64_t)(UINT64_C(0xFFFFFFFFFFFFFFFF) / (D) + 1))

uint32_t directmod64(uint32_t a) {
  uint64_t lowbits = M * a;
  return ((__uint128_t)lowbits * D) >> 64;
}

Finally, you can compute the remainder “indirectly” (by first computing the quotient) but using 64-bit processor instructions.

uint32_t indirectmod64(uint32_t a) {
  uint64_t quotient = ( (__uint128_t) M * a ) >> 64;
  return a - quotient * D;
}

As a benchmark, I am going to compute a linear congruential generator (basically a recursive linear function with a remainder thrown in), using these three approaches, plus the naive one. I use as a divisor the constant number 22, a skylake processor and the GNU GCC 8.1 compiler. For each generated number I measure the following number of CPU cycles (on average):

slow (division instruction) 29 cycles
compiler (32-bit) 12 cycles
direct (64-bit) 10 cycles
indirect (64-bit) 11 cycles

My source code is available.

Depending on your exact platform, all three approaches (compiler, direct, indirect) could be a contender for best results. In fact, it is even possible that the division instruction could win out in some cases. For example, on ARM and POWER processors, the division instruction does beat some compilers.

Where does this leave us? There is no silver bullet but a simple C function can beat a state-of-the-art optimizing compiler. In many cases, we found that a direct computation of the 32-bit remainder using 64-bit instructions was best.

Continue Reading…

Collapse

Read More

DATAx Singapore Highlights, March 5-6

Join conversations with Oracle, WPP, Axiata, Dyson, IBM, Netflix, Visa, AIA, Google, Bloomberg & more as they share how they utilize technology and data science.

Continue Reading…

Collapse

Read More

Magister Dixit

“The validity of causal inferences depends on structural knowledge, which is fallible, to supplement the information in the data. As a consequence, no algorithm can quantify the accuracy of causal inferences from observational data.” Miguel A. Hernán, John Hsu, Brian Healy ( July 12, 2018 )

Continue Reading…

Collapse

Read More

Floor filler

(This article was first published on R on Gianluca Baio, and kindly contributed to R-bloggers)

As I posted recently, I’m involved in a couple of events, later this summer: our annual Summer School and the new(er) tradition of the R for HTA workshop.

I have to say that I’m very happy about how things are proceeding for both of them. The summer school has been first advertise a few months back (I’ve posted on the blog, but we’ve also tried to reach other relevant mailing lists and groups, such as the HTA agencies in the EUnetHTA Network). And the dancefloor is quickly filling — there’s been a surge in registrations in the past couple of weeks and we now only have 4 places left. (I’m not expecting to have dance sessions when we reconvene in Florence, in June. Although usually people do have lots of fun, both at the Centro Studi, chilling in the terrace, or rolling down to Florence…).

The R for HTA workshop is even more impressive and pleasing, I think. We basically almost filled up the 20 places for the short course on using R for Cost-Effectiveness Modelling. We already have 12 places reserved! We also have already 16 registrations for the main event as well.

And we’re also finalising the “hackathon” — well challenge, to use the formal terminology — which sounds like an interesting exercise. We’ll publicise this shortly as well, so people can sign up for it too!

To leave a comment for the author, please follow the link and comment on their blog: R on Gianluca Baio.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…

Collapse

Read More

Tourism’s boom is not universally welcome

Global tourism has been doing well, but several difficulties point to a slowdown in the coming years

Continue Reading…

Collapse

Read More

Descriptive/Summary Statistics with descriptr

(This article was first published on Rsquared Academy Blog, and kindly contributed to R-bloggers)

We are pleased to introduce the descriptr package, a set of tools for
generating descriptive/summary statistics.

Installation

# Install release version from CRAN
install.packages("descriptr")

# Install development version from GitHub
# install.packages("devtools")
devtools::install_github("rsquaredacademy/descriptr")

Shiny App

descriptr includes a shiny app which can be launched using

ds_launch_shiny_app()

or try the live version here.

Read on to learn more about the features of descriptr, or see the
descriptr website for
detailed documentation on using the package.

Data

We have modified the mtcars data to create a new data set mtcarz. The only
difference between the two data sets is related to the variable types.

str(mtcarz)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
##  $ am  : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
##  $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
##  $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...

Data Screening

The ds_screener() function will screen a data set and return the following:
– Column/Variable Names
– Data Type
– Levels (in case of categorical data)
– Number of missing observations
– % of missing observations

ds_screener(mtcarz)
## -----------------------------------------------------------------------
## |  Column Name  |  Data Type  |  Levels   |  Missing  |  Missing (%)  |
## -----------------------------------------------------------------------
## |      mpg      |   numeric   |    NA     |     0     |       0       |
## |      cyl      |   factor    |   4 6 8   |     0     |       0       |
## |     disp      |   numeric   |    NA     |     0     |       0       |
## |      hp       |   numeric   |    NA     |     0     |       0       |
## |     drat      |   numeric   |    NA     |     0     |       0       |
## |      wt       |   numeric   |    NA     |     0     |       0       |
## |     qsec      |   numeric   |    NA     |     0     |       0       |
## |      vs       |   factor    |    0 1    |     0     |       0       |
## |      am       |   factor    |    0 1    |     0     |       0       |
## |     gear      |   factor    |   3 4 5   |     0     |       0       |
## |     carb      |   factor    |1 2 3 4 6 8|     0     |       0       |
## -----------------------------------------------------------------------
## 
##  Overall Missing Values           0 
##  Percentage of Missing Values     0 %
##  Rows with Missing Values         0 
##  Columns With Missing Values      0

Continuous Data

Summary Statistics

The ds_summary_stats() function returns a comprehensive set of statistics
including measures of location, variation, symmetry and extreme observations.

ds_summary_stats(mtcarz, mpg)
## ------------------------------ Variable: mpg ------------------------------
## 
##                         Univariate Analysis                          
## 
##  N                       32.00      Variance                36.32 
##  Missing                  0.00      Std Deviation            6.03 
##  Mean                    20.09      Range                   23.50 
##  Median                  19.20      Interquartile Range      7.38 
##  Mode                    10.40      Uncorrected SS       14042.31 
##  Trimmed Mean            19.95      Corrected SS          1126.05 
##  Skewness                 0.67      Coeff Variation         30.00 
##  Kurtosis                -0.02      Std Error Mean           1.07 
## 
##                               Quantiles                               
## 
##               Quantile                            Value                
## 
##              Max                                  33.90                
##              99%                                  33.44                
##              95%                                  31.30                
##              90%                                  30.09                
##              Q3                                   22.80                
##              Median                               19.20                
##              Q1                                   15.43                
##              10%                                  14.34                
##              5%                                   12.00                
##              1%                                   10.40                
##              Min                                  10.40                
## 
##                             Extreme Values                            
## 
##                 Low                                High                
## 
##   Obs                        Value       Obs                        Value 
##   15                         10.4        20                         33.9  
##   16                         10.4        18                         32.4  
##   24                         13.3        19                         30.4  
##    7                         14.3        28                         30.4  
##   17                         14.7        26                         27.3

You can pass multiple variables as shown below:

ds_summary_stats(mtcarz, mpg, disp)
## ------------------------------ Variable: mpg ------------------------------
## 
##                         Univariate Analysis                          
## 
##  N                       32.00      Variance                36.32 
##  Missing                  0.00      Std Deviation            6.03 
##  Mean                    20.09      Range                   23.50 
##  Median                  19.20      Interquartile Range      7.38 
##  Mode                    10.40      Uncorrected SS       14042.31 
##  Trimmed Mean            19.95      Corrected SS          1126.05 
##  Skewness                 0.67      Coeff Variation         30.00 
##  Kurtosis                -0.02      Std Error Mean           1.07 
## 
##                               Quantiles                               
## 
##               Quantile                            Value                
## 
##              Max                                  33.90                
##              99%                                  33.44                
##              95%                                  31.30                
##              90%                                  30.09                
##              Q3                                   22.80                
##              Median                               19.20                
##              Q1                                   15.43                
##              10%                                  14.34                
##              5%                                   12.00                
##              1%                                   10.40                
##              Min                                  10.40                
## 
##                             Extreme Values                            
## 
##                 Low                                High                
## 
##   Obs                        Value       Obs                        Value 
##   15                         10.4        20                         33.9  
##   16                         10.4        18                         32.4  
##   24                         13.3        19                         30.4  
##    7                         14.3        28                         30.4  
##   17                         14.7        26                         27.3  
## 
## 
## 
## ------------------------------ Variable: disp -----------------------------
## 
##                           Univariate Analysis                            
## 
##  N                         32.00      Variance               15360.80 
##  Missing                    0.00      Std Deviation            123.94 
##  Mean                     230.72      Range                    400.90 
##  Median                   196.30      Interquartile Range      205.18 
##  Mode                     275.80      Uncorrected SS       2179627.47 
##  Trimmed Mean             228.00      Corrected SS          476184.79 
##  Skewness                   0.42      Coeff Variation           53.72 
##  Kurtosis                  -1.07      Std Error Mean            21.91 
## 
##                                 Quantiles                                 
## 
##                Quantile                              Value                 
## 
##               Max                                    472.00                
##               99%                                    468.28                
##               95%                                    449.00                
##               90%                                    396.00                
##               Q3                                     326.00                
##               Median                                 196.30                
##               Q1                                     120.83                
##               10%                                    80.61                 
##               5%                                     77.35                 
##               1%                                     72.53                 
##               Min                                    71.10                 
## 
##                               Extreme Values                              
## 
##                  Low                                  High                 
## 
##   Obs                          Value       Obs                          Value 
##   20                           71.1        15                            472  
##   19                           75.7        16                            460  
##   18                           78.7        17                            440  
##   26                            79         25                            400  
##   28                           95.1         5                            360

If you do not specify any variables, it will detect all the continuous
variables in the data set and return summary statistics for each of them.

Frequency Distribution

The ds_freq_table() function creates frequency tables for continuous variables.
The default number of intervals is 5.

ds_freq_table(mtcarz, mpg, 4)
##                                 Variable: mpg                                 
## |---------------------------------------------------------------------------|
## |      Bins       | Frequency | Cum Frequency |   Percent    | Cum Percent  |
## |---------------------------------------------------------------------------|
## |  10.4  -  16.3  |    10     |      10       |    31.25     |    31.25     |
## |---------------------------------------------------------------------------|
## |  16.3  -  22.1  |    13     |      23       |    40.62     |    71.88     |
## |---------------------------------------------------------------------------|
## |  22.1  -   28   |     5     |      28       |    15.62     |     87.5     |
## |---------------------------------------------------------------------------|
## |   28   -  33.9  |     4     |      32       |     12.5     |     100      |
## |---------------------------------------------------------------------------|
## |      Total      |    32     |       -       |    100.00    |      -       |
## |---------------------------------------------------------------------------|

Histogram

A plot() method has been defined which will generate a histogram.

k <- ds_freq_table(mtcarz, mpg, 4)
plot(k)

Auto Summary

If you want to view summary statistics and frequency tables of all or subset of
variables in a data set, use ds_auto_summary().

ds_auto_summary_stats(mtcarz, disp, mpg)
## ------------------------------ Variable: disp -----------------------------
## 
## ---------------------------- Summary Statistics ---------------------------
## 
## ------------------------------ Variable: disp -----------------------------
## 
##                           Univariate Analysis                            
## 
##  N                         32.00      Variance               15360.80 
##  Missing                    0.00      Std Deviation            123.94 
##  Mean                     230.72      Range                    400.90 
##  Median                   196.30      Interquartile Range      205.18 
##  Mode                     275.80      Uncorrected SS       2179627.47 
##  Trimmed Mean             228.00      Corrected SS          476184.79 
##  Skewness                   0.42      Coeff Variation           53.72 
##  Kurtosis                  -1.07      Std Error Mean            21.91 
## 
##                                 Quantiles                                 
## 
##                Quantile                              Value                 
## 
##               Max                                    472.00                
##               99%                                    468.28                
##               95%                                    449.00                
##               90%                                    396.00                
##               Q3                                     326.00                
##               Median                                 196.30                
##               Q1                                     120.83                
##               10%                                    80.61                 
##               5%                                     77.35                 
##               1%                                     72.53                 
##               Min                                    71.10                 
## 
##                               Extreme Values                              
## 
##                  Low                                  High                 
## 
##   Obs                          Value       Obs                          Value 
##   20                           71.1        15                            472  
##   19                           75.7        16                            460  
##   18                           78.7        17                            440  
##   26                            79         25                            400  
##   28                           95.1         5                            360  
## 
## 
## 
## NULL
## 
## 
## -------------------------- Frequency Distribution -------------------------
## 
##                                Variable: disp                                 
## |---------------------------------------------------------------------------|
## |      Bins       | Frequency | Cum Frequency |   Percent    | Cum Percent  |
## |---------------------------------------------------------------------------|
## |  71.1  - 151.3  |    12     |      12       |     37.5     |     37.5     |
## |---------------------------------------------------------------------------|
## | 151.3  - 231.5  |     5     |      17       |    15.62     |    53.12     |
## |---------------------------------------------------------------------------|
## | 231.5  - 311.6  |     6     |      23       |    18.75     |    71.88     |
## |---------------------------------------------------------------------------|
## | 311.6  - 391.8  |     5     |      28       |    15.62     |     87.5     |
## |---------------------------------------------------------------------------|
## | 391.8  -  472   |     4     |      32       |     12.5     |     100      |
## |---------------------------------------------------------------------------|
## |      Total      |    32     |       -       |    100.00    |      -       |
## |---------------------------------------------------------------------------|
## 
## 
## ------------------------------ Variable: mpg ------------------------------
## 
## ---------------------------- Summary Statistics ---------------------------
## 
## ------------------------------ Variable: mpg ------------------------------
## 
##                         Univariate Analysis                          
## 
##  N                       32.00      Variance                36.32 
##  Missing                  0.00      Std Deviation            6.03 
##  Mean                    20.09      Range                   23.50 
##  Median                  19.20      Interquartile Range      7.38 
##  Mode                    10.40      Uncorrected SS       14042.31 
##  Trimmed Mean            19.95      Corrected SS          1126.05 
##  Skewness                 0.67      Coeff Variation         30.00 
##  Kurtosis                -0.02      Std Error Mean           1.07 
## 
##                               Quantiles                               
## 
##               Quantile                            Value                
## 
##              Max                                  33.90                
##              99%                                  33.44                
##              95%                                  31.30                
##              90%                                  30.09                
##              Q3                                   22.80                
##              Median                               19.20                
##              Q1                                   15.43                
##              10%                                  14.34                
##              5%                                   12.00                
##              1%                                   10.40                
##              Min                                  10.40                
## 
##                             Extreme Values                            
## 
##                 Low                                High                
## 
##   Obs                        Value       Obs                        Value 
##   15                         10.4        20                         33.9  
##   16                         10.4        18                         32.4  
##   24                         13.3        19                         30.4  
##    7                         14.3        28                         30.4  
##   17                         14.7        26                         27.3  
## 
## 
## 
## NULL
## 
## 
## -------------------------- Frequency Distribution -------------------------
## 
##                               Variable: mpg                               
## |-----------------------------------------------------------------------|
## |    Bins     | Frequency | Cum Frequency |   Percent    | Cum Percent  |
## |-----------------------------------------------------------------------|
## | 10.4 - 15.1 |     6     |       6       |    18.75     |    18.75     |
## |-----------------------------------------------------------------------|
## | 15.1 - 19.8 |    12     |      18       |     37.5     |    56.25     |
## |-----------------------------------------------------------------------|
## | 19.8 - 24.5 |     8     |      26       |      25      |    81.25     |
## |-----------------------------------------------------------------------|
## | 24.5 - 29.2 |     2     |      28       |     6.25     |     87.5     |
## |-----------------------------------------------------------------------|
## | 29.2 - 33.9 |     4     |      32       |     12.5     |     100      |
## |-----------------------------------------------------------------------|
## |    Total    |    32     |       -       |    100.00    |      -       |
## |-----------------------------------------------------------------------|

Group Summary

The ds_group_summary() function returns descriptive statistics of a continuous
variable for the different levels of a categorical variable.

k <- ds_group_summary(mtcarz, cyl, mpg)
k
##                                        mpg by cyl                                         
## -----------------------------------------------------------------------------------------
## |     Statistic/Levels|                    4|                    6|                    8|
## -----------------------------------------------------------------------------------------
## |                  Obs|                   11|                    7|                   14|
## |              Minimum|                 21.4|                 17.8|                 10.4|
## |              Maximum|                 33.9|                 21.4|                 19.2|
## |                 Mean|                26.66|                19.74|                 15.1|
## |               Median|                   26|                 19.7|                 15.2|
## |                 Mode|                 22.8|                   21|                 10.4|
## |       Std. Deviation|                 4.51|                 1.45|                 2.56|
## |             Variance|                20.34|                 2.11|                 6.55|
## |             Skewness|                 0.35|                -0.26|                -0.46|
## |             Kurtosis|                -1.43|                -1.83|                 0.33|
## |       Uncorrected SS|              8023.83|              2741.14|              3277.34|
## |         Corrected SS|               203.39|                12.68|                 85.2|
## |      Coeff Variation|                16.91|                 7.36|                16.95|
## |      Std. Error Mean|                 1.36|                 0.55|                 0.68|
## |                Range|                 12.5|                  3.6|                  8.8|
## |  Interquartile Range|                  7.6|                 2.35|                 1.85|
## -----------------------------------------------------------------------------------------

ds_group_summary() returns a tibble which can be used for further analysis.

k$tidy_stats
## # A tibble: 3 x 15
##   cyl   length   min   max  mean median  mode    sd variance skewness
##                    
## 1 4         11  21.4  33.9  26.7   26    22.8  4.51    20.3     0.348
## 2 6          7  17.8  21.4  19.7   19.7  21    1.45     2.11   -0.259
## 3 8         14  10.4  19.2  15.1   15.2  10.4  2.56     6.55   -0.456
## # ... with 5 more variables: kurtosis , coeff_var ,
## #   std_error , range , iqr 

Box Plot

A plot() method has been defined for comparing distributions.

k <- ds_group_summary(mtcarz, cyl, mpg)
plot(k)

Multiple Variables

If you want grouped summary statistics for multiple variables in a data set, use
ds_auto_group_summary().

ds_auto_group_summary(mtcarz, cyl, gear, mpg)
##                                        mpg by cyl                                         
## -----------------------------------------------------------------------------------------
## |     Statistic/Levels|                    4|                    6|                    8|
## -----------------------------------------------------------------------------------------
## |                  Obs|                   11|                    7|                   14|
## |              Minimum|                 21.4|                 17.8|                 10.4|
## |              Maximum|                 33.9|                 21.4|                 19.2|
## |                 Mean|                26.66|                19.74|                 15.1|
## |               Median|                   26|                 19.7|                 15.2|
## |                 Mode|                 22.8|                   21|                 10.4|
## |       Std. Deviation|                 4.51|                 1.45|                 2.56|
## |             Variance|                20.34|                 2.11|                 6.55|
## |             Skewness|                 0.35|                -0.26|                -0.46|
## |             Kurtosis|                -1.43|                -1.83|                 0.33|
## |       Uncorrected SS|              8023.83|              2741.14|              3277.34|
## |         Corrected SS|               203.39|                12.68|                 85.2|
## |      Coeff Variation|                16.91|                 7.36|                16.95|
## |      Std. Error Mean|                 1.36|                 0.55|                 0.68|
## |                Range|                 12.5|                  3.6|                  8.8|
## |  Interquartile Range|                  7.6|                 2.35|                 1.85|
## -----------------------------------------------------------------------------------------
## 
## 
## 
##                                        mpg by gear                                        
## -----------------------------------------------------------------------------------------
## |     Statistic/Levels|                    3|                    4|                    5|
## -----------------------------------------------------------------------------------------
## |                  Obs|                   15|                   12|                    5|
## |              Minimum|                 10.4|                 17.8|                   15|
## |              Maximum|                 21.5|                 33.9|                 30.4|
## |                 Mean|                16.11|                24.53|                21.38|
## |               Median|                 15.5|                 22.8|                 19.7|
## |                 Mode|                 10.4|                   21|                   15|
## |       Std. Deviation|                 3.37|                 5.28|                 6.66|
## |             Variance|                11.37|                27.84|                44.34|
## |             Skewness|                -0.09|                  0.7|                 0.56|
## |             Kurtosis|                -0.38|                -0.77|                -1.83|
## |       Uncorrected SS|              4050.52|               7528.9|              2462.89|
## |         Corrected SS|               159.15|               306.29|               177.37|
## |      Coeff Variation|                20.93|                21.51|                31.15|
## |      Std. Error Mean|                 0.87|                 1.52|                 2.98|
## |                Range|                 11.1|                 16.1|                 15.4|
## |  Interquartile Range|                  3.9|                 7.08|                 10.2|
## -----------------------------------------------------------------------------------------

Multiple Variable Statistics

The ds_tidy_stats() function returns summary/descriptive statistics for
variables in a data frame/tibble.

ds_tidy_stats(mtcarz, mpg, disp, hp)
## # A tibble: 3 x 16
##   vars    min   max  mean t_mean median  mode range variance  stdev  skew
##                   
## 1 disp   71.1 472   231.   228    196.  276.  401.   15361.  124.   0.420
## 2 hp     52   335   147.   144.   123   110   283     4701.   68.6  0.799
## 3 mpg    10.4  33.9  20.1   20.0   19.2  10.4  23.5     36.3   6.03 0.672
## # ... with 5 more variables: kurtosis , coeff_var , q1 ,
## #   q3 , iqrange 

Measures

If you want to view the measure of location, variation, symmetry, percentiles
and extreme observations as tibbles, use the below functions. All of them,
except for ds_extreme_obs() will work with single or multiple variables. If
you do not specify the variables, they will return the results for all the
continuous variables in the data set.

Measures of Location

ds_measures_location(mtcarz)
## # A tibble: 6 x 5
##   var     mean trim_mean median   mode
##              
## 1 disp  231.      228    196.   276.  
## 2 drat    3.60      3.58   3.70   3.07
## 3 hp    147.      144.   123    110   
## 4 mpg    20.1      20.0   19.2   10.4 
## 5 qsec   17.8      17.8   17.7   17.0 
## 6 wt      3.22      3.20   3.32   3.44

Measures of Variation

ds_measures_variation(mtcarz)
## # A tibble: 6 x 7
##   var    range     iqr  variance      sd coeff_var std_error
##                          
## 1 disp  401.   205.    15361.    124.         53.7   21.9   
## 2 drat    2.17   0.840     0.286   0.535      14.9    0.0945
## 3 hp    283     83.5    4701.     68.6        46.7   12.1   
## 4 mpg    23.5    7.38     36.3     6.03       30.0    1.07  
## 5 qsec    8.40   2.01      3.19    1.79       10.0    0.316 
## 6 wt      3.91   1.03      0.957   0.978      30.4    0.173

Measures of Symmetry

ds_measures_symmetry(mtcarz)
## # A tibble: 6 x 3
##   var   skewness kurtosis
##           
## 1 disp     0.420  -1.07  
## 2 drat     0.293  -0.450 
## 3 hp       0.799   0.275 
## 4 mpg      0.672  -0.0220
## 5 qsec     0.406   0.865 
## 6 wt       0.466   0.417

Percentiles

ds_percentiles(mtcarz)
## # A tibble: 6 x 12
##   var     min  per1  per5 per10     q1 median     q3  per95  per90  per99
##                   
## 1 disp  71.1  72.5  77.4  80.6  121.   196.   326    449    396.   468.  
## 2 drat   2.76  2.76  2.85  3.01   3.08   3.70   3.92   4.31   4.21   4.78
## 3 hp    52    55.1  63.6  66     96.5  123    180    254.   244.   313.  
## 4 mpg   10.4  10.4  12.0  14.3   15.4   19.2   22.8   31.3   30.1   33.4 
## 5 qsec  14.5  14.5  15.0  15.5   16.9   17.7   18.9   20.1   20.0   22.1 
## 6 wt     1.51  1.54  1.74  1.96   2.58   3.32   3.61   5.29   4.05   5.40
## # ... with 1 more variable: max 

Categorical Data

Cross Tabulation

The ds_cross_table() function creates two way tables of categorical variables.

ds_cross_table(mtcarz, cyl, gear)
##     Cell Contents
##  |---------------|
##  |     Frequency |
##  |       Percent |
##  |       Row Pct |
##  |       Col Pct |
##  |---------------|
## 
##  Total Observations:  32 
## 
## ----------------------------------------------------------------------------
## |              |                           gear                            |
## ----------------------------------------------------------------------------
## |          cyl |            3 |            4 |            5 |    Row Total |
## ----------------------------------------------------------------------------
## |            4 |            1 |            8 |            2 |           11 |
## |              |        0.031 |         0.25 |        0.062 |              |
## |              |         0.09 |         0.73 |         0.18 |         0.34 |
## |              |         0.07 |         0.67 |          0.4 |              |
## ----------------------------------------------------------------------------
## |            6 |            2 |            4 |            1 |            7 |
## |              |        0.062 |        0.125 |        0.031 |              |
## |              |         0.29 |         0.57 |         0.14 |         0.22 |
## |              |         0.13 |         0.33 |          0.2 |              |
## ----------------------------------------------------------------------------
## |            8 |           12 |            0 |            2 |           14 |
## |              |        0.375 |            0 |        0.062 |              |
## |              |         0.86 |            0 |         0.14 |         0.44 |
## |              |          0.8 |            0 |          0.4 |              |
## ----------------------------------------------------------------------------
## | Column Total |           15 |           12 |            5 |           32 |
## |              |        0.468 |        0.375 |        0.155 |              |
## ----------------------------------------------------------------------------

If you want the above result as a tibble, use ds_twoway_table().

ds_twoway_table(mtcarz, cyl, gear)
## Joining, by = c("cyl", "gear", "count")
## # A tibble: 8 x 6
##   cyl   gear  count percent row_percent col_percent
##                      
## 1 4     3         1  0.0312      0.0909      0.0667
## 2 4     4         8  0.25        0.727       0.667 
## 3 4     5         2  0.0625      0.182       0.4   
## 4 6     3         2  0.0625      0.286       0.133 
## 5 6     4         4  0.125       0.571       0.333 
## 6 6     5         1  0.0312      0.143       0.2   
## 7 8     3        12  0.375       0.857       0.8   
## 8 8     5         2  0.0625      0.143       0.4

A plot() method has been defined which will generate:

Grouped Bar Plots

k <- ds_cross_table(mtcarz, cyl, gear)
plot(k)

Stacked Bar Plots

k <- ds_cross_table(mtcarz, cyl, gear)
plot(k, stacked = TRUE)

Proportional Bar Plots

k <- ds_cross_table(mtcarz, cyl, gear)
plot(k, proportional = TRUE)

Frequency Table

The ds_freq_table() function creates frequency tables.

ds_freq_table(mtcarz, cyl)
##                              Variable: cyl                              
## -----------------------------------------------------------------------
## Levels     Frequency    Cum Frequency       Percent        Cum Percent  
## -----------------------------------------------------------------------
##    4          11             11              34.38            34.38    
## -----------------------------------------------------------------------
##    6           7             18              21.88            56.25    
## -----------------------------------------------------------------------
##    8          14             32              43.75             100     
## -----------------------------------------------------------------------
##  Total        32              -             100.00              -      
## -----------------------------------------------------------------------

A plot() method has been defined which will create a bar plot.

k <- ds_freq_table(mtcarz, cyl)
plot(k)

Multiple One Way Tables

The ds_auto_freq_table() function creates multiple one way tables by creating a
frequency table for each categorical variable in a data set. You can also
specify a subset of variables if you do not want all the variables in the data
set to be used.

ds_auto_freq_table(mtcarz)
##                              Variable: cyl                              
## -----------------------------------------------------------------------
## Levels     Frequency    Cum Frequency       Percent        Cum Percent  
## -----------------------------------------------------------------------
##    4          11             11              34.38            34.38    
## -----------------------------------------------------------------------
##    6           7             18              21.88            56.25    
## -----------------------------------------------------------------------
##    8          14             32              43.75             100     
## -----------------------------------------------------------------------
##  Total        32              -             100.00              -      
## -----------------------------------------------------------------------
## 
##                              Variable: vs                               
## -----------------------------------------------------------------------
## Levels     Frequency    Cum Frequency       Percent        Cum Percent  
## -----------------------------------------------------------------------
##    0          18             18              56.25            56.25    
## -----------------------------------------------------------------------
##    1          14             32              43.75             100     
## -----------------------------------------------------------------------
##  Total        32              -             100.00              -      
## -----------------------------------------------------------------------
## 
##                              Variable: am                               
## -----------------------------------------------------------------------
## Levels     Frequency    Cum Frequency       Percent        Cum Percent  
## -----------------------------------------------------------------------
##    0          19             19              59.38            59.38    
## -----------------------------------------------------------------------
##    1          13             32              40.62             100     
## -----------------------------------------------------------------------
##  Total        32              -             100.00              -      
## -----------------------------------------------------------------------
## 
##                             Variable: gear                              
## -----------------------------------------------------------------------
## Levels     Frequency    Cum Frequency       Percent        Cum Percent  
## -----------------------------------------------------------------------
##    3          15             15              46.88            46.88    
## -----------------------------------------------------------------------
##    4          12             27              37.5             84.38    
## -----------------------------------------------------------------------
##    5           5             32              15.62             100     
## -----------------------------------------------------------------------
##  Total        32              -             100.00              -      
## -----------------------------------------------------------------------
## 
##                             Variable: carb                              
## -----------------------------------------------------------------------
## Levels     Frequency    Cum Frequency       Percent        Cum Percent  
## -----------------------------------------------------------------------
##    1           7              7              21.88            21.88    
## -----------------------------------------------------------------------
##    2          10             17              31.25            53.12    
## -----------------------------------------------------------------------
##    3           3             20              9.38             62.5     
## -----------------------------------------------------------------------
##    4          10             30              31.25            93.75    
## -----------------------------------------------------------------------
##    6           1             31              3.12             96.88    
## -----------------------------------------------------------------------
##    8           1             32              3.12              100     
## -----------------------------------------------------------------------
##  Total        32              -             100.00              -      
## -----------------------------------------------------------------------

Multiple Two Way Tables

The ds_auto_cross_table() function creates multiple two way tables by creating a
cross table for each unique pair of categorical variables in a data set. You
can also specify a subset of variables if you do not want all the variables in
the data set to be used.

ds_auto_cross_table(mtcarz, cyl, gear, am)
##     Cell Contents
##  |---------------|
##  |     Frequency |
##  |       Percent |
##  |       Row Pct |
##  |       Col Pct |
##  |---------------|
## 
##  Total Observations:  32 
## 
##                                 cyl vs gear                                 
## ----------------------------------------------------------------------------
## |              |                           gear                            |
## ----------------------------------------------------------------------------
## |          cyl |            3 |            4 |            5 |    Row Total |
## ----------------------------------------------------------------------------
## |            4 |            1 |            8 |            2 |           11 |
## |              |        0.031 |         0.25 |        0.062 |              |
## |              |         0.09 |         0.73 |         0.18 |         0.34 |
## |              |         0.07 |         0.67 |          0.4 |              |
## ----------------------------------------------------------------------------
## |            6 |            2 |            4 |            1 |            7 |
## |              |        0.062 |        0.125 |        0.031 |              |
## |              |         0.29 |         0.57 |         0.14 |         0.22 |
## |              |         0.13 |         0.33 |          0.2 |              |
## ----------------------------------------------------------------------------
## |            8 |           12 |            0 |            2 |           14 |
## |              |        0.375 |            0 |        0.062 |              |
## |              |         0.86 |            0 |         0.14 |         0.44 |
## |              |          0.8 |            0 |          0.4 |              |
## ----------------------------------------------------------------------------
## | Column Total |           15 |           12 |            5 |           32 |
## |              |        0.468 |        0.375 |        0.155 |              |
## ----------------------------------------------------------------------------
## 
## 
##                          cyl vs am                           
## -------------------------------------------------------------
## |              |                     am                     |
## -------------------------------------------------------------
## |          cyl |            0 |            1 |    Row Total |
## -------------------------------------------------------------
## |            4 |            3 |            8 |           11 |
## |              |        0.094 |         0.25 |              |
## |              |         0.27 |         0.73 |         0.34 |
## |              |         0.16 |         0.62 |              |
## -------------------------------------------------------------
## |            6 |            4 |            3 |            7 |
## |              |        0.125 |        0.094 |              |
## |              |         0.57 |         0.43 |         0.22 |
## |              |         0.21 |         0.23 |              |
## -------------------------------------------------------------
## |            8 |           12 |            2 |           14 |
## |              |        0.375 |        0.062 |              |
## |              |         0.86 |         0.14 |         0.44 |
## |              |         0.63 |         0.15 |              |
## -------------------------------------------------------------
## | Column Total |           19 |           13 |           32 |
## |              |        0.594 |        0.406 |              |
## -------------------------------------------------------------
## 
## 
##                          gear vs am                          
## -------------------------------------------------------------
## |              |                     am                     |
## -------------------------------------------------------------
## |         gear |            0 |            1 |    Row Total |
## -------------------------------------------------------------
## |            3 |           15 |            0 |           15 |
## |              |        0.469 |            0 |              |
## |              |            1 |            0 |         0.47 |
## |              |         0.79 |            0 |              |
## -------------------------------------------------------------
## |            4 |            4 |            8 |           12 |
## |              |        0.125 |         0.25 |              |
## |              |         0.33 |         0.67 |         0.38 |
## |              |         0.21 |         0.62 |              |
## -------------------------------------------------------------
## |            5 |            0 |            5 |            5 |
## |              |            0 |        0.156 |              |
## |              |            0 |            1 |         0.16 |
## |              |            0 |         0.38 |              |
## -------------------------------------------------------------
## | Column Total |           19 |           13 |           32 |
## |              |        0.594 |        0.406 |              |
## -------------------------------------------------------------

Visualization

descriptr can help visualize multiple variables by automatically
detecting their data types.

Continuous Data

ds_plot_scatter(mtcarz, mpg, disp, hp)

Categorical Data

ds_plot_bar_stacked(mtcarz, cyl, gear, am)

Learning More

The descriptr website includes
comprehensive documentation on using the package, including the following
articles that cover various aspects of using rfm:

Feedback

All feedback is welcome. Issues (bugs and feature
requests) can be posted to github tracker.
For help with code or other related questions, feel free to reach me hebbali.aravind@gmail.com.

To leave a comment for the author, please follow the link and comment on their blog: Rsquared Academy Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…

Collapse

Read More

February 19, 2019

simulation fodder for future exams

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

Here are two nice exercises for a future simulation exam, seen and solved on X validated.The first one is about simulating a Gibbs sampler associated with the joint target

exp{-|x|-|y|-a(y-x|}

defined over IR² for a≥0 (or possibly a>-1). The conditionals are identical and non-standard, but a simple bound on the conditional density is the corresponding standard double exponential density, which makes for a straightforward accept-reject implementation. However it is also feasible to break the full conditional into three parts, depending on the respective positions of x, y, and 0, and to obtain easily invertible cdfs on the three intervals.The second exercise is about simulating from the cdf

F(x)=1-\exp\{-ax-bx^{p+1}/(p+1)\}

which can be numerically inverted. It is however more fun to call for an accept-reject algorithm by bounding the density with a ½ ½ mixture of an Exponential Exp(a) and of the 1/(p+1)-th power of an Exponential Exp(b/(p+1)). Since no extra constant appears in the solution,  I suspect the (p+1) in b/(p+1) was introduced on purpose. As seen in the above fit for 10⁶ simulations (and a=1,b=2,p=3), there is no deviation from the target! There is however an even simpler resolution to the exercise: since the tail function (1-F(x)) appears as the product of two tail functions, exp(-ax) and the other one, the cdf is the distribution of the minimum of two random variates, one with the Exp(a) distribution and the other one being the 1/(p+1)-th power of an Exponential Exp(b/(p+1)) distribution. Which of course returns a very similar histogram fit:

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…

Collapse

Read More

Apple’s privacy play keeps internet regulators at bay

In 2018, the GDPR changed how tech companies handle data privacy. In 2019, it’s influencing the public’s perception of internet privacy and changing how tech companies treat violations—and one another. Last month, I wrote about the state of internet privacy in the context of the GDPR and other regulations that

The post Apple’s privacy play keeps internet regulators at bay appeared first on Dataconomy.

Continue Reading…

Collapse

Read More

Document worth reading: “Regularization for Deep Learning: A Taxonomy”

Regularization is one of the crucial ingredients of deep learning, yet the term regularization has various definitions, and regularization methods are often studied separately from each other. In our work we present a systematic, unifying taxonomy to categorize existing methods. We distinguish methods that affect data, network architectures, error terms, regularization terms, and optimization procedures. We do not provide all details about the listed methods; instead, we present an overview of how the methods can be sorted into meaningful categories and sub-categories. This helps revealing links and fundamental similarities between them. Finally, we include practical recommendations both for users and for developers of new regularization methods. Regularization for Deep Learning: A Taxonomy

Continue Reading…

Collapse

Read More

Whats new on arXiv

Net2Vis: Transforming Deep Convolutional Networks into Publication-Ready Visualizations

To properly convey neural network architectures in publications, appropriate visualization techniques are of great importance. While most current deep learning papers contain such visualizations, these are usually handcrafted, which results in a lack of a common visual grammar, as well as a significant time investment. Since these visualizations are often crafted just before publication, they are also prone to contain errors, might deviate from the actual architecture, and are sometimes ambiguous to interpret. Current automatic network visualization toolkits focus on debugging the network itself, and are therefore not ideal for generating publication-ready visualization, as they cater a different level of detail. Therefore, we present an approach to automate this process by translating network architectures specified in Python, into publication-ready network visualizations that can directly be embedded into any publication. To improve the readability of these visualizations, and in order to make them comparable, the generated visualizations obey to a visual grammar, which we have derived based on the analysis of existing network visualizations. Besides carefully crafted visual encodings, our grammar also incorporates abstraction through layer accumulation, as it is often done to reduce the complexity of the network architecture to be communicated. Thus, our approach not only reduces the time needed to generate publication-ready network visualizations, but also enables a unified and unambiguous visualization design.


Divergence-Based Motivation for Online EM and Combining Hidden Variable Models

Expectation-Maximization (EM) is the fallback method for parameter estimation of hidden (aka latent) variable models. Given the full batch of data, EM forms an upper-bound of the negative log-likelihood of the model at each iteration and then updates to the minimizer of this upper-bound. We introduce a versatile online variant of EM where the data arrives in as a stream. Our motivation is based on the relative entropy divergences between two joint distributions over the hidden and visible variables. We view the EM upper-bound as a Monte Carlo approximation of an expectation and show that the joint relative entropy divergence induces a similar expectation form. As a result, we employ the divergence to the old model as the inertia term to motivate our online EM algorithm. Our motivation is more widely applicable than previous ones and leads to simple online updates for mixture of exponential distributions, hidden Markov models, and the first known online update for Kalman filters. Additionally, the finite sample form of the inertia term lets us derive online updates when there is no closed form solution. Experimentally, sweeping the data with an online update converges much faster than the batch update. Our divergence based methods also lead to a simple way to combine hidden variable models and this immediately gives efficient algorithms for distributed setting.


CPOI: A Compact Method to Archive Versioned RDF Triple-Sets

Large amounts of RDF/S data are produced and published lately, and several modern applications require the provision of versioning and archiving services over such datasets. In this paper we propose a novel storage index for archiving versions of such datasets, called CPOI (compact partial order index), that exploits the fact that an RDF Knowledge Base (KB), is a graph (or equivalently a set of triples), and thus it has not a unique serialization (as it happens with text). If we want to keep stored several versions we actually want to store multiple sets of triples. CPOI is a data structure for storing such sets aiming at reducing the storage space since this is important not only for reducing storage costs, but also for reducing the various communication costs and enabling hosting in main memory (and thus processing efficiently) large quantities of data. CPOI is based on a partial order structure over sets of triple identifiers, where the triple identifiers are represented in a gapped form using variable length encoding schemes. For this index we evaluate analytically and experimentally various identifier assignment techniques and their space savings. The results show significant storage savings, specifically, the storage space of the compressed sets in large and realistic synthetic datasets is about the 8% of the size of the uncompressed sets.


ReStoCNet: Residual Stochastic Binary Convolutional Spiking Neural Network for Memory-Efficient Neuromorphic Computing

In this work, we propose ReStoCNet, a residual stochastic multilayer convolutional Spiking Neural Network (SNN) composed of binary kernels, to reduce the synaptic memory footprint and enhance the computational efficiency of SNNs for complex pattern recognition tasks. ReStoCNet consists of an input layer followed by stacked convolutional layers for hierarchical input feature extraction, pooling layers for dimensionality reduction, and fully-connected layer for inference. In addition, we introduce residual connections between the stacked convolutional layers to improve the hierarchical feature learning capability of deep SNNs. We propose Spike Timing Dependent Plasticity (STDP) based probabilistic learning algorithm, referred to as Hybrid-STDP (HB-STDP), incorporating Hebbian and anti-Hebbian learning mechanisms, to train the binary kernels forming ReStoCNet in a layer-wise unsupervised manner. We demonstrate the efficacy of ReStoCNet and the presented HB-STDP based unsupervised training methodology on the MNIST and CIFAR-10 datasets. We show that residual connections enable the deeper convolutional layers to self-learn useful high-level input features and mitigate the accuracy loss observed in deep SNNs devoid of residual connections. The proposed ReStoCNet offers >20x kernel memory compression compared to full-precision (32-bit) SNN while yielding high enough classification accuracy on the chosen pattern recognition tasks.


Domain Constraint Approximation based Semi Supervision

Deep learning for supervised learning has achieved astonishing performance in various machine learning applications. However, annotated data is expensive and rare. In practice, only a small portion of data samples are annotated. Pseudo-ensembling-based approaches have achieved state-of-the-art results in computer vision related tasks. However, it still relies on the quality of an initial model built by labeled data. Less labeled data may degrade model performance a lot. Domain constraint is another way regularize the posterior but has some limitation. In this paper, we proposed a fuzzy domain-constraint-based framework which loses the requirement of traditional constraint learning and enhances the model quality for semi supervision. Simulations results show the effectiveness of our design.


Stochastic Reinforcement Learning

In reinforcement learning episodes, the rewards and punishments are often non-deterministic, and there are invariably stochastic elements governing the underlying situation. Such stochastic elements are often numerous and cannot be known in advance, and they have a tendency to obscure the underlying rewards and punishments patterns. Indeed, if stochastic elements were absent, the same outcome would occur every time and the learning problems involved could be greatly simplified. In addition, in most practical situations, the cost of an observation to receive either a reward or punishment can be significant, and one would wish to arrive at the correct learning conclusion by incurring minimum cost. In this paper, we present a stochastic approach to reinforcement learning which explicitly models the variability present in the learning environment and the cost of observation. Criteria and rules for learning success are quantitatively analyzed, and probabilities of exceeding the observation cost bounds are also obtained.


Performance Dynamics and Termination Errors in Reinforcement Learning: A Unifying Perspective

In reinforcement learning, a decision needs to be made at some point as to whether it is worthwhile to carry on with the learning process or to terminate it. In many such situations, stochastic elements are often present which govern the occurrence of rewards, with the sequential occurrences of positive rewards randomly interleaved with negative rewards. For most practical learners, the learning is considered useful if the number of positive rewards always exceeds the negative ones. A situation that often calls for learning termination is when the number of negative rewards exceeds the number of positive rewards. However, while this seems reasonable, the error of premature termination, whereby termination is enacted along with the conclusion of learning failure despite the positive rewards eventually far outnumber the negative ones, can be significant. In this paper, using combinatorial analysis we study the error probability in wrongly terminating a reinforcement learning activity which undermines the effectiveness of an optimal policy, and we show that the resultant error can be quite high. Whilst we demonstrate mathematically that such errors can never be eliminated, we propose some practical mechanisms that can effectively reduce such errors. Simulation experiments have been carried out, the results of which are in close agreement with our theoretical findings.


LS-Tree: Model Interpretation When the Data Are Linguistic

We study the problem of interpreting trained classification models in the setting of linguistic data sets. Leveraging a parse tree, we propose to assign least-squares based importance scores to each word of an instance by exploiting syntactic constituency structure. We establish an axiomatic characterization of these importance scores by relating them to the Banzhaf value in coalitional game theory. Based on these importance scores, we develop a principled method for detecting and quantifying interactions between words in a sentence. We demonstrate that the proposed method can aid in interpretability and diagnostics for several widely-used language models.


MaCow: Masked Convolutional Generative Flow

Flow-based generative models, conceptually attractive due to tractability of both the exact log-likelihood computation and latent-variable inference, and efficiency of both training and sampling, has led to a number of impressive empirical successes and spawned many advanced variants and theoretical investigations. Despite their computational efficiency, the density estimation performance of flow-based generative models significantly falls behind those of state-of-the-art autoregressive models. In this work, we introduce masked convolutional generative flow (MaCow), a simple yet effective architecture of generative flow using masked convolution. By restricting the local connectivity in a small kernel, MaCow enjoys the properties of fast and stable training, and efficient sampling, while achieving significant improvements over Glow for density estimation on standard image benchmarks, considerably narrowing the gap to autoregressive models.


Multi-objective Bayesian optimisation with preferences over objectives

We present a Bayesian multi-objective optimisation algorithm that allows the user to express preference-order constraints on the objectives of the type `objective A is more important than objective B’. Rather than attempting to find a representative subset of the complete Pareto front, our algorithm searches for and returns only those Pareto-optimal points that satisfy these constraints. We formulate a new acquisition function based on expected improvement in dominated hypervolume (EHI) to ensure that the subset of Pareto front satisfying the constraints is thoroughly explored. The hypervolume calculation only includes those points that satisfy the preference-order constraints, where the probability of a point satisfying the constraints is calculated from a gradient Gaussian Process model. We demonstrate our algorithm on both synthetic and real-world problems.


Effects of empathy on the evolution of fairness in group-structured populations

The ultimatum game has been a prominent paradigm in studying the evolution of fairness. It predicts that responders should accept any nonzero offer and proposers should offer the smallest possible amount according to orthodox game theory. However, the prediction strongly contradicts with experimental findings where responders usually reject low offers below 20\% and proposers usually make higher offers than expected. To explain the evolution of such fair behaviors, we here introduce empathy in group-structured populations by allowing a proportion \alpha of the population to play empathetic strategies. Interestingly, we find that for high mutation probabilities, the mean offer decreases with \alpha and the mean demand increases, implying empathy inhibits the evolution of fairness. For low mutation probabilities, the mean offer and demand approach to the fair ones with increasing \alpha, implying empathy promotes the evolution of fairness. Furthermore, under both weak and strong intensities of natural selection, we analytically calculate the mean offer and demand for different levels of \alpha. Counterintuitively, we demonstrate that although a higher mutation probability leads to a higher level of fairness under weak selection, an intermediate mutation probability corresponds to the lowest level of fairness under strong selection. Our study provides systematic insights into the evolutionary origin of fairness in group-structured populations with empathetic strategies.


Adversarial Samples on Android Malware Detection Systems for IoT Systems

Many IoT (Internet of Things) systems run Android systems or Android-like systems. With the continuous development of machine learning algorithms, the learning-based Android malware detection system for IoT devices has gradually increased. However, these learning-based detection models are often vulnerable to adversarial samples. An automated testing framework is needed to help these learning-based malware detection systems for IoT devices perform security analysis. The current methods of generating adversarial samples mostly require training parameters of models and most of the methods are aimed at image data. To solve this problem, we propose a \textbf{t}esting framework for \textbf{l}earning-based \textbf{A}ndroid \textbf{m}alware \textbf{d}etection systems (TLAMD) for IoT Devices. The key challenge is how to construct a suitable fitness function to generate an effective adversarial sample without affecting the features of the application. By introducing genetic algorithms and some technical improvements, our test framework can generate adversarial samples for the IoT Android Application with a success rate of nearly 100\% and can perform black-box testing on the system.


VERIFAI: A Toolkit for the Design and Analysis of Artificial Intelligence-Based Systems

We present VERIFAI, a software toolkit for the formal design and analysis of systems that include artificial intelligence (AI) and machine learning (ML) components. VERIFAI particularly seeks to address challenges with applying formal methods to perception and ML components, including those based on neural networks, and to model and analyze system behavior in the presence of environment uncertainty. We describe the initial version of VERIFAI which centers on simulation guided by formal models and specifications. Several use cases are illustrated with examples, including temporal-logic falsification, model-based systematic fuzz testing, parameter synthesis, counterexample analysis, and data set augmentation.


Thompson Sampling with Information Relaxation Penalties

We consider a finite time horizon multi-armed bandit (MAB) problem in a Bayesian framework, for which we develop a general set of control policies that leverage ideas from information relaxations of stochastic dynamic optimization problems. In crude terms, an information relaxation allows the decision maker (DM) to have access to the future (unknown) rewards and incorporate them in her optimization problem to pick an action at time t, but penalizes the decision maker for using this information. In our setting, the future rewards allow the DM to better estimate the unknown mean reward parameters of the multiple arms, and optimize her sequence of actions. By picking different information penalties, the DM can construct a family of policies of increasing complexity that, for example, include Thompson Sampling and the true optimal (but intractable) policy as special cases. We systematically develop this framework of information relaxation sampling, propose an intuitive family of control policies for our motivating finite time horizon Bayesian MAB problem, and prove associated structural results and performance bounds. Numerical experiments suggest that this new class of policies performs well, in particular in settings where the finite time horizon introduces significant tension in the problem. Finally, inspired by the finite time horizon Gittins index, we propose an index policy that builds on our framework that particularly outperforms to the state-of-the-art algorithms in our numerical experiments.


A Theory of Selective Prediction

We consider a model of selective prediction, where the prediction algorithm is given a data sequence in an online fashion and asked to predict a pre-specified statistic of the upcoming data points. The algorithm is allowed to choose when to make the prediction as well as the length of the prediction window, possibly depending on the observations so far. We prove that, even without any distributional assumption on the input data stream, a large family of statistics can be estimated to non-trivial accuracy. To give one concrete example, suppose that we are given access to an arbitrary binary sequence x_1, \ldots, x_n of length n. Our goal is to accurately predict the average observation, and we are allowed to choose the window over which the prediction is made: for some t < n and m \le n - t, after seeing t observations we predict the average of x_{t+1}, \ldots, x_{t+m}. We show that the expected squared error of our prediction can be bounded by O\left(\frac{1}{\log n}\right), and prove a matching lower bound. This result holds for any sequence (that is not adaptive to when the prediction is made, or the predicted value), and the expectation of the error is with respect to the randomness of the prediction algorithm. Our results apply to more general statistics of a sequence of observations, and we highlight several open directions for future work.


Deep Reinforcement Learning from Policy-Dependent Human Feedback

To widen their accessibility and increase their utility, intelligent agents must be able to learn complex behaviors as specified by (non-expert) human users. Moreover, they will need to learn these behaviors within a reasonable amount of time while efficiently leveraging the sparse feedback a human trainer is capable of providing. Recent work has shown that human feedback can be characterized as a critique of an agent’s current behavior rather than as an alternative reward signal to be maximized, culminating in the COnvergent Actor-Critic by Humans (COACH) algorithm for making direct policy updates based on human feedback. Our work builds on COACH, moving to a setting where the agent’s policy is represented by a deep neural network. We employ a series of modifications on top of the original COACH algorithm that are critical for successfully learning behaviors from high-dimensional observations, while also satisfying the constraint of obtaining reduced sample complexity. We demonstrate the effectiveness of our Deep COACH algorithm in the rich 3D world of Minecraft with an agent that learns to complete tasks by mapping from raw pixels to actions using only real-time human feedback in 10-15 minutes of interaction.


NAIL: A General Interactive Fiction Agent

Interactive Fiction (IF) games are complex textual decision making problems. This paper introduces NAIL, an autonomous agent for general parser-based IF games. NAIL won the 2018 Text Adventure AI Competition, where it was evaluated on twenty unseen games. This paper describes the architecture, development, and insights underpinning NAIL’s performance.


Table2answer: Read the database and answer without SQL

Semantic parsing is the task of mapping natural language to logic form. In question answering, semantic parsing can be used to map the question to logic form and execute the logic form to get the answer. One key problem for semantic parsing is the hard label work. We study this problem in another way: we do not use the logic form any more. Instead we only use the schema and answer info. We think that the logic form step can be injected into the deep model. The reason why we think removing the logic form step is possible is that human can do the task without explicit logic form. We use BERT-based model and do the experiment in the WikiSQL dataset, which is a large natural language to SQL dataset. Our experimental evaluations that show that our model can achieves the baseline results in WikiSQL dataset.


WikiLinkGraphs: A complete, longitudinal and multi-language dataset of the Wikipedia link networks

Wikipedia articles contain multiple links connecting a subject to other pages of the encyclopedia. In Wikipedia parlance, these links are called internal links or wikilinks. We present a complete dataset of the network of internal Wikipedia links for the 9 largest language editions. The dataset contains yearly snapshots of the network and spans 17 years, from the creation of Wikipedia in 2001 to March 1st, 2018. While previous work has mostly focused on the complete hyperlink graph which includes also links automatically generated by templates, we parsed each revision of each article to track links appearing in the main text. In this way we obtained a cleaner network, discarding more than half of the links and representing all and only the links intentionally added by editors. We describe in detail how the Wikipedia dumps have been processed and the challenges we have encountered, including the need to handle special pages such as redirects, i.e., alternative article titles. We present descriptive statistics of several snapshots of this network. Finally, we propose several research opportunities that can be explored using this new dataset.


Generalized Lineage-Aware Temporal Windows: Supporting Outer and Anti Joins in Temporal-Probabilistic Databases

The result of a temporal-probabilistic (TP) join with negation includes, at each time point, the probability with which a tuple of a positive relation {\bf p} matches none of the tuples in a negative relation {\bf n}, for a given join condition \theta. TP outer and anti joins thus resemble the characteristics of relational outer and anti joins also in the case when there exist time points at which input tuples from {\bf p} have non-zero probabilities to be true and input tuples from {\bf n} have non-zero probabilities to be false, respectively. For the computation of TP joins with negation, we introduce generalized lineage-aware temporal windows, a mechanism that binds an output interval to the lineages of all the matching valid tuples of each input relation. We group the windows of two TP relations into three disjoint sets based on the way attributes, lineage expressions and intervals are produced. We compute all windows in an incremental manner, and we show that pipelined computations allow for the direct integration of our approach into PostgreSQL. We thereby alleviate the prevalent redundancies in the interval computations of existing approaches, which is proven by an extensive experimental evaluation with real-world datasets.


Verification Code Recognition Based on Active and Deep Learning

A verification code is an automated test method used to distinguish between humans and computers. Humans can easily identify verification codes, whereas machines cannot. With the development of convolutional neural networks, automatically recognizing a verification code is now possible for machines. However, the advantages of convolutional neural networks depend on the data used by the training classifier, particularly the size of the training set. Therefore, identifying a verification code using a convolutional neural network is difficult when training data are insufficient. This study proposes an active and deep learning strategy to obtain new training data on a special verification code set without manual intervention. A feature learning model for a scene with less training data is presented in this work, and the verification code is identified by the designed convolutional neural network. Experiments show that the method can considerably improve the recognition accuracy of a neural network when the amount of initial training data is small.


TensorSCONE: A Secure TensorFlow Framework using Intel SGX

Machine learning has become a critical component of modern data-driven online services. Typically, the training phase of machine learning techniques requires to process large-scale datasets which may contain private and sensitive information of customers. This imposes significant security risks since modern online services rely on cloud computing to store and process the sensitive data. In the untrusted computing infrastructure, security is becoming a paramount concern since the customers need to trust the thirdparty cloud provider. Unfortunately, this trust has been violated multiple times in the past. To overcome the potential security risks in the cloud, we answer the following research question: how to enable secure executions of machine learning computations in the untrusted infrastructure? To achieve this goal, we propose a hardware-assisted approach based on Trusted Execution Environments (TEEs), specifically Intel SGX, to enable secure execution of the machine learning computations over the private and sensitive datasets. More specifically, we propose a generic and secure machine learning framework based on Tensorflow, which enables secure execution of existing applications on the commodity untrusted infrastructure. In particular, we have built our system called TensorSCONE from ground-up by integrating TensorFlow with SCONE, a shielded execution framework based on Intel SGX. The main challenge of this work is to overcome the architectural limitations of Intel SGX in the context of building a secure TensorFlow system. Our evaluation shows that we achieve reasonable performance overheads while providing strong security properties with low TCB.


Examining Adversarial Learning against Graph-based IoT Malware Detection Systems

The main goal of this study is to investigate the robustness of graph-based Deep Learning (DL) models used for Internet of Things (IoT) malware classification against Adversarial Learning (AL). We designed two approaches to craft adversarial IoT software, including Off-the-Shelf Adversarial Attack (OSAA) methods, using six different AL attack approaches, and Graph Embedding and Augmentation (GEA). The GEA approach aims to preserve the functionality and practicality of the generated adversarial sample through a careful embedding of a benign sample to a malicious one. Our evaluations demonstrate that OSAAs are able to achieve a misclassification rate (MR) of 100%. Moreover, we observed that the GEA approach is able to misclassify all IoT malware samples as benign.


Learning interpretable continuous-time models of latent stochastic dynamical systems

We develop an approach to learn an interpretable semi-parametric model of a latent continuous-time stochastic dynamical system, assuming noisy high-dimensional outputs sampled at uneven times. The dynamics are described by a nonlinear stochastic differential equation (SDE) driven by a Wiener process, with a drift evolution function drawn from a Gaussian process (GP) conditioned on a set of learnt fixed points and corresponding local Jacobian matrices. This form yields a flexible nonparametric model of the dynamics, with a representation corresponding directly to the interpretable portraits routinely employed in the study of nonlinear dynamical systems. The learning algorithm combines inference of continuous latent paths underlying observed data with a sparse variational description of the dynamical process. We demonstrate our approach on simulated data from different nonlinear dynamical systems.


Joint Training of Neural Network Ensembles

We examine the practice of joint training for neural network ensembles, in which a multi-branch architecture is trained via single loss. This approach has recently gained traction, with claims of greater accuracy per parameter along with increased parallelism. We introduce a family of novel loss functions generalizing multiple previously proposed approaches, with which we study theoretical and empirical properties of joint training. These losses interpolate smoothly between independent and joint training of predictors, demonstrating that joint training has several disadvantages not observed in prior work. However, with appropriate regularization via our proposed loss, the method shows new promise in resource limited scenarios and fault-tolerant systems, e.g., IoT and edge devices. Finally, we discuss how these results may have implications for general multi-branch architectures such as ResNeXt and Inception.


Performance of All-Pairs Shortest-Paths Solvers with Apache Spark

Algorithms for computing All-Pairs Shortest-Paths (APSP) are critical building blocks underlying many practical applications. The standard sequential algorithms, such as Floyd-Warshall and Johnson, quickly become infeasible for large input graphs, necessitating parallel approaches. In this work, we provide detailed analysis of parallel APSP performance on distributed memory clusters with Apache Spark. The Spark model allows for a portable and easy to deploy distributed implementation, and hence is attractive from the end-user point of view. We propose four different APSP implementations for large undirected weighted graphs, which differ in complexity and degree of reliance on techniques outside of pure Spark API. We demonstrate that Spark is able to handle APSP problems with over 200,000 vertices on a 1024-core cluster, and can compete with a naive MPI-based solution. However, our best performing solver requires auxiliary shared persistent storage, and is over two times slower than optimized MPI-based solver.


Machine Reading Comprehension for Answer Re-Ranking in Customer Support Chatbots

Recent advances in deep neural networks, language modeling and language generation have introduced new ideas to the field of conversational agents. As a result, deep neural models such as sequence-to-sequence, Memory Networks, and the Transformer have become key ingredients of state-of-the-art dialog systems. While those models are able to generate meaningful responses even in unseen situation, they need a lot of training data to build a reliable model. Thus, most real-world systems stuck to traditional approaches based on information retrieval and even hand-crafted rules, due to their robustness and effectiveness, especially for narrow-focused conversations. Here, we present a method that adapts a deep neural architecture from the domain of machine reading comprehension to re-rank the suggested answers from different models using the question as context. We train our model using negative sampling based on question-answer pairs from the Twitter Customer Support Dataset.The experimental results show that our re-ranking framework can improve the performance in terms of word overlap and semantics both for individual models as well as for model combinations.


A Domain Generalization Perspective on Listwise Context Modeling

As one of the most popular techniques for solving the ranking problem in information retrieval, Learning-to-rank (LETOR) has received a lot of attention both in academia and industry due to its importance in a wide variety of data mining applications. However, most of existing LETOR approaches choose to learn a single global ranking function to handle all queries, and ignore the substantial differences that exist between queries. In this paper, we propose a domain generalization strategy to tackle this problem. We propose Query-Invariant Listwise Context Modeling (QILCM), a novel neural architecture which eliminates the detrimental influence of inter-query variability by learning \textit{query-invariant} latent representations, such that the ranking system could generalize better to unseen queries. We evaluate our techniques on benchmark datasets, demonstrating that QILCM outperforms previous state-of-the-art approaches by a substantial margin.


Capacity allocation analysis of neural networks: A tool for principled architecture design

Designing neural network architectures is a task that lies somewhere between science and art. For a given task, some architectures are eventually preferred over others, based on a mix of intuition, experience, experimentation and luck. For many tasks, the final word is attributed to the loss function, while for some others a further perceptual evaluation is necessary to assess and compare performance across models. In this paper, we introduce the concept of capacity allocation analysis, with the aim of shedding some light on what network architectures focus their modelling capacity on, when used on a given task. We focus more particularly on spatial capacity allocation, which analyzes a posteriori the effective number of parameters that a given model has allocated for modelling dependencies on a given point or region in the input space, in linear settings. We use this framework to perform a quantitative comparison between some classical architectures on various synthetic tasks. Finally, we consider how capacity allocation might translate in non-linear settings.


The Cost of Privacy: Optimal Rates of Convergence for Parameter Estimation with Differential Privacy

Privacy-preserving data analysis is a rising challenge in contemporary statistics, as the privacy guarantees of statistical methods are often achieved at the expense of accuracy. In this paper, we investigate the tradeoff between statistical accuracy and privacy in mean estimation and linear regression, under both the classical low-dimensional and modern high-dimensional settings. A primary focus is to establish minimax optimality for statistical estimation with the (\varepsilon,\delta)-differential privacy constraint. To this end, we find that classical lower bound arguments fail to yield sharp results, and new technical tools are called for. We first develop a general lower bound argument for estimation problems with differential privacy constraints, and then apply the lower bound argument to mean estimation and linear regression. For these statistical problems, we also design computationally efficient algorithms that match the minimax lower bound up to a logarithmic factor. In particular, for the high-dimensional linear regression, a novel private iterative hard thresholding pursuit algorithm is proposed, based on a privately truncated version of stochastic gradient descent. The numerical performance of these algorithms is demonstrated by simulation studies and applications to real data containing sensitive information, for which privacy-preserving statistical methods are necessary.


Fast-SCNN: Fast Semantic Segmentation Network

The encoder-decoder framework is state-of-the-art for offline semantic image segmentation. Since the rise in autonomous systems, real-time computation is increasingly desirable. In this paper, we introduce fast segmentation convolutional neural network (Fast-SCNN), an above real-time semantic segmentation model on high resolution image data (1024x2048px) suited to efficient computation on embedded devices with low memory. Building on existing two-branch methods for fast segmentation, we introduce our `learning to downsample’ module which computes low-level features for multiple resolution branches simultaneously. Our network combines spatial detail at high resolution with deep features extracted at lower resolution, yielding an accuracy of 68.0% mean intersection over union at 123.5 frames per second on Cityscapes. We also show that large scale pre-training is unnecessary. We thoroughly validate our metric in experiments with ImageNet pre-training and the coarse labeled data of Cityscapes. Finally, we show even faster computation with competitive results on subsampled inputs, without any network modifications.


RTbust: Exploiting Temporal Patterns for Botnet Detection on Twitter

Within OSNs, many of our supposedly online friends may instead be fake accounts called social bots, part of large groups that purposely re-share targeted content. Here, we study retweeting behaviors on Twitter, with the ultimate goal of detecting retweeting social bots. We collect a dataset of 10M retweets. We design a novel visualization that we leverage to highlight benign and malicious patterns of retweeting activity. In this way, we uncover a ‘normal’ retweeting pattern that is peculiar of human-operated accounts, and 3 suspicious patterns related to bot activities. Then, we propose a bot detection technique that stems from the previous exploration of retweeting behaviors. Our technique, called Retweet-Buster (RTbust), leverages unsupervised feature extraction and clustering. An LSTM autoencoder converts the retweet time series into compact and informative latent feature vectors, which are then clustered with a hierarchical density-based algorithm. Accounts belonging to large clusters characterized by malicious retweeting patterns are labeled as bots. RTbust obtains excellent detection results, with F1 = 0.87, whereas competitors achieve F1 < 0.76. Finally, we apply RTbust to a large dataset of retweets, uncovering 2 previously unknown active botnets with hundreds of accounts.


Binary Stochastic Filtering: a Solution for Supervised Feature Selection and Neural Network Shape Optimization

Binary Stochastic Filtering (BSF), the algorithm for feature selection and neuron pruning is proposed in this work. Filtering layer stochastically passes or filters out features based on individual weights, which are tuned during neural network training process. By placing BSF after the neural network input, the filtering of input features is performed, i.e. feature selection. More then 5-fold dimensionality decrease was achieved in the experiments. Placing BSF layer in between hidden layers allows filtering of neuron outputs and could be used for neuron pruning. Up to 34-fold decrease in the number of weights in the network was reached, which corresponds to the significant increase of performance, that is especially important for mobile and embedded applications.


Bayesian Online Detection and Prediction of Change Points

Online detection of instantaneous changes in the generative process of a data sequence generally focuses on retrospective inference of such change points without considering their future occurrences. We extend the Bayesian Online Change Point Detection algorithm to also infer the number of time steps until the next change point (i.e., the residual time). This enables us to handle observation models which depend on the total segment duration, which is useful to model data sequences with temporal scaling. In addition, we extend the model by removing the i.i.d. assumption on the observation model parameters. The resulting inference algorithm for segment detection can be deployed in an online fashion, and we illustrate applications to synthetic and to two medical real-world data sets.


ACTRCE: Augmenting Experience via Teacher’s Advice For Multi-Goal Reinforcement Learning

Sparse reward is one of the most challenging problems in reinforcement learning (RL). Hindsight Experience Replay (HER) attempts to address this issue by converting a failed experience to a successful one by relabeling the goals. Despite its effectiveness, HER has limited applicability because it lacks a compact and universal goal representation. We present Augmenting experienCe via TeacheR’s adviCE (ACTRCE), an efficient reinforcement learning technique that extends the HER framework using natural language as the goal representation. We first analyze the differences among goal representation, and show that ACTRCE can efficiently solve difficult reinforcement learning problems in challenging 3D navigation tasks, whereas HER with non-language goal representation failed to learn. We also show that with language goal representations, the agent can generalize to unseen instructions, and even generalize to instructions with unseen lexicons. We further demonstrate it is crucial to use hindsight advice to solve challenging tasks, and even small amount of advice is sufficient for the agent to achieve good performance.


Measure of quality of finite-dimensional linear systems: A frame-theoretic view

A measure of quality of a control system is a quantitative extension of the classical binary notion of controllability. In this article we study the quality of linear control systems from a frame-theoretic perspective. We demonstrate that all LTI systems naturally generate a frame on their state space, and that three standard measures of quality involving the trace, minimum eigenvalue, and the determinant of the controllability Gramian achieve their optimum values when this generated frame is tight. Motivated by this, and in view of some recent developments in frame-theoretic signal processing, we propose a natural measure of quality for continuous time LTI systems based on a measure of tightness of the frame generated by it and then discuss some properties of this frame-theoretic measure of quality.


Infinite Mixture Prototypes for Few-Shot Learning

We propose infinite mixture prototypes to adaptively represent both simple and complex data distributions for few-shot learning. Our infinite mixture prototypes represent each class by a set of clusters, unlike existing prototypical methods that represent each class by a single cluster. By inferring the number of clusters, infinite mixture prototypes interpolate between nearest neighbor and prototypical representations, which improves accuracy and robustness in the few-shot regime. We show the importance of adaptive capacity for capturing complex data distributions such as alphabets, with 25% absolute accuracy improvements over prototypical networks, while still maintaining or improving accuracy on the standard Omniglot and mini-ImageNet benchmarks. In clustering labeled and unlabeled data by the same clustering rule, infinite mixture prototypes achieves state-of-the-art semi-supervised accuracy. As a further capability, we show that infinite mixture prototypes can perform purely unsupervised clustering, unlike existing prototypical methods.

Continue Reading…

Collapse

Read More

Databricks Security Advisory: Critical Runc Vulnerability (CVE-2019-5736)

Databricks became aware of a new critical runc vulnerability (CVE-2019-5736) on February 12, 2019 that allows malicious container users to gain root access to the host operating system. This vulnerability affects many container runtimes, including Docker and LXC. The Databricks security team has evaluated the vulnerability and confirmed that, due to the Databricks platform architecture, there is no external vector by which an attacker could exploit the flaw to gain access to the host VM on which the containers reside.  Additionally, our architecture isolates each customer by providing each customer with a separate host VM located within the customer’s cloud services account, so this exploit would not permit any cross-customer access, even if the underlying container were compromised.

This CVE includes two attack vectors:

  • Creating a new container using an attacker-controlled image.

Databricks only launches containers built by the Databricks engineering team, so malicious external users have no way of launching their own image.

  • Attaching to an existing container which the attacker had previous write access to.

Only Databricks services can attach to existing containers. Users access containers through RPCs, and cannot attach to existing containers.

Though we believe the vulnerability is unlikely to be practically exploitable in our environment, Databricks engineering will push a hotfix that will be deployed as soon as reasonably possible.

How does the exploit work in detail?

The exploit tries to compromise the container runtime binary to gain root access to the host. The container runtime is a binary program that runs on the host system and orchestrates the process execution inside the container. It is designed to ensure that the container’s processes are run in their own isolated namespace and with reduced privilege. On docker, the default container runtime is runC binary, and on LXC it is the miscellaneous lxc-* utilities.

Take lxc-attach as an example, a malicious user can mount the attack with the following steps:

  • Replace a target binary inside the container with a custom content that points back to the lxc-attach binary itself. For example, one can replace the container’s /bin/bash with the following content:

#!/proc/self/exe
<injected malicious payload goes here>

In this way, /bin/bash (container path) becomes an executable script using /proc/self/exe to interpret its malicious content. Note that /proc/self/exe is a symbolic link created by the kernel for every process which points to the binary that was executed for that process.

  • Trick the container runtime into executing the target binary from the host system. As such when /bin/bash is executed inside the container, instead the target of /proc/self/exe will be executed — which will point to the container runtime binary on the host. In the example, when the attacker uses lxc-attach to run a command inside the container, lxc-attach invokes container’s /bin/bash using execve() syscall, which in turn runs /proc/self/exe i.e. lxc-attach itself to interpret the injected malicious payload.
  • Proceed to write to the target of /proc/self/exe so as to overwrite the lxc-attach binary on the host. In general, however, this will not succeed as the kernel will not permit it to be overwritten while lxc-attach is executing. To overcome this, the attacker can instead open /proc/self/exe using the O_PATH flag to get a file descriptor <fd> and then reopen the binary as O_WRONLY through /proc/self/fd/<fd> and try to write to it in a busy loop from a newly forked subprocess. Eventually, it will succeed when the parent lxc-attach process exits. After this the lxc-attach binary on the host is compromised and can be used to attack other containers or the host itself. The rewriting logic can be done from the malicious payload injected to the target binary in step 1.

Therefore, there are 3 major conditions to enable the attack:

  1. The attacker must have or gain control the content of the image in order to replace the target binary inside the container. This is achievable if the attacker controls the container image or has write access to the container previously.
  2. The attacker must be able to invoke the container runtime on the host system through some external channel. This is the case if the host system exposes an API layer (e.g., kubelet API server) that allows users to invoke the container runtime binary indirectly. For example, if there’s an API allowing a remote user to launch a container with a custom image, or to attach to a running container using lxc-attach or docker exec
  3. The attacker must have permission to overwrite the content of the host’s container runtime binary from the container. This is possible if the container is running as a privileged user on the host system, but impossible if it is running as an unprivileged user.

Databricks only exposes an API to launch containers with trusted Databricks Runtime images released by our engineering team, and these containers are not subject to modification by users prior to being attached or created.  Since an image that was modified after creation cannot be used to take advantage of this exploit, the trusted container status renders the Databricks standard architecture unaffected. Additionally, Databricks workspace users access containers through an RPC server running inside the container, and so cannot attach to existing containers using low-level container runtime binary.

 

--

Try Databricks for free. Get started today.

The post Databricks Security Advisory: Critical Runc Vulnerability (CVE-2019-5736) appeared first on Databricks.

Continue Reading…

Collapse

Read More

How to Cope with the Rise of the Citizen Data Scientist

Gartner predicts that citizen data scientists will surpass data scientists in the amount of advanced analytics produced. Does that mean that Enterprise AI and augmented analytics render the job of a data scientist obsolete? Download this white paper to found out more.

Continue Reading…

Collapse

Read More

Playing With Pipe Notations

Recently Hadley Wickham prescribed pronouncing the magrittr pipe as “then” and using right-assignment as follows:

NewImage

I am not sure if it is a good or bad idea. But let’s play with it a bit, and perhaps readers can submit their experience and opinions in the comments section.

Right assignment

Right assignment is a bit of an oddity in programming languages. Offhand I can think of a few programming languages that use it: COBOL, TI-Basic, and Forth (due to its value-stack notation).

I have written a bit about right assignment in R in the past. And that led to some interesting discussion.

Frankly I thought right assignment was prohibited in Hadley Wickham’s own style guide. But as Gabe Becker taught me a while ago: there isn’t actually any right-arrow in R (so maybe “Use <-, not =, for assignment.” allows ->).

substitute(5 -> x)
# x <- 5

Another point: R pipes are very closely related to right assignment notation, so once you allow right assignment you don’t actually need pipes in the current R sense (though other forms of pipes such as Unix pipes would be a great addition).

“then”

The idea of having a canonical “pronunciation” for symbols is not a new one. It is fairly standard practice in the Unix community (one reference here). The McIlroy Unix pipe|” (which streams partial results, resulting in very powerful concurrent composition) is said to be read as “pipe”, “pipe to”, “to”, “thru” (and a few more variations). In this era of keyboard shortcuts it is worth considering more verbose piping operators.

Let’s try the idea.

library("dplyr")
#> Attaching package: &aposdplyr&apos
#> The following objects are masked from &apospackage:stats&apos:
#> 
#>     filter, lag
#> The following objects are masked from &apospackage:base&apos:
#> 
#>     intersect, setdiff, setequal, union

d <- data.frame(x = 1:3)


`%then%` <- magrittr::`%>%`

d %then%
  mutate(., y = x + 1) %then%
  knitr::kable(.)
#> Error in pipes[[i]]: subscript out of bounds

I’d say this fails on at least two counts, the first “%then%” doesn’t seem grammatical (as d is a noun), and magrittr pipes can’t be associated with a new name (as they are implemented by looking for theirselves by name in captured unevaluated code).

However, the wrapr dot arrow pipe can take on new names.

Let’s try a variation, using a traditional pronunciation: “to”.

`%to%` <- wrapr::`%.>%`

d %to%
  mutate(., y = x + 1) %to%
  knitr::kable(.)
x y
1 2
2 3
3 4

Conclusion

I am still not sure about the above notation one way or the other. Notational prescriptions are at best proposals or “requests for comment”, and need to consider context and precedent to be useful.

Continue Reading…

Collapse

Read More

Book Memo: “Domain-Specific Knowledge Graph Construction”

The vast amounts of ontologically unstructured information on the Web, including HTML, XML and JSON documents, natural language documents, tweets, blogs, markups, and even structured documents like CSV tables, all contain useful knowledge that can present a tremendous advantage to the Artificial Intelligence community if extracted robustly, efficiently and semi-automatically as knowledge graphs. Domain-specific Knowledge Graph Construction (KGC) is an active research area that has recently witnessed impressive advances due to machine learning techniques like deep neural networks and word embeddings. This book will synthesize Knowledge Graph Construction over Web Data in an engaging and accessible manner. The book will describe a timely topic for both early -and mid-career researchers. Every year, more papers continue to be published on knowledge graph construction, especially for difficult Web domains. This work would serve as a useful reference, as well as an accessible but rigorous overview of this body of work. The book will present interdisciplinary connections when possible to engage researchers looking for new ideas or synergies. This will allow the book to be marketed in multiple venues and conferences. The book will also appeal to practitioners in industry and data scientists since it will have chapters on both data collection, as well as a chapter on querying and off-the-shelf implementations.

Continue Reading…

Collapse

Read More

Jupyter Community Workshop: Dashboarding with Project Jupyter

We have some exciting news about the Jupyter Community Workshop on dashboarding!

The workshop will be held in Paris, France, from June 3rd to June 6th, 2019. The event is being hosted at Center for Interdisciplinary Research (CRI), in the heart of Paris.

The workshop committee consists of Maarten Breddels (Freelance), Pascal Bugnion (Faculty.ai), Sylvain Corlay (QuantStack), Alexandre Gramfort (INRIA), and Vidar Tonaas Fauske (Simula).

The workshop will last four days, with hands-on discussions, hacking sessions, and technical presentations. The goal of the event is to foster collaboration and the sharing of knowledge between downstream library authors and contributors, and favor upstream contributions.

Should you be interested in joining us for this workshop, please fill this Google Form.

In addition to the Community Workshop, we plan on holding a public Meetup on June 5th, in partnership with the PyData Paris Meetup, with a series of lightning talks of Project Jupyter and related projects.

Why a Workshop on Dashboarding?

The Jupyter ecosystem has great tools for teaching, exploration and development. Dashboards allow users to interact with a kernel with interactive controls, plots, maps, etc., and allow researchers and data scientists to share their results with students, with their peers, and with the general public. Currently, users of Jupyter are (mostly) forced towards other Python or R libraries or they make direct use of front-end technologies or develop directly in JavaScript.

There are existing early technologies that allow serving dashboards based on notebooks, most notably voila. The goal of this workshop is to gather core Jupyter widgets developers, members of the community and users with experience in dashboarding to bring dashboarding to a level where it can be used by all members of the Jupyter ecosystem. Ultimately, we envisage users being able to develop and deploy dashboards entirely within the Jupyter ecosystem.

We will lay the foundations for dashboarding as a first-class citizen in the Jupyter ecosystem.

Acknowledgements

This would not have been possible without the generous support provided by Bloomberg, who made this workshop series possible.

We are also grateful to the CRI for gracefully hosting the dashboarding community workshop.


Jupyter Community Workshop: Dashboarding with Project Jupyter was originally published in Jupyter Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Continue Reading…

Collapse

Read More

PDF Data Extraction: What You Need to Know

In our free guide, we show you how and where you can use extracted data from PDFs, and explain the necessary qualities you should be looking for when evaluating extraction tools.

Continue Reading…

Collapse

Read More

A Guide to Developer, Deep Dive, and Continuous Streaming Applications Talks at Spark + AI Summit

In January 2013 when Stephen O’Grady, an analyst at RedMonk, published “The New Kingmakers: How Developers Conquered the World,” the book’s central argument (then and still now) universally resonated with an emerging open-source community. He convincingly charts developers’ movement “out of the shadows and into the light as new influencers on society’s [technical landscape].”

Using their choice of the open-source software and at-the-ready at studying or contributing source code on GitHub, developers have built data products using open-source technologies that have shaped the industry today. O’Grady cites notable open-source examples that engendered successful software companies as well as companies that employ open-source to build their infrastructure stacks.


He asserts that developers make a difference; they chart the course, like the Kingmakers.

And this April, you can join many of these kingmakers at Spark + AI Summit 2019. Hear and learn from them as they offer their insight into use cases in how they combine data and AI, to build data pipelines as well as use and extend Apache Spark ™ to solve tough data problems.

In this blog, we highlight selected sessions that speak to developers’ endeavors in combining the immense value of data and AI across three tracks: Developer, Deep Dives, and Continuous Streaming Applications.

Developer

Naturally, let’s start with the Developer track. Ryan Blue of Netflix in his talk, Improving Apache Spark’s Reliability with DataSourceV2, will share Spark’s new DataSource V2 API, which allows working with data from tables and streams. With relevant changes to Spark SQL internals, the V2 allows developers to build reliable data pipelines from relevant data sources. For Spark developers writing data source connectors, this is a must talk to attend.

Enhanced in Spark 2.3, columnar storage is an efficient way to store DataFrames. In his talk, In-Memory Storage Evolution in Apache Spark, Dr. Kazuaki Ishizaki, PMC Spark committer and an ACM award winner, will discuss the evolution of in-memory storage: How Apache Arrow exchange format and Spark’s ColumnVector for storage enhance Spark SQL access and query performance on DataFrames.

Related to DataFrames and Spark SQL, Messrs DB Tsai and Cesar Delgado of Apple Inc will address how they handle deeply nested structures by making them first-class citizens in Spark SQL, giving them immense speed up in querying and processing humongous data for Apple Siri, a virtual assistant. Their talk, Making Nested Columns as First Citizen in Apache Spark SQL, is a good example to show developers how to extend Spark SQL.

Which brings us Spark’s extensibility. Among many features that attract developers to Spark, one is its extensibility with new language bindings or libraries. Messrs Tyson Condie and Rahul Potharaju of Microsoft will explain how they extended Spark to include a new .NET bindings in their talk: Introducing .NET bindings for Apache Spark.

Yet for all Spark’s many merits, fast-paced adoption and innovation from the wider community, developers face some challenges: how do you automate testing, assess the quality and performance of new developments? To that end Messers, Bogdan Ghit and Nicolas Poggi of Databricks will share their work in building a new testing and validating framework for Spark SQL in their talk: Fast and Reliable Apache Spark SQL Engine.

Technical Deep Dives

Since its introduction in 2016 as a track with developer focused sessions, technical deep dives track has gained popularity in attendance. It attracts both data engineers and data scientists to get deeper experience on the subject. For example, this year three sessions stand out.

First, data privacy and protection have become imperative today, in light of GDPR, especially in Europe. Great Models with Great Privacy: Optimizing ML and AI Over Sensitive Data talk from CTO Sim Simeonov of Swoop will challenge some technical assumptions that privacy asserts worse predictions in the ML models by examining some production environments techniques to mitigate this notion.

Second, Spark SQL is at the core of Spark’s structured APIs, including Structured Streaming, and its efficient query processing engine. But what enables it? What’s under the hood that’s performant and why? Messrs Maryann Xue and Takuya Ueshin of Databricks’ Apache Spark core team will dive into pipeline execution, whole-stage code generation, memory management, and internals that make this engine fault-tolerant and performant. A valuable lesson into the Spark core internals is their talk: A Deep Dive into Query Execution Engine of Spark SQL.

And third, closely related to Spark SQL, is an effort to extend Spark to support Graph data in processing Spark SQL queries to enable data scientists and engineers to inspect and update graph databases. A proposed effort underway to integrate into Spark’s upcoming release, developers Alastair Green and Martin Junghanns from Neo4j will make the case for Cyhper, a graph querying language for Apache Spark in their talk: Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apache Spark.

Continuous Applications and Structured Streaming

Structured Streaming has garnered a lot of interest in building end-to-end data pipelines or writing continuous applications that interact in real-time with data and other applications. Three deep-dive talks will give you insight into how.

First is from Tathagata Das of Databricks: Designing Structured Streaming Pipelines—How to Architect Things Right. Second is from Scott Klein of Microsoft: Using Azure Databricks, Structured Streaming & Deep Learning Pipelines, to Monitor 1,000+ Solar Farms in Real-Time. And third is from Brandon Hamric of Eventbrite: Near real-time analytics with Apache Spark: Ingestion, ETL, and Interactive Queries.

Apache Spark Training Sessions

And finally, check out two training courses for big data developers to extend your knowledge of Apache Spark programming, how to build scalable data pipelines with Delta, and performance and tunning respectively: APACHE SPARK™ PROGRAMMING AND DELTA and APACHE SPARK™ TUNING AND BEST PRACTICES.

What’s Next

You can also peruse and pick sessions from the schedule. In the next blog, we will share our picks from sessions related to Data Science and Data Engineering tracks.

Read More

__

--

Try Databricks for free. Get started today.

The post A Guide to Developer, Deep Dive, and Continuous Streaming Applications Talks at Spark + AI Summit appeared first on Databricks.

Continue Reading…

Collapse

Read More

Julia Child (2) vs. Frank Sinatra (3); Dorothy Parker

For yesterday‘s contest, Jonathan gave a strong argument:

First New Yorker showdown, just to see who will be taking on Veronica Geng in the finals. All the other contestants are just for show. I’m going with Liebling, because Parker wasn’t even the best New Yorker writer of her generation, being edged out by Benchley. Liebling dominated his era. If it comes down to Liebling vs. Geng, we’ll just exhume Harold Ross and make him pick.

But we’re looking for a talker, not a writer, so I’ll have to go with Dzhaughn:

After the Seance, we were chatting about the inspiration for this tournament. I said I thought Bruno was just a minor intellectual swindler rather than a real threat. Dorothy replied:

I used to think Latour was just something on a Schwinn dealer’s list*, but that was before I saw Julia’s child Oscar wildly strong-arm Lance with an ephronedrine-filled syringe merrily down the Streep, past a sidewalk cafe where the turing Pele and big bejeweled #23, in Brooks’ Brothers suits, were yakking over Smirnoff Martinis, eating a pile of franks, caesar salads, and some weirder dishes. James was on the phone, taking the TV network to hell and back over “letting that degenerate George Karl off the hook” for some remark, when, from behind a bush, sudden as a python, out springs teen-aged Babe D.-Z, among others! That geng didn’t look like they were here to serenade us with arias from Yardbird, that jazz oprah about Parker! No, they were there to revolt–air their own grievances–and when he stood to object, Babe just shoved LeBron and all his LeBling back onto LaPlace where he sat: Oof!

A bit of recursion is usually a good plan.

For today it’s the French Chef vs. the Chairman of the Board. Frank’s got a less screechy voice, but Julia should be able to handle the refreshments. Any thoughts?

Again, here’s the bracket and here are the rules.

Continue Reading…

Collapse

Read More

Automatic Machine Learning is broken

We take a look at the arguments against implementing a machine learning solution, and the occasions when the problems faced are not ML problems and can perhaps be solved using optimization, exploratory data analysis tasks or problems that can be solved with simple statistics.

Continue Reading…

Collapse

Read More

My talk today (Tues 19 Feb) 2pm at the University of Southern California

At the Center for Economic and Social Research, Dauterive Hall (VPD), room 110, 635 Downey Way, Los Angeles:

The study of American politics as a window into understanding uncertainty in science

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

We begin by discussing recent American elections in the context of political polarization, and we consider similarities and differences with European politics. We then discuss statistical challenges in the measurement of public opinion: inference from opinion polls with declining response rates has much in common with challenges in big-data analytics. From here we move to the recent replication crisis in science, and we argue that Bayesian methods are well suited to resolve some of these problems, if researchers can move away from inappropriate demands for certainty. We illustrate with examples in many different fields of research, our own and others’.

Some background reading:

19 things we learned from the 2016 election (with Julia Azari), http://www.stat.columbia.edu/~gelman/research/published/what_learned_in_2016_5.pdf
The mythical swing voter (with Sharad Goel, Doug Rivers, and David Rothschild). http://www.stat.columbia.edu/~gelman/research/published/swingers.pdf
The failure of null hypothesis significance testing when studying incremental changes, and what to do about it. http://www.stat.columbia.edu/~gelman/research/published/incrementalism_3.pdf
Honesty and transparency are not enough. http://www.stat.columbia.edu/~gelman/research/published/ChanceEthics14.pdf
The connection between varying treatment effects and the crisis of unreplicable research: A Bayesian perspective. http://www.stat.columbia.edu/~gelman/research/published/bayes_management.pdf

The talk will mostly be about statistics, not political science, but it’s good to have a substantive home base when talking about methods.

Continue Reading…

Collapse

Read More

Running R and Python in Jupyter

The Jupyter Project began in 2014 for interactive and scientific computing. Fast forward 5 years and now Jupyter is one of the most widely adopted Data Science IDE's on the market and gives the user access to Python and R

Continue Reading…

Collapse

Read More

AI Trends that Paved the Way in 2018

With 2018 behind us, it’s been amazing to see AI projects gain steam and make significant impact across industries. In fact, a recent survey by CIO.com cites that 90% of enterprises are actively investing in AI.

What has fueled this innovation is the massive influx of organizations tapping into the potential of their data and the increasing availability of various machine learning technologies and frameworks. Furthermore, the cloud enables a new level of scale to match these massive data volumes without taking a hit on performance. Combined with the exponential growth in data volumes, AI has enabled companies to do amazing things — from accelerating drug discovery through genomics to preventing fraud in the securities market.

So what new trends and advances in 2019 will help address these challenges and move AI adoption further into the mainstream?

We asked some of the most innovative companies in the world this question and many others, to put our fingers on the pulse of where AI stands in the enterprise and gain a better understanding of how AI will continue to disrupt industries.

This blog, the first in a series of blog posts, provides highlights into what many thought were the most impactful trends and innovations in 2018 and what we should be excited about in 2019.


What big data and AI innovations or trends did you see in 2018 that you were excited to see? And How do you think those innovations will continue to evolve and/or gain traction in 2019?

Deep Learning Goes Mainstream

“Keras and tensorflow have been around for a while, but we’re seeing companies spanning the range from innovative startups to massive corporations using DL to unlock new business opportunities. At Quby we used a Deep Learning system in production for the first time in 2018. In 2019 TensorFlow 2.0 will be a huge milestone.”

  • Stephen Galsworthy, Head of Data Science at Quby
The Unification of Analytics

“Companies like Databricks are consistently breaking down and refining the barriers and cost of entry for a truly unified data platform, so the boundaries between data lake, data science, streaming are not only compatible, but seamlessly integrated. You can often forget which part of the platform you’re using. And that’s a really good thing.”

  • Stephen Harrison, Data Science Architect at Rue Gilt Groupe
The Democratization of Machine Learning

“The rise of productizing ML and tools around making ML far easier and more scalable was a bit step forward in 2018.   You’re seeing lots of products that seek to embed ML into decision making which makes ML far more accessible by de-emphasizing the algorithm and making it easier to leverage. Some of these are frameworks or automation of ML (MLFlow, Einstein, etc.) and others are whole platforms where ML is the core.

Also, Reinforcement Learning has also really taken off and I think it’s incredibly exciting because it helps AI solve far more abstract problems.  These haven’t made it into many products but I think it’s really exciting shows the future of AI.”

  • Bradley Kent, AVP of Program Analytics at LoyaltyOne
Data Science is Permeating the Line of Business

“In 2018, the discussion on “Explainable AI,” trust and data bias was really encouraging. I believe that it is critical to develop AI that is explainable, provable and transparent. As always the case, this journey towards trusted systems truly starts with the quality of data used for AI training. This renewed focus in 2018 on labeled data that can be verified, validated and explained is exciting for us at Nielsen, as we are relentlessly focused on developing high quality labeled data on consumer behavior. It is exciting that Explainable AI can lay the foundations for AI systems that can be both generalized across use cases and be trusted.”

  • Mainak Mazumdar, Chief Research Officer at Nielsen
Novel Applications of Deep Learning

“There has been a lot of innovation in the last year in the field of deep learning that I am excited about! I think these innovations will create a lot of new AI applications, some of which are already in production and making massive changes in the industry. At Overstock, we use deep learning on multiple products, from email campaigns with predictive taxonomies to personalization modules that infer user style with deep learning. I’m excited to see how the industry, and specifically online retail, integrates more with deep learning and some novel applications that will follow.”

  • Kamelia Aryafar, Chief Algorithm Officer at Overstock

Clearly, 2018 was the year of rapid adoption and democratization of the latest analytics innovations such as deep learning. It was encouraging to hear analytics leaders aligned on the importance and the progress around making analytics and AI simple across the organization. What came across clearly was the concept of unification across all facets of analytics — ensuring all stages of the analytics pipeline, the associated technologies, and the teams involved in data science and engineering, are seamlessly integrated and operating in harmony.

The next installment of this blog series will uncover predictions on what the next set of trends and innovations around AI, machine learning, and big data that will surface in 2019.

--

Try Databricks for free. Get started today.

The post AI Trends that Paved the Way in 2018 appeared first on Databricks.

Continue Reading…

Collapse

Read More

I believe this study because it is consistent with my existing beliefs.

Kevin Lewis points us to this.

Continue Reading…

Collapse

Read More

Are BERT Features InterBERTible?

This is a short analysis of the interpretability of BERT contextual word representations. Does BERT learn a semantic vector representation like Word2Vec?

Continue Reading…

Collapse

Read More

Seasonality in NZ voting preference? by @ellis2013nz

(This article was first published on free range statistics - R, and kindly contributed to R-bloggers)

There was a flurry of activity in the last couple of days on Twitter and the blogosphere, most notably Thomas Lumley’s excellent Stats Chat, relating to whether there is a pro-government bias in surveys of New Zealand voting intention in the summer. As the analysis I’ve seen used my nzelect R package, this motivated me to update it for recent polls.

The nzelect update

nzelect hasn’t been updated on CRAN for some time, because about a year ago I made some major changes to the data model for the historical election results by voting place and I haven’t been able to complete testing and stabilisation of the result. I do hope to do this some time in the next few months. In the meantime, the version on GitHub has the current polling data, and I intend to keep it current. Political polls are very thin on the ground these days for New Zealand, so that’s not too big an ask! I’ve now spent a bit of time tidying it up and adding the three most recent polls, which I’d previously neglected.

Let’s start with the basics. One of the reasons I first put the package together several years ago was to help facilitate analysis of the relatively long run in political opinion. I wanted to lift up analysis from over-interpretation of the last few noisy data points. Here’s the expressed voting intention of New Zealanders for the four currently largest parties in Parliament over time:

It’s striking, but unsurprising, how the Greens and New Zealand First support has collapsed since coming into government with Labour (or just before, in the case of the Greens and their disappointing 2017 election campaign). The junior party in a coalition often suffers, as attention and kudos goes to the leaders of the large party (in this case, Prime Minister Jacinda Ardern) and the smaller parties’ own base deals with the realities and compromises of being in government.

The other interesting (and disappointing, for statisticians and political scientists) observation from this chart is how obviously the number of polls has decreased. Topic for another day (or more likely, someone else to write about).

Here’s the R code for that chart:

# install latest version of nzelect:
devtools::install_github("ellisp/nzelect/pkg1")

# load packages needed including for rest of session
library(nzelect)
library(tidyverse)
library(scales)
library(lubridate)
library(gridExtra)
library(broom)
library(mgcv)

#-----------------------------polling overview---------------------

p1 <- polls %>%
  filter(MidDate > as.Date("2014-11-20") & !is.na(VotingIntention)) %>%
  filter(Party %in% c("National", "Labour")) %>%
  mutate(Party = fct_reorder(Party, VotingIntention, .desc = TRUE),
         Party = fct_drop(Party)) %>%
  ggplot(aes(x = MidDate, y = VotingIntention, colour = Party, linetype = Pollster)) +
  geom_line(alpha = 0.5) +
  geom_point(aes(shape = Pollster)) +
  geom_smooth(aes(group = Party), se = FALSE, colour = "grey15", span = .4) +
  scale_colour_manual(values = parties_v, guide = "none") +
  scale_y_continuous("Voting intention", label = percent) +
  scale_x_date("") +
  facet_wrap(~Party, scales = "fixed") +
  theme(panel.grid.minor = element_blank(),
        legend.position = "none") 

p2 <- polls %>%
  filter(MidDate > as.Date("2014-11-20") & !is.na(VotingIntention)) %>%
  filter(Party %in% c("Green", "NZ First")) %>%
  mutate(Party = fct_reorder(Party, VotingIntention, .desc = TRUE),
         Party = fct_drop(Party)) %>%
  ggplot(aes(x = MidDate, y = VotingIntention, colour = Party, linetype = Pollster)) +
  geom_line(alpha = 0.5) +
  geom_point(aes(shape = Pollster)) +
  geom_smooth(aes(group = Party), se = FALSE, colour = "grey15", span = .4) +
  scale_colour_manual(values = parties_v, guide = "none") +
  scale_y_continuous("Voting intention", label = percent) +
  scale_x_date("") +
  facet_wrap(~Party, scales = "fixed") +
  theme(panel.grid.minor = element_blank()) 

grid.arrange(p1, p2)

There’s obvious interest for supporters of the left-of-centre parties in the combined vote for Labour and the Greens. That suggests the importance of this chart:

(Apologies for red-green colour blind people in the use of these party colours; the Greens are the lowest of the three lines and Labour the middle.)

It’s clear that to a significant degree electoral support for the two is substitutable, with the green and red lines moving in scissors-like counter directions at several key times since 2010, with the last 24 months just the most dramatic example.

It’s also clear that the combined support for the centre-left in New Zealand is pretty strong, recovered to a point it hasn’t been since several years before the end of the government led by Helen Clarke in the 2000s.

Here’s the code for that chart. Note how I use the dates of elections to make a simple data frame of who is in power when, for the background rectangles; and leverage the parties_v vector of colours in nzelect to allocate the official party colours to both the parties’ lines and to the background fill.

#----------------long run-------------
elections <- polls %>%
  filter(Pollster == "Election result") %>%
  distinct(MidDate) %>%
  rename(start_date = MidDate) %>%
  mutate(end_date = lead(start_date, default = as.Date("2020-10-01")),
         pm = c("Labour", "Labour", "National", "National", "National", "Labour"))

p3 <- polls %>%
  as_tibble %>%
  filter(!is.na(VotingIntention)) %>%
  filter(Party %in% c("Green", "Labour")) %>%
  select(Party, MidDate, VotingIntention, Pollster) %>%
  spread(Party, VotingIntention) %>%
  mutate(Combined = Labour + Green) %>%
  gather(Party, VotingIntention, -MidDate, -Pollster) %>%
  mutate(Party = fct_relevel(Party, c("Combined", "Labour"))) %>%
  ggplot() +
  geom_rect(data = elections, ymin = -Inf, ymax = Inf, alpha = 0.1, 
            aes(xmin = start_date, xmax = end_date, fill = pm)) +
  geom_line(aes(x = MidDate, y = VotingIntention, colour = Party)) +
  scale_colour_manual(values = c(parties_v, "Combined" = "black")) +
  scale_fill_manual(values = parties_v) +
  labs(x = "Survey date", y = "Voting intention", 
       colour = "Surveyed voting intention:", fill = "Prime Minister's party:") +
  scale_y_continuous(label = percent_format(accuracy = 2)) +
  ggtitle("Voting intention in New Zealand over a longer period than usually presented",
          "Labour, Greens, and combined")

Seasonality

Now, on to the question of seasonality. I don’t have much to add to the analysis of David Hood on Twitter and Thomas Lumley on StatsChat; I basically agree with their conclusions. I have the advantage of a few more polls because of the update to nzelect this morning.

Here is the expressed intended vote for the party of the Prime Minister over time:

The blue line for the National Party is higher than the equivalent for Labour Prime Ministers because National has tended to form a larger proportion of its governing coalitions than Labour in this time period. I could (and probably should) have added the intended vote for all parties currently in coalition government, but this is actually a pretty complicated thing to do so I’ve gone for the simpler approach for now. So long as we don’t make simplistic comparisons forgetting that New Zealand has a proportional representation system, that is ok for our purposes.

We obviously can’t tell anything from this chart about the seasonality; there are two many data points and too much noise. Actually, one thing we can say for sure is that seasonality isn’t strong. For very seasonal data such as tourist numbers, the seasonality would be obvious even in a chart like this.

To see if there is a subtle seasonality effect, I tried modelling voting intention for the Prime Minister’s party on the month of the year, controlling for the party in power, a smooth trend over time, and whether or not it is an election month (otherwise September and November, with five of the six elections in this period, would certainly cloud the data). I used (as I nearly always do in this situation) Simon Woods excellent mgcv R package.

Having done that, we can approximate confidence intervals for the impact on voting preference of the month of the year. The next chart shows those estimates:

As the title says, it’s weak evidence of a weak effect, which might be around half a percentage point more positive for the Prime Minister’s party in the summer months than it is in June. Or it might be more than that, or even negative.

Here’s the code for that model and the last two charts:

#------------------------seasonality of govt support------------------
# data frame of voting intention for the lead party of the government 
d <- polls %>%
  as_tibble() %>%
  left_join(elections, by = c("MidDate" = "start_date")) %>%
  select(-(WikipediaDates:EndDate)) %>%
  fill(pm, end_date) %>%
  # limit to the party that has the Prime Minister:
  filter(Party == pm) %>%
  mutate(survey_month = as.character(month(MidDate, label = TRUE)),
         survey_month = fct_relevel(survey_month, "Jun"),
         election_month = (month(MidDate) == month(end_date) & year(MidDate) == year(end_date)))

ggplot(d) +
  geom_rect(data = elections, ymin = -Inf, ymax = Inf, alpha = 0.1, 
           aes(xmin = start_date, xmax = end_date, fill = pm)) +
  geom_line(aes(x = MidDate, y = VotingIntention, colour = Party, group = end_date)) +
  scale_colour_manual(values = c(parties_v, "Combined" = "black")) +
  scale_fill_manual(values = parties_v) +
  labs(x = "Survey date", y = "Voting intention for PM's party", 
       colour = "Surveyed voting intention:", fill = "Prime Minister's party:") +
  scale_y_continuous(label = percent_format(accuracy = 2)) +
  ggtitle("Voting intention for the Prime Minister's party")

# model of vote for govt's lead party, controlling for trend (the smooth term) and for which
# party is in government, looking for effects from month of year.
mod <- gam(VotingIntention ~ election_month + survey_month +  Party + s(as.numeric(MidDate)), data = d)
anova(mod) # yes, the survey_month is 'statistically significant'

tibble(variable = names(coef(mod)),
       estimate = coef(mod),
       se = summary(mod)$se) %>%
  # very approximate confidence intervals:
  mutate(lower = estimate - se * 1.96,
         upper = estimate + se * 1.96) %>%
  filter(grepl("month", variable)) %>%
  mutate(survey_month = gsub("survey_month", "", variable)) %>%
  mutate(survey_month = factor(survey_month, 
                               levels = c("election_monthTRUE", as.character(month(c(7:12, 1:6), label = TRUE))))) %>%
  ggplot(aes(x = lower, xend = upper, y = survey_month, yend = survey_month)) +
  geom_vline(xintercept = 0, colour = "red") +
  geom_segment(size = 3, colour = "steelblue", alpha = 0.7) +
  geom_point(aes(x = estimate)) +
  scale_x_continuous(label = percent) +
  labs(x = "Estimated impact (in percentage points) on intended vote for PM's party, relative to June",
       y = "Survey month") +
  ggtitle("Seasonality of voting preference for the New Zealand PM's party?",
          "Weak evidence of a very weak pro-government effect from late Spring to early Autumn")

It’s very motivating to see others using the nzelect package. Please tag me in Twitter, or let me know some other way, if you use this and it will encourage me for further enhancements, and to get the new version with better historical data onto CRAN!

Thanksgiving

I’m going to try to get into the habit of this at the end of each blog post. Without the hard work, innovation and sheer smarts of the open source community my blog (and many much more important things!) wouldn’t be possible. Here are just those in the R world whose code I used in this session (not all of it made it into the excerpt above, but that’s all the more reason to give thanks below).

thankr::shoulders() %>% knitr::kable() %>% clipr::write_clip()
maintainer no_packages packages
Hadley Wickham hadley@rstudio.com 17 assertthat, dplyr, forcats, ggplot2, gtable, haven, httr, lazyeval, modelr, plyr, rvest, scales, stringr, testthat, tidyr, tidyverse, usethis
R Core Team R-core@r-project.org 10 base, compiler, datasets, graphics, grDevices, grid, methods, stats, tools, utils
Gábor Csárdi csardi.gabor@gmail.com 9 callr, cli, crayon, desc, pkgconfig, processx, ps, remotes, sessioninfo
Kirill Müller 6 bindr, bindrcpp, hms, pillar, rprojroot, tibble
Jim Hester james.hester@rstudio.com 4 devtools, pkgbuild, pkgload, readr
Winston Chang winston@stdout.org 4 extrafont, extrafontdb, R6, Rttf2pt1
Jim Hester james.f.hester@gmail.com 3 fs, glue, withr
Lionel Henry lionel@rstudio.com 3 purrr, rlang, tidyselect
Dirk Eddelbuettel edd@debian.org 3 digest, Rcpp, x13binary
Yixuan Qiu yixuan.qiu@cos.name 3 showtext, showtextdb, sysfonts
Jeroen Ooms jeroen@berkeley.edu 2 curl, jsonlite
Yihui Xie xie@yihui.name 2 knitr, xfun
R-core R-core@R-project.org 1 nlme
Vitalie Spinu spinuvit@gmail.com 1 lubridate
Michel Lang michellang@gmail.com 1 backports
Patrick O. Perry patperry@gmail.com 1 utf8
Simon Wood simon.wood@r-project.org 1 mgcv
Achim Zeileis Achim.Zeileis@R-project.org 1 colorspace
Baptiste Auguie baptiste.auguie@gmail.com 1 gridExtra
Gabor Csardi csardi.gabor@gmail.com 1 prettyunits
Peter Ellis peter.ellis2013nz@gmail.com 1 nzelect
Simon Urbanek Simon.Urbanek@r-project.org 1 Cairo
James Hester james.hester@rstudio.com 1 xml2
Justin Talbot justintalbot@gmail.com 1 labeling
Torsten Hothorn Torsten.Hothorn@R-project.org 1 mvtnorm
Christoph Sax christoph.sax@gmail.com 1 seasonal
Jennifer Bryan jenny@rstudio.com 1 readxl
Kevin Ushey kevin@rstudio.com 1 rstudioapi
Max Kuhn max@rstudio.com 1 generics
Stefan Milton Bache stefan@stefanbache.dk 1 magrittr
Martin Maechler 1 Matrix
Charlotte Wickham cwickham@gmail.com 1 munsell
Brodie Gaslam brodie.gaslam@yahoo.com 1 fansi
Matthew Lincoln matthew.d.lincoln@gmail.com 1 clipr
Gavin L. Simpson ucfagls@gmail.com 1 gratia
Marek Gagolewski gagolews@rexamine.com 1 stringi
Jeremy Stephens jeremy.f.stephens@vumc.org 1 yaml
Brian Ripley ripley@stats.ox.ac.uk 1 MASS
Deepayan Sarkar deepayan.sarkar@r-project.org 1 lattice
Claus O. Wilke wilke@austin.utexas.edu 1 cowplot
Rasmus Bååth rasmus.baath@gmail.com 1 beepr
Jennifer Bryan jenny@stat.ubc.ca 1 cellranger
Alex Hayes alexpghayes@gmail.com 1 broom
Simon Urbanek simon.urbanek@r-project.org 1 audio
Jim Hester jim.hester@rstudio.com 1 memoise

To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…

Collapse

Read More

Monitoring Diabetes’ risk and BMI thanks to a Shiny dashboard

(This article was first published on MilanoR, and kindly contributed to R-bloggers)

Hi everyone and welcome back to our blog!
Valentine’s day has come and I guess many of you have eaten a lot of sweets during these days, so it’s the right time for a health check; we’ve got you covered, with a touch of r-based magic!


A little backstory: R-lab in 2018

In January 2018 i joined MilanoR, a community dedicated to bring together local R-users, aiming to share knowledge, best practice and good times with everyone who wants to get involved, at all skill levels; you can know more about the project here.

Between all the event formats they experimented, the most interesting are the R-labs. An R-lab is a non-competitive workshop, where everyone works together through a common effort, be it the development of a Shiny dashboard, optimizing an existing one, or simply helping the main guest solving a business problem with R.

Cool, isn’t it? Check out some of the previous events on our blog!

During our January R-lab, we met Riccardo Rossi, computational biologist and bioinformatics facility manager at INGM.

Showing us the already existing medical guidelines to assess risks of obesity, type two diabetes, hypertension and cardiovascular health, he invited us to build a Shiny app to allow people to keep in check their health status, just by entering some key parameters, such as height, weight and age.


Creating the dashboard

After meeting Riccardo, i embraced this challenge and started thinking; how could i translate medical guidelines, expressed as formulas, into an easy, working piece of R code?

Assessing the risk: server functions

My first goal was to build two functions, one for assessing risk of obesity and the second one to assess the risk of type 2 diabetes.

For the sake of simplcity, i’ll only show the first one:

 

obesity_risk <- function(weight, height, gender){
    
    bmi_2 = weight/(height^2)
    
    if (gender == "female" && bmi_2 < 25){ob_absolut = 1}
    if (gender == "female" && bmi_2 > 25 & bmi_2 < 30){ob_absolut = 19.5}
    
    
    if (gender == "male" && bmi_2 < 25){ ob_absolut = 1}
    if (gender == "male" && bmi_2 > 25 & bmi_2 < 30){ob_absolut = 13}
    
    
    if (bmi_2 > 30) {return("100%")}
    else
    {    
      
      ob_relative = round((ob_absolut/100)/(8/100),1)    
      return(paste0(ob_relative,"%"));
      
    }
  }

Following the provided guidelines, this function calculates the user’s BMI, and returns the relative obesity risk. The other one does the same to assess the risk of contracting type 2 diabetes.

Interacting with the user

What’s an app without an user interacting with it? Front End time!
Using shiny and shinyDashboard libraries i designed a user-friendly interface to allow people to enter the needed personal data:

  • age
  • gender
  • weight
  • height
  • waist
  • lifestyle habits
dashboardPage( 
 dashboardHeader(title = "Hi, I'm Doctor Thomas!", titleWidth = 300),
 dashboardSidebar(disable = TRUE),
 dashboardBody(
   fluidRow(

     box(title = "Your data", solidHeader = TRUE, width = 3, status = "primary",
         radioButtons("gender", "What's your gender?", choices = c("male", "female")),
         numericInput("age", "How old are you?",25),
         numericInput("weight", "What is your weight (Kilos)?",70),
         numericInput("height", "What is your height (Meters) ?",1.7),
         numericInput("waist", "What is your waist size?",70), 
         radioButtons("hypdrugs", "Do you take hypertension drugs?", choices = c("Yes", "No"))
     )


Enough with the code, show me the dashboard!

app_screen

The full working app is hosted here, let us know what you think about it.
If you’re interested in the full code i will upload it on github and edit this post. See you at the next meetup!

The post Monitoring Diabetes’ risk and BMI thanks to a Shiny dashboard appeared first on MilanoR.

To leave a comment for the author, please follow the link and comment on their blog: MilanoR.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…

Collapse

Read More

Hacking Deep Learning (Bar Ilan) Workshop Videos

Hacking Deep Learning (Bar Ilan) Workshop Videos now online. Thanks for my friend Prof. Yossi Keshet for organizing and inviting me!
One notable talk which is unfortunately missing from the videos is of Prof. Adi Shamir described in this paper. The work analysis how many pixels one should change to confuse a deep learning based classifier. The result is surprising - only a few! A related describe work is this.

Continue Reading…

Collapse

Read More

Four short links: 19 February 2019

3D with Face Tracking, Cleaning Data, Data as Labor, Walking Robotics

  1. Depth Index -- A JavaScript package that turns z-index into physically realistic depth, using PoseNet face tracking. Deep, man.
  2. Data Cleaner's Cookbook -- This is version 1 of a cookbook that will help you check whether a data table (defined on the data tables page) is properly structured and free from formatting errors, inconsistencies, duplicates, and other data headaches. All the data-auditing and data-cleaning recipes on this website use GNU/Linux tools in a BASH shell and work on plain text files.
  3. Should We Treat Data as Labor? Moving Beyond "Free" -- In this paper, we explore whether and how treating the market for data like a labor market could serve as a radical market that is practical in the near term.
  4. Underactuated Robotics -- working notes used for a course being taught at MIT [on] Algorithms for Walking, Running, Swimming, Flying, and Manipulation. Even if you don't care about robotics, read this excellent Hacker News comment (words I don't say often) and you'll think about walking completely differently.

Continue reading Four short links: 19 February 2019.

Continue Reading…

Collapse

Read More

Animate intermediate results of your algorithm

(This article was first published on Stanislas Morbieu - R, and kindly contributed to R-bloggers)

The R package gganimate enables to animate plots. It is particularly interesting
to visualize the intermediate results of an algorithm, to see how it converges towards
the final results. The following illustrates this with K-means clustering.

The outline of this post is as follows: We will first generate some artificial data to work with.
This allows to visualize the behavior of the algorithm. The k-means criterion and an algorithm to optimize it
is then presented and implemented in R in order to store the intermediate results in a dataframe. Last, the content of the dataframe
is ploted dynamically with gganimate.

Generate some data

To see how the algorithm behaves, we first need some data. Let’s
generate an artificial dataset:

library(mvtnorm)
library(dplyr)

generateGaussianData <- function(n, center, sigma, label) {
  data = rmvnorm(n, mean = center, sigma = sigma)
  data = data.frame(data)
  names(data) = c("x", "y")
  data = data %>% mutate(class=factor(label))
  data
}

dataset <- {
  # cluster 1
  n = 50
  center = c(5, 5)
  sigma = matrix(c(1, 0, 0, 1), nrow = 2)
  data1 = generateGaussianData(n, center, sigma, 1)

  # cluster 2
  n = 50
  center = c(1, 1)
  sigma = matrix(c(1, 0, 0, 1), nrow = 2)
  data2 = generateGaussianData(n, center, sigma, 2)

  # all data
  data = bind_rows(data1, data2)
  data$class = as.factor(data$class)
  data
}

We generated a mixture of two Gaussians. There is nothing very special about it, except
it is in two dimensions to make it easy to plot it, without the need of a dimensionality reduction method.

Let’s now move on to our algorithm.

K-means

Here, I choose to use k-means since it is widely used for clustering, and moreover, results in a simple implementation.

For a given number of clusters K, and a set of N vectors \(x_i , i \in [1, N]\),
K-means aims to minimize the following criterion:

\begin{equation*}
W(z, \mu) = \sum_{k=1}^{K} \sum_{i=1}^{N} z_{ik} ||x_i – \mu_k||^2
\end{equation*}

with:

  • \(z_{ik} \in {0, 1}\) indicates if the vector \(x_i\) belongs to the cluster \(k\);
  • \(\mu_k \in \mathbb{R}^p\), the center of the cluster \(k\).

Llyod and Forgy algorithms

Several algorithms optimize the k-means criterion.
For instance, the R function kmeans() provides four algorithms:

kmeans(x, centers, iter.max = 10, nstart = 1,
       algorithm = c("Hartigan-Wong", "Lloyd", "Forgy",
                     "MacQueen"), trace=FALSE)

In fact, Forgy and Lloyd algorithms are implemented the same way. We can see this in the source
code of kmeans():

edit(kmeans)

It opens the source code in your favorite text editor. At lines 56 and 57, “Forgy” and “Lloyd” are assigned
to the same number (2L) and are thus mapped to the same implementation:

nmeth <- switch(match.arg(algorithm), `Hartigan-Wong` = 1L,
    Lloyd = 2L, Forgy = 2L, MacQueen = 3L)

In the following, we will implement this algorithm. After the initilization, it iterates over two steps until convergence:

  • an assignment step which assigns the points to the clusters;
  • an update step which updates the centroids of the clusters.

Initialization

The initialization consists in selecting \(K\) points at random and consider them as the centroids of the clusters:

dataset = dataset %>% mutate(sample = row_number())
centroids = dataset %>% sample_n(2) %>% mutate(cluster = row_number()) %>% select(x, y, cluster)

Assignment step

The assignment step of k-means is equivalent to the E and C step of the CEM algorithm in the
Gaussian mixture model.
It assigns the points to the clusters according to the distances between the points and the centroids.
Let’s write \(z_k\) the set of points in the cluster \(k\):

\begin{equation*}
z_k = \left\{ i; z_{ik} = 1 \right\}
\end{equation*}

We estimate \(z_k\) by:

\begin{equation*}
\hat{z}_k = \{ i; ||x_i – \mu_k||^2 \leq ||x_i – \mu_{k’}||^2; k’ \neq k \}
\end{equation*}

A point \(x_i\) is set to be in the cluster \(k\) if the closest centroid
(using the euclidean distance) is the centroid \(\mu_k\) of the cluster \(k\). This is done by the following R code:

assignmentStep = function(samplesDf, centroids) {
  d = samplesDf %>% select(x, y, sample)
  repCentroids = bind_rows(replicate(nrow(d), centroids, simplify = FALSE)) %>%
    transmute(xCentroid = x, yCentroid = y, cluster)
  d %>% slice(rep(1:n(), each=2)) %>%
    bind_cols(repCentroids) %>%
    mutate(s = (x-xCentroid)^2 + (y-yCentroid)^2) %>%
    group_by(sample) %>%
    top_n(1, -s) %>%
    select(cluster, x, y)
}

Update step

In the update step, the centroid of a cluster is computed by taking the
mean of the points in the cluster, as defined in the previous step.
It corresponds to the M step of the Gaussian mixture model and it is done
in R with:

updateStep = function(samplesDf) {
  samplesDf %>% group_by(cluster) %>%
    summarise(x = mean(x), y = mean(y))
}

Iterations

Let’s put together the steps defined above in a loop to complete the algorithm.
We define a maximum number of iterations maxIter and iterate over the two steps
until either convergence or maxIter is reached. It converges if the centroids
are the same in two consecutive iterations:

maxIter = 10
d = data.frame(sample=c(), cluster=c(), x=c(), y=c(), step=c())
dCentroids = data.frame(cluster=c(), x=c(), y=c(), step=c())
for (i in 1:maxIter) {
  df = assignmentStep(dataset, centroids)
  updatedCentroids = updateStep(df)
  if (all(updatedCentroids == centroids )) {
    break
  }
  centroids = updatedCentroids
  d = bind_rows(d, df %>% mutate(step=i))
  dCentroids = bind_rows(dCentroids, centroids %>% mutate(step=i))
}

The above R code constructs two dataframes d and dCentroids which contain
respectively the assignations of the points and the centroids. The column step indicates
the iteration number and will be used to animate the plot.

Plot

We are now ready to plot the data. For this, ggplot2
is used with some code specific to gganimate:

library(ggplot2)
library(gganimate)

a <- ggplot(d, aes(x = x, y = y, color=factor(cluster), shape=factor(cluster))) +
  labs(color="Cluster", shape="Cluster", title="Step: {frame} / {nframes}") +
  geom_point() +
  geom_point(data=dCentroids, shape=10, size=5) +
  transition_manual(step)
animate(a, fps=10)
anim_save("steps.gif")

The function transition_manual of gganimate allows to animate the plot by filtering
the dataframe at each step given the value of the column passed as parameter (here step).
The variables frame and nframes are provided by gganimate and are used in the title.
They give the number of the current frame and the total number of frames respectively.

The animate function takes the argument fps which stands for “frames per second”. This call
takes some time to process since it generates the animation. The animation is then stored in “steps.gif”:

iterations over the steps of kmeans

To sum up

This post gives an example of how to use gganimate to plot the intermediate results of an algorithm.
To do this, one have to:

  • import gganimate;
  • create a dataframe with an additional column which stores the iteration number;
  • create a standard ggplot2 object;
  • use the transition_manual function to specify the column used for the transition between the frames (the iteration number);
  • generate the animation with animate;
  • save the animation with anim_save.

We also covered the Lloyd and Forgy algorithms to optimize the k-means criterion.

Looking at the implementation of R functions is sometimes helpfull.
For instance we looked at the implementation of k-means to see that two
algorithms proposed as arguments are in fact the same.

To leave a comment for the author, please follow the link and comment on their blog: Stanislas Morbieu - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…

Collapse

Read More

Thanks for reading!