# My Data Science Blogs

## November 17, 2018

### Whats new on arXiv

Often, a community becomes alarmed when high rates of cancer are noticed, and residents suspect that the cancer cases could be caused by a known source of hazard. In response, the CDC recommends that departments of health perform a standardized incidence ratio (SIR) analysis to determine whether the observed cancer incidence is higher than expected. This approach has several limitations that are well documented in the literature. In this paper we propose a novel causal inference approach to cancer cluster investigations, rooted in the potential outcomes framework. Assuming that a source of hazard representing a potential cause of increased cancer rates in the community is identified a priori, we introduce a new estimand called the causal SIR (cSIR). The cSIR is a ratio defined as the expected cancer incidence in the exposed population divided by the expected cancer incidence under the (counterfactual) scenario of no exposure. To estimate the cSIR we need to overcome two main challenges: 1) identify unexposed populations that are as similar as possible to the exposed one to inform estimation under the counterfactual scenario of no exposure, and 2) make inference on cancer incidence in these unexposed populations using publicly available data that are often available at a much higher level of spatial aggregation than what is desired. We overcome the first challenge by relying on matching. We overcome the second challenge by developing a Bayesian hierarchical model that borrows information from other sources to impute cancer incidence at the desired finer level of spatial aggregation. We apply our proposed approach to determine whether trichloroethylene vapor exposure has caused increased cancer incidence in Endicott, NY.
This paper attempts to address the issues of machine learning in its current implementation. It is known that machine learning algorithms require a significant amount of data for training purposes, whereas recent developments in deep learning have increased this requirement dramatically. The performance of an algorithm depends on the quality of data and hence, algorithms are as good as the data they are trained on. Supervised learning is developed based on human learning processes by analysing named (i.e. annotated) objects, scenes and actions. Whether training on large quantities of data (i.e. big data) is the right or the wrong approach, is debatable. The fact is, that training algorithms the same way we learn ourselves, comes with limitations. This paper discusses the issues around applying a human-like approach to train algorithms and the implications of this approach when using limited data. Several current studies involving non-data-driven algorithms and natural examples are also discussed and certain alternative approaches are suggested.
Consider a data publishing setting for a dataset composed of non-private features correlated with a set of private features not necessarily present in the dataset. The objective of the publisher is to maximize the amount of information about the non-private features in a revealed dataset (utility), while keeping the information leaked about the private attributes bounded (privacy). Here, both privacy and utility are measured using information leakage measures that arise in adversarial settings. We consider a practical setting wherein the publisher uses an estimate of the joint distribution of the features to design the privacy mechanism. For any privacy mechanism, we provide probabilistic upper bounds for the discrepancy between the privacy guarantees for the empirical and true distributions, and similarly for utility. These bounds follow from our main technical results regarding the Lipschitz continuity of the considered information leakage measures. We also establish the statistical consistency of the notion of optimal privacy mechanism. Finally, we introduce and study the family of uniform privacy mechanisms, mechanisms designed upon an estimate of the joint distribution which are capable of providing privacy to a whole neighborhood of the estimated distribution, and thereby, guaranteeing privacy for the true distribution with high probability.
Constrained sequential pattern mining aims at identifying frequent patterns on a sequential database of items while observing constraints defined over the item attributes. We introduce novel techniques for constraint-based sequential pattern mining that rely on a multi-valued decision diagram representation of the database. Specifically, our representation can accommodate multiple item attributes and various constraint types, including a number of non-monotone constraints. To evaluate the applicability of our approach, we develop an MDD-based prefix-projection algorithm and compare its performance against a typical generate-and-check variant, as well as a state-of-the-art constraint-based sequential pattern mining algorithm. Results show that our approach is competitive with or superior to these other methods in terms of scalability and efficiency.
In unsupervised learning, dimensionality reduction is an important tool for data exploration and visualization. Because these aims are typically open-ended, it can be useful to frame the problem as looking for patterns that are enriched in one dataset relative to another. These pairs of datasets occur commonly, for instance a population of interest vs. control or signal vs. signal free recordings.However, there are few methods that work on sets of data as opposed to data points or sequences. Here, we present a probabilistic model for dimensionality reduction to discover signal that is enriched in the target dataset relative to the background dataset. The data in these sets do not need to be paired or grouped beyond set membership. By using a probabilistic model where some structure is shared amongst the two datasets and some is unique to the target dataset, we are able to recover interesting structure in the latent space of the target dataset. The method also has the advantages of a probabilistic model, namely that it allows for the incorporation of prior information, handles missing data, and can be generalized to different distributional assumptions. We describe several possible variations of the model and demonstrate the application of the technique to de-noising, feature selection, and subgroup discovery settings.
Deep learning involves a difficult non-convex optimization problem, which is often solved by stochastic gradient (SG) methods. While SG is usually effective, it may not be robust in some situations. Recently, Newton methods have been investigated as an alternative optimization technique, but nearly all existing studies consider only fully-connected feedforward neural networks. They do not investigate other types of networks such as Convolutional Neural Networks (CNN), which are more commonly used in deep-learning applications. One reason is that Newton methods for CNN involve complicated operations, and so far no works have conducted a thorough investigation. In this work, we give details of all building blocks including function, gradient, and Jacobian evaluation, and Gauss-Newton matrix-vector products. These basic components are very important because with them further developments of Newton methods for CNN become possible. We show that an efficient MATLAB implementation can be done in just several hundred lines of code and demonstrate that the Newton method gives competitive test accuracy.
Timely prediction of clinically critical events in Intensive Care Unit (ICU) is important for improving care and survival rate. Most of the existing approaches are based on the application of various classification methods on explicitly extracted statistical features from vital signals. In this work, we propose to eliminate the high cost of engineering hand-crafted features from multivariate time-series of physiologic signals by learning their representation with a sequence-to-sequence auto-encoder. We then propose to hash the learned representations to enable signal similarity assessment for the prediction of critical events. We apply this methodological framework to predict Acute Hypotensive Episodes (AHE) on a large and diverse dataset of vital signal recordings. Experiments demonstrate the ability of the presented framework in accurately predicting an upcoming AHE.
In Business Intelligence, accurate predictive modeling is the key for providing adaptive decisions. We studied predictive modeling problems in this research which was motivated by real-world cases that Microsoft data scientists encountered while dealing with e-commerce transaction fraud control decisions using transaction streaming data in an uncertain probabilistic decision environment. The values of most online transactions related features can return instantly, while the true fraud labels only return after a stochastic delay. Using partially mature data directly for predictive modeling in an uncertain probabilistic decision environment would lead to significant inaccuracy on risk decision-making. To improve accurate estimation of the probabilistic prediction environment, which leads to more accurate predictive modeling, two frameworks, Current Environment Inference (CEI) and Future Environment Inference (FEI), are proposed. These frameworks generated decision environment related features using long-term fully mature and short-term partially mature data, and the values of those features were estimated using varies of learning methods, including linear regression, random forest, gradient boosted tree, artificial neural network, and recurrent neural network. Performance tests were conducted using some e-commerce transaction data from Microsoft. Testing results suggested that proposed frameworks significantly improved the accuracy of decision environment estimation.
This paper examines the possibility of, and the possible advantages to learning the filters of convolutional neural networks (CNNs) for image analysis in the wavelet domain. We are stimulated by both Mallat’s scattering transform and the idea of filtering in the Fourier domain. It is important to explore new spaces in which to learn, as these may provide inherent advantages that are not available in the pixel space. However, the scattering transform is limited by its inability to learn in between scattering orders, and any Fourier domain filtering is limited by the large number of filter parameters needed to get localized filters. Instead we consider filtering in the wavelet domain with learnable filters. The wavelet space allows us to have local, smooth filters with far fewer parameters, and learnability can give us flexibility. We present a novel layer which takes CNN activations into the wavelet space, learns parameters and returns to the pixel space. This allows it to be easily dropped in to any neural network without affecting the structure. As part of this work, we show how to pass gradients through a multirate system and give preliminary results.
Deep neural networks have shown superior performance in many regimes to remember familiar patterns with large amounts of data. However, the standard supervised deep learning paradigm is still limited when facing the need to learn new concepts efficiently from scarce data. In this paper, we present a memory-augmented neural network which is motivated by the process of human concept learning. The training procedure, imitating the concept formation course of human, learns how to distinguish samples from different classes and aggregate samples of the same kind. In order to better utilize the advantages originated from the human behavior, we propose a sequential process, during which the network should decide how to remember each sample at every step. In this sequential process, a stable and interactive memory serves as an important module. We validate our model in some typical one-shot learning tasks and also an exploratory outlier detection problem. In all the experiments, our model gets highly competitive to reach or outperform those strong baselines.
Deep domain adaptation has recently undergone a big success. Compared with shallow domain adaptation, deep domain adaptation has shown higher predictive performance and stronger capacity to tackle structural data (e.g., image and sequential data). The underlying idea of deep domain adaptation is to bridge the gap between source and target domains in a joint feature space so that a supervised classifier trained on labeled source data can be nicely transferred to the target domain. This idea is certainly appealing and motivating, but under the theoretical perspective, none of the theory has been developed to support this. In this paper, we have developed a rigorous theory to explain why we can bridge the relevant gap in an intermediate joint space. Under the light of our proposed theory, it turns out that there is a strong connection between deep domain adaptation and Wasserstein (WS) distance. More specifically, our theory revolves the following points: i) first, we propose a context wherein we can perfectly perform a transfer learning and ii) second, we further prove that by means of bridging the relevant gap and minimizing some reconstruction errors we are minimizing a WS distance between the push forward source distribution and the target distribution via a transport that maps from the source to target domains.
In machine learning, a nonparametric forecasting algorithm for time series data has been proposed, called the kernel spectral hidden Markov model (KSHMM). In this paper, we propose a technique for short-term wind-speed prediction based on KSHMM. We numerically compared the performance of our KSHMM-based forecasting technique to other techniques with machine learning, using wind-speed data offered by the National Renewable Energy Laboratory. Our results demonstrate that, compared to these methods, the proposed technique offers comparable or better performance.
Interactive visualizations are arguably the most important tool to explore, understand and convey facts about data. In the past years, the database community has been working on different techniques for Approximate Query Processing (AQP) that aim to deliver an approximate query result given a fixed time bound to support interactive visualizations better. However, classical AQP approaches suffer from various problems that limit the applicability to support the ad-hoc exploration of a new data set: (1) Classical AQP approaches that perform online sampling can support ad-hoc exploration queries but yield low quality if executed over rare subpopulations. (2) Classical AQP approaches that rely on offline sampling can use some form of biased sampling to mitigate these problems but require a priori knowledge of the workload, which is often not realistic if users want to explore a new database. In this paper, we present a new approach to AQP called Model-based Approximate Query Processing that leverages generative models learned over the complete database to answer SQL queries at interactive speeds. Different from classical AQP approaches, generative models allow us to compute responses to ad-hoc queries and deliver high-quality estimates also over rare subpopulations at the same time. In our experiments with real and synthetic data sets, we show that Model-based AQP can in many scenarios return more accurate results in a shorter runtime. Furthermore, we think that our techniques of using generative models presented in this paper can not only be used for AQP in databases but also has applications for other database problems including Query Optimization as well as Data Cleaning.
Different layers of deep convolutional neural networks(CNN) can encode different-level information. High-layer features always contain more semantic information, and low-layer features contain more detail information. However, low-layer features suffer from the background clutter and semantic ambiguity. During visual recognition, the feature combination of the low-layer and high-level features plays an important role in context modulation. Directly combining the high-layer and low-layer features, the background clutter and semantic ambiguity may be caused due to the introduction of detailed information.In this paper, we propose a general network architecture to concatenate CNN features of different layers in a simple and effective way, called Selective Feature Connection Mechanism (SFCM). Low level features are selectively linked to high-level features with an feature selector which is generated by high-level features. The proposed connection mechanism can effectively overcome the above-mentioned drawbacks. We demonstrate the effectiveness, superiority, and universal applicability of this method on many challenging computer vision tasks, such as image classification, scene text detection, and image-to-image translation.
A contextual care protocol is used by a medical practitioner for patient healthcare, given the context or situation that the specified patient is in. This paper proposes a method to build an automated self-adapting protocol which can help make relevant, early decisions for effective healthcare delivery. The hybrid model leverages neural networks and decision trees. The neural network estimates the chances of each disease and each tree in the decision trees represents care protocol for a disease. These trees are subject to change in case of aberrations found by the diagnosticians. These corrections or prediction errors are clustered into similar groups for scalability and review by the experts. The corrections as suggested by the experts are incorporated into the model.
The Age of Information (AoI) metric has recently received much attention as a measure of freshness of information in a network. In this paper, we study the average AoI in a generic setting with correlated sensor information, which can be related to multiple Internet of Things (IoT) scenarios. The system consists of physical sources with discrete-time updates that are observed by a set of sensors. Each source update may be observed by multiple sensors, and hence the sensor observations are correlated. We devise a model that is simple, but still capable to capture the main tradeoffs. We propose two sensor scheduling policies that minimize the AoI of the sources; one that requires the system parameters to be known a priori, and one based on contextual bandits in which the parameters are unknown and need to be learned. We show that both policies are able to exploit the sensor correlation to reduce the AoI, which result in a large reduction in AoI compared to the use of schedules that are random or based on round-robin.
Deep learning adoption in the financial services industry has been limited due to a lack of model interpretability. However, several techniques have been proposed to explain predictions made by a neural network. We provide an initial investigation into these techniques for the assessment of credit risk with neural networks.
Language models, being at the heart of many NLP problems, are always of great interest to researchers. Neural language models come with the advantage of distributed representations and long range contexts. With its particular dynamics that allow the cycling of information within the network, Recurrent neural network’ (RNN) becomes an ideal paradigm for neural language modeling. Long Short-Term Memory (LSTM) architecture solves the inadequacies of the standard RNN in modeling long-range contexts. In spite of a plethora of RNN variants, possibility to add multiple memory cells in LSTM nodes was seldom explored. Here we propose a multi-cell node architecture for LSTMs and study its applicability for neural language modeling. The proposed multi-cell LSTM language models outperform the state-of-the-art results on well-known Penn Treebank (PTB) setup.
In many domains, collecting sufficient labeled training data for supervised machine learning requires easily accessible but noisy sources, such as crowdsourcing services or tagged Web data. Noisy labels occur frequently in data sets harvested via these means, sometimes resulting in entire classes of data on which learned classifiers generalize poorly. For real world applications, we argue that it can be beneficial to avoid training on such classes entirely. In this work, we aim to explore the classes in a given data set, and guide supervised training to spend time on a class proportional to its learnability. By focusing the training process, we aim to improve model generalization on classes with a strong signal. To that end, we develop an online algorithm that works in conjunction with classifier and training algorithm, iteratively selecting training data for the classifier based on how well it appears to generalize on each class. Testing our approach on a variety of data sets, we show our algorithm learns to focus on classes for which the model has low generalization error relative to strong baselines, yielding a classifier with good performance on learnable classes.

### Convert Data Frame to Dictionary List in R

In R, there are a couple ways to convert the column-oriented data frame to a row-oriented dictionary list or alike, e.g. a list of lists.

In the code snippet below, I would show each approach and how to extract keys and values from the dictionary. As shown in the benchmark, it appears that the generic R data structure is still the most efficient.

### LIST() FUNCTION IN BASE PACKAGE ###
x1 <- as.list(iris[1, ])
names(x1)
# [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"
x1[["Sepal.Length"]]
# [1] 5.1

### ENVIRONMENT-BASED SOLUTION ###
envn_dict <- function(x) {
e <- new.env(hash = TRUE)
for (name in names(x)) assign(name, x[, name], e)
return(e)
}

x2 <- envn_dict(iris[1, ])
ls(x2)
# [1] "Petal.Length" "Petal.Width"  "Sepal.Length" "Sepal.Width"  "Species"
x2[["Sepal.Length"]]
# [1] 5.1

### COLLECTIONS PACKAGE ###
coll_dict <-  function(x) {
d <- collections::Dict$new() for (name in names(x)) d$set(name, x[, name])
return(d)
}

x3 <- coll_dict(iris[1, ])
x3$keys() # [1] "Petal.Length" "Petal.Width" "Sepal.Length" "Sepal.Width" "Species" x3$get("Sepal.Length")
# [1] 5.1

### HASH PACKAGE ###
hash_dict <- function(x) {
d <- hash::hash()
for (name in names(x)) d[[name]] <- x[, name]
return(d)
}

x4 <- hash_dict(iris[1, ])
hash::keys(x4)
# [1] "Petal.Length" "Petal.Width"  "Sepal.Length" "Sepal.Width"  "Species"
hash::values(x4, "Sepal.Length")
# Sepal.Length
#          5.1

### DATASTRUCTURES PACKAGE ###
data_dict <- function(x) {
d <- datastructures::hashmap()
for (name in names(x)) d[name] <- x[, name]
return(d)
}

x5 <- data_dict(iris[1, ])
datastructures::keys(x5)
# [1] "Species"      "Sepal.Width"  "Petal.Length" "Sepal.Length" "Petal.Width"
datastructures::get(x5, "Sepal.Length")
# [1] 5.1

### FROM PYTHON ###
py2r_dict <- function(x) {
return(reticulate::py_dict(names(x), x, TRUE))
}

x6 <- py2r_dict(iris[1, ])
x6$keys() # [1] "Petal.Length" "Sepal.Length" "Petal.Width" "Sepal.Width" "Species" x6["Sepal.Length"] # [1] 5.1 ### CONVERT DATAFRAME TO DICTIONARY LIST ### to_list <- function(df, fn) { l <- list() for (i in seq(nrow(df))) l[[i]] <- fn(df[i, ]) return(l) } rbenchmark::benchmark(replications = 100, order = "elapsed", relative = "elapsed", columns = c("test", "replications", "elapsed", "relative", "user.self", "sys.self"), "BASE::LIST" = to_list(iris, as.list), "BASE::ENVIRONMENT" = to_list(iris, envn_dict), "COLLECTIONS::DICT" = to_list(iris, coll_dict), "HASH::HASH" = to_list(iris, hash_dict), "DATASTRUCTURES::HASHMAP" = to_list(iris, data_dict), "RETICULATE::PY_DICT" = to_list(iris, py2r_dict) ) # test replications elapsed relative user.self sys.self #1 BASE::LIST 100 0.857 1.000 0.857 0.000 #2 BASE::ENVIRONMENT 100 1.607 1.875 1.607 0.000 #4 HASH::HASH 100 2.600 3.034 2.600 0.000 #3 COLLECTIONS::DICT 100 2.956 3.449 2.956 0.000 #5 DATASTRUCTURES::HASHMAP 100 16.070 18.751 16.071 0.000 #6 RETICULATE::PY_DICT 100 18.030 21.039 18.023 0.008  To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### If you did not already know Halide Halide is a computer programming language designed for writing digital image processing code that takes advantage of memory locality, vectorized computation and multi-core CPUs and GPUs. Halide is implemented as an internal domain-specific language (DSL) in C++. The main innovation Halide brings is the separation of the algorithm being implemented from its execution schedule, i.e. code specifying the loop nesting, parallelization, loop unrolling and vector instruction. These two are usually interleaved together and experimenting with changing the schedule requires the programmer to rewrite large portions of the algorithm with every change. With Halide, changing the schedule does not require any changes to the algorithm and this allows the programmer to experiment with scheduling and finding the most efficient one. DNN Dataflow Choice Is Overrated Learning Active Learning (LAL) In this paper, we suggest a novel data-driven approach to active learning: Learning Active Learning (LAL). The key idea behind LAL is to train a regressor that predicts the expected error reduction for a potential sample in a particular learning state. By treating the query selection procedure as a regression problem we are not restricted to dealing with existing AL heuristics; instead, we learn strategies based on experience from previous active learning experiments. We show that LAL can be learnt from a simple artificial 2D dataset and yields strategies that work well on real data from a wide range of domains. Moreover, if some domain-specific samples are available to bootstrap active learning, the LAL strategy can be tailored for a particular problem. … Dimensional Clustering This paper introduces a new clustering technique, called {\em dimensional clustering}, which clusters each data point by its latent {\em pointwise dimension}, which is a measure of the dimensionality of the data set local to that point. Pointwise dimension is invariant under a broad class of transformations. As a result, dimensional clustering can be usefully applied to a wide range of datasets. Concretely, we present a statistical model which estimates the pointwise dimension of a dataset around the points in that dataset using the distance of each point from its$n^{\text{th}}$nearest neighbor. We demonstrate the applicability of our technique to the analysis of dynamical systems, images, and complex human movements. … Continue Reading… ### Simple table size estimates and 128-bit numbers (Java Edition) Suppose that you are given a table. You know the number of rows, as well as how many distinct value each column has. For example, you know that there are two genders (in this particular table). Maybe there are 73 distinct age values. For a concrete example, take the standard Adult data set which is made of 48842 rows. How many distinct entries do you expect the table to have? That is, if you remove all duplicate rows, what is the number of rows left? There is a standard formula for this problem: Cardenas’ formula. It uses the simplistic model where there is no relationship between the distinct columns. In practice, it will tend to overestimate the number of rows. However, despite it is simplicity, it often works really well. Let p be the product of all column cardinalities, and let n be number of rows, then the Cardenas estimate is p * (1 – (1 – 1/p)n). Simple right? You can implement in Java easily enough… double cardenas64(long[] cards, int n) { double product = 1; for(int k = 0; k < cards.length; k++) { product *= cards[k]; } return product * (1- Math.pow(1 - 1.0/product,n)); }  So let us put in the numbers… my column cardinalities are 16,16,15,5,2,94,21648,92,42,7,9,2,6,73,119; and I have 48842 rows. So what is Cardenas’ prediction? Zero. At least, that’s what the Java function returns. Why is that? The first problem is that 1 – 1/p is 1 when p is that large. And even if you could compute 1 – 1/p accurately enough, taking it to the power of 48842 is a problem. So what do you do? You can switch to something more accurate than double precision, that is quadruple precision (also called binary128). There is no native 128-bit floats in Java, but you can emulate them using the BigDecimal class. The code gets much uglier. Elegance aside, I assumed it would be a walk in the park, but I found that the implementation of the power function was numerically unstable, so I had to roll my own (from multiplications). The core function looks like this… double cardenas128(long[] cards, int n) { BigDecimal product = product(cards); BigDecimal oneover = BigDecimal.ONE.divide(product, MathContext.DECIMAL128); BigDecimal proba = BigDecimal.ONE.subtract(oneover, MathContext.DECIMAL128); proba = lemirepower(proba,n); return product.subtract( product.multiply(proba, MathContext.DECIMAL128), MathContext.DECIMAL128).doubleValue(); }  It scales up to billions of rows and up to products of cardinalities that do not fit in any of Java’s native type. Though the computation involves fancy data types, it is probably more than fast enough for most applications. Continue Reading… ### Document worth reading: “Multi-Agent Reinforcement Learning: A Report on Challenges and Approaches” Reinforcement Learning (RL) is a learning paradigm concerned with learning to control a system so as to maximize an objective over the long term. This approach to learning has received immense interest in recent times and success manifests itself in the form of human-level performance on games like \textit{Go}. While RL is emerging as a practical component in real-life systems, most successes have been in Single Agent domains. This report will instead specifically focus on challenges that are unique to Multi-Agent Systems interacting in mixed cooperative and competitive environments. The report concludes with advances in the paradigm of training Multi-Agent Systems called \textit{Decentralized Actor, Centralized Critic}, based on an extension of MDPs called \textit{Decentralized Partially Observable MDP}s, which has seen a renewed interest lately. Multi-Agent Reinforcement Learning: A Report on Challenges and Approaches Continue Reading… ### MongoDB Atlas: Connector for Apache Spark now Officially Certified for Azure Databricks This is a guest blog from our partners at MongoDB Bryan Reinero and Dana Groce We are happy to announce that the MongoDB Connector for Apache Spark is now officially certified for Azure Databricks. MongoDB Atlas users can integrate Spark and MongoDB in the cloud for advanced analytics and machine learning workloads by using the MongoDB Connector for Apache Spark which is fully supported and maintained by MongoDB. The MongoDB Connector for Apache Spark exposes all of Spark’s libraries, including Scala, Java, Python, and R. MongoDB data is materialized as DataFrames and Datasets for analysis with machine learning, graph, streaming, and SQL APIs. The MongoDB Connector for Apache Spark can take advantage of MongoDB’s aggregation pipeline and rich secondary indexes to extract, filter, and process only the range of data it needs – for example, analyzing all customers located in a specific geography. This is very different from simple NoSQL data stores that do not offer secondary indexes or in-database aggregations and require the extraction of all data based on a simple primary key, even if only a subset of that data is needed for the Spark process. This results in more processing overhead, more hardware, and longer time-to-insight for data scientists and engineers. Additionally, MongoDB’s workload isolation makes it easy for users to efficiently process data drawn from multiple sources into a single database with zero impact on other business-critical database operations. Running Spark on MongoDB reduces operational overhead as well by greatly simplifying your architecture and increasing the speed at which analytics can be executed. MongoDB Atlas, our on-demand, fully-managed cloud database service for MongoDB, makes it even easier to run sophisticated analytics processing by eliminating the operational overhead of managing database clusters directly. By combining Azure Databricks and MongoDB, Atlas users can make benefit of a fully managed analytics platform, freeing engineering resources to focus on their core business domain and deliver actionable insights quickly. -- Try Databricks for free. Get started today. The post MongoDB Atlas: Connector for Apache Spark now Officially Certified for Azure Databricks appeared first on Databricks. Continue Reading… ### RcppGetconf 0.0.3 (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers) A second and minor update for the RcppGetconf package for reading system configuration — not unlike getconf from the libc library — is now on CRAN. Changes are minor. We avoid an error on a long-dead operating system cherished in one particular corner of the CRAN world. In doing so some files were updated so that dynamically loaded routines are now registered too. The short list of changes in this release follows: #### Changes in inline version 0.0.3 (2018-11-16) • Examples no longer run on Solaris where they appear to fail. Courtesy of CRANberries, there is a diffstat report. More about the package is at the local RcppGetconf page and the GitHub repo. This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings. To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### Book Memo: “Reinforcement Learning for Optimal Feedback Control”  A Lyapunov-Based Approach Reinforcement Learning for Optimal Feedback Control develops model-based and data-driven reinforcement learning methods for solving optimal control problems in nonlinear deterministic dynamical systems. In order to achieve learning under uncertainty, data-driven methods for identifying system models in real-time are also developed. The book illustrates the advantages gained from the use of a model and the use of previous experience in the form of recorded data through simulations and experiments. The book’s focus on deterministic systems allows for an in-depth Lyapunov-based analysis of the performance of the methods described during the learning phase and during execution. To yield an approximate optimal controller, the authors focus on theories and methods that fall under the umbrella of actor-critic methods for machine learning. They concentrate on establishing stability during the learning phase and the execution phase, and adaptive model-based and data-driven reinforcement learning, to assist readers in the learning process, which typically relies on instantaneous input-output measurements. Continue Reading… ### Despite California’s inferno, global wildfires are fizzling out Climate change makes fires worse, but agricultural development limits them Continue Reading… ## November 16, 2018 ### Announcing the Winners of the Facebook Crisis Informatics Research Awards We are pleased to announce the winners of the Facebook Crisis Informatics Research Awards. As social media and messaging apps continue to become important communication tools in people’s everyday lives, they have also come to play important roles in how people prepare, respond to, and recover from disasters. With these research awards, Facebook is supporting new and innovative research in the area of crisis informatics to make social media tools more useful to people responding to or impacted by a disaster. This research will unlock new tools and products to improve disaster response and recovery. No Facebook data will be provided to award recipients. The winning proposals represent a diverse range of problems, methodologies and academic disciplines. One proposal will use social media data to enhance available information on flood conditions in order to predict the impact and extent of floods. Another takes crisis informatics from a typical focus on natural-hazard induced disasters to one dealing with conflict zones, including the use of mixed methods to understand displacement and information flows among refugee populations. A third winning proposal will build statistical models that predict how populations will move in response to natural hazards, including sheltering in place, evacuations and returning to affected areas. This diverse slate of research stands to advance our understanding of and development of tools for dealing with crises, and will push the field of crisis informatics in new and exciting directions. The three award winners and their topic areas are: Capturing the Big Picture from Social Media for Global Flood Awareness Carlos Castillo (Universitat Pompeu Fabra) and Milan Kalaš Data & Design Methods for Supporting Crisis Informatics in Conflict Zones Karen E. Fisher (University of Washington), Eaid Yafi (University Kuala Lumpur), Jevin West (University of Washington), Kate Starbird (University of Washington) and Reem Talhouk (University of Newcastle) Real-time Prediction of Movement Patterns during Disaster Times Cynthia Chen (University of Washington), Feilong Wang (University of Washington) and Xiangyang Guan (University of Washington) Thank you to all the researchers that submitted proposals, and congratulations to the winners. To view our currently open research awards and to subscribe to our Research Awards email list, visit our Research Awards page. Continue Reading… ### Example of Overfitting (This article was first published on Mad (Data) Scientist, and kindly contributed to R-bloggers) I occasionally see queries on various social media as to overfitting — what is it?, etc. I’ll post an example here. (I mentioned it at my talk the other night on our novel approach to missing values, but had a bug in the code. Here is the correct account.) The dataset is prgeng, on wages of programmers and engineers in Silicon Valley as of the 2000 Census. It’s included in our polyreg package, which we developed as an alternative to neural networks. But it is quite useful in its own right, as it makes it very convenient to fit multivariate polynomial models. (Not as easy as it sounds; e.g. one must avoid cross products of orthogonal dummy variables, powers of those variables, etc.) First I split the data into training and test sets: > set.seed(88888) > getPE() > pe1 <- pe[,c(1,2,4,6,7,12:16,3)] > testidxs <- sample(1:nrow(pe1),1000) > testset <- pe1[testidxs,] > trainset <- pe1[-testidxs,]  As a base, I fit an ordinary degree-1 model and found the mean absolute prediction error: > lmout <- lm(wageinc ~ .,data=trainset) > predvals <- predict(lmout,testset[,-11]) > mean(abs(predvals - testset[,11])) [1] 25456.98  Next, I tried a quadratic model: > pfout <- polyFit(trainset,deg=2) > mean(abs(predict(pfout,testset[,-11]) - testset[,11])) [1] 24249.83  Note that, though originally there were 10 predictors, now with polynomial terms we have 46. I kept going:  deg MAPE # terms 3 23951.69 118 4 23974.76 226 5 24340.85 371 6 24554.34 551 7 36463.61 767 8 74296.09 1019 One must keep in mind the effect of sampling variation, and repeated trials would be useful here, but it seems that the data can stand at least a cubic fit and possibly as much as degree 5 or even 6. To be conservative, it would seem wise to stop at degree 3. That’s also consistent with the old Tukey rule of thumb that we should have p <- sqrt(n), n being about 20,000 here. In any event, the effects of overfitting here are dramatic, starting at degree 7. It should be noted that I made no attempt to clean the data, nor to truncate predicted values at 0, etc. It should also be noted that, starting at degree 4, R emitted warnings, “prediction from a rank-deficient fit may be misleading.” It is well known that at high degrees, polynomial terms can have multicollinearity issues. Indeed, this is a major point we make in our arXiv paper cited above. There we argue that neural networks are polynomial models in disguise, with the effective degree of the polynomial increasing a lot at each successive layer, and thus multicollinearity increasing from layer to layer. We’ve confirmed this empirically. We surmise that this is a major cause of convergence problems in NNs. Finally, whenever I talk about polynomials and NNs, I hear the standard (and correct) concern that polynomial grow rapidly at the edges of the data. True, but I would point out that if you accept NNs = polynomials, then the same is true for NNs. We’re still working on the polynomials/NNs project. More developments to be announced soon. But for those who are wondering about overfitting, the data here should make the point. To leave a comment for the author, please follow the link and comment on their blog: Mad (Data) Scientist. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### How to Choose an AI Vendor This report explores why it is so challenging to choose an AI vendor and what you should consider as you seek a partner in AI. Download now. Continue Reading… ### UnitedHealth Group: Senior Principal Data Scientist [Telecommute, Central or Eastern Time Zones] An exceptional Data Scientist with a passion for working with healthcare data, hands-on skills using various big-data and ML platform tools, and proven experience delivering business impact. Continue Reading… ### Distilled News Today’s trend of Artificial Intelligence (AI) and the increased level of Automation in manufacturing allow firms to flexibly connect assets and improve productivity through data-driven insights that has not been possible before. As more automation is used in manufacturing, the speed of responses required in dealing with maintenance issues is going to get faster and automated decisions as to what’s the best option from an economic standpoint are getting more complex. I published implementation of Faster R-CNN with MXNet C++ Frontend. You can use this implementation as comprehensive example of using MXNet C++ Frontend, it has custom data loader for MS Coco dataset, implements custom target proposal layer as a part of the project without modification MXNet library, contains code for errors checking (Missed in current C++ API), have Eigen and NDArray integration samples. Feel free to leave comments and proposes. I´m writing a series of posts on various function options of the glmnet function (from the package of the same name), hoping to give more detail and insight beyond R´s documentation. I love the formattable package, but I always struggle to remember its syntax. A quick Google search reveals that I’m not alone in this struggle. This post is intended as a reminder for myself of how the package works – and hopefully you’ll find it useful too! RDQA is a R package for Qualitative Data Analysis, a free (free as freedom) qualitative analysis software application (BSD license). It works on Windows, Linux/FreeBSD and Mac OSX platforms. RQDA is an easy to use tool to assist in the analysis of textual data. At the moment it only supports plain text formatted data. All the information is stored in a SQLite database via the R package of RSQLite. The GUI is based on RGtk2, via the aid of gWidgetsRGtk2. It includes a number of standard Computer-Aided Qualitative Data Analysis features. In addition it seamlessly integrates with R, which means that a) statistical analysis on the coding is possible, and b) functions for data manipulation and analysis can be easily extended by writing R functions. To some extent, RQDA and R make an integrated platform for both quantitative and qualitative data analysis. Literature reviews are the cornerstone of science. Keeping abreast of developments within any given field of enquiry has become increasingly difficult given the enormous amounts of new research. Databases and search technology have made finding relevant literature easy but, keeping a coherent overview of the discourse within a field of enquiry is an ever more encompassing task. Scholars have proposed many approaches to analysing literature, which can be placed along a continuum from traditional narrative methods to systematic analytic syntheses of text using machine learning. Traditional reviews are biased because they rely entirely on the interpretation of the researcher. Analytical approaches follow a process that is more like scientific experimentation. These systematic methods are reproducible in the way literature is searched and collated but still rely on subjective interpretation. Machine learning provides new methods to analyse large swaths of text. Although these methods sound exciting, these methods are incapable of providing insight. Machine learning cannot interpret a text; it can only summarise and structure a corpus. Machine learning still requires human interpretation to make sense of the information. This article introduces a mixed-method technique for reviewing literature, combining qualitative and quantitative methods. I used this method to analyse literature published by the International Water Association as part of my dissertation into water utility marketing. You can read the code below, or download it from GitHub. Detailed infromation about the methodology is available through FigShare. Delivery time prediction has long been a part of city logistics, but refining accuracy has recently become very important for services such as Deliveroo, Foodpanda and Uber Eats which deliver food on-demand. These services and similar ones must receive an order and have it delivered within ~30 minutes to appease their users. In these situations +/- 5 minutes can make a big difference so it’s very important for customer satisfaction that the initial prediction is highly accurate and that any delays are communicated effectively. Assume that we have a some data that looks like above, and red arrow shows the first principle component, while the blue arrow shows the second principle component. How would normalization as well as standardization change those two? Also can we perform some kind of normalization respect to the variance we have? (I will make more post as I go along. ) Computers are fast: they store and manipulate data using electronic signals that travel across their silicon internals at hundreds of thousands of miles per hour. For comparison, the fastest signals in the human nervous system travel at about 250mph, which is about 3 million times slower, and those speeds are only possible for unconscious signals – signal speeds observed for conscious thought and calculation are typically orders of magnitude slower still. Basically, we’re never going to be able to out calculate a computer. How would you like to speed up your language modeling (LM) tasks by 1000%, with nearly no drop in accuracy? A recent paper from Grave et al. (2017), called ‘Efficient softmax approximation for GPUs’, shows how you can gain a massive speedup in one of the most time-consuming aspects of language-modeling, the computation-heavy softmax step, through their ‘adaptive softmax’. The giant speedup from using the adaptive softmax comes with only minimal costs in accuracy, so anyone who is doing language modeling should definitely consider using it. Here in Part 1 of this blog post, I’ll fully explain the adaptive softmax, then in Part 2 I’ll walk you step by step through a Pytorch implementation (with an accompanying Jupyter notebook), which uses Pytorch’s built-in AdaptiveLogSoftmaxWithLoss function. In Part 1 of this blog post, I explained how the adaptive softmax works, and how it can speed up your language model by up to 1000%. Here in Part 2, I’ll walk you step by step through a Pytorch implementation (with an accompanying Jupyter notebook), which uses Pytorch’s built-in AdaptiveLogSoftmaxWithLoss function. For preprocessing you will need fastai (see https://docs.fast.ai ), a deep learning library that runs on top of Pytorch that simplifies training neural networks. [For those who want to learn state-of-the-art deep learning techniques, I highly recommend Jeremy Howard’s fast.ai course, which is available online for free: https://…/]. I decided to use the Wikitext-2 dataset, which is a relatively small dataset that contains ~2 million tokens and ~33 thousand words in the vocabulary. Once the data has been downloaded and properly formatted in csv files, fastai makes it easy to quickly tokenize, numericalize, and create a data-loader that will be used for training. I also downloaded GloVe pre-trained word vectors that are used for the models’ word embeddings. Finally, I created a modeler class that handles training. Continue Reading… ### Because it's Friday: The physics of The Expanse For a science fiction show set hundreds of years in the future, The Expanse is unusual in that it takes very few liberties with Science as we understand it today. The solar system is made up of the familiar planets we know (other than the colonists and space stations spread throughout the system), communication is limited by the speed of light, and and spaceships operate under standard Newtonian principles. You can get a good sense in this highlight reel of the visual effects of the show: The series was recently saved from cancellation by Amazon: the first three series are streaming on Amazon Prime now (in the US at least) and I can't wait for the fourth season and the continuation of the excellent story, which hews closely to the source novels. That's all from from the blog for this week. Next week will be a short week for us with US Thanksgiving coming up: if you're celebrating, have a great holiday! Continue Reading… ### Magister Dixit “1. Think carefully about which projects you take on. 2. Use as much data as you can from as many places as possible. 3. Don’t just use internal customer data. 4. Have a clear sampling strategy. 5. Always use a holdout sample. 6. Spend time on ‘throwaway’ modelling. 7. Refresh your model regularly. 8. Make sure your insights are meaningful to other people. 9. Use your model in the real world.” Rachel Clinton ( January 7, 2015 ) Continue Reading… ### UnitedHealth Group: Data Analytics and Reporting Lead [Minnetonka, MN or Telecommute] This would be a hands-on technical lead role that will require end-to-end ownership of our UnitedHealthcare Fin360 applications. Continue Reading… ### UnitedHealth Group: Clinical Data Statistical Analyst – SQL SAS (Clinician Required) [Telecommute] Leverage your data analytic and project management skills to lead programs that focus on improving HEDIS rates and impacting the quality of care for our members. Continue Reading… ### R Packages worth a look Header-Only C++ Mathematical Optimization Library for ‘Armadillo’ (RcppEnsmallen) Ensmallen’ is a templated C++ mathematical optimization library (by the ‘MLPACK’ team) that provides a simple set of abstractions for writing an object … Network Tools for Memory Research (memnet) Efficient implementations of network science tools to facilitate research into human (semantic) memory. In its current version, the package contains se … Sequential Quadratic Programming for Fast Maximum-Likelihood Estimation of Mixture Proportions (mixsqp) Provides optimization algorithms based on sequential quadratic programming (SQP) for maximum likelihood estimation of the mixture proportions in a fini … Continue Reading… ### The tidy caret interface in R (This article was first published on poissonisfish, and kindly contributed to R-bloggers) Among most popular off-the-shelf machine learning packages available to R, caret ought to stand out for its consistency. It reaches out to a wide range of dependencies that deploy and support model building using a uniform, simple syntax. I have been using caret extensively for the past three years, with a precious partial least squares (PLS) tutorial in my track record. A couple of years ago, the creator and maintainer of caret Max Kuhn joined RStudio where he has contributing new packages to the ongoing tidy-paranoia – the supporting recipes, yardstick, rsample and many other packages that are part of the tidyverse paradigm and I knew little about. As it happens, caret is now best used with some of these. As an aspiring data scientist with fashionable hex-stickers on my laptop cover and a tendency to start any sentence with ‘big data’, I set to learn tidyverse and going Super Mario using pipes (%>%, Ctrl + Shift + M). Overall, I found the ‘tidy’ approach quite enjoyable and efficient. Besides streamlining model design in a sequential manner, recipes and the accompanying ‘tidy’ packages fix a common problem in the statistical learning realm that curtails predictive accuracy, data leakage. You can now generate a ‘recipe’ that lays out the plan for all the feature engineering (or data pre-processing, depending on how many hex-stickers you have) and execute it separately for validation and resampling splits (i.e. split first, process later), thus providing a more robust validation scheme. It also enables recycling any previous computations afterwards. I drew a lot of inspiration from a webinar and a presentation from Max himself to come up with this tutorial. Max is now hands-on writing a new book, entitled Feature Engineering and Selection: A Practical Approach for Predictive Models that will include applications of caret supported by recipes. I very much look forward to read it considering he did a fantastic job with Applied Predictive Modeling. The aim here is to test this new caret framework on the DREAM Olfaction Prediction Challenge, a crowd-sourced initiative launched in January 2015. This challenge demonstrated predictive models can identify perceptual attributes (e.g. coffee-like, spicy and sweet notes) in a collection of fragrant compounds from physicochemical features (e.g. atom types). Results and conclusions were condensed into a research article [1] that reports an impressive predictive accuracy for a large number of models and teams, over various perceptive attributes. Perhaps more impressive yet, is the fact it anticipates the very scarce knowledge of human odour receptors and their molecular activation mechanisms. Our task, more concretely, will be to predict population-level pleasantness of hundreds of compounds from physicochemical features, using the training set of the competition. Finally, a short announcement. Many people have been raising reproducibility issues with previous entries, due to either package archiving (e.g. GenABEL, for genetic mapping) or cross-incompatibility (e.g. caret, for PLS). There were easy fixes for the most part, but I have nevertheless decided to start packing scripts together with sessionInfo() reports in unique GitHub repositories. The bundle for this tutorial is available under https://github.com/monogenea/olfactionPrediction. I hope this will help track down the exact package versions and system configurations I use, so that anyone can anytime fully reproduce these results. No hex-stickers required! NOTE that most code chunks containing pipes are corrupted. PLEASE refer to the materials from the repo. ## The DREAM Olfaction Prediction Challenge The DREAM Olfaction Prediction Challenge training set consists of 338 compounds characterised by 4,884 physicochemical features (the design matrix), and profiled by 49 subjects with respect to 19 semantic descriptors, such as ‘flower’, ‘sweet’ and ‘garlic’, together with intensity and pleasantness (all perceptual attributes). Two different dilutions were used per compound. The perceptual attributes were given scores in the 0-100 range. The two major goals of the competition were to use the physicochemical features in modelling i) the perceptual attributes at the subject-level and ii) both the mean and standard deviation of the perceptual attributes at the population-level. How ingenious it was to model the standard deviation, the variability in how something tastes to different people! In the end, the organisers garnered important insights about the biochemical basis of flavour, from teams all over the world. The resulting models were additionally used to reverse-engineer nonexistent compounds from target sensory profiles – an economically exciting prospect. According to the publication, the top-five best predicted perceptual attributes at the population-level (i.e. mean values across all 49 subjects) were ‘garlic’, ‘fish’, ‘intensity’, pleasantness and ‘sweet’ (cf. Fig. 3A in [1]), all exhibiting average prediction correlations of $r \ge 0.5$, far greater than the average subject-level predictions (cf. Fig. 2D in [1]). This should come as no surprise since averaging over subjects is more likely to flesh out universal mechanisms. This averaging also helps stabilising the subject-level variance – no subject will necessarily utilise the full 0-100 scoring range; some will tend to use narrow intervals (low variance) while some will use the whole range instead (high variance). Since pleasantness is one of the most general olfactory properties, I propose modelling the population-level median pleasantness of all compounds, from all physicochemical features. The median, as opposed to the mean, is less sensitive to outliers and an interesting departure from the original approach we can later compare against. # Let’s get started with R We can start off by loading all the necessary packages and pulling the data directly from the DREAM Olfaction Prediction GitHub repository. More specifically, we will create a directory called data and import both the profiled perceptual attributes (TrainSet.txt) and the physicochemical features that comprise our design matrix (molecular_descriptors_data.txt). Because we are tidy people, we will use readr to parse these as tab-separated values (TSV). I also suggest re-writing the column name Compound Identifier in the sensory profiles into CID, to match with that from the design matrix molFeats. # Mon Oct 29 13:17:55 2018 ------------------------------ ## DREAM olfaction prediction challenge library(caret) library(rsample) library(tidyverse) library(recipes) library(magrittr) library(doMC) # Create directory and download files dir.create("data/") ghurl <- "https://github.com/dream-olfaction/olfaction-prediction/raw/master/data/" download.file(paste0(ghurl, "TrainSet.txt"), destfile = "data/trainSet.txt") download.file(paste0(ghurl, "molecular_descriptors_data.txt"), destfile = "data/molFeats.txt") # Read files with readr, select least diluted compounds responses % rename(CID = Compound Identifier) molFeats <- read_tsv("data/molFeats.txt") # molecular features  Next, we filter compounds that are common to both datasets and reorder them accordingly. The profiled perceptual attributes span a total of 49 subjects, two different dilutions and some replications (cf. Fig. 1A in [1]). Determine the median pleasantness (i.e. VALENCE/PLEASANTNESS) per compound, across all subjects and dilutions while ignoring missingness with na.rm = T. The last line will ensure the order of the compounds is identical between this new dataset and the design matrix, so that we can subsequently introduce population-level pleasantness as a new column termed Y in the new design matrix X. We no longer need CID so we can discard it. Regarding missingness, I found maxima of 0.006% and 23% NA content over all columns and rows, respectively. A couple of compounds could be excluded but I would move on. # Determine intersection of compounds in features and responses commonMols % dplyr::summarise(pleasantness = median(VALENCE/PLEASANTNESS, na.rm = T)) all(medianPlsnt$CID == molFeats$CID) # TRUE - rownames match # Concatenate predictors (molFeats) and population pleasantness X % select(-CID)  In examining the structure of the design matrix, you will see there are many skewed binary variables. In the most extreme case, if a binary predictor is either all zeros or all ones throughout it is said to be a zero-variance predictor that if anything, will harm the model training process. There is still the risk of having near-zero-variance predictors, which can be controlled based on various criteria (e.g. number of samples, size of splits). Of course, this can also impact quantitative, continuous variables. Here we will use nearZeroVar from caret to identify faulty predictors that are either zero-variance or display values whose frequency exceeds a predefined threshold. Having a 5x repeated 5-fold cross-validation in mind, I suggest setting freqCut = 4, which will rule out i) binary variables with $|f(0) - f(1)| \ge 0.80$, and ii) continuous variables with $\frac{\# most common value}{\# second most common value} \ge 4$. See ?nearZeroVar for more information. From 4,870 features we are left with 2,554 – a massive reduction by almost a half. # Filter nzv X <- X[,-nearZeroVar(X, freqCut = 4)] # == 80/20  Now we will use rsample to define the train-test and cross-validation splits. Please use set.seed as it is, to ensure you will generate the same partitions and arrive to the same results. Here I have allocated 10% of the data to the test set; by additionally requesting stratification based on the target Y, we are sure to have a representative subset. We can next pass the new objects to training and testing to effectively split the design matrix into train and test sets, respectively. The following steps consist of setting up a 5x repeated 5-fold cross-validation for the training set. Use vfold_cv and convert the output to a caret-like object via rsample2caret. Next, initialise a trainControl object and overwrite index and indexOut using the corresponding elements in the newly converted vfold_cv output. # Split train/test with rsample set.seed(100) initSplit <- initial_split(X, prop = .9, strata = "Y") trainSet <- training(initSplit) testSet <- testing(initSplit) # Create 5-fold cross-validation, convert to caret class set.seed(100) myFolds % rsample2caret() ctrl <- trainControl(method = "cv", selectionFunction = "oneSE") ctrl$index <- myFolds$index ctrl$indexOut <- myFolds$indexOut  Prior to modelling, I will create two reference variables – binVars, to identify binary variables, and missingVars, to identify any variables containing missing values. These will help with i) excluding binary variables from mean-centering and unit-variance scaling, and ii) restricting mean-based imputation to variables that do contain missing values, respectively. # binary vars binVars <- which(sapply(X, function(x){all(x %in% 0:1)})) missingVars <- which(apply(X, 2, function(k){any(is.na(k))}))  Recipes are very simple in design. The call to recipe is first given the data it is supposed to be applied on, as well as a lm-style formula. The recipes package contains a family of functions with a step_... prefix, involved in encodings, date conversions, filters, imputation, normalisation, dimension reduction and a lot more. These can be linked using pipes, forming a logic sequence of operations that starts with the recipe call. Supporting functions such as all_predictors, all_outcomes and all_nominal delineate specific groups of variables at any stage in the sequence. One can also use the names of the variables, akin to my usage of binVars and missingVars. I propose writing a basic recipe myRep that does the following: • Yeo-Johnson [2] transformation of all quantitative predictors; • Mean-center all quantitative predictors; • Unit-variance scale all quantitative predictors; • Impute missing values with the mean of the incident variables. # Design recipe myRec % step_YeoJohnson(all_predictors(), -binVars) %>% step_center(all_predictors(), -binVars) %>% step_scale(all_predictors(), -binVars) %>% step_meanimpute(missingVars)  In brief, the Yeo-Johnson procedure transforms variables to be distributed as Gaussian-like as possible. Before delving into the models let us tweak the recipe and assign it to pcaRep, conduct a principal component analysis and examine how different compounds distribute along the first two principal components. Colour them based on their population-level pleasantness – red for very pleasant, blue for very unpleasant and the gradient in between, via colGrad. # simple PCA, plot pcaRec % step_pca(all_predictors()) myPCA % juice() colGrad <- trainSet$Y/100 # add color

plot(myPCA$PC1, myPCA$PC2,
pch = 16, xlab = "PC1", ylab = "PC2")
legend("topleft", pch = 16,
col = rgb(c(0,.5,1), 0, c(1,.5,0), alpha = .5),
legend = c("Pleasant", "Neutral", "Unpleasant"), bty = "n")


The compounds do not seem to cluster into groups, nor do they clearly separate with respect to pleasantness.

Note that pcaRep itself is just a recipe on wait. Except for recipes passed to caret::train, to process and extract the data as instructed you need to either ‘bake’ or ‘juice’ the recipe. The difference between the two is that ‘juicing’ outputs the data associated to the recipe (e.g. the training set), whereas ‘baking’ can be applied to process any other dataset. ‘Baking’ is done with bake, provided the recipe was trained using prep. I hope this explains why I used juice above!

Next we have the model training. I propose training five regression models – k-nearest neighbours (KNN), elastic net (ENET), support vector machine with a radial basis function kernel (SVM), random forests (RF), extreme gradient boosting (XGB) and Quinlan’s Cubist (CUB) – aiming at minimising the root mean squared error (RMSE). Note I am using tuneGrid and tuneLength interchangeably. I rather supply predefined hyper-parameters to simple models I am more familiar with, and run the rest with a number of defaults via tuneLength.

Parallelise the computations if possible. With macOS, I can use registerDoMC from the doMC package to harness multi-threading of the training procedure. If you are running a different OS, please use a different library if necessary. To the best of my knowledge, doMC will also work in Linux but Windows users might need to use the doParallel package instead.

# Train
doMC::registerDoMC(10)
knnMod <- train(myRec, data = trainSet,
method = "knn",
tuneGrid = data.frame(k = seq(5, 25, by = 4)),
trControl = ctrl)

enetMod <- train(myRec, data = trainSet,
method = "glmnet",
tuneGrid = expand.grid(alpha = seq(0, 1, length.out = 5),
lambda = seq(.5, 2, length.out = 5)),
trControl = ctrl)

svmMod <- train(myRec, data = trainSet,
tuneLength = 8,
trControl = ctrl)

rfMod <- train(myRec, data = trainSet,
method = "ranger",
tuneLength = 8,
num.trees = 5000,
trControl = ctrl)

xgbMod <- train(myRec, data = trainSet,
method = "xgbTree",
tuneLength = 8,
trControl = ctrl)

cubMod <- train(myRec, data = trainSet,
method = "cubist",
tuneLength = 8,
trControl = ctrl)

modelList <- list("KNN" = knnMod,
"ENET" = enetMod,
"SVM" = svmMod,
"RF" = rfMod,
"XGB" = xgbMod,
"CUB" = cubMod)


Once the training is over, we can compare the optimal five cross-validated models based on their RMSE across all $5 \times 5 = 25$ resamples, using some bwplot magic sponsored by lattice. In my case the runtime was unusually long (>2 hours) compared to previous releases.

bwplot(resamples(modelList),
metric = "RMSE")


The cross-validated RSME does not vary considerably across the six optimal models. To conclude, I propose creating a model ensemble by simply taking the average predictions of all six optimal models on the test set, to compare observed and predicted population-level pleasantness in this hold-out subset of compounds.

# Validate on test set with ensemble
allPreds <- sapply(modelList, predict, newdata = testSet)
ensemblePred <- rowSums(allPreds) / length(modelList)

# Plot predicted vs. observed; create PNG
plot(ensemblePred, testSetY, xlim = c(0,100), ylim = c(0,100), xlab = "Predicted", ylab = "Observed", pch = 16, col = rgb(0, 0, 0, .25)) abline(a=0, b=1) writeLines(capture.output(sessionInfo()), "sessionInfo")  The ensemble fit on the test set is not terrible – slightly underfit, predicting the poorest in the two extremes of the pleasantness range. All in all, it attained a prediction correlation of $r \approx 0.73$, which is slightly larger than the mean reported (cf. Fig. 3A in [1]). However, note that there are only 31 compounds in the test set. A model like this could be easily moved into production stage or repurposed to solve a molecular structure from a sensory profile of interest, as mentioned earlier. ## Wrap-up The revamped caret framework is still under development, but it seems to offer • A substantially expanded, unified syntax for models and utils. Keep an eye on textrecipes, an upcoming complementary package for processing textual data; • A sequential streamlining of extraction, processing and modelling, enabling recycling of previous computations; • Executing-after-splitting, thereby precluding leaky validation strategies. On the other hand, I hope to see improvements in • Decoupling from proprietary frameworks in the likes of tidyverse, although there are alternatives; • The runtime. I suspect the recipes themselves are the culprit, not the models. Future updates will certainly fix this. Regarding the DREAM Olfaction Prediction Challenge, we were able to predict population-level perceived pleasantness from physicochemical features with an accuracy as good as that achieved, in average, by the competing teams. Our model ensemble seems underfit, judging from the narrow range of predicted values. Maybe we could be less restrictive about the near-zero-variance predictors and use a repeated 3-fold cross-validation. I hope you had as much fun as I did in dissecting a fascinating problem while unveiling a bit of the new caret. You can use this code as a template for exploring different recipes and models. If you encounter any problems or have any questions do not hesitate to post a comment underneath – we all learn from each other! ## References [1] Keller, Andreas et al. (2017). Predicting human olfactory perception from chemical features of odor molecules. Science, 355(6327), 820-826. [2] Yeo, In-Kwon and Johnson, Richard (2000). A new family of power transformations to improve normality or symmetry. Biometrika, 87(4), 954-959. To leave a comment for the author, please follow the link and comment on their blog: poissonisfish. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### Top 10 Python Data Science Libraries The third part of our series investigating the top Python Libraries across Machine Learning, AI, Deep Learning and Data Science. Continue Reading… ### Introducing New Built-in and Higher-Order Functions for Complex Data Types in Apache Spark 2.4 Apache Spark 2.4 introduces 29 new built-in functions for manipulating complex types (for example, array type), including higher-order functions. Before Spark 2.4, for manipulating the complex types directly, there were two typical solutions: 1) Exploding the nested structure into individual rows, and applying some functions, and then creating the structure again 2) Building a User Defined Function (UDF). In contrast, the new built-in functions can directly manipulate complex types, and the higher-order functions can manipulate complex values with an anonymous lambda function similar to UDFs but with much better performance. In this blog, through some examples, we’ll show some of these new built-in functions and how to use them to manipulate complex data types. ## Typical solutions Let’s review the typical solutions with the following examples first. ### Option 1 – Explode and Collect We use explode to break the array into individual rows and evaluate val + 1, and then use collect_list to restructure the array as follows: SELECT id, collect_list(val + 1) AS vals FROM (SELECT id, explode(vals) AS val FROM input_tbl) x GROUP BY id  This is error-prone and inefficient for three reasons. First, we have to be diligent to ensure that the recollected arrays are made exactly from the original arrays by grouping them by the unique key. Second, we need a group-by, which means a shuffle operation; a shuffle operation is not guaranteed to keep the element order of the re-collected array from the original array. And finally, it is expensive. ### Option 2 – User Defined Function Next, we use Scala UDF which takes Seq[Int] and add 1 to the each element in it: def addOne(values: Seq[Int]): Seq[Int] = { values.map(value => value + 1) } val plusOneInt = spark.udf.register("plusOneInt", addOne(_: Seq[Int]): Seq[Int])  or we can also use Python UDF, and then: SELECT id, plusOneInt(vals) as vals FROM input_tbl  This is simpler and faster and does not suffer from correctness pitfalls, but it might still be inefficient because the data serialization into Scala or Python can be expensive. You can see the examples in a notebook in a blog that we published and try them. ## New Built-in Functions Let’s see the new built-in functions for manipulating complex types directly. The notebook lists the examples for each function. The signatures and arguments for each function are annotated with their respective types T or U to denote as array element types and K, V as map and value types. ## Higher-Order Functions For further manipulation for array and map types, we used known syntax in SQL for the anonymous lambda function and higher-order functions to take the lambda functions as arguments. The syntax for the lambda function is as follows:  argument -> function body (argument1, argument2, ...) -> function body  The left side of the symbol -> defines the argument list, and the right side defines the function body which can use the arguments and other variables in it to calculate the new value. ### Transform with Anonymous Lambda Function Let’s see the example with transform function that employs an anonymous lambda function. Here we have a table of data that contains 3 columns: a key as an integer; values of array of integer; and nested_values of array of array of integers. key values nested_values 1 [1, 2, 3] [[1, 2, 3], [], [4, 5]] When we execute the following SQL: SELECT TRANSFORM(values, element -> element + 1) FROM data;  the transform function iterates over the array and applies the lambda function, adding 1 to each element, and creates a new array. We can also use other variables besides the arguments, for example: key, which is coming from the outer context, a column of the table, in the lambda function: SELECT TRANSFORM(values, element -> element + key) FROM data;  If you want to manipulate a deeply nested column, like nested_values in this case, you can use the nested lambda functions:  SELECT TRANSFORM( nested_values, arr -> TRANSFORM(arr, element -> element + key + SIZE(arr))) FROM data;  You can use key and arr in the internal lambda function that are coming from the outer context, a column of the table and an argument of the outer lambda function. Note, you can see the same examples as the typical solution in the notebook for them, and the examples of the other higher-order functions are included in the notebook for built-in functions. ## Conclusion Spark 2.4 introduced 24 new built-in functions, such as array_union, array_max/min, etc., and 5 higher-order functions, such as transform, filter, etc. for manipulating complex types. The whole list and their examples are in this notebook. If you have any complex values, consider using them and let us know of any issues. We would like to thank contributors from the Apache Spark community Alex Vayda, Bruce Robbins, Dylan Guedes, Florent Pepin, H Lu, Huaxin Gao, Kazuaki Ishizaki, Marco Gaido, Marek Novotny, Neha Patil, Sandeep Singh, and many others. ## Read More To find out more about higher-order and built-in functions, see the following resources: 1. Try the accompanying notebook 2. Read the previous blog on higher-order functions 3. Watch the Spark + AI Summit Europe Talk on Higher-Order Functions -- Try Databricks for free. Get started today. The post Introducing New Built-in and Higher-Order Functions for Complex Data Types in Apache Spark 2.4 appeared first on Databricks. Continue Reading… ### Book of hand-painted ski maps When you go skiing or snowboarding, you get a map of the mountain that shows the terrain and where you can go. James Niehues is the man behind many of these hand-painted ski maps around the world, and he has a kickstarter to catalog his life’s work. This is kind of amazing. I went skiing a lot as a kid, and I have distinct memories of these maps. I would stand at the top of the mountain, rip off one of my gloves with my teeth, and then pull out a folded map from a zipped pocket. I never knew they were by the same man, but in retrospect, it makes sense. Tags: , Continue Reading… ### Using Uncertainty to Interpret your Model We outline why you should care about uncertainty and discuss the different types, including model, data and measurement uncertainty and what different purposes these all serve. Continue Reading… ### Report from the Enterprise Applications of the R Language conference (This article was first published on Revolutions, and kindly contributed to R-bloggers) by Mark Niemann-Ross Mango Solutions presented the EARL conference in Seattle this November and I was fortunate enough to have time to attend. During the conference I took notes on the presentations, which I’ll pass along to you. ### Thoughts on the future of R in industry The EARL conference started with a panel discussion on R. Moderated by Doug Ashton from Mango Solutions, the panel included Julia Silge with Stack Overflow, David Smith with Microsoft, and Joe Cheng with Rstudio. Topics ranged from the tidyverse, R certification, R as a general purpose language, Shiny, and the future of R. I captured some interesting quotes from the panel members: Regarding R certification, David Smith points out that certifications are important for big companies trying to filter employment applications. He mentions that certification is a minimum bar for some HR departments. Julia Silge mentions that Stack Overflow has found certifications to be less influential in the hiring process. R as a general purpose language: Joe Cheng feels that R is useful for more than just statistics, but that Rstudio isn’t interested in developing general purpose tools. There was discussion around Python as the “second best” language for a lot of applications, and an agreement that R should remain focused on data science. Most interesting was the discussion regarding the future of R. Julia Silge points out that Stack Overflow data shows R growing fast year over year — at about the same rate as Python. There are a lot of new users and packages need to take that into account. ### I learned more about Natural Language Processing Tim Oldfield introduces this conference as NOT a code-based conference. However, Julia Silge doesn't hesitate to illustrate her discussion with appropriate code. And seriously, how would you discuss natural language processing without using the language of the trade? I won't get into the terms (TF-IDF) and technology (tidytext) of Mz Silge's presentation. I will mention she does a great job of summarizing how and why to perform text mining. Like all good tech, you can easily scratch the surface of text mining in fifteen minutes. A thorough understanding requires years of hard research. If you'd like an introduction to her work, take a look at her paper She Giggles, He Gallops – analyzing gender tropes in film. ### I gained an understanding of machine-learning David Smith with Microsoft presented a session on neural nets, machine learning and transfer learning. More than just a theoretical chat, David illustrated the concepts with working tools. I’ve read quite a bit about machine learning – but this illustration really brings it home. Oh — and it’s pretty damn funny. ( David posted this on a blog entry here ) ### I learned to (sort of) love Excel Eina Ooka found herself forced to incorporate Excel with her data workflow. Now — we all have opinions about Excel in data science — but Eina points out that for multidisciplinary data science teams, it’s good for data storage, modeling, and reports. There are issues about reproducibility and source control and for that, R is a good solution. But Eina summarizes that excel is still useful. Not all projects can move away from it. ### Data science teams without structured, intentional collaboration leak knowledge and waste resources Stephanie Kirmer provided real-life experience and lessons learned on Data Science teams. Her themes included collaboration, version control, reproducibility, institutional knowledge, and other concerns. She has accomplished this with the consistent use of R packages. One of her most interesting concepts was using packages to capture institutional knowledge. Documenting procedures with a function, contained in a standardized package provides stability and a common tool. This package then becomes an on-boarding tool for new hires and a knowledge repository for departing staff. ### I learned risk can be studied and quantified Risk is the chance that something bad will happen. David Severski with Starbucks revealed some of the tools used by the coffee giant, specifically OpenFAIR (Open Factor Analysis of Information Risk) and EvaluatoR (an R package for risk evaluation). Dave points out that R is an elegant language for data tasks. It has an ecosystem of tools for simulations and statistics, making risk evaluation a plug-and-play process. Personally, I don’t have call for risk evaluation. But it’s interesting to get a quick peek into the language and concerns of this specialty. ### I was reminded of the Science in Data Science Mark Gallivan reminds us about the science in data science. He researched the effect of Zika on travel by searching twitter for #babymoon. With that data, he cross-references on the geolocation of the tweet to draw conclusions of the impact of Zika on travel plans of pregnant women. This is one of those useful presentations on the nuts and bolts of research. ### I gained knowledge for non-R conversations On November 7th I attended the EARL (Enterprise Applications of the R Language) conference in Seattle. Two days later I attended Orycon, the annual science fiction convention in Portland, Oregon. After every conference I attend, I do a private post-mortem. I ask myself if the conference was worthwhile, if I’d attend again, and if my time was well-spent. EARL is a deep dive into the application of the R language. Attendees are assumed to have deep understanding of R, statistics, and a domain knowledge; the quintessential definition of data science. Orycon is a gathering of science fiction fans. It includes a crowd of cosplayers imitating the latest Star Wars characters — but I’ll ignore them for this discussion. To be clear, the people that appreciate science fiction are deeply involved in science fact. As a result of attending EARL, I’m better prepared to understand the talent and experience slightly under the radar at Orycon. I already knew the methods the experts used to perform and document their research. Thanks to David Smith’s “not hotdog” I understand machine learning at an operational level, so can skip over that discussion — or correct bad science from a misinformed espouser of pseudo-fact. Mark is an author, educator, and writer and teaches about R and Raspberry Pi at LinkedIn Learning. To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### R-bloggers weekly – top R posts from last week (2018-11-04 till 2018-11-10) Most liked R posts from last week, sorted based on the number of likes they got on twitter, enjoy: Continue Reading… ### Sorry I didn’t get that! How to understand what your users want Creating a chatbot is difficult, it involves knowledge in many AI-Hard tasks, such as Natural Language Understanding, Machine Comprehension, Inference, or Automatic Language Generation (in fact, solving these tasks is close to solving AI) and large human effort is required. Continue Reading… ### 2018: How did people actually vote? (The real story, not the exit polls.) Following up on the post that we linked to last week, here’s Yair’s analysis, using Mister P, of how everyone voted. Like Yair, I think these results are much better than what you’ll see from exit polls, partly because the analysis is more sophisticated (MRP gives you state-by-state estimates in each demographic group), partly because he’s using more data (tons of pre-election polls), and partly because I think his analysis does a better job of correcting for bias (systematic differences between the sample and population). As Yair puts it: We spent the last week combining all of the information available — pre-election projections, early voting, county and congressional district election results, precinct results where we have it available , and polling data — to come up with our estimates. In future election years, maybe Yair’s results, or others constructed using similar methods, will become the standard, and we’ll be able to forget exit polls, or relegate them to a more minor part of our discourse. Anyway, here’s what Yair found: The breakdown by age. Wow: Changes since the previous midterm election, who voted and how they voted: Ethnicity and education: Urban/suburban/rural: > Yair’s got more at the link. And here’s our summary of what happened in 2018, that we posted a few days after the election. Continue Reading… ### Is AI Just a Buzzword? 4 Experts Weigh In on the Future of AI Perhaps even moreso than even big data or blockchain, AI is fast becoming the buzzword on everyone’s lips. Machine learning has been a promising field for years, but with the astonishing success of deep learning techniques, we’re rapidly being propelled into an automated future. But can AI withstand the hype? The post Is AI Just a Buzzword? 4 Experts Weigh In on the Future of AI appeared first on Dataconomy. Continue Reading… ### Four short links: 16 November 2018 Illuminated Paper, Software Forge, Leak Checklist, and PC on ESP 1. IllumiPaper -- illuminated elements built into regular paper, with implementation. 2. sr.ht -- (pronounced "sir hat") a software forge like GitHub or GitLab, but with interesting strengths (e.g., very lightweight pages, and the CI system). 3. Leak Mitigation Checklist -- If you just leaked sensitive information in public source code, read this document as part of your emergency procedure. 4. Emulating an IBM PC on an ESP8266 -- an 8086 PC-XT emulation with 640K RAM, 80×25 CGA composite video, and a 1.44MB MS-DOS disk on an ESP12E without additional components. (via Alasdair Allen) Continue reading Four short links: 16 November 2018. Continue Reading… ### IBM Watson IoT’s Chief Data Scientist Advocates For Ethical Deep Learning and Building The AI Humans Really Need In terms of pioneering data-based technology, IBM are the gold standard. Indeed, IBM has held the record for receiving the most patents every year for the past 25 years and has developed countless revolutionary technologies, from SQL to the world’s fastest supercomputer. It should come as no surprise that IBM Continue Reading… ### Basics of Entity Resolution Entity resolution (ER) is the task of disambiguating records that correspond to real world entities across and within datasets. The applications of entity resolution are tremendous, particularly for public sector and federal datasets related to health, transportation, finance, law enforcement, and antiterrorism. Unfortunately, the problems associated with entity resolution are equally big — as the volume and velocity of data grow, inference across networks and semantic relationships between entities becomes increasingly difficult. Data quality issues, schema variations, and idiosyncratic data collection traditions can all complicate these problems even further. When combined, such challenges amount to a substantial barrier to organizations’ ability to fully understand their data, let alone make effective use of predictive analytics to optimize targeting, thresholding, and resource management. ### Naming Your Problem Let us first consider what an entity is. Much as the key step in machine learning is to determine what an instance is, the key step in entity resolution is to determine what an entity is. Let's define an entity as a unique thing (a person, a business, a product) with a set of attributes that describe it (a name, an address, a shape, a title, a price, etc.). That single entity may have multiple references across data sources, such as a person with two different email addresses, a company with two different phone numbers, or a product listed on two different websites. If we want to ask questions about all the unique people, or businesses, or products in a dataset, we must find a method for producing an annotated version of that dataset that contains unique entities. How can we tell that these multiple references point to the same entity? What if the attributes for each entity aren't the same across references? What happens when there are more than two or three or ten references to the same entity? Which one is the main (canonical) version? Do we just throw the duplicates away? Each question points to a single problem, albeit one that frequently goes unnamed. Ironically, one of the problems in entity resolution is that even though it goes by a lot of different names, many people who struggle with entity resolution do not know the name of their problem. The three primary tasks involved in entity resolution are deduplication, record linkage, and canonicalization: 1. Deduplication: eliminating duplicate (exact) copies of repeated data. 2. Record linkage: identifying records that reference the same entity across different sources. 3. Canonicalization: converting data with more than one possible representation into a standard form. Entity resolution is not a new problem, but thanks to Python and new machine learning libraries, it is an increasingly achievable objective. This post will explore some basic approaches to entity resolution using one of those tools, the Python Dedupe library. In this post, we will explore the basic functionalities of Dedupe, walk through how the library works under the hood, and perform a demonstration on two different datasets. ## About Dedupe Dedupe is a library that uses machine learning to perform deduplication and entity resolution quickly on structured data. It isn't the only tool available in Python for doing entity resolution tasks, but it is the only one (as far as we know) that conceives of entity resolution as it's primary task. In addition to removing duplicate entries from within a single dataset, Dedupe can also do record linkage across disparate datasets. Dedupe also scales fairly well — in this post we demonstrate using the library with a relatively small dataset of a few thousand records and a very large dataset of several million. ### How Dedupe Works Effective deduplication relies largely on domain expertise. This is for two main reasons: first, because domain experts develop a set of heuristics that enable them to conceptualize what a canonical version of a record should look like, even if they've never seen it in practice. Second, domain experts instinctively recognize which record subfields are most likely to uniquely identify a record; they just know where to look. As such, Dedupe works by engaging the user in labeling the data via a command line interface, and using machine learning on the resulting training data to predict similar or matching records within unseen data. ### Testing Out Dedupe Getting started with Dedupe is easy, and the developers have provided a convenient repo with examples that you can use and iterate on. Let's start by walking through the csv_example.py from the dedupe-examples. To get Dedupe running, we'll need to install unidecode, future, and dedupe. In your terminal (we recommend doing so inside a virtual environment): git clone https://github.com/DistrictDataLabs/dedupe-examples.git cd dedupe-examples pip install unidecode pip install future pip install dedupe  Then we'll run the csv_example.py file to see what dedupe can do: python csv_example.py  ### Blocking and Affine Gap Distance Let's imagine we own an online retail business, and we are developing a new recommendation engine that mines our existing customer data to come up with good recommendations for products that our existing and new customers might like to buy. Our dataset is a purchase history log where customer information is represented by attributes like name, telephone number, address, and order history. The database we've been using to log purchases assigns a new unique ID for every customer interaction. But it turns out we're a great business, so we have a lot of repeat customers! We'd like to be able to aggregate the order history information by customer so that we can build a good recommender system with the data we have. That aggregation is easy if every customer's information is duplicated exactly in every purchase log. But what if it looks something like the table below? How can we aggregate the data so that it is unique to the customer rather than the purchase? Features in the data set like names, phone numbers, and addresses will probably be useful. What is notable is that there are numerous variations for those attributes, particularly in how names appear — sometimes as nicknames, sometimes even misspellings. What we need is an intelligent and mostly automated way to create a new dataset for our recommender system. Enter Dedupe. When comparing records, rather than treating each record as a single long string, Dedupe cleverly exploits the structure of the input data to instead compare the records field by field. The advantage of this approach is more pronounced when certain feature vectors of records are much more likely to assist in identifying matches than other attributes. Dedupe lets the user nominate the features they believe will be most useful: fields = [ {'field' : 'Name', 'type': 'String'}, {'field' : 'Phone', 'type': 'Exact', 'has missing' : True}, {'field' : 'Address', 'type': 'String', 'has missing' : True}, {'field' : 'Purchases', 'type': 'String'}, ]  Dedupe scans the data to create tuples of records that it will propose to the user to label as being either matches, not matches, or possible matches. These uncertainPairs are identified using a combination of blocking , affine gap distance, and active learning. Blocking is used to reduce the number of overall record comparisons that need to be made. Dedupe's method of blocking involves engineering subsets of feature vectors (these are called 'predicates') that can be compared across records. In the case of our people dataset above, the predicates might be things like: • the first three digits of the phone number • the full name • the first five characters of the name • a random 4-gram within the city name Records are then grouped, or blocked, by matching predicates so that only records with matching predicates will be compared to each other during the active learning phase. The blocks are developed by computing the edit distance between predicates across records. Dedupe uses a distance metric called affine gap distance, which is a variation on Hamming distance that makes subsequent consecutive deletions or insertions cheaper. Therefore, we might have one blocking method that groups all of the records that have the same area code of the phone number. This would result in three predicate blocks: one with a 202 area code, one with a 334, and one with NULL. There would be two records in the 202 block (IDs 452 and 821), two records in the 334 block (IDs 233 and 699), and one record in the NULL area code block (ID 720). The relative weight of these different feature vectors can be learned during the active learning process and expressed numerically to ensure that features that will be most predictive of matches will be heavier in the overall matching schema. As the user labels more and more tuples, Dedupe gradually relearns the weights, recalculates the edit distances between records, and updates its list of the most uncertain pairs to propose to the user for labeling. Once the user has generated enough labels, the learned weights are used to calculate the probability that each pair of records within a block is a duplicate or not. In order to scale the pairwise matching up to larger tuples of matched records (in the case that entities may appear more than twice within a document), Dedupe uses hierarchical clustering with centroidal linkage. Records within some threshold distance of a centroid will be grouped together. The final result is an annotated version of the original dataset that now includes a centroid label for each record. ## Active Learning You can see that dedupe is a command line application that will prompt the user to engage in active learning by showing pairs of entities and asking if they are the same or different. Do these records refer to the same thing? (y)es / (n)o / (u)nsure / (f)inished  Active learning is the so-called special sauce behind Dedupe. As in most supervised machine learning tasks, the challenge is to get labeled data that the model can learn from. The active learning phase in Dedupe is essentially an extended user-labeling session, which can be short if you have a small dataset and can take longer if your dataset is large. You are presented with four options: You can experiment with typing the y, n, and u keys to flag duplicates for active learning. When you are finished, enter f to quit. • (y)es: confirms that the two references are to the same entity • (n)o: labels the two references as not the same entity • (u)nsure: does not label the two references as the same entity or as different entities • (f)inished: ends the active learning session and triggers the supervised learning phase As you can see in the example above, some comparisons decisions are very easy. The first contains zero for zero hits on all four attributes being examined, so the verdict is most certainly a non-match. On the second, we have a 3/4 exact match, with the fourth being fuzzy in that one entity contains a piece of the matched entity; Ryerson vs. Chicago Public Schools Ryerson. A human would be able to discern these as two references to the same entity, and we can label it as such to enable the supervised learning that comes after the active learning. The csv_example also includes an evaluation script that will enable you to determine how successfully you were able to resolve the entities. It's important to note that the blocking, active learning and supervised learning portions of the deduplication process are very dependent on the dataset attributes that the user nominates for selection. In the csv_example, the script nominates the following four attributes: fields = [ {'field' : 'Site name', 'type': 'String'}, {'field' : 'Address', 'type': 'String'}, {'field' : 'Zip', 'type': 'Exact', 'has missing' : True}, {'field' : 'Phone', 'type': 'String', 'has missing' : True}, ]  A different combination of attributes would result in a different blocking, a different set of uncertainPairs, a different set of features to use in the active learning phase, and almost certainly a different result. In other words, user experience and domain knowledge factor in heavily at multiple phases of the deduplication process. ## Something a Bit More Challenging In order to try out Dedupe with a more challenging project, we decided to try out deduplicating the White House visitors' log. Our hypothesis was that it would be interesting to be able to answer questions such as "How many times has person X visited the White House during administration Y?" However, in order to do that, it would be necessary to generate a version of the list that contained unique entities. We guessed that there would be many cases where there were multiple references to a single entity, potentially with slight variations in how they appeared in the dataset. We also expected to find a lot of names that seemed similar but in fact referenced different entities. In other words, a good challenge! The data set we used was pulled from the WhiteHouse.gov website, a part of the executive initiative to make federal data more open to the public. This particular set of data is a list of White House visitor record requests from 2006 through 2010. Here's a snapshot of what the data looks like via the White House API. The dataset includes a lot of columns, and for most of the entries, the majority of these fields are blank: Database Field Field Description NAMELAST Last name of entity NAMEFIRST First name of entity NAMEMID Middle name of entity UIN Unique Identification Number BDGNBR Badge Number Type of Access Access type to White House TOA Time of arrival POA Post on arrival TOD Time of departure POD Post on departure APPT_MADE_DATE When the appointment date was made APPT_START_DATE When the appointment date is scheduled to start APPT_END_DATE When the appointment date is scheduled to end APPT_CANCEL_DATE When the appointment date was canceled Total_People Total number of people scheduled to attend LAST_UPDATEDBY Who was the last person to update this event POST Classified as 'WIN' LastEntryDate When the last update to this instance TERMINAL_SUFFIX ID for terminal used to process visitor visitee_namelast The visitee's last name visitee_namefirst The visitee's first name MEETING_LOC The location of the meeting MEETING_ROOM The room number of the meeting CALLER_NAME_LAST The authorizing person for the visitor's last name CALLER_NAME_FIRST The authorizing person for the visitor's first name CALLER_ROOM The authorizing person's room for the visitor Description Description of the event or visit RELEASE_DATE The date this set of logs were released to the public ### Loading the Data Using the API, the White House Visitor Log Requests can be exported in a variety of formats to include, .json, .csv, and .xlsx, .pdf, .xlm, and RSS. However, it's important to keep in mind that the dataset contains over 5 million rows. For this reason, we decided to use .csv and grabbed the data using requests: import requests def getData(url,fname): """ Download the dataset from the webpage. """ response = requests.get(url) with open(fname, 'w') as f: f.write(response.content) DATAURL = "https://open.whitehouse.gov/api/views/p86s-ychb/rows.csv?accessType=DOWNLOAD" ORIGFILE = "fixtures/whitehouse-visitors.csv" getData(DATAURL,ORIGFILE)  Once downloaded, we can clean it up and load it into a database for more secure and stable storage. ## Tailoring the Code Next, we'll discuss what is needed to tailor a dedupe example to get the code to work for the White House visitors log dataset. The main challenge with this dataset is its sheer size. First, we'll need to import a few modules and connect to our database: import csv import psycopg2 from dateutil import parser from datetime import datetime conn = None DATABASE = your_db_name USER = your_user_name HOST = your_hostname PASSWORD = your_password try: conn = psycopg2.connect(database=DATABASE, user=USER, host=HOST, password=PASSWORD) print ("I've connected") except: print ("I am unable to connect to the database") cur = conn.cursor()  The other challenge with our dataset are the numerous missing values and datetime formatting irregularities. We wanted to be able to use the datetime strings to help with entity resolution, so we wanted to get the formatting to be as consistent as possible. The following script handles both the datetime parsing and the missing values by combining Python's dateutil module and PostgreSQL's fairly forgiving 'varchar' type. This function takes the csv data in as input, parses the datetime fields we're interested in ('lastname','firstname','uin','apptmade','apptstart','apptend', 'meeting_loc'.), and outputs a database table that retains the desired columns. Keep in mind this will take a while to run. def dateParseSQL(nfile): cur.execute('''CREATE TABLE IF NOT EXISTS visitors_er (visitor_id SERIAL PRIMARY KEY, lastname varchar, firstname varchar, uin varchar, apptmade varchar, apptstart varchar, apptend varchar, meeting_loc varchar);''') conn.commit() with open(nfile, 'rU') as infile: reader = csv.reader(infile, delimiter=',') next(reader, None) for row in reader: for field in DATEFIELDS: if row[field] != '': try: dt = parser.parse(row[field]) row[field] = dt.toordinal() # We also tried dt.isoformat() except: continue sql = "INSERT INTO visitors_er(lastname,firstname,uin,apptmade,apptstart,apptend,meeting_loc) \ VALUES (%s,%s,%s,%s,%s,%s,%s)" cur.execute(sql, (row[0],row[1],row[3],row[10],row[11],row[12],row[21],)) conn.commit() print ("All done!") dateParseSQL(ORIGFILE)  About 60 of our rows had ASCII characters, which we dropped using this SQL command: delete from visitors where firstname ~ '[^[:ascii:]]' OR lastname ~ '[^[:ascii:]]';  For our deduplication script, we modified the PostgreSQL example as well as Dan Chudnov's adaptation of the script for the OSHA dataset. import tempfile import argparse import csv import os import dedupe import psycopg2 from psycopg2.extras import DictCursor  Initially, we wanted to try to use the datetime fields to deduplicate the entities, but dedupe was not a big fan of the datetime fields, whether in isoformat or ordinal, so we ended up nominating the following fields: KEY_FIELD = 'visitor_id' SOURCE_TABLE = 'visitors' FIELDS = [{'field': 'firstname', 'variable name': 'firstname', 'type': 'String','has missing': True}, {'field': 'lastname', 'variable name': 'lastname', 'type': 'String','has missing': True}, {'field': 'uin', 'variable name': 'uin', 'type': 'String','has missing': True}, {'field': 'meeting_loc', 'variable name': 'meeting_loc', 'type': 'String','has missing': True} ]  We modified a function Dan wrote to generate the predicate blocks: def candidates_gen(result_set): lset = set block_id = None records = [] i = 0 for row in result_set: if row['block_id'] != block_id: if records: yield records block_id = row['block_id'] records = [] i += 1 if i % 10000 == 0: print ('{} blocks'.format(i)) smaller_ids = row['smaller_ids'] if smaller_ids: smaller_ids = lset(smaller_ids.split(',')) else: smaller_ids = lset([]) records.append((row[KEY_FIELD], row, smaller_ids)) if records: yield records  And we adapted the method from the dedupe-examples repo to handle the active learning, supervised learning, and clustering steps: def find_dupes(args): deduper = dedupe.Dedupe(FIELDS) with psycopg2.connect(database=args.dbname, host='localhost', cursor_factory=DictCursor) as con: with con.cursor() as c: c.execute('SELECT COUNT(*) AS count FROM %s' % SOURCE_TABLE) row = c.fetchone() count = row['count'] sample_size = int(count * args.sample) print ('Generating sample of {} records'.format(sample_size)) with con.cursor('deduper') as c_deduper: c_deduper.execute('SELECT visitor_id,lastname,firstname,uin,meeting_loc FROM %s' % SOURCE_TABLE) temp_d = dict((i, row) for i, row in enumerate(c_deduper)) deduper.sample(temp_d, sample_size) del(temp_d) if os.path.exists(args.training): print ('Loading training file from {}'.format(args.training)) with open(args.training) as tf: deduper.readTraining(tf) print ('Starting active learning') dedupe.convenience.consoleLabel(deduper) print ('Starting training') deduper.train(ppc=0.001, uncovered_dupes=5) print ('Saving new training file to {}'.format(args.training)) with open(args.training, 'w') as training_file: deduper.writeTraining(training_file) deduper.cleanupTraining() print ('Creating blocking_map table') c.execute(""" DROP TABLE IF EXISTS blocking_map """) c.execute(""" CREATE TABLE blocking_map (block_key VARCHAR(200), %s INTEGER) """ % KEY_FIELD) for field in deduper.blocker.index_fields: print ('Selecting distinct values for "{}"'.format(field)) c_index = con.cursor('index') c_index.execute(""" SELECT DISTINCT %s FROM %s """ % (field, SOURCE_TABLE)) field_data = (row[field] for row in c_index) deduper.blocker.index(field_data, field) c_index.close() print ('Generating blocking map') c_block = con.cursor('block') c_block.execute(""" SELECT * FROM %s """ % SOURCE_TABLE) full_data = ((row[KEY_FIELD], row) for row in c_block) b_data = deduper.blocker(full_data) print ('Inserting blocks into blocking_map') csv_file = tempfile.NamedTemporaryFile(prefix='blocks_', delete=False) csv_writer = csv.writer(csv_file) csv_writer.writerows(b_data) csv_file.close() f = open(csv_file.name, 'r') c.copy_expert("COPY blocking_map FROM STDIN CSV", f) f.close() os.remove(csv_file.name) con.commit() print ('Indexing blocks') c.execute(""" CREATE INDEX blocking_map_key_idx ON blocking_map (block_key) """) c.execute("DROP TABLE IF EXISTS plural_key") c.execute("DROP TABLE IF EXISTS plural_block") c.execute("DROP TABLE IF EXISTS covered_blocks") c.execute("DROP TABLE IF EXISTS smaller_coverage") print ('Calculating plural_key') c.execute(""" CREATE TABLE plural_key (block_key VARCHAR(200), block_id SERIAL PRIMARY KEY) """) c.execute(""" INSERT INTO plural_key (block_key) SELECT block_key FROM blocking_map GROUP BY block_key HAVING COUNT(*) > 1 """) print ('Indexing block_key') c.execute(""" CREATE UNIQUE INDEX block_key_idx ON plural_key (block_key) """) print ('Calculating plural_block') c.execute(""" CREATE TABLE plural_block AS (SELECT block_id, %s FROM blocking_map INNER JOIN plural_key USING (block_key)) """ % KEY_FIELD) print ('Adding {} index'.format(KEY_FIELD)) c.execute(""" CREATE INDEX plural_block_%s_idx ON plural_block (%s) """ % (KEY_FIELD, KEY_FIELD)) c.execute(""" CREATE UNIQUE INDEX plural_block_block_id_%s_uniq ON plural_block (block_id, %s) """ % (KEY_FIELD, KEY_FIELD)) print ('Creating covered_blocks') c.execute(""" CREATE TABLE covered_blocks AS (SELECT %s, string_agg(CAST(block_id AS TEXT), ',' ORDER BY block_id) AS sorted_ids FROM plural_block GROUP BY %s) """ % (KEY_FIELD, KEY_FIELD)) print ('Indexing covered_blocks') c.execute(""" CREATE UNIQUE INDEX covered_blocks_%s_idx ON covered_blocks (%s) """ % (KEY_FIELD, KEY_FIELD)) print ('Committing') print ('Creating smaller_coverage') c.execute(""" CREATE TABLE smaller_coverage AS (SELECT %s, block_id, TRIM(',' FROM split_part(sorted_ids, CAST(block_id AS TEXT), 1)) AS smaller_ids FROM plural_block INNER JOIN covered_blocks USING (%s)) """ % (KEY_FIELD, KEY_FIELD)) con.commit() print ('Clustering...') c_cluster = con.cursor('cluster') c_cluster.execute(""" SELECT * FROM smaller_coverage INNER JOIN %s USING (%s) ORDER BY (block_id) """ % (SOURCE_TABLE, KEY_FIELD)) clustered_dupes = deduper.matchBlocks( candidates_gen(c_cluster), threshold=0.5) print ('Creating entity_map table') c.execute("DROP TABLE IF EXISTS entity_map") c.execute(""" CREATE TABLE entity_map ( %s INTEGER, canon_id INTEGER, cluster_score FLOAT, PRIMARY KEY(%s) )""" % (KEY_FIELD, KEY_FIELD)) print ('Inserting entities into entity_map') for cluster, scores in clustered_dupes: cluster_id = cluster[0] for key_field, score in zip(cluster, scores): c.execute(""" INSERT INTO entity_map (%s, canon_id, cluster_score) VALUES (%s, %s, %s) """ % (KEY_FIELD, key_field, cluster_id, score)) print ('Indexing head_index') c_cluster.close() c.execute("CREATE INDEX head_index ON entity_map (canon_id)") con.commit() if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--dbname', dest='dbname', default='whitehouse', help='database name') parser.add_argument('-s', '--sample', default=0.10, type=float, help='sample size (percentage, default 0.10)') parser.add_argument('-t', '--training', default='training.json', help='name of training file') args = parser.parse_args() find_dupes(args)  ## Active Learning Observations We ran multiple experiments: • Test 1: lastname, firstname, meeting_loc => 447 (15 minutes of training) • Test 2: lastname, firstname, uin, meeting_loc => 3385 (5 minutes of training) - one instance that had 168 duplicates We observed a lot of uncertainty during the active learning phase, mostly because of how enormous the dataset is. This was particularly pronounced with names that seemed more common to us and that sounded more domestic since those are much more commonly occurring in this dataset. For example, are two records containing the name Michael Grant the same entity? Additionally, we noticed that there were a lot of variations in the way that middle names were captured. Sometimes they were concatenated with the first name, other times with the last name. We also observed what seemed to be many nicknames or that could have been references to separate entities: KIM ASKEW vs. KIMBERLEY ASKEW and Kathy Edwards vs. Katherine Edwards (and yes, dedupe does preserve variations in case). On the other hand, since nicknames generally appear only in people's first names, when we did see a short version of a first name paired with an unusual or rare last name, we were more confident in labeling those as a match. Other things that made the labeling easier were clearly gendered names (e.g. Brian Murphy vs. Briana Murphy), which helped us to identify separate entities in spite of very small differences in the strings. Some names appeared to be clear misspellings, which also made us more confident in our labeling two references as matches for a single entity (Davifd Culp vs. David Culp). There were also a few potential easter eggs in the dataset, which we suspect might actually be aliases (Jon Doe and Ben Jealous). One of the things we discovered upon multiple runs of the active learning process is that the number of fields the user nominates to Dedupe for use has a great impact on the kinds of predicate blocks that are generated during the initial blocking phase. Thus, the comparisons that are presented to the trainer during the active learning phase. In one of our runs, we used only the last name, first name, and meeting location fields. Some of the comparisons were easy: lastname : KUZIEMKO firstname : ILYANA meeting_loc : WH lastname : KUZIEMKO firstname : ILYANA meeting_loc : WH Do these records refer to the same thing? (y)es / (n)o / (u)nsure / (f)inished  Some were hard: lastname : Desimone firstname : Daniel meeting_loc : OEOB lastname : DeSimone firstname : Daniel meeting_loc : WH Do these records refer to the same thing? (y)es / (n)o / (u)nsure / (f)inished  ## Results What we realized from this is that there are two different kinds of duplicates that appear in our dataset. The first kind of duplicate is one that generated via (likely mistaken) duplicate visitor request forms. We noticed that these duplicate entries tended to be proximal to each other in terms of visitor_id number, have the same meeting location and the same uin (which confusingly, is not a unique guest identifier but appears to be assigned to every visitor within a unique tour group). The second kind of duplicate is what we think of as the frequent flier — people who seem to spend a lot of time at the White House like staffers and other political appointees. During the dedupe process, we computed there were 332,606 potential duplicates within the data set of 1,048,576 entities. For this particular data, we would expect these kinds of figures, knowing that people visit for repeat business or social functions. ### Within-Visit Duplicates lastname : Ryan meeting_loc : OEOB firstname : Patrick uin : U62671 lastname : Ryan meeting_loc : OEOB firstname : Patrick uin : U62671  ### Across-Visit Duplicates (Frequent Fliers) lastname : TANGHERLINI meeting_loc : OEOB firstname : DANIEL uin : U02692 lastname : TANGHERLINI meeting_loc : NEOB firstname : DANIEL uin : U73085  lastname : ARCHULETA meeting_loc : WH firstname : KATHERINE uin : U68121 lastname : ARCHULETA meeting_loc : OEOB firstname : KATHERINE uin : U76331  ## Conclusion In this beginners guide to Entity Resolution, we learned what it means to identify entities and their possible duplicates within and across records. To further examine this data beyond the scope of this blog post, we would like to determine which records are true duplicates. This would require additional information to canonicalize these entities, thus allowing for potential indexing of entities for future assessments. Ultimately we discovered the importance of entity resolution across a variety of domains, such as counter-terrorism, customer databases, and voter registration. Please return to the District Data Labs blog for upcoming posts on entity resolution and discussion about a number of other important topics to the data science community. Upcoming post topics from our research group include string matching algorithms, data preparation, and entity identification! District Data Labs provides data science consulting and corporate training services. We work with companies and teams of all sizes, helping them make their operations more data-driven and enhancing the analytical abilities of their employees. Interested in working with us? Let us know! Continue Reading… ### Data Exploration with Python, Part 2 This is the second post in our Data Exploration with Python series. Before reading this post, make sure to check out Data Exploration with Python, Part 1! Mise en place (noun): In a professional kitchen, the disciplined organization and preparation of equipment and food before service begins. When performing exploratory data analysis (EDA), it is important to not only prepare yourself (the analyst) but to prepare your data as well. As we discussed in the previous post, a small amount of preparation will often save you a significant amount of time later on. So let's review where we should be at this point and then continue our exploration process with data preparation. In Part 1 of this series, we were introduced to the data exploration framework we will be using. As a reminder, here is what that framework looks like. We also introduced the example data set we are going to be using to illustrate the different phases and stages of the framework. Here is what that looks like. We then familiarized ourselves with our data set by identifying the types of information and entities encoded within it. We also reviewed several data transformation and visualization methods that we will use later to explore and analyze it. Now we are at the last stage of the framework's Prep Phase, the Create stage, where our goal will be to create additional categorical fields that will make our data easier to explore and allow us to view it from new perspectives. ## Renaming Columns to be More Intuitive Before we dive in and start creating categories, however, we have an opportunity to improve our categorization efforts by examining the columns in our data and making sure their labels intuitively convey what they represent. Just as with the other aspects of preparation, changing them now will save us from having to remember what displ or co2TailpipeGpm mean when they show up on a chart later. In my experience, these small, detail-oriented enhancements to the beginning of your process usually compound and preserve cognitive cycles that you can later apply to extracting insights. We can use the code below to rename the columns in our vehicles data frame. vehicles.columns = ['Make','Model','Year','Engine Displacement','Cylinders', 'Transmission','Drivetrain','Vehicle Class','Fuel Type', 'Fuel Barrels/Year','City MPG','Highway MPG','Combined MPG', 'CO2 Emission Grams/Mile','Fuel Cost/Year']  ## Thinking About Categorization Now that we have changed our column names to be more intuitive, let's take a moment to think about what categorization is and examine the categories that currently exist in our data set. At the most basic level, categorization is just a way that humans structure information — how we hierarchically create order out of complexity. Categories are formed based on attributes that entities have in common, and they present us with different perspectives from which we can view and think about our data. Our primary objective in this stage is to create additional categories that will help us further organize our data. This will prove beneficial not only for the exploratory analysis we will conduct but also for any supervised machine learning or modeling that may happen further down the data science pipeline. Seasoned data scientists know that the better your data is organized, the better downstream analyses you will be able to perform and the more informative features you will have to feed into your machine learning models. In this stage of the framework, we are going to create additional categories in 3 distinct ways: • Category Aggregations • Binning Continuous Variables • Clustering Now that we have a better idea of what we are doing and why, let's get started. ### Aggregating to Higher-Level Categories The first way we are going to create additional categories is by identifying opportunities to create higher-level categories out of the variables we already have in our data set. In order to do this, we need to get a sense of what categories currently exist in the data. We can do this by iterating through our columns and printing out the name, the number of unique values, and the data type for each. def unique_col_values(df): for column in df: print("{} | {} | {}".format( df[column].name, len(df[column].unique()), df[column].dtype )) unique_col_values(vehicles)  Make | 126 | object Model | 3491 | object Year | 33 | int64 Engine Displacement | 65 | float64 Cylinders | 9 | float64 Transmission | 43 | object Drivetrain | 7 | object Vehicle Class | 34 | object Fuel Type | 13 | object Fuel Barrels/Year | 116 | float64 City MPG | 48 | int64 Highway MPG | 49 | int64 Combined MPG | 45 | int64 CO2 Emission Grams/Mile | 550 | float64 Fuel Cost/Year | 58 | int64  From looking at the output, it is clear that we have some numeric columns (int64 and float64) and some categorical columns (object). For now, let's focus on the six categorical columns in our data set. • Make: 126 unique values • Model: 3,491 unique values • Transmission: 43 unique values • Drivetrain: 7 unique values • Vehicle Class: 34 unique values • Fuel Type: 13 unique values When aggregating and summarizing data, having too many categories can be problematic. The average human is said to have the ability to hold 7 objects at a time in their short-term working memory. Accordingly, I have noticed that once you exceed 8-10 discrete values in a category, it becomes increasingly difficult to get a holistic picture of how the entire data set is divided up. What we want to do is examine the values in each of our categorical variables to determine where opportunities exist to aggregate them into higher-level categories. The way this is typically done is by using a combination of clues from the current categories and any domain knowledge you may have (or be able to acquire). For example, imagine aggregating by Transmission, which has 43 discrete values in our data set. It is going to be difficult to derive insights due to the fact that any aggregated metrics are going to be distributed across more categories than you can hold in short-term memory. However, if we examine the different transmission categories with the goal of finding common features that we can group on, we would find that all 43 values fall into one of two transmission types, Automatic or Manual. Let's create a new Transmission Type column in our data frame and, with the help of the loc method in pandas, assign it a value of Automatic where the first character of Transmission is the letter A and a value of Manual where the first character is the letter M. AUTOMATIC = "Automatic" MANUAL = "Manual" vehicles.loc[vehicles['Transmission'].str.startswith('A'), 'Transmission Type'] = AUTOMATIC vehicles.loc[vehicles['Transmission'].str.startswith('M'), 'Transmission Type'] = MANUAL  We can apply the same logic to the Vehicle Class field. We originally have 34 vehicle classes, but we can distill those down into 8 vehicle categories, which are much easier to remember. small = ['Compact Cars','Subcompact Cars','Two Seaters','Minicompact Cars'] midsize = ['Midsize Cars'] large = ['Large Cars'] vehicles.loc[vehicles['Vehicle Class'].isin(small), 'Vehicle Category'] = 'Small Cars' vehicles.loc[vehicles['Vehicle Class'].isin(midsize), 'Vehicle Category'] = 'Midsize Cars' vehicles.loc[vehicles['Vehicle Class'].isin(large), 'Vehicle Category'] = 'Large Cars' vehicles.loc[vehicles['Vehicle Class'].str.contains('Station'), 'Vehicle Category'] = 'Station Wagons' vehicles.loc[vehicles['Vehicle Class'].str.contains('Truck'), 'Vehicle Category'] = 'Pickup Trucks' vehicles.loc[vehicles['Vehicle Class'].str.contains('Special Purpose'), 'Vehicle Category'] = 'Special Purpose' vehicles.loc[vehicles['Vehicle Class'].str.contains('Sport Utility'), 'Vehicle Category'] = 'Sport Utility' vehicles.loc[(vehicles['Vehicle Class'].str.lower().str.contains('van')), 'Vehicle Category'] = 'Vans & Minivans'  Next, let's look at the Make and Model fields, which have 126 and 3,491 unique values respectively. While I can't think of a way to get either of those down to 8-10 categories, we can create another potentially informative field by concatenating Make and the first word of the Model field together into a new Model Type field. This would allow us to, for example, categorize all Chevrolet Suburban C1500 2WD vehicles and all Chevrolet Suburban K1500 4WD vehicles as simply Chevrolet Suburbans. vehicles['Model Type'] = (vehicles['Make'] + " " + vehicles['Model'].str.split().str.get(0))  Finally, let's look at the Fuel Type field, which has 13 unique values. On the surface, that doesn't seem too bad, but upon further inspection, you'll notice some complexity embedded in the categories that could probably be organized more intuitively. vehicles['Fuel Type'].unique()  array(['Regular', 'Premium', 'Diesel', 'Premium and Electricity', 'Premium or E85', 'Premium Gas or Electricity', 'Gasoline or E85', 'Gasoline or natural gas', 'CNG', 'Regular Gas or Electricity', 'Midgrade', 'Regular Gas and Electricity', 'Gasoline or propane'], dtype=object)  This is interesting and a little tricky because there are some categories that contain a single fuel type and others that contain multiple fuel types. In order to organize this better, we will create two sets of categories from these fuel types. The first will be a set of columns that will be able to represent the different combinations, while still preserving the individual fuel types. vehicles['Gas'] = 0 vehicles['Ethanol'] = 0 vehicles['Electric'] = 0 vehicles['Propane'] = 0 vehicles['Natural Gas'] = 0 vehicles.loc[vehicles['Fuel Type'].str.contains( 'Regular|Gasoline|Midgrade|Premium|Diesel'),'Gas'] = 1 vehicles.loc[vehicles['Fuel Type'].str.contains('E85'),'Ethanol'] = 1 vehicles.loc[vehicles['Fuel Type'].str.contains('Electricity'),'Electric'] = 1 vehicles.loc[vehicles['Fuel Type'].str.contains('propane'),'Propane'] = 1 vehicles.loc[vehicles['Fuel Type'].str.contains('natural|CNG'),'Natural Gas'] = 1  As it turns out, 99% of the vehicles in our database have gas as a fuel type, either by itself or combined with another fuel type. Since that is the case, let's create a second set of categories - specifically, a new Gas Type field that extracts the type of gas (Regular, Midgrade, Premium, Diesel, or Natural) each vehicle accepts. vehicles.loc[vehicles['Fuel Type'].str.contains( 'Regular|Gasoline'),'Gas Type'] = 'Regular' vehicles.loc[vehicles['Fuel Type'] == 'Midgrade', 'Gas Type'] = 'Midgrade' vehicles.loc[vehicles['Fuel Type'].str.contains('Premium'), 'Gas Type'] = 'Premium' vehicles.loc[vehicles['Fuel Type'] == 'Diesel', 'Gas Type'] = 'Diesel' vehicles.loc[vehicles['Fuel Type'].str.contains('natural|CNG'), 'Gas Type'] = 'Natural'  An important thing to note about what we have done with all of the categorical fields in this section is that, while we created new categories, we did not overwrite the original ones. We created additional fields that will allow us to view the information contained within the data set at different (often higher) levels. If you need to drill down to the more granular original categories, you can always do that. However, now we have a choice whereas before we performed these category aggregations, we did not. ### Creating Categories from Continuous Variables The next way we can create additional categories in our data is by binning some of our continuous variables - breaking them up into different categories based on a threshold or distribution. There are multiple ways you can do this, but I like to use quintiles because it gives me one middle category, two categories outside of that which are moderately higher and lower, and then two extreme categories at the ends. I find that this is a very intuitive way to break things up and provides some consistency across categories. In our data set, I've identified 4 fields that we can bin this way. Binning essentially looks at how the data is distributed, creates the necessary number of bins by splitting up the range of values (either equally or based on explicit boundaries), and then categorizes records into the appropriate bin that their continuous value falls into. Pandas has a qcut method that makes binning extremely easy, so let's use that to create our quintiles for each of the continuous variables we identified. efficiency_categories = ['Very Low Efficiency', 'Low Efficiency', 'Moderate Efficiency','High Efficiency', 'Very High Efficiency'] vehicles['Fuel Efficiency'] = pd.qcut(vehicles['Combined MPG'], 5, efficiency_categories)  engine_categories = ['Very Small Engine', 'Small Engine','Moderate Engine', 'Large Engine', 'Very Large Engine'] vehicles['Engine Size'] = pd.qcut(vehicles['Engine Displacement'], 5, engine_categories)  emission_categories = ['Very Low Emissions', 'Low Emissions', 'Moderate Emissions','High Emissions', 'Very High Emissions'] vehicles['Emissions'] = pd.qcut(vehicles['CO2 Emission Grams/Mile'], 5, emission_categories)  fuelcost_categories = ['Very Low Fuel Cost', 'Low Fuel Cost', 'Moderate Fuel Cost','High Fuel Cost', 'Very High Fuel Cost'] vehicles['Fuel Cost'] = pd.qcut(vehicles['Fuel Cost/Year'], 5, fuelcost_categories)  ### Clustering to Create Additional Categories The final way we are going to prepare our data is by clustering to create additional categories. There are a few reasons why I like to use clustering for this. First, it takes multiple fields into consideration together at the same time, whereas the other categorization methods only consider one field at a time. This will allow you to categorize together entities that are similar across a variety of attributes, but might not be close enough in each individual attribute to get grouped together. Clustering also creates new categories for you automatically, which takes much less time than having to comb through the data yourself identifying patterns across attributes that you can form categories on. It will automatically group similar items together for you. The third reason I like to use clustering is because it will sometimes group things in ways that you, as a human, may not have thought of. I'm a big fan of humans and machines working together to optimize analytical processes, and this is a good example of value that machines bring to the table that can be helpful to humans. I'll write more about my thoughts on that in future posts, but for now, let's move on to clustering our data. The first thing we are going to do is isolate the columns we want to use for clustering. These are going to be columns with numeric values, as the clustering algorithm will need to compute distances in order to group similar vehicles together. cluster_columns = ['Engine Displacement','Cylinders','Fuel Barrels/Year', 'City MPG','Highway MPG','Combined MPG', 'CO2 Emission Grams/Mile', 'Fuel Cost/Year']  Next, we want to scale the features we are going to cluster on. There are a variety of ways to normalize and scale variables, but I'm going to keep things relatively simple and just use Scikit-Learn's MaxAbsScaler, which will divide each value by the max absolute value for that feature. This will preserve the distributions in the data and convert the values in each field to a number between 0 and 1 (technically -1 and 1, but we don't have any negatives). from sklearn import preprocessing scaler = preprocessing.MaxAbsScaler() vehicle_clusters = scaler.fit_transform(vehicles[cluster_columns]) vehicle_clusters = pd.DataFrame(vehicle_clusters, columns=cluster_columns)  Now that our features are scaled, let's write a couple of functions. The first function we are going to write is a kmeans_cluster function that will k-means cluster a given data frame into a specified number of clusters. It will then return a copy of the original data frame with those clusters appended in a column named Cluster. from sklearn.cluster import KMeans def kmeans_cluster(df, n_clusters=2): model = KMeans(n_clusters=n_clusters, random_state=1) clusters = model.fit_predict(df) cluster_results = df.copy() cluster_results['Cluster'] = clusters return cluster_results  Our second function, called summarize_clustering is going to count the number of vehicles that fall into each cluster and calculate the cluster means for each feature. It is going to merge the counts and means into a single data frame and then return that summary to us. def summarize_clustering(results): cluster_size = results.groupby(['Cluster']).size().reset_index() cluster_size.columns = ['Cluster', 'Count'] cluster_means = results.groupby(['Cluster'], as_index=False).mean() cluster_summary = pd.merge(cluster_size, cluster_means, on='Cluster') return cluster_summary  We now have functions for what we need to do, so the next step is to actually cluster our data. But wait, our kmeans_cluster function is supposed to accept a number of clusters. How do we determine how many clusters we want? There are a number of approaches for figuring this out, but for the sake of simplicity, we are just going to plug in a couple of numbers and visualize the results to arrive at a good enough estimate. Remember earlier in this post where we were trying to aggregate our categorical variables to less than 8-10 discrete values? We are going to apply the same logic here. Let's start out with 8 clusters and see what kind of results we get. cluster_results = kmeans_cluster(vehicle_clusters, 8) cluster_summary = summarize_clustering(cluster_results)  After running the couple of lines of code above, your cluster_summary should look similar to the following. By looking at the Count column, you can tell that there are some clusters that have significantly more records in them (ex. Cluster 7) and others that have significantly fewer (ex. Cluster 3). Other than that, though, it is difficult to notice anything informative about the summary. I don't know about you, but to me, the rest of the summary just looks like a bunch of decimals in a table. This is a prime opportunity to use a visualization to discover insights faster. With just a couple import statements and a single line of code, we can light this summary up in a heatmap so that we can see the contrast between all those decimals and between the different clusters. import matplotlib.pyplot as plt import seaborn as sns sns.heatmap(cluster_summary[cluster_columns].transpose(), annot=True)  In this heatmap, the rows represent the features and the columns represent the clusters, so we can compare how similar or differently columns look to each other. Our goal for clustering these features is to ultimately create meaningful categories out of the clusters, so we want to get to the point where we can clearly distinguish one from the others. This heatmap allows us to do this quickly and visually. With this goal in mind, it is apparent that we probably have too many clusters because: • Clusters 3, 4, and 7 look pretty similar • Clusters 2 and 5 look similar as well • Clusters 0 and 6 are also a little close for comfort From the way our heatmap currently looks, I'm willing to bet that we can cut the number of clusters in half and get clearer boundaries. Let's re-run the clustering, summary, and heatmap code for 4 clusters and see what kind of results we get. cluster_results = kmeans_cluster(vehicle_clusters, 4) cluster_summary = summarize_clustering(cluster_results) sns.heatmap(cluster_summary[cluster_columns].transpose(), annot=True)  These clusters look more distinct, don't they? Clusters 1 and 3 look like they are polar opposites of each other, cluster 0 looks like it’s pretty well balanced across all the features, and cluster 2 looks like it’s about half-way between Cluster 0 and Cluster 1. We now have a good number of clusters, but we still have a problem. It is difficult to remember what clusters 0, 1, 2, and 3 mean, so as a next step, I like to assign descriptive names to the clusters based on their properties. In order to do this, we need to look at the levels of each feature for each cluster and come up with intuitive natural language descriptions for them. You can have some fun and can get as creative as you want here, but just keep in mind that the objective is for you to be able to remember the characteristics of whatever label you assign to the clusters. • Cluster 1 vehicles seem to have large engines that consume a lot of fuel, process it inefficiently, produce a lot of emissions, and cost a lot to fill up. I'm going to label them Large Inefficient. • Cluster 3 vehicles have small, fuel efficient engines that don't produce a lot of emissions and are relatively inexpensive to fill up. I'm going to label them Small Very Efficient. • Cluster 0 vehicles are fairly balanced across every category, so I'm going to label them Midsized Balanced. • Cluster 2 vehicles have large engines but are more moderately efficient than the vehicles in Cluster 1, so I'm going to label them Large Moderately Efficient. Now that we have come up with these descriptive names for our clusters, let's add a Cluster Name column to our cluster_results data frame, and then copy the cluster names over to our original vehicles data frame. cluster_results['Cluster Name'] = '' cluster_results['Cluster Name'][cluster_results['Cluster']==0] = 'Midsized Balanced' cluster_results['Cluster Name'][cluster_results['Cluster']==1] = 'Large Inefficient' cluster_results['Cluster Name'][cluster_results['Cluster']==2] = 'Large Moderately Efficient' cluster_results['Cluster Name'][cluster_results['Cluster']==3] = 'Small Very Efficient' vehicles = vehicles.reset_index().drop('index', axis=1) vehicles['Cluster Name'] = cluster_results['Cluster Name']  ## Conclusion In this post, we examined several ways to prepare a data set for exploratory analysis. First, we looked at the categorical variables we had and attempted to find opportunities to roll them up into higher-level categories. After that, we converted some of our continuous variables into categorical ones by binning them into quintiles based on how relatively high or low their values were. Finally, we used clustering to efficiently create categories that automatically take multiple fields into consideration. The result of all this preparation is that we now have several columns containing meaningful categories that will provide different perspectives of our data and allow us to acquire as many insights as possible. Now that we have these meaningful categories, our data set is in really good shape, which means that we can move on to the next phase of our data exploration framework. In the next post, we will cover the first two stages of the Explore Phase and demonstrate various ways to visually aggregate, pivot, and identify relationships between fields in our data. Make sure to subscribe to the DDL blog so that you get notified when we publish it! District Data Labs provides data science consulting and corporate training services. We work with companies and teams of all sizes, helping them make their operations more data-driven and enhancing the analytical abilities of their employees. Interested in working with us? Let us know! Continue Reading… ### Data Exploration with Python, Part 3 This is the third post in our Data Exploration with Python series. Before reading this post, make sure to check out Part 1 and Part 2! Preparing yourself and your data like we have done thus far in this series is essential to analyzing your data well. However, the most exciting part of Exploratory Data Analysis (EDA) is actually getting in there, exploring the data, and discovering insights. That's exactly what we are going to start doing in this post. We will begin with the cleaned and prepped vehicle fuel economy data set that we ended up with at the end of the last post. This version of the data set contains: • The higher-level categories we created via category aggregations. • The quintiles we created by binning our continuous variables. • The clusters we generated via k-means clustering based on numeric variables. Now, without further ado, let's embark on our insight-finding mission! ## Making Our Data Smaller: Filter + Aggregate One of the fundamental ways to extract insights from a data set is to reduce the size of the data so that you can look at just a piece of it at a time. There are two ways to do this: filtering and aggregating. With filtering, you are essentially removing either rows or columns (or both rows and columns) in order to focus on a subset of the data that interests you. With aggregation, the objective is to group records in your data set that have similar categorical attributes and then perform some calculation (count, sum, mean, etc.) on one or more numerical fields so that you can observe and identify differences between records that fall into each group. To begin filtering and aggregating our data set, we could write a function like the one below to aggregate based on a group_field that we provide, counting the number of rows in each group. To make things more intuitive and easier to interpret, we will also sort the data from most frequent to least and format it in a pandas data frame with appropriate column names. def agg_count(df, group_field): grouped = df.groupby(group_field, as_index=False).size() grouped.sort(ascending = False) grouped = pd.DataFrame(grouped).reset_index() grouped.columns = [group_field, 'Count'] return grouped  Now that we have this function in our toolkit, let's use it. Suppose we were looking at the Vehicle Category field in our data set and were curious about the number of vehicles in each category that were manufactured last year (2016). Here is how we would filter the data and use the agg_count function to transform it to show what we wanted to know. vehicles_2016 = vehicles[vehicles['Year']==2016] category_counts = agg_count(vehicles_2016, 'Vehicle Category')  This gives us what we want in tabular form, but we could take it a step further and visualize it with a horizontal bar chart. ax = sns.barplot(data=category_counts, x='Count', y='Vehicle Category') ax.set(xlabel='\n Number of Vehicles Manufactured') sns.plt.title('Vehicles Manufactured by Category (2016) \n')  Now that we know how to do this, we can filter, aggregate, and plot just about anything in our data set with just a few lines of code. For example, here is the same metric but filtered for a different year (1985). vehicles_1985 = vehicles[vehicles['Year']==1985] category_counts = agg_count(vehicles, 'Vehicle Category') ax = sns.barplot(data=category_counts, x='Count', y='Vehicle Category') ax.set(xlabel='\n Number of Vehicles Manufactured') sns.plt.title('Vehicles Manufactured by Category (1985) \n')  If we wanted to stick with the year 2016 but drill down to the more granular Vehicle Class, we could do that as well. class_counts = agg_count(vehicles_2016, 'Vehicle Class') ax = sns.barplot(data=class_counts, x='Count', y='Vehicle Class') ax.set(xlabel='\n Number of Vehicles Manufactured') sns.plt.title('Vehicles Manufactured by Class (2016) \n')  We could also look at vehicle counts by manufacturer. make_counts = agg_count(vehicles_2016, 'Make') ax = sns.barplot(data=make_counts, x='Count', y='Make') ax.set(xlabel='\n Number of Vehicles Manufactured') sns.plt.title('Vehicles Manufactured by Make (2016) \n')  What if we wanted to filter by something other than the year? We could do that by simply creating a different filtered data frame and passing that to our agg_count function. Below, instead of filtering by Year, I've filtered on the Fuel Efficiency field, which contains the fuel efficiency quintiles we generated in the last post. Let's choose the Very High Efficiency value so that we can see how many very efficient vehicles each manufacturer has made. very_efficient = vehicles[vehicles['Fuel Efficiency']=='Very High Efficiency'] make_counts = agg_count(very_efficient, 'Make') ax = sns.barplot(data=make_counts, x='Count', y='Make') ax.set(xlabel='\n Number of Vehicles Manufactured') sns.plt.title('Very Fuel Efficient Vehicles by Make \n')  What if we wanted to perform some other calculation, such as averaging, instead of counting the number of records that fall into each group? We can just create a new function called agg_avg that calculates the mean of a designated numerical field. def agg_avg(df, group_field, calc_field): grouped = df.groupby(group_field, as_index=False)[calc_field].mean() grouped = grouped.sort(calc_field, ascending = False) grouped.columns = [group_field, 'Avg ' + str(calc_field)] return grouped  We can then simply swap out the agg_count function with our new agg_avg function and indicate what field we would like to use for our calculation. Below is an example showing the average fuel efficiency, represented by the Combined MPG field, by vehicle category. category_avg_mpg = agg_avg(vehicles_2016, 'Vehicle Category', 'Combined MPG') ax = sns.barplot(data=category_avg_mpg, x='Avg Combined MPG', y='Vehicle Category') ax.set(xlabel='\n Average Combined MPG') sns.plt.title('Average Combined MPG by Category (2016) \n')  ## Pivoting the Data for More Detail Up until this point, we've been looking at our data at a pretty high level, aggregating up by a single variable. Sure, we were able to drill down from Vehicle Category to Vehicle Class to get a more granular view, but we only looked at the data one hierarchical level at a time. Next, we're going to go into further detail by taking a look at two or three variables at a time. The way we are going to do this is via pivot tables and their visual equivalents, pivot heatmaps. First, we will create a pivot_count function, similar to the agg_count function we created earlier, that will transform whatever data frame we feed it into a pivot table with the rows, columns, and calculated field we specify. def pivot_count(df, rows, columns, calc_field): df_pivot = df.pivot_table(values=calc_field, index=rows, columns=columns, aggfunc=np.size ).dropna(axis=0, how='all') return df_pivot  We will then use this function on our vehicles_2016 data frame and pivot it out with the Fuel Efficiency quintiles we created in the last post representing the rows, the Engine Size quintiles representing the columns, and then counting the number of vehicles that had a Combined MPG value. effic_size_pivot = pivot_count(vehicles_2016,'Fuel Efficiency', 'Engine Size','Combined MPG')  This is OK, but it would be faster to analyze visually. Let's create a heatmap that will color the magnitude of the counts and present us with a more intuitive view. sns.heatmap(effic_size_pivot, annot=True, fmt='g') ax.set(xlabel='\n Engine Size') sns.plt.title('Fuel Efficiency vs. Engine Size (2016) \n')  Just like we did earlier with our horizontal bar charts, we can easily filter by a different year and get a different perspective. For example, here's what this heatmap looks like for 1985. effic_size_pivot = pivot_count(vehicles_1985,'Fuel Efficiency', 'Engine Size','Combined MPG') fig, ax = plt.subplots(figsize=(15,8)) sns.heatmap(effic_size_pivot, annot=True, fmt='g') ax.set(xlabel='\n Engine Size') sns.plt.title('Fuel Efficiency vs. Engine Size (1985) \n')  With these pivot heatmaps, we are not limited to just two variables. We can pass a list of variables for any of the axes (rows or columns), and it will display all the different combinations of values for those variables. effic_size_category = pivot_count(vehicles_2016, ['Engine Size','Fuel Efficiency'], 'Vehicle Category','Combined MPG') fig, ax = plt.subplots(figsize=(20,10)) sns.heatmap(effic_size_category, annot=True, fmt='g') ax.set(xlabel='\n Vehicle Category') sns.plt.title('Fuel Efficiency + Engine Size vs. Vehicle Category (2016) \n')  In this heatmap, we have Engine Size and Fuel Efficiency combinations represented by the rows, and we've added a third variable (the Vehicle Category) across the columns. So now we can see a finer level of detail about what types of cars had what size engines and what level of fuel efficiency last year. As a final example for this section, let's create a pivot heatmap that plots Make against Vehicle Category for 2016. We saw earlier, in the bar chart that counted vehicles by manufacturer, that BMW made the largest number of specific models last year. This pivot heatmap will let us see how those counts are distributed across vehicle categories, giving us a better sense of each auto company's current offerings in terms of the breadth vs. depth of vehicle types they make. effic_size_pivot = pivot_count(vehicles_2016, 'Make', 'Vehicle Category','Combined MPG') fig, ax = plt.subplots(figsize=(20,20)) sns.heatmap(effic_size_pivot, annot=True, fmt='g') ax.set(xlabel='\n Vehicle Category') sns.plt.title('Make vs. Vehicle Category (2016) \n')  ## Visualizing Changes Over Time So far in this post, we've been looking at the data at given points in time. The next step is to take a look at how the data has changed over time. We can do this relatively easily by creating a multi_line function that accepts a data frame and x/y fields and then plots them on a multiline chart. def multi_line(df, x, y): ax = df.groupby([x, y]).size().unstack(y).plot(figsize=(15,8), cmap="Set2")  Let's use this function to visualize our vehicle categories over time. The resulting chart shows the number of vehicles in each category that were manufactured each year. multi_line(vehicles, 'Year', 'Vehicle Category') ax.set(xlabel='\n Year') sns.plt.title('Vehicle Categories Over Time \n')  We can see from the chart that Small Cars have generally dominated across the board and that there was a small decline in the late 90s that then started to pick up again in the early 2000s. We can also see the introduction and increase in popularity of SUVs starting in the late 90s, and the decline in popularity of trucks in recent years. If we wanted to, we could zoom in and filter for specific manufacturers to see how their offerings have changed over the years. Since BMW had the most number of vehicles last year and we saw in the pivot heatmap that those were mostly small cars, let's filter for just their vehicles to see whether they have always made a lot of small cars or if this is more of a recent phenomenon. bmw = vehicles[vehicles['Make'] == 'BMW'] multi_line(bmw, 'Year', 'Vehicle Category') ax.set(xlabel='\n Year') sns.plt.title('BMW Vehicle Categories Over Time \n')  We can see in the chart above that they started off making a reasonable number of small cars, and then seemed to ramp up production of those types of vehicles in the late 90s. We can contrast this with a company like Toyota, who started out making a lot of small cars back in the 1980s and then seemingly made a decision to gradually manufacture less of them over the years, focusing instead on SUVs, pickup trucks, and midsize cars. toyota = vehicles[vehicles['Make'] == 'Toyota'] multi_line(toyota, 'Year', 'Vehicle Category') ax.set(xlabel='\n Year') sns.plt.title('Toyota Vehicle Categories Over Time \n')  ## Examining Relationships Between Variables The final way we are going to explore our data in this post is by examining the relationships between numerical variables in our data. Doing this will provide us with better insight into which fields are highly correlated, what the nature of those correlations are, what typical combinations of numerical values exist in our data, and which combinations are anomalies. For looking at relationships between variables, I often like to start with a scatter matrix because it gives me a bird's eye view of the relationships between all the numerical fields in my data set. With just a couple lines of code, we can not only create a scatter matrix, but we can also factor in a layer of color that can represent, for example, the clusters we generated at the end of the last post. select_columns = ['Engine Displacement','Cylinders','Fuel Barrels/Year', 'City MPG','Highway MPG','Combined MPG', 'CO2 Emission Grams/Mile', 'Fuel Cost/Year', 'Cluster Name'] sns.pairplot(vehicles[select_columns], hue='Cluster Name', size=3)  From here, we can see that there are some strong positive linear relationships in our data, such as the correlations between the MPG fields, and also among the fuel cost, barrels, and CO2 emissions fields. There are also some hyperbolic relationships in there as well, particularly between the MPG fields and engine displacement, fuel cost, barrels, and emissions. Additionally, we can also get a sense of the size of our clusters, how they are distributed, and the level of overlap we have between them. Once we have this high-level overview, we can zoom in on anything that we think looks interesting. For example, let's take a closer look at Engine Displacement plotted against Combined MPG. sns.lmplot('Engine Displacement', 'Combined MPG', data=vehicles, hue='Cluster Name', size=8, fit_reg=False)  In addition to being able to see that there is a hyperbolic correlation between these two variables, we can see that our Small Very Efficient cluster resides in the upper left, followed by our Midsized Balanced cluster that looks smaller and more compact than the others. After that, we have our Large Moderately Efficient cluster and finally our Large Inefficient cluster on the bottom right. We can also see that there are a few red points at the very top left and a few purple points at the very bottom right that we may want to investigate further to get a sense of what types of vehicles we are likely to see at the extremes. Try identifying some of those on your own by filtering the data set like we did earlier in the post. While you're at it, try creating additional scatter plots that zoom in on other numerical field combinations from the scatter matrix above. There are a bunch of other insights to be found in this data set, and all it takes is a little exploration! ## Conclusion We have covered quite a bit in this post, and I hope I've provided you with some good examples of how, with just a few tools in your arsenal, you can embark on a robust insight-finding expedition and discover truly interesting things about your data. Now that you have some structure in your process and some tools for exploring data, you can let your creativity run wild a little and come up with filter, aggregate, pivot, and scatter combinations that are most interesting to you. Feel free to experiment and post any interesting insights you're able to find in the comments! Also, make sure to stay tuned because in the next (and final) post of this series, I'm going to cover how to identify and think about the different networks that are present in your data and how to explore them using graph analytics. Click the subscribe button below and enter your email so that you receive a notification as soon as it's published! District Data Labs provides data science consulting and corporate training services. We work with companies and teams of all sizes, helping them make their operations more data-driven and enhancing the analytical abilities of their employees. Interested in working with us? Let us know! Continue Reading… ### Inflated counts for cleared rape cases Newsy, Reveal and ProPublica look into rape cases in the U.S. and law enforcement’s use of exceptional clearance. The designation allows police to clear cases when they have enough evidence to make an arrest and know who and where the suspect is, but can’t make an arrest for reasons outside their control. Experts say it’s supposed to be used sparingly. Culled data from various police departments shows the designation is used more often that one would expect. Tags: , , , Continue Reading… ### Where Camp fire destroyed homes The Camp fire death toll rose to 63 and 631 missing as of yesterday. The Los Angeles Times provides some graphics showing scale and the buildings that burned. Ugh. I live a few hundred miles away and the smoke is bad enough that my son’s school is closed today. It has not been a good year for California in terms of wildfires. Continue Reading… ### Document worth reading: “Saliency Prediction in the Deep Learning Era: An Empirical Investigation” Visual saliency models have enjoyed a big leap in performance in recent years, thanks to advances in deep learning and large scale annotated data. Despite enormous effort and huge breakthroughs, however, models still fall short in reaching human-level accuracy. In this work, I explore the landscape of the field emphasizing on new deep saliency models, benchmarks, and datasets. A large number of image and video saliency models are reviewed and compared over two image benchmarks and two large scale video datasets. Further, I identify factors that contribute to the gap between models and humans and discuss remaining issues that need to be addressed to build the next generation of more powerful saliency models. Some specific questions that are addressed include: in what ways current models fail, how to remedy them, what can be learned from cognitive studies of attention, how explicit saliency judgments relate to fixations, how to conduct fair model comparison, and what are the emerging applications of saliency models. Saliency Prediction in the Deep Learning Era: An Empirical Investigation Continue Reading… ### Hey, check this out: Columbia’s Data Science Institute is hiring research scientists and postdocs! Here’s the official announcement: The Institute’s Postdoctoral and Research Scientists will help anchor Columbia’s presence as a leader in data-science research and applications and serve as resident experts in fostering collaborations with the world-class faculty across all schools at Columbia University. They will also help guide, plan and execute data-science research, applications and technological innovations that address societal challenges and related University-wide initiatives. Postdoc Fellows Requirements: PhD degree Research Scientist (Open Rank) Requirements: PhD degree + (see position description for more information) Research Scientist who will conduct independent cutting-edge research in the foundations or application of data science or related fields, or be involved in interdisciplinary research through a collaboration between the Data Science Institute and the various schools at the University. APPLY NOW Research Scientist (Open Rank) Requirements: PhD degree + (see position description for more information) Research Scientist who will serve as Columbia’s resident experts to foster collaborations with faculty across all the schools at Columbia University. APPLY NOW Candidates for all Research Scientists positions must apply using the links above that direct to the Columbia HR portal for each position, whereas the Postdoc Fellows submit materials via: DSI-Fellows@columbia.edu I’m part of the Data Science Institute so if you want to work with me or others here at Columbia, you should apply. Continue Reading… ### Whats new on arXiv First derived from human intuition, later adapted to machine translation for automatic token alignment, attention mechanism, a simple method that can be used for encoding sequence data based on the importance score each element is assigned, has been widely applied to and attained significant improvement in various tasks in natural language processing, including sentiment classification, text summarization, question answering, dependency parsing, etc. In this paper, we survey through recent works and conduct an introductory summary of the attention mechanism in different NLP problems, aiming to provide our readers with basic knowledge on this widely used method, discuss its different variants for different tasks, explore its association with other techniques in machine learning, and examine methods for evaluating its performance. Recent advances in machine learning allow us to analyze and describe the content of high-dimensional data like text, audio, images or other signals. In order to visualize that data in 2D or 3D, usually Dimensionality Reduction (DR) techniques are employed. Most of these techniques, e.g., PCA or t-SNE, produce static projections without taking into account corrections from humans or other data exploration scenarios. In this work, we propose the interactive Similarity Projection (iSP), a novel interactive DR framework based on similarity embeddings, where we form a differentiable objective based on the user interactions and perform learning using gradient descent, with an end-to-end trainable architecture. Two interaction scenarios are evaluated. First, a common methodology in multidimensional projection is to project a subset of data, arrange them in classes or clusters, and project the rest unseen dataset based on that manipulation, in a kind of semi-supervised interpolation. We report results that outperform competitive baselines in a wide range of metrics and datasets. Second, we explore the scenario of manipulating some classes, while enriching the optimization with high-dimensional neighbor information. Apart from improving classification precision and clustering on images and text documents, the new emerging structure of the projection unveils semantic manifolds. For example, on the Head Pose dataset, by just dragging the faces looking far left to the left and those looking far right to the right, all faces are re-arranged on a continuum even on the vertical axis (face up and down). This end-to-end framework can be used for fast, visual semi-supervised learning, manifold exploration, interactive domain adaptation of neural embeddings and transfer learning. The problem of short text matching is formulated as follows: given a pair of sentences or questions, a matching model determines whether the input pair mean the same or not. Models that can automatically identify questions with the same meaning have a wide range of applications in question answering sites and modern chatbots. In this article, we describe the approach by team hahu to solve this problem in the context of the ‘CIKM AnalytiCup 2018 – Cross-lingual Short-text Matching of Question Pairs’ that is sponsored by Alibaba. Our solution is an end-to-end system based on current advances in deep learning which avoids heavy feature-engineering and achieves improved performance over traditional machine-learning approaches. The log-loss scores for the first and second rounds of the contest are 0.35 and 0.39 respectively. The team was ranked 7th from 1027 teams in the overall ranking scheme by the organizers that consisted of the two contest scores as well as: innovation and system integrity, understanding data as well as practicality of the solution for business. Recent work has raised concerns on the risk of unintended bias in algorithmic decision making systems being used nowadays that can affect individuals unfairly based on race, gender or religion, among other possible characteristics. While a lot of bias metrics and fairness definitions have been proposed in recent years, there is no consensus on which metric/definition should be used and there are very few available resources to operationalize them. Therefore, despite recent awareness, auditing for bias and fairness when developing and deploying algorithmic decision making systems is not yet a standard practice. We present Aequitas, an open source bias and fairness audit toolkit that is an intuitive and easy to use addition to the machine learning workflow, enabling users to seamlessly test models for several bias and fairness metrics in relation to multiple population sub-groups. We believe Aequitas will facilitate informed and equitable decisions around developing and deploying algorithmic decision making systems for both data scientists, machine learning researchers and policymakers. This paper presents a novel approach to the technical analysis of wireheading in intelligent agents. Inspired by the natural analogues of wireheading and their prevalent manifestations, we propose the modeling of such phenomenon in Reinforcement Learning (RL) agents as psychological disorders. In a preliminary step towards evaluating this proposal, we study the feasibility and dynamics of emergent addictive policies in Q-learning agents in the tractable environment of the game of Snake. We consider a slightly modified settings for this game, in which the environment provides a ‘drug’ seed alongside the original ‘healthy’ seed for the consumption of the snake. We adopt and extend an RL-based model of natural addiction to Q-learning agents in this settings, and derive sufficient parametric conditions for the emergence of addictive behaviors in such agents. Furthermore, we evaluate our theoretical analysis with three sets of simulation-based experiments. The results demonstrate the feasibility of addictive wireheading in RL agents, and provide promising venues of further research on the psychopathological modeling of complex AI safety problems. Networks are fundamental building blocks for representing data, and computations. Remarkable progress in learning in structurally defined (shallow or deep) networks has recently been achieved. Here we introduce evolutionary exploratory search and learning method of topologically flexible networks under the constraint of producing elementary computational steady-state input-output operations. Our results include; (1) the identification of networks, over four orders of magnitude, implementing computation of steady-state input-output functions, such as a band-pass filter, a threshold function, and an inverse band-pass function. Next, (2) the learned networks are technically controllable as only a small number of driver nodes are required to move the system to a new state. Furthermore, we find that the fraction of required driver nodes is constant during evolutionary learning, suggesting a stable system design. (3), our framework allows multiplexing of different computations using the same network. For example, using a binary representation of the inputs, the network can readily compute three different input-output functions. Finally, (4) the proposed evolutionary learning demonstrates transfer learning. If the system learns one function A, then learning B requires on average less number of steps as compared to learning B from tabula rasa. We conclude that the constrained evolutionary learning produces large robust controllable circuits, capable of multiplexing and transfer learning. Our study suggests that network-based computations of steady-state functions, representing either cellular modules of cell-to-cell communication networks or internal molecular circuits communicating within a cell, could be a powerful model for biologically inspired computing. This complements conceptualizations such as attractor based models, or reservoir computing. We present the state of the art in representing and reasoning with fuzzy knowledge in Semantic Web Languages such as triple languages RDF/RDFS, conceptual languages of the OWL 2 family and rule languages. We further show how one may generalise them to so-called annotation domains, that cover also e.g. temporal and provenance extensions. Characterizing statistical properties of solutions of inverse problems is essential for decision making. Bayesian inversion offers a tractable framework for this purpose, but current approaches are computationally unfeasible for most realistic imaging applications in the clinic. We introduce two novel deep learning based methods for solving large-scale inverse problems using Bayesian inversion: a sampling based method using a WGAN with a novel mini-discriminator and a direct approach that trains a neural network using a novel loss function. The performance of both methods is demonstrated on image reconstruction in ultra low dose 3D helical CT. We compute the posterior mean and standard deviation of the 3D images followed by a hypothesis test to assess whether a ‘dark spot’ in the liver of a cancer stricken patient is present. Both methods are computationally efficient and our evaluation shows very promising performance that clearly supports the claim that Bayesian inversion is usable for 3D imaging in time critical applications. In this paper we study the theoretical properties of the simultaneous multiscale change point estimator (SMUCE) proposed by Frick et al. (2014) in regression models with dependent error processes. Empirical studies show that in this case the change point estimate is inconsistent, but it is not known if alternatives suggested in the literature for correlated data are consistent. We propose a modification of SMUCE scaling the basic statistic by the long run variance of the error process, which is estimated by a difference-type variance estimator calculated from local means from different blocks. For this modification we prove model consistency for physical dependent error processes and illustrate the finite sample performance by means of a simulation study. We study age of information in a multiple source-multiple destination setting with a focus on its scaling in large wireless networks. There are $n$ nodes that are randomly paired with each other on a fixed area to form $n$ source-destination (S-D) pairs. We propose a three-phase transmission scheme which utilizes local cooperation between the nodes by forming what we call mega update packets to serve multiple S-D pairs at once. We show that under the proposed scheme average age of an S-D pair scales as $O(n^{\frac{1}{4}})$ as the number of users, $n$, in the network grows. To the best of our knowledge, this is the best age scaling result for a multiple source-multiple destination setting. Probabilistic programs with dynamic computation graphs can define measures over sample spaces with unbounded dimensionality, and thereby constitute programmatic analogues to Bayesian nonparametrics. Owing to the generality of this model class, inference relies on ‘black-box’ Monte Carlo methods that are generally not able to take advantage of conditional independence and exchangeability, which have historically been the cornerstones of efficient inference. We here seek to develop a ‘middle ground’ between probabilistic models with fully dynamic and fully static computation graphs. To this end, we introduce a combinator library for the Probabilistic Torch framework. Combinators are functions that accept models and return transformed models. We assume that models are dynamic, but that model composition is static, in the sense that combinator application takes place prior to evaluating the model on data. Combinators provide primitives for both model and inference composition. Model combinators take the form of classic functional programming constructs such as map and reduce. These constructs define a computation graph at a coarsened level of representation, in which nodes correspond to models, rather than individual variables. Inference combinators – such as enumeration, importance resampling, and Markov Chain Monte Carlo operators – assume a sampling semantics for model evaluation, in which application of combinators preserves proper weighting. Owing to this property, models defined using combinators can be trained using stochastic methods that optimize either variational or wake-sleep style objectives. As a validation of this principle, we use combinators to implement black box inference for hidden Markov models. Continue Reading… ### Distilled News Could synthetic data be the solution to rapidly train artificial intelligence (AI) algorithms? There are advantages and disadvantages to synthetic data; however, many technology experts believe that synthetic data is the key to democratizing machine learning and to accelerate testing and adoption of artificial intelligence algorithms into our daily lives. I wanted to figure out where I should train my deep learning models online for the lowest cost and least hassle. I wasn’t able to find a good comparison of GPU cloud service providers, so I decided to make my own. Feel free to skip to the pretty charts if you know all about GPUs and TPUs and just want the results. I’m not looking at serving models in this article, but I might in the future. Follow me to make sure you don’t miss out. Accurate forecasts are key components of successful data-driven businesses. We may forecast the need for internal ressources (e.g. call center staffing), key metrics that drive other business decisions (e.g. electricity demand to decide on constructing a new power plant), or customer demand for products we distribute (retail demand forecasting). Time horizons of our forecasts may differ widely from forecasts years in advance to only a few minutes. At Uber, we test most new features and products with the help of experiments in order to understand and quantify their impact on our marketplace. The analysis of experimental results traditionally focuses on calculating average treatment effects (ATEs). Since averages reduce an entire distribution to a single number, however, any heterogeneity in treatment effects will go unnoticed. Instead, we have found that calculating quantile treatment effects (QTEs) allows us to effectively and efficiently characterize the full distribution of treatment effects and thus capture the inherent heterogeneity in treatment effects when thousands of riders and drivers interact within Uber’s marketplace. Besides providing a more nuanced picture of the effect of a new algorithm, this analysis is relevant to our business because people remember negative experiences more strongly than positive ones (see Baumeister et al. (2001)). In this article, we describe what QTEs are, how exactly they provide additional insights beyond ATEs, why they are relevant for a business such as Uber’s, and how we calculate them. This final article in the series Model evaluation, model selection, and algorithm selection in machine learning presents overviews of several statistical hypothesis testing approaches, with applications to machine learning model and algorithm comparisons. This includes statistical tests based on target predictions for independent test sets (the downsides of using a single test set for model comparisons was discussed in previous articles) as well as methods for algorithm comparisons by fitting and evaluating models via cross-validation. Lastly, this article will introduce nested cross-validation, which has become a common and recommended a method of choice for algorithm comparisons for small to moderately-sized datasets. Then, at the end of this article, I provide a list of my personal suggestions concerning model evaluation, selection, and algorithm selection summarizing the several techniques covered in this series of articles. For many new data scientists transitioning into AI and deep learning, the Keras framework is an efficient tool. Keras is a powerful and easy-to-use Python library for developing and evaluating deep learning models. In this article, we’ll lay out the welcome mat to the framework. You should walk away with a handful of useful features to keep in mind as you get up to speed. In the words of the developers, ‘Keras is a high-level neural networks API, written in Python and developed with a focus on enabling fast experimentation.’ It has been open sourced since its initial release in March 2015. Its documentation can be found on keras.io with source code on GitHub. Starting off, you’ll learn about Artificial Intelligence and then move to machine learning and deep learning. You will further learn how machine learning is different from deep learning, the various kinds of algorithms that fall under these two domains of learning. Finally, you will be introduced to some real-life applications where machine learning and deep learning is being applied. The promise of machine learning has shown many stunning results in a wide variety of fields. Natural language processing is no exception of it, and it is one of those fields where machine learning has been able to show general artificial intelligence (not entirely but at least partially) achieving some brilliant results for genuinely complicated tasks. Now, NLP (natural language processing) is not a new field and neither is machine learning. But the fusion of both the fields is quite contemporary and only vows to make progress. This is one of those hybrid applications which everyone (with a budget smart-phone) comes across daily. For example, take ‘keyboard word suggestion’ into the account, or intelligent auto-completions; these all are the byproducts of the amalgamation of NLP and Machine Learning, and quite naturally these have become the inseparable parts of our lives. Machine learning is an important topic in lots of industries right now. It’s a fast moving field with lots of active research and receives huge amounts of media attention. This post isn’t intended to be an introduction to machine learning, or a comprehensive overview of the state of the art. Instead it will show how models built using machine learning can be leveraged from within Excel. This note is just a quick follow-up to our last note on correcting the bias in estimated standard deviations for binomial experiments. For normal deviates there is, of course, a well know scaling correction that returns an unbiased estimate for observed standard deviations. In this blog post, I’ll use the data that I cleaned in a previous blog post, which you can download here. If you want to follow along, download the monthly data. In the previous blog post, I used the auto.arima() function to very quickly get a ‘good-enough’ model to predict future monthly total passengers flying from LuxAirport. ‘Good-enough’ models can be all you need in a lot of situations, but perhaps you’d like to have a better model. I will show here how you can get a better model by searching through a grid of hyper-parameters. A few years ago, when I first became aware of Topological Data Analysis (TDA), I was really excited by the possibility that the elegant theorems of Algebraic Topology could provide some new insights into the practical problems of data analysis. But time has passed, and the sober assessment of Larry Wasserman seems to describe where things stand. If you are excited about Machine Learning, and you’re interested in how it can be applied to Gaming or Optimization, this article is for you. We’ll see the basics of Reinforcement Learning, and more specifically Deep Reinforcement Learning (Neural Networks + Q-Learning) applied to the game Snake. Let’s dive into it! This post is about building a shallow NeuralNetowrk(nn) from scratch for a classification problem using numpy library in Python and also compare the performance against the LogisticRegression (using scikit learn). Building a nn from scratch helps in understanding how nn works in the back-end and it is essential for building effective models. Without delay lets dive into building our simple shallow nn model from scratch. If you’re a data scientist in industry, have you had experience of facing your customers and telling them the project might not be able to be realized as their expectation because of data sparsity? Conventional machine learning and deep learning algorithms, so far, have been traditionally designed to work in isolation. These algorithms are trained to solve specific tasks. The models have to be rebuilt from scratch once the feature-space distribution changes. Transfer learning is the idea of overcoming the isolated learning paradigm and utilizing knowledge acquired for one task to solve related ones. In this article, we will do a comprehensive coverage of the concepts, scope and real-world applications of transfer learning and even showcase some hands-on examples. To be more specific, we will be covering the following. There has been a renewed interest in machine learning in the last few years. This revival seems to be driven by strong fundamentals?-?loads of data being emitted by sensors across the globe, with cheap storage and lowest ever computational costs! The Buzz around Machine Learning has developed keen interest among the techies to get their hands on ML. However, before diving into the ocean of ML here are few basic concepts that you should be familiar with. Keep this handy as you will come across these terms frequently while learning ML. If you have ever worked with time series predictions, I am quite sure you are well aware of the strains and pains that come with them. One moment you think you have cracked the stock market, the next moment you are lying in the bath crying and cursing your inaccurate models (I really don’t recommend that you try and predict the stock market, you will most likely not reap the benefits you think you will). What I am trying to say is that time series predictions are difficult and always require a very specialized data scientist to implement it. As a data scientist, one of the most integral aspects of our job is to relay and display data to ‘non-data scientists’, in formats that provide visually actionable data. In my opinion, one of the coolest parts of our job is interacting with data, especially when there is visual interaction. At times where we might want to build an interactive application, one of the options available to us is a framework called Dash. Dash is an open source Python framework for building web applications, created and maintained by the people at Plotly. Dash’s web graphics are completely interactive because the framework is built on top of Ploty.js, a JavaScript library written and maintained by Ploty. This means that after importing the Dash framework into a Python file you can build a web application writing strictly in Python with no other languages necessary. Continue Reading… ### If you did not already know Information Extraction Technology With rise of digital age, there is an explosion of information in the form of news, articles, social media, and so on. Much of this data lies in unstructured form and manually managing and effectively making use of it is tedious, boring and labor intensive. This explosion of information and need for more sophisticated and efficient information handling tools gives rise to Information Extraction(IE) and Information Retrieval(IR) technology. Information Extraction systems takes natural language text as input and produces structured information specified by certain criteria, that is relevant to a particular application. Various sub-tasks of IE such as Named Entity Recognition, Coreference Resolution, Named Entity Linking, Relation Extraction, Knowledge Base reasoning forms the building blocks of various high end Natural Language Processing (NLP) tasks such as Machine Translation, Question-Answering System, Natural Language Understanding, Text Summarization and Digital Assistants like Siri, Cortana and Google Now. This paper introduces Information Extraction technology, its various sub-tasks, highlights state-of-the-art research in various IE subtasks, current challenges and future research directions. … Implicit Maximum Likelihood Estimation Implicit probabilistic models are models defined naturally in terms of a sampling procedure and often induces a likelihood function that cannot be expressed explicitly. We develop a simple method for estimating parameters in implicit models that does not require knowledge of the form of the likelihood function or any derived quantities, but can be shown to be equivalent to maximizing likelihood under some conditions. Our result holds in the non-asymptotic parametric setting, where both the capacity of the model and the number of data examples are finite. We also demonstrate encouraging experimental results. … Lovasz Convolutional Network (LCN) Semi-supervised learning on graph structured data has received significant attention with the recent introduction of graph convolution networks (GCN). While traditional methods have focused on optimizing a loss augmented with Laplacian regularization framework, GCNs perform an implicit Laplacian type regularization to capture local graph structure. In this work, we propose Lovasz convolutional network (LCNs) which are capable of incorporating global graph properties. LCNs achieve this by utilizing Lovasz’s orthonormal embeddings of the nodes. We analyse local and global properties of graphs and demonstrate settings where LCNs tend to work better than GCNs. We validate the proposed method on standard random graph models such as stochastic block models (SBM) and certain community structure based graphs where LCNs outperform GCNs and learn more intuitive embeddings. We also perform extensive binary and multi-class classification experiments on real world datasets to demonstrate LCN’s effectiveness. In addition to simple graphs, we also demonstrate the use of LCNs on hypergraphs by identifying settings where they are expected to work better than GCNs. … Continue Reading… ### Obama-Trump voters turn back to Democrats Senate Democrats did especially well where Donald Trump had gained the most ground Continue Reading… ### Using a genetic algorithm for the hyperparameter optimization of a SARIMA model (This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers) ## Introduction In this blog post, I’ll use the data that I cleaned in a previous blog post, which you can download here. If you want to follow along, download the monthly data. In my last blog post I showed how to perform a grid search the “tidy” way. As an example, I looked for the right hyperparameters of a SARIMA model. However, the goal of the post was not hyperparameter optimization per se, so I did not bother with tuning the hyperparameters on a validation set, and used the test set for both validation of the hyperparameters and testing the forecast. Of course, this is not great because doing this might lead to overfitting the hyperparameters to the test set. So in this blog post I split my data into trainig, validation and testing sets and use a genetic algorithm to look for the hyperparameters. Again, this is not the most optimal way to go about this problem, since the {forecast} package contains the very useful auto.arima() function. I just wanted to see what kind of solution a genetic algorithm would return, and also try different cost functions. If you’re interested, read on! ## Setup Let’s first load some libraries and define some helper functions (the helper functions were explained in the previous blog posts): library(tidyverse) library(forecast) library(rgenoud) library(parallel) library(lubridate) library(furrr) library(tsibble) library(brotools) ihs <- function(x){ log(x + sqrt(x**2 + 1)) } to_tibble <- function(forecast_object){ point_estimate <- forecast_objectmean %>%
as_tsibble() %>%
rename(point_estimate = value,
date = index)

upper <- forecast_object$upper %>% as_tsibble() %>% spread(key, value) %>% rename(date = index, upper80 = 80%, upper95 = 95%) lower <- forecast_object$lower %>%
as_tsibble() %>%
rename(date = index,
lower80 = 80%,
lower95 = 95%)

reduce(list(point_estimate, upper, lower), full_join)
}

avia_clean_monthly <- read_csv("https://raw.githubusercontent.com/b-rodrigues/avia_par_lu/master/avia_clean_monthy.csv")
## Parsed with column specification:
## cols(
##   destination = col_character(),
##   date = col_date(format = ""),
##   passengers = col_integer()
## )

Let’s split the data into a train set, a validation set and a test set:

avia_clean_train <- avia_clean_monthly %>%
select(date, passengers) %>%
filter(year(date) < 2013) %>%
group_by(date) %>%
summarise(total_passengers = sum(passengers)) %>%
pull(total_passengers) %>%
ts(., frequency = 12, start = c(2005, 1))

avia_clean_validation <- avia_clean_monthly %>%
select(date, passengers) %>%
filter(between(year(date), 2013, 2016)) %>%
group_by(date) %>%
summarise(total_passengers = sum(passengers)) %>%
pull(total_passengers) %>%
ts(., frequency = 12, start = c(2013, 1))

avia_clean_test <- avia_clean_monthly %>%
select(date, passengers) %>%
filter(year(date) >= 2016) %>%
group_by(date) %>%
summarise(total_passengers = sum(passengers)) %>%
pull(total_passengers) %>%
ts(., frequency = 12, start = c(2016, 1))

logged_test_data <- ihs(avia_clean_test)

logged_validation_data <- ihs(avia_clean_validation)

logged_train_data <- ihs(avia_clean_train)

I will train the models on data from 2005 to 2012, look for the hyperparameters on data from 2013
to 2016 and test the accuracy on data from 2016 to March 2018. For this kind of exercise, the ideal
situation would be to perform cross-validation. Doing this with time-series data is not obvious
because of the autocorrelation between observations, which would be broken by sampling independently
which is required by CV. Also, if for example you do leave-one-out CV,
you would end up trying to predict a point in, say, 2017, with data
from 2018, which does not make sense. So you should be careful about that. {forecast} is able
to perform CV for time series and scikit-learn, the
Python package, is able to perform
cross-validation of time series data
too. I will not do it in this blog post and simply focus on the genetic algorithm part.

Let’s start by defining the cost function to minimize. I’ll try several, in the first one I will
minimize the RMSE:

cost_function_rmse <- function(param, train_data, validation_data, forecast_periods){
order <- param[1:3]
season <- c(param[4:6], 12)
model <- purrr::possibly(arima, otherwise = NULL)(x = train_data, order = order,
seasonal = season,
method = "ML")
if(is.null(model)){
return(9999999)
} else {
forecast_model <- forecast::forecast(model, h = forecast_periods)
point_forecast <- forecast_model$mean sqrt(mean(point_forecast - validation_data) ** 2) } } If arima() is not able to estimate a model for the given parameters, I force it to return NULL, and in that case force the cost function to return a very high cost. If a model was successfully estimated, then I compute the RMSE. Let’s also take a look at what auto.arima() says: starting_model <- auto.arima(logged_train_data) summary(starting_model) ## Series: logged_train_data ## ARIMA(1,0,2)(2,1,0)[12] ## ## Coefficients: ## ar1 ma1 ma2 sar1 sar2 ## 0.9754 -0.7872 0.2091 -0.7285 -0.4413 ## s.e. 0.0261 0.1228 0.1213 0.1063 0.1150 ## ## sigma^2 estimated as 0.004514: log likelihood=105.61 ## AIC=-199.22 AICc=-198.13 BIC=-184.64 ## ## Training set error measures: ## ME RMSE MAE MPE MAPE ## Training set 0.008398036 0.06095102 0.03882593 0.07009285 0.3339574 ## MASE ACF1 ## Training set 0.4425794 0.02073886 Let’s compute the cost at this vector of parameters: cost_function_rmse(c(1, 0, 2, 2, 1, 0), train_data = logged_train_data, validation_data = logged_validation_data, forecast_periods = 65) ## [1] 0.1731473 Ok, now let’s start with optimizing the hyperparameters. Let’s help the genetic algorithm a little bit by defining where it should perform the search: domains <- matrix(c(0, 3, 0, 2, 0, 3, 0, 3, 0, 2, 0, 3), byrow = TRUE, ncol = 2) This matrix constraints the first parameter to lie between 0 and 3, the second one between 0 and 2, and so on. Let’s call the genoud() function from the {rgenoud} package, and use 8 cores: cl <- makePSOCKcluster(8) clusterExport(cl, c('logged_train_data', 'logged_validation_data')) tic <- Sys.time() auto_arima_rmse <- genoud(cost_function_rmse, nvars = 6, data.type.int = TRUE, starting.values = c(1, 0, 2, 2, 1, 0), # <- from auto.arima Domains = domains, cluster = cl, train_data = logged_train_data, validation_data = logged_validation_data, forecast_periods = length(logged_validation_data), hard.generation.limit = TRUE) toc_rmse <- Sys.time() - tic makePSOCKcluster() is a function from the {parallel} package. I must also export the global variables logged_train_data or logged_validation_data. If I don’t do that, the workers called by genoud() will not know about these variables and an error will be returned. The option data.type.int = TRUE force the algorithm to look only for integers, and hard.generation.limit = TRUE forces the algorithm to stop after 100 generations. The process took 7 minutes, which is faster than doing the grid search. What was the solution found? auto_arima_rmse ##$value
## [1] 0.0001863039
##
## $par ## [1] 3 2 1 1 2 1 ## ##$gradients
## [1] NA NA NA NA NA NA
##
## $generations ## [1] 11 ## ##$peakgeneration
## [1] 1
##
## $popsize ## [1] 1000 ## ##$operators
## [1] 122 125 125 125 125 126 125 126   0

Let’s train the model using the arima() function at these parameters:

best_model_rmse <- arima(logged_train_data, order = auto_arima_rmse$par[1:3], season = list(order = auto_arima_rmse$par[4:6], period = 12),
method = "ML")

summary(best_model_rmse)
##
## Call:
## arima(x = logged_train_data, order = auto_arima_rmse$par[1:3], seasonal = list(order = auto_arima_rmse$par[4:6],
##     period = 12), method = "ML")
##
## Coefficients:
##           ar1      ar2      ar3      ma1     sar1     sma1
##       -0.6999  -0.4541  -0.0476  -0.9454  -0.4996  -0.9846
## s.e.   0.1421   0.1612   0.1405   0.1554   0.1140   0.2193
##
## sigma^2 estimated as 0.006247:  log likelihood = 57.34,  aic = -100.67
##
## Training set error measures:
##                         ME       RMSE        MAE          MPE      MAPE
## Training set -0.0006142355 0.06759545 0.04198561 -0.005408262 0.3600483
##                   MASE         ACF1
## Training set 0.4386693 -0.008298546

Let’s extract the forecasts:

best_model_rmse_forecast <- forecast::forecast(best_model_rmse, h = 65)

best_model_rmse_forecast <- to_tibble(best_model_rmse_forecast)
## Joining, by = "date"
## Joining, by = "date"
starting_model_forecast <- forecast(starting_model, h = 65)

starting_model_forecast <- to_tibble(starting_model_forecast)
## Joining, by = "date"
## Joining, by = "date"

and plot the forecast to see how it looks:

avia_clean_monthly %>%
group_by(date) %>%
summarise(total = sum(passengers)) %>%
mutate(total_ihs = ihs(total)) %>%
ggplot() +
ggtitle("Minimization of RMSE") +
geom_line(aes(y = total_ihs, x = date), colour = "#82518c") +
scale_x_date(date_breaks = "1 year", date_labels = "%m-%Y") +
geom_ribbon(data = best_model_rmse_forecast, aes(x = date, ymin = lower95, ymax = upper95),
fill = "#666018", alpha = 0.2) +
geom_line(data = best_model_rmse_forecast, aes(x = date, y = point_estimate),
linetype = 2, colour = "#8e9d98") +
geom_ribbon(data = starting_model_forecast, aes(x = date, ymin = lower95, ymax = upper95),
fill = "#98431e", alpha = 0.2) +
geom_line(data = starting_model_forecast, aes(x = date, y = point_estimate),
linetype = 2, colour = "#a53031") +
theme_blog()

The yellowish line and confidence intervals come from minimizing the genetic algorithm, and the
redish from auto.arima(). Interesting; the point estimate is very precise, but the confidence
intervals are very wide. Low bias, high variance.

Now, let’s try with another cost function, where I minimize the BIC, similar to the auto.arima() function:

cost_function_bic <- function(param, train_data, validation_data, forecast_periods){
order <- param[1:3]
season <- c(param[4:6], 12)
model <- purrr::possibly(arima, otherwise = NULL)(x = train_data, order = order,
seasonal = season,
method = "ML")
if(is.null(model)){
return(9999999)
} else {
BIC(model)
}
}

Let’s take a look at the cost at the parameter values returned by auto.arima():

cost_function_bic(c(1, 0, 2, 2, 1, 0),
train_data = logged_train_data,
validation_data = logged_validation_data,
forecast_periods = 65)
## [1] -184.6397

Let the genetic algorithm run again:

cl <- makePSOCKcluster(8)
clusterExport(cl, c('logged_train_data', 'logged_validation_data'))

tic <- Sys.time()

auto_arima_bic <- genoud(cost_function_bic,
nvars = 6,
data.type.int = TRUE,
starting.values = c(1, 0, 2, 2, 1, 0), # <- from auto.arima
Domains = domains,
cluster = cl,
train_data = logged_train_data,
validation_data = logged_validation_data,
forecast_periods = length(logged_validation_data),
hard.generation.limit = TRUE)
toc_bic <- Sys.time() - tic

This time, it took 6 minutes, a bit slower than before. Let’s take a look at the solution:

auto_arima_bic
## $value ## [1] -201.0656 ## ##$par
## [1] 0 1 1 1 0 1
##
## $gradients ## [1] NA NA NA NA NA NA ## ##$generations
## [1] 12
##
## $peakgeneration ## [1] 1 ## ##$popsize
## [1] 1000
##
## $operators ## [1] 122 125 125 125 125 126 125 126 0 Let’s train the model at these parameters: best_model_bic <- arima(logged_train_data, order = auto_arima_bic$par[1:3],
season = list(order = auto_arima_bic$par[4:6], period = 12), method = "ML") summary(best_model_bic) ## ## Call: ## arima(x = logged_train_data, order = auto_arima_bic$par[1:3], seasonal = list(order = auto_arima_bic$par[4:6], ## period = 12), method = "ML") ## ## Coefficients: ## ma1 sar1 sma1 ## -0.6225 0.9968 -0.832 ## s.e. 0.0835 0.0075 0.187 ## ## sigma^2 estimated as 0.004145: log likelihood = 109.64, aic = -211.28 ## ## Training set error measures: ## ME RMSE MAE MPE MAPE ## Training set 0.003710982 0.06405303 0.04358164 0.02873561 0.3753513 ## MASE ACF1 ## Training set 0.4553447 -0.03450603 And let’s plot the results: best_model_bic_forecast <- forecast::forecast(best_model_bic, h = 65) best_model_bic_forecast <- to_tibble(best_model_bic_forecast) ## Joining, by = "date" ## Joining, by = "date" avia_clean_monthly %>% group_by(date) %>% summarise(total = sum(passengers)) %>% mutate(total_ihs = ihs(total)) %>% ggplot() + ggtitle("Minimization of BIC") + geom_line(aes(y = total_ihs, x = date), colour = "#82518c") + scale_x_date(date_breaks = "1 year", date_labels = "%m-%Y") + geom_ribbon(data = best_model_bic_forecast, aes(x = date, ymin = lower95, ymax = upper95), fill = "#5160a0", alpha = 0.2) + geom_line(data = best_model_bic_forecast, aes(x = date, y = point_estimate), linetype = 2, colour = "#208480") + geom_ribbon(data = starting_model_forecast, aes(x = date, ymin = lower95, ymax = upper95), fill = "#98431e", alpha = 0.2) + geom_line(data = starting_model_forecast, aes(x = date, y = point_estimate), linetype = 2, colour = "#a53031") + theme_blog() The solutions are very close, both in terms of point estimates and confidence intervals. Bias increased, but variance lowered… This gives me an idea! What if I minimize the RMSE, while keeping the number of parameters low, as a kind of regularization? This is somewhat what minimising BIC does, but let’s try to do it a more “naive” approach: cost_function_rmse_low_k <- function(param, train_data, validation_data, forecast_periods, max.order){ order <- param[1:3] season <- c(param[4:6], 12) if(param[1] + param[3] + param[4] + param[6] > max.order){ return(9999999) } else { model <- purrr::possibly(arima, otherwise = NULL)(x = train_data, order = order, seasonal = season, method = "ML") } if(is.null(model)){ return(9999999) } else { forecast_model <- forecast::forecast(model, h = forecast_periods) point_forecast <- forecast_model$mean
sqrt(mean(point_forecast - validation_data) ** 2)
}
}

This is also similar to what auto.arima() does; by default, the max.order argument in auto.arima()
is set to 5, and is the sum of p + q + P + Q. So I’ll try something similar.

Let’s take a look at the cost at the parameter values returned by auto.arima():

cost_function_rmse_low_k(c(1, 0, 2, 2, 1, 0),
train_data = logged_train_data,
validation_data = logged_validation_data,
forecast_periods = 65,
max.order = 5)
## [1] 0.1731473

Let’s see what will happen:

cl <- makePSOCKcluster(8)
clusterExport(cl, c('logged_train_data', 'logged_validation_data'))

tic <- Sys.time()

auto_arima_rmse_low_k <- genoud(cost_function_rmse_low_k,
nvars = 6,
data.type.int = TRUE,
starting.values = c(1, 0, 2, 2, 1, 0), # <- from auto.arima
max.order = 5,
Domains = domains,
cluster = cl,
train_data = logged_train_data,
validation_data = logged_validation_data,
forecast_periods = length(logged_validation_data),
hard.generation.limit = TRUE)
toc_rmse_low_k <- Sys.time() - tic

It took 1 minute to train this one, quite fast! Let’s take a look:

auto_arima_rmse_low_k
## $value ## [1] 0.002503478 ## ##$par
## [1] 1 2 0 3 1 0
##
## $gradients ## [1] NA NA NA NA NA NA ## ##$generations
## [1] 11
##
## $peakgeneration ## [1] 1 ## ##$popsize
## [1] 1000
##
## $operators ## [1] 122 125 125 125 125 126 125 126 0 And let’s plot it: best_model_rmse_low_k <- arima(logged_train_data, order = auto_arima_rmse_low_k$par[1:3],
season = list(order = auto_arima_rmse_low_k$par[4:6], period = 12), method = "ML") summary(best_model_rmse_low_k) ## ## Call: ## arima(x = logged_train_data, order = auto_arima_rmse_low_k$par[1:3], seasonal = list(order = auto_arima_rmse_low_k$par[4:6], ## period = 12), method = "ML") ## ## Coefficients: ## ar1 sar1 sar2 sar3 ## -0.6468 -0.7478 -0.5263 -0.1143 ## s.e. 0.0846 0.1171 0.1473 0.1446 ## ## sigma^2 estimated as 0.01186: log likelihood = 57.88, aic = -105.76 ## ## Training set error measures: ## ME RMSE MAE MPE MAPE ## Training set 0.0005953302 0.1006917 0.06165919 0.003720452 0.5291736 ## MASE ACF1 ## Training set 0.6442205 -0.3706693 best_model_rmse_low_k_forecast <- forecast::forecast(best_model_rmse_low_k, h = 65) best_model_rmse_low_k_forecast <- to_tibble(best_model_rmse_low_k_forecast) ## Joining, by = "date" ## Joining, by = "date" avia_clean_monthly %>% group_by(date) %>% summarise(total = sum(passengers)) %>% mutate(total_ihs = ihs(total)) %>% ggplot() + ggtitle("Minimization of RMSE + low k") + geom_line(aes(y = total_ihs, x = date), colour = "#82518c") + scale_x_date(date_breaks = "1 year", date_labels = "%m-%Y") + geom_ribbon(data = best_model_rmse_low_k_forecast, aes(x = date, ymin = lower95, ymax = upper95), fill = "#5160a0", alpha = 0.2) + geom_line(data = best_model_rmse_low_k_forecast, aes(x = date, y = point_estimate), linetype = 2, colour = "#208480") + geom_ribbon(data = starting_model_forecast, aes(x = date, ymin = lower95, ymax = upper95), fill = "#98431e", alpha = 0.2) + geom_line(data = starting_model_forecast, aes(x = date, y = point_estimate), linetype = 2, colour = "#a53031") + theme_blog() Looks like this was not the right strategy. There might be a better cost function than what I have tried, but looks like minimizing the BIC is the way to go. Hope you enjoyed! If you found this blog post useful, you might want to follow me on twitter for blog post updates or buy me an espresso. To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ## November 15, 2018 ### Introducing Drexel new online MS in Data Science Drexel’s new online MS in Data Science is the degree that launched a thousand opportunities. Complete your courses on your own schedule, while still building meaningful relationships with your professors and classmates. Continue Reading… ### R Packages worth a look Generates Multivariate Nonnormal Data and Determines How Many Factors to Retain (RGenData) The GenDataSample() and GenDataPopulation() functions create, respectively, a sample or population of multivariate nonnormal data using methods describ … Statistical Learning Functions (LearningRlab) Aids in learning statistical functions incorporating the result of calculus done with each function and how they are obtained, that is, which equations … Smooth Time-Dependent ROC Curve Estimation (smoothROCtime) Computes smooth estimations for the Cumulative/Dynamic and Incident/Dynamic ROC curves, in presence of right censorship, based on the bivariate kernel … Continue Reading… ### URI: Director, Data Analytics/DataSpark [Kingston, RI] Provide leadership and administrative oversight for all data and analytic responsibilities of DataSpark. Apply by Dec 14, 2018. Continue Reading… ### In case you missed it: October 2018 roundup In case you missed them, here are some articles from October of particular interest to R users. Peter Provost ports some 80's-era BASIC programs for kids to R. In a podcast for Fringe FM, I discuss the ethics of AI, Microsoft and Open Source, and the R Community. Roundup of AI, Machine Learning and Data Science news from October 2018. In this episode of "Guy in a Cube", R is used to visualize Anscombe's Quartet via Power BI. Di Cook suggests using computer vision to automate statistical model assessment for machine learning in the 2018 Belz Lecture. R provides the analysis behind a front-page story on bridge safety in the Baltimore Sun. Tomas Kalibera describes the big impacts of a small tweak to the logical comparison operators in R. The Economist is now using R to calculate its famous "Big Mac Index". Behind-the-scenes details of how R gets built on Windows, from a presentation by Jeroen Ooms. The R Consortium has accepted another round of grant applications for R community projects. A list of upcoming R conferences. A recap of AI, Machine Learning and Data Science announcements from the Microsoft Ignite conference. And some general interest stories (not necessarily related to R): As always, thanks for the comments and please send any suggestions to me at davidsmi@microsoft.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here Continue Reading… ### In case you missed it: October 2018 roundup (This article was first published on Revolutions, and kindly contributed to R-bloggers) In case you missed them, here are some articles from October of particular interest to R users. Peter Provost ports some 80's-era BASIC programs for kids to R. In a podcast for Fringe FM, I discuss the ethics of AI, Microsoft and Open Source, and the R Community. Roundup of AI, Machine Learning and Data Science news from October 2018. In this episode of "Guy in a Cube", R is used to visualize Anscombe's Quartet via Power BI. Di Cook suggests using computer vision to automate statistical model assessment for machine learning in the 2018 Belz Lecture. R provides the analysis behind a front-page story on bridge safety in the Baltimore Sun. Tomas Kalibera describes the big impacts of a small tweak to the logical comparison operators in R. The Economist is now using R to calculate its famous "Big Mac Index". Behind-the-scenes details of how R gets built on Windows, from a presentation by Jeroen Ooms. The R Consortium has accepted another round of grant applications for R community projects. A list of upcoming R conferences. A recap of AI, Machine Learning and Data Science announcements from the Microsoft Ignite conference. And some general interest stories (not necessarily related to R): As always, thanks for the comments and please send any suggestions to me at davidsmi@microsoft.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### Magister Dixit “There is a dizzying array of algorithms from which to choose, and just making the choice between them presupposes that you have sufficiently advanced mathematical background to understand the alternatives and make a rational choice. The options are also changing, evolving constantly as a result of the work of some very bright, very dedicated researchers who are continually refining existing algorithms and coming up with new ones.” Ted Dunning, Ellen Friedman ( 2014 ) Continue Reading… ### Quoting in R Many R users appear to be big fans of "code capturing" or "non standard evaluation" (NSE) interfaces. In this note we will discuss quoting and non-quoting interfaces in R. The above terms are simply talking about interfaces where a name to be used is captured from the source code the user typed, and thus does not need quote marks. For example: d <- data.frame(x = 1) d$x
## [1] 1

Notice both during data.frame creation and column access: the column name is given without quotes and also accessed without quotes.

This differs from using a standard value oriented interface as in the following:

d[["x"]]
## [1] 1

A natural reason for R users to look for automatic quoting is: it helps make working with columns in data.frames (R‘s primary data analysis structure) look much like working with variables in the environment. Without the quotes a column name looks very much like a variable name. And thinking of columns as variables is a useful mindset.

Another place implicit quoting shows up is with R‘s "combine" operator where one can write either of the following.

c(a = "b")
##   a
## "b"
c("a" = "b")
##   a
## "b"

The wrapr package brings in a new function: qc() or "quoting c()" that gives a very powerful and convenient way to elide quotes.

library(wrapr)

qc(a = b)
##   a
## "b"

Notice quotes are not required on either side of the name assignment. Again, eliding quotes is not that big a deal, and not to everyone’s taste. For example I have never seen a Python user feel they are missing anything because they write "{"a" : "b"}" to construct their own named dictionary structure.

That being said, qc() is a very convenient and consistent notation if you do want to work in an NSE style.

For example, if it ever bothered you that dplyr join takes the join column names as a character vector you can use qc() to instead write:

dplyr::full_join(
iris, iris,
by = qc(Sepal.Length, Sepal.Width,
Petal.Length, Petal.Width,
Species))

(Actually I very much like that the join takes the columns as a vector, as it is much easier to program over.) I feel the qc() grouping of the columns makes it easier for a reader to see which arguments are the column set than a use of ... would. Please take, as an example, the following dplyr::group_by():

library(dplyr)

starwars %>%
group_by(homeworld, species, add = FALSE) %>%
summarize(mass = mean(mass, na.rm = TRUE))
## # A tibble: 58 x 3
## # Groups:   homeworld [?]
##    homeworld      species    mass
##    <chr>          <chr>     <dbl>
##  1 Alderaan       Human      64
##  2 Aleen Minor    Aleena     15
##  3 Bespin         Human      79
##  4 Bestine IV     Human     110
##  5 Cato Neimoidia Neimodian  90
##  6 Cerea          Cerean     82
##  7 Champala       Chagrian  NaN
##  8 Chandrila      Human     NaN
##  9 Concord Dawn   Human      79
## 10 Corellia       Human      78.5
## # ... with 48 more rows

When coming back to such code later, I find the following notation to be easier to read:

library(seplyr)

starwars %>%
group_by_se(qc(homeworld, species), add = FALSE) %>%
summarize(mass = mean(mass, na.rm = TRUE))
## # A tibble: 58 x 3
## # Groups:   homeworld [?]
##    homeworld      species    mass
##    <chr>          <chr>     <dbl>
##  1 Alderaan       Human      64
##  2 Aleen Minor    Aleena     15
##  3 Bespin         Human      79
##  4 Bestine IV     Human     110
##  5 Cato Neimoidia Neimodian  90
##  6 Cerea          Cerean     82
##  7 Champala       Chagrian  NaN
##  8 Chandrila      Human     NaN
##  9 Concord Dawn   Human      79
## 10 Corellia       Human      78.5
## # ... with 48 more rows

In the above we can clearly see which arguments to the grouping command are intended to be column names, and which are not.

qc() is a powerful NSE tool that annotates and contains where we are expecting quoting behavior. Some possible applications include examples such as the following.

# install many packages
install.packages(qc(testthat, knitr, rmarkdown, R.rsp))

# select columns
iris[, qc(Petal.Length, Petal.Width, Species)]

# control a for-loop
for(col in qc(Petal.Length, Petal.Width)) {
iris[[col]] <- sqrt(iris[[col]])
}

# control a vapply
vapply(qc(Petal.Length, Petal.Width),
function(col) {
sum(is.na(iris[[col]]))
}, numeric(1))

The idea is: with qc() the user can switch name capturing notation at will, with no prior-arrangement needed in the functions or packages used. Also the parenthesis in qc() make for more legible code: a reader can see which arguments are being quoted and taken as a group.

As of wrapr 1.7.0 qc() incorporates bquote() functionality. bquote() is R‘s built-in quasi-quotation facility. It was added to R in August of 2003 by Thomas Lumley, and doesn’t get as much attention as it deserves.

A quoting tool such as qc() becomes a quasi-quoting tool if we add a notation that signals we do not wish to quote. In R the standard notation for this is ".()" (Lisp uses a back-tick, the data.table packages uses "..", and the rlang package uses "!!"). The bquote()-enabled version of qc() lets us write code such as the following.

library(wrapr)

extra_column = "Species"

qc(Petal.Length, Petal.Width, extra_column)
## [1] "Petal.Length" "Petal.Width"  "extra_column"
qc(Petal.Length, Petal.Width, .(extra_column))
## [1] "Petal.Length" "Petal.Width"  "Species"

Notice it is un-ambiguous what is going on above. The first qc() quotes all of its arguments into strings. The second works much the same, with the exception of names marked with .(). This ability to "break out" or turn off quoting is convenient if we are working with a combination of values we wish to type in directly and others we wish to take from variables.

qc() allows substitution on the left-hand sides of assignments, if we use the alternate := notation for assignment (a convention put forward by data.table, and later adopted by dplyr).

library(wrapr)

left_name = "a"
right_value = "b"

qc(.(left_name) := .(right_value))
##   a
## "b"

The wrapr package also exports an implementation for :=. So one could also write:


left_name := right_value
##   a
## "b"

The hope is that the qc() and := operators are well behaved enough to commute in the sense the following two statements should return the same value.

library(wrapr)

qc(a := b, c := d)
##   a   c
## "b" "d"
qc(a, c) := qc(b, d)
##   a   c
## "b" "d"

The idea is: when there is a symmetry it is often evidence you are using the right concepts.

In conclusion: the goal of wrapr::qc() is to put a very regular and controllable quoting facility directly into the hands of the R user. This allows the R user to treat just about any R function or package as if the function or package itself implemented argument quoting and quasi-quotation capabilities.

### A deep dive into glmnet: standardize

(This article was first published on R – Statistical Odds & Ends, and kindly contributed to R-bloggers)

I’m writing a series of posts on various function options of the glmnet function (from the package of the same name), hoping to give more detail and insight beyond R’s documentation.

In this post, we will focus on the standardize option.

For reference, here is the full signature of the glmnet function:

glmnet(x, y, family=c("gaussian","binomial","poisson","multinomial","cox","mgaussian"),
weights, offset=NULL, alpha = 1, nlambda = 100,
lambda.min.ratio = ifelse(nobs<500,"covariance","naive"),
type.logistic=c("Newton","modified.Newton"),
standardize.response=FALSE, type.multinomial=c("ungrouped","grouped"))


Unless otherwise stated, $n$ will denote the number of observations, $p$ will denote the number of features, and fit will denote the output/result of the glmnet call. The data matrix is denoted by $X \in \mathbb{R}^{n \times p}$ and the response is denoted by $y \in \mathbb{R}^n$.

standardize

When standardize = TRUE (default), columns of the data matrix x are standardized, i.e. each column of x has mean 0 and standard deviation 1. More specifically, we have that for each $j = 1, \dots, p$,

$\displaystyle\sum_{i=1}^n X_{ij} = 0$, and $\sqrt{\displaystyle\sum_{i=1}^n \frac{X_{ij}^2}{n}} = 1$.

Why might we want to do this? Standardizing our features before model fitting is common practice in statistical learning. This is because if our features are on vastly different scales, the features with larger scales will tend to dominate the action. (One instance where we might not want to standardize our features is if they are already all measured along the same scale, e.g. meters or kilograms.)

Notice that the standardization here is slightly different from that offered by the scale function: scale(x, center = TRUE, scale = TRUE) gives the standardization

$\displaystyle\sum_{i=1}^n X_{ij} = 0$, and $\sqrt{\displaystyle\sum_{i=1}^n \frac{X_{ij}^2}{n-1}} = 1$.

We verify this with a small data example. Generate data according to the following code:

n <- 100; p <- 5; true_p <- 2
set.seed(950)
X <- matrix(rnorm(n * p), nrow = n)
beta <- matrix(c(rep(1, true_p), rep(0, p - true_p)), ncol = 1)
y <- X %*% beta + 3 * rnorm(n)


Create a version of the data matrix which has standardized columns:

X_centered <- apply(X, 2, function(x) x - mean(x))
Xs <- apply(X_centered, 2, function(x) x / sqrt(sum(x^2) / n))


Next, we run glmnet on Xs and y with both possible options for standardize:

library(glmnet)
fit <- glmnet(Xs, y, standardize = TRUE)
fit2 <- glmnet(Xs, y, standardize = FALSE)


We can check that we get the same fit in both cases (modulo numerical precision):

sum(fit$lambda != fit2$lambda)
# 0
max(abs(fit$beta - fit2$beta))
# 6.661338e-16


The documentation notes that the coefficients returned are on the original scale. Let’s confirm that with our small data set. Run glmnet with the original data matrix and standardize = TRUE:

fit3 <- glmnet(X, y, standardize = TRUE)


For each column $j$, our standardized variables are $Z_j = \dfrac{X_j - \mu_j}{s_j}$, where $\mu_j$ and $s_j$ are the mean and standard deviation of column $j$ respectively. If $\beta_j$ and $\gamma_j$ represent the model coefficients of fit2 and fit3 respectively, then we should have

\begin{aligned} \beta_0 + \sum_{j=1}^p \beta_j Z_j &= \gamma_0 + \sum_{j=1}^p \gamma_j X_j, \\ \beta_0 + \sum_{j=1}^p \beta_j \frac{X_j - \mu_j}{s_j} &= \gamma_0 + \sum_{j=1}^p \gamma_j X_j, \\ \left( \beta_0 - \sum_{j=1}^p \frac{\mu_j}{s_j} \right) + \sum_{j=1}^p \frac{\beta_j}{s_j} X_j &= \gamma_0 + \sum_{j=1}^p \gamma_j X_j, \end{aligned}

i.e. we should have $\gamma_0 = \beta_0 - \sum_{j=1}^p \frac{\mu_j}{s_j}$ and $\gamma_j = \frac{\beta_j}{s_j}$ for $j = 1, \dots, p$. The code below checks that this is indeed the case (modulo numerical precision):

# get column means and SDs
X_mean <- colMeans(X)
X_sd <- apply(X_centered, 2, function(x) sqrt(sum(x^2) / n))

# check difference for intercepts
fit2_int <- coefficients(fit2)[1,]
fit3_int <- coefficients(fit3)[1,]
temp <- fit2_int - colSums(diag(X_mean / X_sd) %*% fit2$beta) max(abs(temp - fit3_int)) # 1.110223e-16 # check difference for feature coefficients temp <- diag(1 / X_sd) %*% fit2$beta

## Parallel operation

Another advantage of this method is that it can be parallelized in a data-parallel fashion: Multiple workers can update the posterior simultaneously on different minibatches of data. Expectation-propagation provides a meaningful way of combining the parameter updates computed by parallel workers in a meaningful fashion. Indeed, this is what the PBODL algorithm of Tencent does.

Bayesian online learning comes with great advantages. In recommender systems and ads marketplaces, the Bayesian online learning approach handles the user and item cold-start problem gracefully. If there is a new ad in the system that has to be recommended, initially our network might say "I don't really know", and then gradually hone in on a confident prediction as more data is observed.

One can use the uncertainty estimates in a Bayesian network to perform online learning: actively proposing which items to label in order to increase predictive performance in the future. In the context of ads, this may be a useful feature if implemented with care.

Finally, there is the useful principle that Bayes never forgets: if we perform exact Bayesian updates on global parameters of an exchangeable model, the posterior will store information indefinitely long about all datapoints, including very early ones. This advantage is clearly demonstrated in DeepMind's work on catastrophic forgetting (Kirkpatrick et al, 2017). This capacity to keep a long memory of course diminishes the more approximations we have to make, which leads me to drawbacks.

## Drawbacks

The ADF method is only approximate, and over time, the approximations in each step may accumulate resulting in the network to essentially forget what it learned from earlier datapoints. It is also worth pointing out that in stationary situations, the posterior over parameters is expected to shrink, especially in the last hidden layers of the network where there are often fewer parameters. This posterior compression is often countered by introducing explicit forgetting: without evidence to the contrary, the variance of each parameter is marginally increased in each step. Forgetting and the lack of infinite memory may turn out to be an advantage, not a bug, in non-stationary situations where more recent datapoints are more representative of future test cases.

A practical drawback of a method like this in production environments is that probabilistic forward- and backpropagation require non-trivial custom implementations. Probabilistic backprop does not use reverse mode automatic differentiation, i.e. vanilla backprop as a subroutine. As a consequence, one cannot rely on extensively tested and widely used autodiff packages in tensorflow or pytorch, and performing probabilistic backprop in new layer-types might require significant effort and custom implementations. It is possible that high-quality message passing packages such as the recently open-sourced infer.net will see a renaissance after all.