# My Data Science Blogs

## January 19, 2019

### “Either the results are completely wrong, or Nasa has confirmed a major breakthrough in space propulsion.”

Daniel Lakeland points us to this news article by David Hambling from 2014, entitled “Nasa validates ‘impossible’ space drive.” Here’s Hambling:

Nasa is a major player in space science, so when a team from the agency this week presents evidence that “impossible” microwave thrusters seem to work, something strange is definitely going on. Either the results are completely wrong, or Nasa has confirmed a major breakthrough in space propulsion. . . .

He has built a number of demonstration systems, but critics reject his relativity-based theory and insist that, according to the law of conservation of momentum, it cannot work.

According to good scientific practice, an independent third party needed to replicate Shawyer’s results. As Wired.co.uk reported, this happened last year when a Chinese team built its own EmDrive and confirmed that it produced 720 mN (about 72 grams) of thrust, enough for a practical satellite thruster. . . . a US scientist, Guido Fetta, has built his own propellant-less microwave thruster, and managed to persuade Nasa to test it out. The test results were presented on July 30 at the 50th Joint Propulsion Conference in Cleveland, Ohio. Astonishingly enough, they are positive. . . .

OK, that was 3.5 years ago. Any followups? A quick google search revealed this article by Guilio Prisco from 2017, “Theoretical Physicists Are Getting Closer to Explaining How NASA’s ‘Impossible’ EmDrive Works: The EmDrive propulsion system might be able to take us to the stars, but first it must be reconciled with the laws of physics.”

If I wanted to be snarky, I’d say they could do a 2-for-1 deal and power the Em-drive with cold fusion. But my physics knowledge is weak, so I’ll just say . . . who knows, maybe this is the interstellar drive we’ve all been waiting for! I’ll believe it once it appears in PNAS.

### Why Ice Cream Is Linked to Shark Attacks – Correlation/Causation Smackdown

Why are soda and ice cream each linked to violence? This article delivers the final word on what people mean by "correlation does not imply causation."

### Whats new on arXiv

We present a novel notion of outlier, called the Concentration Free Outlier Factor, or CFOF. As a main contribution, we formalize the notion of concentration of outlier scores and theoretically prove that CFOF does not concentrate in the Euclidean space for any arbitrary large dimensionality. To the best of our knowledge, there are no other proposals of data analysis measures related to the Euclidean distance for which it has been provided theoretical evidence that they are immune to the concentration effect. We determine the closed form of the distribution of CFOF scores in arbitrarily large dimensionalities and show that the CFOF score of a point depends on its squared norm standard score and on the kurtosis of the data distribution, thus providing a clear and statistically founded characterization of this notion. Moreover, we leverage this closed form to provide evidence that the definition does not suffer of the hubness problem affecting other measures. We prove that the number of CFOF outliers coming from each cluster is proportional to cluster size and kurtosis, a property that we call semi-locality. We determine that semi-locality characterizes existing reverse nearest neighbor-based outlier definitions, thus clarifying the exact nature of their observed local behavior. We also formally prove that classical distance-based and density-based outliers concentrate both for bounded and unbounded sample sizes and for fixed and variable values of the neighborhood parameter. We introduce the fast-CFOF algorithm for detecting outliers in large high-dimensional dataset. The algorithm has linear cost, supports multi-resolution analysis, and is embarrassingly parallel. Experiments highlight that the technique is able to efficiently process huge datasets and to deal even with large values of the neighborhood parameter, to avoid concentration, and to obtain excellent accuracy.
The prevalence of networked sensors and actuators in many real-world systems such as smart buildings, factories, power plants, and data centers generate substantial amounts of multivariate time series data for these systems. The rich sensor data can be continuously monitored for intrusion events through anomaly detection. However, conventional threshold-based anomaly detection methods are inadequate due to the dynamic complexities of these systems, while supervised machine learning methods are unable to exploit the large amounts of data due to the lack of labeled data. On the other hand, current unsupervised machine learning approaches have not fully exploited the spatial-temporal correlation and other dependencies amongst the multiple variables (sensors/actuators) in the system for detecting anomalies. In this work, we propose an unsupervised multivariate anomaly detection method based on Generative Adversarial Networks (GANs). Instead of treating each data stream independently, our proposed MAD-GAN framework considers the entire variable set concurrently to capture the latent interactions amongst the variables. We also fully exploit both the generator and discriminator produced by the GAN, using a novel anomaly score called DR-score to detect anomalies by discrimination and reconstruction. We have tested our proposed MAD-GAN using two recent datasets collected from real-world CPS: the Secure Water Treatment (SWaT) and the Water Distribution (WADI) datasets. Our experimental results showed that the proposed MAD-GAN is effective in reporting anomalies caused by various cyber-intrusions compared in these complex real-world systems.
Motivated by the success of using black-box predictive algorithms as subroutines for online decision-making, we develop a new framework for designing online policies given access to an oracle providing statistical information about an offline benchmark. Having access to such prediction oracles enables simple and natural Bayesian selection policies, and raises the question as to how these policies perform in different settings. Our work makes two important contributions towards tackling this question: First, we develop a general technique we call *compensated coupling* which can be used to derive bounds on the expected regret (i.e., additive loss with respect to a benchmark) for any online policy and offline benchmark; Second, using this technique, we show that the Bayes Selector has constant expected regret (i.e., independent of the number of arrivals and resource levels) in any online packing and matching problem with a finite type-space. Our results generalize and simplify many existing results for online packing and matching problems, and suggest a promising pathway for obtaining oracle-driven policies for other online decision-making settings.
Internet of Things (IoT) has gained substantial attention over the past years. And the main discussion has been how to process the amount of data that it generates which has lead to the edge computing paradigm. Wether it is called fog1, edge or mist, the principle remains that cloud services must become available closer to clients. This documents presents ongoing work on future edge systems that are built to provide steadfast IoT services to users by bringing storage and processing power closer to peripheral parts of networks. Designing such infrastructures is becoming much more challenging as the number of IoT devices keeps growing. Production grade deployments have to meet very high performance requirements, and end-to-end solutions involve significant investments. In this paper, we aim at providing a solution to extend the range of the edge model to the very farthest nodes in the network. Specifically, we focus on providing reliable storage and computation capabilities immediately on wireless IoT sensor nodes. This extended edge model will allow end users to manage their IoT ecosystem without forcibly relying on gateways or Internet provider solutions. In this document, we introduce Achlys, a prototype implementation of an edge node that is a concrete port of the Lasp programming library on the GRiSP Erlang embedded system. This way, we aim at addressing the need for a general purpose edge that is both resilient and consistent in terms of storage and network. Finally, we study example use cases that could take advantage of integrating the Achlys framework and discuss future work for the latter.
Next generation of embedded Information and Communication Technology (ICT) systems are interconnected collaborative intelligent systems able to perform autonomous tasks. Training and deployment of such systems on Edge devices however require a fine-grained integration of data and tools to achieve high accuracy and overcome functional and non-functional requirements. In this work, we present a modular AI pipeline as an integrating framework to bring data, algorithms and deployment tools together. By these means, we are able to interconnect the different entities or stages of particular systems and provide an end-to-end development of AI products. We demonstrate the effectiveness of the AI pipeline by solving an Automatic Speech Recognition challenge and we show that all the steps leading to an end-to-end development for Key-word Spotting tasks: importing, partitioning and pre-processing of speech data, training of different neural network architectures and their deployment on heterogeneous embedded platforms.
Prevailing user authentication schemes on smartphones rely on explicit user interaction, where a user types in a passcode or presents a biometric cue such as face, fingerprint, or iris. In addition to being cumbersome and obtrusive to the users, such authentication mechanisms pose security and privacy concerns. Passive authentication systems can tackle these challenges by frequently and unobtrusively monitoring the user’s interaction with the device. In this paper, we propose a Siamese Long Short-Term Memory network architecture for passive authentication, where users can be verified without requiring any explicit authentication step. We acquired a dataset comprising of measurements from 30 smartphone sensor modalities for 37 users. We evaluate our approach on 8 dominant modalities, namely, keystroke dynamics, GPS location, accelerometer, gyroscope, magnetometer, linear accelerometer, gravity, and rotation sensors. Experimental results find that, within 3 seconds, a genuine user can be correctly verified 97.15% of the time at a false accept rate of 0.1%.
For optimization of a sum of functions in a distributed computing environment, we present a novel communication efficient Newton-type algorithm that enjoys a variety of advantages over similar existing methods. Similar to Newton-MR, our algorithm, DINGO, is derived by optimization of the gradient’s norm as a surrogate function. DINGO does not impose any specific form on the underlying functions, and its application range extends far beyond convexity. In addition, the distribution of the data across the computing environment can be almost arbitrary. Further, the underlying sub-problems of DINGO are simple linear least-squares, for which a plethora of efficient algorithms exist. Lastly, DINGO involves a few hyper-parameters that are easy to tune. Moreover, we theoretically show that DINGO is not sensitive to the choice of its hyper-parameters in that a strict reduction in the gradient norm is guaranteed, regardless of the selected hyper-parameters. We demonstrate empirical evidence of the effectiveness, stability and versatility of our method compared to other relevant algorithms, on both convex and non-convex problems.
SkinnerDB is designed from the ground up for reliable join ordering. It maintains no data statistics and uses no cost or cardinality models. Instead, it uses reinforcement learning to learn optimal join orders on the fly, during the execution of the current query. To that purpose, we divide the execution of a query into many small time slices. Different join orders are tried in different time slices. We merge result tuples generated according to different join orders until a complete result is obtained. By measuring execution progress per time slice, we identify promising join orders as execution proceeds. Along with SkinnerDB, we introduce a new quality criterion for query execution strategies. We compare expected execution cost against execution cost for an optimal join order. SkinnerDB features multiple execution strategies that are optimized for that criterion. Some of them can be executed on top of existing database systems. For maximal performance, we introduce a customized execution engine, facilitating fast join order switching via specialized multi-way join algorithms and tuple representations. We experimentally compare SkinnerDB’s performance against various baselines, including MonetDB, Postgres, and adaptive processing methods. We consider various benchmarks, including the join order benchmark and TPC-H variants with user-defined functions. Overall, the overheads of reliable join ordering are negligible compared to the performance impact of the occasional, catastrophic join order choice.
Graph matching is an important and persistent problem in computer vision and pattern recognition for finding node-to-node correspondence between graph-structured data. However, as widely used, graph matching that incorporates pairwise constraints can be formulated as a quadratic assignment problem (QAP), which is NP-complete and results in intrinsic computational difficulties. In this paper, we present a functional representation for graph matching (FRGM) that aims to provide more geometric insights on the problem and reduce the space and time complexities of corresponding algorithms. To achieve these goals, we represent a graph endowed with edge attributes by a linear function space equipped with a functional such as inner product or metric, that has an explicit geometric meaning. Consequently, the correspondence between graphs can be represented as a linear representation map of that functional. Specifically, we reformulate the linear functional representation map as a new parameterization for Euclidean graph matching, which is associative with geometric parameters for graphs under rigid or nonrigid deformations. This allows us to estimate the correspondence and geometric deformations simultaneously. The use of the representation of edge attributes rather than the affinity matrix enables us to reduce the space complexity by two orders of magnitudes. Furthermore, we propose an efficient optimization strategy with low time complexity to optimize the objective function. The experimental results on both synthetic and real-world datasets demonstrate that the proposed FRGM can achieve state-of-the-art performance.
In this paper, we deal with the problem of estimating the intervention effect in the statistical causal analysis using the structural equation model and the causal diagram. The intervention effect is defined as a causal effect on the response variable $Y$ when the causal variable $X$ is fixed to a certain value by an external operation and is defined based on the causal diagram. The intervention effect is defined as a function of the probability distributions in the causal diagram, however, generally these probability distributions are unknown, so it is required to estimate them from data. In other words, the steps of the estimation of the intervention effect using the causal diagram are as follows: 1. Estimate the causal diagram from the data, 2. Estimate the probability distributions in the causal diagram from the data, 3. Calculate the intervention effect. However, if the problem of estimating the intervention effect is formulated in the statistical decision theory framework, estimation with this procedure is not necessarily optimal. In this study, we formulate the problem of estimating the intervention effect for the two cases, the case where the causal diagram is known and the case where it is unknown, in the framework of statistical decision theory and derive the optimal decision method under the Bayesian criterion. We show the effectiveness of the proposed method through numerical simulations.
Sentence embedding is a significant research topic in the field of natural language processing (NLP). Generating sentence embedding vectors reflecting the intrinsic meaning of a sentence is a key factor to achieve an enhanced performance in various NLP tasks such as sentence classification and document summarization. Therefore, various sentence embedding models based on supervised and unsupervised learning have been proposed after the advent of researches regarding the distributed representation of words. They were evaluated through semantic textual similarity (STS) tasks, which measure the degree of semantic preservation of a sentence and neural network-based supervised embedding models generally yielded state-of-the-art performance. However, these models have a limitation in that they have multiple parameters to update, thereby requiring a tremendous amount of labeled training data. In this study, we propose an efficient approach that learns a transition matrix that refines a sentence embedding vector to reflect the latent semantic meaning of a sentence. The proposed method has two practical advantages; (1) it can be applied to any sentence embedding method, and (2) it can achieve robust performance in STS tasks irrespective of the number of training examples.
This paper considers a high dimensional linear regression model with corrected variables. A variety of methods have been developed in recent years, yet it is still challenging to keep accurate estimation when there are complex correlation structures among predictors and the response. We propose an adaptive and ‘reversed’ penalty for regularization to solve this problem. This penalty doesn’t shrink variables but focuses on removing the shrinkage bias and encouraging grouping effect. Combining the l_1 penalty and the Minimax Concave Penalty (MCP), we propose two methods called Smooth Adjustment for Correlated Effects (SACE) and Generalized Smooth Adjustment for Correlated Effects (GSACE). Compared with the traditional adaptive estimator, the proposed methods have less influence from the initial estimator and can reduce the false negatives of the initial estimation. The proposed methods can be seen as linear functions of the new penalty’s tuning parameter, and are shown to estimate the coefficients accurately in both extremely highly correlated variables situation and weakly correlated variables situation. Under mild regularity conditions we prove that the methods satisfy certain oracle property. We show by simulations and applications that the proposed methods often outperforms other methods.
Sequential decision-making (SDM) plays a key role in intelligent robotics, and can be realized in very different ways, such as supervised learning, automated reasoning, and probabilistic planning. The three families of methods follow different assumptions and have different (dis)advantages. In this work, we aim at a robot SDM framework that exploits the complementary features of learning, reasoning, and planning. We utilize long short-term memory (LSTM), for passive state estimation with streaming sensor data, and commonsense reasoning and probabilistic planning (CORPP) for active information collection and task accomplishment. In experiments, a mobile robot is tasked with estimating human intentions using their motion trajectories, declarative contextual knowledge, and human-robot interaction (dialog-based and motion-based). Results suggest that our framework performs better than its no-learning and no-reasoning versions in a real-world office environment.
This paper surveys the machine learning literature and presents machine learning as optimization models. Such models can benefit from the advancement of numerical optimization techniques which have already played a distinctive role in several machine learning settings. Particularly, mathematical optimization models are presented for commonly used machine learning approaches for regression, classification, clustering, and deep neural networks as well new emerging applications in machine teaching and empirical model learning. The strengths and the shortcomings of these models are discussed and potential research directions are highlighted.
Domain adaptation has become a prominent problem setting in machine learning and related fields. This review asks the questions: when and how a classifier can learn from a source domain and generalize to a target domain. As for when, we review conditions that allow for cross-domain generalization error bounds. As for how, we present a categorization of approaches, divided into, what we refer to as, sample-based, feature-based and inference-based methods. Sample-based methods focus on weighting individual observations during training based on their importance to the target domain. Feature-based methods focus on mapping, projecting and representing features such that a source classifier performs well on the target domain and inference-based methods focus on alternative estimators, such as robust, minimax or Bayesian. Our categorization highlights recurring ideas and raises a number of questions important to further research.
TensorFlow.js is a library for building and executing machine learning algorithms in JavaScript. TensorFlow.js models run in a web browser and in the Node.js environment. The library is part of the TensorFlow ecosystem, providing a set of APIs that are compatible with those in Python, allowing models to be ported between the Python and JavaScript ecosystems. TensorFlow.js has empowered a new set of developers from the extensive JavaScript community to build and deploy machine learning models and enabled new classes of on-device computation. This paper describes the design, API, and implementation of TensorFlow.js, and highlights some of the impactful use cases.
In this work, we study value function approximation in reinforcement learning (RL) problems with high dimensional state or action spaces via a generalized version of representation policy iteration (RPI). We consider the limitations of proto-value functions (PVFs) at accurately approximating the value function in low dimensions and we highlight the importance of features learning for an improved low-dimensional value function approximation. Then, we adopt different representation learning algorithm on graphs to learn the basis functions that best represent the value function. We empirically show that node2vec, an algorithm for scalable feature learning in networks, and the Variational Graph Auto-Encoder constantly outperform the commonly used smooth proto-value functions in low-dimensionl feature space.
Data competitions rely on real-time leaderboards to rank competitor entries and stimulate algorithm improvement. While such competitions have become quite popular and prevalent, particularly in supervised learning formats, their implementations by the host are highly variable. Without careful planning, a supervised learning competition is vulnerable to overfitting, where the winning solutions are so closely tuned to the particular set of provided data that they cannot generalize to the underlying problem of interest to the host. This paper outlines some important considerations for strategically designing relevant and informative data sets to maximize the learning outcome from hosting a competition based on our experience. It also describes a post-competition analysis that enables robust and efficient assessment of the strengths and weaknesses of solutions from different competitors, as well as greater understanding of the regions of the input space that are well-solved. The post-competition analysis, which complements the leaderboard, uses exploratory data analysis and generalized linear models (GLMs). The GLMs not only expand the range of results we can explore, they also provide more detailed analysis of individual sub-questions including similarities and differences between algorithms across different types of scenarios, universally easy or hard regions of the input space, and different learning objectives. When coupled with a strategically planned data generation approach, the methods provide richer and more informative summaries to enhance the interpretation of results beyond just the rankings on the leaderboard. The methods are illustrated with a recently completed competition to evaluate algorithms capable of detecting, identifying, and locating radioactive materials in an urban environment.
We propose a new architecture that learns to attend to different Convolutional Neural Networks (CNN) layers (i.e., different levels of abstraction) and different spatial locations (i.e., specific layers within a given feature map) in a sequential manner to perform the task at hand. Specifically, at each Recurrent Neural Network (RNN) timestep, a CNN layer is selected and its output is processed by a spatial soft-attention mechanism. We refer to this architecture as the Unified Attention Network (UAN), since it combines the ‘what’ and ‘where’ aspects of attention, i.e., ‘what’ level of abstraction to attend to, and ‘where’ should the network look at. We demonstrate the effectiveness of this approach on two computer vision tasks: (i) image-based camera pose and orientation regression and (ii) indoor scene classification. We evaluate our method on standard benchmarks for camera localization (Cambridge, 7-Scene, and TUM-LSI datasets) and for scene classification (MIT-67 indoor dataset), and show that our method improves upon the results of previous methods. Empirically, we show that combining ‘what’ and ‘where’ aspects of attention improves network performance on both tasks.
The majority of conversations a dialogue agent sees over its lifetime occur after it has already been trained and deployed, leaving a vast store of potential training signal untapped. In this work, we propose the self-feeding chatbot, a dialogue agent with the ability to extract new training examples from the conversations it participates in. As our agent engages in conversation, it also estimates user satisfaction in its responses. When the conversation appears to be going well, the user’s responses become new training examples to imitate. When the agent believes it has made a mistake, it asks for feedback; learning to predict the feedback that will be given improves the chatbot’s dialogue abilities further. On the PersonaChat chit-chat dataset with over 131k training examples, we find that learning from dialogue with a self-feeding chatbot significantly improves performance, regardless of the amount of traditional supervision.
In this paper, we introduce a new data freshness metric, relative Age of Information (rAoI), and examine it in a single server system with various packet management schemes. The (classical) AoI metric was introduced to measure the staleness of status updates at the receiving end with respect to their generation at the source. This metric addresses systems where the timings of update generation at the source are absolute and can be designed separately or jointly with the transmission schedules. In many decentralized applications, transmission schedules are blind to update generation timing, and the transmitter can know the timing of an update packet only after it arrives. As such, an update becomes stale after a new one arrives. The rAoI metric measures how fresh the data is at the receiver with respect to the data at the transmitter. It introduces a particularly explicit dependence on the arrival process in the evaluation of age. We investigate several queuing disciplines and provide closed form expressions for rAoI and numerical comparisons.
In this paper we propose a new training loop for deep reinforcement learning agents with an evolutionary generator. Evolutionary procedural content generation has been used in the creation of maps and levels for games before. Our system incorporates an evolutionary map generator to construct a training curriculum that is evolved to maximize loss within the state-of-the-art Double Dueling Deep Q Network architecture with prioritized replay. We present a case-study in which we prove the efficacy of our new method on a game with a discrete, large action space we made called Attackers and Defenders. Our results demonstrate that training on an evolutionarily-curated curriculum (directed sampling) of maps both expedites training and improves generalization when compared to a network trained on an undirected sampling of maps.
We develop a likelihood free inference procedure for conditioning a probabilistic model on a predicate. A predicate is a Boolean valued function which expresses a yes/no question about a domain. Our contribution, which we call predicate exchange, constructs a softened predicate which takes value in the unit interval [0, 1] as opposed to a simply true or false. Intuitively, 1 corresponds to true, and a high value (such as 0.999) corresponds to ‘nearly true’ as determined by a distance metric. We define Boolean algebra for soft predicates, such that they can be negated, conjoined and disjoined arbitrarily. A softened predicate can serve as a tractable proxy to a likelihood function for approximate posterior inference. However, to target exact inference, we temper the relaxation by a temperature parameter, and add a accept/reject phase use to replica exchange Markov Chain Mont Carlo, which exchanges states between a sequence of models conditioned on predicates at varying temperatures. We describe a lightweight implementation of predicate exchange that it provides a language independent layer that can be implemented on top of existingn modeling formalisms.

### R Packages worth a look

Inventory Analytics and Cost Calculations (inventorize)
Facilitate inventory analysis calculations. The package heavily relies on my studies, the package includes calculations of inventory metrics, profit ca …

Ordinal Outcomes: Generalized Linear Models with the Log Link (lcpm)
An implementation of the Log Cumulative Probability Model (LCPM) and Proportional Probability Model (PPM) for which the Maximum Likelihood Estimates ar …

The adapted pair correlation function transfers the concept of the pair correlation function from point patterns to patterns of objects of finite size …

### If you did not already know

Effect Size
In statistics, an effect size is a quantitative measure of the strength of a phenomenon. Examples of effect sizes are the correlation between two variables, the regression coefficient, the mean difference, or even the risk with which something happens, such as how many people survive after a heart attack for every one person that does not survive. For each type of effect-size, a larger absolute value always indicates a stronger effect. Effect sizes complement statistical hypothesis testing, and play an important role in statistical power analyses, sample size planning, and in meta-analyses. Especially in meta-analysis, where the purpose is to combine multiple effect-sizes, the standard error of effect-size is of critical importance. The S.E. of effect-size is used to weight effect-sizes when combining studies, so that large studies are considered more important than small studies in the analysis. The S.E. of effect-size is calculated differently for each type of effect-size, but generally only requires knowing the study’s sample size (N), or the number of observations in each group (n’s). Reporting effect sizes is considered good practice when presenting empirical research findings in many fields. The reporting of effect sizes facilitates the interpretation of the substantive, as opposed to the statistical, significance of a research result. Effect sizes are particularly prominent in social and medical research. Relative and absolute measures of effect size convey different information, and can be used complementarily. …

Total Distance Multivariance
We introduce two new measures for the dependence of $n \ge 2$ random variables: distance multivariance’ and total distance multivariance’. Both measures are based on the weighted $L^2$-distance of quantities related to the characteristic functions of the underlying random variables. They extend distance covariance (introduced by Szekely, Rizzo and Bakirov) and generalized distance covariance (introduced in part I) from pairs of random variables to $n$-tuplets of random variables. We show that total distance multivariance can be used to detect the independence of $n$ random variables and has a simple finite-sample representation in terms of distance matrices of the sample points, where distance is measured by a continuous negative definite function. Based on our theoretical results, we present a test for independence of multiple random vectors which is consistent against all alternatives. …

Data Transfer Project (DTP)
The Data Transfer Project was formed in 2017 to create an open-source, service-to-service data portability platform so that all individuals across the web could easily move their data between online service providers whenever they want. The contributors to the Data Transfer Project believe portability and interoperability are central to innovation. Making it easier for individuals to choose among services facilitates competition, empowers individuals to try new services and enables them to choose the offering that best suits their needs.
Data Transfer Project (DTP) is a collaboration of organizations committed to building a common framework with open-source code that can connect any two online service providers, enabling a seamless, direct, user initiated portability of data between the two platforms.
The Data Transfer Project uses services´ existing APIs and authorization mechanisms to access data. It then uses service specific adapters to transfer that data into a common format, and then back into the new service´s API. …

### Magister Dixit

“Preprocessing is often the most time-consuming phase in knowledge discovery and preprocessing transformations interdependent in unexpected ways.” Markus Vattulainen ( 2015-10-11 )

### Managers in football matter much less than most fans think

They struggle to sustain success when switching clubs

## January 18, 2019

### SHRSS: Data Analytics Engineer [Davie, FL]

SHRSS is seeking a Data Analytics Engineer in Davie, FL, to be responsible for coordinating the design, development, implementation, maintenance and support of data engineering solutions, predictive models and applications. This position requires SAS experience.

### Google on Responsible AI Practices

Great and beautifully written advice for any data science setting:

Enjoy.

### A startup: Data Scientist [Remote (US)]

A Startup is seeking a talented and highly motivated Data Scientist for a unique and exciting opportunity with a small team looking to build a sports wagering business.

### Book Memo: “Adaptive Resonance Theory in Social Media Data Clustering”

 Roles, Methodologies, and Applications Social media data contains our communication and online sharing, mirroring our daily life. This book looks at how we can use and what we can discover from such big data: •Basic knowledge (data & challenges) on social media analytics •Clustering as a fundamental technique for unsupervised knowledge discovery and data mining •A class of neural inspired algorithms, based on adaptive resonance theory (ART), tackling challenges in big social media data clustering •Step-by-step practices of developing unsupervised machine learning algorithms for real-world applications in social media domain Adaptive Resonance Theory in Social Media Data Clustering stands on the fundamental breakthrough in cognitive and neural theory, i.e. adaptive resonance theory, which simulates how a brain processes information to perform memory, learning, recognition, and prediction. It presents initiatives on the mathematical demonstration of ART’s learning mechanisms in clustering, and illustrates how to extend the base ART model to handle the complexity and characteristics of social media data and perform associative analytical tasks.

### Webinar: 2019 AI Trends: Filtering the Noise

Check out Dataiku's exclusive webinar on Feb 7, 11am EST, "2019 AI Trends: Filtering the Noise," featuring insights from Léo Drefus-Schmidt, Lead Data Scientist at Dataiku.

### Building credit scorecards using statistical methods and business logic

Whether you’re applying for your first credit card or shopping for a second home – or anywhere in between – you’ll probably encounter an application process. As part of that process, banks and other lenders use a scorecard to determine your likelihood to pay off that loan.

Naturally, this means credit scoring is an important data science topic for banks and any business that works with the banking industry.

Since I have previous experience with customer analytics, but not specifically with financial risk, I’ve been learning how to develop a credit scorecard, and I wanted to share what I’ve learned including my thoughts and code implementation.

## Scorecards and the value of credit scoring

There are two basic types of scorecards: behavioral scorecards and application scorecards.

1. Behavioral scorecards deal more with predicting or scoring current customers and their likelihood to default.
2. Application scorecards are used when new customers apply for loans to predict their likelihood to be profitable customers, and to associate a score to them.

For banks, credit scoring helps manage risk. As consumers we’re bombarded with offers. It’s up to the business to assess the credit worthiness and credit scores of consumers to identify optimal product solutions based on risk, turnaround times, incorrect credit denials and more.

If credit is offered when it shouldn’t be, then a future loss is likely. If turnaround times to approve or deny credit have long lag times or a bank inaccurately denies a good customer credit, they could lose those customers to competitors. In those situations, it might be a long time before you get them back.

Using credit scoring can optimize risk and maximize profitability for businesses.

## Credit scoring data

The training data for the credit scoring example in this post is real customer bank data that has been massaged and anonymized for obvious reasons. The features - what are called characteristics in credit scoring - include the number of children, number in household, age, time at address, time at current job, has a telephone, income, etc. Our target variable will be a binary variable with the values ”bad” or ”good” with respect to the customer defaulting given some historical period.

## The credit scoring code

For this analysis I’m using the SAS Open Source library called SWAT (Scripting Wrapper for Analytics Transfer) to code in Python and execute SAS CAS Action Sets. SWAT acts as a bridge between the python language to CAS Action Sets. CAS Action Sets are synonymous to libraries in Python or packages in R. The one main difference and benefit is that the algorithms within these action sets have been highly parallelized to run on a CAS (Cloud Analytic Services) server. The CAS server is a distributed in-memory engine where I can do all my heavy lifting or computations. The code and Jupyter Notebook are available on GitHub.

## The credit scoring method

### Weight of evidence

I first transform my data using the weight of evidence (WOE) method. This method attempts to find a monotonic relationship between the input features and your target variable by splitting each feature into bins and assigning a weight to each bin. Suppose a WOE transformation on income level included income level $100k -$150k, then all observations within that bin would receive the same WOE value which can be computed using the formula below.

Weight of Evidence Calculation

Consider a bin for income level between $100k –$150k. Of all the "good" observations in our data, 30 percent come from this income level bin while only 10 percent of the "bad" observations come from this bin. Using these proportions, you could state that we have 3:1 odds that a person between the income of $100k and$150k is a good credit candidate versus bad. We then take the natural log and multiple by 100 for easier numeric representation and we have our WOE value for income level for all observations that fall within our income level bin.

Weight of Evidence Example

### Data pipeline

Let’s apply this transformation to our entire data set. Using the data preprocess action set in SAS makes it very easy to build data pipelines (Figure 1). Data pipelines help the flow of automation of common manual data science steps. This action set can build out large data pipelines for a variety of transformations across any continuous or nominal features.

There are only a few steps to building a single pipeline:

1. Assign variables to roles.
2. Build variable transformations.
3. Append transformations to later apply to data.

First, we assign the features to their roles, with respect to the transformation and modeling, separating nominal and continuous variables as well as the target (Figure 1)

Figure 1: Assign features to roles

Next, we create the first transformation called req_pack1, which is just short for request package and is the parameter found in the datapreprocess.transform action. I give that transformation a name, pass the list of features and the target, and specify the event of interest, which is ”bad” in this case.

I call discretize in Python to bin the continuous values and specify the WOE transformation. Within that transformation is a regularization parameter where you can specify a range of bins using the min and max NBins parameter. This enables a search across those bins to find the optimal bin number using information value (IV). IV is a common statistic used in classification models to gauge the predictive power of your feature set.

The second transformation, which I label as req_pack2, is nearly identical except I’m transforming the nominal inputs and therefore need to use cattrans instead of discretize. The cattrans parameter stands for categorical transformation.

We then append those lists together to later pass the transformation outline to our transform action.

Figure 2: Set up weight of evidence transformation

### Data Transformation

Now that we have our data pipeline in place, let’s transform the data (Figure 3). I first reference our data using the table parameter. Then I provide the req_packs list that I created in figure 2 with all the transformations. I specify the output table (casout) to be called woe_transform. Next, I use copyVars to copy the target and _customerID feature over to our new transformed table from the original table. Then I give a global prefix of “woe” to all our newly transformed features. The code parameter saves the transformation as a code table. This will be used later for scoring new data. This benefits teams that want to do model collaboration or build out deeper end-to-end pipelines for reoccurring jobs.

Finally, we’ll look at a preview of our new WOE table (Figure 3). Notice the identical values for some of our customers. Remember this happens because those observations, for a given variable, fall into the same bin and therefore receive the same weight.

Figure 3: Transform data and view new dataset.

### Visualizing transformation results

The transform action creates several output tables on the CAS server like the one from Figure 3. One of those tables is called VarTransInfo, which contains the IV statistic for our features. Its good practice to look at IV for our features to understand their predictive power and to determine if it’s necessary to include those features in our model. Below is the calculation for IV. Figure 4 is a plot of those IV values for each of our features. Strong features typically have an IV > 0.3, weak features < 0.02, and anything > 0.5 may be suspicious and need a closer look. Figure 4 shows that our features almost split in half between strong and then somewhere in the middle. Also, the age variable looks suspiciously strong. For now, we’ll keep all the features.

Figure 4: Information value (IV) calculation

We can also plot the WOE values of our features by their assigned bins. Aside from viewing these WOE values across bins, a data scientist may want to go back and manually configure the number of bins (combining or splitting up) per a specific variable that makes more logical sense. It’s important to know that WOE attempts to create separation within individual features as it relates to a target variable so there should be differences across bins. Figure 5 shows WOE values that you would expect to see across bins indicating separation within those features.

Figure 5: Plot of WOE value by bin for several features

This is really where a data scientist needs to understand the business in order to ensure the weighting trend across bins is logical. For example, it makes business sense to a bank that the more cash a customer has available that it should see a higher weighting across bins. The same type of logic applies to time at a job, age, or profession group. We would expect all of those weightings to be higher with respect to the bins.

## Logistic regression for scorecards

The next step we have is to fit a logistic regression model using our newly transformed WOE dataset. I will show the very easy code to train the model and explain the parameters.

Figure 6 below shows the training code. Here are the steps involved for training the model:

1. Assign model inputs and concatenate “woe_” to all original column names so it references the woe_transform dataset correctly.
2. Reference the woe_transform dataset using the table parameter.
3. Specify the target as well as the reference (reference ‘good’ or ‘bad’ in terms of the model).
4. Then, specify forward selection method for variable selection. Forward selection starts with an empty model and adds a single variable at each iteration based on a specific criterion (AIC, AICC, SBC, etc.).
5. Use the code parameter to save our logistic regression code
6. Create a name for the new output table using casout and then copy over the target and _customerID variables again.

Figure 6: Logistic regression model

## Creating the scorecard

The final step is to scale the model into a scorecard. We’ll be using a common scaling method. We’ll need both our logistic regression coefficients that we got from fitting our model as well as our WOE dataset with the transformed WOE values. We’ll score our training table to derive the logit or log odds values. Since our score from a logistic regression is in log odds form we need to convert that to a point system for our scorecard. We do that conversion by applying some scaling methods.

The first value is the target score. This can be considered a baseline score. For this scorecard we scaled the points to 600. The target score of 600 corresponds to a good/bad target odds of 30 to 1 (target_odds = 30). Scaling does not affect the predictive strength of the scorecard, so if you select 800 as your score for scaling it won’t be an issue.

The next variable is called pts_double_odds which means points to double the odds. This means that a score with an increase of 20 points doubles our odds the applicant good in terms of default. For example, if you have a score of 600 you have 30 to 1 odds of being a good credit candidate. But a score of 620 raises you to 60:1 odds of being considered good. Figure 8 below shows a visualization of the exponential relationship between predicted odds and score. Below in Figure 7 you can see the simple calculations and how they’re used to derive our scorecard score variable.

Figure 7: Scorecard scaling and logic

Figure 8: Predicted odds by score plot

## Visualizations

We can look at a distribution of the scorecard score variable here in Figure 9 with the mean score being 456.

Figure 9: Distribution of scores with mean score

And then we can also see how those scores correlate to our probability of being good or bad credit customers for our business, in Figure 10. You can see the nice sigmoidal shape of the plot.

Figure 10: Plot of scores by predicted probability

Finally, we can use those scores to determine tiers of consumers based on their credit score. Depending on the product, loan, etc. there will be varying bands for these tiers. For my tiering system I just selected quartiles to illustrate the building of cutoffs, but you could select any variation with a variety by product or offering.

Figure 11: Customer groups based on percentile scores

## Rejection inference

I want to briefly mention rejection inference, since it is an important step in credit scoring. To this point we’ve fit a logistic regression model based on a label of good or bad and scaled those scores into a scorecard. This entire process has looked at the current customer base which has mostly complete data and known credit (good or bad). However, applications for credit can often be missing a lot of data, which leads to a denial of credit. Denial of credit in this case is due to our biased model that only looks at complete records of people we know to be good or bad. We need to include some method to investigate the denials and include that information back into our model, so it is less bias and generalizes better.

This is what rejection inference achieves. In short, we look at those rejected customers, who have unknown credit status, and treat that data separately while then re-classifying them as good or bad. This is often achieved by rule-based approaches, proportional assignment similar to original logit model, augmenting original scores from logit for the denials, etc. This topic could be a standalone discussion since there are a variety of methods as well as schools of thought. For now, I’ll just state that to optimize your model to be unbiased and generalize better the denials, due to missing data, should be investigated and incorporated.

## Summary

Overall, using weight of evidence transformations and a logistic regression model to derive scores for customers can be a very powerful tool in the hands of a data scientist who also uses logical sense of the business. There are many ways to build these scoring models and while this is just one I hope it’s helpful in giving guidance or spurring new ideas. Plus, any time a data scientist can take complex problems, like transformations and credit scoring, and illustrate the results it’s a win for both the practitioner and the organization.

Bonus: Scoring function or Rest API

I created a function that used the code parameter from our transformation and logistic regression to score new incoming data in batch. It can save time and show how to use your code for scoring new data.

Looking at Figure 12, the first step is generating a small test set and I do this by grabbing a couple observations from the training data (just for testing purposes). Then load that data to the CAS server. Here you can see the function I built called model_scoring. It takes 5 parameters: name of CAS connection, code from woe transformation, code from logistic regression model, test table name and the scored table name. If you look within the model_scoring function there are three steps:

1. runcodetable - woe transform.
2. impute – replace missing woe values with 0.
3. runcodetable – logistic regression using woe transform values.

I use that scoring function to score the new data. At this point you would apply those simple scorecard calculations however you want to scale, and you would have a ready to use scorecard.

Figure 12: Scoring new data with scoring function

All of this can also be done using REST API’s. Every analytic asset on CAS is abstracted using a REST end point. This means that your data and your data processes are just a few REST calls away. This allows for easy integration of SAS technology into your business process or other applications. I accessed these action sets and actions using python, but with REST you can access any of these assets in the language of your choice.

References

1. Siddiqi, Naeem. Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. 1st ed., Wiley, 2005.
2. SAS Developer of ‘dataPreProcess’ Action Set and Huge Thanks to Biruk Gebremariam
3. Full Jupyter Notebook Demo on GitHub

The post Building credit scorecards using statistical methods and business logic appeared first on The SAS Data Science Blog.

### On Points Insights: Senior Python Developer with Big Data skills [Bay Area, CA]

On Points Insights is seeking a Senior Python Developer with Big Data skills in Bay Area, CA, to Interpret internal or external business issues and recommend best practices, solve complex problems, and take a broad perspective to identify innovative solutions.

### New videos from Databricks Academy: Introduction to Machine Learning Series and the Apache Spark™ Cost-Based Optimizer

Databricks’ commitment to education is at the center of the work we do. Through Instructor-Led Training, Certification, and Self-Paced Training, Databricks Academy provides strong pathways for users to learn Apache Spark and Databricks, and to push their knowledge to the next level.

In that spirit, we are pleased to present some great new free content. First up is a series of short videos to help anyone get started with Machine Learning on Apache Spark and Databricks. We follow that with a sample module from one of our 3-day Instructor-Led training classes.

## Series: An Introduction to Machine Learning

In the past few weeks, Databricks Academy launched two new self-paced courses: Structured Streaming and Introduction to Data Science and Machine Learning. As part of the Machine Learning class launch, we created series of videos featuring course developer Conor Murphy. If you’d like to follow along with Conor on your own computer, simply download the code. If you don’t have a Databricks account yet, get started for free on Databricks Community Edition.

﻿

In this video, Conor introduces the core concepts of Machine Learning and Distributed Learning, and how Distributed Machine Learning is done with Apache Spark. He also sets up the goal of the entire video series: building an end-to-end machine learning pipeline using Databricks.

In order to work on a data set, the data must be imported into a Databricks workspace. In this video, Conor provides a concrete example of importing data into Databricks. Once the data is loaded, Conor uses Databricks to do Exploratory Data Analysis (EDA) and Visualization of salient aspects of the data set.

There are three main abstractions in Apache Spark’s Machine Learning Library: Transformers, Estimators, and Pipelines. In this video, Conor discusses the <pre>.transform()</pre> and <pre>.fit()</pre> methods implemented in Transformers and Pipelines, respectively, and how they are used to construct a full machine learning Pipeline. Conor then walks through the implementation of such a pipeline using Spark in Databricks.

In this video, Conor prepares the data for final model fitting. He first demonstrates the preparation of a Train/Test Split on the data set and discusses the importance of this technique in terms of preventing model overfitting. Next, Conor shows how to use Spark ML Transformers to complete data preparation via a Featurization Pipeline.

Finally, Conor completes the end-to-end machine learning pipeline by training models on the full Pipelines developed throughout this series. He also shows how to use performance metrics to assess the performance of these Pipelines. Having selected a final model, Conor demonstrates how to save the model for later use.

We hope that you find these videos informative, as well as entertaining! The full video playlist is here. You can learn more about Machine Learning using Databricks in the Introduction to Data Science and Machine Learning available at Databricks Academy.

## Apache Spark Cost-Based Optimizer

Here we present an example module from Apache Spark Tuning and Best Practices, one of Databricks Academy’s 3-day Instructor-Led Training courses.

In this video, Databricks Instructor Jacob Parr presents the Apache Spark Cost-Based Optimizer. He’ll first explain various optimizers and how they are used within Apache Spark, and then go into detail on the Cost-Based Optimizer, providing examples on actual data with code samples.

--

### Voloridge: Quant Data Analyst [Jupiter, FL]

We are seeking an enthusiastic, self-motivated Data Services Analyst to work collaboratively with our data scientists and our operations team to meet the Voloridge mission of delivering superior risk-adjusted returns using proprietary modeling techniques.

### Document worth reading: “Rethinking the Artificial Neural Networks: A Mesh of Subnets with a Central Mechanism for Storing and Predicting the Data”

The Artificial Neural Networks (ANNs) have been originally designed to function like a biological neural network, but does an ANN really work in the same way as a biological neural network? As we know, the human brain holds information in its memory cells, so if the ANNs use the same model as our brains, they should store datasets in a similar manner. The most popular type of ANN architecture is based on a layered structure of neurons, whereas a human brain has trillions of complex interconnections of neurons continuously establishing new connections, updating existing ones, and removing the irrelevant connections across different parts of the brain. In this paper, we propose a novel approach to building ANNs which are truly inspired by the biological network containing a mesh of subnets controlled by a central mechanism. A subnet is a network of neurons that hold the dataset values. We attempt to address the following fundamental questions: (1) What is the architecture of the ANN model? Whether the layered architecture is the most appropriate choice? (2) Whether a neuron is a process or a memory cell? (3) What is the best way of interconnecting neurons and what weight-assignment mechanism should be used? (4) How to incorporate prior knowledge, bias, and generalizations for features extraction and prediction? Our proposed ANN architecture leverages the accuracy on textual data and our experimental findings confirm the effectiveness of our model. We also collaborate with the construction of the ANN model for storing and processing the images. Rethinking the Artificial Neural Networks: A Mesh of Subnets with a Central Mechanism for Storing and Predicting the Data

### At JMM 2019!

I registered for this year’s Joint Math Meeting by claiming to be Press so I think it’s only fair that I blog from the conference.

I got here Wednesday, met up with my BFF Aaron Abrams, and we promptly dashed to a fancypants reception to meet up with my buddy Ken Ribet. And yes, both of these wonderful men were wearing knitted hats that I knitted for them in the blistering Baltimore weather. Ken happens to be the outgoing AMS President so has lots of fancypants receptions to go to, and he was kind enough to let us in. The highlight, besides reminiscences with him and others, was when I got to write on a board about how Ken has been a great mentor to me since I was 18, welcoming me with open arms into the warm and wonderful community of mathematics. I also got to (re)meet Francis Su, who is awesome.

Then, yesterday I was honored to receive the MAA Euler Book Prize along with a bunch of adorable nerds receiving all kinds of mathematical honors onstage. It was fun, and afterwards there was a reception, which I went to. Then after that I ran over to a Budapest Semesters in Math reunion, and then the MAA dinner for prize winners. So that’s pretty much three more parties, bringing my total to hour as of last night. If you’re wondering what else I did besides party, the answer is I totally checked out the Exhibitor Hall and went to lunch with an editor from Cambridge University Press and a friend of mine who might write a book. Yes, we went to a pub.

This morning so far I’ve been to the HCSSiM reunion breakfast, I’m having drinks with Ina Mette, AMS editor, and I’m looking for receptions to crash later (please leave a comment if you know of any good ones!).

Finally, tomorrow I’ll be giving the Gerald and Judith Porter Lecture, which will be great in part because I got to meet Gerald and Judith Porter last night and they’ve very cool. Also, the title of my talk is “Big Data, Inequality, and Democracy”, which are three topics I love talking about. I’m considering inviting the entire audience to the aforementioned pub afterwards.

Besides my alcohol consumption, I have a few comments to make.

First, math nerds are and always will be unbelievably adorable.

Second, unlike many past years when I’ve visited JMM, I am less pessimistic of the future of mathematics. I was quite worried, for many years, that MOOCs and other “flipped classroom” type scenarios would take over calculus teaching. I’m no longer so worried about that, because I simply haven’t heard of it working on a broad scale.

Third, on the other hand, from the little I’ve understood talking to people, the other effect I’ve been worrying about, namely the slow replacement of tenured faculty by adjunct staff, doesn’t seem to be abating. So I will say that the profession of academic mathematics is not a growing or improving field in terms of quality of life for the median Ph.D. grad.

Fourth, I’m kind of surprised how slowly the world of publishing in math has changed, and its flip side, the world of credentialing. It seems like there’s just as much gaming, counting, and other kind of dumb metric stuff going on as ever. I guess it’s because I’m on the outside now looking in, but I’m wondering when people will start seriously contributing to things like the Stack Project – and figure out a way of giving credit to people for those contributions – because it seems like the obvious future of mathematical contributions. Tell me if I’m wrong.

### Books to Read While the Algae Grow in Your Fur, December 2018

Attention conservation notice: I have no taste. I also have no qualifications to discuss poetry or leftist political theory. I do know something about spatiotemporal data analysis, but you don't care about that.

Gidon Eshel, Spatiotemporal Data Analysis
I assigned this as a textbook in my fall class on data over space and time, because I need something which covered spatiotemporal data analysis, especially principal components analysis, for students who could be taking linear regression at the same time, and was cheap. This met all my requirements.
The book is divided into two parts. Part I is a review or crash course in linear algebra, building up to decomposing square matrices in terms of their eigenvalues and eigenvectors, and then the singular value decomposition of arbitrary matrices. (Some prior acquaintance with linear algebra will help, but not very much is needed.) Part II is about data analysis, covering some basic notions of time series and autocorrelation, linear regression models estimated by least squares, and "empirical orthogonal functions", i.e., principal components analysis, i.e., eigendecomposition of covariance or correlation matrices. As for "cheap", while the list price is (currently) an outrageous \$105, it's on JSTOR, so The Kids had free access to the PDF through the university library. In retrospect, there were strengths to the book, and some serious weaknesses --- some absolute, some just for my needs. The most important strength is that Eshel writes like a human being, and not a bloodless textbook. His authorial persona is not (thankfully) much like mine, but it's a likeable and enthusiastic one. This is related to his trying really, really hard to explain everything as simply as possible, and with multitudes of very detailed worked examples. I will probably be assigning Part I of the book, on linear algebra, as refresher material to my undergrads for years. He is also very good at constantly returning to physical insight to motivate data-analytic procedures. (The highlight of this, for me, was section 9.7 [pp. 185ff] on when and why an autonomous, linear, discrete-time AR(1) or VAR(1) model will arise from a forced, nonlinear, continuous-time dynamical system.) If this had existed when I was a physics undergrad, or starting grad school, I'd have loved it. Turning to the weaknesses, some of them are, as I said, merely ways in which he didn't write the book to meet my needs. His implied reader is very familiar with physics, and not just the formal, mathematical parts but also the culture (e.g., the delight in complicated compound units of measurement, saying "ensemble" when other disciplines say "distribution" or "population"). In fact, the implied reader is familiar with, or at least learning, climatology. But that reader has basically no experience with statistics, and only a little probability (so that, e.g., they're not familiar with rules for algebra with expectations and covariances*). Since my audience was undergraduate and masters-level statistics students, most of whom had only the haziest memories of high school physics, this was a mis-match. Others weaknesses are, to my mind, a bit more serious, because they reflect more on the intrinsic content. • A trivial but real one: the book is printed in black and white, but many figures are (judging by the text) intended to be in color, and are scarcely comprehensible without it. (The first place this really struck me was p. 141 and Figure 9.4, but there were lots of others.) The electronic version is no better. • The climax of the book (chapter 11) is principal components analysis. This is really, truly important, so it deserves a lot of treatment. But it's not a very satisfying stopping place: what do you do with the principal components once you have them? What about the difference between principal components / empirical orthogonal functions and factor models? (In the book's terms, the former does a low-rank approximation to the sample covariance matrix$\mathbf{v} \approx \mathbf{w}^T \mathbf{w}$, while the latter treats it as low-rank-plus-diagonal-noise$\mathbf{v} \approx \mathbf{w}^T\mathbf{w} + \mathbf{d}$, an importantly different thing.) What about nonlinear methods of dimensionality reduction? My issue isn't so much that the book didn't do everything, as that it didn't give readers even hints of where to look. • There are places where the book's exposition is not very internally coherent. Chapter 8, on autocorrelation, introduces the topic with an example where$x(t) = s(t) + \epsilon(t)$, for a deterministic signal function$s(t)$and white noise$\epsilon(t)$. Fair enough; this is a trend-plus-noise representation. But it then switches to modeling the autocorrelations as arising from processes where$x(t) = \int_{-\infty}^{t}{w(u) x(u) du} + \xi(t)$, where again$\xi(t)$is white noise. (Linear autoregressions are the discrete-time analogs.) These are distinct classes of processes. (Readers will find it character-building to try to craft a memory kernel$w(u)$which matches the book's running signal-plus-noise example, where$s(t) = e^{-t/120}\cos{\frac{2\pi t}{49}}$.) • I am all in favor of physicists' heuristic mathematical sloppiness, especially in introductory works, but there are times when it turns into mere confusion. The book persistently conflates time or sample averages with expectation values. The latter are ensemble-level quantities, deterministic functionals of the probability distribution. The former are random variables. Under various laws of large numbers or ergodic theorems, the former converge on the latter, but they are not the same. Eshel knows they are not the same, and sometimes talks about how they are not the same, but the book's notation persistently writes them both as$\langle x \rangle$, and the text sometimes flat-out identifies them. (For one especially painful example among many, p. 185.) Relatedly, the book conflates parameters (again, ensemble-level quantities, functions of the data-generating process) and estimators of those parameters (random variables) • The treatment of multiple regression is unfortunate.$R^2$does not measure goodness of fit. (It's not even a measure of how well the regression predicts or explains.) At some level, Eshel knows this, since his recommendation for how to pick regressors is not "maximize$R^2$". On the other hand, his prescription for picking regressors (sec. 9.6.4, pp.180ff) is rather painful to read, and completely at odds with his stated rationale of using regression coefficients to compare alternative explanations (itself a bad, though common, idea). Very strikingly, the terms "cross-validation" and "bootstrap" do not appear in his index**. Now, to be clear, Eshel isn't worse in his treatment of regression that most non-statisticians, and he certainly understands the algebra backwards and forwards. But his advice on the craft of regression is, to be polite, weak and old-fashioned. Summing up, the linear-algebra refresher/crash-course of Part I is great, and I even like the principal components chapters in Part II, as far as they go. But it's not ideal for my needs, and there are a bunch of ways I think it could be improved for anyone's needs. What to assign instead, I have no idea. *: This is, I think, why he doesn't explain the calculation of the correlation time and effective sample size in sec. 8.2 (pp. 123--124), just giving a flat statement of the result, though it's really easy to prove with those tools. I do appreciate finally learning the origin of this beautiful and practical result --- G. I. Taylor, "Diffusion by Continuous Movements", Proceedings of the London Mathematical Society, series 2, volume 20 (1922), pp. 196--212 (though the book's citing it with the wrong year, confusing series number with an issue number, and no page numbers was annoying). ^ **: The absence of "ridge regression" and "Tikhonov regularization" from the index is all the more striking because they appear in section 9.3.3 as "a more general, weighted, dual minimization formalism", which, compared to ordinary least squares, is described as "sprinkling added power ... on the diagonal of an otherwise singular problem". This is, of course, a place where it would be really helpful to have a notion of cross-validation, to decide how much to sprinkle.^ Nick Srnicek and Alex Williams, Inventing the Future: Postcapitalism and a World Without Work It's --- OK, I guess? They have some good points against what they call "folk politics", namely, that it has conspicuously failed to accomplish anything, so doubling down on more of it seems like a bad way to change the world. And they really want to change the world: the old twin goals of increasing human power over the world, and eliminating human power of other humans, are very much still there, though they might not quite adopt that formula. To get there, their basic idea is to push for a "post-work world", one where people don't have to work to survive, because they're entitled to a more-than-subsistence basic income as a matter of right. They realize that making that work will require lots of politics and pushes for certain kinds of technological progress rather than others. This is the future they want --- to finally enter (in Marx's words) "the kingdom of freedom", where we will be able to get on with all the other problems, and possibilities, confronting us. As for getting there: like a long, long line of leftist intellectuals from the 1960s onwards, Srnicek and Williams are very taken with the idea, going back to Gramsci, that the key to achieving socialism is to first achieve ideological "hegemony". To put it crudely, this means trying to make your idea such broadly-diffused, widely-accepted, scarcely-noticed common notions that when madmen in authority channel voices from the air, they channel you. (In passing: Occupy may have done nothing to reduce economic inequality, but Gramsci's success as a strategist may be measured by the fact that he wrote in a Fascist prison.) Part of this drive for hegemony is pushing for new ideas in economics --- desirable in itself, but they are sure in advance of what inquiry should find *. Beyond this, and saying that many tactics will need to be tried out by a whole "ecology" of organizations and groups, they're pretty vague. There's some wisdom here --- who could propound a detailed plan to get to post-work post-capitalism? --- but also more ambiguity than they acknowledge. Even if a drive for a generous basic income (and all that would go with it) succeeds, the end result might not be anything like the sort of post-capitalism Srniceck and Williams envisage, if only because what we learn and experience along the way might change what seems feasible and desirable. (This is a Popperian point against Utopian plans, but it can be put in other language quite easily**.) I think Srnicek and Williams might be OK with the idea that their desired future won't be realized, so long as some better future is, and that the important point is to get people on the left not to prefigure better worlds in occasional carnivals of defiance, but to try to make them happen. Saying that doing this will require organization, concrete demands, and leadership is pretty sensible, though they do disclaim trying to revive the idea of a vanguard party. Large portions of the book are, unfortunately, given over to insinuating, without ever quite saying, that post-work is not just desirable and possible, but a historical necessity to which we are impelled by the inexorable development of capitalism, as foreseen by the Prophet. (They also talk about how Marx's actual scenario for how capitalism would develop, and end, not only has not come to pass yet, but is pretty much certain to never come to pass.) Large portions of the book are given over to wide-ranging discussions of lots of important issues, all of which, apparently, they grasp through the medium of books and articles published by small, left-wing presses strongly influenced by post-structuralism --- as it were, the world viewed through the Verso Books catalog. (Perry Anderson had the important advantage, as a writer and thinker, of being formed outside the rather hermetic subculture/genre he helped create; these two are not so lucky.) Now, I recognize that good ideas usually emerge within a community that articulates its own distinctive tradition, so some insularity can be all to the good. In this case, I am not all that far from the authors' tradition, and sympathetic to it. But still, the effect of these two (overlapping) writerly defects is that once the book announced a topic, I often felt I could have written the subsequent passage myself; I was never surprised by what they had to say. Finishing this was a slog. I came into the book a mere Left Popperian and market socialist, friendly to the idea of a basic income, and came out the same way. My mind was not blown, or even really changed, about anything. But it might encourage some leftist intellectuals to think constructively about the future, which would be good. Shorter: Read Peter Frase's Four Futures instead. *: They are quite confident that modern computing lets us have an efficient planned economy, a conclusion they support not be any technical knowledge of the issue but by citations to essays in literary magazines and collections of humanistic scholarship. As I have said before, I wish that were the case, if only because it would be insanely helpful for my own work, but I think that's just wrong. In any case, this is an important point for socialists, since it's very consequential for the kind of socialism we should pursue. It should be treated much more seriously, i.e., rigorously and knowledgeable, than they do. Fortunately, a basic income is entirely compatible with market socialism, as are other measures to ensure that people don't have to sell their labor power in order to live. **: My own two-minute stab at making chapter 9 of The Open Society and Its Enemies sound suitable for New Left Review: "The aims of the progressive forces, always multifarious, develop dialectically in the course of the struggle to attain them. Those aims can never be limited by the horizon of any abstract, pre-conceived telos, even one designated 'socialism', but will always change and grow through praxis." (I admit "praxis" may be a bit behind the times.) ^ A. E. Stallings, Like: Poems Beautiful stuff from one of my favorite contemporary poets. "Swallows" and "Epic Simile" give a fair impression of what you'll find. This also includes a lot of the poems discussed in Cynthia Haven's "Crossing Borders" essay. Continue Reading… ### Data over Space and Time, Lectures 21--24 ### Books to Read While the Algae Grow in Your Fur, September 2018 Attention conservation notice: I have no taste. I also have no qualifications to discuss geography, the alt-right, 19th century American history, political philosophy, or the life and works of Joseph Conrad. Gilbert Seldes, The Stammering Century A sympathetic, at times even loving, account of selected 19th century American cranks, and crank movements, tracing them all back to Jonathan Edwards, both in the inflection he gave to Calvinism, and his cultivating outbreaks of enthusiasm. Strongly recommended to those interested in weird Americana, and, of course, psychoceramics. Stanley Fish, Save the World on Your Own Time A plea to university faculty to teach their subject matter, and just teach their subject matter, rather than use our teaching to try to "save the world". I am very sympathetic, but I don't think Fish is really fair to some fairly obvious counter-arguments: • Sometimes, the consensus of a discipline on a key subject matter runs smack in to a current political or cultural controversy --- e.g., evolutionary biology or climatology. To refuse to engage that is to fail in teaching our disciplines. To (as Fish suggests) "academicize" the point by studying the controversy itself fails to convey crucial points of our disciplines. (And anyway biologists and climatologists aren't sociologists or historians, and would be operating outside their domain of expertise.) • We may have options available to us in our teaching which are equally good from a disciplinary standpoint, but carry very different connotations. If I am teaching time series analysis, from a purely statistical viewpoint it doesn't matter whether I draw my examples from finance or from environmental toxicology, but it'd be (faux) naive to pretend that this choice wouldn't carry connotations to the students. Of course, what my students would make of those connotations is another matter. One of Fish's sounder points is that the way our students understand our lessons, especially the subtler aspects of them, is so far beyond our control, and so idiosyncratic from student to student, that it's futile to aim at changing their attitudes in the way some of us profess to do. (Fish didn't originate the line about "how am I supposed to indoctrinate my students when I can't even make them do the reading?", but I'm pretty sure he'd endorse it.) I might please myself by using environmental examples in my time-series class, and I might even fulfill a legitimate pedagogical purpose of showing the students something about the range of applicability of the methods, but I shouldn't fool myself that I am raising their consciousness. • At least since the medieval universities were founded to train professionals in medicine, law and theology, higher education has always had practical aims. American higher education was certainly never intended as the self-justifying pursuit of inutility which Fish longs for. So why not ask "useful for what?" (Cf.) Now, this is a short book, and one can forgive a pamphlet for not being a comprehensive treatise, and in particular for not considering all possible ramifications and objections. I become less forgiving, however, when a short book has a lot of space given over to, among other things, • An account of what sounds like its author's nervous breakdown after he gave up being a dean; • A loving description of the author's frankly-eccentric approach to teaching composition and syntax by making his students invent an artificial language (not much burdened by knowledge of linguistics) • A disquisition on how, because Milton wrote poems, he couldn't also have been trying to make political or theological points in his poetry, because (you guessed it) poetry is a self-referential, self-justifying activity [*]. and so on. I feel like Fish probably has it in him to write a better-proportioned book on these themes, which engaged better with objections; I'd be interested to read it. *: This is a frankly astonishing argument from someone of Fish's obvious erudition; I can't decide whether it's more rhetorically or historically ill-informed. If poetry can be used to write astronomy textbooks, it can be used to score theological points. Maya Jasanoff, The Dawn Watch: Joseph Conrad in a Global World Part biography of Conrad, part exposition of his most important novels, part an effort to portray him as a prophet of a newly-globalizing world, and so connect him to our own time. I think it really works quite well on all fronts. Daniel Dorling, Mark Newman and Anna Barford, The Atlas of the Real World: Mapping the Way We Live A collection of interesting (if not always very uplifting) cartograms. Since Mark is a friend and collaborator, and once upon a time we wrote something with using his cartogram-making technique, I won't pretend to objectivity, but I will say this is fascinating and I wish it could be perpetually updated. (Posted now because of my policy / compulsion of not recommending books until I've read them cover to cover.) (I am, however, puzzled by the international-trade cartograms that use net exports or imports by industry; this seems very misleading when a lot of countries both export and import substantially in the same category.) Mike Wendling, Alt-Right: From 4Chan to the White House No great revelations, but a decent, straightforward journalistic account of the movement, or rather collection of more-or-less related and overlapping movements and tendencies, and some of the principle ideologues/grifters. Owing the vagaries of publication, this basically ends with Charlottesville, and with the conclusion that the movement is on its way to implosion. I suspect this is right for whatever attempt there was at a coherent movement of (sort-of) younger, (pseudo-) sophisticated people. As events since then have amply shown, however, there is no shortage online of disorganized people spread somewhere on a spectrum from paranoia to frothing hatred, and encouraging each other to ever more elaborate delusions. (Written before one of those fuckheads shot up my neighborhood and killed someone I cared about.) Amy Gutmann, Identity in Democracy This is calm and sensible, and a bit depressing to still be discussing a decade and a half later, when a lot of the topical examples are very dated. Curiously, from my point of view, the book takes which identities are politically relevant as given, rather than as endogenous to the political-cultural process. Continue Reading… ### Books to Read While the Algae Grow in Your Fur, October 2018 Attention conservation notice: I have no taste. I also have no qualifications to discuss corporate fraud. John Carreyrou, Bad Blood: Secrets and Lies in a Silicon Valley Startup This is a deservedly-famous story, told meticulously. It says some very bad things about the culture around Silicon Valley which made this fraud (and waste) possible. (To be scurpulously fair, investment companies with experience in medical devices and the like don't seem to have bought in.) It also says some very bad things about our elites more broadly, since lots of influential people who were in no position to know anything useful about whether Theranos could fulfill its promises endorsed them, apparently on the basis of will-to-believe and their own arrogance. (I hereby include by reference Khurana's book on the charisma of corporate CEOs, and Xavier Marquez's great post on charisma.) The real heroes here are, of course, the people who quietly kept following through on established procedures and regulations, and refused to bend to considerable pressure. Luca D'Andrea, Beneath the Mountain Mind candy: in which a stranger investigates the secrets of a small, isolated community's past, for multiple values of "past". Walter Jon Williams, Quillifer Misadventures of a rogue in a fantasy world whose technology level seems to be about the 1500s in our world. Quillifer has some genuinely horrible things happen to him, and brings others on himself, but keeps bouncing back, and keeps his eye on various main chances (befitting the only law clerk I can think of in fantasy literature who isn't just cannon-fodder). I didn't like him, exactly, but I was definitely entertained. Continue Reading… ### Data Over Space and Time Collecting posts related to this course (36-3467/36-667). Continue Reading… ### Books to Read While the Algae Grow in Your Fur, November 2018 Attention conservation notice: I have no taste. I also have no qualifications to discuss the history of photography, or of black Pittsburgh. Cheryl Finley, Laurence Glasco and Joe W. Trotter, with an introduction by Deborah Willis, Teenie Harris, Photographer: Image, Memory, History A terrific collection of Harris's photos of (primarily) Pittsburgh's black community from the 1930s to the 1970s, with good biographical and historical-contextual essays. Disclaimer: Prof. Trotter is also on the faculty at CMU, but I don't believe we've ever actually met. Ben Aaronovitch, Lies Sleeping Mind candy: the latest installment in the long-running supernatural-procedural mystery series, where the Folly gets tangled up with the Matter of Britain. Charles Stross, The Labyrinth Index Mind candy; Latest installment in Stross's long-running Lovecraftian spy-fiction series. I imagine a novel about the US Presidency being taken over by a malevolent occult force seemed a lot more amusing before 2016, when this must have been mostly written. It's a good installment, but only suitable for those already immersed in the story. Anna Lee Huber, The Anatomist's Wife and A Brush with Shadows Mind-candy, historical mystery flavor. These are the first and sixth books in the series, because I couldn't lay hands on 2--5, but I will. Continue Reading… ### Data over Space and Time: Self-Evaluation and Lessons Learned Attention conservation notice: Academic navel-gazing, about a class you didn't take, in a subject you don't care about, at a university you don't attend. Well, that went better than it could have, especially since it was the first time I've taught a new undergraduate course since 2011. Some things that worked well: 1. The over-all choice of methods topics --- combining descriptive/exploratory techniques and generative models and their inference. Avoiding the ARIMA alphabet soup as much as possible both played to my prejudices and avoided interference with a spring course. 2. The over-all kind and range of examples (mostly environmental and social-historical) and the avoidance of finance. I could have done some more economics, and some more neuroscience. 3. The recurrence of linear algebra and eigen-analysis (in smoothing, principal components, linear dynamics, and Markov processes) seems to have helped some students, and at least not hurt the others. 4. The in-class exercises did wonders for attendance. Whether doing the exercises, or that attendance, improved learning is hard to say. Some students specifically praised them in their anonymous feedback, and nobody complained. Some things did not work so well: 1. I was too often late in posting assignments, and too many of them had typos when first posted. (This was a real issue with the final. To any of the students reading this: my apologies once again.) I also had a lot of trouble calibrating how hard the assignments would be, so the opening problem sets were a lot more work than the later ones. (In my partial defense about late assignments, there were multiple problem sets which I never posted, after putting a lot of time into them, because my initial idea either proved much too complicated for this course when fully executed, or because I was, despite much effort, simply unable to reproduce published papers*. Maybe next time, if there is a next time, these efforts can see the light of day.) 2. I let the grading get really, really behind the assignments. (Again, my apologies.) 3. I gave less emphasis to spatial and spatio-temporal models in the second, generative half of the course than they really deserve. E.g., Markov random fields and cellular automata (and kin) probably deserve at least a lecture each, perhaps more. 4. I didn't build in enough time for review in my initial schedule, so I ended up making some painful cuts. (In particular, nonlinear autoregressive models.) 5. My attempt to teach Fourier analysis was a disaster. It needs much more time and preparation than I gave it. 6. We didn't get very much at all into how to think your way through building a new model, as opposed to estimating, simulating, predicting, checking, etc., a given model. 7. I have yet to figure out how to get the students to do the readings before class. If I got to teach this again, I'd keep the same over-all structure, but re-work all the assignments, and re-think, very carefully, how much time I spent on which topics. Some of these issues would of course go away if there were a second semester to the course, but that's not going to happen. *: I now somewhat suspect that one of the papers I tried to base an assignment on is just wrong, or at least could not have done the analysis the way it say it did. This is not the first time I've encountered something like this through teaching... ^ Continue Reading… ### Anthony Bourdain (3) vs. A. J. Liebling; Steve Martin advances Yesterday‘s decision was pretty easy, as almost all the commenters talked about Steve Martin, pro and con. Letterman was pretty much out of the picture. Indeed, the best argument in favor of Letterman came from Jonathan, who wrote: I’ll go with Letterman because he looks like he could use the work. Conversely, the strongest argument against Martin came from Adam, who wrote: Steve Martin once said: I know what you’re saying, you’re saying, “Steve, where do you find time to juggle?” Well, I juggle in my mind. … Whoops. so that’s the problem: he might just do magic in his head. and that’s no fun to watch. Then again, along the same lines as zbicyclist, he might be able to shed some light on the stuff you post on here. In the same routine, he said: And then on the other hand science, you know, is just pure empiricism and by virtue of its method it excludes metaphysics. And uh, I guess I wouldn’t believe in anything if not for my lucky astrology mood watch. Take the strongest case for Dave, and the strongest case against Steve, and Steve still comes out on top. So, no contest. And now for today’s contest, featuring two people from the Creative Eaters category. (It’s the nature of the random assignment of unseeded competitors that sometimes two people from the same category will face off in the first round.) Seeded #3 in the group is legendary globetrotting tell-it-like-it-is chef Anthony Bourdain. You can’t go wrong with Bourdain. But his unseeded opponent is formidable too: A. J. Liebling, one of the greatest and most versatile reporters who’s ever lived, author of The Honest Rainmaker and many other classics and the inspiration for O.G. blogger Mickey Kaus’s invention of the concept of Liebling optimality. Bourdain was skinny and Liebling was fat; make of that what you will. So give it your best: this round could turn out to be important! Again, the full bracket is here, and here are the rules: We’re trying to pick the ultimate seminar speaker. I’m not asking for the most popular speaker, or the most relevant, or the best speaker, or the deepest, or even the coolest, but rather some combination of the above. I’ll decide each day’s winner not based on a popular vote but based on the strength and amusingness of the arguments given by advocates on both sides. So give it your best! Continue Reading… ### Two takes on Google Maps navigation software On my (new) Youtube channel called "Fung with Data", I am using short clips to explain how data, software and algorithms are working behind the scenes to influence their daily decision making. The third episode just launched today, and it addresses the question of whether Google Maps or GPS navigators can really find you the "fastest route" to your destination. Lots of people I know swear by the software; how does it work? Click here to see the video. This video is the second in a series about Google Maps. Episode 2 presents the basics of how route optimization works. Click here to see the short clip. Subscribe to the channel to get notified when the next episode shows up. Continue Reading… ### How to Monitor Machine Learning Models in Real-Time We present practical methods for near real-time monitoring of machine learning systems which detect system-level or model-level faults and can see when the world changes. Continue Reading… ### Reviewing 2018 and Previewing 2019 TLDR • Kaggle ended 2018 with 2.5MM members, up from 1.4M at the end of 2017 and 840K when we were acquired in March 2017. • We had 1.55MM logged-in-users visit Kaggle in 2018, up 73% from 895K in 2017. • In 2019, we aim to grow the community passed 4MM members. Kaggle Kernels Kaggle Kernels is our hosted data science environment. It allows our users to author, execute, and share code written in Python and R. Kaggle Kernels entered 2018 as a data science scratchpad. In 2018, we added key pieces of functionality that make it a powerful environment. This includes the ability to use a GPU backend and collaborate with other users. We had 346K users author kernels in 2018, up 3.1x from 111K in 2017. Some of the most upvoted kernels from this year were: Datasets Kaggle’s datasets platform allows our community to share datasets with each other. We currently have ~14K datasets that have been shared publicly by our community. We entered the year only supporting public datasets, which limited the use cases for our datasets. In 2018, we added the ability for datasets to be kept private or shared with collaborators. This makes Kaggle a good destination for projects that aren’t intended to be publicly shared. We had 78K private datasets upload in 2018. We had 11K public datasets uploaded to Kaggle in 2018, up from 3.4K 2017. 731K users downloaded datasets in 2018, up 2.2x from 335K downloaded in 2017. Some of the most downloaded datasets from this year include: Competitions Machine learning competitions were Kaggle’s first product. Companies and researchers post machine learning problems and our community competes to build the most accurate algorithm. We launched 52 competitions in 2018, up from 38 in 2017. We had 181K users make submissions, up 48% from 122K in 2017. One of the most exciting competitions of 2018 was the second round of the$1.15MM
Zillow Prize to improve the Zestimate home valuation algorithm.

The competitions team focused its product efforts towards support for kernels-only competitions, where users submit code rather than predictions. We launched 8 Kernels-only competitions in 2018. In 2019, we’re aiming to harden kernels-only support and use it for an increasing portion of our competitions, including targeting newer areas of AI such as reinforcement learning and GANs.

Kaggle InClass is a free version of our competitions platform that allows professors to host competitions for their students. In 2018, we hosted competitions for 2,247 classes, up from 1,217 in 2017.

We had 55K students submit to InClass competitions, up 77% from the 31K in 2017.

Kaggle Learn

We launched Kaggle Learn in 2018. Kaggle Learn is ultra short-form data science education inside Kaggle Kernels. Kaggle Learn grew from 3 courses at launch to 11 courses by year end. 143K users did the Kaggle Learn exercises in 2018.

Other highlights

As the amount of content on Kaggle increased dramatically in 2018, we have started putting meaningful emphasis on improving the discoverability of that content. This year, we added notifications, revamped our newsfeed and made improvements to search. Improving discoverability is going to continue to be a big theme in 2019.

We added an API to allow our users to programmatically interact with the major parts of our site.

We hosted our second annual machine learning and data science survey. With 24K responses, it is the world’s largest ML survey.

Focus for 2019

In 2019, we will continue to grow the community, with a goal of passing 4MM members. We aim to do this by:

• adding functionality that makes Kaggle Kernels and our datasets platform useful beyond learning and hobby projects; ie for real world problems.
• improve discoverability of the content on Kaggle: we have a huge number of kernels and datasets that users can build off but it’s often hard for our users to find what they’re looking for
• transition competitions to start to run newer competition types (kernels-only, RL and GANs related competitions)
• continue to create Kaggle Learn content to bring new machine learners to Kaggle

How You Can Help

Continue sharing your thoughts on our product, community, and platform. User feedback is invaluable in our development roadmap.

Thanks for being here!

Team Kaggle 2018

### Automated Machine Learning in Python

An organization can also reduce the cost of hiring many experts by applying AutoML in their data pipeline. AutoML also reduces the amount of time it would take to develop and test a machine learning model.

### Summer Internships 2019

We are excited to announce the second formal summer internship program at RStudio. The goal of this program is to enable RStudio employees to collaborate with students to do work that will help both RStudio users and the broader R community, and help ensure that the community of R developers is as diverse as its community of users. Over the course of the internship, you will work with experienced data scientists, software developers, and educators to create and share new tools and ideas.

The internship pays approximately $12,000 USD (paid hourly), lasts up to 10-12 weeks, and will start around June 1 (depending on your availability, applications are open now, and close at the end of February. To qualify, you must currently be a student (broadly construed - if you think you’re a student, you probably qualify) and have some experience writing code in R and using Git and GitHub. To demonstrate these skills, your application needs to include a link to a package, Shiny app, or data analysis repository on GitHub. It’s OK if you create something specifically for this application: we just want to know that you’re already familiar with the mechanics of collaborative development in R. RStudio is a geographically distributed team which means you can be based anywhere in the United States (we hope to expand the program to support interns in other countries next year). This means that unless you are based in Boston or Seattle, you will be working 100% remotely, though you will meet with your mentor regularly online, and we will pay for you to travel to one face-to-face work sprint with them. We are recruiting interns for the following projects: **Calibrated Peer Review **- Prototype some tools to conduct experiments to see whether calibrated peer review is a useful and feasible feedback strategy in introductory data science classes and industry workshops. (Mine Çetinkaya-Rundel) Tidy Blocks - Prototype and evaluate a block-based version of the tidyverse so that young students can do simple analysis using an interface like Scratch. (Greg Wilson) Data Science Training for Software Engineers - Develop course materials to teach basic data analysis to programmers using software engineering problems and data sets. (Greg Wilson) Tidy Practice - Develop practice projects for learners to tackle to practice tidyverse (or other) skills using interesting real-world data. (Alison Hill) Teaching and Learning with RStudio - Create a one-stop guide to teaching with RStudio similar to Teaching and Learning with Jupyter (https://jupyter4edu.github.io/jupyter-edu-book/) (Alison Hill) Object Scrubbers - A lot of R objects contain elements that could be recreated and these can result in large object sizes for large data sets. Also, terms, formulas, and other objects can carry the entire global environment with them when they are saved. This internship would help write a set of methods that would scrub different types of objects to reduce their size on disk. (Max Kuhn and Davis Vaughan) Production Testing Tools for Data Science Pipelines - This project will build on “applicability domain” methods from computational chemistry to create functions that can be included in a dplyr pipeline to perform statistical checks on data in production. (Max Kuhn) Shiny Enhancements - There are a several Shiny and Shiny-related projects that are available, depending on the intern’s interests and and skill set. Possible topics include: Shiny UI enhancements, improving performance bottlenecks by rewriting in C and C++, fixing bugs, and creating a set of higher-order reactives for more sophisticated reactive programming. (Barret Schloerke) ggplot2 Enhancements - Contribute to ggplot2 or an associated package (like scales). You’ll write R code for graphics, but mostly you’ll learn the challenges of managing a large, popular open source project including the care needed to avoid breaking changes, and actively gardening issues. You work will impact the millions of people who use ggplot2. (Hadley Wickham) R Markdown Enhancements - R Markdown is a cornerstone product of RStudio used by millions to create documents in their own publishing pipelines. The code base has grown organically over several years; the goal of this project is to refactor it. This involves tidying up inconsistencies in formatting, adding a comprehensive test suite, and improving consistency and coverage of documentation. (Rich Iannone) Apply now! Application deadline is February 22nd. RStudio is committed to being a diverse and inclusive workplace. We encourage applicants of different backgrounds, cultures, genders, experiences, abilities and perspectives to apply. All qualified applicants will receive equal consideration without regard to race, color, national origin, religion, sexual orientation, gender, gender identity, age, or physical disability. However, applicants must legally be able to work in the United States. Continue Reading… ### Window Aggregate operator in batch mode in SQL Server 2019 (This article was first published on R – TomazTsql, and kindly contributed to R-bloggers) So this came as a surprise, when working on calculating simple statistics on my dataset, in particular min, max and median. First two are trivial. The last one was the one, that caught my attention. While finding the fastest way on calculating the median (statistic: median) for given dataset, I have stumbled upon an interesting thing. While WINDOW function was performing super slow and calling R or Python using sp_execute_xternal_script outperform window function as well, it raised couple of questions. But first, I created a sample table and populate it sample rows: DROP TABLE IF EXISTS t1; GO CREATE TABLE t1 (id INT IDENTITY(1,1) NOT NULL ,c1 INT ,c2 SMALLINT ,t VARCHAR(10) ) SET NOCOUNT ON; INSERT INTO t1 (c1,c2,t) SELECT x.* FROM ( SELECT ABS(CAST(NEWID() AS BINARY(6)) %1000) AS c1 ,ABS(CAST(NEWID() AS BINARY(6)) %1000) AS c2 ,'text' AS t ) AS x CROSS JOIN (SELECT number FROM master..spt_values) AS n CROSS JOIN (SELECT number FROM master..spt_values) AS n2 GO 2 Query generated – in my case – little over 13 million records, just enough to test the performance. So starting with calculating Median, but sorting first half and second half of rows respectively, the calculation time was surprisingly long: -- Itzik Solution SELECT ( (SELECT MAX(c1) FROM (SELECT TOP 50 PERCENT c1 FROM t1 ORDER BY c1) AS BottomHalf) + (SELECT MIN(c1) FROM (SELECT TOP 50 PERCENT c1 FROM t1 ORDER BY c1 DESC) AS TopHalf) ) / 2 AS Median Before and after each run, I cleaned the stored execution plan. The execution on 13 million rows took – on my laptop – around 45 seconds. Next query, for median calculation was a window function query. SELECT DISTINCT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY c1) OVER (PARTITION BY (SELECT 1)) AS MedianCont FROM t1 To my surprise, the performance was even worse, and at this time, I have to say, I was running this on SQL Server 2017 with CU7. But luckily, I had a SQL Server 2019 CTP 2.0 also installed and here, with no further optimization the query ran little over 1 second. So the difference between the versions was enormous. I could replicate the same results by switching the database compatibility level from 140 to 150, respectively. ALTER DATABASE SQLRPY SET COMPATIBILITY_LEVEL = 140; GO SELECT DISTINCT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY c1) OVER (PARTITION BY (SELECT 1)) AS MedianCont140 FROM t1 ALTER DATABASE SQLRPY SET COMPATIBILITY_LEVEL = 150; GO SELECT DISTINCT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY c1) OVER (PARTITION BY (SELECT 1)) AS MedianCont150 FROM t1 The answer was found in execution plan. When running window function under 140 compatibility level, execution plan decides to create nested loop two times, for both groups of upper and lower 50% of the dataset. This plan is is somehow similar to understanding of 50% of upper and lower dataset but with only one nested loop: Difference is that when running the window function calculation of median on SQL Server version 2017, the query optimizer decides to take row execution mode for built-in window function with WITHIN GROUP. This was, as far as I knew, not an issue since SQL Server 2016, where batch mode operator for window aggregation was already used. When switching to compatibility level 150 and running the same window function, the execution plan is, as expected: And window aggregate uses batch mode: When calculating Median using R: sp_Execute_External_Script @language = N'R' ,@script = N'd <- InputDataSet OutputDataSet <- data.frame(median(d$c1))'
,@input_data_1 = N'select c1 from t1'
WITH RESULT SETS (( Median_R VARCHAR(100) ));
GO

or Python:

sp_Execute_External_Script
@language = N'Python'
,@script = N'
import pandas as pd
dd = pd.DataFrame(data=InputDataSet)
os2 = dd.median()[0]
OutputDataSet = pd.DataFrame({''a'':os2}, index=[0])'
,@input_data_1 = N'select c1 from t1'
WITH RESULT SETS (( MEdian_Python VARCHAR(100) ));
GO

both are executing and returning the results in about 5 seconds. So no bigger difference between R and Python when handling 13 million rows for calculating simple statistics.

To wrap up, If you find yourself in situation, where you need to calculate – as in my case – Median or any statistics, using window function within group, R or Python would be the fastest solutions, following T-SQL. Unless, you have the ability to use SQL Server 2019, T-SQL is your best choice.

Code and the plans, used in this blog post are available, as always at Github.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Causal Inference 2: Illustrating Interventions via a Toy Example

Last week I had the honor to lecture at the Machine Learning Summer School in Stellenbosch, South Africa. I chose to talk about Causal Inference, despite being a newcomer to this whole area. In fact, I chose it exactly because I'm a newcomer: causal inference has been a blindspot for me for a long time. I wanted to communicate some of the intuitions and ideas I learned and developed over the past few months, which I wish someone had explained to me earlier.

Now, I'm turning my presentation into a series of posts, starting with this one, building on the previous one I wrote in May. In this one, I will present the toy example I used in my talk to explain interventions. I call this the three scripts toy example. I was not sure if people are going to get it, but I got good feedback on it from the audience, so I'm hoping you will find it useful, too.

## Three scripts

Imagine you teach a programming course and you ask students to write a python script that samples from a 2D Gaussian distribution with a certain mean and covariance. Some of the solutions will be correct, but as there are multiple ways to sample from a Gaussian, you might see very different solutions. For example, here are three scripts that would implement the same, correct sampling behaviour:

Below each of the code snippets I plotted samples drawn by repeatedly executing the scripts. As you can see, all three scripts produce the same joint distribution between $x$ and $y$. You can feed these distributions into a two-sample test, and you will find that they are indeed indistinguishable from each other.

Based on the joint distribution the three scripts are indistinguishable.

## Interventions

But despite the three scripts being equivalent in that they generate the same distribution, they are not exactly the same. For example, they behave differently if we interefere or intervene to the execution.

Consider this thought experiment: I am a hacker, and I can inject code to the python interpreter. For every line of code from the snippet, I can insert a line of code of my choice. Let's say that I really want to set the value of $x$ to $3$, so I use my code injection ability and insert the line x=3 after each line of code of yours. So what actually gets executed is this:

We can now run the scripts in this hacked interpreter and see how the intervention changes the distribution of $x$ and $y$:

Of course, we see that the value of $x$ is no longer random, it's deterministically set to $3$, this results in all samples lining up along the $x=3$ vertical line. But, interestingly, the distribution of $y$ is different for the different scripts. In the blue script, $y$ has a mean around $5$ while the green and red scripts produce a distribution of $y$ centered around a mean of $1$. Here is a better look at the marginal distribution of $y$ under the intervention:

I labelled this plot $p(y\vert do(X=3))$ which semantically means the distribution of $y$ under the intervention where I set the value of $X$ to $3$. This is generally different from the conditional distribution $p(y\vert x=3)$, which of course is the same for all three scripts. Below I show these conditionals - excuse me for the massive estimation errors here, I was lazy creating these plots, but believe me they technically are all the same:

The important point here is that

the scripts behave differently under intervention.

We have a situation where the scripts are indistinguishable when you only look at the joint distribution of the samples they produce, yet they behave differently under intervention.

Consequently,

the joint distribution of data alone is insufficient to predict behaviour under interventions.

## Causal Diagrams

If the joint distribution is insufficient, what level of description would allow us to make predictions about how the scripts behave under intervention. If I have the full source code, I can of course execute the modified scripts, i.e. run an experiment and directly observe how the interaction effects the distribution.

However, it turns out, you don't need the full source code. It is sufficient to know the causal diagram corresponding to the source code. The causal diagram encodes causal relationships between variables, with an arrow pointing from causes to effects. Here is what the causal diagrams would look like for these scripts:

We can see that, even though they produce the same joint distribution, the scripts have different causal diagrams. And this additional knowledge of the causal structure allows us to make inferences about intervention without actually running experiments with that intervention. To do this in general setting, we can use do-calculus, explained in a little bit more detail in my earlier post.

Graphically, to simulate the effect of an intervention, you mutilate the graph by removing all edges that point into the variable on which the intervention is applied, in this case $x$.

At the top row you see the three diagrams that describe the three scripts. In the second row are the mutilated graphs where all incoming edges to $x$ have been removed. In the first script, the graph looks the same after mutilation. From this, we can conclude that $p(y\vert do(x)) = p(y\vert x)$, i.e. that the distribution of $y$ under intervention $X=3$ is the same as the conditional distribution of $y$ conditioned on $X=3$. In the second script, after mutilation, $x$ and $y$ become disconnected, therefore independent. From this, we can conclude that $p(y\vert do(X=3)) = p(y)$. Changing the value of $x$ does nothing to change the value of $y$, so whatever you set $X$ to be, $y$ is just going to sample from its marginal distribution. The same argument holds for the third causal diagram.

The significance of this is the following: By only looking at the causal diagram, we are now able to predict how the scripts are going to behave under the intervention $X=3$. We can compute and plot $p(y\vert do(X=3))$ for the three scripts by only using data observed during the normal (non-intervened) condition, without ever having to run the experiment or simulate the intervention.

The causal diagram allows us to predict how the models will behave under intervention, without carrying out the intervention

Here is proof of this, I could estimate the distribution of $y$ observed during the intervention experiment using only samples from the script under the normal (non-intervention) situation. This is called causal infernce from observational data.

## The morale of the story

The morale of this story is summed up in the following picture:

Let's consider all the questions you would want to be able to answer given some data (i.i.d. samples from a joint distribution). Having access to data, or more generally, the joint distribution it was sampled from, allows you to answer a great many questions, and solve many tasks. For example, you can do supervised machine learning by approximating $p(y \vert x)$ and then use it for many things, such as image labelling. These questions together make up the the blue set.

However, we have already seen that some questions cannot be answered using data/joint distribution alone. Notably, if you want to make predictions about how the system you study would behave under certain interventions or perturbations, you typically won't be able to make such infernces based on the data you have. These types of questions lie outside the blue set.

However, if you complement your data with causal assumptions encoded in a causal diagram - a directed acyclic graph where nodes are your variables - you can exploit these extra assumptions to start answering these questions, shown by the green area.

I am also showing an even larger set of questions still, but I won't tell you what this refers to just yet. I'm leaving that to a future post.

## Where does the diagram come from?

The questions I got most frequently after the lecture were these: what if I don't know what the graph looks like? And what if I get the graph wrong? There are many ways to answer these questions.

To me, a rehabilitated Bayesian, the most appealing answer is that you have to accept that your analysis is conditional on the graph you choose, and your conclusions are valid under the assumptions encoded there. In a way, causal inference from observational data is subjective. When you publish a result, you should caveat it with "under these assumptions, this is true". Readers can then dispute and question your assumptions if they disagree.

As to how to obtain a graph, it varies by application. If you work with online systems such as a recommender system, the causal diagram is actually pretty simple to draw, as it corresponds to how various subsystems are hooked up and feed data to one another. In other applications, notably in healthcare, a little bit more guesswork and thought may involved.

Finally, you can use various causal discovery techniques to try to identify the causal diagram from the data itself. Theoretically, recovering the full causal graph from the data is impossible in general cases. However, if you add certain additional smoothness or independence assumptions to the mix, you may be able to recover the graph from the data with a certain reliability.

## Summary

We have seen that modeling the joint distribution can only get you so far, and if you want to predict the effect of interventions, i.e. calculate $p(y\vert do(x))$-like quantities, you have to add a causal graph to your analysis.

An optimistic take on this is that, in the i.i.d. setting, drawing a causal diagram and using do-calculus (or something equivalent) can significantly broaden the set of problems you can tackle with machine learning.

The pessimist's take on this is that if you are not aware of all this, you might be trying to address questions in the green bit, without realizing that the data won't be able to give you an answer to such questions.

Whichever way you look at it, causal inference from observational data is an important topic to be aware of.

### Let Curiosity Drive: Fostering Innovation in Data Science

[This blog post is more nuanced version of my HBR article titled Curiosity Driven Data Science.]

The real value of data science lies not in making existing processes incrementally more efficient but rather in the creation of new algorithmic capabilities that enable step-function changes in value. However, such capabilities are rarely asked for in a top-down fashion. Instead, they are discovered and revealed through curiosity-driven tinkering by data scientists. For companies ready to jump on the data science bandwagon I offer this advice: think less about how data science will support and execute your plans and think more about how to create an environment to empower your data scientists to come up with ideas you’ve never dreamed of.

At Stitch Fix, we have more than 100 data scientists who have created several dozens of algorithmic capabilities, generating 100s of millions of dollars in benefits. We have algorithms for recommender systems, merchandise buying, inventory management, client relationship management, logistics, operations—we even have algorithms for designing clothes! Each provides material and measurable returns, enabling us to better serve our clients, while providing a protective barrier against competition. Yet, virtually none of these capabilities were asked for—not by executives, product managers, or domain experts, and not even by a data science manager. Instead, they were born out of curiosity and extracurricular tinkering by data scientists.

Data scientists are a curious bunch, especially the talented ones. They work towards stated goals, and they are focused on and accountable to achieving certain performance metrics. But they are also easily distracted—in a good way. In the course of doing their work they stumble on various patterns, phenomena, and anomalies that are unearthed during their data sleuthing. This goads the data scientist’s curiosity: “Is there a latent dimension that can characterize a client’s style?” “If we modeled clothing fit as a distance measure could we improve client feedback?” “Can successful features from existing styles be systematically re-combined to create better ones?” Such curiosity questions can be insatiable and the data scientists knows the answers lie hidden in the reams of historical data. Tinkering ensues. They don’t ask permission (eafp). In some cases, explanations can be found quickly, in only a few hours or so. Other times, it takes longer because each answer evokes new questions and hypotheses, leading to more tinkering. But the work is contained to non-sanctioned side-work, at least for now. They’ll tinker on their own time if they need to—evening and weekend if they must. Yet, no one asked them to; curiosity is a powerful force.

Are they wasting their time? No! Data science tinkering is typically accompanied by evidence for the merits of the exploration. Statistical measures like AUC, RMSE, and R-squared quantify the amount of predictive power the data scientist’s exploration is adding. They are also equipped with the business context to allow them to assess viability and potential impact of a solution that leverages their new insights. If there is no “there” there, they stop. But, when compelling evidence is found and coupled with big potential, the data scientist is emboldened. The exploration flips from being curiosity-driven to impact-driven. “If we incorporate this latent style space into our styling algorithms we can better recommend products.” “This fit feature will materially increase client satisfaction.” “These new designs will do very well with this client segment.” Note the difference in tone. Much of the uncertainty has been allayed and replaced with potential impact. No longer satisfied with mere historical data, the data scientist is driven to more rigorous methods—randomized controlled trials or “AB Testing,” which can provide true causal impact. She wants to see how her insights perform in real life. She cobbles together a new algorithm based on the newly revealed insights and exposes it to a sample of clients in an experiment. She’s already confident it will improve the client experience and business metrics, but she needs to know by how much. If the experiment yields a big enough win, she’ll roll it out to all clients. In some cases, it may require additional work to build a robust capability around her new insights. This will almost surely go beyond what can be considered “side work” and she’ll need to collaborate with others for engineering and process changes. But she will have already validated her hypothesis and quantified the impact, giving her a clear case for its prioritization within the business.

The essential thing to note here is that no one asked the data scientists to explore. Managers, PMs, domain experts—none of them saw the unexplained phenomenon that the data scientist stumbled upon. This is what tipped her off to start tinkering. And, the data scientist didn’t have to ask permission to explore because it’s low-cost enough that it just happens fluidly in the course of their work, or they are compelled by curiosity to flesh it out on their own time. In fact, if they had asked permission to explore their initial itch, managers and stakeholders probably would have said “no.” The insights and resulting capabilities are often so unintuitive and/or esoteric that, without the evidence to support it, they don’t seem like a good use of time or resources.

These two things—low cost exploration and empirical evidence—set data science apart from other business functions. Sure, other departments are curious too: “I wonder if clients would respond better to this this type of creative?” a marketer might ponder. “Would a new user interface be more intuitive?” a product manager inquires, etc. But those questions can’t be answered with historical data. Exploring those ideas requires actually building something, which is costly. And justifying the cost is often difficult since there’s no evidence that suggests the ideas will work. But with data science’s low-cost exploration and risk-reducing evidence, more ideas are explored which, in turn, leads to more innovation.

Sounds great, right? It is! But this doesn’t happen by will alone. You can’t just declare as an organization that “we’ll do this too.” This is a very different way of doing things. Many established organizations are set up to resist change. Such a new approach can create so much friction with the existing processes that the organization rejects it in the same way antibodies attack a foreign substance entering the body. It’s going to require fundamental changes to the organization that extend beyond the addition of a data science team. You need to create an environment in which it can thrive.

First, you have to position data science as its own entity. Don’t bury it under another department like marketing, product, engineering, etc. Instead, make it its own department, reporting to the CEO. In some cases the data science team can be completely autonomous in producing value for the company. In other cases, it will need to collaborate with other departments to provide solutions. Yet, it will do so as equal partners—not as a support staff that merely executes on what is asked of them. Recall that most algorithmic capabilities won’t be asked for; they are discovered through exploration. So, instead of positioning data science as a supportive team in service to other departments, make it responsible for business goals. Then, hold it accountable to hitting those goals—but enable the data scientists to come up with the solutions.

Next, you need to equip the data scientists with all the technical resources they need to be autonomous. They’ll need full access to data as well as the compute resources to process their explorations. Requiring them to ask permission or request resources will impose a cost and less exploration will occur. My recommendation is to leverage a cloud architecture where the compute resources are elastic and nearly infinite.

The data scientists will also need to have the skills to provision their own processors and conduct their own exploration. They will have to be great generalists. Most companies divide their data scientists into teams of functional specialists—say, Modelers, Machine Learning Engineers, Data Engineers, Causal Inference Analysts, etc. While this may provide greater focus, it also necessitates coordination among many specialists to pursue any exploration. This increases costs and fewer explorations will be conducted. Instead, leverage “full-stack data scientists” that possess varied skills to do all the specialty functions. Of course, data scientists can’t be experts in everything. Providing a robust data platform can help abstract them from the intricacies of distributed processing, auto-scaling, graceful degradation, etc. This way the data scientist focuses more on driving business value through testing and learning, and less on technical specialty. The cost of exploration is lowered and therefore more things are tried, leading to more innovation.

Finally, you need a culture that will support a steady process of learning and experimentation. This means the entire company must have common values for things like learning by doing, being comfortable with ambiguity, and balancing long- and short-term returns. These values need to be shared across the entire organization as they cannot survive in isolation.

Before you jump in and implement this at your company, be aware that it will be hard if not impossible to implement at an older, more established company. I’m not sure it could have worked, even at Stitch Fix, if we hadn’t enabled data science to be successful from the very beginning. Data Science was not “inserted” into the organization. Rather, data science was native to us even in the formative years, and hence, the necessary ways-of-working are more natural.

This is not to say data science is necessarily destined for failure at older, more mature companies, although it is certainly harder than starting from scratch. Some companies have been able to pull off miraculous changes. It’s too important not to try. The benefits of this model are substantial, and for companies that have the data assets to create a sustaining competitive advantage through algorithmic capabilities, it’s worth considering whether this approach can work for you.

#### Postscript

People often ask me, “Why not provide the time for data scientists to be creative like Google’s 20 percent time?” We’ve considered this several times after seeing many successful innovations emerge from data science tinkering. In spirit, it’s a great idea. Yet, we have concerns that a structured program for innovation may have unintended consequences.

Such programs may be too open-ended and lead to research where there is no actual problem to solve. The real examples I depicted above all stemmed out of observation—patterns, anomalies, an unexplained phenomenon, etc. They were observed first and then researched. It’s less likely to lead to impact the other way around.

Structured programs may also set expectations too high. I suspect there would be a tendency to think of the creative time as a PhD dissertation, requiring novelty and a material contribution to the community (e.g., “I’d better consult with my manager on what to spend my 20 percent time on”). I’d prefer a more organic process that is driven from observation. The data scientists should feel no shame in switching topics often or modifying their hypotheses. They may even find that their stated priorities are the most important thing they can be doing.

So, instead of a structured program, let curiosity drive. By providing ownership of business goals and generalized roles, tinkering and exploration become a natural and fluid part of the role. In fact, it’s hard to quell curiosity. Even if one were to explicitly ban curiosity projects, data scientists would simply conduct their explorations surreptitiously. Itches must be scratched!

### A ladder of responses to criticism, from the most responsible to the most destructive

In a recent discussion thread, I mentioned how I’m feeling charitable toward David Brooks, Michael Barone, and various others whose work I’ve criticized over the years, because their responses have been so civilized and moderate.

Consider the following range of responses to an outsider pointing out an error in your published work:

1. Look into the issue and, if you find there really was an error, fix it publicly and thank the person who told you about it.

2. Look into the issue and, if you find there really was an error, quietly fix it without acknowledging you’ve ever made a mistake.

3. Look into the issue and, if you find there really was an error, don’t ever acknowledge or fix it, but be careful to avoid this error in your future work.

4. Avoid looking into the question, ignore the possible error, act as if it had never happened, and keep making the same mistake over and over.

5. If forced to acknowledge the potential error, actively minimize its importance, perhaps throwing in an “everybody does it” defense.

6. Attempt to patch the error by misrepresenting what you’ve written, introducing additional errors in an attempt to protect your original claim.

7. Attack the messenger: attempt to smear the people who pointed out the error in your work, lie about them, and enlist your friends in the attack.

We could probably add a few more rungs to the latter, but the basic idea is that response 1 is optimal, responses 2 and 3 are unfortunate but understandable, response 4 represents at the very least a lost opportunity for improvement, and responses 5, 6, and 7 increasingly pollute the public discourse.

David Brooks is a pretty solid 4 on that scale, which isn’t great but in retrospect is like a breath of fresh air, given the 6’s and 7’s we’ve been encountering lately.

Most of the responses I’ve seen, in academic research and also the news media, have been 1’s. Or, at worst, 2’s and 3’s. From that perspective, Brooks’s stubbornness (his 4 on the above scale) has been frustrating. But it can, and has, been much worse. So I appreciate that, however Brooks handles criticism of his own writing, he does not go on the attack. Similarly, I was annoyed when Gregg Easterbrook did response 2, but, in retrospect, that 2 doesn’t seem so bad at all.

As I said, I put the above into a comment thread, but I thought it’s something we might want to refer to more generally, so it’s convenient to give it its own post.

### Comparing Machine Learning Models: Statistical vs. Practical Significance

Is model A or B more accurate? Hmm… In this blog post, I’d love to share my recent findings on model comparison.

### Four short links: 18 January 2019

Remove Filters, Quantum Cables, Embedded Vision, and Citizen Developers

1. Desnapify -- deep convolutional generative adversarial network (DCGAN) trained to remove Snapchat filters from selfie images.
2. Quantum Computer Component Shortage (MIT TR) -- cables for superconducting quantum computing experiments turn out to be hard to find at Radio Shack. Reminder: QC is in its infancy.
3. SOD -- an embedded computer vision and machine learning library (CPU optimized and IoT capable).
4. Devsumer -- interesting argument: lots of people with exposure to programming via Hour of Code type things, as IT departments are too busy to build all the apps people want, so [a] number of products have emerged that allow people to build simple software applications, or to use templated applications for their own work flow or productivity. You can think of this as taking a SQL database or excel spreadsheet and turning it into an app platform.

### Personality quiz with traits on a spectrum

Ah, the online personality quiz, oh how I missed you. Oh wait, this one is slightly different. For FiveThirtyEight, Maggie Koerth-Baker and Julia Wolfe provide a quiz used by psychologists to gauge personality traits:

First, the Big Five doesn’t put people into neat personality “types,” because that’s not how personalities really work. Instead, the quiz gives you a score on five different traits: extraversion, agreeableness, conscientiousness, negative emotionality and openness to experience. For each of those traits, you’re graded on a scale from 0 to 100, depending on how strongly you associate with that trait. So, for example, this quiz won’t tell you whether you’re an extravert or an introvert — instead, it tells you your propensity toward extraversion. Every trait is graded on a spectrum, with a few people far out on the extremes and a lot of people in the middle.

Dang it. I really wanted to know what Harry Potter character I am.

### Rcrastinate is moving.

(This article was first published on Rcrastinate, and kindly contributed to R-bloggers)

Hi all, this is just an announcement.

I am moving Rcrastinate to a blogdown-based solution and am therefore leaving blogger.com. If you’re interested in the new setup and how you could do the same yourself, please check out the all shiny and new Rcrastinate over at

In my first post over there, I am giving a short summary on how I started the whole thing. I hope that the new Rcrastinate is also integrated into R-bloggers soon.

Thanks for being here, see you over there.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### POPL 2019 Most Influential Paper Award for research that led to Facebook Infer

We’re excited to congratulate Cristiano Calcagno, Dino Distefano, Peter W. O’Hearn (Facebook) and Hongseok Yang (KAIST) on receiving the ACM SIGPLAN Most Influential POPL Paper Award for their paper “Compositional shape analysis by means of bi-abduction,” which they presented at POPL 2009. The techniques described in this paper were utilized in the tool Infer, which was open-sourced in 2015 and is also used at other major tech companies.

Cristiano, Dino and Peter all joined the Facebook London engineering team in 2013 with the acquisition of Monoidics, a startup they founded. We recently caught up with the three of them to learn more about the research paper that won the Most Influential Paper Award at POPL 2019.

Q: What was the research about?

It was about writing a computer program that can show (prove) that another program does not have memory safety errors — like crashes from accessing null or undefined pointers, and memory leaks. This kind of program can also find bugs when the attempt to show their absence fails. This work is part of research areas known as program verification and automatic program analysis. At the time, reasoning accurately about memory was one of the main open problems in the field, holding back the application of reasoning techniques to practical industrial code.

Q: What led you to do the initial research?

Partial success, followed by daring to dream what might be possible if we went far beyond what was possible at the time.

It started when we (the four authors, Cristiano, Dino, Peter, Hongseok) met in a cafe in the Shoreditch area of East London to talk about research ideas. We had published a paper in the CAV ’08 conference which proved memory safety properties for C programs up to 10k lines of code, and found some bugs in Microsoft device driver that had been confirmed and fixed by the Windows kernel team. This was absolutely at the leading edge of the research literature. But one of us, Cristiano, was very dissatisfied. “These results aren’t good enough,” he said, because “10k is too small.” Cristiano and Dino were dreaming about using our techniques to start a company, and Cristiano saw that we needed to go much further than the academic leading edge at the time if we were to tackle the kinds of large codebases one finds in the major companies that might be clients.

“What is the main problem blocking application to one million lines of code?” Cristiano then asked. It was an outlandish question, because the leading research had seldom reached even one thousand lines for this sort of problem, let alone a million. Spurred by Cristiano, Peter shot back: “We need a new, compositional version of our techniques.” Peter outlined the basic scheme for an analysis where little bits of proof would be done independently and then stitched together, and he gave an argument for how it should scale if we could do it…but he did not explain how the scheme could actually be realized; it might have just been a fanciful idea.

We all, the four authors, set about working on converting this so-far-fanciful scheme — automatic, compositional memory safety proving (shape analysis, in the jargon) — into something real. Hongseok made a valuable step, based on mining information from failed proofs to automatically discover preconditions for individual functions. Then Dino made a breakthrough, in discovering a logical concept (that we eventually called bi-abduction) to do the stitching that Peter had asked for. Dino wrote a prototype reasoning tool that had encouraging initial results, and it was clear we were onto something. We all worked on the theory that would become the POPL paper, and Cristiano led the further implementation effort including a new species of automatic theorem prover, a “bi-abductive prover.” The new reasoning technique soon powered its way to over a million lines of code in an experiment, we had a nice celebration, and then set about writing the POPL ’09 paper to tell the world about it.

Q: What happened when the paper was originally published, and how was it received in the community then?

The paper got a good reception at POPL, but an even better reception when we presented demos. Cristiano has a slogan — seeing is believing — that we kept in mind when thinking how to present our ideas.

Dino worked out a demo where we would analyze a program of a few hundred thousand lines of code (often, OpenSSL), and this would take 20 minutes or so, so that the analysis was finished during a talk we would be giving. Then, we would re-run the analysis and confirm that the bug was fixed, all right on the spot in front of the audience. The re-run would take in a few seconds rather than tens of minutes, because we only needed to re-analyze the changed part of the program: the algorithm was incremental. This illustrated very concretely how the reasoning technique might be used during program development, not only in in a long-running batch mode. This often caused excitement in the audience, because the potential of the technique was clear to see. Incidentally, it is this incrementality that led to subsequent impact at Facebook (as we describe in the next question).

Although the majority of the reaction was enthusiastic, a small number of insiders in the program analysis research community reacted less positively. Although we went from a thousand to a million lines of code, we received complaints that the million-line-analysis did not do the exact same thing as the thousand-line-variant. We had been up front and admitted as much — our new algorithm sometimes lost precision — but some people thought the job of the compositional technique should be to match the slow-running one in everything except speed. In hindsight, moving this sort of proving to big codebases was entering a new and unfamiliar arena, and perhaps not everybody appreciated how many opportunities scaling to big code could offer.

Q: How has the work been built upon?

Shortly after the POPL paper, Cristiano and Dino founded a startup company, Monoidics, to commercialize the techniques, and they convinced Peter to join them. After hardening the techniques and trying them out with some customers, Monoidics was acquired in 2013, and a team of seven including Cristiano, Dino and Peter joined the Facebook London engineering office. Their tool, Infer, was deployed inside Facebook in early 2014, and since then has discovered over 100,000 issues which have been fixed by Facebook’s developers before code reaches production.

In the development of Infer the techniques from the POPL paper were generalized to other reasoning techniques, but all of the checks in Infer share the compositional analysis architecture and incremental algorithmic basis of the original Infer. Infer participates as a bot during Facebook’s code review process, where it automatically comments on code modifications that are submitted for check-in to our code bases. Although the code bases are large — in the tens of millions of lines of code — Infer runs quickly on the code changes in an incremental fashion, reminiscent of the early Infer demos described previously.

Infer is run on the codebases for the WhatsApp, Messenger, Instagram and Facebook mobile apps. Thus the techniques in the POPL paper have directly affected apps used by billions of people on a regular basis. Infer was open-sourced in 2015 and is also used at other companies, including Amazon, Mozilla and Spotify.

Q: Were there any surprises along the way?

We knew that speed, scale and incrementality were important, but just how important took us by surprise.

When we landed in the company we learned that Facebook’s codebases were being modified at what seemed like a breathtaking rate, to the tune of hundreds of thousands of times per week. Infer would have to run on all these code changes. This was far beyond anything we had imagined at the time of the POPL paper, and even far beyond anything we had ever heard about for a deployment of logical reasoning. It turned out that Infer’s incrementality meant that it was up to the task: thousands of analysis runs per day would each perform the little bits of independent proof made possible by Infer’s compositional technique.

Scale was one thing, but an even bigger surprise was how important incrementality — dealing quickly with code changes — was for addressing the human side of the engineering problem we faced. We first tried a “batch” deployment of Infer, where it was run overnight and produced bug lists that developers might act on. Issues were presented to developers outside their workflow and the fix rate — the rate at which they fixed the issues discovered — was near 0%. Then, when we switched on Infer in an incremental-online mode, as a bot during code review, the fix rate shot up to 70%. We were stunned. Although the POPL paper had provided the techniques to enable this form of analysis, we had no idea at the time how powerful the effect of its fast incremental analysis of code changes would be.

In retrospect, we were fortunate that Facebook perceived the importance of the scale and incrementality in the POPL ’09 techniques. We now recall that when we talked with Facebook about potentially joining, it was laser focused on the limitations of batch deployment (the typical model presumed in academia at the time) and could see the potential of our techniques, even if we did not fully appreciate it at the time.

Q: What is your current focus?

We’re working on a variety of problems of importance to the company, including problems related to concurrent programming, to program performance, and to UI programming. Stay tuned!

### Factor Analysis in R with Psych Package: Measuring Consumer Involvement

(This article was first published on The Devil is in the Data – The Lucid Manager, and kindly contributed to R-bloggers)

The post Factor Analysis in R with Psych Package: Measuring Consumer Involvement appeared first on The Lucid Manager.

The first step for anyone who wants to promote or sell something is to understand the psychology of potential customers. Getting into the minds of consumers is often problematic because measuring psychological traits is a complex task. Researchers have developed many parameters that describe our feelings, attitudes, personality and so on. One of these measures is consumer involvement, which is a measure of the attitude people have towards a product or service.

The most common method to measure psychological traits is to ask people a battery of questions. Analysing these answers is complicated because it is difficult to relate the responses to a survey to the software of the mind. While the answers given by survey respondents are the directly measured variables, what we like to know are the hidden (latent) states in the mind of the consumer. Factor Analysis is a technique that helps to discover latent variables within a responses set of data, such as a customer survey.

The basic principle of measuring consumer attitudes is that the consumer’s state of mind causes them to respond to questions in a certain way. Factor analysis seeks to reverse this causality by looking for patterns in the responses that are indicative of the consumer’s state of mind. Using a computing analogy, factor analysis is a technique to reverse-engineer the source code by analysing the input and output.

This article introduces the concept of consumer involvement and how it can be predictive of other important marketing metrics such as service quality. An example using data from tap water consumers illustrates the theory. The data collected from these consumers is analysed using factor analysis in R, using the psych package.

## What is Consumer Involvement?

Involvement is a marketing metric that describes the relevance of a product or service in somebody’s life. Judy Zaichkowsky defines consumer involvement formally as “a person’s perceived relevance of the object based on inherent needs, values, and interests”. People who own a car will most likely be highly involved with purchasing and driving the vehicle due to the money involved and the social role it plays in developing their public self. Consumers will most likely have a much lower level of involvement with the instant coffee they drink than with the clothes they wear.

From a managerial point of view, involvement is crucial because it is causally related to willingness to pay and perceptions of quality.  Consumers with a higher level of involvement are willing to pay more for a service and have a more favourable perception of quality. Understanding involvement in the context of urban water supply is also important because sustainably managing water as a common pool resource requires the active involvement of all users.

The level of consumer involvement depends on a complex array of factors, which are related to psychology, situational factors and the marketing mix of the service provider. The lowest level of involvement is considered a state of inertia which occurs when people habitually purchase a product without comparing alternatives.

Cult products have the highest possible level of involvement as customers are fully devoted to a particular product or brand. Commercial organisations use this knowledge to their advantage by maximising the level of consumer involvement through branding and advertising. This strategy is used effectively by the bottled water industry. Manufacturers focus on enhancing the emotional aspects of their product rather than on improving the cognitive elements. Water utilities tend to use a reversed strategy and emphasise the cognitive aspects of tap water, the pipes, plants and pumps, rather than trying to create an emotional relationship with their consumers.

## Measuring Consumer Involvement

Asking consumers directly about their level of involvement would not lead to a stable answer because each respondent will interpret the question differently. The best way to measure psychological states or psychometrics is to ask a series of questions that are linguistically related to the topic of interest.

The most cited method to measure consumer involvement in the Personal Involvement Index, developed by Judy Zaichowsky. This index is a two-dimensional scale consisting of:

• cognitive involvement (importance, relevance, meaning, value and need)
• affective involvement (involvement, fascination, appeal, excitement and interest).

The survey instrument consists of ten semantic-differential items. A Semantic Differential is a type of a rating scale designed to measure the meaning of objects, events or concepts. The concept that is being measured, such as involvement, is translated into a list of several synonyms and their associated antonyms.

In the involvement survey, participants are asked to position their views between two extremes such as Worthless and Valuable or Boring and Interesting. The level of involvement is defined as the sum of all answers, which is a number between 10 and 70.

Personal Involvement Inventory (Zaichowsky 1994).

## Exploratory Analysis

For my dissertation about customer service in water utilities, I measured the level of involvement that consumers have with tap water. 832 tap water consumers completed this survey in Australia and the United States.

This data set contains other information, and the code selects only those variable names starting with “p” (for Personal Involvement Inventory). Before any data is analysed, customers who provided the same answer to all items, or did not respond to all questions, are removed as these are most likely invalid responses., which leaves 757 rows of data.

A boxplot is a convenient way to view the responses to multiple survey items in one visualisation. This plot immediately shows an interesting pattern in the answers. It seems that responses to the first five items were generally higher than those for the last five items. This result seems to indicate a demarcation between cognitive and affective involvement.

Responses to Personal Involvement Index by tap water consumers.

Next step in the exploratory analysis is to investigate how these factors correlate with each other. The correlation plot below shows that all items strongly correlate with each other. In correspondence with the boxplots above, the first five and the last five items correlate more strongly with each other. This plot suggests that the two dimensions of the involvement index correlate with each other.

Correlation matrix for the Personal Involvement Index

## Factor Analysis in R

Factor Analysis is often confused with Principal Component Analysis because the outcomes of are very similar when applied to the same data set. Both methods are similar but have a different purpose. Principal Component Analysis is a data-reduction technique that serves to reduce the number of variables in a problem. The specific purpose of Factor Analysis is to uncover latent variables. The mathematical principles for both techniques are similar, but not the same and should not be confused.

One of the most important decisions in factor analysis is to decide how to rotate the factors. There are two types: orthogonal or oblique. In simple terms, orthogonal rotations seek to reduce the correlation between dimensions and oblique rotation allow for dimensions to relate to each other. Given the strong correlations in the correlation plot and the fact that both dimensions measure involvement, this analysis uses oblique rotation. The visualisation below shows how each of the items how, and the two dimensions relate to each other.

Factor analysis in R with Psych package.

This simple exploratory analysis shows the basic principle of how to analyse psychometric data. The psych package has a lot more specialised tools to dig deeper into the information. This article has not assessed the validity of this construct, or evaluated the reliability of the factors. Perhaps that is for a future article.

## The R Code

You can view the code below. Go to my Github Repository to see the code and the data source.

## ConsumerInvolvement.R
library(tidyverse)
library(psych)
select(starts_with("p"))
dim(consumers)

## Data clesaning
sdevs <- apply(consumers, 1, sd, na.rm = TRUE)
incomplete <- apply(consumers, 1, function(i) any(is.na(i)))
consumers <- consumers[sdevs != 0 & !incomplete, ]
dim(consumers)

## Exploratory Analysis
consumers %>%
rownames_to_column(var = "Subject") %>%
gather(Item, Response, -Subject) %>%
ggplot(aes(Item, Response)) + geom_boxplot(fill = "#f7941d") +
ggtitle("personal Involvement Index",
subtitle = paste("Tap Water Consumers USA and Australia (n =",
nrow(consumers), ")"))
ggsave("involvement_explore.png", dpi = 300)

png("involvement_correlation.png", width = 1024, height = 1024)
corPlot(consumers)
dev.off()

piiFac <- fa(consumers, nfactors = 2, rotate = "oblimin")

png("involvement_factors.png", width = 1024, height = 768)
fa.diagram(piiFac)
dev.off()

The post Factor Analysis in R with Psych Package: Measuring Consumer Involvement appeared first on The Lucid Manager.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Document worth reading: “Can Entropy Explain Successor Surprisal Effects in Reading?”

Human reading behavior is sensitive to surprisal: more predictable words tend to be read faster. Unexpectedly, this applies not only to the surprisal of the word that is currently being read, but also to the surprisal of upcoming (successor) words that have not been fixated yet. This finding has been interpreted as evidence that readers can extract lexical information parafoveally. Calling this interpretation into question, Angele et al. (2015) showed that successor effects appear even in contexts in which those successor words are not yet visible. They hypothesized that successor surprisal predicts reading time because it approximates the reader’s uncertainty about upcoming words. We test this hypothesis on a reading time corpus using an LSTM language model, and find that successor surprisal and entropy are independent predictors of reading time. This independence suggests that entropy alone is unlikely to be the full explanation for successor surprisal effects. Can Entropy Explain Successor Surprisal Effects in Reading?

### Are you parallelizing your raster operations? You should!

(This article was first published on R – : Francesco Bailo :, and kindly contributed to R-bloggers)

If you plan to do anything with the raster package you should definitely consider parallelize all your processes, especially if you are working with very large image files. I couldn’t find any blog post describing how to parallelize with the raster package (it is well documented in the package documentation, though). So here my notes.

Let’s first get some raster data from here, any file will do but I’m using the Cambodian population data for 2015 (KHM_ppp_v2b_2015_UNadj).

library(raster)
khm_pop.r <-
raster("~/Downloads/KHM_ppp_v2b_2015_UNadj/KHM_ppp_v2b_2015_UNadj.tif")

We can plot it with

library(rasterVis)
library(viridis)
library(ggplot2)
rasterVis::gplot(khm_pop.r) +
geom_tile(aes(fill = log(value)))  +
viridis::scale_fill_viridis(direction = -1,
na.value='#FFFFFF00') +
theme_bw()

## Projection

Now, let’s first project the raster without any parallelization.

start_time <- Sys.time()
res1 <-
projectRaster(khm_pop.r,
crs = '+proj=utm +zone=48 +datum=WGS84 +units=m +no_defs')
end_time <- Sys.time()
end_time - start_time
## Time difference of 1.088329 mins
rasterVis::gplot(res1) +
geom_tile(aes(fill = log(value)))  +
viridis::scale_fill_viridis(direction = -1,
na.value='#FFFFFF00') +
theme_bw()

And now let’s parallelize the process. There are two approaches to parallelization with raster objects (do ?clusterR for the documentation of the package mantainers):

1. By including the raster function between a beginCluster() and an endCluster().
2. By using clusterR() like in clusterR(x, fun, args=NULL, cl=mycluster), where mycluster is a cluster object generated for example by getCluster().

Yet clusterR() doesn’t work with merge, crop, mosaic, disaggregate, aggregate, resample, projectRaster, focal, distance, buffer and direction.

Let’s try the first approach first.

start_time <- Sys.time()
beginCluster()
## 4 cores detected, using 3
  res2 <-
projectRaster(khm_pop.r,
crs = '+proj=utm +zone=48 +datum=WGS84 +units=m +no_defs')
## Using cluster with 3 nodes
  endCluster()
end_time <- Sys.time()
end_time - start_time
## Time difference of 1.548856 mins
rasterVis::gplot(res2) +
geom_tile(aes(fill = log(value)))  +
viridis::scale_fill_viridis(direction = -1, na.value='#FFFFFF00') +
theme_bw()

## Maths

To test the second approach, let’s use the calc() and sqrt() functions, first without parallelization:

start_time <- Sys.time()
calc(khm_pop.r, sqrt)
## class       : RasterLayer
## dimensions  : 5205, 6354, 33072570  (nrow, ncol, ncell)
## resolution  : 0.0008333, 0.0008333  (x, y)
## extent      : 102.3375, 107.6323, 10.35008, 14.6874  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## data source : in memory
## names       : layer
## values      : 0.02269337, 42.87014  (min, max)
end_time <- Sys.time()
end_time - start_time
## Time difference of 3.316296 secs

and then with parallelization, this time with clusterR():

start_time <- Sys.time()
beginCluster()
## 4 cores detected, using 3
clusterR(khm_pop.r, sqrt)
## class       : RasterLayer
## dimensions  : 5205, 6354, 33072570  (nrow, ncol, ncell)
## resolution  : 0.0008333, 0.0008333  (x, y)
## extent      : 102.3375, 107.6323, 10.35008, 14.6874  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## data source : in memory
## names       : layer
## values      : 0.02269337, 42.87014  (min, max)
endCluster()
end_time <- Sys.time()
end_time - start_time
## Time difference of 16.49228 secs

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### New allegations against Donald Trump raise the odds of impeachment

But gamblers still think there is a 65% chance he will complete his first term

### If you did not already know

INFODENS
The advent of representation learning methods enabled large performance gains on various language tasks, alleviating the need for manual feature engineering. While engineered representations are usually based on some linguistic understanding and are therefore more interpretable, learned representations are harder to interpret. Empirically studying the complementarity of both approaches can provide more linguistic insights that would help reach a better compromise between interpretability and performance. We present INFODENS, a framework for studying learned and engineered representations of text in the context of text classification tasks. It is designed to simplify the tasks of feature engineering as well as provide the groundwork for extracting learned features and combining both approaches. INFODENS is flexible, extensible, with a short learning curve, and is easy to integrate with many of the available and widely used natural language processing tools. …

We present the Variational Adaptive Newton (VAN) method which is a black-box optimization method especially suitable for explorative-learning tasks such as active learning and reinforcement learning. Similar to Bayesian methods, VAN estimates a distribution that can be used for exploration, but requires computations that are similar to continuous optimization methods. Our theoretical contribution reveals that VAN is a second-order method that unifies existing methods in distinct fields of continuous optimization, variational inference, and evolution strategies. Our experimental results show that VAN performs well on a wide-variety of learning tasks. This work presents a general-purpose explorative-learning method that has the potential to improve learning in areas such as active learning and reinforcement learning. …

Hyperspherical Convolution (SphereConv)
Convolution as inner product has been the founding basis of convolutional neural networks (CNNs) and the key to end-to-end visual representation learning. Benefiting from deeper architectures, recent CNNs have demonstrated increasingly strong representation abilities. Despite such improvement, the increased depth and larger parameter space have also led to challenges in properly training a network. In light of such challenges, we propose hyperspherical convolution (SphereConv), a novel learning framework that gives angular representations on hyperspheres. We introduce SphereNet, deep hyperspherical convolution networks that are distinct from conventional inner product based convolutional networks. In particular, SphereNet adopts SphereConv as its basic convolution operator and is supervised by generalized angular softmax loss – a natural loss formulation under SphereConv. We show that SphereNet can effectively encode discriminative representation and alleviate training difficulty, leading to easier optimization, faster convergence and comparable (even better) classification accuracy over convolutional counterparts. We also provide some theoretical insights for the advantages of learning on hyperspheres. In addition, we introduce the learnable SphereConv, i.e., a natural improvement over prefixed SphereConv, and SphereNorm, i.e., hyperspherical learning as a normalization method. Experiments have verified our conclusions. …

### Distilled News

It looks like Christmas is a little early this year Here’s a little something from me to all of you out there: a map to navigate ML services on AWS. With all the new stuff launched at re:Invent, I’m quite sure it will come in handy!
This resource is part of a series on specific topics related to data science: regression, clustering, neural networks, deep learning, decision trees, ensembles, correlation, Python, R, Tensorflow, SVM, data reduction, feature selection, experimental design, cross-validation, model fitting, and many more.
The problem is simple, we have a slot machine with n number of arms. And we have limited numbers of trials on which arm we can pull, also we don’t know which arms will give us the most amount of money. Assuming that the probability distribution does not change over time (meaning that this is a stationary problem)…. Which arm should we pull? Should we pull the arm that gave us the most amount of reward in the past or should we explore in hopes of getting more optimal arm? There are multiple solutions to this problem, and usually, people measure regret in order to rank each solution. (Regret == simply put, the amount of penalty that we get for not pulling the optimal arm.). So to minimize regret we just have to pull the arm that has the highest probability of giving us a reward. But I wanted to look at an additional measurement, specifically, I will also take into account how well each algorithm estimates the probability distribution for each arm. (the probability that they will give us a reward). And see each of their performance on a smaller scale, in which we only have 12 arms, and a larger scale, in which we have 1000 arms. Finally, the aim of this post is to provide a simple implementation of each solution, for non-mathematicians (like me). Hence the theoretical guarantees and proofs are not discussed but, I have provided different links for people who wish to study this problem more in depth. Below is the list of methods that we are going to compare…..
Researchers borrowed equations from calculus to redesign the core machinery of deep learning so it can model continuous processes like changes in health.
Since Pearson developed principal component analysis (PCA) in 1901, feature learning (or called representation learning) has been studied for more than 100?years. During this period, many ‘shallow’ feature learning methods have been proposed based on various learning criteria and techniques, until the popular deep learning research in recent years. In this advanced review, we describe the historical profile of the shallow feature learning research and introduce the important developments of the deep learning models. Particularly, we survey the deep architectures with benefits from the optimization of their width and depth, as these models have achieved new records in many applications, such as image classification and object detection. Finally, several interesting directions of deep learning are presented and briefly discussed.
The k-nearest neighbors algorithm is characterized as a simple yet effective data mining technique. The main drawback of this technique appears when massive amounts of data – likely to contain noise and imperfections – are involved, turning this algorithm into an imprecise and especially inefficient technique. These disadvantages have been subject of research for many years, and among others approaches, data preprocessing techniques such as instance reduction or missing values imputation have targeted these weaknesses. As a result, these issues have turned out as strengths and the k-nearest neighbors rule has become a core algorithm to identify and correct imperfect data, removing noisy and redundant samples, or imputing missing values, transforming Big Data into Smart Data – which is data of sufficient quality to expect a good outcome from any data mining algorithm. The role of this smart data gleaning algorithm in a supervised learning context are investigated. This includes a brief overview of Smart Data, current and future trends for the k-nearest neighbor algorithm in the Big Data context, and the existing data preprocessing techniques based on this algorithm. We present the emerging big data-ready versions of these algorithms and develop some new methods to cope with Big Data. We carry out a thor- ough experimental analysis in a series of big datasets that provide guidelines as to how to use the k-nearest neighbor algorithm to obtain Smart/Quality Data for a high- quality data mining process. Moreover, multiple Spark Packages have been devel- oped including all the Smart Data algorithms analyzed.
Now that knowledge of machine learning is making its way into offices all around the world, company leaders have a strong desire to automate processes that have existed manually for years. Google’s Jasmeet Bhatia, a talented machine learning specialist, explained to us the ways in which Google is innovating unique processes meant to facilitate effective automation at the Data Science Salon in New York City in September 2018.
Clusterlab is a CRAN package (https://…/index.html ) for the routine testing of clustering algorithms. It can simulate positive (data-sets with >1 clusters) and negative controls (data-sets with 1 cluster). Why test clustering algorithms? Because they often fail in identifying the true K in practice, published algorithms are not always well tested, and we need to know about ones that have strange behaviour. I’ve found in many own experiments on clustering algorithms that algorithms many people are using are not necessary ones that provide the most sensible results. I can give a good example below.
Last week the R package ruimtehol was released on CRAN (https://…/ruimtehol ) allowing R users to easily build and apply neural embedding models on text data. It wraps the ‘StarSpace’ library https://…/StarSpace allowing users to calculate word, sentence, article, document, webpage, link and entity ’embeddings’. By using the ’embeddings’, you can perform text based multi-label classification, find similarities between texts and categories, do collaborative-filtering based recommendation as well as content-based recommendation, find out relations between entities, calculate graph ’embeddings’ as well as perform semi-supervised learning and multi-task learning on plain text. The techniques are explained in detail in the paper: ‘StarSpace: Embed All The Things!’ by Wu et al. (2017), available at https://…/1709.03856. You can get started with some common text analytical use cases by using the presentation we have built below. Enjoy!
A comparison of two transfer learning methods in Natural Language Processing: ‘ULMFiT’ and the ‘OpenAI Transformer’ for a multi-class classification task involving Twitter data.
Latent Dirichlet Allocation (LDA) is a ‘generative probabilistic model’ of a collection of composites made up of parts. In terms of topic modeling, the composites are documents and the parts are words and/or phrases (n-grams). But you could apply LDA to DNA and nucleotides, pizzas and toppings, molecules and atoms, employees and skills, or keyboards and crumbs.
In 2016, a Reddit user made a confession. FiletOfFish1066 had automated all of the work tasks and spent around six years ‘doing nothing’. While the original post seems to have disappeared from Reddit, there are numerous reports about the admission. The original poster suggested that he (all the stories refer to FiletOfFish1066 as male) spent about 50 hours doing ‘real work’. The rest?-?’nothing’. When his employer found out, FiletOfFish1066 was fired. I think this is the worst mistake an employer can make. He should have been given a pay rise. But that’s a topic for another article. Let’s talk about hiring algorithms to work for you?-?just like FiletOfFish1066 had a bunch of algorithms working for him.
mlrose provides functionality for implementing some of the most popular randomization and search algorithms, and applying them to a range of different optimization problem domains. In this tutorial, we will discuss what is meant by an optimization problem and step through an example of how mlrose can be used to solve them. This is the first in a series of three tutorials. Parts 2 and 3 will be published over the next two weeks.

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new RcppArmadillo bugfix release arrived at CRAN today. The version 0.9.200.7.0 is another minor bugfix release, and based on the new Armadillo bugfix release 9.200.7 from earlier this week. I also just uploaded the Debian version, and Uwe’s systems have already create the CRAN Windows binary.

Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab. RcppArmadillo integrates this library with the R environment and language–and is widely used by (currently) 559 other packages on CRAN.

This release just brings minor upstream bug fixes, see below for details (and we also include the updated entry for the November bugfix release).

#### Changes in RcppArmadillo version 0.9.200.7.0 (2019-01-17)

• Fixes in 9.200.7 compared to 9.200.5:

• handling complex compound expressions by trace()

• handling .rows() and .cols() by the Cube class

#### Changes in RcppArmadillo version 0.9.200.5.0 (2018-11-09)

• Changes in this release

• linking issue when using fixed size matrices and vectors

• faster handling of common cases by princomp()

Courtesy of CRANberries, there is a diffstat report relative to previous release. More detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### R Packages worth a look

Interface to the ‘ECMWF’ Data Web Services (ecmwfr)
Programmatic interface to the ‘ECMWF’ public dataset web services (<https://…/> ). Allows …

Information Preserving Regression-Based Tools for Statistical Disclosure Control (RegSDC)
Implementation of the methods described in the paper with the above title: Langsrud. (2019) <doi:10.1007/s11222-018-9848-9>. Open view-only versi …

Variational Approximations for Generalized Additive Models (vagam)
Fits generalized additive models (GAMs) using a variational approximations (VA) framework. In brief, the VA framework provides a fully or at least clos …

### The Tentpoles of Data Science

What makes for a good data scientist? This is a question I asked a long time ago and am still trying to figure out the answer. Seven years ago, I wrote:

I was thinking about the people who I think are really good at data analysis and it occurred to me that they were all people I knew. So I started thinking about people that I don’t know (and there are many) but are equally good at data analysis. This turned out to be much harder than I thought. And I’m sure it’s not because they don’t exist, it’s just because I think good data analysis chops are hard to evaluate from afar using the standard methods by which we evaluate people.

Now that time has passed and I’ve had an opportunity to see what’s going on in the world of data science, what I think about good data scientists, and what seems to make for good data analysis, I have a few more ideas on what makes for a good data scientist. In particular, I think there are broadly five “tentpoles” for a good data scientist. Each tentpole represents a major area of activity that will to some extent be applied in any given data analysis.

When I ask myself the question “What is data science?” I tend to think of the following five components. Data science is

• the application of design thinking to data problems;
• the creation and management of workflows for transforming and processing data;
• the negotiation of human relationships to identify context, allocate resources, and characterize audiences for data analysis products;
• the application of statistical methods to quantify evidence; and
• the transformation of data analytic information into coherent narratives and stories

My contention is that if you are a good data scientist, then you are good at all five of the tentpoles of data science. Conversely, if you are good at all five tentpoles, then you’ll likely be a good data scientist.

## Design Thinking

Listeners of my podcast know that Hilary Parker and I are fans of design thinking. Having recently spent eight episodes discussing Nigel Cross’s book Design Thinking, it’s clear I think this is a major component of good data analysis.

The main focus here is developing a proper framing of a problem and homing in on the most appropriate question to ask. Many good data scientists are distinguished by their ability to think of a problem in a new way. Figuring out the best way to ask a question requires knowledge and consideration of the audience and what it is they need. I think it’s also important to frame the problem in a way that is personally interesting (if possible) so that you, as the analyst, are encouraged to look at the data analysis as a systems problem. This requires digging into all the details and looking into areas that others who are less interested might overlook. Finally, alternating between divergent and convergent thinking is useful for exploring the problem space via potential solutions (rough sketches), but also synthesizing many ideas and bringing oneself to focus on a specific question.

Another important area that design thinking touches is the solicitation of domain knowledge. Many would argue that having domain knowledge is a key part of developing a good data science solution. But I don’t think being a good data scientist is about having specific knowledge of biology, web site traffic, environmental health, or clothing styles. Rather, if you want to have an impact in any of those areas, it’s important to be able to solicit the relevant information—including domain knowledge—for solving the problem at hand. I don’t have a PhD in environmental health sciences, and my knowledge of that area is not at the level of someone who does. But I believe that over my career, I have solicited the relevant information from experts and have learned the key facts that are needed to conduct data science research in this area.

## Workflows

Over the past 15 years or so, there has been a growing discussion of the importance of good workflows in the data analysis community. At this point, I’d say a critical job of a data scientist is to develop and manage the workflows for a given data problem. Most likely, it is the data scientist who will be in a position to observe how the data flows through a team or across different pieces of software, and so the data scientist will know how best to manage these transitions. If a data science problem is a systems problem, then the workflow indicates how different pieces of the system talk to each other. While the tools of data analytic workflow management are constantly changing, the importance of the idea persists and staying up-to-date with the best tools is a key part of the job.

In the scientific arena the end goal of good workflow management is often reproducibility of the scientific analysis. But good workflow can also be critical for collaboration, team management, and producing good science (as opposed to merely reproducible science). Having a good workflow can also facilitate sharing of data or results, whether it’s with another team at the company or with the public more generally, as in the case of scientific results. Finally, being able to understand and communicate how a given result has been generated through the workflow can be of great importance when problems occur and need to be debugged.

## Human Relationships

In previous posts I’ve discussed the importance of context, resources, and audience for producing a successful data analysis. Being able to grasp all of these things typically involves having good relationships with other people, either within a data science team or outside it. In my experience, poor relationships can often lead to poor work.

It’s a rare situation where a data scientist works completely alone, accountable to no one, only presenting to themselves. Usually, resources must be obtained to do the analysis in the first place and the audience (i.e. users, customers, viewers, scientists) must be characterized to understand how a problem should be framed or a question should be asked. All of this will require having relationships with people who can provide the resources or the information that a data scientist needs.

Failures in data analysis can often be traced back to a breakdown in human relationships and in communication between team members. As the Duke Saga showed us, dramatic failures do not occur because someone didn’t know what a p-value was or how to fit a linear regression. In that particular case, knowledgeable people reviewed the analysis, identified exactly all the serious the problems, raised the issues with the right people, and…were ignored. There is no statistical method that I know of that can prevent disaster from occurring under this circumstance. Unfortunately, for outside observers, it’s usually impossible to see this process happening, and so we tend to attribute failures to the parts that we can see.

## Statistical Methods

Applying statistical methods is obviously essential to the job of a data scientist. In particular, knowing what methods are most appropriate for different situations and different kinds of data, and which methods are best-suited to answer different kinds of questions. Proper application of statistical methods is clearly important to doing good data analysis, but it’s also important for data scientists to know what methods can be reasonably applied given the constraint on resources. If an analysis must be done by tomorrow, one cannot apply a method that requires two days to complete. However, if the method that requires two days is the only appropriate method, then additional time or resources must be negotiated (thus necessitating good relationships with others).

I don’t think much more needs to be said here as I think most assume that knowledge of statistical methods is critical to being a good data scientist. That said, one important aspect that falls into this category is the implementation of statistical methods, which can be more or less complex depending on the size of the data. Sophisticated computational algorithms and methods may need to be applied or developed from scratch if a problem is too big to work on off-the-shelf software. In such cases, a good data scientist will need to know how to implement these methods so that the problem can be solved. While it is sometimes necessary to collaborate with an expert in this area who can implement a complex algorithm, this creates a new layer of communication and another relationship that must be properly managed.

## Narratives and Stories

Even the simplest of analyses can produce an overwhelming amount of results and being able to distill that information into a coherent narrative or story is critical to the success of an analysis. If a great analysis is done, but no one can understand it, did it really happen? Narratives and stories serve as dimension reduction for results and allow an audience to navigate a specified path through the sea of information.

Data scientists have to prioritize what is important and what is not and present things that are relevant to the audience. Part of building a good narrative is choosing the right presentation materials to tell the story, whether they be plots, tables, charts, or text. There is rarely an optimal choice that serves all situations because what works best will be highly audience- and context-dependent. Data scientists need to be able to “read the room”, so to speak, and make the appropriate choices. Many times, when I’ve seen critiques of data analyses, it’s not the analysis that is being criticized but rather the choice of narrative. If the data scientist chooses to emphasize one aspect but the audience thinks another aspect is more important, the analysis will seem “wrong” even though the application of the methods to the data is correct.

A hallmark of good communication about a data analysis is providing a way for the audience to reason about the data and to understand how the data are tied to the result. This is a data analysis after all, and we should be able to see for ourselves how the data inform the conclusion. As an audience member in this situation, I’m not as interested in just trusting the presenter and their conclusions.

## Describing a Good Data Scientist

When thinking of some of the best data scientists I’ve known over the years, I think they are all good at the five tentpoles I’ve described above. However, what about the converse? If you met someone who demonstrated that they were good at these five tentpoles, would you think they were a good data scientist? I think the answer is yes, and to get a sense of this, one need look no further than a typical job advertisement for a data science position.

I recently saw this job ad from my Johns Hopkins colleague Elana Fertig. She works in the area of computational biology and her work involves analyzing large quantities of data to draw connections between people’s genes and cancer (if I may make a gross oversimplification). She is looking for a postdoctoral fellow to join her lab and the requirements listed for the position are typical of many ads of this type:

• PhD in computational biology, biostatistics, biomedical engineering, applied mathematics, or a related field.
• Proficiency in programming with R/Bioconductor and/or python for genomics analysis.
• Experience with high-performance computing clusters and LINUX scripting.
• Techniques for reproducible research and version control, including but not limited to experience generating knitr reports, GitHub repositories, and R package development.
• Problem-solving skills and independence.
• The ability to work as part of a multidisciplinary team.
• Excellent written and verbal communication skills.

This is a job where complex statistical methods will be applied to large biological datasets. As a result, knowledge of the methods or the biology will be useful, and knowing how to implement these methods on a large scale (i.e. via cluster computing) will be important. Knowing techniques for reproducible research requires knowledge of the proper workflows and how to manage them throughout an analysis. Problem-solving skills is practically synonymous with design thinking; working as part of a multidisciplinary team requires negotiating human relationships; and developing narratives and stories requires excellent written and verbal communication skills.

## Summary

A good data scientist can be hard to find, and part of the reason is because being a good data scientist requires mastering skills in a wide range of areas. However, these five tentpoles are not haphazardly chosen; rather they reflect the interwoven set of skills that are needed to solve complex data problems. Focusing on being good at these five tentpoles means sacrificing time spent studying other things. To the extent that we can coalesce around the idea of convincing people to do exactly that, data science will become a distinct field with its own identity and vision.

### forecast 8.5

(This article was first published on R on Rob J Hyndman, and kindly contributed to R-bloggers)

The latest minor release of the forecast package has now been approved on CRAN and should be available in the next day or so.

Version 8.5 contains the following new features

• Updated tsCV() to handle exogenous regressors.
• Reimplemented naive(), snaive(), rwf() for substantial speed improvements.
• Added support for passing arguments to auto.arima() unit root tests.
• Improved auto.arima() stepwise search algorithm (some neighbouring models were missed previously).

We haven’t done a major release for two years, and there is unlikely to be another one now. Instead, we are working hard on fable, a tidyverse replacement for the forecast package.

The forecast package will continue to be maintained, but no new features will be added.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

## January 17, 2019

### Book Memo: “The Enterprise Big Data Lake”

 Delivering on the Promise of Hadoop and Data Science in the Enterprise This is a handbook for decision-makers from the initial research and decision-making process through planning, choosing products, and implementing-and, crucially, maintaining and governing-the modern data lake. It covers all these issues in practical and actionable terms for both the managerial and the IT professional.

### Whats new on arXiv

Change point detection algorithms have numerous applications in fields of scientific and economic importance. We consider the problem of change point detection on compositional multivariate data (each sample is a probability mass function), which is a practically important sub-class of general multivariate data. While the problem of change-point detection is well studied in univariate setting, and there are few viable implementations for a general multivariate data, the existing methods do not perform well on compositional data. In this paper, we propose a parametric approach for change point detection in compositional data. Moreover, using simple transformations on data, we extend our approach to handle any general multivariate data. Experimentally, we show that our method performs significantly better on compositional data and is competitive on general data compared to the available state of the art implementations.
We propose nnstreamer, a software system that handles neural networks as filters of stream pipelines, applying the stream processing paradigm to neural network applications. A new trend with the wide-spread of deep neural network applications is on-device AI; i.e., processing neural networks directly on mobile devices or edge/IoT devices instead of cloud servers. Emerging privacy issues, data transmission costs, and operational costs signifies the need for on-device AI especially when a huge number of devices with real-time data processing are deployed. Nnstreamer efficiently handles neural networks with complex data stream pipelines on devices, improving the overall performance significantly with minimal efforts. Besides, nnstreamer simplifies the neural network pipeline implementations and allows reusing off-shelf multimedia stream filters directly; thus it reduces the developmental costs significantly. Nnstreamer is already being deployed with a product releasing soon and is open source software applicable to a wide range of hardware architectures and software platforms.
This paper proposes a novel adaptive guidance system developed using reinforcement meta-learning with a recurrent policy and value function approximator. The use of recurrent network layers allows the deployed policy to adapt real time to environmental forces acting on the agent. We compare the performance of the DR/DV guidance law, an RL agent with a non-recurrent policy, and an RL agent with a recurrent policy in four difficult tasks with unknown but highly variable dynamics. These tasks include a safe Mars landing with random engine failure and a landing on an asteroid with unknown environmental dynamics. We also demonstrate the ability of a recurrent policy to navigate using only Doppler radar altimeter returns, thus integrating guidance and navigation.
In this study, a novel topology optimization approach based on conditional Wasserstein generative adversarial networks (CWGAN) is developed to replicate the conventional topology optimization algorithms in an extremely computationally inexpensive way. CWGAN consists of a generator and a discriminator, both of which are deep convolutional neural networks (CNN). The limited samples of data, quasi-optimal planar structures, needed for training purposes are generated using the conventional topology optimization algorithms. With CWGANs, the topology optimization conditions can be set to a required value before generating samples. CWGAN truncates the global design space by introducing an equality constraint by the designer. The results are validated by generating an optimized planar structure using the conventional algorithms with the same settings. A proof of concept is presented which is known to be the first such illustration of fusion of CWGANs and topology optimization.
The goal of this article is to inspire data scientists to participate in the debate on the impact that their professional work has on society, and to become active in public debates on the digital world as data science professionals. How do ethical principles (e.g., fairness, justice, beneficence, and non-maleficence) relate to our professional lives? What lies in our responsibility as professionals by our expertise in the field? More specifically this article makes an appeal to statisticians to join that debate, and to be part of the community that establishes data science as a proper profession in the sense of Airaksinen, a philosopher working on professional ethics. As we will argue, data science has one of its roots in statistics and extends beyond it. To shape the future of statistics, and to take responsibility for the statistical contributions to data science, statisticians should actively engage in the discussions. First the term data science is defined, and the technical changes that have led to a strong influence of data science on society are outlined. Next the systematic approach from CNIL is introduced. Prominent examples are given for ethical issues arising from the work of data scientists. Further we provide reasons why data scientists should engage in shaping morality around and to formulate codes of conduct and codes of practice for data science. Next we present established ethical guidelines for the related fields of statistics and computing machinery. Thereafter necessary steps in the community to develop professional ethics for data science are described. Finally we give our starting statement for the debate: Data science is in the focal point of current societal development. Without becoming a profession with professional ethics, data science will fail in building trust in its interaction with and its much needed contributions to society!
Recent GAN-based architectures have been able to deliver impressive performance on the general task of image-to-image translation. In particular, it was shown that a wide variety of image translation operators may be learned from two image sets, containing images from two different domains, without establishing an explicit pairing between the images. This was made possible by introducing clever regularizers to overcome the under-constrained nature of the unpaired translation problem. In this work, we introduce a novel architecture for unpaired image translation, and explore several new regularizers enabled by it. Specifically, our architecture comprises a pair of GANs, as well as a pair of translators between their respective latent spaces. These cross-translators enable us to impose several regularizing constraints on the learnt image translation operator, collectively referred to as latent cross-consistency. Our results show that our proposed architecture and latent cross-consistency constraints are able to outperform the existing state-of-the-art on a wide variety of image translation tasks.
We present a method to reconstruct networks of socialbots given minimal input. Then we use Kernel Density Estimates of Botometer scores from 47,000 social networking accounts to find clusters of automated accounts, discovering over 5,000 socialbots. This statistical and data driven approach allows for inference of thresholds for socialbot detection, as illustrated in a case study we present from Guatemala.
As more researchers have become aware of and passionate about algorithmic fairness, there has been an explosion in papers laying out new metrics, suggesting algorithms to address issues, and calling attention to issues in existing applications of machine learning. This research has greatly expanded our understanding of the concerns and challenges in deploying machine learning, but there has been much less work in seeing how the rubber meets the road. In this paper we provide a case-study on the application of fairness in machine learning research to a production classification system, and offer new insights in how to measure and address algorithmic fairness issues. We discuss open questions in implementing equality of opportunity and describe our fairness metric, conditional equality, that takes into account distributional differences. Further, we provide a new approach to improve on the fairness metric during model training and demonstrate its efficacy in improving performance for a real-world product
Machine-learning models have demonstrated great success in learning complex patterns that enable them to make predictions about unobserved data. In addition to using models for prediction, the ability to interpret what a model has learned is receiving an increasing amount of attention. However, this increased focus has led to considerable confusion about the notion of interpretability. In particular, it is unclear how the wide array of proposed interpretation methods are related, and what common concepts can be used to evaluate them. We aim to address these concerns by defining interpretability in the context of machine learning and introducing the Predictive, Descriptive, Relevant (PDR) framework for discussing interpretations. The PDR framework provides three overarching desiderata for evaluation: predictive accuracy, descriptive accuracy and relevancy, with relevancy judged relative to a human audience. Moreover, to help manage the deluge of interpretation methods, we introduce a categorization of existing techniques into model-based and post-hoc categories, with sub-groups including sparsity, modularity and simulatability. To demonstrate how practitioners can use the PDR framework to evaluate and understand interpretations, we provide numerous real-world examples. These examples highlight the often under-appreciated role played by human audiences in discussions of interpretability. Finally, based on our framework, we discuss limitations of existing methods and directions for future work. We hope that this work will provide a common vocabulary that will make it easier for both practitioners and researchers to discuss and choose from the full range of interpretation methods.
Interval-censored data analysis is important in biomedical statistics for any type of time-to-event response where the time of response is not known exactly, but rather only known to occur between two assessment times. Many clinical trials and longitudinal studies generate interval-censored data; one common example occurs in medical studies that entail periodic follow-up. In this paper we propose a survival forest method for interval-censored data based on the conditional inference framework. We describe how this framework can be adapted to the situation of interval-censored data. We show that the tuning parameters have a non-negligible effect on the survival forest performance and guidance is provided on how to tune the parameters in a data-dependent way to improve the overall performance of the method. Using Monte Carlo simulations we find that the proposed survival forest is at least as effective as a survival tree method when the underlying model has a tree structure, performs similarly to an interval-censored Cox proportional hazards model fit when the true relationship is linear, and outperforms the survival tree method and Cox model when the true relationship is nonlinear. We illustrate the application of the method on a tooth emergence data set.
Large amount of public data produced by enterprises are in semi-structured PDF form. Tabular data extraction from reports and other published data in PDF format is of interest for various data consolidation purposes such as analysing and aggregating financial reports of a company. Queries into the structured tabular data in PDF format are normally processed in an unstructured manner through means like text-match. This is mainly due to that the binary format of PDF documents is optimized for layout and rendering and do not have great support for automated parsing of data. Moreover, even the same table type in PDF files varies in schema, row or column headers, which makes it difficult for a query plan to cover all relevant tables. This paper proposes a deep learning based method to enable SQL-like query and analysis of financial tables from annual reports in PDF format. This is achieved through table type classification and nearest row search. We demonstrate that using word embedding trained on Google news for header match clearly outperforms the text-match based approach in traditional database. We also introduce a practical system that uses this technology to query and analyse finance tables in PDF documents from various sources.
In many security and healthcare systems, the detection and diagnosis systems use a sequence of sensors/tests. Each test outputs a prediction of the latent state and carries an inherent cost. However, the correctness of the predictions cannot be evaluated since the ground truth annotations may not be available. Our objective is to learn strategies for selecting a test that gives the best trade-off between accuracy and costs in such Unsupervised Sensor Selection (USS) problems. Clearly, learning is feasible only if ground truth can be inferred (explicitly or implicitly) from the problem structure. It is observed that this happens if the problem satisfies the ‘Weak Dominance’ (WD) property. We set up the USS problem as a stochastic partial monitoring problem and develop an algorithm with sub-linear regret under the WD property. We argue that our algorithm is optimal and evaluate its performance on problem instances generated from synthetic and real-world datasets.
A simultaneous change-point detection and estimation in a piece-wise constant model is a common task in modern statistics. If, in addition, the whole estimation can be performed automatically, in just one single step without going through any hypothesis tests for non-identifiable models, or unwieldy classical a-posterior methods, it becomes an interesting, but also challenging idea. In this paper we introduce the estimation method based on the quantile LASSO approach. Unlike standard LASSO approaches, our method does not rely on typical assumptions usually required for the model errors, such as sub-Gaussian or Normal distribution. The proposed quantile LASSO method can effectively handle heavy-tailed random error distributions, and, in general, it offers a more complex view of the data as one can obtain any conditional quantile of the target distribution, not just the conditional mean. It is proved that under some reasonable assumptions the number of change-points is not underestimated with probability tenting to one, and, in addition, when the number of change-points is estimated correctly, the change-point estimates provided by the quantile LASSO are consistent. Numerical simulations are used to demonstrate these results and to illustrate the empirical performance robust favor of the proposed quantile LASSO method.
The Laplace approximation has been one of the workhorses of Bayesian inference. It often delivers good approximations in practice despite the fact that it does not strictly take into account where the volume of posterior density lies. Variational approaches avoid this issue by explicitly minimising the Kullback-Leibler divergence DKL between a postulated posterior and the true (unnormalised) logarithmic posterior. However, they rely on a closed form DKL in order to update the variational parameters. To address this, stochastic versions of variational inference have been devised that approximate the intractable DKL with a Monte Carlo average. This approximation allows calculating gradients with respect to the variational parameters. However, variational methods often postulate a factorised Gaussian approximating posterior. In doing so, they sacrifice a-posteriori correlations. In this work, we propose a method that combines the Laplace approximation with the variational approach. The advantages are that we maintain: applicability on non-conjugate models, posterior correlations and a reduced number of free variational parameters. Numerical experiments demonstrate improvement over the Laplace approximation and variational inference with factorised Gaussian posteriors.
In this paper, we propose a model for non-cooperative Markov games with time-consistent risk-aware players. In particular, our model characterizes the risk arising from both the stochastic state transitions and the randomized strategies of the other players. We give an appropriate equilibrium concept for our risk-aware Markov game model and we demonstrate the existence of such equilibria in stationary strategies. We then propose and analyze a simulation-based $Q$-learning type algorithm for equilibrium computation, and work through the details for some specific risk measures. Our numerical experiments on a two player queuing game demonstrate the worth and applicability of our model and corresponding $Q$-learning algorithm.

### Practice Makes Perfect – Free Data Science Interviews

The year 2019 has begun and many people have plans to become a data scientist. That is because data scientist has been ranked as one of the top jobs for the last several years. After learning the necessary skills, preparing and completing the interview can be an intimidating task. That is why interview practice is so important, and Pramp provides a free online environment for practicing a data science interview.

## What is Pramp?

Pramp is a free peer-to-peer matching platform that enables you to practice a technical interview. After signing up, here is how the process works:

1. You schedule an interview by choosing a date and time for when you would like the interview to occur.
2. You then prepare for the interview with the materials Pramp provides. Pramp will supply interview questions and guidelines for being best prepared.
3. Finally, you conduct the interview where you and the other person take turns interviewing each other.
4. If desired, the process can be repeated multiple times.

Pramp takes two people preparing for a data science interview and matches them together.

Being on both sides of the interview is surprisingly very helpful. It allows you to practice your responses, and it allows you to understand what is important to the person asking questions. It is often more about understanding the problem and thinking through a solution, rather than identifying a right or wrong answer.

As a bit of a bonus, if you enjoyed interviewing with your peer and you’d like to practice with him/her again, Pramp has a feature for that. Who knows, that peer may become a friend or a coworker in the future.

## Why Practice the Interview?

Even the best data scientists and engineers struggle to pass technical interviews. Let’s face it, technical interviews are challenging and intimidating. For many, the biggest challenge isn’t the coding question, but rather staying focused while solving a problem out loud and under time pressure in front of an interviewer.

Data from over 180,000 interviews scheduled on Pramp has shown that those who completed face-to-face mock interviews performed significantly better than those who just practiced alone. Plus, Pramp users have already found jobs at companies like: Google, Amazon, Facebook, Twitter, Microsoft, Spotify, and many others.

Pramp, a Y Combinator-funded company, has tackled the challenge of technical interviews by offering a free peer-to-peer mock interview platform helping data scientists and engineers practice technical interviews. In addition to data science, Pramp also offers interviews for:

• Data Structures and Algorithms
• System Design
• Frontend Development
• Behavioral Interviews

If you are looking to get into a data scientist or other technical role in 2019, Pramp is a site which can help you be better prepared for the interview.

### A Year Learning Data Science at Dataquest

Specifically, this article will give you an idea of what you can expect to learn, and what kind of jobs you might be able to apply for, as you work through a year of learning data science with us.

Of course, one of the advantages of an online course like Dataquest’s is that you can work at your own pace and tailor your study to your background, skipping courses with content you're already comfortable with. For the purposes of this article, we’re going to make some conservative assumptions:

• You’re busy, and can only dedicate around five hours per week to your studies.
• You have no previous programming experience.
• You have no math training (beyond high-school algebra).

A lot of our students work faster than that, so this is a pretty conservative estimate of what you can get done in a year. But if you put in at least five hours a week, we expect that in a year, you could finish the Data Analyst path, or get more than halfway through the Data Scientist path, and would be qualified for a variety of entry-level data analysis and data science jobs.

Let’s take a closer look at what that year would look like, what you would learn on the Data Scientist in Python path, and how you could best take advantage of your subscription over the course of the year. (Your experience on the Data Analyst path in Python or R would be very similar, however).

## January-February: Learning Python

The first eight weeks of your year would likely be spent learning Python. You might be able to get through our introductory and intermediate Python courses a little bit faster if you rush, but building a solid Python foundation is important for almost everything that comes afterwards. It’s worth taking a little extra time here to be sure you can understand and apply all the concepts.

The good news is that even if you started these eight weeks with zero coding experience, you’re going to end them as a programmer. After these courses, you’ll be able to confidently apply most of the important concepts of Python programming (from basics like functions and for loops to more advanced concepts like regular expressions and list comprehensions), and you’ll also be comfortable working with Jupyter Notebooks, an important tool for data scientists who use Python.

As you learn those skills and techniques, you will also have gotten a great introduction to the fundamentals of data analysis in Python. All of our courses have you working with real-world data, and as part of these courses you’ll get to apply what you’ve learned doing guided projects analyzing what app store profiles lead to more app downloads and what successful Hacker News posts have in common with each other.

These two classes alone won’t be enough to get you a job in data science, but by the end of the eight weeks, you’ll find that you’ve learned enough to do some basic data analysis on your own, and probably code some other things, too! Just these eight weeks would be enough to give you some skills that might help you save some time on analytical tasks in your current job.

You can take advantage of Premium help and support at any time during your studies, but these first eight weeks will be an especially good time to reach out if you encounter any roadblocks, need a second pair of eyes on your code, or simply want to figure out whether your understanding of a concept is correct. These courses are the foundation upon which the rest of your data science “house” will be built, so being extra thorough here will pay dividends later.

## March-May: Data Cleaning, Data Analysis, and Data Visualization

These twelve weeks are where the rubber really starts to meet the road in applying your new Python skills to accomplish typical data science tasks. You’ll go through four courses here, and each one of them is crucial for doing data science.

In the first course, Pandas and NumPy Fundamentals, you’ll learn how to use the pandas library, a crucial tool for real-world data analysis tasks. You’ll also learn about NumPy, another useful Python package, and you’ll learn to make them play nicely together. Then you’ll apply that learning with a guided project analyzing real-world eBay car sales data.

From there, you’ll move into two courses about data visualization. The first, Exploratory Data Visualization, will teach you how to use the matplotlib package together with pandas to do exploratory visualizations that will help you make sense of your data and guide you in your analysis. The second, Storytelling Through Data Visualization, will teach you more about how to make aesthetic, readable charts using Seaborn to ensure that you know how to communicate your data clearly to others (a crucial skill in any data science job). In these courses, you’ll synthesize what you’ve learned in guided projects analyzing topics like the gender gap in college degrees and geographical flight patterns (all using real-world data, of course).

Finally, you’ll move into a course on data cleaning, one of the most un-sexy but essential skills in any data scientist’s toolkit. You’ll learn to explore and clean datasets, how to combine multiple datasets into a single, clean source, and work through some guided projects analyzing data from NYC high schools and a survey about Star Wars.

By mid-May (your 20th week), you’ll have acquired many of the foundational data science skills, and you should be well-equipped to start taking on your own data science projects. You might not be ready for a full-time data science job just yet, but you’ll know enough to be able to solve real-world problems with data science in a way that might impact your current job.

For example, Dataquest student Curtly Critchlow was able to take an Excel data analysis nightmare that took him a full week of work each month and turn it into a project that took just a few minutes after he finished our Pandas and NumPy course.

During these weeks, though, you may encounter a psychological phenomenon sometimes referred to as ‘The Dip’. This happens often in the course of learning a new skill; once you get beyond the beginner phase, big gains come a bit more slowly, and the novelty of studying something new has worn off. The result can be a bit of a dip in your natural level of motivation.

But don’t worry: we’ll help you fight the dip! All of our courses use interesting, real-world data to combat this effect by keeping you interested in the analysis, and you’ll be solving different and interesting problems in each course.

These twelve weeks would also be a good time for you to get more involved with our Members-Only Slack community, where you can network and collaborate with other students and get help from peers and data scientists alike. The energy of collective learning can be a great motivator, and by this point, you’ll have learned enough to start helping other students. Teaching others what you’ve learned is a great way to reinforce your own learning.

## May-July: Learning Command Line and Git

As we get towards the middle of our year of data science, it’s time to cover some skills that are hugely important for working in data science: operating with the command line and using git to develop projects collaboratively.

In the first two courses, you’ll learn to work with the command line. You’ll get comfortable navigating around without the use of a GUI, and working with Python scripts and packages from the command line. Then you’ll move on to more advanced topics including searches with grep, building shell pipelines, and using some new tools like Jupyter console. You’ll also get some more training in data cleaning using a tool called csvkit. To cement these skills, you’ll work on real-world projects like analyzing years of Hacker News headlines to see what words, domains, and submission times are most likely to result in a highly-upvoted post.

From there, you’ll move into our course on Git and Version Control, where you’ll learn why version control is important, and how you can use git both locally and on Git remotes like Github (the biggest public code repository and a platform for code-sharing and collaboration that’s used by software developers all over the world). You’ll also learn project management techniques like how to merge branches and resolve merge conflicts that will make it easier to work as part of a collaborative data science team. And of course, you’ll get Git installed and get your own Github set up.

At this point in your study, it’s a great time to start thinking about portfolio projects. Having a Github or some other portfolio page with compelling projects is key to landing a job in data science, and Dataquest is full of guided projects that you can absolutely use for a portfolio. You’ll have worked through some of these already, so this is a good moment to look back and think about adding some polish to your favorites so you’ve got some cool projects ready when you start applying for jobs.

## July-October: Learn SQL, APIs, and Web Scraping

Over these twelve weeks, you’ll take four more courses, all of them focused on helping you work with data sources more efficiently.

You’ll start with three SQL courses. In the first, you’ll learn the basics, like how to explore and analyze data in SQL, and how to use SQLite with Python. Then you’ll move into more intermediate topics like querying across multiple tables, and you’ll begin getting practice answering business questions using SQL. Finally, you’ll dig into the advanced stuff, like PostgreSQL and using database indexes to speed up your SQL queries.

And while your new SQL skills will be crucial for working with most of the databases out there, there are plenty of other data sources you’ll want to work with, so after the SQL courses you’ll move into a course on APIs and web scraping that’ll teach you how to query APIs and scrape data from websites that don’t have APIs.

To cement these skills, you’ll answer some more real-world business questions with SQL, and dive into data from the CIA World Factbook.

If you haven’t already, this would be a great time to schedule one-on-one office hours with one of our data scientists and discuss your career plans. This would be a great opportunity to get a second set of eyes on your resume, get some advice about building your portfolio, or just get some input on the types of roles you should apply for.

That might sound premature, but don’t sell yourself short - by this point, you have almost all of the key skills you need for entry-level data analyst positions. And while it’s scary to apply for jobs before you feel ready, the payoff can be massive. That was certainly the case for Miguel Couto, a Dataquest student who applied for jobs about halfway through our Data Scientist path even though he didn’t think he was ready. He ended up getting three full-time job offers and is now working happily as a data analyst.

## October-December: Learn Statistics for Data Science

By this point, you’ll have the programming skills to do a lot of data analysis, but you still need a solid understanding of statistics and probability to be able to get the most of of them, so in the final section of your year of Dataquest, you’ll take a sequence of three courses aimed at giving you a solid stats foundation and helping you apply these concepts in Python.

You’ll start with the basics, like learning different sampling techniques for taking good samples from your data. Then you’ll start looking at distributions, measuring variability, and locating and comparing values with z-scores. Finally, you’ll learn more about probability and dig into advanced topics like significance testing and the chi-squared test.

As usual, as you work through these courses, you’ll be using real-world data to answer interesting questions, like how a bike-sharing company can anticipate rental patterns. And you’ll be able to apply your new skills to cool guided projects like figuring out winning Jeopardy strategies and determining whether a movie ratings site’s ratings are biased.

This is a great time to branch out a bit more and start making some connections in the data science community. Our Members Slack is a great place to start. You may also want to work on building your brand as a data scientist by getting yourself out there in other ways, like by writing a tutorial for the Dataquest blog.

## This Is Just the Beginning of Your Data Science Journey

This statistics and probability section is the final one you’ll complete in your year of Dataquest study, assuming you’ve stuck to just five hours per week. It also represents the end of the Data Analyst path. At this point, you’ll be well-qualified to apply for data analyst positions; we have many students like Pol Brigneti, who’ve finished our Data Analyst path and found full-time data analyst positions. If that’s your choice, then you’ve got two extra weeks in the year that you can use to polish projects, tackle a few more guided projects for your portfolio, and start applying for jobs.

There’s still plenty more learning on our Data Science path, too. If you continue studying, in the final two weeks of the year, you’ll get to dig into the hottest topic in data science: machine learning.

And remember, this is a pretty conservative estimate. Spending a little more time each week studying will get you further, faster. At around 10 hours per week, we estimate you’d be able to finish the entire Data Scientist path in a year.

Even if you don’t aspire to go all the way through the Data Science path, it pays to keep learning while you’re searching for jobs, and even after you find employment. That’s what Miguel told us after he got his full-time job just halfway through our Data Science path. “Even though I’m starting a job in January, I’m still going to be active, and I’ll keep on studying, because obviously I want to reach other paths.”

“And I still think Dataquest is the best option,” he told us. “If I had to choose only one, I’d choose Dataquest.”

Commit to your year of Dataquest. Buy an annual subscription and save 50% (limited time only, offer expires Feb. 4)

### Why Applied MSc in Data Engineering? Data Engineers are in greater demand than Data Scientists

2 graduate programmes now available at Data ScienceTech Institute in France: Applied MSc in Data Engineering Applied MSc in Data Science & Artificial Intelligence, with enterprise level certifications included in each. There is a 100% conversion to an internship and 90% to a job contract.

### Facebook announces Probability and Programming research award at POPL 2019

As part of the Facebook-sponsored evening reception for all POPL conference attendees, Facebook is launching a Probability and Programming Languages research award, presented by Satish Chandra. Facebook is looking for proposals that address fundamental problems at the intersection of machine learning, programming languages and software engineering. Find out more details about the Probability and Programming research award in the next section.

### Probability and Programming Research Award

At Facebook, we are doing forward-looking research, as well as putting into production concrete results from several of these threads. We introduced HackPPL, which extends our internal PHP dialect into a full-fledged probabilistic programming language, and are creating extensions to Python to eliminate string-based API patterns. We have started various language-centric projects around acceleration and differentiable programming. We also have a portfolio of projects in the “big code” space, exploring several topics such as code search and recommendation, automatic bug fixing, and program synthesis using machine learning. Together, this work hopes to have impact across all of Facebook’s infrastructure.

To foster further innovation in these topics, and to deepen our collaboration with academia, Facebook is pleased to invite faculty and graduate students to respond to this call for research proposals pertaining to the aforementioned topics.

For more information and to respond to the Probability and Programming request for proposals, visit the research award page.

Facebook is a platinum sponsor of the 46th ACM SIGPLAN Symposium on Principles of Programming Languages (POPL). The conference is being held Sunday, January 13th through Saturday, January 19th at the Hotel Cascias Miragem in Cascias/Lisbon, Portugal.

A few Facebook researchers are attending the conference to present their work, give talks and participate in academic outreach. Here are the papers and keynotes being presented by Facebook Research:

Paper: Building Your Own Modular Static Analyser with Infer as part of the TutorialFest
Jules Villard, Ezgi Çiçek, Dino Distefano, Nikos Gorogiannis, and Peter O’Hearn

Paper: A True Positives Theorem for a Static Race Detector as part of the Concurrency track
Nikos Gorogiannis, Peter O’Hearn, and Ilya Sergey (Yale-NUS College)

### Magister Dixit

“To tackle sector-wide challenges, we need a range of voices involved.” Jake Porway ( October 1, 2015 )

### What is driving Europe’s tech economy?

Here is what “The State of European Tech 2018”, a data-driven analytical report by Atomico reveals about the factors that are accelerating the growth of tech in the European economy The comparison of Europe’s tech scene with Silicon Valley in the U.S and Asia (particularly China) has been a topic

The post What is driving Europe’s tech economy? appeared first on Dataconomy.

### Curtly: “Dataquest changed my life”

Curtly Critchlow came to data science for a reason: he was struggling with a data nightmare.

Curtly, 25, works at the Livestock Development Authority in Guyana, South America. “Our agency is responsible for visiting farmers and helping them to become better farmers,” he said. To do that, they travel to hundreds of farms across the country, providing services and generating data with each visit. It’s an important program, but it created a massive data headache for Curtly, who was responsible for wrangling all of that data.

To record data, the agency was using a Dropbox account with ten regional folders, each containing six files of data for the six services the agency offers in each region. For example, veterinary visits in one region would be logged in the vet file that region’s folder, while vet visits in another region would go into a different folder and file. Working with that data in any meaningful way meant trying to combine sixty different files into a six Excel spreadsheets. And that meant scrubbing huge mountains of data by hand.

“It took up an entire week every month trying to combine and clean the data,” Curtly said. And because of the large volume of data, doing any kind of operation in those Excel sheets took forever. Even a simple copy-paste command might take five minutes to execute because Excel was so bogged down.

“At this point in time I had no idea that data science was even a career,” Curtly said. “But I reached a point where Excel wasn’t enough, so I was looking for alternative solutions.”

“I began reading up [data science],” he said. “I realized, hey, I actually like data science. I’m not qualified for it, but I like it, so how can I learn it?” He started watching data science presentations on Youtube, and when he found a good one, he reached out to its creator to ask how he could learn more. The creator advised him to check out Dataquest.

He worked through a few of our free courses, and quickly saw it was very different from the video-lecture-based courses he’d tried before. At first, he said, “I kinda wanted videos,” but it didn’t take long for him to realize that the Dataquest approach was working for him. He felt he was learning more by reading, and that it was giving him good practice for reading software documentation (which is rarely available in video tutorial form).

His boss at the Livestock Development Authority was happy too, because Dataquest training was a lot more affordable than many of the available alternatives.

In terms of coding, Curtly was starting from absolute zero. “I was clueless about everything you guys were teaching,” he said. “When you don’t know anything there’s this fear that it’s confusing.” But the learn-by-doing structure of Dataquest lessons — and the fact that he could turn to teachers and fellow students for help — made it easier for him to make progress than it was with other solutions he’d tried. (“I wasn’t too satisfied with DataCamp,” he said. “Dataquest is way better.”)

Curtly’s focus has been to move slowly and deliberately through his courses: “As I learn, I try to apply what I’ve learned into my everyday environment to ensure that I’m really learning the concepts.” And although he says he’s only 22% of the way through the Data Scientist path on the site, it has already made a huge impact. “I’ve learned so much,” he said, “My life has been so much easier with that 22% knowledge.”

One example? That nightmare Excel file he was trying to work with each month. About eight months into his studies, after finishing our new pandas course in September, he went back to that problem and started writing code. At first, it didn’t work, but when he ran into problems he couldn’t solve, he turned to our Slack community for help. “Eventually, he said, “I got it to work perfectly. It took about a minute to combine all 60 files into six files, and it was just a sweet feeling, an amazing feeling.”

But of course, that accomplishment won’t be the end of Curtly’s journey. In fact, in the long term, he’s got big plans. “My goal is to have my own data science startup that will help businesses and agencies increase their productivity.” “Meeting Dataquest was life-changing,” he said.

Feeling inspired? Dive in and start (or continue) your own data science journey.

### Make Teaching R Quasi-Quotation Easier

To make teaching R quasi-quotation easier it would be nice if R string-interpolation and quasi-quotation both used the same notation. They are related concepts. So some commonality of notation would actually be clarifying, and help teach the concepts. We will define both of the above terms, and demonstrate the relation between the two concepts.

## String-interpolation

String-interpolation is the name for substituting value into a string. For example:

library("wrapr")

variable <- as.name("angle")

sinterp(
'variable name is .(variable)'
)
## [1] "variable name is angle"

Notice the ".(variable)" portion was replaced with the actual variable name "angle". For string interpolation we are intentionally using the ".()" notation that Thomas Lumley’s picked in 2003 when he introduced quasi-quotation into R (a different concept than string-interpolation, but the topic of our next section).

String interpolation is a common need, and there are many R packages that supply variations of such functionality:

## Quasi-quotation

A related idea is "quasi-quotation" which substitutes a value into a general expression. For example:

angle = 1:10
variable <- as.name("angle")

evalb(

plot(x = .(variable),
y = sin(.(variable)))

)

Notice how in the above plot the actual variable name "angle" was substituted into the graphics::plot() arguments, allowing this name to appear on the axis labels.

evalb() is a very simple function built on top of base::bquote():

print(evalb)
## function(..., where = parent.frame()) {
##   force(where)
##   exprq <- bquote(..., where = where)
##   eval(exprq,
##        envir = where,
##        enclos = where)
## }
## <bytecode: 0x7fa0181b4470>
## <environment: namespace:wrapr>

All evalb() does is: call bquote() as intended. A way to teach this is to just call bqoute() alone.

bquote(

plot(x = .(variable),
y = sin(.(variable)))

)
## plot(x = angle, y = sin(angle))

And we see the un-executed code with the substitutions performed.

There are many R quasi-quotation systems including:

If you don’t want to wrap your plot() call in evalb() you can instead pre-adapt the function. Below we create a new function plotb() that is intended as shorthand for eval(bquote(plot(...))).

plotb <- bquote_function(graphics::plot)

plotb(x = .(variable),
y = sin(.(variable)))

## Conclusion

When string-inerpolation and quasi-quotation use the same notation we can teach them quickly as simple related concepts.