# My Data Science Blogs

## June 22, 2018

### If you did not already know

We present a gradient-tree-boosting-based structured learning model for jointly disambiguating named entities in a document. Gradient tree boosting is a widely used machine learning algorithm that underlies many top-performing natural language processing systems. Surprisingly, most works limit the use of gradient tree boosting as a tool for regular classification or regression problems, despite the structured nature of language. To the best of our knowledge, our work is the first one that employs the structured gradient tree boosting (SGTB) algorithm for collective entity disambiguation. By defining global features over previous disambiguation decisions and jointly modeling them with local features, our system is able to produce globally optimized entity assignments for mentions in a document. Exact inference is prohibitively expensive for our globally normalized model. To solve this problem, we propose Bidirectional Beam Search with Gold path (BiBSG), an approximate inference algorithm that is a variant of the standard beam search algorithm. BiBSG makes use of global information from both past and future to perform better local search. Experiments on standard benchmark datasets show that SGTB significantly improves upon published results. Specifically, SGTB outperforms the previous state-of-the-art neural system by near 1\% absolute accuracy on the popular AIDA-CoNLL dataset. …

ChoiceNet
In this paper, we focus on the supervised learning problem with corrupted training data. We assume that the training dataset is generated from a mixture of a target distribution and other unknown distributions. We estimate the quality of each data by revealing the correlation between the generated distribution and the target distribution. To this end, we present a novel framework referred to here as ChoiceNet that can robustly infer the target distribution in the presence of inconsistent data. We demonstrate that the proposed framework is applicable to both classification and regression tasks. ChoiceNet is extensively evaluated in comprehensive experiments, where we show that it constantly outperforms existing baseline methods in the handling of noisy data. Particularly, ChoiceNet is successfully applied to autonomous driving tasks where it learns a safe driving policy from a dataset with mixed qualities. In the classification task, we apply the proposed method to the CIFAR-10 dataset and it shows superior performances in terms of robustness to noisy labels. …

Long short-term memory (LSTM) is normally used in recurrent neural network (RNN) as basic recurrent unit. However,conventional LSTM assumes that the state at current time step depends on previous time step. This assumption constraints the time dependency modeling capability. In this study, we propose a new variation of LSTM, advanced LSTM (A-LSTM), for better temporal context modeling. We employ A-LSTM in weighted pooling RNN for emotion recognition. The A-LSTM outperforms the conventional LSTM by 5.5% relatively. The A-LSTM based weighted pooling RNN can also complement the state-of-the-art emotion classification framework. This shows the advantage of A-LSTM. …

### Fast Convex Pruning of Deep Neural Networks - implementation -

Ali just sent me the following:

Hi Igor,

Just wanted to share our recent paper on pruning neural networks, which makes a strong connection with the compressed sensing literature:

- Paper: "Fast convex pruning of deep neural networks", https://arxiv.org/abs/1806.06457- Code + Implementation instructions: https://dnntoolbox.github.io/Net-Trim/

Thanks;
-Ali

Thanks Ali !

Fast Convex Pruning of Deep Neural Networks by Alireza Aghasi, Afshin Abdi, Justin Romberg
We develop a fast, tractable technique called Net-Trim for simplifying a trained neural network. The method is a convex post-processing module, which prunes (sparsifies) a trained network layer by layer, while preserving the internal responses. We present a comprehensive analysis of Net-Trim from both the algorithmic and sample complexity standpoints, centered on a fast, scalable convex optimization program. Our analysis includes consistency results between the initial and retrained models before and after Net-Trim application and guarantees on the number of training samples needed to discover a network that can be expressed using a certain number of nonzero terms. Specifically, if there is a set of weights that uses at most
s
terms that can re-create the layer outputs from the layer inputs, we can find these weights from O(slogN/s)
samples, where
N
is the input size. These theoretical results are similar to those for sparse regression using the Lasso, and our analysis uses some of the same recently-developed tools (namely recent results on the concentration of measure and convex analysis). Finally, we propose an algorithmic framework based on the alternating direction method of multipliers (ADMM), which allows a fast and simple implementation of Net-Trim for network pruning and compression.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

### Two thousand five hundred ways to say the same thing

Wallethub published a credit card debt study, which includes the following map:

Let's describe what's going on here.

The map plots cities (N = 2,562) in the U.S. Each city is represented by a bubble. The color of the bubble ranges from purple to green, encoding the percentile ranking based on the amount of credit card debt that was paid down by consumers. Purple represents 1st percentile, the lowest amount of paydown while green represents 99th percentile, the highest amount of paydown.

The bubble size is encoding exactly the same data, apparently in a coarser gradation. The more purple the color, the smaller the bubble. The more green the color, the larger the bubble.

***

The design decisions are baffling.

Purple is more noticeable than the green, but signifies the less important cities, with the lesser paydowns.

With over 2,500 bubbles crowding onto the map, over-plotting is inevitable. The purple bubbles are printed last, dominating the attention but those are the least important cities (1st percentile). The green bubbles, despite being larger, lie underneath the smaller, purple bubbles.

What might be the message of this chart? Our best guess is: the map explores the regional variation in the paydown rate of credit card debt.

The analyst provides all the data beneath the map.

From this table, we learn that the ranking is not based on total amount of debt paydown, but the amount of paydown per household in each city (last column). That makes sense.

Shouldn't it be ranked by the paydown rate instead of the per-household number? Divide the "Total Credit Card Paydown by City" by "Total Credit Card Debt Q1 2018" should yield the paydown rate. Surprise! This formula yields a column entirely consisting of 4.16%.

What does this mean? They applied the national paydown rate of 4.16% to every one of 2,562 cities in the country. If they had plotted the paydown rate, every city would attain the same color. To create "variability," they plotted the per-household debt paydown amount. Said differently, the color scale encodes not credit card paydown as asserted but amount of credit card debt per household by city.

Here is a scatter plot of the credit card amount against the paydown amount.

A perfect alignment!

This credit card debt paydown map is an example of a QDV chart, in which there isn't a clear question, there is almost no data, and the visual contains several flaws. (See our Trifecta checkup guide.) We are presented 2,562 ways of saying the same thing: 4.16%.

P.S. [6/22/2018] Added scatter plot, and cleaned up some language.

### R Packages worth a look

Render ‘Plotly’ Maps without an Internet Connection (plotlyGeoAssets)
Includes ‘JavaScript’ files that allow ‘plotly’ maps to render without an internet connection.

Sample Design, Drawing & Data Analysis Using Data Frames (sampler)
Determine sample sizes, draw samples, and conduct data analysis using data frames. It specifically enables you to determine simple random sample sizes, stratified sample sizes, and complex stratified sample sizes using a secondary variable such as population; draw simple random samples and stratified random samples from sampling data frames; determine which observations are missing from a random sample, missing by strata, duplicated within a dataset; and perform data analysis, including proportions, margins of error and upper and lower bounds for simple, stratified and cluster sample designs.

Feature Selection and Ranking by Simultaneous Perturbation Stochastic Approximation (spFSR)
An implementation of feature selection and ranking via simultaneous perturbation stochastic approximation (SPSA-FSR) based on works by V. Aksakalli and M. Malekipirbazari (2015) <arXiv:1508.07630> and Zeren D. Yenice and et al. (2018) <arXiv:1804.05589>. The SPSA-FSR algorithm searches for a locally optimal set of features that yield the best predictive performance using a specified error measure such as mean squared error (for regression problems) and accuracy rate (for classification problems). This package requires an object of class ‘task’ and an object of class ‘Learner’ from the ‘mlr’ package.

Change-Point Analysis of High-Dimensional Time Series via Binary Segmentation (hdbinseg)
Binary segmentation methods for detecting and estimating multiple change-points in the mean or second-order structure of high-dimensional time series as described in Cho and Fryzlewicz (2014) <doi:10.1111/rssb.12079> and Cho (2016) <doi:10.1214/16-EJS1155>.

Sensitivity Analysis for Comparative Methods (sensiPhy)
An implementation of sensitivity analysis for phylogenetic comparative methods. The package is an umbrella of statistical and graphical methods that estimate and report different types of uncertainty in PCM: (i) Species Sampling uncertainty (sample size; influential species and clades). (ii) Phylogenetic uncertainty (different topologies and/or branch lengths). (iii) Data uncertainty (intraspecific variation and measurement error).

### Document worth reading: “Deep Face Recognition: A Survey”

Driven by graphics processing units (GPUs), massive amounts of annotated data and more advanced algorithms, deep learning has recently taken the computer vision community by storm and has benefited real-world applications, including face recognition (FR). Deep FR methods leverage deep networks to learn more discriminative representations, significantly improving the state of the art and surpassing human performance (97.53%). In this paper, we provide a comprehensive survey of deep FR methods, including data, algorithms and scenes. First, we summarize the commonly used datasets for training and testing. Then, the data preprocessing methods are categorized into two classes: ‘one-to-many augmentation’ and ‘many-to-one normalization’. Second, for algorithms, we summarize different network architectures and loss functions used in the state-of-the art methods. Third, we review several scenes in deep FR, such as video FR, 3D FR and cross-age FR. Finally, some potential deficiencies of the current methods and several future directions are highlighted. Deep Face Recognition: A Survey

### I am the supercargo

In a form of sympathetic magic, many built life-size replicas of airplanes out of straw and cut new military-style landing strips out of the jungle, hoping to attract more airplanes. – Wikipedia

Twenty years ago, Geri Halliwell left the Spice Girls, so I’ve been thinking about Cargo Cults a lot.

As an analogy for what I’m gonna talk about, it’s … inapt, but only if you’ve looked up Cargo Cults. But I’m going with it because it’s pride week and Drag Race is about to start

The thing is, it can be hard to identify if you’re a member of a Cargo Cult. The whole point is that from within the cult, everything seems good and sensible and natural. Or, to quote today’s titular song,

They say “our John Frum’s coming,
He’s bringing cargo…” and the rest
At least they don’t expect to be
Surviving their own deaths.

This has been on my mind on and off for a while now. Mostly from a discussion I had with someone in the distant-enough-to-not-actually-remember-who-I-was-talking-to past, where we were arguing about something (I’m gonna guess non-informative vs informative priors, but honestly I do not remember) and this person suggested that the thing I didn’t like was a good idea, at least in part, because Harold Jeffreys thought it was a good idea.

A technical book written in the 1930s being used as a coup de grâce to end a technical argument in 2018 screams cargo cult to me. But is that fair? (Extreme narrator voice: It is not fair.)

I guess this is one of the problems we need to deal with as a field: how do we maintain the best bits of our old knowledge (pre-computation, early computation, MCMC, and now) while dealing with the rapidly evolving nature of modern data and modern statistical questions?

So how do you avoid cult like behaviour? Well, as a child of Nü-Metal*, I think there’s only one real answer:

Break stuff

I am a firm believer that before you use a method, you should know how to break it. Describing how to break something should be an essential part of describing a new piece of statistical methodology (or, for that matter, of resurrecting an existing one). At the risk of getting all Dune on you, he who can destroy a thing controls a thing.

(We’re getting very masc4masc here. Who’d’ve thought that me with a hangover was so into Sci-Fi & Nü-Metal? Next thing you know I’ll be doing a straight-faced reading of Ender’s Game. Look for me at 2am explaining to a woman who’d really rather not be still talking to me that they’re just called “buggers” because they look like bugs.)

So let’s break something.

This isn’t meant to last, this is for right now

Specifically let’s talk about breaking leave-one-out cross validation (LOO-CV) for computing the expected log-predictive density (elpd or sometimes LOO-elpd). Why? Well, partly because I also read that paper that Aki commented on a few weeks back that made me think more about the dangers of accidentally starting a cargo cult. (In this analogy, the cargo is a R package and a bunch of papers.)

One of the fabulous things about this job is that there are two things you really can’t control: how people will use the tools you construct, and how long they will continue to take advice that turned out not to be the best (for serious, cool it with the Cauchy priors!).

So it’s really important to clearly communicate flaws in method both when it’s published and later on. This is, of course, in tension with the desire to actually get work published, so we do  what we can.

Now, Aki’s response was basically definitive, so I’m mostly not going to talk about the paper. I’m just going to talk about LOO.

One step closer to the edge

One of the oldest criticisms of using LOO for model selection is that it is not necessarily consistent when the model list contains the true data generating model (the infamous, but essentially useless** M-Closed setting). This contrasts with model selection using Bayes’ Factors, which are consistent in the useless asymptotic regime. (Very into Nü-Metal. Very judgemental.)

Being that judge-y without explaining the context is probably not good practice, so let’s actually look at the famous case where model selection will not be consistent: Nested models.

For a very simple example, let’s consider two potential models:

$\text{M1:}\; y_i \sim N(\mu, 1)$

$\text{M2:}\; y_i \sim N(\mu + \beta x_i, 1)$

The covariate $x_i$ can be anything, but for simplicity, let’s take it to be $x_i \sim N(0,1)$.

And to put us in an M-Closed setting, let’s assume the data that we are seeing is drawn  from the first model (M1) with $\mu=0$In this situation, model selection based on the LOO-expected log predictive density will be inconsistent.

Spybreak!

To see this, we need to understand what the LOO methods are using to select models. It is the ability to predict a new data point coming from the (assumed iid) data generating mechanism. If two models asymptotically produce the same one point predictive distribution, then the LOO-elpd criterion will not be able to separate them.  This is different to Bayes’ factors, which will always choose the simplest of the models that make the same predictions.

Let’s look at what happens asymptotically. (And now you see why I focussed on such simple models: I’m quite bad at maths.)

Because these models are regular and have finite-dimensional parameters, they both satisfy all of the conditions of the Bernstein-von Mises theorem (which I once wrote about in these pages during an epic panic attack) which means that we know in both cases that the posterior for the  model parameters $\theta$ after observing n data points will be $\theta_j^{(n)} = \theta_{(j)}^* + \mathcal{O}_p(n^{-1/2})$. Here:

• $\theta_j^{(n)}$ is the random variable distributed according to the posterior for model j after $n$ obeservations,
• $\theta_{(j)}^*$ is the true parameter from model j that would generate the data. In this case $\theta_{(1)}^*=0$ and $\theta_{(2)}^*=(0,0)^T$.
• And $\mathcal{O}_p(n^{-1/2})$ is a random variable with (finite) standard deviation that goes to zero as increases like $n^{-1/2}$.

Arguing loosely (again: quite bad a maths), the LOO-elpd criterion is trying to compute $E_{\theta_j^{(n)}}\left[p(y\mid\theta_j^{(n)})\right]$ which asymptotically looks like $p(y\mid\theta_j^*)+O(n^{-1/2})$.

This means that, asymptotically, both of these models will give rise to the same posterior predictive distribution and hence LOO-elpd will not be able to tell between them.

Take a look around

LOO-elpd can’t tell them apart, but we sure can! The thing is, the argument of inconsistency in this case only really holds water if you never actually look at the parameter estimates. If you know that you have nested models (ie that one is the special case of another), you should just look at the estimates to see if there’s any evidence for the more complex model.  Or, if you want to do it more formally, consider the family of potential nested models as your M-Complete model class and use something like projpred to choose the simplest one.

All of which is to say that this inconsistently is mathematically a very real thing but should not cause practical problems unless you use model selection tools blindly and thoughtlessly.

For a bonus extra fact: This type of setup will also cause the stacking weights we (Yuling, Aki, Andrew, and me) proposed not to stabilize. Because any convex combination will asymptotically give the same distribution. So be careful if you’re trying to interpret model stacking weights as posterior model probabilities.

Have a cigar

But I said I was going to break things. And so far I’ve just propped up the method yet again.

The thing is, there is a much bigger problem with LOO-elpd. The problem is the assumption that leaving one observation out is enough to get a good approximation to the average value of the posterior log-predictive over a new data set.  This is all fine when the data is iid draws from some model.

LOO-elpd can fail catastrophically and silently when the data cannot be assumed to be iid. A simple case where this happens is time-series data, where you should leave out the whole future instead.  Or spatial data, where you should leave out large-enough spatial regions that the point you are predicting is effectively independent of all of the points that remain in the data set. Or when your data has multilevel structure, where you really should leave out whole strata.

In all of these cases, cross validation can be a useful too, but it’s k-fold cross validation that’s needed rather than LOO-CV. Moreover, if your data is weird, it can be hard to design a cross validation scheme that’s defensible. Worse still, while LOO is cheap (thanks to Aki and Jonah’s work on the loo package), k-fold CV requires re-fitting the model a lot of times, which can be extremely expensive.

All of this is to say that if you want to avoid an accidental LOO cargo cult, you need to be very aware of the assumptions and limitations of the method and to use it wisely, rather than automatically. There is no such thing as an automatic statistician.

* One of the most harrowing days of my childhood involved standing that the check out of the Target in Buranda (a place that has not changed in 15 year, btw) and having to choose between buying the first Linkin Park album and the first Coldplay album. You’ll be pleased to know that I made the correct choice.

** When George Box said that “All models are wrong” he was saying that M-Closed is a useless assumption that is never fulfilled.

The post I am the supercargo appeared first on Statistical Modeling, Causal Inference, and Social Science.

### Distilled News

Automating machine learning by providing techniques that autonomously find the best algorithm, hyperparameter configuration and preprocessing is helpful for both researchers and practitioners. Therefore, it is not surprising that automated machine learning has become a very interesting field of research. Bayesian optimization has proven to be a very successful tool for automated machine learning. In the first part of the thesis we present different approaches to improve Bayesian optimization by means of transfer learning. We present three different ways of considering meta-knowledge in Bayesian optimization, i.e. search space pruning, initialization and transfer surrogate models. Finally, we present a general framework for Bayesian optimization combined with meta-learning and conduct a comparison among existing work on two different meta-data sets. A conclusion is that in particular the meta-target driven approaches provide better results. Choosing algorithm configurations based on the improvement on the meta-knowledge combined with the expected improvement yields best results. The second part of this thesis is more application-oriented. Bayesian optimization is applied to large data sets and used as a tool to participate in machine learning challenges. We compare its autonomous performance and its performance in combination with a human expert. At two ECML-PKDD Discovery Challenges, we are able to show that automated machine learning outperforms human machine learning experts. Finally, we present an approach that automates the process of creating an ensemble of several layers, different algorithms and hyperparameter configurations. These kinds of ensembles are jokingly called Frankenstein ensembles and proved their benefit on versatile data sets in many machine learning challenges. We compare our approach Automatic Frankensteining with the current state of the art for automated machine learning on 80 different data sets and can show that it outperforms them on the majority using the same training time. Furthermore, we compare Automatic Frankensteining on a large-scale data set to more than 3,500 machine learning expert teams and are able to outperform more than 3,000 of them within 12 CPU hours.
In today´s world, every customer is faced with multiple choices. For example, If I´m looking for a book to read without any specific idea of what I want, there´s a wide range of possibilities how my search might pan out. I might waste a lot of time browsing around on the internet and trawling through various sites hoping to strike gold. I might look for recommendations from other people. But if there was a site or app which could recommend me books based on what I have read previously, that would be a massive help. Instead of wasting time on various sites, I could just log in and voila! 10 recommended books tailored to my taste. This is what recommendation engines do and their power is being harnessed by most businesses these days. From Amazon to Netflix, Google to Goodreads, recommendation engines are one of the most widely used applications of machine learning techniques. In this article, we will cover various types of recommendation engine algorithms and fundamentals of creating them in Python. We will also see the mathematics behind the workings of these algorithms. Finally, we will create our own recommendation engine using matrix factorization.
In this tutorial, you will build four models using Latent Dirichlet Allocation (LDA) and K-Means clustering machine learning algorithms.
Today, Python is one of the most popular programming languages and it has replaced many languages in the industry. There are various reasons for its popularity and one of them is that python has a large collection of libraries. With python, the data scientists need not spend all the day debugging. They just need to invest time in determining which python library works best for their ongoing projects. So, what is a Python Library It is a collection of methods and functions that enable you to carry out a lot of actions without the need for writing your code.
Tensorflow 1.5 implementation of Chris Moody’s Lda2vec
Machine learning is increasingly moving from hand-designed models to automatically optimized pipelines using tools such as H20, TPOT, and auto-sklearn. These libraries, along with methods such as random search, aim to simplify the model selection and tuning parts of machine learning by finding the best model for a dataset with little to no manual intervention. However, feature engineering, an arguably more valuable aspect of the machine learning pipeline, remains almost entirely a human labor. Feature engineering, also known as feature creation, is the process of constructing new features from existing data to train a machine learning model. This step can be more important than the actual model used because a machine learning algorithm only learns from the data we give it, and creating features that are relevant to a task is absolutely crucial (see the excellent paper ‘A Few Useful Things to Know about Machine Learning’). Typically, feature engineering is a drawn-out manual process, relying on domain knowledge, intuition, and data manipulation. This process can be extremely tedious and the final features will be limited both by human subjectivity and time. Automated feature engineering aims to help the data scientist by automatically creating many candidate features out of a dataset from which the best can be selected and used for training. In this article, we will walk through an example of using automated feature engineering with the featuretools Python library. We will use an example dataset to show the basics (stay tuned for future posts using real-world data). The complete code for this article is available on GitHub.
In order to solve tasks, deep neural networks (DNNs) progressively transform input data into a sequence of complex representations (i.e., patterns of activations across individual neurons). Understanding these representations is critically important, not only for interpretability, but also so that we can more intelligently design machine learning systems. However, understanding these representations has proven quite difficult, especially when comparing representations across networks. In a previous post, we outlined the benefits of Canonical Correlation Analysis (CCA) as a tool for understanding and comparing the representations of convolutional neural networks (CNNs), showing that they converge in a bottom-up pattern, with early layers converging to their final representations before later layers over the course of training.
This 29-part course consists of tutorials on ML concepts and algorithms, as well as end-to-end follow-along ML examples, quizzes, and hands-on projects. You can think of this course as a ‘Free Online Nano Book’.
This paper addresses a key NLP problem known as sarcasm detection using a combination of models based on convolutional neural networks (CNNs). Detection of sarcasm is important in other areas such as affective computing and sentiment analysis because such expressions can flip the polarity of a sentence.
As a beginner at deep learning, one of the things I realized is that there isn´t much online documentation that covers all the deep learning tricks in one place. There are lots of small best practices, ranging from simple tricks like initializing weights, regularization to slightly complex techniques like cyclic learning rates that can make training and debugging neural nets easier and efficient. This inspired me to write this series of blogs where I will cover as many nuances as I can to make implementing deep learning simpler for you. While writing this blog, the assumption is that you have a basic idea of how neural networks are trained. An understanding of weights, biases, hidden layers, activations and activation functions will make the content clearer. I would recommend this course if you wish to build a basic foundation of deep learning. Note – Whenever I refer to layers of a neural network, it implies the layers of a simple neural network, i.e. the fully connected layers. Of course some of the methods I talk about apply to convolutional and recurrent neural networks as well. In this blog I am going to talk about the issues related to initialization of weight matrices and ways to mitigate them. Before that, let´s just cover some basics and notations that we will be using going forward.
A mechanistic model for the relationship between x and y sometimes needs parameter estimation. When model linearisation does not work,we need to use non-linear modelling.
You may be interested in my new arXiv paper, joint work with Xi Cheng, an undergraduate at UC Davis (now heading to Cornell for grad school); Bohdan Khomtchouk, a post doc in biology at Stanford; and Pete Mohanty, a Science, Engineering & Education Fellow in statistics at Stanford. The paper is of a provocative nature, and we welcome feedback.
I have frequent conversations with R champions and Systems Administrators responsible for R, in which they ask how they can measure and analyze the usage of their servers. Among the many solutions to this problem, one of the my favourites is to use an RRD database and RRDtool.
RRDtool is the OpenSource industry standard, high performance data logging and graphing system for time series data. RRDtool can be easily integrated in shell scripts, perl, python, ruby, lua or tcl applications.

## June 21, 2018

### Idle thoughts lead to R internals: how to count function arguments

(This article was first published on R – What You're Doing Is Rather Desperate, and kindly contributed to R-bloggers)

“Some R functions have an awful lot of arguments”, you think to yourself. “I wonder which has the most?”

It’s not an original thought: the same question as applied to the R base package is an exercise in the Functions chapter of the excellent Advanced R. Much of the information in this post came from there.

There are lots of R packages. We’ll limit ourselves to those packages which ship with R, and which load on startup. Which ones are they?

What packages load on starting R?
Start a new R session and type search(). Here’s the result on my machine:

 search() [1] ".GlobalEnv" "tools:rstudio" "package:stats" "package:graphics" "package:grDevices" "package:utils" "package:datasets" "package:methods" "Autoloads" "package:base" 

We’re interested in the packages with priority = base. Next question:

How can I see and filter for package priority?
You don’t need dplyr for this, but it helps.

library(tidyverse)

installed.packages() %>%
as.tibble() %>%
filter(Priority == "base") %>%
select(Package, Priority)

# A tibble: 14 x 2
Package   Priority

1 base      base
2 compiler  base
3 datasets  base
4 graphics  base
5 grDevices base
6 grid      base
7 methods   base
8 parallel  base
9 splines   base
10 stats     base
11 stats4    base
12 tcltk     base
13 tools     base
14 utils     base


Comparing to the output from search(), we want to look at: stats, graphics, grDevices, utils, datasets, methods and base.

How can I see all the objects in a package?
Like this, for the base package. For other packages, just change base to the package name of interest.

ls("package:base")


However, not every object in a package is a function. Next question:

How do I know if an object is a function?
The simplest way is to use is.function().

is.function(ls)
[1] TRUE


What if the function name is stored as a character variable, “ls”? Then we can use get():

is.function(get("ls"))
[1] TRUE


But wait: what if two functions from different packages have the same name and we have loaded both of those packages? Then we specify the package too, using the pos argument.

is.function(get("Position", pos = "package:base"))
[1] TRUE
is.function(get("Position", pos = "package:ggplot2"))
[1] FALSE


So far, so good. Now, to the arguments.

How do I see the arguments to a function?
Now things start to get interesting. In R, function arguments are called formals. There is a function of the same name, formals(), to show the arguments for a function. You can also use formalArgs() which returns a vector with just the argument names:

formalArgs(ls)
[1] "name"      "pos"       "envir"     "all.names" "pattern"   "sorted"


But that won’t work for every function. Let’s try abs():

formalArgs(abs)
NULL


The issue here is that abs() is a primitive function, and primitives don’t have formals. Our next two questions:

How do I know if an object is a primitive?
Hopefully you guessed that one:

is.primitive(abs)
[1] TRUE


How do I see the arguments to a primitive?
You can use args(), and you can pass the output of args() to formals() or formalArgs():

args(abs)
function (x)
NULL

formalArgs(args(abs))
[1] "x"


However, there are a few objects which are primitive functions for which this doesn’t work. Let’s not worry about those.

is.primitive(:)
[1] TRUE

formalArgs(args(:))
NULL
Warning message:
In formals(fun) : argument is not a function


So what was the original question again?
Let’s put all that together. We want to find the base packages which load on startup, list their objects, identify which are functions or primitive functions, list their arguments and count them up.

We’ll create a tibble by pasting the arguments for each function into a comma-separated string, then pulling the string apart using unnest_tokens() from the tidytext package.

library(tidytext)
library(tidyverse)

pkgs <- installed.packages() %>%
as.tibble() %>%
filter(Priority == "base",
Package %in% c("stats", "graphics", "grDevices", "utils",
"datasets", "methods", "base")) %>%
select(Package) %>%
rowwise() %>%
mutate(fnames = paste(ls(paste0("package:", Package)), collapse = ",")) %>%
unnest_tokens(fname, fnames, token = stringr::str_split,
pattern = ",", to_lower = FALSE) %>%
filter(is.function(get(fname, pos = paste0("package:", Package)))) %>%
mutate(is_primitive = ifelse(is.primitive(get(fname, pos = paste0("package:", Package))),
1,
0),
num_args = ifelse(is.primitive(get(fname, pos = paste0("package:", Package))),
length(formalArgs(args(fname))),
length(formalArgs(fname)))) %>%
ungroup()


That throws out a few warnings where, as noted, args() doesn’t work for some primitives.

And the winner is –

pkgs %>%
top_n(10) %>%
arrange(desc(num_args))

Selecting by num_args
# A tibble: 10 x 4
Package  fname            is_primitive num_args

1 graphics legend                      0       39
2 graphics stars                       0       33
3 graphics barplot.default             0       30
4 stats    termplot                    0       28
6 stats    heatmap                     0       24
7 base     scan                        0       22
8 graphics filled.contour              0       21
9 graphics hist.default                0       21
10 stats    interaction.plot            0       21


– the function legend() from the graphics package, with 39 arguments. From the base package itself, scan(), with 22 arguments.

Just to wrap up, some histograms of argument number by package, suggesting that the base graphics functions tend to be the more verbose.

pkgs %>%
ggplot(aes(num_args)) +
geom_histogram() +
facet_wrap(~Package, scales = "free_y") +
theme_bw() +
labs(x = "arguments", title = "R base function arguments by package")


R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Book Memo: “Essentials of Time Series for Financial Applications”

 Essentials of Time Series for Financial Applications serves as an agile reference for upper level students and practitioners who desire a formal, easy-to-follow introduction to the most important time series methods applied in financial applications (pricing, asset management, quant strategies, and risk management). Real-life data and examples developed with EViews illustrate the links between the formal apparatus and the applications. The examples either directly exploit the tools that EViews makes available or use programs that by employing EViews implement specific topics or techniques. The book balances a formal framework with as few proofs as possible against many examples that support its central ideas. Boxes are used throughout to remind readers of technical aspects and definitions and to present examples in a compact fashion, with full details (workout files) available in an on-line appendix. The more advanced chapters provide discussion sections that refer to more advanced textbooks or detailed proofs.

### Whats new on arXiv

The robotic systems continuously interact with complex dynamical systems in the physical world. Reliable predictions of spatiotemporal evolution of these dynamical systems, with limited knowledge of system dynamics, are crucial for autonomous operation. In this paper, we present HybridNet, a framework that integrates data-driven deep learning and model-driven computation to reliably predict spatiotemporal evolution of a dynamical systems even with in-exact knowledge of their parameters. A data-driven deep neural network (DNN) with Convolutional LSTM (ConvLSTM) as the backbone is employed to predict the time-varying evolution of the external forces/perturbations. On the other hand, the model-driven computation is performed using Cellular Neural Network (CeNN), a neuro-inspired algorithm to model dynamical systems defined by coupled partial differential equations (PDEs). CeNN converts the intricate numerical computation into a series of convolution operations, enabling a trainable PDE solver. With a feedback control loop, HybridNet can learn the physical parameters governing the system’s dynamics in real-time, and accordingly adapt the computation models to enhance prediction accuracy for time-evolving dynamical systems. The experimental results on two dynamical systems, namely, heat convection-diffusion system, and fluid dynamical system, demonstrate that the HybridNet produces higher accuracy than the state-of-the-art deep learning based approach.
Weight thresholding is a simple technique that aims at reducing the number of edges in weighted networks that are otherwise too dense for the application of standard graph-theoretical methods. We show that the community structure of real weighted networks is very robust under weight thresholding, as it is maintained even when most of the edges are removed. This is due to the correlation between topology and weight that characterizes real networks. On the other hand, the behaviour of other properties is generally system dependent.
Local surrogate models, to approximate the local decision boundary of a black-box classifier, constitute one approach to generate explanations for the rationale behind an individual prediction made by the back-box. This paper highlights the importance of defining the right locality, the neighborhood on which a local surrogate is trained, in order to approximate accurately the local black-box decision boundary. Unfortunately, as shown in this paper, this issue is not only a parameter or sampling distribution challenge and has a major impact on the relevance and quality of the approximation of the local black-box decision boundary and thus on the meaning and accuracy of the generated explanation. To overcome the identified problems, quantified with an adapted measure and procedure, we propose to generate surrogate-based explanations for individual predictions based on a sampling centered on particular place of the decision boundary, relevant for the prediction to be explained, rather than on the prediction itself as it is classically done. We evaluate the novel approach compared to state-of-the-art methods and a straightforward improvement thereof on four UCI datasets.
Using variational Bayes neural networks, we develop an algorithm capable of accumulating knowledge into a prior from multiple different tasks. The result is a rich and meaningful prior capable of few-shot learning on new tasks. The posterior can go beyond the mean field approximation and yields good uncertainty on the performed experiments. Analysis on toy tasks shows that it can learn from significantly different tasks while finding similarities among them. Experiments of Mini-Imagenet yields the new state of the art with 74.5% accuracy on 5 shot learning. Finally, we provide experiments showing that other existing methods can fail to perform well in different benchmarks.
The family of Expectation-Maximization (EM) algorithms provides a general approach to fitting flexible models for large and complex data. The expectation (E) step of EM-type algorithms is time-consuming in massive data applications because it requires multiple passes through the full data. We address this problem by proposing an asynchronous and distributed generalization of the EM called the Distributed EM (DEM). Using DEM, existing EM-type algorithms are easily extended to massive data settings by exploiting the divide-and-conquer technique and widely available computing power, such as grid computing. The DEM algorithm reserves two groups of computing processes called \emph{workers} and \emph{managers} for performing the E step and the maximization step (M step), respectively. The samples are randomly partitioned into a large number of disjoint subsets and are stored on the worker processes. The E step of DEM algorithm is performed in parallel on all the workers, and every worker communicates its results to the managers at the end of local E step. The managers perform the M step after they have received results from a $\gamma$-fraction of the workers, where $\gamma$ is a fixed constant in $(0, 1]$. The sequence of parameter estimates generated by the DEM algorithm retains the attractive properties of EM: convergence of the sequence of parameter estimates to a local mode and linear global rate of convergence. Across diverse simulations focused on linear mixed-effects models, the DEM algorithm is significantly faster than competing EM-type algorithms while having a similar accuracy. The DEM algorithm maintains its superior empirical performance on a movie ratings database consisting of 10 million ratings.
Most recent work on interpretability of complex machine learning models has focused on estimating $\textit{a posteriori}$ explanations for previously trained models around specific predictions. $\textit{Self-explaining}$ models where interpretability plays a key role already during learning have received much less attention. We propose three desiderata for explanations in general — explicitness, faithfulness, and stability — and show that existing methods do not satisfy them. In response, we design self-explaining models in stages, progressively generalizing linear classifiers to complex yet architecturally explicit models. Faithfulness and stability are enforced via regularization specifically tailored to such models. Experimental results across various benchmark datasets show that our framework offers a promising direction for reconciling model complexity and interpretability.
Binary neural networks (BNN) have been studied extensively since they run dramatically faster at lower memory and power consumption than floating-point networks, thanks to the efficiency of bit operations. However, contemporary BNNs whose weights and activations are both single bits suffer from severe accuracy degradation. To understand why, we investigate the representation ability, speed and bias/variance of BNNs through extensive experiments. We conclude that the error of BNNs is predominantly caused by the intrinsic instability (training time) and non-robustness (train \& test time). Inspired by this investigation, we propose the Binary Ensemble Neural Network (BENN) which leverages ensemble methods to improve the performance of BNNs with limited efficiency cost. While ensemble techniques have been broadly believed to be only marginally helpful for strong classifiers such as deep neural networks, our analyses and experiments show that they are naturally a perfect fit to boost BNNs. We find that our BENN, which is faster and much more robust than state-of-the-art binary networks, can even surpass the accuracy of the full-precision floating number network with the same architecture.
Several researchers have argued that a machine learning system’s interpretability should be defined in relation to a specific agent or task: we should not ask if the system is interpretable, but to whom is it interpretable. We describe a model intended to help answer this question, by identifying different roles that agents can fulfill in relation to the machine learning system. We illustrate the use of our model in a variety of scenarios, exploring how an agent’s role influences its goals, and the implications for defining interpretability. Finally, we make suggestions for how our model could be useful to interpretability researchers, system developers, and regulatory bodies auditing machine learning systems.
Recent advances in Convolutional Neural Networks (CNN) have achieved remarkable results in localizing objects in images. In these networks, the training procedure usually requires providing bounding boxes or the maximum number of expected objects. In this paper, we address the task of estimating object locations without annotated bounding boxes, which are typically hand-drawn and time consuming to label. We propose a loss function that can be used in any Fully Convolutional Network (FCN) to estimate object locations. This loss function is a modification of the Average Hausdorff Distance between two unordered sets of points. The proposed method does not require one to ‘guess’ the maximum number of objects in the image, and has no notion of bounding boxes, region proposals, or sliding windows. We evaluate our method with three datasets designed to locate people’s heads, pupil centers and plant centers. We report an average precision and recall of 94% for the three datasets, and an average location error of 6 pixels in 256×256 images.
In this paper, we introduce a skill-balancing mechanism for adversarial non-player characters (NPCs), called Skilled Experience Catalogue (SEC). The objective of this mechanism is to approximately match the skill level of an NPC to an opponent in real-time. We test the technique in the context of a First-Person Shooter (FPS) game. Specifically, the technique adjusts a reinforcement learning NPC’s proficiency with a weapon based on its current performance against an opponent. Firstly, a catalogue of experience, in the form of stored learning policies, is built up by playing a series of training games. Once the NPC has been sufficiently trained, the catalogue acts as a timeline of experience with incremental knowledge milestones in the form of stored learning policies. If the NPC is performing poorly, it can jump to a later stage in the learning timeline to be equipped with more informed decision-making. Likewise, if it is performing significantly better than the opponent, it will jump to an earlier stage. The NPC continues to learn in real-time using reinforcement learning but its policy is adjusted, as required, by loading the most suitable milestones for the current circumstances.
Matrix factorization (MF) is extensively used to mine the user preference from explicit ratings in recommender systems. However, the reliability of explicit ratings is not always consistent, because many factors may affect the user’s final evaluation on an item, including commercial advertising and a friend’s recommendation. Therefore, mining the reliable ratings of user is critical to further improve the performance of the recommender system. In this work, we analyze the deviation degree of each rating in overall rating distribution of user and item, and propose the notion of user-based rating centrality and item-based rating centrality, respectively. Moreover, based on the rating centrality, we measure the reliability of each user rating and provide an optimized matrix factorization recommendation algorithm. Experimental results on two popular recommendation datasets reveal that our method gets better performance compared with other matrix factorization recommendation algorithms, especially on sparse datasets.
Neural networks allow Q-learning reinforcement learning agents such as deep Q-networks (DQN) to approximate complex mappings from state spaces to value functions. However, this also brings drawbacks when compared to other function approximators such as tile coding or their generalisations, radial basis functions (RBF) because they introduce instability due to the side effect of globalised updates present in neural networks. This instability does not even vanish in neural networks that do not have any hidden layers. In this paper, we show that simple modifications to the structure of the neural network can improve stability of DQN learning when a multi-layer perceptron is used for function approximation.
Multiple kernel learning (MKL) method is generally believed to perform better than single kernel method. However, some empirical studies show that this is not always true: the combination of multiple kernels may even yield an even worse performance than using a single kernel. There are two possible reasons for the failure: (i) most existing MKL methods assume that the optimal kernel is a linear combination of base kernels, which may not hold true; and (ii) some kernel weights are inappropriately assigned due to noises and carelessly designed algorithms. In this paper, we propose a novel MKL framework by following two intuitive assumptions: (i) each kernel is a perturbation of the consensus kernel; and (ii) the kernel that is close to the consensus kernel should be assigned a large weight. Impressively, the proposed method can automatically assign an appropriate weight to each kernel without introducing additional parameters, as existing methods do. The proposed framework is integrated into a unified framework for graph-based clustering and semi-supervised classification. We have conducted experiments on multiple benchmark datasets and our empirical results verify the superiority of the proposed framework.
Abstract Dialectical Frameworks (ADFs) generalize Dung’s argumentation frameworks allowing various relationships among arguments to be expressed in a systematic way. We further generalize ADFs so as to accommodate arbitrary acceptance degrees for the arguments. This makes ADFs applicable in domains where both the initial status of arguments and their relationship are only insufficiently specified by Boolean functions. We define all standard ADF semantics for the weighted case, including grounded, preferred and stable semantics. We illustrate our approach using acceptance degrees from the unit interval and show how other valuation structures can be integrated. In each case it is sufficient to specify how the generalized acceptance conditions are represented by formulas, and to specify the information ordering underlying the characteristic ADF operator. We also present complexity results for problems related to weighted ADFs.
The identification of semantic relations between terms within texts is a fundamental task in Natural Language Processing which can support applications requiring a lightweight semantic interpretation model. Currently, semantic relation classification concentrates on relations which are evaluated over open-domain data. This work provides a critique on the set of abstract relations used for semantic relation classification with regard to their ability to express relationships between terms which are found in a domain-specific corpora. Based on this analysis, this work proposes an alternative semantic relation model based on reusing and extending the set of abstract relations present in the DOLCE ontology. The resulting set of relations is well grounded, allows to capture a wide range of relations and could thus be used as a foundation for automatic classification of semantic relations.
Deep learning (DL) has achieved remarkable progress over the past decade and been widely applied to many safety-critical applications. However, the robustness of DL systems recently receives great concerns, such as adversarial examples against computer vision systems, which could potentially result in severe consequences. Adopting testing techniques could help to evaluate the robustness of a DL system and therefore detect vulnerabilities at an early stage. The main challenge of testing such systems is that its runtime state space is too large: if we view each neuron as a runtime state for DL, then a DL system often contains massive states, rendering testing each state almost impossible. For traditional software, combinatorial testing (CT) is an effective testing technique to reduce the testing space while obtaining relatively high defect detection abilities. In this paper, we perform an exploratory study of CT on DL systems. We adapt the concept in CT and propose a set of coverage criteria for DL systems, as well as a CT coverage guided test generation technique. Our evaluation demonstrates that CT provides a promising avenue for testing DL systems. We further pose several open questions and interesting directions for combinatorial testing of DL systems.
Natural language definitions of terms can serve as a rich source of knowledge, but structuring them into a comprehensible semantic model is essential to enable them to be used in semantic interpretation tasks. We propose a method and provide a set of tools for automatically building a graph world knowledge base from natural language definitions. Adopting a conceptual model composed of a set of semantic roles for dictionary definitions, we trained a classifier for automatically labeling definitions, preparing the data to be later converted to a graph representation. WordNetGraph, a knowledge graph built out of noun and verb WordNet definitions according to this methodology, was successfully used in an interpretable text entailment recognition approach which uses paths in this graph to provide clear justifications for entailment decisions.
As the backbone technology of machine learning, deep neural networks (DNNs) have have quickly ascended to the spotlight. Running DNNs on resource-constrained mobile devices is, however, by no means trivial, since it incurs high performance and energy overhead. While offloading DNNs to the cloud for execution suffers unpredictable performance, due to the uncontrolled long wide-area network latency. To address these challenges, in this paper, we propose Edgent, a collaborative and on-demand DNN co-inference framework with device-edge synergy. Edgent pursues two design knobs: (1) DNN partitioning that adaptively partitions DNN computation between device and edge, in order to leverage hybrid computation resources in proximity for real-time DNN inference. (2) DNN right-sizing that accelerates DNN inference through early-exit at a proper intermediate DNN layer to further reduce the computation latency. The prototype implementation and extensive evaluations based on Raspberry Pi demonstrate Edgent’s effectiveness in enabling on-demand low-latency edge intelligence.
Machine learning (ML), especially deep learning is made possible by the availability of big data, enormous compute power and, often overlooked, development tools or frameworks. As the algorithms become mature and efficient, more and more ML inference is moving out of datacenters/cloud and deployed on edge devices. This model deployment process can be challenging as the deployment environment and requirements can be substantially different from those during model development. In this paper, we propose a new ML development and deployment approach that is specially designed and optimized for inference-only deployment on edge devices. We build a prototype and demonstrate that this approach can address all the deployment challenges and result in more efficient and high-quality solutions.
We propose a novel reinforcement learning approach for finite Markov decision processes (MDPs) with delayed rewards. In this work, biases of temporal difference (TD) estimates are proved to be corrected only exponentially slowly in the number of delay steps. Furthermore, variances of Monte Carlo (MC) estimates are proved to increase the variance of other estimates, the number of which can exponentially grow in the number of delay steps. We introduce RUDDER, a return decomposition method, which creates a new MDP with same optimal policies as the original MDP but with redistributed rewards that have largely reduced delays. If the return decomposition is optimal, then the new MDP does not have delayed rewards and TD estimates are unbiased. In this case, the rewards track Q-values so that the future expected reward is always zero. We experimentally confirm our theoretical results on bias and variance of TD and MC estimates. On artificial tasks with different lengths of reward delays, we show that RUDDER is exponentially faster than TD, MC, and MC Tree Search (MCTS). RUDDER outperforms rainbow, A3C, DDQN, Distributional DQN, Dueling DDQN, Noisy DQN, and Prioritized DDQN on the delayed reward Atari game Venture in only a fraction of the learning time. RUDDER considerably improves the state-of-the-art on the delayed reward Atari game Bowling in much less learning time. Source code is available at https://…/baselines-rudder, with demonstration videos at https://goo.gl/EQerZV.
We propose and analyze a new family of algorithms for training neural networks with ReLU activations. Our algorithms are based on the technique of alternating minimization: estimating the activation patterns of each ReLU for all given samples, interleaved with weight updates via a least-squares step. We consider three different cases of this model: (i) a single ReLU; (ii) 1-hidden layer networks with $k$ hidden ReLUs (iii) 2-hidden layer networks. We show that under standard distributional assumptions on the input data, our algorithm provably recovers the true ‘ground truth’ parameters in a linearly convergent fashion; furthermore, our method exhibits requires only $O(d)$ samples for the single ReLU case and $\widetilde{O}(dk^2)$ samples in the 1-hidden layer case. We also extend this framework to deeper networks, and empirically demonstrate its convergence to a global minimum.
High dimensional piecewise stationary graphical models represent a versatile class for modelling time varying networks arising in diverse application areas, including biology, economics, and social sciences. There has been recent work in offline detection and estimation of regime changes in the topology of sparse graphical models. However, the online setting remains largely unexplored, despite its high relevance to applications in sensor networks and other engineering monitoring systems, as well as financial markets. To that end, this work introduces a novel scalable online algorithm for detecting an unknown number of abrupt changes in the inverse covariance matrix of sparse Gaussian graphical models with small delay. The proposed algorithm is based upon monitoring the conditional log-likelihood of all nodes in the network and can be extended to a large class of continuous and discrete graphical models. We also investigate asymptotic properties of our procedure under certain mild regularity conditions on the graph size, sparsity level, number of samples, and pre- and post-changes in the topology of the network. Numerical works on both synthetic and real data illustrate the good performance of the proposed methodology both in terms of computational and statistical efficiency across numerous experimental settings.

### Open Source Datasets with Kaggle

This post was written by Vanessa Sochat, a Research Software Engineer for Stanford Research Computing and the Stanford School of Medicine. She is the primary developer at Stanford for Singularity, a driver to bring Research Applications support to Stanford, and lead developer of Singularity Hub, and Singularity Registry both frameworks optimized for deployment of container-based workflows and "science as a service" capabilities. This piece was originally posted on her personal blog here.

Data sharing is hard, but we all know that there is great potential for discovery and reward [1]. A typical “sharing operation” might look like passing around a portable drive, putting compressed archives on some university or cloud server, or bulk storing on a secure university cluster (and living in fear it will be purged). Is this an optimal approach? Is it easy enough to do? To answer this question, let’s think about the journey that one of our datasets might take. It looks like this:

This flow of events is often cyclical, because the generation of data is more of a stream, and the only reason the data stops flowing from steps 1 to 6 is because we decide to stop collecting it. In the most ideal scenario we would have these steps be totally automated. Step 1 might be generation of images at the MRI scanner, step 2 might be automated scripts to convert the initial file format to the format the researcher desires, 3 is moving to private cluster storage, 4 is use by the research group, and then steps 5 and 6 (if they happen at all) are additional work to again process and transfer the data to a shared location.

Typically, we stop at step 4, because that’s where the lab is content, the analysis is done, and papers are written. Ironically, it’s steps 5 and 6 that would open up a potential firehose of discovery. But the unspoken word is that if I share my dataset and you publish first, I lose out. Data is akin to oranges that must be squeezed of all their juice before giving up for others to inspect, so of course I wouldn’t want to do it. But arguably, if sharing the dataset itself could produce a paper (or something similar), and if steps 4 and 5 were easy, we would have a lot more data sharing. This is the topic that I want to discuss today, and while there is no production solution available, I will show how it’s very easy to share your data as a Kaggle Dataset.

## Living Data

I’ve talked about the idea of living data before, and in summary it’s the idea that we can update our understanding of the world, the answer to some interesting question, as new data comes in. It’s the idea that representation of knowledge as a static PDF isn’t good enough because it only represents one point in time. Instead, living data asserts that the knowledge we accumulate to confirm or deny hypotheses is a living and changing thing. In order to make this living and changing thing a reality, it needs to be easy to provide that feed. Right now, sharing data is a manual afterthought of the publication process. Many journals now encourage or require it, and researchers can upload to various platforms some single timepoint of the dataset. While this practice is better than nothing, I don’t think that it is optimal for learning about the world. Instead of a static article we should have a feed of data that goes into an algorithm and pops out a new answer. We would want the data sharing to happen automatically as the data is produced, and available to all who want to study it. This is probably way too lofty a goal for now, but we can imagine something in the middle of the two extremes. How about a simple pipeline to automatically generate and share a dataset? It might look something like this:

Steps 4 through 6 still happen (the researchers doing analyses) but instead of one group coveting the data, it’s available to thousands. The change is that we’ve added a helper, continuous integration, to step 3 to make it easy to process and share the data. We typically think of continuous integration (CI) for use in testing or deployment, but it also could be a valuable tool for data sharing. Let’s just call this idea “continuous data,” because that’s sort of what it is. Once the data is processed and plopped onto storage for the research group, it also might have this continuous data step that packages it up for sharing.

## TLDR

We need to incentivize data sharing at the step of storage, and provide support for researchers to do this. Institutions must staff a missing layer of data engineers, and prioritize the development of organizational standards and tooling for this task. In the meantime, small research computing groups can help researchers achieve this goal. Researchers should reach out to get help to share their datasets.

## Kaggle API

While a larger, institutional level effort would be ideal, in the meantime we can take advantage of open source, free to use resources like Kaggle. I think that Kaggle has potential to do what Github did for early scientific reproducibility. If it’s easy and fun to share datasets, and if there is potential reward, Kaggle can have an impact on scaled discovery and collaboration. But we have to start somewhere! I decided to start with showing that I can use the Kaggle API to upload a dataset. It’s fairly easy to do in the web interface, and it’s also easy to do from the commnd line. In a nutshell, all we need is a directory with data files and a metadata (json file) that we can point the API client to. For example, here is one of my datasets that I uploaded:

The datapackage.json just describes the content there that is being uploaded.

So how hard is it to share your datasets for others to use and discover? You download a credential file to authenticate with the service. Then you put files (.tar.gz or .csv) in a folder, create a json file, and point the tool at it. It is so easy, and you could practically do all these things without any extra help. It would be so trivial to plug a script like this into some continuous integration to update a dataset as it is added to storage.

## Tools for You!

I put together a Docker container serving a brief example here that I used to interact with the Kaggle API and generate a few datasets. I’ll walk through the basic logic of the scripts here. The kaggle command line client does a good job on its own for many tasks, but as a developer, I wanted much more control over things like specification of metadata and clean creation of files. I also wanted it Dockerized so that I could do a create operation mostly isolated from my host.

### Build the container

The image is provided on Docker Hub but you can always build it on your own:

I didn’t expose the create script as an entrypoint because I wanted the interaction to be an interactive “shell into the container and understand what’s going on.” You can do that like this.

Notice that we are binding our kaggle API credentials to root’s home so they are discovered by the client, and we are also binding some directory with data files (for our dataset upload) by way of specifying volumes (-v): The dataset in question is a Dinosaur Dataset called Zenodo ML, specifically a sample of the data that converts the numpy arrays to actual png images. For those interested, the script I used to reorganize and generate the data subset is provided here. The original rationale for doing this was because I simply couldn’t share the entire dinosaur dataset on Kaggle (too big!). My idea was that sharing a subset would be useful, and those interested could then download the entire dataset. If you are interested, the finished dataset stanfordcompute/code-images is here.

### Create a Dataset

The script create_dataset.py is located in the working directory you shell into, and the usage accepts the arguments you would expect to generate a dataset. You can run the script without arguments to see details,

And for this post, it’s easier to just see an example. I had my data files (.tar.gz files) in /tmp/data/ARCHIVE, so first I prepared a space separated list of fullpaths to them:

and I wanted to upload them to a new dataset called vanessa/code-images. My command would look like this:

The arguments above are the following:

• keywords comma separated list of keywords (no spaces!)
• files full paths to the data files to upload
• title the title to give the dataset (put in quotes if you have spaces)
• name the name of the dataset itself (no spaces or special characters, and good practice to put in quotes)
• username your kaggle username, or the name of an organization that the dataset will belong to

It will generate a temporary directory with a data package:

And add your files to it, for example, here is how my temporary folder was filled:

In retrospect I didn’t need to copy the files here too, but I did this because I don’t typically like to do any kind of operation on raw data (in case something goes awry). The tool will then show you the metadata file (the one we have already shown above) and then start the upload. This can take some time, and it will show a URL when finished!

Hugely important! There is some kind of post processing that happens, and this can take many additional hours (it did for me given the size of my uploads). My dataset did not actually exist at the URL given until the following morning, and so you should be patient. Until it’s done you will get a 404. You can go for a run, or call it a day. Since there is a lot of additional metadata and description / helpers needed on your part for the dataset, it’s recommended to go to the URL when it’s available and do things like add an image, description, examples, etc. The upload is done (by default with my tool) as private so that the user can check things over before making public. Is this manual work? For the first upload, yes, but subsequent versions of your dataset don’t necessarily require it. It’s also the case that the tooling is growing and changing rapdily, and you should (likely) expect exciting changes!

## Vision for Reproducible Science

Institutions need to have data as a priority, and help researchers with the burden of managing their own data. A researcher should be able to get support to organize their data, and then make it programmatically accessibility. This must go beyond a kind of “archive” that is provided by a traditional library and delve into APIs, notifications, and deployments or analyses triggers. While we don’t have these production systems, it all starts with simple solutions to easily create and share datasets. The vision that I have would have a strong relationship between where the compute happens (our research computing clusters) and where the data is stored (and automatically shared via upload or API). It looks like this:

Notifications can range anywhere from 1) going into a feed to alert another researcher of new data, 2) triggering a CI job to re-upload from storage to a shared location, or 3) triggering a build and deployment of a new version of some container that has the data as a dependency.

We need data engineers

An institution needs to allocate resources and people that solely help researchers with data. It shouldn’t be the case that a new PI needs to start from scratch, every time, to set up his or her lab to collect, organize, and then process data. The machines that collect data should collect it and send it to a designated location based on a standard format.

We need collaborative platforms

I believe that there is some future where researchers can collaborate on research together, leading to some kind of publication, with data feeds provided by other researchers, via a collaborative platform. It feels like a sort of “if you build it, they will come” scenario, and the interesting question is “Who will build it?”

Right now, our compute clusters are like the wild west!

Sure, we have local law enforcement to prevent unwanted cowboys from entering the wrong side of the wild desert (file and directory permissions), but it’s largely up to the various groups to decide how to organize their files. As a result, we see the following:

Sure, we have local law enforcement to prevent unwanted cowboys from entering the wrong side of the wild desert (file and directory permissions), but it’s largely up to the various groups to decide how to organize their files. As a result, we see the following:

1. We forget where things are
2. We forget what things are
3. Data and scripts used for papers gets lost and forgotten
4. Every space looks different

We’ve all been here - we have some new dataset to work with, but we have run out of space, and so we email our research computing to ask why (Can I have more?) and then send an email out to our lab to “clean up those files!” and then wind up deleting some set of data that (a few years earlier) we deemed as highly important, but it can’t be important anymore because “Ain’t nobody got disk space for that.”

Imagine a new reality where the researchers themselves aren’t totally responsible for the organization, care, and metadata surrounding their data. They get to focus on doing science. They get help from a data engineer to do this, and it’s done with a ridiculous amount of detail and concern for metadata that no normal human would have. The cost of not doing this is insurmountable wasted time losing and finding things, not being able to reproduce work, or easily get from point [get data] to point [working with data].

## Remaining Challenges

There are still several challenges that we need to think about.

Where is the connection to academic publishing?

I’m going to focus on Kaggle because I haven’t found a similar, successful platform for working on datasets together. The feel that I get for Kaggle is one of “Let’s have fun, learn, and do machine learning” or “Let’s compete in this competition for a prize.” I see a graduate student wanting to try a competition in his or her spare time to win something, or to learn and have fun, but not to further his or her research. As I understand it now, Kaggle doesn’t have a niche that the academic researcher fits into. But when I think about it, “competition” isn’t so different from “collaboration” in that many people are working at once to solve a similar problem. Both have questions that the challenge aims to answer, and metric(s) that can be assessed to evaluate the goodness of a solution. The interesting thing here is that Kaggle, like Github, is a relatively unbiased platform that we could choose to use in a different way. Academic researchers could choose to make a “competition” that would actually encompass researchers working together to answer a scientific question. The missing piece is having some additional rules and tooling around giving particpants and data providers avenues to publish and get credit for their contributions.

If we want to create incentive to share data and thus drive discovery, we need to address this missing incentive piece. It should be the case that a graduate student or researcher can further his or her career by using a platform like Kaggle. It should be easy, and it should be fun. Let’s imagine if a competition wasn’t a competition at all, but in fact a collaboration. A graduate student would go to his or her PI, say “Hey I found this Kaggle dataset that is trying to solve a similar problem, why don’t I try this method there?” The PI would be OK with that because it would be the same as the student solving the problem independently, but with a leg up having some data preprocessing handled, and others to discuss the challenge with. The graduate student would enter his or her kernel entry to (still) optimize some metric(s), and efforts would be automatically summarized into some kind of publication that parallels a paper. The peer review would be tied into these steps, as the work would be completely open. All those who contributed, from creating the initial dataset, to submission to discussing solutions, would be given credit as having taken part in the publication. If it’s done right, the questions themselves would also be organized, conceptually, so we can start mapping the space of hypotheses.

How can we start to get a handle on all these methods?

Methods are like containers. Currently in most papers, they aren’t substantial to reproduce the work. It would also be hard to derive a complete ontology of methods and their links to functions from text alone (yes, I actually started this as a graduate school project, and long since abandoned it in favor of projects that my committeees would deem “meaningful”.) But given that we have code, arguably the methods could be automatically derived (and possibly even linked to the documentation sources). Can I imagine a day when the code is so close to the publication that we drastically cut down the time needed to spend on a methods section? Or the day when a methods section actually can reproduce the work because it’s provided in a container? Yep.

What about sensitive information in data?

Taking care for removing sensitive information goes without saying. This is something that is scary to think about, especially in this day and age when it seems like there is no longer any such thing as privacy. Any data sharing initiative or pipelines must take privacy and (if necessary) protocol for deidentification (and similar) into account.

Where is the incentive for the institution?

This is an even harder question. How would an institution get incentive to put funding into people and resources just for data? From what I’ve seen, years and years go by and people make committees and talk about things. Maybe this is what needs to happen, but it’s hard to be sitting in Silicon Valley and watch companies skip over the small talk and just get it done. Maybe it’s not perfect the first time, but it’s a lot easier to shape a pot once you have the clay spinning.

## Summary

These are my thoughts on this for now! We don’t have a perfect solution, but we have ways to share our data to allow for others to discover. I have hopes that the team at Kaggle will get a head start on thinking about incentives for researchers, and this will make it easy for software engineers in academia to then help the researchers share their data. These are the steps I would take:

1. Create simple tool / example to share data (this post)
2. Create incentive for sharing of academic datasets (Collaborative, open source publications?)
3. Support a culture for academics to share, and do some test cases
4. Have research software engineers help researchers!

And then tada! We have open source, collaborative sharing of datasets, and publication. Speaking of the last point, if you are a researcher with a cool dataset (even if it’s messy) and you want help to share it, please reach out and I will help you. IF you have some ideas or thinking on how we can do a toy example of the above, I hope you reach out too.

Are you interested in a dataset to better understand software? Check out the Code Images Kaggle Dataset that can help to do that. If you use the dataset, here is a reference for it:

### What Makes People the Most Happy

It's in the details of 100,000 moments. I analyzed the crowd-sourced corpus to see what brought the most smiles. Read More

### Magister Dixit

“Tell me, and I will forget. Show me and I may remember. Involve me, and I will understand.” Confucius

### Spark + AI Summit Europe Agenda Announced

London, as a financial center and cosmopolitan city, has its historical charm, cultural draw, and technical allure for everyone, whether you are an artist, entrepreneur or high-tech engineer. As such, we are excited to announce that London is our next stop for Spark + AI Summit Europe, from October 2-4th, 2018, so prepare yourself for the largest Spark + AI community gathering in EMEA!

Today, we announced our agenda for Spark + AI Summit Europe, with over 100 sessions across 11 tracks, including AI Use Cases, Deep Learning Techniques, Productionizing Machine Learning, and Apache Spark Streaming. Sign up before July 27th for early registration and save £300.00.

While we will announce all our exceptional keynotes soon, we are delighted to have these notable technical visionaries as part of the keynotes: Databricks CEO and Co-founder Ali Ghodsi; Matei Zaharia, the original creator of Apache Spark and Databricks chief technologist; Reynold Xin, Databricks co-founder and chief architect; and Soumith Chintala, creator of PyTorch and AI researcher at Facebook.

Along with these visionary keynotes, our agenda features a stellar lineup of community talks led by engineers, data scientists, researchers, entrepreneurs, and machine learning experts from Facebook, Microsoft, Uber, CERN, IBM, Intel, Redhat, Pinterest and, of course, Databricks. There is also a full day of hands-on Apache Spark and Deep Learning training, with courses for both beginners and advanced users, on both AWS and Azure clouds.

All of the above keynotes and sessions will reinforce the idea that Data + AI and Unified Analytics are an integral part of accelerating innovation. For example, early this month, we had our first expanded Spark + AI Summit at Moscone Center in San Francisco, where over 4,000 Spark and Machine Learning enthusiasts attended, representing over 40 countries and regions. The overall theme of Data + AI as a unifying and driving force of innovation resonated in many sessions, including notable keynotes, as Apache Spark forays into new frontiers because of its capability to unify new data workloads and capacity to process data at scale.

With four new tracks, over 180 sessions, Apache Spark and Deep Learning training on both AWS and Azure Cloud, and myriad community-related events, the San Francisco summit was a huge success and a new experience for many attendees! One attendee notes:

We want our European attendees to have a similar experience and gain the same knowledge, so make this your moment, keep calm and come to London in October. With an early bird registration, you can save you £300.00.

--

The post Spark + AI Summit Europe Agenda Announced appeared first on Databricks.

### How Sierra Leone is beating tropical diseases

SIERRA LEONE is one of the world’s poorest countries. From 1991 to 2002, it suffered a devastating civil war that claimed 70,000 lives and wrecked the health system. What little remained of it was gutted by an Ebola outbreak in 2014, which killed lots of doctors and nurses.

### College Admissions Will Never Be Fair

I wrote a new Bloomberg View essay about the Harvard admissions kerfuffle:

#### College Admissions Will Never Be Fair

##### If we recognize this, we can build a better system.

My other Bloomberg columns are listed here.

### AI, Machine Learning and Data Science Roundup: June 2018

A monthly roundup of news about Artificial Intelligence, Machine Learning and Data Science. This is an eclectic collection of interesting blog posts, software announcements and data applications I've noted over the past month or so.

## Open Source AI, ML & Data Science News

Intel open-sources NLP Architect, a Python library for deep learning with natural language.

Gym Retro, an open source platform for reinforcement learning research on video games.

Facebook open-sources DensePose, a toolkit to transform 2D images of people into a 3-D surface map of the human body.

MLflow, an open source machine learning platform from Databricks, has been released.

## Industry News

In a 12-minute documentary video and accompanying Wired article, Facebook describes how it uses Machine Learning to improve quality of the News Feed.

In the PYPL language rankings, Python ranks #1 in popularity and R is #7; both are rising.

Google announces its ethical principles for AI applications, and AI applications it will not pursue.

Wolfram Research launches the Wolfram Neural Network Repository, with implementations of around 70 neural net models.

Google Cloud introduces preemptive pricing for GPUs, with discounts compared to GPUs attached to non-preemptable VMs.

Conversational AIs conducting conversational full-duplex conversations: Microsoft Xiaoce and Google Duplex.

### Microsoft News

Microsoft's head of AI research, Harry Shum, on "raising" ethical AI.

Microsoft has acquired Semantic Machines, a startup focused on conversational AI.

Microsoft is developing a bias-detection tool, with the goal of reducing discrimination in applied AI.

Microsoft's Bot Builder SDKv4 further simplifies the process of developing conversational bots.

Cognitive Services Labs, which offers previews of emerging Microsoft Cognitive Services technologies, adds labs for anomaly detection, ink analysis and more.

ML.NET 0.2, a cross-platform open source machine learning framework for .NET developers, has been released.

Microsoft R Open 3.5.0 has been released.

Azure Databricks now provides a machine-learning runtime and GPU support.

## Learning resources

A tutorial on visualizing machine learning models with LIME, a package for R and Python.

A visual introduction to Machine Learning, Part II: Model Tuning and the Bias-Variance Tradeoff, with an in-depth and graphically elegant look at decision trees.

Materials for five new AI-oriented courses have been published to the LearnAI Materials site.

A Developer's Guide to Building AI Applications: a free e-book from O'Reilly and Microsoft.

Microsoft Professional Program for Artificial Intelligence, a free, self-paced on-line certification in AI skills.

The Azure AI Lab provides complete worked applications for generative image synthesis, cognitive search, automated drone flight, artistic style transfer and machine reading comprehension.

CVAE-GAN: a new generative algorithm for synthesizing novel realistic images, like faces.

Land cover mapping with aerial images, using deep learning and FPGAs.

Find previous editions of the monthly AI roundup here

### What is it like to be a machine learning engineer in 2018?

A personal account as to why 2018 is going to be a fun year for machine learning engineers.

BigML Associations can help identify which pairs (or groups) of items occur together more frequently than expected. A typical use case for association rule discovery is market basket analysis, where the goal is to find the products that are usually purchased together by customers. BigML’s Associations is able to output such interesting associations from your […]

### History of the word ‘data’

Sandra Rendgen describes the history of “data” the word and where it stands in present day.

All through the evolution of statistics through the 19th century, data was generated by humans, and the scientific methodology of measuring and recording data had been a constant topic of debate. This is not trivial, as the question of how data is generated also answers the question of whether and how it is capable of delivering a “true” (or at least “approximated”) representation of reality. The notion that data begins to exist when it is recorded by the machine completely obscures the role that human decisions play in its creation. Who decided which data to record, who programmed the cookie, who built the sensor? And more broadly – what is the specific relationship of any digital data set to reality?

Oh, so there’s more to it than just singular versus plural. Imagine that.

Tags: , ,

### An Intuitive Introduction to Gradient Descent

This post provides a good introduction to Gradient Descent, covering the intuition, variants and choosing the learning rate.

## Introduction

The idea of grouping data science tasks has been bouncing around in my head for the past six or so months, and recently I came across a paper by Miguel Hernán¹ that summarised the ideas nicely. In fact, the metaphorical "pillars" idea came from his paper. This blog article is a combination of his ideas and my own. With that, I present what I (and he) consider the three pillars of data science.

## Description

Description is what I feel the majority of data scientists in industry focus on. This could be something as simple as mean, a conversion rate, a time series, etc. or something more advanced like a statistical test. Description also includes data mining algorithms like clustering and association rules.

This is a well explored area, thanks to the 20th century focus on descriptive statistics and methods. This was the result of a few simultaneous conditions:

1. Due to the lack of computers and often inability to share datasets, statisticians and scientists needed ways to summarise datasets.
2. Hand in hand with advances in healthcare and agriculture were ways to understand the effectiveness of these advancements. With not only finite but small sample sizes, and lives on the line, it was of paramount of importance to measure the variance and potential errors.
3. Towards the end of the century, when computers become more available, new methods become available too: GLMs, bootstrapping, data mining.

While this is what I feel is the most common pillar in industry, it's not talked much about in the community. I wonder if is this because it's due to the classical feel of these techniques? Or perhaps it's what I would prefer but do not believe, is that they are so well studied and understand that we need not talk about them. (Why do I not believe this? Take a look at Andrew Gelman's terrific blog almost any time he posts an article. It's always full of mistakes other researchers make, let alone people in industry). No, most of the community's attention is given to the next pillar: prediction.

## Prediction

We spend most of our time thinking, implementing, watching and talking about prediction. Many times a day, a new ML prediction algorithm or article will surface on Hacker News. Contrast where we are today with the previous century: we have lots of compute power, available datasets and accessible technology for prediction. Prediction, in some cases, has been widely profitable for companies. (Not to say description hasn't either. Pharma companies rely on descriptive methods to validate their work. Banks and insurance companies, too, rely on descriptive methods to control their business investments - no other technique could help.)  I say some cases, because I do see a lot of excess application of prediction, often where description (or the third pillar) would be sufficient.

The next pillar is the least talked about. Surprisingly too, considering the importance of it.

## Causal Inference

Sometimes I feel ashamed to admit how ignorant I really am. To be transparent, I didn't learn much causal inference (or care) until about nine months ago. And even then, it was just fumbling around in a dark room of ideas. Slowly, however, I picked up the pieces and learned what to learn. So why should a data scientist know about causal inference?

From my experience, a data scientist gets asked questions that can only be answered with a causal inference lens. Questions like, to use my company Shopify as an example:

1. What is the impact on sales of merchants who adopt our point-of-sale system?
2. What is causing merchants to leave our platform?
3. What is the effect of server latency on checkouts?

These questions, because of the complex nature of the system, cannot be answered with traditional summary statistics. Confounders abound, and populations are biased. We require new tools and ideas that come from causal inference.

Randomised control trials (RCTs) are very common in tech companies. This classical technique is actually generalised in causal inference frameworks. Beyond this, as a data scientist working in industry, you'll have a new perspective on what your observational data can (and cannot!) deliver.

What I find worrying is that, as a community, we are not talking about causal inference enough. This could have some serious harm. If practitioners in industry aren't aware of techniques from causal inference, they will use the wrong tool and could come to the wrong conclusion. For example, both Simpson's Paradox and Lord's Paradox, two very common phenomena, are easy explainable in a causal framework, but without this framework, people can make disastrous and harmful conclusions. For example, in the low birthweight paradox (see paper below), without causal inference there is the potential to conclude that smoking helps low birthweight infants! Furthermore, I rarely see school programs or bootcamps in data science teaching causal inference. It's almost exclusively prediction and technology!

So, rather than describe the slow and meandering way I discovered causal inference, I'll post some links below to give you a head start (probably read these in order):

There's so much more, but at least this gives you a place to learn what to learn﻿.

## Conclusion

This article will likely evolve over time, as I discover more and more about the boundaries of these three pillars. But I hope the reader recognises a pillar they are lacking in (for me it was causal inference). Based on this construction, I do expect there to be huge potential for practitioners of causal inference (ex: econometricians and epidemiologists come to mind) in teaching data science.

### Translating music to predict a musician’s body movements

When pianists play a musical piece on a piano, their body reacts to the music. Their fingers strike piano keys to create music. They move their arms to play on different octaves. Violin players draw the bow with one hand across the strings and touch lightly or pluck the strings with the other hand’s fingers. Faster bowing produces a faster music pace.

In the long term goal of using augmented and artificial intelligence to help teach people how to play musical instruments, this research investigated whether correlation between music signals and fingers can be predicted computationally. We show that indeed it can be predicted. To our knowledge, this is the first time such an idea was tested.

Our goal was to create an animation of an avatar that moves its hands in the way a pianist or violinist would do, just by hearing the audio. Our research introduces a method that inputs violin or piano music, and outputs a video of skeleton predictions that are further used to animate an avatar, and we successfully demonstrate that natural body dynamics can be predicted. This research was presented in our paper Audio to Body Dynamics at the 2018 Conference on Computer Vision and Pattern Recognition (CVPR) conference.

## Research challenges

Predicting body movement from a music signal is a highly challenging computational problem. To tackle it we needed a good training set of videos, we needed to be able to accurately predict body poses in those videos, and our algorithm needed to be able to find the correlation between music and body.

There is no available training data for such a purpose. Traditionally, state-of-the-art prediction of natural body movement from video sequences (not audio) used motion capture sequences created in a lab. To replicate a traditional approach, we would need to bring a pianist to a laboratory and have them play several hours with sensors attached to their fingers and body joints. This is hard to execute and not easily generalizable.

Instead, we leveraged publicly available videos of highly skilled musicians playing online which could also potentially allow a higher degree of diversity in data. We collected 3.6 hours of violin and 4.4 hours of piano recital “in the wild” videos from the Internet and processed the videos by detecting upper body, and fingers in each frame of each video.

We then built a Long-Short-Term-Memory (LSTM) neural network that learns the correlation between audio features and body skeleton landmarks. Predicted points were applied onto a rigged avatar to create the animation, with the final output as an avatar that moves according to the audio input.

The output skeletons are promising, and produce interesting body dynamics. To best experience our results, watch the videos with audio turned on.

## Potential applications

The research was inspired by a system we had created at the University of Washington that can find correlation between a person’s speech and how the lips move. Our hypothesis that body gestures can be predicted from audio signals shows promising initial results. We believe the correlation between audio to human body has the potential for a variety of applications in VR/AR and recognition.

One potential application is to use AR to teach people how to play musical instruments. People could potentially learn from the best pianists in the world because we’re using professional pianists for training videos. When the experience is shown in AR, a person can walk around the avatar in 3D and zoom in to the fingers to see what movements are being made. It is exciting to show how AI can help people create music by grasping which movements make great performances from real-world examples.

This work has shown the potential AR has to change the way we learn new capabilities. We are excited to show the beginning of the potential capabilities for music.

### Detecting Sarcasm with Deep Convolutional Neural Networks

Detection of sarcasm is important in other areas such as affective computing and sentiment analysis because such expressions can flip the polarity of a sentence.

### Avoiding a Data Science Hype Bubble

In this post, Josh Poduska, Chief Data Scientist at Domino Data Lab, advocates for a common taxonomy of terms within the data science industry. The proposed definitions enable data science professionals to cut through the hype and increase the speed of data science innovation.

# Introduction

The noise around AI, data science, machine learning, and deep learning is reaching a fever pitch. As this noise has grown, our industry has experienced a divergence in what people mean when they say “AI”, “machine learning”, or “data science”. It can be argued that our industry lacks a common taxonomy. If there is a taxonomy, then we, as data science professionals, have not done a very good job of adhering to it. This has consequences. Two consequences include the creation of a hype-bubble that leads to unrealistic expectations and an increasing inability to communicate, especially with non-data science colleagues. In this post, I’ll cover concise definitions and then argue how it is vital to our industry that we be consistent with how we define terms like “AI”.

## Concise Definitions

• Data Science: A discipline that uses code and data to build models that are put into production to generate predictions and explanations.
• Machine Learning: A class of algorithms or techniques for automatically capturing complex data patterns in the form of a model.
• Deep Learning: A class of machine learning algorithms that uses neural networks with more than one hidden layer.
• AI: A category of systems that operate in a way that is comparable to humans in the degree of autonomy and scope.

# Hype

Our terms have a lot of star power. They inspire people to dream and imagine a better world which leads to their overuse. More buzz around our industry raises the tide that lifts all boats, right? Sure, we all hope the tide will continue to rise. But, we should work for a sustainable rise and avoid a hype bubble that will create widespread disillusionment if it bursts.

I recently attended Domino’s rev conference, a summit for data science leaders and practitioners. I heard multiple leaders seeking advice on how to help executives, mid-level managers, and even new data scientists have proper expectations of data science projects without sacrificing enthusiasm for data science. Unrealistic expectations slow down progress by deflating the enthusiasm when projects yield less than utopian results. They also make it harder than it should be to agree on project success metrics and ROI goals.

The frequent overuse of “AI” when referring to any solution that makes any kind of prediction has been a major cause of this hype. Because of frequent overuse, people instinctively associate data science projects with near perfect human-like autonomous solutions. Or, at a minimum, people perceive that data science can easily solve their specific predictive need, without any regard to whether their organizational data will support such a model.

# Communication

Incorrect use of terms also gums up conversations. This can be especially damaging in the early planning phases of a data science project when a cross-functional team assembles to articulate goals and design the end solution. I know a data science manager that requires his team of data scientists to be literally locked in a room for an our hour with business leaders before he will approve any new data science project. Okay, the door is not literally locked, but it is shut, and he does require them to discuss the project for a full hour. They’ve seen a reduction in project rework as they’ve focused on early alignment with business stakeholders. The challenge of explaining data science concepts is hard enough as it is. We only make this harder when we can’t define our own terms.

I’ve been practicing data science for a long time now. I’ve worked with hundreds of analytical leaders and practitioners from all over the world. Since AI and deep learning came on the scene, I’ve increasingly had to pause conversations and ask questions to discover what people really mean when they use certain terms. For example, how would you interpret these statements which are based on conversations I’ve had?

• “Our goal is to make our solution AI-driven within 5 years.”
• “We need to get better at machine learning before we invest in deep learning.”
• “We use AI to predict fraud so our customers can spend with confidence.”
• “Our study found that organizations investing in AI realize a 10% revenue boost.”

Confusing, right?

One has to ask a series of questions to be able to understand what is really going on.

The most common term-confusion I hear is when someone talks about AI solutions, or doing AI, when they really should be talking about building a deep learning or machine learning model. It seems that far too often the interchange of terms is on purpose, with the speaker hoping to get a hype-boost by saying “AI”. Let’s dive into each of the definitions and see if we can come to an agreement on a taxonomy.

# Data Science

First of all, I view data science as a scientific discipline, like any other scientific discipline. Take biology, for example. Biology encompasses a set of ideas, theories, methods, and tools. Experimentation is common. The biological research community is continually adding to the discipline’s knowledge base. Data science is no different. Practitioners do data science. Researchers advance the field with new theory, concepts, and tools.

The practice of data science involves marrying code (usually some statistical programming language) with data to build models. This includes the important and dominant initial steps of data acquisition, cleansing, and preparation. Data science models usually make predictions (e.g., predict loan risk, predict disease diagnosis, predict how to respond to a chat, predict what objects are in an image). Data science models can also explain or describe the world for us (e.g., which combination of factors are most influential in making a disease diagnosis, which customers are most similar to each other and how). Finally, these models are put into production to make predictions and explanations when applied to new data. Data science is a discipline that uses code and data to build models that are put into production to generate predictions and explanations.

It can be difficult to craft a definition for data science while, at the same time, distinguishing it from statistical analysis. I came to the data science profession via educational training in math and statistics as well as professional experience as a statistician. Like many of you, I was doing data science before it was a thing.

Statistical analysis is based on samples, controlled experiments, probabilities, and distributions. It usually answers questions about likelihood of events or the validity of statements. It uses different algorithms like t-test, chi-square, ANOVA, DOE, response surface designs, etc. These algorithms sometimes build models too. For example, response surface designs are techniques to estimate the polynomial model of a physical system based on observed explanatory factors and how they relate to the response factor.

One key point in my definition is that data science models are applied to new data to make future predictions and descriptions, or “put into production”. While it is true that response surface models can be used on new data to predict a response, it is usually a hypothetical prediction about what might happen if the inputs were changed. The engineers then change the inputs and observe the responses that are generated by the physical system in its new state. The response surface model is not put into production. It does not take new input settings by the thousands, over time, in batches or streams, and predict responses.

My data science definition is by no means fool-proof, but I believe putting predictive and descriptive models into production starts to capture the essence of data science.

# Machine Learning

Machine learning as a term goes back to the 1950s. Today, it is viewed by data scientists as a set of techniques that are used within data science. It is a toolset or a class of techniques for building the models mentioned above. Instead of a human explicitly articulating the logic for a model, machine learning enables computers to generate (or learn) models on their own. This is done by processing an initial set of data, discovering complex hidden patterns in that data, and capturing those patterns in a model so they can be applied later to new data in order to make predictions or explanations. The magic behind this process of automatically discovering patterns lies in the algorithms. Algorithms are the workhorses of machine learning. Common machine learning algorithms include the various neural network approaches, clustering techniques, gradient boosting machines, random forests, and many more. If data science is a discipline like biology, then machine learning is like microscopy or genetic engineering. It is a class of tools and techniques with which the discipline is practiced.

# Deep Learning

Deep learning is the easiest of these terms to define. Deep learning is a class of machine learning algorithms that uses neural networks with more than one hidden layer. Neural networks themselves date back to the 1950s. Deep learning algorithms have recently become very popular starting in the 1980s, with a lull in the 1990s and 2000s, followed by a revival in our decade due to relatively small tweaks in the way the deep networks were constructed that proved to have astonishing effects. Deep learning can be applied to a variety of use cases including image recognition, chat assistants, and recommender systems. For example, Google Speech, Google Photos, and Google Search are some of the original solutions built using deep learning.

# AI

AI has been around for a long time. Long before the recent hype storm that has co-opted it with buzzwords. How do we, as data scientists, define it? When and how should we use it? What is AI to us? Honestly, I’m not sure anyone really knows. This might be our “emperor has no clothes” moment. We have the ambiguity and the resulting hype that comes from the promise of something new and unknown. The CEO of a well known data science company was recently talking with our team at Domino when he mentioned “AI”. He immediately caught himself and said, “I know that doesn’t really mean anything. I just had to start using it because everyone is talking about it. I resisted for a long time but finally gave in.”

That said, I’ll take a stab at it: AI is a category of systems that people hope to create which have the defining characteristic that they will be comparable to humans in the degree of autonomy and scope of operation.

To extend our analogy, if data science is like biology and machine learning is like genetic engineering, then AI is like disease resistance. It’s the end result, a set of solutions or systems that we are striving to create through the application of machine learning (often deep learning) and other techniques.

Here’s the bottom line. I believe that we need to draw a distinction between techniques that are part of AI solutions, AI-like solutions, and true AI solutions. This includes AI building blocks, solutions with AI-ish qualities, and solutions that approach human autonomy and scope. These are three separate things. People just say “AI” for all three far too often.

For example,

• Deep learning is not AI. It is a technique that can be used as part of an AI solution.
• Most data science projects are not AI solutions. A customer churn model is not an AI solution, no matter if it used deep learning or logistic regression.
• A self driving car is an AI solution. It is a solution that operates with complexity and autonomy that approaches what humans are capable of doing.

Remember those cryptic statements from above? In each case I asked questions to figure out exactly what was going on under the hood. Here is what I found.

• An executive said: “Our goal is to make our solution AI-driven within 5 years.”
The executive meant: “We want to have a couple machine learning models in production within 5 years.”
• A manager said: “We need to get better at machine learning before we invest in deep learning.”
The manager meant: “We need to train our analysts in basic data science principles before we are ready to try deep learning approaches.”
• A marketer said: “We use AI to predict fraud so our customers can spend with confidence.”
The marketer meant: “Our fraud score is based on a logistic regression model that has been working well for years.
• An industry analyst said: “Our study found that organizations investing in AI realize a 10% revenue boost.”
The industry analyst meant: “Organizations that have any kind of predictive model in production realize a 10% revenue boost.”

Whether you 100% agree with my definitions or not, I think we can all agree that there is too much hype in our industry today, especially around AI. Each of us has seen how this hype limits real progress. I argue that a lot of the hype is from misuse of the terms of data science. My ask is that, as data science professionals, we try harder to be conscious of how we use these key terms, and that we politely help others who work with us learn to use these terms in the right way. I believe that the quicker we can iterate to an agreed-upon taxonomy and insist on adherence to it, the quicker we can cut through hype and increase our speed of innovation as we build the solutions of today and tomorrow.

The post Avoiding a Data Science Hype Bubble appeared first on Data Science Blog by Domino.

### Answering the question, What predictors are more important?, going beyond p-value thresholding and ranking

Daniel Kapitan writes:

We are in the process of writing a paper on the outcome of cataract surgery. A (very rough!) draft can be found here, to provide you with some context:  https://www.overleaf.com/read/wvnwzjmrffmw.

Using standard classification methods (Python sklearn, with synthetic oversampling to address the class imbalance), we are able to predict a poor outcome with sufficient sensitivity (> 60%) and specificity (>95%) to be of practical use at our clinics as a clinical decision support tool. As we are writing up our findings and methodology, we have an interesting debate on how to interpret what the most relevant features (i.e. patient characteristics) are.

My colleagues who are trained as epidemiologist/doctors, have been taught to do standard univariate testing, using a threshold p-value to identify statistically significant features.

Those of us who come from machine learning (including myself) are more inclined to just feed all the data into an algorithm (we’re comparing logistic regression and random forest), and then evaluate feature importance a posteriori.

The results from the two approaches are substantially different. Comparing the first approach (using  sklearn SelectKBest) and the second (using  sklearn Random Forest), for example, the variable ‘age’ ends up somewhere halfway in the ranking (p-value 0.005 with  F_classif) vs. top-6 (feature importance from random forest)

As a regular reader of your blog, I am aware of the ongoing debate regarding p-values, reproducible science etc. Although I get the gist of it, my understanding of statistics is too limited to convincingly argue for or against the two approaches. Googling the subject, I come across some (partial) answers:

https://stats.stackexchange.com/questions/291210/is-it-wrong-to-choose-features-based-on-p-value.

I would appreciate if you could provide some feedback and/or suggestions how to address this question. It will help us to gain confidence in applying machine learning in the day-to-day clinical practice.

First, I think it would help to define what you mean by “most relevant features” in a predictive model. That is, before deciding on your procedure to estimate relevance, to declare based on the data what are the most relevant features, first figure out how you would define relevance. As Rubin puts it: What would you do if you had all the data?

I don’t mind looking at classification error etc., but I think it’s hard to make any progress at all here without some idea of your goals.

Why do you want to evaluate the importance of predictors in your model?

You might have a ready answer to this question, and that’s fine—it’s not supposed to be a trick. Once we better understand the goals, it might be easier to move to questions of estimation and inference.

Kapitan replied:

My aim of understanding the importance of predictors is to support clinical reasoning. Ideally, the results of the predictor should be ‘understandable’ such that the surgeon can explain why a patient is classified as a high risk patient. I.e. I would like to combine clinical reasoning (inference, as evidenced in ‘classical’ clinical studies) with the observed patterns (correlation). Perhaps this is a tall order, but I think worth trying. This is one of the reasons why I prefer using tree-based algorithms (rather than neural networks), because it is less of a black box.

To give a specific example: patients with multiple ocular co-morbidities are expected to have high risk of poor outcome. Various clinical studies have tried to ‘prove’ this, but never in relation patterns (i.e. feature importance) that are obtained from machine learning. Now, the current model tells us that co-morbidities are not that important (relative to the other features).

Another example: laterality ends up as second most important feature in the random forest model. Looking at the data, it may be the case that left-eyes have a higher risk of poor outcome. Talking to doctors, this could be explained that, given most doctors are right-handed, operating a left-eye is slightly more complex. But looking at the data naively (histograms on subpopulations) the difference does not seem significant. Laterality ends up in the bottom range with univariate testing.

I understand that the underlying statistics are different (linear vs non-linear) and intuitively I tend to ‘believe’ the results from random forest more. What I’m looking for is sound arguments and reasoning if and why this is indeed the case.

To start with, you should forget about statistical significance and start thinking about uncertainty. For example, if your estimated coefficient is 200 with a standard error of 300, and on a scale where 200 is a big effect, then all you can say is that you’re uncertain: maybe it’s a good predictor in the population, maybe not.

Next, try to answer questions as directly as possible. For example, “patients with multiple ocular co-morbidities are expected to have high risk of poor outcome.” To start with, look at the data. Look at the average outcome as a function of the number of ocular co-morbidities. It should be possible to look at this directly. Here’s another example: “it may be the case that left-eyes have a higher risk of poor outcome.” Can you look at this directly? A statement such as “Laterality ends up in the bottom range with univariate testing,” does not seem interesting to me; it’s an indirect question framed in statistical terms (“the bottom range,” “univariate testing”), and I think it’s better to try to ask the question more directly.

Another tip is that different questions can require different analyses. Instead of fitting one model and trying to tell a story with each coefficient, list your questions one at a time and try to answer each one using the data. Kinda like Bill James: he didn’t throw all his baseball data into a single analysis and then sit there reading off conclusions; no, he looked at his questions one at a time.

### Deep Learning Best Practices –  Weight Initialization﻿

In this blog I am going to talk about the issues related to initialization of weight matrices and ways to mitigate them. Before that, let’s just cover some basics and notations that we will be using going forward.

### A Comparative Review of the BlueSky Statistics GUI for R

(This article was first published on R – r4stats.com, and kindly contributed to R-bloggers)

### Introduction

BlueSky Statistics’ desktop version is a free and open source graphical user interface for the R software that focuses on beginners looking to point-and-click their way through analyses.  A commercial version is also available which includes technical support and a version for Windows Terminal Servers such as Remote Desktop, or Citrix. Mac, Linux, or tablet users could run it via a terminal server.

This post is one of a series of reviews which aim to help non-programmers choose the Graphical User Interface (GUI) that is best for them. Additionally, these reviews include a cursory description of the programming support that each GUI offers.

### Terminology

There are various definitions of user interface types, so here’s how I’ll be using these terms:

GUI = Graphical User Interface using menus and dialog boxes to avoid having to type programming code. I do not include any assistance for programming in this definition. So, GUI users are people who prefer using a GUI to perform their analyses. They don’t have the time or inclination to become good programmers.

IDE = Integrated Development Environment which helps programmers write code. I do not include point-and-click style menus and dialog boxes when using this term. IDE usersare people who prefer to write R code to perform their analyses.

### Installation

The various user interfaces available for R differ quite a lot in how they’re installed. Some, such as jamovi or RKWard, install in a single step. Others install in multiple steps, such as the R Commander (two steps) and Deducer (up to seven steps). Advanced computer users often don’t appreciate how lost beginners can become while attempting even a simple installation. The HelpDesks at most universities are flooded with such calls at the beginning of each semester!

The main BlueSky installation is easily performed in a single step. The installer provides its own embedded copy of R, simplifying the installation and ensuring complete compatibility between BlueSky and the version of R it’s using. However, it also means if you already have R installed, you’ll end up with a second copy. You can have BlueSky control any version of R you choose, but if the version differs too much, you may run into occasional problems.

### Plug-in Modules

When choosing a GUI, one of the most fundamental questions is: what can it do for you? What the initial software installation of each GUI gets you is covered in the Graphics, Analysis, and Modeling sections of this series of articles. Regardless of what comes built-in, it’s good to know how active the development community is. They contribute “plug-ins” which add new menus and dialog boxes to the GUI. This level of activity ranges from very low (RKWard, Deducer) through moderate (jamovi) to very active (R Commander).

BlueSky is a fairly new open source project, and at the moment all the add-on modules are provided by the company. However, BlueSky’s capabilities approaches the comprehensiveness of R Commander, which currently has the most add-ons available. The BlueSky developers are working to create an Internet repository for module distribution.

### Startup

Some user interfaces for R, such as jamovi, start by double-clicking on a single icon, which is great for people who prefer to not write code. Others, such as R commander and JGR, have you start R, then load a package from your library, and call a function. That’s better for people looking to learn R, as those are among the first tasks they’ll have to learn anyway.

You start BlueSky directly by double-clicking its icon from your desktop, or choosing it from your Start Menu (i.e. not from within R itself). It interacts with R in the background; you never need to be aware that R is running.

### Data Editor

A data editor is a fundamental feature in data analysis software. It puts you in touch with your data and lets you get a feel for it, if only in a rough way. A data editor is such a simple concept that you might think there would be hardly any differences in how they work in different GUIs. While there are technical differences, to a beginner what matters the most are the differences in simplicity. Some GUIs, including jamovi, let you create only what R calls a data frame. They use more common terminology and call it a data set: you create one, you save one, later you open one, then you use one. Others, such as RKWard trade this simplicity for the full R language perspective: a data set is stored in a workspace. So the process goes: you create a data set, you save a workspace, you open a workspace, and choose a data set from within it.

BlueSky starts up by showing you its main Application screen (Figure 1) and prompts you to enter data with an empty spreadsheet-style data editor. You can start entering data immediately, though at first, the variables are simply named var1, var2…. You might think you can rename them by clicking on their names, but such changes are done in a different manner, one that will be very familiar to SPSS users. There are two tabs at the bottom left of the data editor screen, which are labeled “Data” and “Variables.” The “Data” tab is shown by default, but clicking on the “Variables” tab takes you to a screen (Figure 2) which displays the metadata: variable names, labels, types, classes, values, and measurement scale.

Figure 1. The main BlueSky Application screen.

The big advantage that SPSS offers is that you can change the settings of many variables at once. So if you had, say, 20 variables for which you needed to set the same factor labels (e.g. 1=strongly disagree…5=Strongly Agree) you could do it once and then paste them into the other 19 with just a click or two. Unfortunately, that’s not yet fully implemented in BlueSky. Some of the metadata fields can be edited directly. For the rest, you must instead follow the directions at the top of that screen and right click on each variable, one at a time, to make the changes. Complete copy and paste of metadata is planned for a future version.

Figure 2. The Variables screen in the data editor. The “Variables” tab in the lower left is selected, letting us see the metadata for the same variables as shown in Figure 1.

You can enter numeric or character data in the editor right after starting BlueSky. The first time you enter character data, it will offer to convert the variable from numeric to character and wait for you to approve the change. This is very helpful as it’s all too easy to type the letter “O” when meaning to type a zero “0”, or the letter “I” instead of number one “1”.

To add rows, the Data tab is clearly labeled, “Click here to add a new row”. It would be much faster if the Enter key did that automatically.

To add variables you have to go to the Variables tab and right-click on the row of any variable (variable names are in rows on that screen), then choose “Insert new variable at end.”

To enter factor data, it’s best to leave it numeric such as 1 or 2, for male and female, then set the labels (which are called values using SPSS terminology) afterwards. The reason for this is that once labels are set, you must enter them from drop-down menus. While that ensures no invalid values are entered, it slows down data entry. The developer’s future plans includes automatic display of labels upon entry of numeric values.

If you instead decide to make the variable a factor before entering numeric data, it’s best to enter the numbers as labels as well. It’s an oddity of R that factors are numeric inside, while displaying labels that may or may not be the same as the numbers they represent.

To enter dates, enter them as character data and use the “Data> Compute” menu to convert the character data to a date. When I reported this problem to the developers, they said they would add this to the “Variables” metadata tab so you could set it to be a date variable before entering the data.

If you have another data set to enter, you can start the process again by clicking “File> New”, and a new editor window will appear in a new tab. You can change data sets simply by clicking on its tab and its window will pop to the front for you to see. When doing analyses, or saving data, the data set that’s displayed in the editor is the one that will be used. That approach feels very natural; what you see is what you get.

Saving the data is done with the standard “File > Save As” menu. You must save each one to its own file. While R allows multiple data sets (and other objects such as models) to be saved to a single file, BlueSky does not. Its developers chose to simplify what their users have to learn by limiting each file to a single data set. That is a useful simplification for GUI users. If a more advanced R user sends a compound file containing many objects, BlueSky will detect it and offer to open one data set (data frame) at a time.

Figure 3. Output window showing standard journal-style tables. Syntax editor has been opened and is shown on right side.

### Data Import

The open source version of BlueSky supports the following file formats, all located under “File> Open”:

• Comma Separated Values (.csv)
• Plain text files (.txt)
• Excel (old and new xls file types)
• Dbase’s DBF
• SPSS (.sav)
• SAS binary files (sas7bdat)
• Standard R workspace files (RData) with individual data frame selection

The SQL database formats are found under the “File> Import Data” menu. The supported formats include:

• Microsoft Access
• Microsoft SQL Server
• MySQL
• PostgreSQL
• SQLite

### Data Management

It’s often said that 80% of data analysis time is spent preparing the data. Variables need to be transformed, recoded, or created; strings and dates need to be manipulated; missing values need to be handled; datasets need to be stacked or merged, aggregated, transposed, or reshaped (e.g. from wide to long and back). A critically important aspect of data management is the ability to transform many variables at once. For example, social scientists need to recode many survey items, biologists need to take the logarithms of many variables. Doing these types of tasks one variable at a time can be tedious. Some GUIs, such as jamovi and RKWard handle only a few of these functions. Others, such as the R Commander, can handle many, but not all, of them.

BlueSky offers one of the most comprehensive sets of data management tools of any R GUI. The “Data” menu offers the following set of tools. Not shown is an extensive set of character and date/time functions which appear under “Compute.”

1. Missing Values
2. Compute
3. Bin Numeric Variables
4. Recode (able to recode many at once)
5. Make Factor Variable (able to covert many at once)
6. Transpose
7. Transform (able to transform many at once)
8. Sample Dataset
9. Delete Variables
10. Standardize Variables (able to standardize many at once)
11. Aggregate (outputs results to a new dataset)
12. Aggregate (outputs results to a printed table)
13. Subset (outputs to a new data et)
14. Subset (outputs results to a printed table)
15. Merge Datasets
16. Sort (outputs results to a new dataset)
17. Sort (outputs results to a printed table)
19. Refresh Grid
20. Concatenate Multiple Variables (handling missing values)
21. Legacy (does same things but using base R code)
22. Reshape (long to wide)
23. Reshape (wide to long)

Continued here…

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Managing risk in machine learning models

The O’Reilly Data Show Podcast: Andrew Burt and Steven Touw on how companies can manage models they cannot fully explain.

In this episode of the Data Show, I spoke with Andrew Burt, chief privacy officer at Immuta, and Steven Touw, co-founder and CTO of Immuta. Burt recently co-authored an upcoming white paper on managing risk in machine learning models, and I wanted to sit down with them to discuss some of the proposals they put forward to organizations that are deploying machine learning.

Some high-profile examples of models gone awry have raised awareness among companies for the need for better risk management tools and processes. There is now a growing interest in ethics among data scientists, specifically in tools for monitoring bias in machine learning models. In a previous post, I listed some of the key considerations organization should keep in mind as they move models to production, but the upcoming report co-authored by Burt goes far beyond and recommends lines of defense, including a description of key roles that are needed.

Continue reading Managing risk in machine learning models.

### Four short links: 21 June 2018

Chinese Internet, Booting Linux, Pull Requests, and Commercialized Commons

1. Beijing Wants to Rewrite the Rules of the Internet -- China’s cyber governance plan appears to have three objectives. One is a legitimate desire to address substantial cybersecurity challenges, like defending against cyber attacks and keeping stolen personal data off the black market. A second is the impulse to support domestic industry, in order to wean the government off its dependence on foreign technology components for certain IT products deemed essential to economic and national security. (In effect, these requirements exclude foreign participation, or make foreign participation only possible on Beijing’s terms.) The third goal is to expand Beijing’s power to surveil and control the dissemination of economic, social, and political information online.
2. How Modern Linux Systems Boot -- "Sometimes the reasons for failure are obscure and annoying" could appear in every man page.
3. The Art of Humanizing Pull Requests -- What are PR’s, how to effectively create a PR, how to give feedback on PR’s, and how to respond to feedback. For the junior dev in your life.
4. How Markets Co-opted Free Software's Most Powerful Weapon (YouTube) -- Benjamin Mako Hill's LibrePlanet 2018 keynote. new proprietary, firm-controlled, and money-based models are increasingly replacing, displacing, outcompeting, and potentially reducing what’s available in the commons.[...] In the talk, I talk about how this happened and what I think it means for folks who are committed to working in commons. I talk a little bit about what the free culture and free software should do now that mass collaboration, these communities’ most powerful weapon, is being used against them. (via copyrighteous)

We often make important decisions when resources are scarce.

At Stitch Fix, whenever we serve a customer, we must choose the right day to style that client’s fix, the right stylist for that client’s particular aesthetic, the right warehouse for that client’s shipping address. But stylists and warehouses are in high demand, with many clients competing for their time and attention. Constrained optimization helps us get work to stylists and warehouses a manner that is fair and efficient, and gives our clients the best possible experience.

This post is an introduction to constrained optimization aimed at data scientists and developers fluent in Python, but without any background in operations research or applied math. We’ll demonstrate how optimization modeling can be applied to real problems at Stitch Fix. At the end of this article, you should be able to start modeling your own business problems.

## Optimization Software

My constrained optimization package of choice is the python library pyomo, an open source project for defining and solving optimization problems. Pyomo is simple to install:

pip install pyomo


Pyomo is just the interface for defining and running your model. You also need a solver to do the heavy lifting. Pyomo will call out to the solver to analyze your problem, choose an appropriate algorithm, and apply that algorithm to find a solution. If you just want to get started quickly, go for GLPK, the GNu Linear Programming Kit. GLPK runs slowly on large problems but is free and dirt simple to install. On a mac:

brew install glpk


The fastest open-source solver is CBC, but install can be a bit trickier. The commercial Gurobi software is expensive but state of the art: students and academic researchers can snag a free educational license.

Constrained optimization is a tool for minimizing or maximizing some objective, subject to constraints. For example, we may want to build new warehouses that minimize the average cost of shipping to our clients, constrained by our budget for building and operating those warehouses. Or, we might want to purchase an assortment of merchandise that maximizes expected revenue, limited by a minimum number of different items to stock in each department and our manufacturers’ minimum order sizes.

Here’s the catch: all objectives and constraints must be linear or quadratic functions of the model’s fixed inputs (parameters, in the lingo) and free variables.

Constraints are limited to equalities and non-strict inequalities. (Re-writing strict inequalities in these terms can require some algebraic gymnastics.) Conventionally, all terms including free variables live on the lefthand side of the equality or inequality, leaving only constants and fixed parameters on the righthand side.

To build your model, you must first formalize your objective function and constraints. Once you’ve expressed these terms mathematically, it’s easy to turn the math into code and let pyomo find the optimal solution.

### Matching Clients and Stylists

At Stitch Fix, our styling team styles many thousands of clients every day. Our stylists are as diverse as our clients, and matching a customer to the best suited stylist has a big impact on client satisfaction.

We want to match up stylists and clients in a way that maximizes the total number of happy customers, without overworking any stylist and making sure that every client gets styled exactly once. Formally, the client-stylist matching problem can be expressed as:

Where:

• is the set of all stylists working today;
• is the set of all clients we need to style today;
• is the probability that customer will be happy working with stylist ;
• is 1 if we assign stylist to work with customer and 0 otherwise; and
• is the acceptable workload of each stylist: the maximum number of clients stylist can work with today.

Here, (the matrix of happy-customer probabilities for each possible client-stylist pairing) and (the vector of acceptable workloads for each stylist) are determined by external factors. They’re known and fixed, and we don’t have any control over them. In optimization terms, they are parameters of the model.

In contrast, , the matrix of assignments, is under our complete control—within the limits defined by our constraints. We get to determine which entries are 1 and which are 0. Once we’ve coded up our model, the solver will find the values of the elements of that maximize the objective function.

## Toy Scenario

Before going further, we’ll generate some toy data. Pyomo is easy to work with when data is in dictionary format, with keys treated as the named indices of a vector. For matrices, use tuples as keys, with the first element representing the row index, and the second element corresponding to the column index of the matrix.

Our toy data assigns a random chance of a happy outcome for each potential client-stylist pairing. In the real world, these values are a function of the history between the client and stylist, if any, as well as both parties’ stated and implicit style preferences.

import numpy as np

"Alex": 1, "Jennifer": 2, "Andrew": 2,
"DeAnna": 2, "Jesse": 3}

clients = (
"Trista", "Meredith", "Aaron", "Bob", "Jillian",
"Ali", "Ashley", "Emily", "Desiree", "Byron")

happiness_probabilities = dict(
((stylist, client), np.random.rand())
for client in clients)


### Baseline

At this point, it can be useful to run a naive decision process on your data as a baseline. We tested two baseline methods on 10,000 variations of the above scenario. In the first method, assignments were made randomly. In the second method, we paired clients and stylists greedily, iteratively making the match with the highest predicted happiness probability until no clients remained or all stylists had a full workload.

Random assignment produces on average 5 happy clients, while greedy assignment averages 7.53 happy clients. If constrained optimization modeling can beat those numbers, we’ll know we’ve done useful work!

Once you know your objective, constraints, parameters, and variables, you can implement your model in pyomo. We begin by importing the pyomo environment and instantiating a concrete model object.

from pyomo import environ as pe
model = pe.ConcreteModel()


### Indices

In pyomo, indices are the sets of things that your parameters and variables can be defined over. In our case, that means the set of clients and the set of stylists .

We tell the model about these indices by setting fields on the model object:

model.stylists = pe.Set(
model.clients = pe.Set(
initialize=clients)


### Parameters

Remember, parameters are the fixed external factors that we have no control over: the matrix of happiness probabilities for each possible client-stylist pairing, and the vector of acceptable workloads for each stylist.

In pyomo, a parameter is defined over some index. For the workload vector, that index is our set of stylists. For the happy-outcome matrix, the index is the cartesian product of the set of stylists and the set of clients: all possible client-stylist pairings.

We can also pass an argument specifying the range of values that a parameter can take. This isn’t required, but it helps the solver choose an efficient algorithm well suited to the particular optimization problem embodied by your model.

model.happiness_probabilities = pe.Param(
# On pyomo Set objects, the '*' operator returns the cartesian product
model.stylists * model.clients,
# The dictionary mapping (stylist, client) pairs to chances of a happy outcome
initialize=happiness_probabilities,
# Happiness probabilities are real numbers between 0 and 1
within=pe.UnitInterval)

model.stylists,
within=pe.NonNegativeIntegers)


### Variables

Next, we define our model’s free variables. Variables look similar to fixed parameters: they’re defined over some index in the model, and are restricted to some domain of possible values. The key difference is that the solver is able to manipulate the values of a variable at will. Our only variable is the assignment matrix , which can take a value of 1 when a stylist is assigned to a given client, or 0 otherwise.

model.assignments = pe.Var(
# Defined over the client-stylist matrix
model.stylists * model.clients,
# Possible values are 0 and 1
domain=pe.Binary)


Note that the free variables in a constrained optimization problem don’t need to be binary. A different formulation of the client-stylist matching problem could assign real-valued probabilities to each possible pairing, rather than a hard 0 or 1. The merchandise-assortment example mentioned earlier would use an integer-valued variable to represent the quantity of each item we want to purchase.

### Objective

We now have everything we need to express our model’s objective function: , the expected total number of happy clients that will result from our assignments. In pyomo, the summation function takes any number of parameters and variables defined over the same indices and returns an expression representing the inner product of those terms.

A pyomo Objective also specifies whether to maximize or minimize its value. In our case, we want lots of happy clients, so we’ll maximize.

model.objective = pe.Objective(
expr=pe.summation(model.happiness_probabilities, model.assignments),
sense=pe.maximize)


### Constraints

Finally, we must tell the model about our problem’s constraints:

Like parameters and variables, pyomo constraints are defined over some index. The constraint’s rule is applied to each element in the index set. A rule is implemented as a function that takes the model and an item in the index set, and returns an equality or weak inequality constructed via a python comparison operator (==, >=, or <=). All terms of the comparison that include a model variable must live on the lefthand side of the equation.

Pyomo parameters and variables can be indexed like python dictionaries to retrieve their value for specific indices. Those values may be manipulated with standard python arithmetic operators, allowing us to easily express the constraints formulated above.

def respect_workload(model, stylist):
# Count up all the clients assigned to the stylist
n_clients_assigned_to_stylist = sum(
model.assignments[stylist, client]
for client in model.clients)
# What's the max number of clients this stylist can work with?
# Make sure that sum is no more than the stylist's workload
return n_clients_assigned_to_stylist <= max_clients

# For each stylist in the set of all stylists...
model.stylists,
# Ensure that total assigned clients at most equal workload!

def one_stylist_per_client(model, client):
# Count up all the stylists assigned to the client
n_stylists_assigned_to_client = sum(
model.assignments[stylist, client]
for stylist in model.stylists)
# Make sure that sum is equal to one
return n_stylists_assigned_to_client == 1

model.one_stylist_per_client = pe.Constraint(
# For each client in the set of all clients...
model.clients,
# Ensure that exactly one stylist is assigned!
rule=one_stylist_per_client)


## Solve It!

We’re almost done: all we need to do is apply a solver to our model and retrieve the solution.

# Swap out "glpk" for "cbc" or "gurobi" if using another solver
solver = pe.SolverFactory("glpk")
# Add the keyword arg tee=True for a detailed trace of the solver's work.
solution = solver.solve(model)


Each variable in the model has a get_values() method, which, after solving, will return a mapping of indices to optimal values at that index. Variables also have a pprint() method, which will print a nicely formatted table of values—but we don’t really care about all the 0s in our assignment matrix, so let’s fish out the 1s:

assignments = model.assignments.get_values().items()
for (stylist, client), assigned in sorted(assignments):
if assigned == 1:
print "{} will be styled by {}".format(client.rjust(8), stylist)


     Desiree will be styled by Alex
Ashley will be styled by Andrew
Bob will be styled by Andrew
Emily will be styled by DeAnna
Jillian will be styled by DeAnna
Aaron will be styled by Jennifer
Trista will be styled by Jennifer
Ali will be styled by Jesse
Byron will be styled by Jesse
Meredith will be styled by Jesse


### Comparison to Baseline

If we run our constrained optimization model on the same 10,000 variations of the toy scenario on which we tested our random and greedy baselines, we find that our global optimization strategy produces on average 7.91 satisfied customers. That’s a 5% lift over the greedy baseline: not bad!

## Final Thoughts

Constrained optimization is a tool that complements and magnifies the power of your predictive modeling skills. You’ll start to see opportunities to apply these techniques everywhere: wherever models and domain knowledge show that different resources have different costs or outcomes under different circumstances; wherever resource allocations are currently controlled by naive business rules and heuristics.

The magic lies in the problem framing. Go seek out inefficiencies. Formalize them. Optimize!

### Unsupervised Learning with Stein's Unbiased Risk Estimator - implementation -

Rich just sent me the following:

Hi Igor,
We have a new paper that uses Stein’s Unbiased Risk Estimator (SURE) to train neural networks directly from noisy measurements without any ground truth data. We demonstrate training neural networks with SURE in order to solve the denoising and compressive sensing problems.
Paper: https://arxiv.org/abs/1805.10531
Software: https://github.com/ricedsp/D-AMP_Toolbox
We would be very grateful if you shared it on Nuit Blanche!
richb
Richard G. Baraniuk
Victor E. Cameron Professor of Electrical and Computer Engineering
Founder and Director, OpenStax
Rice University

Sure Rich !

Learning from unlabeled and noisy data is one of the grand challenges of machine learning. As such, it has seen a flurry of research with new ideas proposed continuously. In this work, we revisit a classical idea: Stein's Unbiased Risk Estimator (SURE). We show that, in the context of image recovery, SURE and its generalizations can be used to train convolutional neural networks (CNNs) for a range of image denoising and recovery problems {\em without any ground truth data.}
Specifically, our goal is to reconstruct an image x from a {\em noisy} linear transformation (measurement) of the image. We consider two scenarios: one where no additional data is available and one where we have measurements of other images that are drawn from the same noisy distribution as x, but have no access to the clean images. Such is the case, for instance, in the context of medical imaging, microscopy, and astronomy, where noise-less ground truth data is rarely available.
We show that in this situation, SURE can be used to estimate the mean-squared-error loss associated with an estimate of x. Using this estimate of the loss, we train networks to perform denoising and compressed sensing recovery. In addition, we also use the SURE framework to partially explain and improve upon an intriguing results presented by Ulyanov et al. in "Deep Image Prior": that a network initialized with random weights and fit to a single noisy image can effectively denoise that image.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

### Tax cuts: a shopping trolley guide to what the changes mean for you

Our calculator shows how much difference the tax cuts will actually make to your weekly budget

Now that the Coalition’s $144bn income tax cuts have passed the federal parliament, it’s worth outlining exactly what people on different incomes will get. Given the tax cuts were previously described as a “sandwich and milkshake” tax cut (a reference to a 2003 comment by Howard-era minister Amanda Vanstone), here we’ve represented how far the tax cut will go towards a range of goods and services: Continue reading... Continue Reading… ### If you did not already know Internet Shopping Problem Introduced by Blazewicz et al. (2010), where a customer wants to buy a list of products at the lowest possible total cost from shops which offer discounts when purchases exceed a certain threshold. The problem is NP-hard. … EigenPro 2.0 In recent years machine learning methods that nearly interpolate the data have achieved remarkable success. In many settings achieving near-zero training error leads to excellent test results. In this work we show how the mathematical and conceptual simplicity of interpolation can be harnessed to construct a framework for very efficient, scalable and accurate kernel machines. Our main innovation is in constructing kernel machines that output solutions mathematically equivalent to those obtained using standard kernels, yet capable of fully utilizing the available computing power of a parallel computational resource, such as GPU. Such utilization is key to strong performance since much of the computational resource capability is wasted by the standard iterative methods. The computational resource and data adaptivity of our learned kernels is based on theoretical convergence bounds. The resulting algorithm, which we call EigenPro 2.0, is accurate, principled and very fast. For example, using a single GPU, training on ImageNet with$1.3\times 10^6$data points and$1000labels takes under an hour, while smaller datasets, such as MNIST, take seconds. Moreover, as the parameters are chosen analytically, based on the theory, little tuning beyond selecting the kernel and kernel parameter is needed, further facilitating the practical use of these methods. … Customer Experience Management (CEM) Customer experience management (CEM or CXM) is the process that companies use to oversee and track all interactions with a customer during the duration of their relationship. This involves the strategy of building around the needs of individual customers. According to Jeananne Rae, companies are realizing that ‘building great consumer experiences is a complex enterprise, involving strategy, integration of technology, orchestrating business models, brand management and CEO commitment.’ … Continue Reading… ### Non-Linear Model in R Exercises (This article was first published on R-exercises, and kindly contributed to R-bloggers) A mechanistic model for the relationship between x and y sometimes needs parameter estimation. When model linearisation does not work,we need to use non-linear modelling. There are three main differences between non-linear and linear modelling in R: 1. specify the exact nature of the equation 2. replace the lm() with nls() which means nonlinear least squares 3. sometimes we also need to specify the model parameters a,b, and c. In this exercise, we will use the same dataset as the previous exercise in polynomial regression here. Download the data-set here. A quick overview of the dataset. Response variable = number of invertebrates (INDIV) Explanatory variable = the area of each clump (AREA) Additional possible response variables = Species richness of invertebrates (SPECIES) Answers to these exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page. Exercise 1 Load dataset. Specify the model, try to use power function with nls() and a=0.1 and b=1 as initial parameter number Exercise 2 A quick check by creating plot residual versus fitted model since normal plot will not work Exercise 3 Try to build self start function of the powered model Exercise 4 Generate the asymptotic model Exercise 5 Compared the asymptotic model to the powered one using AIC. What can we infer? Exercise 6 Plot the model in one graph Exercise 7 Predict across the data and plot all three lines To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### Document worth reading: “Deep Probabilistic Programming Languages: A Qualitative Study” Deep probabilistic programming languages try to combine the advantages of deep learning with those of probabilistic programming languages. If successful, this would be a big step forward in machine learning and programming languages. Unfortunately, as of now, this new crop of languages is hard to use and understand. This paper addresses this problem directly by explaining deep probabilistic programming languages and indirectly by characterizing their current strengths and weaknesses. Deep Probabilistic Programming Languages: A Qualitative Study Continue Reading… ### Distilled News This tool finds the Granger causality relationship among the input time series and visualizes the results in a directed causal graph and a directed adjacency matrix. It applies the Lasso-Granger and Copula-Granger algorithms with length of lag l=1. For more information, please see the following papers: • Andrew Arnold, Yan Liu, and Naoki Abe. Temporal Causal Modeling with Graphical Granger Methods, KDD 2007. • Taha Bahadori, Yan Liu. An Examination of Practical Granger Causality Inference, SDM 2013 In my most recent blog, I discussed the idea of aligning the supply of services to market demand. My conceptualization of ‘alignment’ specifically relates to time intervals: i.e. having people at the right place and at the right time – for example, to take advantage of opportunities – is a sign of alignment. Alignment for me is often about the relationship between capacity and incapacity: the ability to supply services versus the inability to satisfy the market demand for those services. In this blog I will be considering the interpretation of charts to evaluate the effectiveness of strategic allocation. In this study, we predict the outcome of the football matches in the FIFA World Cup 2018 to be held in Russia this summer. We do this using classification models over a dataset of historic football results that includes attributes from the playing teams by rating them in attack, midfield, defence, aggression, pressure, chance creation and building ability. This last training data was a result of merging international matches results with AE games ratings of the teams considering the timeline of the matches with their respective statistics. Final predictions show the four countries with the most chances of getting to the semifinals as France, Brazil, Spain and Germany while giving Spain as the winner. In the exercises below, we will explore more in Time Series analysis.The previous exercise is here,Please follow this in sequence. We discuss a new approach for selecting features from a large set of features, in an unsupervised machine learning framework. In supervised learning such as linear regression or supervised clustering, it is possible to test the predicting power of a set of features (also called independent variables by statisticians, or predictors) using metrics such as goodness of fit with the response (the dependent variable), for instance using the R-squared coefficient. This makes the process of feature selection rather easy. Here this is not feasible. The context could be pure clustering, with no training sets available, for instance in a fraud detection problem. We are also dealing with discrete and continuous variables, possibly including dummy variables that represent categories, such as gender. We assume that no simple statistical model explains the data, so the framework here is model-free, data-driven. In this context, traditional methods are based on information theory metrics to determine which subset of features brings the largest amount of information. Teaser: Tensor Networks can be seen as a higher-order generalization of traditional deep neural networks, and yet they lack an explicit non-linearity such as applying the ReLU or sigmoid function as we do with neural nets. A deeper understanding of what nonlinearity actually means, however, reveals that tensor networks can indeed learn non-linear functions. The non-linearity of tensor networks arises soley from the architecture and topology of the network itself. In Data Science, we can use clustering analysis to gain some valuable insights from our data by seeing what groups the data points fall into when we apply a clustering algorithm. Today, we´re going to look at 5 popular clustering algorithms that data scientists need to know and their pros and cons! • K-Means Clustering • Mean-Shift Clustering • Density-Based Spatial Clustering of Applications with Noise (DBSCAN) • Expectation-Maximization (EM) Clustering using Gaussian Mixture Models (GMM) • Agglomerative Hierarchical Clustering This is a tutorial for beginners interested in learning about MNIST and Softmax regression using machine learning (ML) and TensorFlow. When we start learning programming, the first thing we learned to do was to print ‘Hello World.’ It´s like Hello World, the entry point to programming, and MNIST, the starting point for machine learning. At first glance, this reminds us of AI, when a machine decides how to manage a task based on statistical data. In fact, this concept is part of the AI phenomenon and makes it possible to develop machine intelligence and improve the decision-making process. According to NVidia, ‘Machine Learning at its most basic is the practice of using algorithms to parse data, learn from it, and then make a determination or prediction about something in the world’. In a nutshell, a machine analyzes and recommends information without human participation. This work can be done manually but takes a plenty of time and effort. Thanks to huge computing power, modern machines perform data mining, data analytics and predictive modeling more effectively than people do. The next section will be dedicated to a recommender technique based on the machine learning approach. Continue Reading… ### Whats new on arXiv We present an approach to identify concise equations from data using a shallow neural network approach. In contrast to ordinary black-box regression, this approach allows understanding functional relations and generalizing them from observed data to unseen parts of the parameter space. We show how to extend the class of learnable equations for a recently proposed equation learning network to include divisions, and we improve the learning and model selection strategy to be useful for challenging real-world data. For systems governed by analytical expressions, our method can in many cases identify the true underlying equation and extrapolate to unseen domains. We demonstrate its effectiveness by experiments on a cart-pendulum system, where only 2 random rollouts are required to learn the forward dynamics and successfully achieve the swing-up task. Do-calculus is concerned with estimating the interventional distribution of an action from the observed joint probability distribution of the variables in a given causal structure. All identifiable causal effects can be derived using the rules of do-calculus, but the rules themselves do not give any direct indication whether the effect in question is identifiable or not. Shpitser and Pearl constructed an algorithm for identifying joint interventional distributions in causal models, which contain unobserved variables and induce directed acyclic graphs. This algorithm can be seen as a repeated application of the rules of do-calculus and known properties of probabilities, and it ultimately either derives an expression for the causal distribution, or fails to identify the effect, in which case the effect is non-identifiable. In this paper, the R package causaleffect is presented, which provides an implementation of this algorithm. Functionality of causaleffect is also demonstrated through examples. Obtaining a non-parametric expression for an interventional distribution is one of the most fundamental tasks in causal inference. Such an expression can be obtained for an identifiable causal effect by an algorithm or by manual application of do-calculus. Often we are left with a complicated expression which can lead to biased or inefficient estimates when missing data or measurement errors are involved. We present an automatic simplification algorithm that seeks to eliminate symbolically unnecessary variables from these expressions by taking advantage of the structure of the underlying graphical model. Our method is applicable to all causal effect formulas and is readily available in the R package causaleffect. We derive a formal, decision-based method for comparing the performance of counterfactual treatment regime predictions using the results of experiments that give relevant information on the distribution of treated outcomes. Our approach allows us to quantify and assess the statistical significance of differential performance for optimal treatment regimes estimated from structural models, extrapolated treatment effects, expert opinion, and other methods. We apply our method to evaluate optimal treatment regimes for conditional cash transfer programs across countries where predictions are generated using data from experimental evaluations in other countries and pre-program data in the country of interest. We introduce a new family of deep neural network models. Instead of specifying a discrete sequence of hidden layers, we parameterize the derivative of the hidden state using a neural network. The output of the network is computed using a blackbox differential equation solver. These continuous-depth models have constant memory cost, adapt their evaluation strategy to each input, and can explicitly trade numerical precision for speed. We demonstrate these properties in continuous-depth residual networks and continuous-time latent variable models. We also construct continuous normalizing flows, a generative model that can train by maximum likelihood, without partitioning or ordering the data dimensions. For training, we show how to scalably backpropagate through any ODE solver, without access to its internal operations. This allows end-to-end training of ODEs within larger models. With the recent success of embeddings in natural language processing, research has been conducted into applying similar methods to code analysis. Most works attempt to process the code directly or use a syntactic tree representation, treating it like sentences written in a natural language. However, none of the existing methods are sufficient to comprehend program semantics robustly, due to structural features such as function calls, branching, and interchangeable order of statements. In this paper, we propose a novel processing technique to learn code semantics, and apply it to a variety of program analysis tasks. In particular, we stipulate that a robust distributional hypothesis of code applies to both human- and machine-generated programs. Following this hypothesis, we define an embedding space, inst2vec, based on an Intermediate Representation (IR) of the code that is independent of the source programming language. We provide a novel definition of contextual flow for this IR, leveraging both the underlying data- and control-flow of the program. We then analyze the embeddings qualitatively using analogies and clustering, and evaluate the learned representation on three different high-level tasks. We show that with a single RNN architecture and pre-trained fixed embeddings, inst2vec outperforms specialized approaches for performance prediction (compute device mapping, optimal thread coarsening); and algorithm classification from raw code (104 classes), where we set a new state-of-the-art. Machine learning has an emerging critical role in high-performance computing to modulate simulations, extract knowledge from massive data, and replace numerical models with efficient approximations. Decision forests are a critical tool because they provide insight into model operation that is critical to interpreting learned results. While decision forests are trivially parallelizable, the traversals of tree data structures incur many random memory accesses and are very slow. We present memory packing techniques that reorganize learned forests to minimize cache misses during classification. The resulting layout is hierarchical. At low levels, we pack the nodes of multiple trees into contiguous memory blocks so that each memory access fetches data for multiple trees. At higher levels, we use leaf cardinality to identify the most popular paths through a tree and collocate those paths in cache lines. We extend this layout with out-of-order execution and cache-line prefetching to increase memory throughput. Together, these optimizations increase the performance of classification in ensembles by a factor of four over an optimized C++ implementation and a actor of 50 over a popular R language implementation. We empirically investigate learning from partial feedback in neural machine translation (NMT), when partial feedback is collected by asking users to highlight a correct chunk of a translation. We propose a simple and effective way of utilizing such feedback in NMT training. We demonstrate how the common machine translation problem of domain mismatch between training and deployment can be reduced solely based on chunk-level user feedback. We conduct a series of simulation experiments to test the effectiveness of the proposed method. Our results show that chunk-level feedback outperforms sentence based feedback by up to 2.61% BLEU absolute. Smart factories are on the verge of becoming the new industrial paradigm, wherein optimization permeates all aspects of production, from concept generation to sales. To fully pursue this paradigm, flexibility in the production means as well as in their timely organization is of paramount importance. AI is planning a major role in this transition, but the scenarios encountered in practice might be challenging for current tools. Task planning is one example where AI enables more efficient and flexible operation through an online automated adaptation and rescheduling of the activities to cope with new operational constraints and demands. In this paper we present SMarTplan, a task planner specifically conceived to deal with real-world scenarios in the emerging smart factory paradigm. Including both special-purpose and general-purpose algorithms, SMarTplan is based on current automated reasoning technology and it is designed to tackle complex application domains. In particular, we show its effectiveness on a logistic scenario, by comparing its specialized version with the general purpose one, and extending the comparison to other state-of-the-art task planners. Fraud detection is a difficult problem that can benefit from predictive modeling. However, the verification of a prediction is challenging; for a single insurance policy, the model only provides a prediction score. We present a case study where we reflect on different instance-level model explanation techniques to aid a fraud detection team in their work. To this end, we designed two novel dashboards combining various state-of-the-art explanation techniques. These enable the domain expert to analyze and understand predictions, dramatically speeding up the process of filtering potential fraud cases. Finally, we discuss the lessons learned and outline open research issues. The restricted Boltzmann machine is a network of stochastic units with undirected interactions between pairs of visible and hidden units. This model was popularized as a building block of deep learning architectures and has continued to play an important role in applied and theoretical machine learning. Restricted Boltzmann machines carry a rich structure, with connections to geometry, applied algebra, probability, statistics, machine learning, and other areas. The analysis of these models is attractive in its own right and also as a platform to combine and generalize mathematical tools for graphical models with hidden variables. This article gives an introduction to the mathematical analysis of restricted Boltzmann machines, reviews recent results on the geometry of the sets of probability distributions representable by these models, and suggests a few directions for further investigation. Deep neural networks have been proven powerful at processing perceptual data, such as images and audio. However for tabular data, tree-based models are more popular. A nice property of tree-based models is their natural interpretability. In this work, we present Deep Neural Decision Trees (DNDT) — tree models realised by neural networks. A DNDT is intrinsically interpretable, as it is a tree. Yet as it is also a neural network (NN), it can be easily implemented in NN toolkits, and trained with gradient descent rather than greedy splitting. We evaluate DNDT on several tabular datasets, verify its efficacy, and investigate similarities and differences between DNDT and vanilla decision trees. Interestingly, DNDT self-prunes at both split and feature-level. Inverse reinforcement learning is the problem of inferring the reward function of an observed agent, given its policy or behavior. Researchers perceive IRL both as a problem and as a class of methods. By categorically surveying the current literature in IRL, this article serves as a reference for researchers and practitioners in machine learning to understand the challenges of IRL and select the approaches best suited for the problem on hand. The survey formally introduces the IRL problem along with its central challenges which include accurate inference, generalizability, correctness of prior knowledge, and growth in solution complexity with problem size. The article elaborates how the current methods mitigate these challenges. We further discuss the extensions of traditional IRL methods: (i) inaccurate and incomplete perception, (ii) incomplete model, (iii) multiple rewards, and (iv) non-linear reward functions. This discussion concludes with some broad advances in the research area and currently open research questions. Tensors are higher-order extensions of matrices. In recent work [Kilmer and Martin, 2011], the authors introduced the notion of the t-product, a generalization of matrix multiplication for tensors of order three. The multiplication is based on a convolution-like operation, which can be implemented efficiently using the Fast Fourier Transform (FFT). Based on t-product, there has a similar linear algebraic structure of tensors to matrices. For example, there has the tensor SVD (t-SVD) which is computable. By using some properties of FFT, we have a more efficient way for computing t-product and t-SVD in [C. Lu, et al., 2018]. We develop a Matlab toolbox to implement several basic operations on tensors based on t-product. The toolbox is available at https://…/tproduct. Convolutional Neural Networks(CNNs) are complex systems. They are trained so they can adapt their internal connections to recognize images, texts and more. It is both interesting and helpful to visualize the dynamics within such deep artificial neural networks so that people can understand how these artificial networks are learning and making predictions. In the field of scientific simulations, visualization tools like Paraview have long been utilized to provide insights and understandings. We present in situ TensorView to visualize the training and functioning of CNNs as if they are systems of scientific simulations. In situ TensorView is a loosely coupled in situ visualization open framework that provides multiple viewers to help users to visualize and understand their networks. It leverages the capability of co-processing from Paraview to provide real-time visualization during training and predicting phases. This avoid heavy I/O overhead for visualizing large dynamic systems. Only a small number of lines of codes are injected in TensorFlow framework. The visualization can provide guidance to adjust the architecture of networks, or compress the pre-trained networks. We showcase visualizing the training of LeNet-5 and VGG16 using in situ TensorView. Using neural networks in practical settings would benefit from the ability of the networks to learn new tasks throughout their lifetimes without forgetting the previous tasks. This ability is limited in the current deep neural networks by a problem called catastrophic forgetting, where training on new tasks tends to severely degrade performance on previous tasks. One way to lessen the impact of the forgetting problem is to constrain parameters that are important to previous tasks to stay close to the optimal parameters. Recently, multiple competitive approaches for computing the importance of the parameters with respect to the previous tasks have been presented. In this paper, we propose a learning to optimize algorithm for mitigating catastrophic forgetting. Instead of trying to formulate a new constraint function ourselves, we propose to train another neural network to predict parameter update steps that respect the importance of parameters to the previous tasks. In the proposed meta-training scheme, the update predictor is trained to minimize loss on a combination of current and past tasks. We show experimentally that the proposed approach works in the continual learning setting. Continue Reading… ### Scraping Responsibly with R (This article was first published on Blog-rss on stevenmortimer.com, and kindly contributed to R-bloggers) I recently wrote a blog post here comparing the number of CRAN downloads an R package gets relative to its number of stars on GitHub. What I didn’t really think about during my analysis was whether or not scraping CRAN was a violation of its Terms and Conditions. I simply copy and pasted some code from R-bloggers that seemed to work and went on my merry way. In hindsight, it would have been better to check whether or not the scraping was allowed and maybe find a better way to get the information I needed. Of course, there was a much easier way to get the CRAN package metadata using the function tools::CRAN_package_db() thanks to a hint from Maëlle Salmon provided in this tweet. ## How to Check if Scraping is Permitted Also provided by Maëlle’s tweet was the recommendation for using the robotstxt package (currently having 27 Stars + one Star that I just added!). It doesn’t seem to be well known as it only has 6,571 total downloads. I’m hoping this post will help spread the word. It’s easy to use! In this case I’ll check whether or not CRAN permits bots on specific resources of the domain. My other blog post analysis originally started with trying to get a list of all current R packages on CRAN by parsing the HTML from https://cran.rstudio.com/src/contrib. The page looks like this: The question is whether or not scraping this page is permitted according to the robots.txt file on the cran.rstudio.com domain. This is where the robotstxt package can help us out. We can check simply by supplying the domain and path that is used to form the full link we are interested in scraping. If the paths_allowed() function returns TRUE then we should be allowed to scrape, if it returns FALSE then we are not permitted to scrape. library(robotstxt) paths_allowed( paths = "/src/contrib", domain = "cran.rstudio.com", bot = "*" ) #> [1] TRUE In this case the value that is returned is TRUE meaning that bots are allowed to scrape that particular path. This was how I originally scraped the list of current R packages, even though you don’t really need to do that since there is the wonderful function tools::CRAN_package_db(). After retrieving the list of packages I decided to scrape details from the DESCRIPTION file of each package. Here is where things get interesting. CRAN’s robots.txt file shows that scraping the DESCRIPTION file of each package is not allowed. Furthermore, you can verify this using the robotstxt package: paths_allowed( paths = "/web/packages/ggplot2/DESCRIPTION", domain = "cran.r-project.org", bot = "*" ) #> [1] FALSE However, when I decided to scrape the package metadata I did it by parsing the HTML from the canonical package link that resolves to the index.html page for the package. For example, https://cran.r-project.org/package=ggplot2 resolves to https://cran.r-project.org/web/packages/ggplot2/index.html. If you check whether scraping is allowed on this page, the robotstxt package says that it is permitted. paths_allowed( paths = "/web/packages/ggplot2/index.html", domain = "cran.r-project.org", bot = "*" ) #> [1] TRUE paths_allowed( paths = "/web/packages/ggplot2", domain = "cran.r-project.org", bot = "*" ) #> [1] TRUE This is a tricky situation because I can access the same information that is in the DESCRIPTION file just by going to the index.html page for the package where scraping seems to be allowed. In the spirit of respecting CRAN it logically follows that I should not be scraping the package index pages if the individual DESCRIPTION files are off-limits. This is despite there being no formal instruction from the robots.txt file about package index pages. All in all, it was an interesting bit of work and glad that I was able to learn about the robotstxt package so I can have it in my toolkit going forward. Remember to Always Scrape Responsibly! DISCLAIMER: I only have a basic understanding of how robots.txt files work based on allowing or disallowing specified paths. I believe in this case CRAN’s robots.txt broadly permitted scraping, but too narrowly disallowed just the DESCRIPTION files. Perhaps this goes back to an older time where those DESCRIPTION files really were the best place for people to start scraping so it made sense to disallow them. Or the reason could be something else entirely. To leave a comment for the author, please follow the link and comment on their blog: Blog-rss on stevenmortimer.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### Explaining Keras image classification models with lime (This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers) What I did not show in that post was how to use the model for making predictions. This, I will do here. But predictions alone are boring, so I’m adding explanations for the predictions using the lime package. I have already written a few blog posts (here, here and here) about LIME and have given talks (here and here) about it, too. Neither of them applies LIME to image classification models, though. And with the new(ish) release from March of Thomas Lin Pedersen’s lime package, lime is now not only on CRAN but it natively supports Keras and image classification models. Thomas wrote a very nice article about how to use keras and lime in R! Here, I am following this article to use Imagenet (VGG16) to make and explain predictions of fruit images and then I am extending the analysis to last week’s model and compare it with the pretrained net. ## Loading libraries and models library(keras) # for working with neural nets library(lime) # for explaining models library(magick) # for preprocessing images library(ggplot2) # for additional plotting • Loading the pretrained Imagenet model model <- application_vgg16(weights = "imagenet", include_top = TRUE) model ## Model ## ___________________________________________________________________________ ## Layer (type) Output Shape Param # ## =========================================================================== ## input_1 (InputLayer) (None, 224, 224, 3) 0 ## ___________________________________________________________________________ ## block1_conv1 (Conv2D) (None, 224, 224, 64) 1792 ## ___________________________________________________________________________ ## block1_conv2 (Conv2D) (None, 224, 224, 64) 36928 ## ___________________________________________________________________________ ## block1_pool (MaxPooling2D) (None, 112, 112, 64) 0 ## ___________________________________________________________________________ ## block2_conv1 (Conv2D) (None, 112, 112, 128) 73856 ## ___________________________________________________________________________ ## block2_conv2 (Conv2D) (None, 112, 112, 128) 147584 ## ___________________________________________________________________________ ## block2_pool (MaxPooling2D) (None, 56, 56, 128) 0 ## ___________________________________________________________________________ ## block3_conv1 (Conv2D) (None, 56, 56, 256) 295168 ## ___________________________________________________________________________ ## block3_conv2 (Conv2D) (None, 56, 56, 256) 590080 ## ___________________________________________________________________________ ## block3_conv3 (Conv2D) (None, 56, 56, 256) 590080 ## ___________________________________________________________________________ ## block3_pool (MaxPooling2D) (None, 28, 28, 256) 0 ## ___________________________________________________________________________ ## block4_conv1 (Conv2D) (None, 28, 28, 512) 1180160 ## ___________________________________________________________________________ ## block4_conv2 (Conv2D) (None, 28, 28, 512) 2359808 ## ___________________________________________________________________________ ## block4_conv3 (Conv2D) (None, 28, 28, 512) 2359808 ## ___________________________________________________________________________ ## block4_pool (MaxPooling2D) (None, 14, 14, 512) 0 ## ___________________________________________________________________________ ## block5_conv1 (Conv2D) (None, 14, 14, 512) 2359808 ## ___________________________________________________________________________ ## block5_conv2 (Conv2D) (None, 14, 14, 512) 2359808 ## ___________________________________________________________________________ ## block5_conv3 (Conv2D) (None, 14, 14, 512) 2359808 ## ___________________________________________________________________________ ## block5_pool (MaxPooling2D) (None, 7, 7, 512) 0 ## ___________________________________________________________________________ ## flatten (Flatten) (None, 25088) 0 ## ___________________________________________________________________________ ## fc1 (Dense) (None, 4096) 102764544 ## ___________________________________________________________________________ ## fc2 (Dense) (None, 4096) 16781312 ## ___________________________________________________________________________ ## predictions (Dense) (None, 1000) 4097000 ## =========================================================================== ## Total params: 138,357,544 ## Trainable params: 138,357,544 ## Non-trainable params: 0 ## ___________________________________________________________________________ model2 <- load_model_hdf5(filepath = "/Users/shiringlander/Documents/Github/DL_AI/Tutti_Frutti/fruits-360/keras/fruits_checkpoints.h5") model2 ## Model ## ___________________________________________________________________________ ## Layer (type) Output Shape Param # ## =========================================================================== ## conv2d_1 (Conv2D) (None, 20, 20, 32) 896 ## ___________________________________________________________________________ ## activation_1 (Activation) (None, 20, 20, 32) 0 ## ___________________________________________________________________________ ## conv2d_2 (Conv2D) (None, 20, 20, 16) 4624 ## ___________________________________________________________________________ ## leaky_re_lu_1 (LeakyReLU) (None, 20, 20, 16) 0 ## ___________________________________________________________________________ ## batch_normalization_1 (BatchNorm (None, 20, 20, 16) 64 ## ___________________________________________________________________________ ## max_pooling2d_1 (MaxPooling2D) (None, 10, 10, 16) 0 ## ___________________________________________________________________________ ## dropout_1 (Dropout) (None, 10, 10, 16) 0 ## ___________________________________________________________________________ ## flatten_1 (Flatten) (None, 1600) 0 ## ___________________________________________________________________________ ## dense_1 (Dense) (None, 100) 160100 ## ___________________________________________________________________________ ## activation_2 (Activation) (None, 100) 0 ## ___________________________________________________________________________ ## dropout_2 (Dropout) (None, 100) 0 ## ___________________________________________________________________________ ## dense_2 (Dense) (None, 16) 1616 ## ___________________________________________________________________________ ## activation_3 (Activation) (None, 16) 0 ## =========================================================================== ## Total params: 167,300 ## Trainable params: 167,268 ## Non-trainable params: 32 ## ___________________________________________________________________________ ## Load and prepare images Here, I am loading and preprocessing two images of fruits (and yes, I am cheating a bit because I am choosing images where I expect my model to work as they are similar to the training images…). • Banana test_image_files_path <- "/Users/shiringlander/Documents/Github/DL_AI/Tutti_Frutti/fruits-360/Test" img <- image_read('https://upload.wikimedia.org/wikipedia/commons/thumb/8/8a/Banana-Single.jpg/272px-Banana-Single.jpg') img_path <- file.path(test_image_files_path, "Banana", 'banana.jpg') image_write(img, img_path) #plot(as.raster(img)) • Clementine img2 <- image_read('https://cdn.pixabay.com/photo/2010/12/13/09/51/clementine-1792_1280.jpg') img_path2 <- file.path(test_image_files_path, "Clementine", 'clementine.jpg') image_write(img2, img_path2) #plot(as.raster(img2)) ### Superpixels The segmentation of an image into superpixels are an important step in generating explanations for image models. It is both important that the segmentation is correct and follows meaningful patterns in the picture, but also that the size/number of superpixels are appropriate. If the important features in the image are chopped into too many segments the permutations will probably damage the picture beyond recognition in almost all cases leading to a poor or failing explanation model. As the size of the object of interest is varying it is impossible to set up hard rules for the number of superpixels to segment into – the larger the object is relative to the size of the image, the fewer superpixels should be generated. Using plot_superpixels it is possible to evaluate the superpixel parameters before starting the time consuming explanation function. (help(plot_superpixels)) plot_superpixels(img_path, n_superpixels = 35, weight = 10) plot_superpixels(img_path2, n_superpixels = 50, weight = 20) From the superpixel plots we can see that the clementine image has a higher resolution than the banana image. ## Prepare images for Imagenet image_prep <- function(x) { arrays <- lapply(x, function(path) { img <- image_load(path, target_size = c(224,224)) x <- image_to_array(img) x <- array_reshape(x, c(1, dim(x))) x <- imagenet_preprocess_input(x) }) do.call(abind::abind, c(arrays, list(along = 1))) } • test predictions res <- predict(model, image_prep(c(img_path, img_path2))) imagenet_decode_predictions(res) ## [[1]] ## class_name class_description score ## 1 n07753592 banana 0.9929747581 ## 2 n03532672 hook 0.0013420776 ## 3 n07747607 orange 0.0010816186 ## 4 n07749582 lemon 0.0010625814 ## 5 n07716906 spaghetti_squash 0.0009176208 ## ## [[2]] ## class_name class_description score ## 1 n07747607 orange 0.78233224 ## 2 n07753592 banana 0.04653566 ## 3 n07749582 lemon 0.03868873 ## 4 n03134739 croquet_ball 0.03350329 ## 5 n07745940 strawberry 0.01862431 • load labels and train explainer model_labels <- readRDS(system.file('extdata', 'imagenet_labels.rds', package = 'lime')) explainer <- lime(c(img_path, img_path2), as_classifier(model, model_labels), image_prep) Training the explainer (explain() function) can take pretty long. It will be much faster with the smaller images in my own model but with the bigger Imagenet it takes a few minutes to run. explanation <- explain(c(img_path, img_path2), explainer, n_labels = 2, n_features = 35, n_superpixels = 35, weight = 10, background = "white") • plot_image_explanation() only supports showing one case at a time plot_image_explanation(explanation) clementine <- explanation[explanationcase == "clementine.jpg",]
plot_image_explanation(clementine)

## Prepare images for my own model

• test predictions (analogous to training and validation images)
test_datagen <- image_data_generator(rescale = 1/255)

test_generator = flow_images_from_directory(
test_image_files_path,
test_datagen,
target_size = c(20, 20),
class_mode = 'categorical')

predictions <- as.data.frame(predict_generator(model2, test_generator, steps = 1))

fruits_classes_indices_df <- data.frame(indices = unlist(fruits_classes_indices))
fruits_classes_indices_df <- fruits_classes_indices_df[order(fruits_classes_indices_df$indices), , drop = FALSE] colnames(predictions) <- rownames(fruits_classes_indices_df) t(round(predictions, digits = 2)) ## [,1] [,2] ## Kiwi 0 0.00 ## Banana 1 0.11 ## Apricot 0 0.00 ## Avocado 0 0.00 ## Cocos 0 0.00 ## Clementine 0 0.87 ## Mandarine 0 0.00 ## Orange 0 0.00 ## Limes 0 0.00 ## Lemon 0 0.00 ## Peach 0 0.00 ## Plum 0 0.00 ## Raspberry 0 0.00 ## Strawberry 0 0.01 ## Pineapple 0 0.00 ## Pomegranate 0 0.00 for (i in 1:nrow(predictions)) { cat(i, ":") print(unlist(which.max(predictions[i, ]))) } ## 1 :Banana ## 2 ## 2 :Clementine ## 6 This seems to be incompatible with lime, though (or if someone knows how it works, please let me know) – so I prepared the images similarly to the Imagenet images. image_prep2 <- function(x) { arrays <- lapply(x, function(path) { img <- image_load(path, target_size = c(20, 20)) x <- image_to_array(img) x <- reticulate::array_reshape(x, c(1, dim(x))) x <- x / 255 }) do.call(abind::abind, c(arrays, list(along = 1))) } • prepare labels fruits_classes_indices_l <- rownames(fruits_classes_indices_df) names(fruits_classes_indices_l) <- unlist(fruits_classes_indices) fruits_classes_indices_l ## 9 10 8 2 11 ## "Kiwi" "Banana" "Apricot" "Avocado" "Cocos" ## 3 13 14 7 6 ## "Clementine" "Mandarine" "Orange" "Limes" "Lemon" ## 1 5 0 4 15 ## "Peach" "Plum" "Raspberry" "Strawberry" "Pineapple" ## 12 ## "Pomegranate" • train explainer explainer2 <- lime(c(img_path, img_path2), as_classifier(model2, fruits_classes_indices_l), image_prep2) explanation2 <- explain(c(img_path, img_path2), explainer2, n_labels = 1, n_features = 20, n_superpixels = 35, weight = 10, background = "white") • plot feature weights to find a good threshold for plotting block (see below) explanation2 %>% ggplot(aes(x = feature_weight)) + facet_wrap(~ case, scales = "free") + geom_density() • plot predictions plot_image_explanation(explanation2, display = 'block', threshold = 5e-07) clementine2 <- explanation2[explanation2$case == "clementine.jpg",]
plot_image_explanation(clementine2, display = 'block', threshold = 0.16)

sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.5
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8
##
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base
##
## other attached packages:
## [1] ggplot2_2.2.1 magick_1.9    lime_0.4.0    keras_2.1.6
##
## loaded via a namespace (and not attached):
##  [1] stringdist_0.9.5.1 reticulate_1.8     xfun_0.2
##  [4] lattice_0.20-35    colorspace_1.3-2   htmltools_0.3.6
##  [7] yaml_2.1.19        base64enc_0.1-3    rlang_0.2.1
## [10] pillar_1.2.3       later_0.7.3        foreach_1.4.4
## [13] plyr_1.8.4         tensorflow_1.8     stringr_1.3.1
## [16] munsell_0.5.0      blogdown_0.6       gtable_0.2.0
## [19] htmlwidgets_1.2    codetools_0.2-15   evaluate_0.10.1
## [22] labeling_0.3       knitr_1.20         httpuv_1.4.4.1
## [25] tfruns_1.3         parallel_3.5.0     curl_3.2
## [28] Rcpp_0.12.17       xtable_1.8-2       scales_0.5.0
## [31] backports_1.1.2    promises_1.0.1     jsonlite_1.5
## [34] abind_1.4-5        mime_0.5           digest_0.6.15
## [37] stringi_1.2.3      bookdown_0.7       shiny_1.1.0
## [40] grid_3.5.0         rprojroot_1.3-2    tools_3.5.0
## [43] magrittr_1.5       lazyeval_0.2.1     shinythemes_1.1.1
## [46] glmnet_2.0-16      tibble_1.4.2       whisker_0.3-2
## [49] zeallot_0.1.0      Matrix_1.2-14      gower_0.1.2
## [52] assertthat_0.2.0   rmarkdown_1.10     iterators_1.0.9
## [55] R6_2.2.2           compiler_3.5.0

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

## June 20, 2018

### R Packages worth a look

Facilities for Simulating from ODE-Based Models (RxODE)
Facilities for running simulations from ordinary differential equation (ODE) models, such as pharmacometrics and other compartmental models. A compilation manager translates the ODE model into C, compiles it, and dynamically loads the object code into R for improved computational efficiency. An event table object facilitates the specification of complex dosing regimens (optional) and sampling schedules. NB: The use of this package requires both C and Fortran compilers, for details on their use with R please see Section 6.3, Appendix A, and Appendix D in the ‘R Administration and Installation’ manual. Also the code is mostly released under GPL. The VODE and LSODA are in the public domain. The information is available in the inst/COPYRIGHTS.

Stanford ‘ATLAS’ Search Engine API (atlas)
Stanford ‘ATLAS’ (Advanced Temporal Search Engine) is a powerful tool that allows constructing cohorts of patients extremely quickly and efficiently. This package is designed to interface directly with an instance of ‘ATLAS’ search engine and facilitates API queries and data dumps. Prerequisite is a good knowledge of the temporal language to be able to efficiently construct a query. More information available at <https://…/start>.

In-place Operators for R (inplace)
It provides in-place operators for R that are equivalent to ‘+=’, ‘-=’, ‘*=’, ‘/=’ in C++. Those can be applied on integer|double vectors|matrices. You have also access to sweep operations (in-place).

Simulation Extrapolation Inverse Probability Weighted Generalized Estimating Equations (swgee)
Simulation extrapolation and inverse probability weighted generalized estimating equations method for longitudinal data with missing observations and measurement error in covariates. References: Yi, G. Y. (2008) <doi:10.1093/biostatistics/kxm054>; Cook, J. R. and Stefanski, L. A. (1994) <doi:10.1080/01621459.1994.10476871>; Little, R. J. A. and Rubin, D. B. (2002, ISBN:978-0-471-18386-0).

A User-Oriented Statistical Toolkit for Analytical Variance Estimation (gustave)
Provides a toolkit for analytical variance estimation in survey sampling. Apart from the implementation of standard variance estimators, its main feature is to help the sampling expert produce easy-to-use variance estimation ‘wrappers’, where systematic operations (linearization, domain estimation) are handled in a consistent and transparent way for the end user.

### Le Monde puzzle [#1053]

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

An easy arithmetic Le Monde mathematical puzzle again:

1. If coins come in units of 1, x, and y, what is the optimal value of (x,y) that minimises the number of coins representing an arbitrary price between 1 and 149?
2.  If the number of units is now four, what is the optimal choice?

The first question is fairly easy to code

coinz <- function(x,y){
z=(1:149)
if (y

and returns M=12 as the maximal number of coins, corresponding to x=4 and y=22. And a price tag of 129.  For the second question, one unit is necessarily 1 (!) and there is just an extra loop to the above, which returns M=8, with other units taking several possible values:

[1] 40 11  3
[1] 41 11  3
[1] 55 15  4
[1] 56 15  4


A quick search revealed that this problem (or a variant) is solved in many places, from stackexchange (for an average—why average?, as it does not make sense when looking at real prices—number of coins, rather than maximal), to a paper by Shalit calling for the 18¢ coin, to Freakonomics, to Wikipedia, although this is about finding the minimum number of coins summing up to a given value, using fixed currency denominations (a knapsack problem). This Wikipedia page made me realise that my solution is not necessarily optimal, as I use the remainders from the larger denominations in my code, while there may be more efficient divisions. For instance, running the following dynamic programming code

coz=function(x,y){
minco=1:149
if (x

returns the lower value of M=11 (with x=7,y=23) in the first case and M=7 in the second one.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

WHEN Donald Trump declared on March 22nd that he planned to impose a 25% tariff on $50bn a year of imports from China, the prospect of a trade war between the world’s two largest economies briefly spooked stock markets. Continue Reading… ### Magister Dixit “Sometimes some data scientists seem to ignore this: you can think of using the most sophisticated and trendy algorithm, come up with brilliant ideas, imagine the most creative visualizations but, if you do not know how to get the data and handle it in the exact way you need it, all of this becomes worthless.” Hernán Resnizky ( May 15, 2015 ) Continue Reading… ### Top KDnuggets tweets, Jun 6–19: #MachineLearning predicts #WorldCup2018 winner; 10 More Free Must-Read Books for Data Science Also: Google #AI principles; #Cartoon: FIFA #WorldCup #Football and #MachineLearning; Introduction to Game Theory; Top 20 Recent Research Papers on Machine Learning and Deep Learning Continue Reading… ### PYPL Language Rankings: Python ranks #1, R at #7 in popularity The new PYPL Popularity of Programming Languages (June 2018) index ranks Python at #1 and R at #7. Like the similar TIOBE language index, the PYPL index uses Google search activity to rank language popularity. PYPL, however, fcouses on people searching for tutorials in the respective languages as a proxy for popularity. By that measure, Python has always been more popular than R (as you'd expect from a more general-purpose language), but both have been growing at similar rates. The chart below includes the three data-oriented languages tracked by the index (and note the vertical scale is logarithmic). Another language ranking was also released recently: the annual KDnuggets Analytics, Data Science and Machine Learning Poll. These rankings, however, are derived not from search trends but by self-selected poll respondents, which perhaps explains the presence of Rapidminer at the #2 spot. Continue Reading… ### PYPL Language Rankings: Python ranks #1, R at #7 in popularity (This article was first published on Revolutions, and kindly contributed to R-bloggers) The new PYPL Popularity of Programming Languages (June 2018) index ranks Python at #1 and R at #7. Like the similar TIOBE language index, the PYPL index uses Google search activity to rank language popularity. PYPL, however, fcouses on people searching for tutorials in the respective languages as a proxy for popularity. By that measure, Python has always been more popular than R (as you'd expect from a more general-purpose language), but both have been growing at similar rates. The chart below includes the three data-oriented languages tracked by the index (and note the vertical scale is logarithmic). Another language ranking was also released recently: the annual KDnuggets Analytics, Data Science and Machine Learning Poll. These rankings, however, are derived not from search trends but by self-selected poll respondents, which perhaps explains the presence of Rapidminer at the #2 spot. To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### Book Memo: “Practical Text Analytics”  Maximizing the Value of Your Text Data This book explores the process of text analytics in order to increase the accessibility of information available in unstructured text data. Unlike other books available in the text analytics field, Practical Text Analytics opens the door to business analysts and practitioners that may not have extensive coding experience or knowledge of the area. This allows readers without a programming background to take advantage of the nearly limitless information currently shrouded by text. Text analytics can help organizations derive insights for their business from text-based content like emails, documents, or social media posts. This book covers the elements involved in creating a text-mining pipeline. While analysts will not use every element in every project, each tool provides a potential segment in the final pipeline. Understanding the options is key to choosing the appropriate elements in designing and conducting text analysis. Continue Reading… ### Low-Resource NMT Awards Facebook is pleased to announce the research award recipients for the Low-resource Neural Machine Translation (NMT) call for proposals. This effort is expected to contribute to the field of NMT through research into novel, strongly performing models under low-resource training conditions and/or comparable corpora mining techniques for low-resource language pairs. Facebook selected the top 5 proposals. Of these, 3 were focused on low-resource modeling and 2 were focused on data mining approaches. The Principal Investigators are: Trevor Cohn, University of Melbourne, Australia Nearest neighbor search over vector space representations of massive corpora: An application to low-resource NMT Victor O.K. Li, The University of Hong Kong, Hong Kong Population-Based Meta-learning for Low-Resource Neural Machine Translation David McAllester. Toyota Technological Institute at Chicago, USA Phrase Based Unsupervised Machine Translation Alexander Rush, Harvard University, USA More Embeddings, Less Parameters: Unsupervised NMT by Learning to Reorder William Wang, University of California, Santa Barbara, USA Hierarchical Deep Reinforcement Learning for Semi-Supervised Low-Resource Comparable Corpora Mining Continue Reading… ### Big News: vtreat 1.2.0 is Available on CRAN, and it is now Big Data Capable We here at Win-Vector LLC have some really big news we would please like the R-community’s help sharing. vtreat version 1.2.0 is now available on CRAN, and this version of vtreat can now implement its data cleaning and preparation steps on databases and big data systems such as Apache Spark. vtreat is a very complete and rigorous tool for preparing messy real world data for supervised machine-learning tasks. It implements a technique we call “safe y-aware processing” using cross-validation or stacking techniques. It is very easy to use: you show it some data and it designs a data transform for you. Thanks to the rquery package, this data preparation transform can now be directly applied to databases, or big data systems such as PostgreSQL, Amazon RedShift, Apache Spark, or Google BigQuery. Or, thanks to the data.table and rqdatatable packages, even fast large in-memory transforms are possible. We have some basic examples of the new vtreat capabilities here and here. Continue Reading… ### Neural Networks Are Essentially Polynomial Regression (This article was first published on Mad (Data) Scientist, and kindly contributed to R-bloggers) You may be interested in my new arXiv paper, joint work with Xi Cheng, an undergraduate at UC Davis (now heading to Cornell for grad school); Bohdan Khomtchouk, a post doc in biology at Stanford; and Pete Mohanty, a Science, Engineering & Education Fellow in statistics at Stanford. The paper is of a provocative nature, and we welcome feedback. A summary of the paper is: • We present a very simple, informal mathematical argument that neural networks (NNs) are in essence polynomial regression (PR). We refer to this as NNAEPR. • NNAEPR implies that we can use our knowledge of the “old-fashioned” method of PR to gain insight into how NNs — widely viewed somewhat warily as a “black box” — work inside. • One such insight is that the outputs of an NN layer will be prone to multicollinearity, with the problem becoming worse with each successive layer. This in turn may explain why convergence issues often develop in NNs. It also suggests that NN users tend to use overly large networks. • NNAEPR suggests that one may abandon using NNs altogether, and simply use PR instead. • We investigated this on a wide variety of datasets, and found that in every case PR did as well as, and often better than, NNs. • We have developed a feature-rich R package, polyreg, to facilitate using PR in multivariate settings. Much work remains to be done (see paper), but our results so far are very encouraging. By using PR, one can avoid the headaches of NN, such as selecting good combinations of tuning parameters, dealing with convergence problems, and so on. Also available are the slides for our presentation at GRAIL on this project. To leave a comment for the author, please follow the link and comment on their blog: Mad (Data) Scientist. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### How to Do Distributed Deep Learning for Object Detection Using Horovod on Azure This post is co-authored by Mary Wahl, Data Scientist, Xiaoyong Zhu, Program Manager, Siyu Yang, Software Development Engineer, and Wee Hyong Tok, Principal Data Scientist Manager, at Microsoft. Object detection powers some of the most widely adopted computer vision applications, from people counting in crowd control to pedestrian detection used by self-driving cars. Training an object detection model can take up to weeks on a single GPU, a prohibitively long time for experimenting with hyperparameters and model architectures. This blog will show how you can train an object detection model by distributing deep learning training to multiple GPUs. These GPUs can be on a single machine or several machines. You will learn how to perform distributed deep learning on Azure, and how you can do this using Horovod running on Azure Batch AI. Object Detection Object detection combines the task of classification with localization, outputting both a category and a set of coordinates representing the bounding box for each object that it detects in the image, as illustrated in Figure 1 below. Figure 1. Different computer vision tasks (source) Over the past few years, many exciting deep learning approaches for object detection have emerged. Models such as Faster R-CNN use a two-stage procedure to first propose regions containing some object followed by classification of the object in each region and adjusting its bounding box. Using a one-stage approach, models such as You Only Look Once (YOLO) and Single Shot MultiBox Detector (SSD), or RetinaNet, consider a fixed set of boxes for detection and skip the region proposal stage. These are a few examples of the array of model architectures available to you for doing object detection. Instead of taking the raw image as input, these object detection models work off the feature map produced by a backbone network, which is often the convolutional layers of a classification network such as ResNet. Amongst different object detection techniques, several promising approaches are introduced recently (e.g. Cascade R-CNN and Scale Normalization for Image Pyramids (SNIP)). This paper provides a good overview of the trade-offs between different object detection architectures. Figure 2. Object Detection Approaches (Tradeoffs between accuracy and inference time). Marker shapes indicate meta-architecture and colors indicate feature extractor. Each (meta-architecture, feature extractor) pair corresponds to multiple points on this plot due to changing input sizes, stride, etc. (source) Why Distributed Deep Learning? While in many situations a powerful GPU could carry out model training in a reasonable amount of time, for elaborate models such as object detectors, it can take days or weeks to complete. To make hyperparameter search and rapid iterative experimentations practical, we look to speed up training time by distributing the computation to multiple GPUs in a computer or even across a cluster of computers. Below we briefly discuss the several ways distributed training can be accomplished, and introduce Horovod, a distributed deep learning framework that can be used with TensorFlow, Keras and PyTorch. Model Parallelism and Data Parallelism The gain in speed from distributing training to more than one GPU comes from parallelizing compute operations across multiple processes, each running on a separate GPU, for instance. There are two approaches for doing this. In the model parallelism approach, the parameters of the model are distributed across multiple devices and one batch of data is processed in each iteration. This is helpful for very large models that cannot fit on a single device. Parallelizing a model requires it to be implemented with the compute resource in mind, and so there is no easy way to rely on a framework to do this for new models or device settings. Figure 3. Model parallelism When using data parallelism, the same copy of the training script (a replica) is run on all devices, but each device reads in a different chunk of data at each iteration. The gradients computed by all copies are averaged by some mechanism and the model gets updated. Figure 4. Data parallelism Parameter Server vs Ring-Allreduce Algorithm for Gradient Updates When considering “data parallelism”, a key question is how we combine the gradients computed by each replica so that a single model is updated. In distributed TensorFlow, parameter servers are used to average the gradients. Each process running in a distributed TensorFlow setup play either a worker or a parameter server role. Workers process training data compute the gradients of the model parameters and send them to one or more parameter servers to be averaged, and later obtain a copy of the updated model for the next iteration. To use this, the training code designed for running on a single GPU needs to be carefully adapted, a rather error prone process. In addition, this distribution often suffers from various scaling inefficiencies, and GPUs are not fully utilized. Figure 5. Parameter Server Approach (Source: Horovod Presentation) A different approach to gradient averaging was popularized by Baidu in early 2017, called ring-allreduce, first implemented as a fork of TensorFlow. In this approach, workers are connected in a ring, communicating with two neighboring workers, and can average gradients and disperse them without a central parameter server. Below is an illustration on how ring-allreduce works. Figure 6. Ring-allreduce approach Why Horovod? A key consideration in distributed deep learning is how to efficiently use the resources that are available (CPUs, GPUs, and more). Horovod is an open source project initially developed at Uber that implements the ring-allreduce algorithm, first designed for TensorFlow. It provides several advantages when compared to the default TensorFlow implementation: • The Horovod API enables you to easily convert a training script designed to run on one GPU to a distributed training ready script using a few lines of code. We will demonstrate how to do this in the next section. • Horovod provides an improvement of training speed by more fully utilizing the GPUs. Horovod works with different deep learning frameworks: TensorFlow, Keras and PyTorch. Models written using these frameworks can be easily trained on Azure Batch AI, which has native support for Horovod. In addition, Batch AI enables you to train models used for different use cases at scale. The trained models can be deployed to the cloud, or edge devices. Model Training and Deployment In this section, you will learn how you can build, train and deploy an object detection model using Azure. We will use the following resources: • Dataset • Common Objects in Context (COCO), a canonical object detection dataset. • Azure Blob Storage to store the dataset for easy access during training and evaluation. • Technique • A Keras implementation of the RetinaNet object detector, with a ResNet-50 backbone. • Horovod for distributed training. • Azure • Azure Batch AI, a new Azure service for training models on GPUs with managed infrastructure, with Horovod support. • Development and deployment tools The diagram below illustrates the architecture of our solution. Once the COCO dataset is placed in Azure blob storage, we train a RetinaNet (described below) to perform object detection using Horovod on Azure Batch AI so that training is distributed to multiple GPUs. We then deploy the model as a REST endpoint accessible to users and applications using Azure Machine Learning. Each of these steps is discussed in the following sections. Figure 7. Training an object detection model on Azure Obtaining the COCO Dataset COCO (Common Objects in Context) is a commonly used dataset for benchmarking object detection models. The COCO 2017 training and validation sets contain over 120k images representing scenes in everyday life, annotated with bounding boxes labeling 80 classes of common objects such as bicycles and cars, humans and pets, foods, and furniture. We downloaded the 2017 training and validation images and annotations from the COCO dataset download page, unzipped the files, and used the AzCopy utility to transfer them into a blob container on an Azure Storage Account for fast and easy access during training. Figure 8. Sample image from the COCO dataset Azure Batch AI Cluster Deployment Azure Batch AI is a service that helps users provision and manage clusters of virtual machines for deep learning training jobs. We used the Azure Command Line Interface (CLI) to deploy our Batch AI cluster. to deploy our Batch AI cluster. We create a cluster containing two Azure Data Science Virtual Machines of size NC24. Each VM has four NVIDIA K80 GPUs, as well as good mix of CPU, memory, and storage resources. During deployment, the Batch AI service coordinates setup tasks that need to be performed on each VM in the cluster. These include: • Installation of less common Python and Anaconda packages • Mounting the blob container with our training data • Ensuring each VM can access a file share on our Azure Storage Account where scripts, logs, and output models will be stored. After specifying our cluster’s configuration, we created the cluster with the following command: az batchai cluster create -n cocohorovod –image UbuntuDSVM –resource-group yourrgname -c cluster.json Training RetinaNet Object Detector with Horovod RetinaNet, an architecture developed by Tsung-Yi Lin and colleagues (2018), is a state-of-the-art object detector that combines the fast inference speed of one-stage detectors with accuracy surpassing that of previous detectors, including those using two-stage approaches. Below is the result on the COCO dataset for multiple architectures, and RetinaNet achieves the highest COCO AP (Average Precision) score with decent inference time. Figure 9. Performance of RetinaNet using COCO Dataset (Lin et al., 2018) We adapted an existing implementation of RetinaNet, keras-retinanet, for distributed training using Horovod, with just a handful of modifications to the repository’s train.py script. One of the advantages of Horovod is its simplicity: you only need to modify a few lines of your code without touching the rest of the code. Notably, you need to: • Import the Horovod package • Configure the number of GPUs visible to Horovod • Use a distributed optimizer to wrap a regular optimizer, such as Adam • Add necessary callbacks to avoid conflicts between workers when saving the model First, we added Horovod import and initialization statements at the top of the script: import horovod.keras as hvd hvd.init() We also modified the method that defines the TensorFlow session’s configuration, so that each running instance of the script would be assigned one GPU by Horovod: def get_session(): config = tf.ConfigProto() config.gpu_options.allow_growth = True config.gpu_options.visible_device_list = str(hvd.local_rank()) return tf.Session(config=config) Then, we wrapped the script’s parameter optimizer (Adam) with Horovod’s distributed optimizer, to coordinate training between workers: training_model.compile( loss={ ‘regression’ : losses.smooth_l1(), ‘classification’: losses.focal()}, optimizer=hvd.DistributedOptimizer( keras.optimizers.adam(lr=1e-5, clipnorm=0.001))) And finally, we replaced the script’s callback for saving model checkpoints with versions recommended for Horovod: callbacks = [hvd.callbacks.BroadcastGlobalVariablesCallback(0)] if hvd.rank() == 0: # only one worker saves the checkpoint file callbacks.append(keras.callbacks.ModelCheckpoint(‘./checkpoint-{epoch:02d}.h5’)) With these simple modifications, the script is ready for distributed training with Horovod. We uploaded the modified script to our storage account’s file share so that it can be accessed by our Batch AI cluster during training. Training with keras-retinanet We used Azure Batch AI to create a training job using all eight available GPUs on our cluster. The training job configuration file, based on the Horovod recipe in the Azure Batch AI repository, specifies the arguments used to call the training script as well as where trained models and logs will be stored. The job is launched from the Azure CLI with the following command: az batchai job create -n trainingjob -r xviewhorovod -c job.json This eight-worker job ran at a rate of one epoch every 670 seconds, roughly 7.2 times faster than when a single GPU was used: Figure 10. Mean training epoch length vs. number of workers (GPUs) We were able to monitor the logs from the Azure CLI using a streaming view. We simply specify the name of the job, the directory of interest (in our case, the directory where stdout and stderr are written), and the name of the file to view: az batchai job stream-file -d stdOutErr -j trainingjob -n stdout.txt We can also list the output trained models and checkpoint files directly from the CLI, including a URL to download each file: az batchai job list-files -n trainingjob -d outputfiles Use Visual Studio Tools for AI to Submit Jobs to Batch AI Once the cluster is configured, we can use Visual Studio Tools for AI to submit jobs to the Batch AI cluster instead of using CLI commands. This allows the developer to use IDE features in Visual Studio, such as IntelliSense, and to take advantage of the easily scalable compute resources of the cloud. Figure 11. Using Visual Studio Tools for AI to submit keras-retinanet training jobs to Batch AI Using Azure Machine Learning to Operationalize the Object Detection Model Azure Machine Learning services enable models to be operationalized as REST endpoints that can be consumed by your applications and other users. To do this, we can use Azure CLI and specify the required configurations using the Azure Machine Learning (AML) operationalization module, as follows: az ml service create realtime -f score.py –model-file model.pkl -s service_schema.json -n coco_retinanet -r python –collect-model-data true -c aml_config\conda_dependencies.yml In this specific case for keras-retinanet, we need to convert the model checkpoint file (which contains only the layers which were updated during training) to an inference model with all model layers present. This is done as follows: keras_retinanet/bin/convert_model.py /path/to/training/model.h5 /path/to/save/inference/model.h5 Summary In this blog post, we showed you how to do distributed deep learning using Horovod on Azure. We walked through training a RetinaNet object detector on the COCO dataset, with distributed training enabled through Horovod and Batch AI. In subsequent blogs, we will share results of our experiments with different models and training schemes. If you have questions or comments, please leave a message on our GitHub repository. Mary, Xiaoyong, Siyu & Wee Hyong Acknowledgements • We would like to thank Mathew Salvaris​, Ilia Karmanov​, and Miguel Fierro​ from Microsoft who shared their insights into distributed training and Horovod. Continue Reading… ### Technical Content Personalization Part 3 of this series moves on from segmenting audiences to the technological side of the process. Continue Reading… ### Five Careers to Consider for Data Enthusiasts Do you get excited every time you have to crunch numbers? Do you love knowing how many times people use their smartphones every day? Are you fascinated by how businesses use data to make decisions? If you do, then you might be ideally suited for a career where you could The post Five Careers to Consider for Data Enthusiasts appeared first on Dataconomy. Continue Reading… ### Are banks ready for Payments Services Directive, Part 2 (PSD2)? The arrival of the first pillar of the Payments Services Directive, Part 2 (PSD2) in January this year laid the groundwork for a more open banking system. It is set to transform the financial services industry in the European Union (EU) by putting the customers in control of both their The post Are banks ready for Payments Services Directive, Part 2 (PSD2)? appeared first on Dataconomy. Continue Reading… ### The 5 Clustering Algorithms Data Scientists Need to Know Today, we’re going to look at 5 popular clustering algorithms that data scientists need to know and their pros and cons! Continue Reading… ### Deep Mesh Projectors for Inverse Problems - implementation - Ivan just let me know of the following instance of the Great Convergence: Dear Igor, A few weeks ago you featured two interesting papers that use random projections to train robust convnets (http://nuit-blanche.blogspot.com/2018/05/adversarial-noise-layer-regularize.html). I wanted to let you know about our related work that is a bit different in spirit: we learn to solve severely ill-posed inverse problems by learning to reconstruct low-dimensional projections of the unknown model instead of the full model. When we choose the low-dimensional subspaces to be piecewise-constant on random meshes, the projected inverse maps are much simpler to learn (in terms of Lipschitz stability constants, say), leading to a comparably better behaved inverse. If you’re interested, the paper is here: https://arxiv.org/abs/1805.11718 and the code here: https://github.com/swing-research/deepmesh I would be grateful if you could advertise the work on Nuit Blanche. Best wishes, Ivan thanks Ivan ! We develop a new learning-based approach to ill-posed inverse problems. Instead of directly learning the complex mapping from the measured data to the reconstruction, we learn an ensemble of simpler mappings from data to projections of the unknown model into random low-dimensional subspaces. We form the reconstruction by combining the estimated subspace projections. Structured subspaces of piecewise-constant images on random Delaunay triangulations allow us to address inverse problems with extremely sparse data and still get good reconstructions of the unknown geometry. This choice also makes our method robust against arbitrary data corruptions not seen during training. Further, it marginalizes the role of the training dataset which is essential for applications in geophysics where ground-truth datasets are exceptionally scarce. Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there ! Continue Reading… ### When does the quest for beauty lead science astray? Under the heading, “please blog about this,” Shravan Vasishth writes: This book by a theoretical physicist [Sabine Hossenfelder] is awesome. The book trailer is here. Some quotes from her blog: “theorists in the foundations of physics have been spectacularly unsuccessful with their predictions for more than 30 years now.” “Everyone is happily producing papers in record numbers, but I go around and say this is a waste of money. Would you give me a job? You probably wouldn’t. I probably wouldn’t give me a job either.” “The story here isn’t that theorists have been unsuccessful per se, but that they’ve been unsuccessful and yet don’t change their methods.” “And that’s the real story here: Scientists get stuck on unsuccessful methods.” She deserves to be world famous. I have no idea who deserves to be world famous, but Shravan’s email was intriguing enough to motivate me to follow the link and read Hossenfelder’s blog, which had some discussion of the idea that the quest for beauty in mathematical theories has led physics astray. Hossenfelder also draws some connections between the crisis in physics and the reproducibility crisis in social and behavioral science. The two crises are different—in physics, the problem (as I see it from the outside) is that the theories are so complicated and so difficult to test with data (requiring extremely high energies, etc.), whereas in the human sciences many prominent theories are so ridiculous and so easy to refute that this creates an entirely different sort of crisis or panic. Hossenfelder writes, “In my community [in physics], it has become common to justify the publication of new theories by claiming the theories are falsifiable. But falsifiability is a weak criterion for a scientific hypothesis. It’s necessary, but certainly not sufficient, for many hypotheses are falsifiable yet almost certainly wrong.” Yup. Theories in the human sciences are typically too vague to ever be wrong, exactly; instead, they are set up to make predictions which, when falsified, are simply folded back into the theory (as in this recent example). Rather than say I think a particular social science theory is false, I prefer to say that I expect the phenomenon of interest to be highly variable, with effects that depend unpredictably on the context, hence trying to verify or estimate parameters of these theories using a black-box experimental approach will typically be hopeless. Again, I can’t really comment on how this works in physics. Back to beauty I will not try to define what is beauty in a scientific theory; instead I’ll point you toward Hossenfelder’s discussions in her blog. I’ll share a few impressions, though. Newton’s laws, relativity theory, quantum mechanics, classical electromagnetism, the second law of thermodynamics, the ideal gas law: these all do seem beautiful to me. At a lesser level, various statistical theories such as the central limit theorem, stable laws, the convergence of various statistical estimators, Bayes’ theorem, regression to the mean, they’re beautiful too. And I have a long list of statistics stories that I keep misplacing . . . they’re all beautiful, at a slightly lower level than the above theorems. I don’t want to pit theories against each other in a beauty context; I’m just listing the above to acknowledge that I too think about beauty when constructing and evaluating theories. And I do see Hossenfelder’s point, that criteria of beauty do not always work when guiding research choices. Let me give some examples. They’re all in my own subfields of research within statistics and political science, so not really of general interest, but I bet you could come up with similar stories in other fields. Anyway, these are my examples where the quest for beauty can get in the way of progress: 1. Conjugate prior distributions. People used to use inverse-gamma(1,1) or inverse-gamma(0.001, 0.001) priors because of their seeming naturalness; it was only relatively recently realized that these priors embodied very strong information. Similarly with Jeffreys priors, noninformative priors, and other ideas out there that had mathematical simplicity and few if any adjustable parameters (hence, some “beauty” in the sense of physics models). It took me awhile to see the benefit of weakly informative priors. The apparent ugliness of user-specified parameters gives real benefits in allowing one to include constraining information for regularization. That said, I do think that transforming to unit scale can make sense, and so, once we understood what we were doing, we could recover some of the lost beauty. 2. Predictive information criteria. AIC was fine under certain asymptotic limits and for linear models with flat priors but does not work in general; hence DIC, WAIC, and other alternatives. Aki and I spent a lot of time trying to figure out the right formula for effective number of parameters, and then we suddenly realized that there was no magic formula. Freed from this alchemic goal, we were able to attack the problem of predictive model evaluation directly using leave-one-out cross-validation and leave the beautiful formulas behind. 3. Lots of other examples: R-hat, R-squared, all sorts of other things. Sometimes there’s a beautiful formula, sometimes not. Using beauty as a criterion is not so terrible, as long as you realize that sometimes the best solution, at least for now, is not so beautiful. 4. Five ways to write the same model. That’s the title of section 12.5 of my book with Jennifer, and it represents a breakthrough for us: after years spent trying to construct the perfect general notation, we realized the (obvious in retrospect) point that different notations will make sense for different problems. And this in turn freed us when writing our new book to be even more flexible with our notation for regression. 5. Social science theories. Daniel Drezner once memorably criticized “piss-poor monocausal social science”—but it’s my impression that, to many people who have not thought seriously about the human sciences, monocausal explanations are more beautiful. A naive researcher might well think that there’s something clean and beautiful about the theory that women are much more likely to support Barack Obama during certain times of the month, or that college students with fat arms are more likely to support redistribution of income, without realizing the piranha problem that these monocausal theories can’t coexist in a single world. In this case, it’s an unsophisticated quest for beauty that’s leading certain fields of science astray—so this is different than physics, where the ideas of beauty are more refined (I mean it, I’m not being sarcastic here; actually this entire post and just about everything I write is sincere; I’m just emphasizing my sincerity right here because I’m thinking that there’s something about talking about a “refined” idea of beauty that might sound sarcastic, and it’s not). But maybe it’s another version of the same impulse. Why care about beauty in theories? Let’s try to unpack this a bit. Constructing and choosing theories based on mathematical beauty and simplicity has led to some famous successes, associated with the names of Copernicus, Kepler, Newton, Laplace, Hamilton, Maxwell, Einstein, Dirac, and Gell-Mann—just to name a few! Or, to move away from physics, there’s Pasteur, Darwin, Galton, Smith, Ricardo, etc. The so-called modern synthesis in biology reconciling Mendelian inheritance and natural selection—that’s beautiful. The unification of various laws of chemistry based on quantum mechanics: again, a seeming menagerie is reframed as the product of an underlying simple structure. Then, more recently, the quest of mathematical beauty has led to some dead ends, as discussed above. From a sociology-of-science perspective, that makes sense: if a method yields success, you keep pushing it to its limits until it stops working. The question remains: Is there some fundamental reason why it makes sense can make sense to prefer beautiful theories, beyond slogans such as “Occam’s razor” (which I hate; see here and here) or “the unreasonable effectiveness of mathematics” or whatever? I think there is. Here’s how I see it. Rather than focusing on beautiful theories that make us happy, let’s consider theories that are not so beautiful and make us uncomfortable. Where does this discomfort come from? Putting on my falsificationist Bayesian hat for a moment (actually, I never took it off; I even sleep with it on!), I’d say the discomfort has to come from some clash of knowledge, some way in which the posterior distribution corresponding to our fitted model in not in accord with some prior belief we have. Similar to the Why ask why? problem in social science. But where exactly is this happening? Let’s think about some examples. In astronomy, the Copernican or Keplerian system is more pleasant to contemplate than an endless set of spinning epicycles. In economics, the invisible hand of Smith etc. seems like a better explanation, overall, than “the benevolence of the butcher, the brewer, or the baker.” Ugly theories are full of moving parts that all have to fit together just right to work. In contrast, beautiful theories are more self-sustaining. The problem is that the world is complicated, and beautiful theories only explain part of what we see. So we’re always in some intermediate zone, where our beautiful theories explain some stylized facts about the world, but some ugliness is needed to explain the rest. Continue Reading… ### Don’t let your ethical judgement go to sleep We need to build organizations that are self-critical and avoid corporate self-deception. Brian LaRossa's article “Questioning Graphic Design’s Ethicality” is an excellent discussion of ethics and design that pays particular attention to designers' professional environments. He's particularly good on the power and dependency relationships between designers and their employers. While programmers and data scientists don't work under the same conditions as graphics designers, most of what LaRossa writes should be familiar to anyone involved with product development. In untangling the connection between employment, power, design, and ethics, LaRossa points to “Ethical Fading: The Role of Self-Deception in Unethical Behavior,” by Ann Tenbrunsel and David Messick. Tenbrunsel and Messick write: Codes of conduct have in some cases produced no discernible difference in behavior. Efforts designed to reduce unethical behavior are therefore best directed on the sequence leading up to the unethical action. This is similar to a point that DJ Patil, Hilary Mason, and I are making in an upcoming report. Oaths and codes of conduct rarely change the way a person acts. If we are going to think seriously about data ethics, we need tools, such as checklists, that force us to engage with ethical issues as we're working on a project. But Tenbrunsel and Messick are making a deeper point. The "sequence leading up to the unethical action" isn't just the product development process. An "unethical project" doesn't jump out from behind a tree and attack you. It's rare for everything to be going just fine, and then your manager says, "Hey, I want you to do something evil." Rather, unethical action is the result of a series of compromises that started long before the action. It's not a single bad decision; it's the endpoint of a chain of decisions, none of which were terribly bad or explicitly evil. Their point isn't that bad people make ethical compromises. The point is that good people make these compromises. Many motivations cause ethics to fade into the background. If what you're asked to do at any stage isn't obviously unethical, you're likely to go ahead with it. You're likely to go ahead with it for any number of reasons: you would rather not have a confrontation with management, you don't trust assurances that participation in the project is a choice, you find the project technically interesting, you have friends working on the project. Developers, whether they're programmers, data scientists, or designers, are dependent on their employers and want to move ahead in their careers. You don't go ahead with an unethical project "against your better judgement"; you go ahead with it without judgement because these other motivations have suspended your judgement. This is what Tenbrunsel and Messick call the "ethical fade." They go on to make many excellent points: about the use of euphemism to disguise ethical problems (words like "externalities" and "collateral damage" hide the consequences of a decision); the "slippery slope" nature of these problems ("this new project is similar to the one we did last year, which wasn't so bad"); the influence of self-interest in post mortems; and the way different contexts can affect how an ethical decision is perceived. I strongly recommend reading their paper. But it's even more important to think about our own histories and to become aware of how we have put our own ethical sensibilities to sleep. It's easy to imagine how the "ethical fade" takes place. A group of developers might form a startup around a new technology for face recognition. It's an interesting and challenging problem, with many applications. But they can't build a business model around tagging family snapshots. So, they start accepting advertising and make some deals with advertisers around personal information, perhaps selling contact information for potential customers who are using the advertiser's product at a party. That's questionable, but easy to ignore. There are plenty of advertising-based businesses, and that's how companies without business models support themselves. Then the stakes grow: a lucrative opportunity appears to combine face recognition with other tracking technologies, perhaps in the context of a dating app. That sounds like fun, right? But without asking a lot of serious questions about how the app is used, and who will use it, it's a gateway for stalkers. Another version of the app could be used to track protestors and political dissidents, or to target individuals for "fake news" based on whom they're associating with. Applications like this aren't built because people start out to be evil. And they're not built because face recognition inevitably leads to ethical disaster. They're built because questions aren't being asked along the way, in large part because they weren't asked at the beginning. Everyone started out with the best of intentions. Finding a business model for the would-be unicorn startup was more pressing than the possibility that their app would harm someone. Nobody wanted to question their friends when a potential client wanted to give them a big check. There were undoubtedly many smaller decisions along the way: the language they used when talking about selling data, hiring for cultural fit, and creating a group monoculture that didn't have anyone sensitive to the risks. The developers aren't evil; they've just put their ethical judgement to sleep ("broad is the way that leads to destruction"). Ultimately, the only way to prevent self-deception is to recognize its pervasive and universal presence. We have to learn to become self-critical, and to realize that our motives and actions are almost never pure. But more than that, we need to build organizations that are self-critical and avoid corporate self-deception. Those organizations will use tools like checklists to ensure that ethical issues are discussed at every stage of a product's life. They will take time for ethical questioning and create space for employees to question decisions—even to stop production when they see unanticipated problems appearing. That is the biggest challenge we face: how do we address our own tendency to self-deception and, beyond that, how do we start, encourage, and maintain an ethical conversation within our organizations? It's not an easy task, but it may be the most important task we face if we're going to practice data science and software development responsibly. Continue reading Don’t let your ethical judgement go to sleep. Continue Reading… ### Top 12 Essential Command Line Tools for Data Scientists This post is a short overview of a dozen Unix-like operating system command line tools which can be useful for data science tasks. The list does not include any general file management commands (pwd, ls, mkdir, rm, ...) or remote session management tools (rsh, ssh, ...), but is instead made up of utilities which would be useful from a data science perspective, generally those related to varying degrees of data inspection and processing. They are all included within a typical Unix-like operating system as well. It is admittedly elementary, but I encourage you to seek out additional command examples where appropriate. Tool names link to Wikipedia entries as opposed to man pages, as the former are generally more friendly to newcomers, in my view. 1. wget wget is a file retrieval utility, used for downloading files from remote locations. In its most basic form, wget is used as follows to download a remote file: ~$ wget https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv

--2018-03-20 18:27:21--  https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.20.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.20.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3716 (3.6K) [text/plain]
Saving to: ‘iris.csv’

iris.csv
100 [=======================================================================================================>]   3.63K  --.-KB/s    in 0s

2018-03-20 18:27:21 (19.9 MB/s) - ‘iris.csv’ saved [3716/3716]

1. cat

cat is a tool for outputting file contents to the standard output. The name comes from concatenate.

More complex use cases include combining files together (actual concatenation), appending file(s) to another, numbering file lines, and more.

~$cat iris.csv sepal_length,sepal_width,petal_length,petal_width,species 5.1,3.5,1.4,0.2,setosa 4.9,3,1.4,0.2,setosa 4.7,3.2,1.3,0.2,setosa 4.6,3.1,1.5,0.2,setosa 5,3.6,1.4,0.2,setosa ... 6.7,3,5.2,2.3,virginica 6.3,2.5,5,1.9,virginica 6.5,3,5.2,2,virginica 6.2,3.4,5.4,2.3,virginica 5.9,3,5.1,1.8,virginica  1. wc The wc command is used for producing word counts, line counts, byte counts, and related from text files. The default output for wc, when run without options, is a single line consisting of, left to right, line count, word count (note that the single string without breaks on each line are counted as a single word), character count, and filename(s). ~$ wc iris.cs

151  151 3716 iris.csv

1. head

head outputs the first n lines of a file (10, by default) to standard output. The number of lines displayed can be set with the -n option.

~$head -n 5 iris.csv sepal_length,sepal_width,petal_length,petal_width,species 5.1,3.5,1.4,0.2,setosa 4.9,3,1.4,0.2,setosa 4.7,3.2,1.3,0.2,setosa 4.6,3.1,1.5,0.2,setosa  1. tail Any guesses as to what tail does? ~$ tail -n 5 iris.csv

6.7,3,5.2,2.3,virginica
6.3,2.5,5,1.9,virginica
6.5,3,5.2,2,virginica
6.2,3.4,5.4,2.3,virginica
5.9,3,5.1,1.8,virginica


Working that command line sorcery.
1. find

find is a utility for searching the file system for particular files.

The following searches the tree structure starting in the current directory (".") for any file starting with "iris" and ending in any dumber of characters ("-name 'iris*'") of regular file type ("-type f"):

~$find . -name 'iris*' -type f ./iris.csv ./notebooks/kmeans-sharding-init/sharding/tests/results/iris_time_results.csv ./notebooks/ml-workflows-python-scratch/iris_raw.csv ./notebooks/ml-workflows-python-scratch/iris_clean.csv ...  1. cut cut is used for slicing out sections of a line of text from a file. While these slices can be made using a variety of criteria, cut can be useful for extracting columnar data from CSV files. This outputs the fifth column ("-f 5") of the iris.csv file using the comma as field delimiter ("-d ','"): ~$ cut -d ',' -f 5 iris.csv

species
setosa
setosa
setosa
...

1. uniq

uniq modifies the output of text files to standard output by collapsing identical consecutive lines into a single copy. On its own, this may not seem too terribly interesting, but when used to build pipelines at the command line (piping the output of one command into the input of another, and so on), this can become useful.

The following gives us a unique count of the iris dataset class names held in the fifth column, along with their counts:

~$tail -n 150 iris.csv | cut -d "," -f 5 | uniq -c 50 setosa 50 versicolor 50 virginica  What the cow say. 1. awk awk isn't actually a "command," but is instead a full programming language. It is meant for processing and extracting text, and can be invoked from the command line in single line command form. Mastery of awk would take some time, but until then here is a sample of what it can accomplish. Considering that our sample file — iris.csv — is rather limited (especially when it comes to diversity of text), this line will invoke awk, search for the string "setosa" within a given file ("iris.csv"), and print to standard output, one by one, the items which it has encountered (held in the$0 variable):

~$awk '/setosa/ { print$0 }' iris.csv

5.1,3.5,1.4,0.2,setosa
4.9,3,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5,3.6,1.4,0.2,setosa

1. grep

grep is another text processing tool, this one for string and regular expression matching.

~$grep -i "vir" iris.csv 6.3,3.3,6,2.5,virginica 5.8,2.7,5.1,1.9,virginica 7.1,3,5.9,2.1,virginica ...  If you spend much time doing text processing at the command line, grep is definitely a tool you will get to know well. See here for some more useful details. 1. sed sed is a stream editor, yet another text processing and transformation tool, similar to awk. Let's use it below to change the occurrences of "setosa" in our iris.csv file to "iris-setosa," using this line: ~$ sed 's/setosa/iris-setosa/g' iris.csv > output.csv
~$head output.csv sepal_length,sepal_width,petal_length,petal_width,species 5.1,3.5,1.4,0.2,iris-setosa 4.9,3,1.4,0.2,iris-setosa 4.7,3.2,1.3,0.2,iris-setosa ...  1. history history is pretty straightforward, but also pretty useful, especially if you're depending on replicating some data preparation you accomplished at the command line. ~$ history

547  tail iris.csv
548  tail -n 150 iris.csv
549  tail -n 150 iris.csv | cut -d "," -f 5 | uniq -c
550  clear
551  history


And there you have a simple introduction to 12 handy command line tools. This is only a taste of what is possible at the command line for data science (or any other goal, for that matter). Free yourself from the mouse and watch your productivity increase.

Editor's note: This was originally posted on KDNuggets, and has been reposted with permission. Author Matthew Mayo is a Machine Learning Researcher and the Editor of KDnuggets.