# My Data Science Blogs

## December 16, 2017

### Whats new on arXiv

In this paper, we explore and compare multiple solutions to the problem of data augmentation in image classification. Previous work has demonstrated the effectiveness of data augmentation through simple techniques, such as cropping, rotating, and flipping input images. We artificially constrain our access to data to a small subset of the ImageNet dataset, and compare each data augmentation technique in turn. One of the more successful data augmentations strategies is the traditional transformations mentioned above. We also experiment with GANs to generate images of different styles. Finally, we propose a method to allow a neural net to learn augmentations that best improve the classifier, which we call neural augmentation. We discuss the successes and shortcomings of this method on various datasets.
In this paper, ellipsoid method for linear programming is derived using only minimal knowledge of algebra and matrices. Unfortunately, most authors first describe the algorithm, then later prove its correctness, which requires a good knowledge of linear algebra.
Recently there has been a dramatic increase in the performance of recognition systems due to the introduction of deep architectures for representation learning and classification. However, the mathematical reasons for this success remain elusive. This tutorial will review recent work that aims to provide a mathematical justification for several properties of deep networks, such as global optimality, geometric stability, and invariance of the learned representations.
We prove characterization theorems for relative entropy (also known as Kullback-Leibler divergence), q-logarithmic entropy (also known as Tsallis entropy), and q-logarithmic relative entropy. All three have been characterized axiomatically before, but we show that earlier proofs can be simplified considerably, at the same time relaxing some of the hypotheses.
Deep Learning (DL) aims at learning the \emph{meaningful representations}. A meaningful representation refers to the one that gives rise to significant performance improvement of associated Machine Learning (ML) tasks by replacing the raw data as the input. However, optimal architecture design and model parameter estimation in DL algorithms are widely considered to be intractable. Evolutionary algorithms are much preferable for complex and non-convex problems due to its inherent characteristics of gradient-free and insensitivity to local optimum. In this paper, we propose a computationally economical algorithm for evolving \emph{unsupervised deep neural networks} to efficiently learn \emph{meaningful representations}, which is very suitable in the current Big Data era where sufficient labeled data for training is often expensive to acquire. In the proposed algorithm, finding an appropriate architecture and the initialized parameter values for a ML task at hand is modeled by one computational efficient gene encoding approach, which is employed to effectively model the task with a large number of parameters. In addition, a local search strategy is incorporated to facilitate the exploitation search for further improving the performance. Furthermore, a small proportion labeled data is utilized during evolution search to guarantee the learnt representations to be meaningful. The performance of the proposed algorithm has been thoroughly investigated over classification tasks. Specifically, error classification rate on MNIST with $1.15\%$ is reached by the proposed algorithm consistently, which is a very promising result against state-of-the-art unsupervised DL algorithms.
Recent studies have discovered that deep networks are capable of memorizing the entire data even when the labels are completely random. Since deep models are trained on big data where labels are often noisy, the ability to overfit noise can lead to poor performance. To overcome the overfitting on corrupted training data, we propose a novel technique to regularize deep networks in the data dimension. This is achieved by learning a neural network called MentorNet to supervise the training of the base network, namely, StudentNet. Our work is inspired by curriculum learning and advances the theory by learning a curriculum from data by neural networks. We demonstrate the efficacy of MentorNet on several benchmarks. Comprehensive experiments show that it is able to significantly improve the generalization performance of the state-of-the-art deep networks on corrupted training data.
Advanced computing and data acquisition technologies have made possible the collection of high-dimensional data streams in many fields. Efficient online monitoring tools which can correctly identify any abnormal data stream for such data are highly sought after. However, most of the existing monitoring procedures directly apply the false discover rate (FDR) controlling procedure to the data at each time point, and the FDR at each time point (the point-wise FDR) is either specified by users or determined by the in-control (IC) average run length (ARL). If the point-wise FDR is specified by users, the resulting procedure lacks control of the global FDR and keeps users in the dark in terms of the IC-ARL. If the point-wise FDR is determined by the IC-ARL, the resulting procedure does not give users the flexibility to choose the number of false alarms (Type-I errors) they can tolerate when identifying abnormal data streams, which often makes the procedure too conservative. To address those limitations, we propose a two-stage monitoring procedure that can control both the IC-ARL and Type-I errors at the levels specified by users. As a result, the proposed procedure allows users to choose not only how often they expect any false alarms when all data streams are IC, but also how many false alarms they can tolerate when identifying abnormal data streams. With this extra flexibility, our proposed two-stage monitoring procedure is shown in the simulation study and real data analysis to outperform the exiting methods.
With the advent of the Internet, large amount of digital text is generated everyday in the form of news articles, research publications, blogs, question answering forums and social media. It is important to develop techniques for extracting information automatically from these documents, as lot of important information is hidden within them. This extracted information can be used to improve access and management of knowledge hidden in large text corpora. Several applications such as Question Answering, Information Retrieval would benefit from this information. Entities like persons and organizations, form the most basic unit of the information. Occurrences of entities in a sentence are often linked through well-defined relations; e.g., occurrences of person and organization in a sentence may be linked through relations such as employed at. The task of Relation Extraction (RE) is to identify such relations automatically. In this paper, we survey several important supervised, semi-supervised and unsupervised RE techniques. We also cover the paradigms of Open Information Extraction (OIE) and Distant Supervision. Finally, we describe some of the recent trends in the RE techniques and possible future research directions. This survey would be useful for three kinds of readers – i) Newcomers in the field who want to quickly learn about RE; ii) Researchers who want to know how the various RE techniques evolved over time and what are possible future research directions and iii) Practitioners who just need to know which RE technique works best in various settings.
Deep learning with 3D data such as reconstructed point clouds and CAD models has received great research interests recently. However, the capability of using point clouds with convolutional neural network has been so far not fully explored. In this technical report, we present a convolutional neural network for semantic segmentation and object recognition with 3D point clouds. At the core of our network is point-wise convolution, a convolution operator that can be applied at each point of a point cloud. Our fully convolutional network design, while being simple to implement, can yield competitive accuracy in both semantic segmentation and object recognition task.

### Burn-in for MCMC, why we prefer the term warm-up

Here’s what we say on p.282 of BDA3:

In the simulation literature (including earlier editions of this book), the warm-up period is called burn-in, a term we now avoid because we feel it draws a misleading analogy to industrial processes in which products are stressed in order to reveal defects. We prefer the term ‘warm-up’ to describe the early phase of the simulations in which the sequences get closer to the mass of the distribution.

Stan does adaptation during the warm-up phase.

### If you did not already know

Stacked Kernel Network (SKN)
Kernel methods are powerful tools to capture nonlinear patterns behind data. They implicitly learn high (even infinite) dimensional nonlinear features in the Reproducing Kernel Hilbert Space (RKHS) while making the computation tractable by leveraging the kernel trick. Classic kernel methods learn a single layer of nonlinear features, whose representational power may be limited. Motivated by recent success of deep neural networks (DNNs) that learn multi-layer hierarchical representations, we propose a Stacked Kernel Network (SKN) that learns a hierarchy of RKHS-based nonlinear features. SKN interleaves several layers of nonlinear transformations (from a linear space to a RKHS) and linear transformations (from a RKHS to a linear space). Similar to DNNs, a SKN is composed of multiple layers of hidden units, but each parameterized by a RKHS function rather than a finite-dimensional vector. We propose three ways to represent the RKHS functions in SKN: (1)nonparametric representation, (2)parametric representation and (3)random Fourier feature representation. Furthermore, we expand SKN into CNN architecture called Stacked Kernel Convolutional Network (SKCN). SKCN learning a hierarchy of RKHS-based nonlinear features by convolutional operation with each filter also parameterized by a RKHS function rather than a finite-dimensional matrix in CNN, which is suitable for image inputs. Experiments on various datasets demonstrate the effectiveness of SKN and SKCN, which outperform the competitive methods. …

TauCharts
Javascript charts with a focus on data, design and flexibility. Free open source D3.js-based library. TauCharts is the data-focused charting library. Our goal – help people to build interactive complex visualizations easily.
Achieve Charting Zen With TauCharts

AOGParsing Operator
This paper presents a method of learning qualitatively interpretable models in object detection using popular two-stage region-based ConvNet detection systems (i.e., R-CNN). R-CNN consists of a region proposal network and a RoI (Region-of-Interest) prediction network.By interpretable models, we focus on weakly-supervised extractive rationale generation, that is learning to unfold latent discriminative part configurations of object instances automatically and simultaneously in detection without using any supervision for part configurations. We utilize a top-down hierarchical and compositional grammar model embedded in a directed acyclic AND-OR Graph (AOG) to explore and unfold the space of latent part configurations of RoIs. We propose an AOGParsing operator to substitute the RoIPooling operator widely used in R-CNN, so the proposed method is applicable to many state-of-the-art ConvNet based detection systems. The AOGParsing operator aims to harness both the explainable rigor of top-down hierarchical and compositional grammar models and the discriminative power of bottom-up deep neural networks through end-to-end training. In detection, a bounding box is interpreted by the best parse tree derived from the AOG on-the-fly, which is treated as the extractive rationale generated for interpreting detection. In learning, we propose a folding-unfolding method to train the AOG and ConvNet end-to-end. In experiments, we build on top of the R-FCN and test the proposed method on the PASCAL VOC 2007 and 2012 datasets with performance comparable to state-of-the-art methods. …

### Science and Technology links (December 15th, 2017)

1. Scientists found a human gene which, when inserted into mice, makes their brain grow larger. David Brin has a series of classical sci-fi books where we “uplift” animals so that they become as smart as we are. Coincidence? I think not.
2. Should we be more willing to accept new medical therapies? Are we too cautious? Some people think so:

Sexual reproduction, had it been invented rather than evolved would never have passed any regulatory body’s standards (John Harris)

3. Apple has released the iMac Pro. It is massively expensive (from 5k$to well over 10k$). It comes with a mysterious co-processor named T2 which is rumored to be a powerful ARM processor derived from an iPhone processor. It can encrypt your data without performance penalty.
4. Playing video games can make older people smarter? You might think so after reading the article Playing Super Mario 64 increases hippocampal grey matter in older adults.
5. At a high level, technologists like to point out that technology improves at an exponential rate. A possible mechanism is that the more sophisticated you are, the faster you can improve technology. The exponential curve is a robust result: just look at per capita GPD or the total number of pictures taken per year.

Many people like to point out that technology does not, strictly speaking, improve at an exponential rate. In practice, we experience plateaus when looking at any given technology.

Rob Miles tweeted a powerful counterpoint:

This concern about an ‘explosion’ is absurd. Yes, the process looks exponential, but it’s bounded – every real world exponential is really just the start of a sigmoid, it will have to plateaux. There’s only a finite amount of plutonium in the device (…) Explosives already exist, so nukes aren’t very concerning

His point, in case you missed it, is that it is easy to rationally dismiss potentially massive disruptions as necessarily “small” in some sense.

6. Gene therapies are a bit scary. Who wants to get his genetic code played with? Some researchers suggest that we could accomplish a lot simply by activating or turning off genes using a variation on the technology currently used to modified genes (e.g., CRISPR/Cas9).
7. Why do Middle Eastern girls crush boys in school?

A boy doesn’t need to study hard to have a good job. But a girl needs to work hard to get a respectable job.

8. Google has this cool feature whereas it automatically catalogs celebrities and displays their biographical information upon request. If you type my name in Google, right now, my picture should come up. However, the biographical information is about someone else (I am younger than 62). To make matters worse, my name comes up along with a comedian (Salvail) who was recently part of a sexual scandal. Maybe it is a warning that you should not take everything Google says for the truth? But we know this, didn’t we?

In case you want to dig deeper into the problem… “Daniel Lemire” is also the name of a somewhat famous Canadian comedian. I think we look nothing alike and we have had entirely distinct careers. It should be trivial for machine learning to distinguish us.

## December 15, 2017

Tags: , ,

### R Packages worth a look

Robust Testing in GLMs (flipscores)
Provides two robust tests for testing in GLMs, by sign-flipping score contributions. The tests are often robust against overdispersion, heteroscedasticity and, in some cases, ignored nuisance variables. See Hemerik and Goeman (2017) <doi:10.1007/s11749-017-0571-1>.

Plot Population Demographic History (POPdemog)
Plot demographic graphs for single/multiple populations from coalescent simulation program input. Currently, this package can support the ‘ms’, ‘msHot’, ‘MaCS’, ‘msprime’, ‘SCRM’, and ‘Cosi2’ simulation programs. It does not check the simulation program input for correctness, but assumes the simulation program input has been validated by the simulation program. More features will be added to this package in the future, please check the ‘GitHub’ page for the latest updates: <https://…/POPdemog>.

Synthetic Population Generator (humanleague)
Generates high-entropy integer synthetic populations from marginal and (optionally) seed data using quasirandom sampling, in arbitrary dimensionality (Smith, Lovelace and Birkin (2017) <doi:10.18564/jasss.3550>). The package also provides an implementation of the Iterative Proportional Fitting (IPF) algorithm (Zaloznik (2011) <doi:10.13140/2.1.2480.9923>).

Proportional Hazards Mixed-Effects Model (PHMM) (phmm)
Fits proportional hazards model incorporating random effects using an EM algorithm using Markov Chain Monte Carlo at E-step. Vaida and Xu (2000) <doi:10.1002/1097-0258(20001230)19:24%3C3309::AID-SIM825%3E3.0.CO;2-9>.

Finding the Number of Significant Principal Components (PCDimension)
Implements methods to automate the Auer-Gervini graphical Bayesian approach for determining the number of significant principal components. Automation uses clustering, change points, or simple statistical models to distinguish ‘long’ from ‘short’ steps in a graph showing the posterior number of components as a function of a prior parameter.

High-Dimensional Variable Selection with Presence-Only Data (PUlasso)
Efficient algorithm for solving PU (Positive and Unlabelled) problem in low or high dimensional setting with lasso or group lasso penalty. The algorithm uses Maximization-Minorization and (block) coordinate descent. Sparse calculation and parallel computing via ‘OpenMP’ are supported for the computational speed-up. See Hyebin Song, Garvesh Raskutti (2017) <arXiv:1711.08129>.

### R data structures for Excel users

Introducing yourself to R as an Excel user can be tricky, especially when you don’t have much programming experience. It requires that you switch from one mental model of the data that exists in an interactive spreadsheet to one that exists in vectors and lists. Steph de Silva provides a translation of these data structures for Excel users.

Tags: ,

### NIPS Conversation AI Workshop

I only attended NIPS for the Conversation AI workshop, so my thoughts are limited to that. I really liked the subtitle of the workshop: "today's practice and tomorrow's potential." Since I'm on a product team trying to build chatbots that are actually effective, it struck me as exactly the right tone.

Several presentations were related to the Alexa prize. When reading these papers, keep in mind that contestants were subject to extreme sample complexity constraints. Semifinalists had circa 500 on-policy dialogs and finalists less than 10 times more. This is because 1) the Alexa chat function is not the primary purpose of the device so not all end users participated and 2) they had to distribute the chats to all contestants.

The result of sample complexity constraints is a “bias against variance”, as I've discussed before. In the Alexa prize, that meant the winners had the architecture of “learned mixture over mostly hand-specified substrategies.” In other words, the (scarce) on-policy data was limited to adjusting the mixture weights. (The MILA team had substrategies that were trained unsupervised on forum data, but it looks like the other substrategies were providing most of the benefit.) Sample complexity constraints are pervasive in dialog, but nonetheless the conditions of the contest were more extreme than what I encounter in practice so if you find yourself with more on-policy data consider more aggressive usage.

Speaking of sample complexity constraints, we have found pre-training representations on MT tasks a la CoVE is extremely effective in practice for multiple tasks. We are now playing with ELMo-style pre-training using language modeling as the pre-training task (very promising: no parallel corpus needed!).

Another sample complexity related theme I noticed at the workshop was the use of functional role dynamics. Roughly speaking, this is modeling the structure of the dialog independent of the topic. Once topics are abstracted, the sample complexity of learning what are reasonably structured conversations seems low. Didericksen et. al. combined a purely structural L1 model with a simple topically-sensitive L2 (tf-idf) to build a retrieval based dialog simulator. Analogously for their Alexa prize submission, Serban et. al. learned a dialog simulator from observational data which utilized only functional role and sentiment information and then applied Q-learning: this was more effective than off-policy reinforce with respect to some metrics.

Overall the workshop gave me enough optimism to continue plugging away despite the underwhelming performance of current dialog systems.

### Building an Audio Classifier using Deep Neural Networks

Using a deep convolutional neural network architecture to classify audio and how to effectively use transfer learning and data-augmentation to improve model accuracy using small datasets.

### Transitioning to Data Science: How to become a data scientist, and how to create a data science team

"A good data scientist in my mind is the person that takes the science part in data science very seriously; a person who is able to find problems and solve them using statistics, machine learning, and distributed computing."

### The piranha problem in social psychology / behavioral economics: The “take a pill” model of science eats itself

A fundamental tenet of social psychology, behavioral economics, at least how it is presented in the news media, and taught and practiced in many business schools, is that small “nudges,” often the sorts of things that we might not think would affect us at all, can have big effects on behavior. Thus the claims that elections are decided by college football games and shark attacks, or that the subliminal flash of a smiley face can cause huge changes in attitudes toward immigration, or that single women were 20% more likely to vote for Barack Obama, or three times more likely to wear red clothing, during certain times of the month, or that standing in a certain position for two minutes can increase your power, or that being subliminally primed with certain words can make you walk faster or slower, etc.

The model of the world underlying these claims is not just the “butterfly effect” that small changes can have big effects; rather, it’s that small changes can have big and predictable effects. It’s what I sometimes call the “button-pushing” model of social science, the idea that if you do X, you can expect to see Y. Indeed, we sometimes see the attitude that the treatment should work every time, so much so that any variation is explained away with its own story.

In response to this attitude, I sometimes present the “piranha argument,” which goes as follows: There can be some large and predictable effects on behavior, but not a lot, because, if there were, then these different effects would interfere with each other, and as a result it would be hard to see any consistent effects of anything in observational data.

The analogy is to a fish tank full of piranhas: it won’t take long before they eat each other.

An example

I recently came across an old post which makes the piranha argument pretty clearly in an example. I was talking about a published paper from 2013, “‘Black and White’ thinking: Visual contrast polarizes moral judgment,” which fell in the embodied cognition category. The claim in that paper was that “incidental visual cues without any affective connotation can similarly shape moral judgment by priming a certain mindset . . . exposure to an incidental black and white visual contrast leads people to think in a ‘black and white’ manner, as indicated by more extreme moral judgments.”

The study had the usual statistical problem of forking paths so I don’t think it makes sense to take its empirical claims seriously. But that’s not where I want to go today. Rather, my point here is the weakness of the underlying theory, in light of all the many many other possible stories that have been advanced to explain attitudes and behavior.

Here’s what I wrote:

I don’t know whether to trust this claim, in light of the equally well-documented finding, “Blue and Seeing Blue: Sadness May Impair Color Perception.” Couldn’t the Zarkadi and Schnall result be explained by an interaction between sadness and moral attitudes? It could go like this: Sadder people have difficulty with color perception so they are less sensitive to the different backgrounds in the images in question. Or maybe it goes the other way: sadder people have difficulty with color perception so they are more sensitive to black-and-white patterns.

I’m also worried about possible interactions with day of the month for female participants, given the equally well-documented findings correlating cycle time with political attitudes and—uh oh!—color preferences. Again, these factors could easily interact with perceptions of colors and also moral judgment.

What a fun game! Anyone can play.

Hey—here’s another one. I have difficulty interpreting this published finding in light of the equally well-documented finding that college students have ESP. Given Zarkadi and Schnall’s expectations as stated it in their paper, isn’t it possible that the participants in their study simply read their minds? That would seem to be the most parsimonious explanation of the observed effect.

Another possibility is the equally well-documented himmicanes and hurricanes effect—I could well imagine something similar with black-and-white or color patterns.

But I’ve saved the best explanation for last.

We can most easily understand the effect discovered by Zarkadi and Schnall’s in the context of the well-known smiley-face effect. If a cartoon smiley face flashed for a fraction of a second can create huge changes in attitudes, it stands to reason that a chessboard pattern can have large effects too. The game of chess, after all, was invented in Persia, and so it makes sense that being primed by a chessboard will make participants think of Iran, which in turn will polarize their thinking, with liberals and conservatives scurrying to their opposite corners. In contrast, a blank pattern or a colored grid will not trigger these chess associations.

Aha, you might say: chess may well have originated in Persia but now it’s associated with Russia. But that just bolsters my point! An association with Russia will again remind younger voters of scary Putin and bring up Cold War memories for the oldsters in the house: either way, polarization here we come.

In a world in which merely being primed with elderly-related words such as “Florida” and “bingo” causes college students to walk more slowly (remember, Daniel Kahneman told us “You have no choice but to accept that the major conclusions of these studies are true”), it is no surprise that being primed with a chessboard can polarize us.

I can already anticipate the response to the preregistered replication that fails: There is an interaction with the weather. Or with relationship status. Or with parents’ socioeconomic status. Or, there was a crucial aspect of the treatment that was buried in the 17th paragraph of the publish paper but turns out to be absolutely necessary for this phenomenon to appear.

Or . . . hey, I have a good one: The recent nuclear accord with Iran and rapprochement with Russia over ISIS has reduced tension with those two chess-related countries, so this would explain a lack of replication in a future experiment.

I wrote the above in a silly way but my point is real:  Once you accept that all these large effects are out there, it becomes essentially impossible to interpret any claim—even from experimental data—as it can also be explained as an interaction of two previously-identified large effects.

Randomized experiment is not enough

Under the button-pushing model of science, there’s nothing better than a randomized experiment: it’s the gold standard! Really, though, there are two big problems with the sort of experimental data described above:

1. Measurement error. When measurements are noisy and biased, any patterns you see will not in general replicate—that is, type M and type S errors will be large. Meanwhile, forking paths allow researchers the illusion of success, over and over again, and enablers such as the editors of PNAS keep this work in the public eye.

2. Interactions. Even if you do unequivocally establish a treatment effect from your data, the estimate only applies to the population and scenario under study: psychology students in university X in May, 2017; or Mechanical Turk participants in May, 2017, asked about topic Y; etc. And in the “tank full of piranhas” context where just about anything can have a large effect—from various literatures, there’s menstrual cycle, birth order, attractiveness of parents, lighting in the room, subliminal smiley faces, recent college football games, parents’ socioeconomic status, outdoor temperature, names of hurricanes, the grid pattern on the edge of the survey form, ESP, the demographic characteristics of the experimenter, and priming on just about any possible stimulus. In this piranha-filled world, the estimate from any particular experiment is pretty much uninterpretable.

To put it another way: if you do one of these experiments and find a statistically significant pattern, it’s not enough for you to defend your own theory. You also have to make the case that just about everything else in the social psychology / behavioral economics literature is wrong. Cos otherwise your findings don’t generalize. But we don’t typically see authors of this sort of paper disputing the rest of the field: they all seem happy thinking of all this work as respectable.

I put this post in the Multilevel Modeling category because ultimately I think we should think about all these effects, or potential effects, in context. All sorts of confusion arise when thinking about them one step at a time. For just one more example, consider ovulation-and-voting and smiley-faces-and-political attitudes. Your time of the month could affect your attention span and your ability to notice or react to subliminal stimuli. Thus, any ovulation-and-voting effect could be explained merely as an interaction with subliminal images on TV ads, for example. And any smiley-faces-and-political-attitudes effect could, conversely, be explained as an interaction with changes in attitudes during the monthly cycle. I don’t believe any of these stories; my point is that if you really buy into the large-predictable-effects framework of social psychology, then it does not make sense to analyze these experiments in isolation.

P.S. Just to clarify: I don’t think all effects are zero. We inherent much of our political views from our parents; there’s also good evidence that political attitudes and voting behavior are affected by the economic performance, candidate characteristics, and the convenience of voter registration. People can be persuaded by political campaigns to change their opinions, and attitudes are affected by events, as we’ve seen with the up-and-down attitudes on health care reform in recent decades. The aquarium’s not empty. It’s just not filled with piranhas.

### How To Write, Deploy, and Interact with Ethereum Smart Contracts on a Private Blockchain

Here are the rules: if you read this post all the way through, you have to deploy a smart contract on your private Ethereum blockchain yourself. I give you all the code I used here in Github so you have no excuses not to.

But if you don’t follow the rules and you only want to read, hopefully this helps give a perspective of starting with nothing and ending with a blockchain app.

By the end, you’ll have started a local private Ethereum blockchain, connected two different nodes as peers, written and compiled a smart contract, and have a web interface that allows users to ask questions, deploy the questions on the blockchain, and then lets the users answer.

If you’re confused, run into an error, or want to say something else, go ahead an write a comment, get in contact, or say something on Twitter.

Oh, and here’s the Github repo, so go ahead and fork  it (if you don’t want to copy paste all the code here) and then if you make updates you want to share, I’ll throw this in the README.

### Private Blockchain Creation

To create a single node, we need the following genesis.json, which represents the initial block on the private blockchain.

//genesis.json
{
"alloc": {},
"config": {
"chainID": 72,
"eip155Block": 0,
"eip158Block": 0
},
"nonce": "0x0000000000000000",
"difficulty": "0x4000",
"mixhash": "0x0000000000000000000000000000000000000000000000000000000000000000",
"coinbase": "0x0000000000000000000000000000000000000000",
"timestamp": "0x00",
"parentHash": "0x0000000000000000000000000000000000000000000000000000000000000000",
"gasLimit": "0xffffffff"
}

If you want a somewhat full explanation of the fields, look at this Stack Overflow answer. The big ones in our case here are difficulty being low, because we don’t want to have to wait long for blocks to be mined on our test network, and then gasLimit being high to allow the amount of work that can be done by a node in the block to be able to process every transaction.

Go ahead and open a terminal, make sure geth is installed in whatever way works for your OS, and then cd into the folder that you have your genesis.json file saved. Running run the following command will initialize the blockchain for this node.

$geth --datadir "/Users/USERNAME/Library/PrivEth" init genesis.json –datadir specifies where we want the all the data for the blockchain to be located. On a mac, the default is ~/Library/Ethereum. Since we have multiple nodes running, we can’t have them sharing the same data folder, so we’re going to specify. Linux and Windows machines have different default datadirs, so take a look at those to see in general where they should be located. After running this init command with the genesis.json file we want to use, go checkout that --datadir directory. You’ll see a bunch of files, so feel free to poke around. Not necessary right now, but you’ll want to look around there eventually. For this to be a blockchain, we need more than one node. For blockchains to become peers, we need them to have the same genesis file. So we’re going to run the same command as above, from the same directory, but this time with a different datadir. geth --datadir "/Users/USERNAME/Library/PrivEth2" init genesis.json With all the code here, we’re going to be working in the same directory. The code is the same, but with the command line options, we’ll be able to separate these processes by the command line arguments. Initializing the chain for both nodes. When running geth with a different --datadir, you’ll be running separate nodes no matter where you ran the command from. Just remember to specify the --datadir each time so it doesn’t fall back to default. Also note that I changed the names for these datadirs, so watch out if you see different names in the screenshots. ### Opening the Consoles So far, we’ve done three things. 1) Created a genesis.json file in a working directory of your choosing, 2) picked a directory to store the blockchain for one node and initialized the first block, and 3) picked a different directory to store the blockchain for the other node. Very little code and a few commands. The next step to be able to log into the geth console for each node. The console will start the geth process and run it, but also give us a way to run some web3 commands in the terminal. geth --datadir "/Users/jackschultz/Library/EthPrivLocal" --networkid 72 --port 30301 --nodiscover console There are a couple more options here. –networkid is similar to in the genesis.json file, where all we want here is to make sure we’re not using network ids 1-4. –port specifies which port our .ipc file will be using. That’s the way we’ll connect with the database using the web3.js library. The default port is 30303, so we’ll keep it in that area, but this is our first node, so 30301 it is. –nodiscover tells geth to not look for peers initially. This is actually important in our case. This is a private network. We don’t want nodes to try to connect to other nodes without me specifying, and we don’t want these nodes to be discovered without us telling them. With the first geth node running, run the same command in a different terminal with the second --datadir and and different --port you’ll have nodes running. Starting the consoles. ### Creating Initial Coinbase Account for Each Node When you have the console running from the command above, we want to create our main coinbase account. If you’re curious, I used the passphrase ‘passphrase’. You’ll see we need that in our Node app down the road. > personal.listAccounts [] > personal.newAccount() Passphrase: Repeat passphrase: 0x538341f72db4b64e320e6c7c698499ca68a6880c > personal.listAccounts [ "0x538341f72db4b64e320e6c7c698499ca68a6880c" ] Run the same commands in the other node’s console as well. Create that new account. Since this is the first account this node has created, you’ll see it’s also listed in > eth.coinbase 0x538341f72db4b64e320e6c7c698499ca68a6880c Another piece of information you can grab on the console is by running > personal.listWallets [{ accounts: [{ address: "0x538341f72db4b64e320e6c7c698499ca68a6880c", url: "keystore:///Users/jackschultz/Library/EthPrivLocal/keystore/UTC--2017-12-09T16-21-48.056824000Z--538341f72db4b64e320e6c7c698499ca68a6880c" }], status: "Locked", url: "keystore:///Users/jackschultz/Library/EthPrivLocal/keystore/UTC--2017-12-09T16-21-48.056824000Z--538341f72db4b64e320e6c7c698499ca68a6880c" }] There you’ll see more information about the accounts instead of only the address. You’ll also see where that account information is stored, and it’ll be in the --datadir you specified. So if you’re still curious how the data is stored in your filesystem, go checkout the directory now. ### Connecting Nodes as Peers We have multiple nodes running, and we’ll need to connect them as peers. First we’ll check to see if we have peers. > admin.peers [] So sad. This is what we expected where we started the console on a non 1-4 network id and the nodiscover flag. This means that we need to tell each node to connect to the other node with a specific command. The way we do this is by sharing the enode address. > admin.nodeInfo.enode "enode://13b835d68917bd4970502b53d8125db1e124b466f6473361c558ea481e31ce4197843ec7d8684011b15ce63def5eeb73982d04425af3a0b6f3437a030878c8a9@[::]:30301?discport=0" This is the enode information that geth uses to connect to different nodes where they’re able to share information about transactions and successful mining. To connect the nodes using this url, we want to call the function addPeer. If we copy the return value of the admin.nodeInfo.enode from one of the nodes, run the following command in the other node. > admin.addPeer("enode://13b835d68917bd4970502b53d8125db1e124b466f6473361c558ea481e31ce4197843ec7d8684011b15ce63def5eeb73982d04425af3a0b6f3437a030878c8a9@[::]:30301?discport=0") This tells one node how to get to the other node, will ask the other node to link up, and they’ll both become each other’s peers. To check, run the admin.peers command on both nodes and you’ll see they’re together. Something like: > admin.peers [{ caps: ["eth/63"], id: "99bf59fe629dbea3cb3da94be4a6cff625c40da21dfffacddc4f723661aa1aa77cd4fb7921eb437b0d5e9333c01ed57bfc0d433b9f718a2c95287d3542f2e9a8", name: "Geth/v1.7.1-stable-05101641/darwin-amd64/go1.9.1", network: { localAddress: "[::1]:30301", remoteAddress: "[::1]:50042" }, protocols: { eth: { difficulty: 935232, head: "0x8dd2dc7968328c8bbd5aacc53f87e590a469e5bde3945bee0f6ae13392503d17", version: 63 } } }] To add the peer, you only need to tell one node to connect to the other node so check the other node and you’ll see output like this. Peers on peers. ### Checking Balances and Mining Now that the nodes are connected, we’re not in the realm of money. Before we mine, we want to check the balances of our main account. > eth.getBalance(eth.coinbase) 0 > Again, so sad. Since we didn’t allocate ethers to this account on the genesis block, we need to start mining to get some for these accounts. When in the console, we run miner.start() for the node to start mining, and then miner.stop() for it to stop. When mining, not only are we looking to see how many ethers the accounts get, we also want to watch the interaction of two nodes that are peers. In the picture below, you’ll see I checked the balance of each main accounts for both nodes. Then on node 1, I started the mining, let it run for ~5 seconds, and then stopped the mining after 7 full blocks. I check the balance on that side and have 35 ether, where the number in the console represents Wei. On the other node, you’ll see that it received information of the 7 blocks that were mined from node 1. Beginning to mine. ### Transactions Working with smart contracts requires special transactions, but before getting that far, we want to know how to create transactions that send ether to the other account. On one node, let’s take the coinbase account and unlock it. > coinbaseAddress = eth.coinbase > personal.unlockAccount(coinbaseAddress) Unlock account 0x554585d7c4e5b5569158c33684657772c0d0b7e1 Passphrase: True Now copy the address from the other node’s coinbase account, and back in the node with the unlocked account, > hisAddress = "0x846774a81e8e48379c6283a3aa92e9036017172a" After this, the sendTransaction command is somewhat simple. > eth.sendTransaction({from: eth.coinbase, to: hisAddress, value: 100000000}) INFO [12-09|10:29:36] Submitted transaction fullhash=0x776689315d837b5f0d9220dc7c0e7315ef45907e188684a6609fde8fcd97dd57 recipient=0x846774A81E8E48379C6283a3Aa92E9036017172A "0x776689315d837b5f0d9220dc7c0e7315ef45907e188684a6609fde8fcd97dd57" One other thing to note, and something that you’ll be decently confused with a lot, is why these numbers for value are huge in terms of zeros. This is because values are represented in Wei, so we don’t have to deal with floating point numbers which could cause issues on different systems. This will come into play with gas that we’ll need to start specifying for contract deployment and transactions. If you’re wondering how few ether we’re sending with that value, > web3.fromWei(100000000, 'ether') "0.0000000001" To get the transaction to send, and to see the difference in balances, we need to start the miner in a node, and then stop it after a block is mined. Now check the balances to see the change > miner.start() ............... > miner.stop() > web3.eth.getBalance(eth.coinbase) 59999999999900000000 > web3.eth.getBalance(hisAddress) 100000000 Alright, check out this giant picture below. Again, node 1 is on the left, node 2 on the right. So I first check balances for each coinbase account on each node. The on node 1, I copy node 2’s address, send the transaction, and then logging from the node that it has received a submitted transaction. Then I start the mining. You’ll see that node 8 has txs=1 meaning it’s mined a transaction into that block. After a few more blocks, I stop the mining. I check the balance of node 1’s account. We have 12 blocks with rewards of 5 ether each, but then gave away 100000000 Wei. Now, I go back to node 2, check the balance of its coinbase account and see that it’s 0. Then I remembered I restarted the console for node 1 and didn’t set the two nodes back as peers. So I print the enode of node 1, add that as a peer for node 2. You’ll see right after adding the peer, node 2 receives the blocks it missed, including 1 transaction. Then I check the balance again and it knows it has 100000000 Wei. This is how to send ether locally. ### Intermission At this point, we’re about half done! We’ve worked in a terminal having a private Ethereum blockchain running locally, two nodes that have accounts, are peers with each other, and can send transactions back and forth. That’s pretty good, so if you want to take a second to calm down and get a slightly better understanding, go ahead. But at some point, we want to move on. ### Write a Contract on Remix Moving on! With the geth nodes running, the next step is getting into contracts. When writing posts like this, it takes a long time to pick a simple yet worthwhile example. And that was the case for me when trying to pick a type of contract to use. The one I decided to throw in here is one where people are able to answer yes / no, or true / false questions. The final v1 code for the Solidity contract is below. A few notes before you look at it. • We’re just using global variables in this case for the question, who asked it, who has answered it, and the values for the answers. Solidity also has structs where we could store the data, but we’re talking about deployment and not Solidity, so not going too in depth with that. • I’m using uints to store the yes / no answers instead of bools. In Solidity, if I have a mapping that links addresses to a bool, the default is FALSE. For a uint, the default is zero. This lets us have the three states we need. I could have used an enum here, but like I said, we’re staying simple. • The answerQuestion method is somewhat complicated in the logic and all the if statements. Go through it if you want to get a sense of how we’re adjusting the variables. • We have a get function that returns all the information we want to show the status of the contract on the page. We could split it up to return different information separately, but might as well throw it together to not have to make multiple queries. • Not only are there other ways to store this data in the contract, there are tons of other ways to write this! For example we could have a list of all the accounts that voted true or false and then loop through those to see if they’ve answered yet. pragma solidity ^0.4.0; contract Questions { //global variables that aren't in a struct mapping(address => uint) public answers; //integer where 0 means hasn't answered, 1 means yes, 2 means no string question; address asker; uint trues; uint falses; /// __init__ function Questions(string _question) public { asker = msg.sender; question = _question; } //We need a way to validate whether or not they've answered before. //The default of a mapping is function answerQuestion (bool _answer) public { if (answers[msg.sender] == 0 && _answer) { //haven't answered yet answers[msg.sender] = 1; //they vote true trues += 1; } else if (answers[msg.sender] == 0 && !_answer) { answers[msg.sender] = 2; //falsity falses += 1; } else if (answers[msg.sender] == 2 && _answer) { // false switching to true answers[msg.sender] = 1; //true trues += 1; falses -= 1; } else if (answers[msg.sender] == 1 && !_answer) { // true switching to false answers[msg.sender] = 2; //falsity trues -= 1; falses += 1; } } function getQuestion() public constant returns (string, uint, uint, uint) { return (question, trues, falses, answers[msg.sender]); } } I store this contract in contracts/Question.sol, but instead of doing the compiling locally, I use Remix which handles a bunch of things, in terms of tons of errors and warnings of the code, as well as compiling the required information. To see the compiling information, on the upper right “compile” tab, click the details button and you’ll see a bunch of information pop up. The data we’re looking for is the byteCode and ABI. Right below that is the web3 deploy information which is exactly what we’re going to mimic! But rather than having giant strings on a single line, we’re going to import the information from a json file. Gotta keep that data separate. //childContractv1.json { "abi": [{"constant":true,"inputs":[{"name":"","type":"address"}],"name":"answers","outputs":[{"name":"","type":"uint256"}],"payable":false,"stateMutability":"view","type":"function"},{"constant":true,"inputs":[],"name":"getQuestion","outputs":[{"name":"","type":"string"},{"name":"","type":"uint256"},{"name":"","type":"uint256"},{"name":"","type":"uint256"}],"payable":false,"stateMutability":"view","type":"function"},{"constant":false,"inputs":[{"name":"_answer","type":"bool"}],"name":"answerQuestion","outputs":[],"payable":false,"stateMutability":"nonpayable","type":"function"},{"inputs":[{"name":"_question","type":"string"}],"payable":false,"stateMutability":"nonpayable","type":"constructor"}], "byteCode": "0x6060604052341561000f57600080fd5b6040516106d23803806106d28339810160405280805182019190505033600260006101000a81548173ffffffffffffffffffffffffffffffffffffffff021916908373ffffffffffffffffffffffffffffffffffffffff1602179055508060019080519060200190610082929190610089565b505061012e565b828054600181600116156101000203166002900490600052602060002090601f016020900481019282601f106100ca57805160ff19168380011785556100f8565b828001600101855582156100f8579182015b828111156100f75782518255916020019190600101906100dc565b5b5090506101059190610109565b5090565b61012b91905b8082111561012757600081600090555060010161010f565b5090565b90565b6105958061013d6000396000f300606060405260043610610057576000357c0100000000000000000000000000000000000000000000000000000000900463ffffffff1680635e9618e71461005c578063eff38f92146100a9578063f9e049611461014c575b600080fd5b341561006757600080fd5b610093600480803573ffffffffffffffffffffffffffffffffffffffff16906020019091905050610171565b6040518082815260200191505060405180910390f35b34156100b457600080fd5b6100bc610189565b6040518080602001858152602001848152602001838152602001828103825286818151815260200191508051906020019080838360005b8381101561010e5780820151818401526020810190506100f3565b50505050905090810190601f16801561013b5780820380516001836020036101000a031916815260200191505b509550505050505060405180910390f35b341561015757600080fd5b61016f60048080351515906020019091905050610287565b005b60006020528060005260406000206000915090505481565b610191610555565b600080600060016003546004546000803373ffffffffffffffffffffffffffffffffffffffff1673ffffffffffffffffffffffffffffffffffffffff16815260200............................600460008282540392505081905550610550565b60016000803373ffffffffffffffffffffffffffffffffffffffff1673ffffffffffffffffffffffffffffffffffffffff168152602001908152602001600020541480156104e3575080155b1561054f5760026000803373ffffffffffffffffffffffffffffffffffffffff1673ffffffffffffffffffffffffffffffffffffffff16815260200190815260200160002081905550600160036000828254039250508190555060016004600082825401925050819055505b5b5b5b50565b6020604051908101604052806000815250905600a165627a7a7230582043defebf8fa91b1cd010927004a7ff4816a1040b9cabd4ddd22122a9816742ff0029" } Go ahead and straight copy this file, but I’d say go to Remix and work with the compiler they have there so you can get a feel for that as well. Quick thing to mention is for the byteCode, you need to make sure that string starts with “0x”. When you copy the byte code field from Remix you only get the numbers. ### NodeJS Time From above, every time I said node, I meant the geth / blockchain node. Here, we’ll be seeing the word “node” again, but when you see the capital N, we mean NodeJS. We have the v1 contract compiled and stored in a file. Now we need to get a Node instance running. There are four endpoints we’re going to have. • GET ‘/’ which will have a form to ask a new question, • POST ‘/questions/new’ which deploys the new question contract on the blockchain, • GET ‘/questions?address=0xXXXX…’ which will show the question with the current answers and a form to send or update your answer, and • POST ‘/questions?address=0xXXXX…’ which handles the answering. Deploying Question Preface, before going into blockchains I hadn’t used Node in forever, so some of the syntax and practices might be off here. For the code, I’ll go through the three endpoints that talk to the blockchain. The first is a post request to deploy a new question. I threw the code that’s needed to connect to your locally running geth as well. const Web3 = require('web3'); const net = require('net'); const compiledContract = require('./contracts/contractv1'); web3IPC = '/Users/jackschultz/Library/PrivEth/geth.ipc'; let web3 = new Web3(web3IPC, net); const byteCode = compiledContract.byteCode; const QuestionContract = new web3.eth.Contract(compiledContract.abi); web3.eth.getCoinbase(function(err, cba) { coinbaseAddress = cba; console.log(coinbaseAddress); }); const coinbasePassphrase = 'passphrase'; app.post('/', (req, res) => { const question = req.body.question; web3.eth.personal.unlockAccount(coinbaseAddress, coinbasePassphrase, function(err, uares) { QuestionContract.deploy({data: byteCode, arguments: }).send({from: coinbaseAddress, gas: 2000000}) .on('receipt', function (receipt) { console.log("Contract Address: " + receipt.contractAddress); res.redirect('/questions?address=' + receipt.contractAddress); }); }); }); When we hit the endpoint, the first step, after grabbing the request from the body, is to unlock the account that we’re deploying from. This is necessary so we’re not impersonating someone else. Once we get the callback, we’re going to deploy the contract where the data of the transaction is the entire byteCode, and then we pass in the question string for the init function in the contract. We specify we’re sending it from the coinbase address, and saying that we’re investing 2000000 Wei (which is 0.000000000002 ether if you’re wondering how small it is). There are more than a few callbacks we can use here, but the only one we’re interested in right now is the ‘receipt’, where we’re given the contract’s address after it’s been mined. In terms of UI, the way this is written is that the page will hang, waiting for the contract to be mined, before redirecting to the question’s page. This probably isn’t a good idea at all for a wide use DAPP because mining blocks on the public Ethereum averages ~14.5 seconds. But here on our private blockchain, we set the difficulty to be so low that blocks are mined very quickly, so it isn’t an issue. Viewing Question Now that we have a question that exists, we want to go ahead and view it! We use the web3.utils.isAddress function to verify that the address is not only a valid hex string, but also verifies that the check sum is valid which makes sure it’s an existing address. Then our getQuestion method returns a result that’s a dictionary of the return values. In our case, that’s the question, the number of trues, number of falses, and also whether or not the person running the node has answered the question yet. app.get('/questions', function(req, res) { const contractAddress = req.query.address; if (web3.utils.isAddress(contractAddress)) { QuestionContract.options.address = contractAddress; const info = QuestionContract.methods.getQuestion().call(function(err, gqres) { //using number strings to get the data from the method const question = gqres['0']; const trues = gqres['1']; const falses = gqres['2']; const currentAnswerInt = parseInt(gqres['3'], 10); data = {contractAddress: contractAddress, question: question, currentAnswerInt: currentAnswerInt, trues: trues, falses: falses}; res.render('question', data); }); } else { res.status(404).send("No question with that address."); } }); Answering the Question When we post to that question url, we go through much of the same process of validating the input, validating the address, and then calling the answerQuestion method with the required parameters. Along with the question creation function, we’re going to have the browser hang until the block with the update transaction is mined. app.post('/questions', function(req, res) { const contractAddress = req.query.address; const answerValue = req.body.answer == 'true' ? true : false; if (web3.utils.isAddress(contractAddress)) { web3.eth.personal.unlockAccount(coinbaseAddress, coinbasePassphrase, function(err, uares) { QuestionContract.options.address = contractAddress; QuestionContract.methods.answerQuestion(answerValue).send({from: coinbaseAddress, gas: 2000000}) .on('receipt', function (receipt) { console.log(Question with address${contractAddress} updated.);
}
);
});
}
});

HTML

As for the HTML, I’m not going to bother posting it here because it’s quite simple. I didn’t bother to use a css template because it doesn’t matter in a backend post like this. You’ll see screenshots of the basic interface below while I talk about running the code.

Running The Code

Now all the code is out there. You have four tabs on the console open. Two are running geth

geth --datadir /Users/jackschultz/Library/PrivEth --networkid 40 --port 30301 --nodiscover console
geth --datadir /Users/jackschultz/Library/PrivEth2 --networkid 40 --port 30302 --nodiscover console

and the other two are running the Node apps, connected to separate geth processes, and running on different localhost ports. I added config files, named them primary and secondary to point to the ipc and port that Node should run on.

NODE_ENV=primary node app.js
NODE_ENV=secondary node app.js

I threw in some pictures here so people reading can know more about what I’m seeing on my screen. On that, lets go to the browser and start interacting. First up is going to the home page where you can ask a question.

Will they??

Then when you hit the submit button, you’ll see the logging from the Node app, in the geth console you’ll start the miner and then stop it after the block with this transaction is complete.

To answer, you’ll submit the form, then start and stop the mining. When you’re doing this yourself, a fun thing to do is start the miner before submitting the answer form so you can get a sense of how quickly blocks are mined with this small level of difficulty defined in the genesis block.

Check out the terminal below. In the top Node terminal you’ll see some logging about validating the address, and then logging when we’re redirected to the same page but with updated information. In the geth console, you can see when the transaction was submitted, along with which block the transaction was mined on.

Obviously they will.

Now that we answered the question from the primary node, let’s check out the secondary one.

On the right side of the picture, you’ll see the top two terminals showing the Node and geth interactions, and then on the bottom is the primary geth which you can see that it received blocks with a transaction in it because the two geth nodes are peers. After the question was answered by the node on port 4002, I reloaded the page on port 4001 and we can see the result.

Of course they will.

Just to show that we can switch back to false, I changed the answer from port 4002 to false (which is wrong, cause the Bucks are definitely going to make the playoffs), and then you can see the console logging the information of what went through.

I changed my answer back to true after taking this screenshot.

### Conclusion

If you’ve gotten this far and have the code running yourself, big cong. Like most of these posts, this is much longer than I had initially imagined it being. The goal with this is to go through and explain all the steps of a smart contract rather than somewhere in the middle.

Like above, if you have feedback of any kind, get in touch — comments, contact, or twitter.

### Machine Learning & Artificial Intelligence: Main Developments in 2017 and Key Trends in 2018

As we bid farewell to one year and look to ring in another, KDnuggets has solicited opinions from numerous Machine Learning and AI experts as to the most important developments of 2017 and their 2018 key trend predictions.

### Whats new on arXiv

Predictive business process monitoring is concerned with the analysis of events produced during the execution of a business process in order to predict as early as possible the final outcome of an ongoing case. Traditionally, predictive process monitoring methods are optimized with respect to accuracy. However, in environments where users make decisions and take actions in response to the predictions they receive, it is equally important to optimize the stability of the successive predictions made for each case. To this end, this paper defines a notion of temporal stability for predictive process monitoring and evaluates existing methods with respect to both temporal stability and accuracy. We find that methods based on XGBoost and LSTM neural networks exhibit the highest temporal stability. We then show that temporal stability can be enhanced by hyperparameter-optimizing random forests and XGBoost classifiers with respect to inter-run stability. Finally, we show that time series smoothing techniques can further enhance temporal stability at the expense of slightly lower accuracy.
In this paper, we propose a mixture of probabilistic partial canonical correlation analysis (MPPCCA) that extracts the Causal Patterns from two multivariate time series. Causal patterns refer to the signal patterns within interactions of two elements having multiple types of mutually causal relationships, rather than a mixture of simultaneous correlations or the absence of presence of a causal relationship between the elements. In multivariate statistics, partial canonical correlation analysis (PCCA) evaluates the correlation between two multivariates after subtracting the effect of the third multivariate. PCCA can calculate the Granger Causal- ity Index (which tests whether a time-series can be predicted from an- other time-series), but is not applicable to data containing multiple partial canonical correlations. After introducing the MPPCCA, we propose an expectation-maxmization (EM) algorithm that estimates the parameters and latent variables of the MPPCCA. The MPPCCA is expected to ex- tract multiple partial canonical correlations from data series without any supervised signals to split the data as clusters. The method was then eval- uated in synthetic data experiments. In the synthetic dataset, our method estimated the multiple partial canonical correlations more accurately than the existing method. To determine the types of patterns detectable by the method, experiments were also conducted on real datasets. The method estimated the communication patterns In motion-capture data. The MP- PCCA is applicable to various type of signals such as brain signals, human communication and nonlinear complex multibody systems.
We report on our experiences of helping staff of the Scottish Longitudinal Study to create synthetic extracts that can be released to users. In particular, we focus on how the synthesis process can be tailored to produce synthetic extracts that will provide users with similar results to those that would be obtained from the original data. We make recommendations for synthesis methods and illustrate how the staff creating synthetic extracts can evaluate their utility at the time they are being produced. We discuss measures of utility for synthetic data and show that one tabular utility measure is exactly equivalent to a measure calculated from a propensity score. The methods are illustrated by using the R package $synthpop$ to create synthetic versions of data from the 1901 Census of Scotland.
We investigate star-galaxy classification for astronomical surveys in the context of four methods enabling the interpretation of black-box machine learning systems. The first is outputting and exploring the decision boundaries as given by decision tree based methods, which enables the visualization of the classification categories. Secondly, we investigate how the Mutual Information based Transductive Feature Selection (MINT) algorithm can be used to perform feature pre-selection. If one would like to provide only a small number of input features to a machine learning classification algorithm, feature pre-selection provides a method to determine which of the many possible input properties should be selected. Third is the use of the tree-interpreter package to enable popular decision tree based ensemble methods to be opened, visualized, and understood. This is done by additional analysis of the tree based model, determining not only which features are important to the model, but how important a feature is for a particular classification given its value. Lastly, we use decision boundaries from the model to revise an already existing method of classification, essentially asking the tree based method where decision boundaries are best placed and defining a new classification method. We showcase these techniques by applying them to the problem of star-galaxy separation using data from the Sloan Digital Sky Survey (hereafter SDSS). We use the output of MINT and the ensemble methods to demonstrate how more complex decision boundaries improve star-galaxy classification accuracy over the standard SDSS frames approach (reducing misclassifications by up to $\approx33\%$). We then show how tree-interpreter can be used to explore how relevant each photometric feature is when making a classification on an object by object basis.
Recent deep learning (DL) models have moved beyond static network architectures to dynamic ones, handling data where the network structure changes every example, such as sequences of variable lengths, trees, and graphs. Existing dataflow-based programming models for DL—both static and dynamic declaration—either cannot readily express these dynamic models, or are inefficient due to repeated dataflow graph construction and processing, and difficulties in batched execution. We present Cavs, a vertex-centric programming interface and optimized system implementation for dynamic DL models. Cavs represents dynamic network structure as a static vertex function $\mathcal{F}$ and a dynamic instance-specific graph $\mathcal{G}$, and performs backpropagation by scheduling the execution of $\mathcal{F}$ following the dependencies in $\mathcal{G}$. Cavs bypasses expensive graph construction and preprocessing overhead, allows for the use of static graph optimization techniques on pre-defined operations in $\mathcal{F}$, and naturally exposes batched execution opportunities over different graphs. Experiments comparing Cavs to two state-of-the-art frameworks for dynamic NNs (TensorFlow Fold and DyNet) demonstrate the efficacy of this approach: Cavs achieves a near one order of magnitude speedup on training of various dynamic NN architectures, and ablations demonstrate the contribution of our proposed batching and memory management strategies.
The performance of optimization algorithms relies crucially on their parameterizations. Finding good parameter settings is called algorithm tuning. Using a simple simulated annealing algorithm, we will demonstrate how optimization algorithms can be tuned using the sequential parameter optimization toolbox (SPOT). SPOT provides several tools for automated and interactive tuning. The underling concepts of the SPOT approach are explained. This includes key techniques such as exploratory fitness landscape analysis and response surface methodology. Many examples illustrate how SPOT can be used for understanding the performance of algorithms and gaining insight into algorithm’s behavior. Furthermore, we demonstrate how SPOT can be used as an optimizer and how a sophisticated ensemble approach is able to combine several meta models via stacking.
Generative adversarial networks (GANs) are innovative techniques for learning generative models of complex data distributions from samples. Despite remarkable recent improvements in generating realistic images, one of their major shortcomings is the fact that in practice, they tend to produce samples with little diversity, even when trained on diverse datasets. This phenomenon, known as mode collapse, has been the main focus of several recent advances in GANs. Yet there is little understanding of why mode collapse happens and why existing approaches are able to mitigate mode collapse. We propose a principled approach to handling mode collapse, which we call packing. The main idea is to modify the discriminator to make decisions based on multiple samples from the same class, either real or artificially generated. We borrow analysis tools from binary hypothesis testing—in particular the seminal result of Blackwell [Bla53]—to prove a fundamental connection between packing and mode collapse. We show that packing naturally penalizes generators with mode collapse, thereby favoring generator distributions with less mode collapse during the training process. Numerical experiments on benchmark datasets suggests that packing provides significant improvements in practice as well.
Recent improvements in deep reinforcement learning have allowed to solve problems in many 2D domains such as Atari games. However, in complex 3D environments, numerous learning episodes are required which may be too time consuming or even impossible especially in real-world scenarios. We present a new architecture to combine external knowledge and deep reinforcement learning using only visual input. A key concept of our system is augmenting image input by adding environment feature information and combining two sources of decision. We evaluate the performances of our method in a 3D partially-observable environment from the Microsoft Malmo platform. Experimental evaluation exhibits higher performance and faster learning compared to a single reinforcement learning model.
Principal component analysis (PCA) is largely adopted for chemical process monitoring and numerous PCA-based systems have been developed to solve various fault detection and diagnosis problems. Since PCA-based methods assume that the monitored process is linear, nonlinear PCA models, such as autoencoder models and kernel principal component analysis (KPCA), has been proposed and applied to nonlinear process monitoring. However, KPCA-based methods need to perform eigen-decomposition (ED) on the kernel Gram matrix whose dimensions depend on the number of training data. Moreover, prefixed kernel parameters cannot be most effective for different faults which may need different parameters to maximize their respective detection performances. Autoencoder models lack the consideration of orthogonal constraints which is crucial for PCA-based algorithms. To address these problems, this paper proposes a novel nonlinear method, called neural component analysis (NCA), which intends to train a feedforward neural work with orthogonal constraints such as those used in PCA. NCA can adaptively learn its parameters through backpropagation and the dimensionality of the nonlinear features has no relationship with the number of training samples. Extensive experimental results on the Tennessee Eastman (TE) benchmark process show the superiority of NCA in terms of missed detection rate (MDR) and false alarm rate (FAR). The source code of NCA can be found in https://…/Neural-Component-Analysis.git.
Directed latent variable models that formulate the joint distribution as $p(x,z) = p(z) p(x \mid z)$ have the advantage of fast and exact sampling. However, these models have the weakness of needing to specify $p(z)$, often with a simple fixed prior that limits the expressiveness of the model. Undirected latent variable models discard the requirement that $p(z)$ be specified with a prior, yet sampling from them generally requires an iterative procedure such as blocked Gibbs-sampling that may require many steps to draw samples from the joint distribution $p(x, z)$. We propose a novel approach to learning the joint distribution between the data and a latent code which uses an adversarially learned iterative procedure to gradually refine the joint distribution, $p(x, z)$, to better match with the data distribution on each step. GibbsNet is the best of both worlds both in theory and in practice. Achieving the speed and simplicity of a directed latent variable model, it is guaranteed (assuming the adversarial game reaches the virtual training criteria global minimum) to produce samples from $p(x, z)$ with only a few sampling iterations. Achieving the expressiveness and flexibility of an undirected latent variable model, GibbsNet does away with the need for an explicit $p(z)$ and has the ability to do attribute prediction, class-conditional generation, and joint image-attribute modeling in a single model which is not trained for any of these specific tasks. We show empirically that GibbsNet is able to learn a more complex $p(z)$ and show that this leads to improved inpainting and iterative refinement of $p(x, z)$ for dozens of steps and stable generation without collapse for thousands of steps, despite being trained on only a few steps.
Often the challenge associated with tasks like fraud and spam detection[1] is the lack of all likely patterns needed to train suitable supervised learning models. In order to overcome this limitation, such tasks are attempted as outlier or anomaly detection tasks. We also hypothesize that out- liers have behavioral patterns that change over time. Limited data and continuously changing patterns makes learning significantly difficult. In this work we are proposing an approach that detects outliers in large data sets by relying on data points that are consistent. The primary contribution of this work is that it will quickly help retrieve samples for both consistent and non-outlier data sets and is also mindful of new outlier patterns. No prior knowledge of each set is required to extract the samples. The method consists of two phases, in the first phase, consistent data points (non- outliers) are retrieved by an ensemble method of unsupervised clustering techniques and in the second phase a one class classifier trained on the consistent data point set is ap- plied on the remaining sample set to identify the outliers. The approach is tested on three publicly available data sets and the performance scores are competitive.
The search for interpretable reinforcement learning policies is of high academic and industrial interest. Especially for industrial systems, domain experts are more likely to deploy autonomously learned controllers if they are understandable and convenient to evaluate. Basic algebraic equations are supposed to meet these requirements, as long as they are restricted to an adequate complexity. Here we introduce the genetic programming for reinforcement learning (GPRL) approach based on model-based batch reinforcement learning and genetic programming, which autonomously learns policy equations from pre-existing default state-action trajectory samples. GPRL is compared to a straight-forward method which utilizes genetic programming for symbolic regression, yielding policies imitating an existing well-performing, but non-interpretable policy. Experiments on three reinforcement learning benchmarks, i.e., mountain car, cart-pole balancing, and industrial benchmark, demonstrate the superiority of our GPRL approach compared to the symbolic regression method. GPRL is capable of producing well-performing interpretable reinforcement learning policies from pre-existing default trajectory data.
It is a grand challenge to model the emergence of swarm intelligence and many principles or models had been proposed. However, existing models do not catch the nature of swarm intelligence and they are not generic enough to describe various types of emergence phenomena. In this work, we propose a contradiction-centric model for emergence of swarm intelligence, in which individuals’ contradictions dominate their appearances whilst they are associated and interacting to update their contradictions. This model hypothesizes that 1) the emergence of swarm intelligence is rooted in the development of contradictions of individuals and the interactions among associated individuals and 2) swarm intelligence is essentially a combinative reflection of the configurations of contradictions inside individuals and the distributions of contradictions among individuals. To verify the feasibility of the model, we simulate four types of swarm intelligence. As the simulations show, our model is truly generic and can describe the emergence of a variety of swarm intelligence, and it is also very simple and can be easily applied to demonstrate the emergence of swarm intelligence without needing complicated computations.
Kernel Principal Component Analysis (KPCA) is a popular dimensionality reduction technique with a wide range of applications. However, it suffers from the problem of poor scalability. Various approximation methods have been proposed in the past to overcome this problem. The Nystr\’om method, Randomized Nonlinear Component Analysis (RNCA) and Streaming Kernel Principal Component Analysis (SKPCA) were proposed to deal with the scalability issue of KPCA. Despite having theoretical guarantees, their performance in real world learning tasks have not been explored previously. In this work the evaluation of SKPCA, RNCA and Nystr\’om method for the task of classification is done for several real world datasets. The results obtained indicate that SKPCA based features gave much better classification accuracy when compared to the other methods for a very large dataset.
The study of deep recurrent neural networks (RNNs) and, in particular, of deep Reservoir Computing (RC) is gaining an increasing research attention in the neural networks community. The recently introduced deep Echo State Network (deepESN) model opened the way to an extremely efficient approach for designing deep neural networks for temporal data. At the same time, the study of deepESNs allowed to shed light on the intrinsic properties of state dynamics developed by hierarchical compositions of recurrent layers, i.e. on the bias of depth in RNNs architectural design. In this paper, we summarize the advancements in the development, analysis and applications of deepESNs.
Real-time text processing systems are required in many domains to quickly identify patterns, trends, sentiments, and insights. Nowadays, social networks, e-commerce stores, blogs, scientific experiments, and server logs are main sources generating huge text data. However, to process huge text data in real time requires building a data processing pipeline. The main challenge in building such pipeline is to minimize latency to process high-throughput data. In this paper, we explain and evaluate our proposed real-time text processing pipeline using open-source big data tools which minimize the latency to process data streams. Our proposed data processing pipeline is based on Apache Kafka for data ingestion, Apache Spark for in-memory data processing, Apache Cassandra for storing processed results, and D3 JavaScript library for visualization. We evaluate the effectiveness of the proposed pipeline under varying deployment scenarios to perform sentiment analysis using Twitter dataset. Our experimental evaluations show less than a minute latency to process $466,700$ Tweets in $10.7$ minutes when three virtual machines allocated to the proposed pipeline.
Class imbalance classification is a challenging research problem in data mining and machine learning, as most of the real-life datasets are often imbalanced in nature. Existing learning algorithms maximise the classification accuracy by correctly classifying the majority class, but misclassify the minority class. However, the minority class instances are representing the concept with greater interest than the majority class instances in real-life applications. Recently, several techniques based on sampling methods (under-sampling of the majority class and over-sampling the minority class), cost-sensitive learning methods, and ensemble learning have been used in the literature for classifying imbalanced datasets. In this paper, we introduce a new clustering-based under-sampling approach with boosting (AdaBoost) algorithm, called CUSBoost, for effective imbalanced classification. The proposed algorithm provides an alternative to RUSBoost (random under-sampling with AdaBoost) and SMOTEBoost (synthetic minority over-sampling with AdaBoost) algorithms. We evaluated the performance of CUSBoost algorithm with the state-of-the-art methods based on ensemble learning like AdaBoost, RUSBoost, SMOTEBoost on 13 imbalance binary and multi-class datasets with various imbalance ratios. The experimental results show that the CUSBoost is a promising and effective approach for dealing with highly imbalanced datasets.
Running agent-based models (ABMs) is a burdensome computational task, specially so when considering the flexibility ABMs intrinsically provide. This paper uses a bundle of model configuration parameters along with obtained results from a validated ABM to train some Machine Learning methods for socioeconomic optimal cases. A larger space of possible parameters and combinations of parameters are then used as input to predict optimal cases and confirm parameters calibration. Analysis of the parameters of the optimal cases are then compared to the baseline model. This exploratory initial exercise confirms the adequacy of most of the parameters and rules and suggests changing of directions to two parameters. Additionally, it helps highlight metropolitan regions of higher quality of life. Better understanding of ABM mechanisms and parameters’ influence may nudge policy-making slightly closer to optimal level.
Subset selection for multiple linear regression aims to construct a regression model that minimizes errors by selecting a small number of explanatory variables. Once a model is built, various statistical tests and diagnostics are conducted to validate the model and to determine whether regression assumptions are met. Most traditional approaches require human decisions at this step, for example, the user adding or removing a variable until a satisfactory model is obtained. However, this trial-and-error strategy cannot guarantee that a subset that minimizes the errors while satisfying all regression assumptions will be found. In this paper, we propose a fully automated model building procedure for multiple linear regression subset selection that integrates model building and validation based on mathematical programming. The proposed model minimizes mean squared errors while ensuring that the majority of the important regression assumptions are met. When no subset satisfies all of the considered regression assumptions, our model provides an alternative subset that satisfies most of these assumptions. Computational results show that our model yields better solutions (i.e., satisfying more regression assumptions) compared to benchmark models while maintaining similar explanatory power.
Inference in the presence of outliers is an important field of research as outliers are ubiquitous and may arise across a variety of problems and domains. Bayesian optimization is method that heavily relies on probabilistic inference. This allows outstanding sample efficiency because the probabilistic machinery provides a memory of the whole optimization process. However, that virtue becomes a disadvantage when the memory is populated with outliers, inducing bias in the estimation. In this paper, we present an empirical evaluation of Bayesian optimization methods in the presence of outliers. The empirical evidence shows that Bayesian optimization with robust regression often produces suboptimal results. We then propose a new algorithm which combines robust regression (a Gaussian process with Student-t likelihood) with outlier diagnostics to classify data points as outliers or inliers. By using an scheduler for the classification of outliers, our method is more efficient and has better convergence over the standard robust regression. Furthermore, we show that even in controlled situations with no expected outliers, our method is able to produce better results.
The field of deep learning has seen significant advancement in recent years. However, much of the existing work has been focused on real-valued numbers. Recent work has shown that a deep learning system using the complex numbers can be deeper for a set parameter budget compared to its real-valued counterpart. In this work, we explore the benefits of generalizing one step further into the hyper-complex numbers, quaternions specifically, and provide the architecture components needed to build deep quaternion networks. We go over quaternion convolutions, present a quaternion weight initialization scheme, and present algorithms for quaternion batch-normalization. These pieces are tested by end-to-end training on the CIFAR-10 and CIFAR-100 data sets to show the improved convergence to a real-valued network.
Learning customer preferences from an observed behaviour is an important topic in the marketing literature. Structural models typically model forward-looking customers or firms as utility-maximizing agents whose utility is estimated using methods of Stochastic Optimal Control. We suggest an alternative approach to study dynamic consumer demand, based on Inverse Reinforcement Learning (IRL). We develop a version of the Maximum Entropy IRL that leads to a highly tractable model formulation that amounts to low-dimensional convex optimization in the search for optimal model parameters. Using simulations of consumer demand, we show that observational noise for identical customers can be easily confused with an apparent consumer heterogeneity.
In this paper, we provide a Rapid Orthogonal Approximate Slepian Transform (ROAST) for the discrete vector one obtains when collecting a finite set of uniform samples from a baseband analog signal. The ROAST offers an orthogonal projection which is an approximation to the orthogonal projection onto the leading discrete prolate spheroidal sequence (DPSS) vectors (also known as Slepian basis vectors). As such, the ROAST is guaranteed to accurately and compactly represent not only oversampled bandlimited signals but also the leading DPSS vectors themselves. Moreover, the subspace angle between the ROAST subspace and the corresponding DPSS subspace can be made arbitrarily small. The complexity of computing the representation of a signal using the ROAST is comparable to the FFT, which is much less than the complexity of using the DPSS basis vectors. We also give non-asymptotic results to guarantee that the proposed basis not only provides a very high degree of approximation accuracy in a mean-square error sense for bandlimited sample vectors, but also that it can provide high-quality approximations of all sampled sinusoids within the band of interest.

### Whats new on arXiv

This paper proposes a method for estimating the causal effect of a discrete intervention in observational time-series data using encoder-decoder recurrent neural networks (RNNs). Encoder-decoder networks, which are special class of RNNs suitable for handling variable-length sequential data, are used to predict a counterfactual time-series of treated unit outcomes. The proposed method does not rely on pretreatment covariates and encoder-decoder networks are capable of learning nonconvex combinations of control unit outcomes to construct a counterfactual. To demonstrate the proposed method, I extend a field experiment studying the effect of radio advertisements on electoral competition to observational time-series.
Adversarial perturbations can pose a serious threat for deploying machine learning systems. Recent works have shown existence of image-agnostic perturbations that can fool classifiers over most natural images. Existing methods present optimization approaches that solve for a fooling objective with an imperceptibility constraint to craft the perturbations. However, for a given classifier, they generate one perturbation at a time, which is a single instance from the manifold of adversarial perturbations. Also, in order to build robust models, it is essential to explore the manifold of adversarial perturbations. In this paper, we propose for the first time, a generative approach to model the distribution of adversarial perturbations. The architecture of the proposed model is inspired from that of GANs and is trained using fooling and diversity objectives. Our trained generator network attempts to capture the distribution of adversarial perturbations for a given classifier and readily generates a wide variety of such perturbations. Our experimental evaluation demonstrates that perturbations crafted by our model (i) achieve state-of-the-art fooling rates, (ii) exhibit wide variety and (iii) deliver excellent cross model generalizability. Our work can be deemed as an important step in the process of inferring about the complex manifolds of adversarial perturbations.
This paper proposes adversarial attacks for Reinforcement Learning (RL) and then improves the robustness of Deep Reinforcement Learning algorithms (DRL) to parameter uncertainties with the help of these attacks. We show that even a naively engineered attack successfully degrades the performance of DRL algorithm. We further improve the attack using gradient information of an engineered loss function which leads to further degradation in performance. These attacks are then leveraged during training to improve the robustness of RL within robust control framework. We show that this adversarial training of DRL algorithms like Deep Double Q learning and Deep Deterministic Policy Gradients leads to significant increase in robustness to parameter variations for RL benchmarks such as Cart-pole, Mountain Car, Hopper and Half Cheetah environment.
Research in Artificial Intelligence is breaking technology barriers every day. New algorithms and high performance computing are making things possible which we could only have imagined earlier. Though the enhancements in AI are making life easier for human beings day by day, there is constant fear that AI based systems will pose a threat to humanity. People in AI community have diverse set of opinions regarding the pros and cons of AI mimicking human behavior. Instead of worrying about AI advancements, we propose a novel idea of cognitive agents, including both human and machines, living together in a complex adaptive ecosystem, collaborating on human computation for producing essential social goods while promoting sustenance, survival and evolution of the agents’ life cycle. We highlight several research challenges and technology barriers in achieving this goal. We propose a governance mechanism around this ecosystem to ensure ethical behaviors of all cognitive agents. Along with a novel set of use-cases of Cogniculture, we discuss the road map ahead for this journey.
Progress in deep learning is slowed by the days or weeks it takes to train large models. The natural solution of using more hardware is limited by diminishing returns, and leads to inefficient use of additional resources. In this paper, we present a large batch, stochastic optimization algorithm that is both faster than widely used algorithms for fixed amounts of computation, and also scales up substantially better as more computational resources become available. Our algorithm implicitly computes the inverse Hessian of each mini-batch to produce descent directions; we do so without either an explicit approximation to the Hessian or Hessian-vector products. We demonstrate the effectiveness of our algorithm by successfully training large ImageNet models (Inception-V3, Resnet-50, Resnet-101 and Inception-Resnet-V2) with mini-batch sizes of up to 32000 with no loss in validation error relative to current baselines, and no increase in the total number of steps. At smaller mini-batch sizes, our optimizer improves the validation error in these models by 0.8-0.9%. Alternatively, we can trade off this accuracy to reduce the number of training steps needed by roughly 10-30%. Our work is practical and easily usable by others — only one hyperparameter (learning rate) needs tuning, and furthermore, the algorithm is as computationally cheap as the commonly used Adam optimizer.
The motivation of the current study was to design an algorithm that can speed up the processing of a query. The important feature is generating code dynamically for a specific query. We present the technique of code generation that is applied to query processing on a raw file. The idea was to customize a query program with a given query and generate a machine- and query-specific source code. The generated code is compiled by GCC, Clang or any other C/C++ compiler, and the compiled file is dynamically linked to the main program for further processing. Code generation reduces the cost of generalizing query processing. It also avoids the overhead of the conventional interpretation during achieve high performance. Database Management Systems (DBMSs) perform excellent jobs in many aspects of big data, such as storage, indexing, and analysis. DBMSs typically format entire data and load them into their storage layer. They increase data-to-query time, which is the cost time it takes to convert data into a specific schema and persist them in a disk. Ideally, DBMSs should adapt to the input data and extract one/some of columns, not the entire data, that is/are associated with a given query. Therefore, the query engine on a raw file can reduce the cost of conventional general operators and avoid some unnecessary procedures, such as fully scanning, tokenizing and paring the whole data. In the current study, we introduce our code-generation approach for in-situ processing of raw files, which is based on the template approach and the hype approach. The approach minimizes the data-to-query time and achieves a high performance for query processing. There are some benefits from our work: reducing branches and instructions, unrolling loops, eliminating unnecessary data type checks and optimizing the binary code with a compiler on a local machine.
While off-policy temporal difference methods have been broadly used in reinforcement learning due to their efficiency and simple implementation, their Bayesian counterparts have been relatively understudied. This is mainly because the max operator in the Bellman optimality equation brings non-linearity and inconsistent distributions over value function. In this paper, we introduce a new Bayesian approach to off-policy TD methods using Assumed Density Filtering, called ADFQ, which updates beliefs on action-values (Q) through an online Bayesian inference method. Uncertainty measures in the beliefs not only are used in exploration but they provide a natural regularization in the belief updates. We also present a connection between ADFQ and Q-learning. Our empirical results show the proposed ADFQ algorithms outperform comparing algorithms in several task domains. Moreover, our algorithms improve general drawbacks in BRL such as computational complexity, usage of uncertainty, and nonlinearity.
Matrix decomposition is a popular and fundamental approach in machine learning and data mining. It has been successfully applied into various fields. Most matrix decomposition methods focus on decomposing a data matrix from one single source. However, it is common that data are from different sources with heterogeneous noise. A few of matrix decomposition methods have been extended for such multi-view data integration and pattern discovery. While only few methods were designed to consider the heterogeneity of noise in such multi-view data for data integration explicitly. To this end, we propose a joint matrix decomposition framework (BJMD), which models the heterogeneity of noise by Gaussian distribution in a Bayesian framework. We develop two algorithms to solve this model: one is a variational Bayesian inference algorithm, which makes full use of the posterior distribution; and another is a maximum a posterior algorithm, which is more scalable and can be easily paralleled. Extensive experiments on synthetic and real-world datasets demonstrate that BJMD considering the heterogeneity of noise is superior or competitive to the state-of-the-art methods.
The quest for performant networks has been a significant force that drives the advancements of deep learning in recent years. While rewarding, improving network design has never been an easy journey. The large design space combined with the tremendous cost required for network training poses a major obstacle to this endeavor. In this work, we propose a new approach to this problem, namely, predicting the performance of a network before training, based on its architecture. Specifically, we develop a unified way to encode individual layers into vectors and bring them together to form an integrated description via LSTM. Taking advantage of the recurrent network’s strong expressive power, this method can reliably predict the performances of various network architectures. Our empirical studies showed that it not only achieved accurate predictions but also produced consistent rankings across datasets — a key desideratum in performance prediction.
In this work, we propose a hybrid parallel Jaya optimisation algorithm for a multi-core environment with the aim of solving large-scale global optimisation problems. The proposed algorithm is called HHCPJaya, and combines the hyper-population approach with the hierarchical cooperation search mechanism. The HHCPJaya algorithm divides the population into many small subpopulations, each of which focuses on a distinct block of the original population dimensions. In the hyper-population approach, we increase the small subpopulations by assigning more than one subpopulation to each core, and each subpopulation evolves independently to enhance the explorative and exploitative nature of the population. We combine this hyper-population approach with the two-level hierarchical cooperative search scheme to find global solutions from all subpopulations. Furthermore, we incorporate an additional updating phase on the respective subpopulations based on global solutions, with the aim of further improving the convergence rate and the quality of solutions. Several experiments applying the proposed parallel algorithm in different settings prove that it demonstrates sufficient promise in terms of the quality of solutions and the convergence rate. Furthermore, a relatively small computational effort is required to solve complex and large-scale optimisation problems.
Recently, Yuan et al. (2016) have shown the effectiveness of using Long Short-Term Memory (LSTM) for performing Word Sense Disambiguation (WSD). Their proposed technique outperformed the previous state-of-the-art with several benchmarks, but neither the training data nor the source code was released. This paper presents the results of a reproduction study of this technique using only openly available datasets (GigaWord, SemCore, OMSTI) and software (TensorFlow). From them, it emerged that state-of-the-art results can be obtained with much less data than hinted by Yuan et al. All code and trained models are made freely available.
Retrieving nearest neighbors across correlated data in multiple modalities, such as image-text pairs on Facebook and video-tag pairs on YouTube, has become a challenging task due to the huge amount of data. Multimodal hashing methods that embed data into binary codes can boost the retrieving speed and reduce storage requirement. As unsupervised multimodal hashing methods are usually inferior to supervised ones, while the supervised ones requires too much manually labeled data, the proposed method in this paper utilizes a part of labels to design a semi-supervised multimodal hashing method. It first computes the transformation matrices for data matrices and label matrix. Then, with these transformation matrices, fuzzy logic is introduced to estimate a label matrix for unlabeled data. Finally, it uses the estimated label matrix to learn hashing functions for data in each modality to generate a unified binary code matrix. Experiments show that the proposed semi-supervised method with 50% labels can get a medium performance among the compared supervised ones and achieve an approximate performance to the best supervised method with 90% labels. With only 10% labels, the proposed method can still compete with the worst compared supervised one.
We introduce a Random Attention Model (RAM) allowing for a large class of stochastic consideration maps in the context of an otherwise canonical limited attention model for decision theory. The model relies on a new restriction on the unobserved, possibly stochastic consideration map, termed \textit{Monotonic Attention}, which is intuitive and nests many recent contributions in the literature on limited attention. We develop revealed preference theory within RAM and obtain precise testable implications for observable choice probabilities. Using these results, we show that a set (possibly a singleton) of strict preference orderings compatible with RAM is identifiable from the decision maker’s choice probabilities, and establish a representation of this identified set of unobserved preferences as a collection of inequality constrains on her choice probabilities. Given this nonparametric identification result, we develop uniformly valid inference methods for the (partially) identifiable preferences. We showcase the performance of our proposed econometric methods using simulations, and provide general-purpose software implementation of our estimation and inference results in the \texttt{R} software package \texttt{ramchoice}. Our proposed econometric methods are computationally very fast to implement.
In an era of big data there is a growing need for memory-bounded learning algorithms. In the last few years researchers have investigated what cannot be learned under memory constraints. In this paper we focus on the complementary question of what can be learned under memory constraints. We show that if a hypothesis class fulfills a combinatorial condition defined in this paper, there is a memory-bounded learning algorithm for this class. We prove that certain natural classes fulfill this combinatorial property and thus can be learned under memory constraints.
Convolutional neural networks (CNNs) are similar to ‘ordinary’ neural networks in the sense that they are made up of hidden layers consisting of neurons with ‘learnable’ parameters. These neurons receive inputs, performs a dot product, and then follows it with a non-linearity. The whole network expresses the mapping between raw image pixels and their class scores. Conventionally, the Softmax function is the classifier used at the last layer of this network. However, there have been studies (Alalshekmubarak and Smith, 2013; Agarap, 2017; Tang, 2013) conducted to challenge this norm. The cited studies introduce the usage of linear support vector machine (SVM) in an artificial neural network architecture. This project is yet another take on the subject, and is inspired by (Tang, 2013). Empirical data has shown that the CNN-SVM model was able to achieve a test accuracy of ~99.04% using the MNIST dataset (LeCun, Cortes, and Burges, 2010). On the other hand, the CNN-Softmax was able to achieve a test accuracy of ~99.23% using the same dataset. Both models were also tested on the recently-published Fashion-MNIST dataset (Xiao, Rasul, and Vollgraf, 2017), which is suppose to be a more difficult image classification dataset than MNIST (Zalandoresearch, 2017). This proved to be the case as CNN-SVM reached a test accuracy of ~90.72%, while the CNN-Softmax reached a test accuracy of ~91.86%. The said results may be improved if data preprocessing techniques were employed on the datasets, and if the base CNN model was a relatively more sophisticated than the one used in this study.
We study the problem of inducing interpretability in KG embeddings. Specifically, we explore the Universal Schema (Riedel et al., 2013) and propose a method to induce interpretability. There have been many vector space models proposed for the problem, however, most of these methods don’t address the interpretability (semantics) of individual dimensions. In this work, we study this problem and propose a method for inducing interpretability in KG embeddings using entity co-occurrence statistics. The proposed method significantly improves the interpretability, while maintaining comparable performance in other KG tasks.
Nowadays, eye tracking is the most used technology to detect areas of interest. This kind of technology requires specialized equipment recording user’s eyes. In this paper, we propose SneakPeek, a different approach to detect areas of interest on images displayed in web pages based on the zooming and panning actions of the users through the image. We have validated our proposed solution with a group of test subjects that have performed a test in our on-line prototype. Being this the first iteration of the algorithm, we have found both good and bad results, depending on the type of image. In specific, SneakPeek works best with medium/big objects in medium/big sized images. The reason behind it is the limitation on detection when smartphone screens keep getting bigger and bigger. SneakPeek can be adapted to any website by simply adapting the controller interface for the specific case.
This paper presents a novel method, called Analysis-of-marginal-Tail-Means (ATM), for parameter optimization over a large, discrete design space. The key advantage of ATM is that it offers effective and robust optimization performance for both smooth and rugged response surfaces, using only a small number of function evaluations. This method can therefore tackle a wide range of engineering problems, particularly in applications where the performance metric to optimize is ‘black-box’ and expensive to evaluate. The ATM framework unifies two parameter optimization methods in the literature: the Analysis-of-marginal-Means (AM) approach (Taguchi, 1986), and the Pick-the-Winner (PW) approach (Wu et al., 1990). In this paper, we show that by providing a continuum between AM and PW via the novel idea of marginal tail means, the proposed method offers a balance between three fundamental trade-offs. By adaptively tuning these trade-offs, ATM can then provide excellent optimization performance over a broad class of response surfaces using limited data. We illustrate the effectiveness of ATM using several numerical examples, and demonstrate how such a method can be used to solve two real-world engineering design problems.
We present a cascade architecture for keyword spotting with speaker verification on mobile devices. By pairing a small computational footprint with specialized digital signal processing (DSP) chips, we are able to achieve low power consumption while continuously listening for a keyword.
Online social networks have increasing influence on our society, they may play decisive roles in politics and can be crucial for the fate of companies. Such services compete with each other and some may even break down rapidly. Using social network datasets we show the main factors leading to such a dramatic collapse. At early stage mostly the loosely bound users disappear, later collective effects play the main role leading to cascading failures. We present a theory based on a generalised threshold model to explain the findings and show how the collapse time can be estimated in advance using the dynamics of the churning users. Our results shed light to possible mechanisms of instabilities in other competing social processes.
In recent years, deep convolutional neural networks (CNNs) have shown record-shattering performance in a variety of computer vision problems, such as visual object recognition, detection and segmentation. These methods have also been utilized in medical image analysis domain for lesion segmentation, anatomical segmentation and classification. We present an extensive literature review of CNN techniques applied in brain magnetic resonance imaging (MRI) analysis, focusing on the architectures, pre-processing, data-preparation and post-processing strategies available in these works. The aim of this study is three-fold. Our primary goal is to report how different CNN architectures have evolved, now entailing state-of-the-art methods by extensive discussion of the architectures and examining the pros and cons of the models when evaluating their performance using public datasets. Second, this paper is intended to be a detailed reference of the research activity in deep CNN for brain MRI analysis. Finally, our goal is to present a perspective on the future of CNNs, which we believe will be among the growing approaches in brain image analysis in subsequent years.
A large fraction of the arithmetic operations required to evaluate deep neural networks (DNNs) are due to matrix multiplications, both in convolutional and fully connected layers. Matrix multiplications can be cast as $2$-layer sum-product networks (SPNs) (arithmetic circuits), disentangling multiplications and additions. We leverage this observation for end-to-end learning of low-cost (in terms of multiplications) approximations of linear operations in DNN layers. Specifically, we propose to replace matrix multiplication operations by SPNs, with widths corresponding to the budget of multiplications we want to allocate to each layer, and learning the edges of the SPNs from data. Experiments on CIFAR-10 and ImageNet show that this method applied to ResNet yields significantly higher accuracy than existing methods for a given multiplication budget, or leads to the same or higher accuracy compared to existing methods while using significantly fewer multiplications. Furthermore, our approach allows fine-grained control of the tradeoff between arithmetic complexity and accuracy of DNN models. Finally, we demonstrate that the proposed framework is able to rediscover Strassen’s matrix multiplication algorithm, i.e., it can learn to multiply $2 \times 2$ matrices using only $7$ multiplications instead of $8$.
We propose a family of nonconvex optimization algorithms that are able to save gradient and negative curvature computations to a large extent, and are guaranteed to find an approximate local minimum with improved runtime complexity. At the core of our algorithms is the division of the entire domain of the objective function into small and large gradient regions: our algorithms only perform gradient descent based procedure in the large gradient region, and only perform negative curvature descent in the small gradient region. Our novel analysis shows that the proposed algorithms can escape the small gradient region in only one negative curvature descent step whenever they enter it, and thus they only need to perform at most $N_{\epsilon}$ negative curvature direction computations, where $N_{\epsilon}$ is the number of times the algorithms enter small gradient regions. For both deterministic and stochastic settings, we show that the proposed algorithms can potentially beat the state-of-the-art local minima finding algorithms. For the finite-sum setting, our algorithm can also outperform the best algorithm in a certain regime.

Learn how to manipulate smartphone behavior with common hyperlinks.

### CogX London 2018, The Festival of all things AI, exclusive KDnuggets Discount

CogX 2018 (11-12 June, London) will be the most important AI event in Europe. Get early bird tickets for only £599 (reduced from £1,799) with code KDN15 (valid December 2017).

### Four short links: 15 December 2017

Machine Teaching, Accuracy Trumps Bias, Fairness in ML, and Quantum Game

1. Machine Teaching: A New Paradigm for Building Machine Learning Systems -- While machine learning focuses on creating new algorithms and improving the accuracy of "learners," the machine teaching discipline focuses on the efficacy of the "teachers." Machine teaching as a discipline is a paradigm shift that follows and extends principles of software engineering and programming languages. We put a strong emphasis on the teacher and the teacher's interaction with data, as well as crucial components such as techniques and design principles of interaction and visualization.
2. Accuracy Dominates Bias and Self-Fulfilling Prophecy -- three conclusions: (1) Although errors, biases, and self-fulfilling prophecies in person perception are real, reliable, and occasionally quite powerful, on average, they tend to be weak, fragile, and fleeting. (2) Perceptions of individuals and groups tend to be at least moderately, and often highly, accurate. (3) Conclusions based on the research on error, bias, and self-fulfilling prophecies routinely greatly overstate their power and pervasiveness, and consistently ignore evidence of accuracy, agreement, and rationality in social perception.
3. Fairness in Machine Learning: Lessons from Political Philosophy -- Questions of discrimination, egalitarianism, and justice are of significant interest to moral and political philosophers, who have expended significant efforts in formalizing and defending these central concepts. It is therefore unsurprising that attempts to formalize "fairness" in machine learning contain echoes of these old philosophical debates. This paper draws on existing work in moral and political philosophy in order to elucidate emerging debates about fair machine learning.
4. Quantum Game -- open source game play with photons, superposition, and entanglement. In your browser! With true quantum mechanics underneath!

### Tracking ballet dancer movements

Research group Euphrates experimented with lines and a ballet dancer’s movements in Ballet Rotoscope:

By the way, rotoscoping is an old technique used by animators to capture movement. Pictures or video are taken and lines are traced for use in different contexts. [via @Rainmaker1973]

Tags: ,

### CfP: LVA/ICA 2018 14th International Conference on Latent Variable Analysis and Signal Separation July 2-6, 2018, University of Surrey, Guildford, UK http://cvssp.org/events/lva-ica-2018

Mark just sent me the following:

Dear Igor,
The forthcoming LVA/ICA 2018 international conference on Latent Variable Analysis and Signal Separation may be of interest to many Nuit Blanche readers, particularly those working on sparse coding or dictionary learning for source separation. The submission deadline is approaching! Please see below for the latest Call for Papers.
Best wishes, Mark
Here it is:

====================================
= LVA/ICA 2018 - CALL FOR PAPERS ==
14th International Conference on Latent Variable Analysis and Signal Separation
July 2-6, 2018
University of Surrey, Guildford, UK
Paper submission deadline: January 15, 2018
====================================

The International Conference on Latent Variable Analysis and Signal Separation, LVA/ICA 2018, is an interdisciplinary forum where researchers and practitioners can experience a broad range of exciting theories and applications involving signal processing, applied statistics, machine learning, linear and multilinear algebra, numerical analysis and optimization, and other areas targeting Latent Variable Analysis problems.
We are pleased to invite you to submit research papers to the 14th LVA/ICA which will be held at the University of Surrey, Guildford, UK, from the 2nd to the 6th of July, 2018. The conference is organized by the Centre for Vision, Speech and Signal Processing (CVSSP); and the Institute of Sound Recording (IoSR).
The proceedings will be published in Springer-Verlag's Lecture Notes in Computer Science (LNCS).

== Keynote Speakers ==
- Orly Alter
Scientific Computing & Imaging Institute and Huntsman Cancer Institute, University of Utah, USA
- Andrzej Cichocki
Brain Science Institute, RIKEN, Japan
- Tuomas Virtanen
Laboratory of Signal Processing
Tampere University of Technology, Finland

== Topics ==
Prospective authors are invited to submit original papers (8-10 pages in LNCS format) in areas related to latent variable analysis, independent component analysis and signal separation, including but not limited to:
- Theory:
* sparse coding, dictionary learning
* statistical and probabilistic modeling
* detection, estimation and performance criteria and bounds
* causality measures
* learning theory
* convex/nonconvex optimization tools
* sketching and censoring for large scale data
- Models:
* general linear or nonlinear models of signals and data
* discrete, continuous, flat, or hierarchical models
* multilinear models
* time-varying, instantaneous, convolutive, noiseless, noisy,
over-complete, or under-complete mixtures
* Low-rank models, graph models, online models
- Algorithms:
* estimation, separation, identification, detection, blind and
semi-blind methods, non-negative matrix factorization, tensor
* feature selection
* time-frequency and wavelet based analysis
* complexity analysis
* Non-conventional signals (e.g. graph signals, quantum sources)
- Applications:
* speech and audio separation, recognition, dereverberation and
denoising
* auditory scene analysis
* image segmentation, separation, fusion, classification, texture
analysis
* biomedical signal analysis, imaging, genomic data analysis,
brain-computer interface
- Emerging related topics:
* sparse learning
* deep learning
* social networks
* data mining
* artificial intelligence
* objective and subjective performance evaluation

== Venue ==
LVA/ICA 2018 will be held at the University of Surrey, Guildford, in the South East of England, UK. The university is a ten minute walk away from the town centre, which offers a vibrant blend of entertainment, culture and history. Guildford is 40 minutes from London by train, and convenient for both London Heathrow and London Gatwick airports.

== Conference Chairs ==
- General Chairs:
Mark Plumbley - University of Surrey, UK
Russell Mason - University of Surrey, UK
- Program Chairs
Sharon Gannot - Bar-Ilan University, Israel
Yannick Deville - Université Paul Sabatier Toulouse 3, France

== Important Dates ==
- Paper submission deadline: January 15, 2018
- Notification of acceptance: March 19, 2018
- Camera ready submission: April 16, 2018
- Summer School: July 2, 2018
- Conference: July 3-6, 2018

== Website ==
For further information and information on how to submit, please visit:

We look forward to your participation,
The LVA/ICA 2018 Organizing Committee
===============================
--
Prof Mark D Plumbley
Professor of Signal Processing
Centre for Vision, Speech and Signal Processing (CVSSP)
University of Surrey, Guildford, Surrey, GU2 7XH, UK

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

### Regulations.

Doug Mills, reporting for The New York Times:

Echoing his days as a real estate developer with the flair of a groundbreaking, Mr. Trump used an oversize pair of scissors to cut a ribbon his staff had set up in front of two piles of paper, representing government regulations in 1960 (20,000 pages, he said), and today — a pile that was about six feet tall (said to be 185,000 pages).

Interpret as you like.

Tags: ,

### Document worth reading: “Understanding Deep Learning Generalization by Maximum Entropy”

Deep learning achieves remarkable generalization capability with overwhelming number of model parameters. Theoretical understanding of deep learning generalization receives recent attention yet remains not fully explored. This paper attempts to provide an alternative understanding from the perspective of maximum entropy. We first derive two feature conditions that softmax regression strictly apply maximum entropy principle. DNN is then regarded as approximating the feature conditions with multilayer feature learning, and proved to be a recursive solution towards maximum entropy principle. The connection between DNN and maximum entropy well explains why typical designs such as shortcut and regularization improves model generalization, and provides instructions for future model development. Understanding Deep Learning Generalization by Maximum Entropy

### Inter-operate with ‘MQTT’ Message Brokers With R (a.k.a. Live! BBC! Subtitles!)

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

Most of us see the internet through the lens of browsers and apps on our laptops, desktops, watches, TVs and mobile devices. These displays are showing us — for the most part — content designed for human consumption. Sure, apps handle API interactions, but even most of that communication happens over ports 80 or 443. But, there are lots of ports out there; 0:65535, in fact (at least TCP-wise). And, all of them have some kind of data, and most of that is still targeted to something for us.

What if I told you the machines are also talking to each other using a thin/efficient protocol that allows one, tiny sensor to talk to hundreds — if not thousands — of systems without even a drop of silicon-laced sweat? How can a mere, constrained sensor do that? Well, it doesn’t do it alone. Many of them share their data over a fairly new protocol dubbed MQTT (Message Queuing Telemetry Transport).

An MQTT broker watches for devices to publish data under various topics and then also watches for other systems to subscribe to said topics and handles the rest of the interchange. The protocol is lightweight enough that fairly low-powered (CPU- and literal electric-use-wise) devices can easily send their data chunks up to a broker, and the entire protocol is robust enough to support a plethora of connections and an equal plethora of types of data.

## Why am I telling you all this?

Devices that publish to MQTT brokers tend to be in the spectrum of what folks sadly call the “Internet of Things”. It’s a terrible, ambiguous name, but it’s all over the media and most folks have some idea what it means. In the context of MQTT, you can think of it as, say, a single temperature sensor publishing it’s data to an MQTT broker so many other things — including programs written by humans to capture, log and analyze that data — can receive it. This is starting to sound like something that might be right up R’s alley.

There are also potential use-cases where an online data processing system might want to publish data to many clients without said clients having to poll a poor, single-threaded R server constantly.

Having MQTT connectivity for R could be really interesting.

And, now we have the beginnings of said connectivity with the mqtt package.

## Another Package? Really?

Yes, really.

Besides the huge potential for having a direct R-bridge to the MQTT world, I’m work-interested in MQTT since we’ve found over 35,000 of them on the default, plaintext port for MQTT (1883) alone:

There are enough of them that I don’t even need to show a base map.

Some of these servers require authentication and others aren’t doing much of anything. But, there are a number of them hosted by corporations and individuals that are exposing real data. OwnTracks seems to be one of the more popular self-/badly-hosted ones.

Then, there are others — like test.mosquitto.org — which deliberately run open MQTT servers for “testing”. There definitely is testing going on there, but there are also real services using it as a production broker. The mqtt package is based on the mosquitto C library, so it’s only fitting that we show a few examples from its own test site here.

For now, there’s really one function: topic_subscribe(). Eventually, R will be able to publish to a broker and do more robust data collection operations (say, to make a live MQTT dashboard in Shiny). The topic_subscribe() function is an all-in one tool that enables you to:

• connect to a broker
• subscribe to a topic
• pass in R callback functions which will be executed on connect, disconnect and when new messages come in

That’s plenty of functionality to do some fun things.

## What’s the frequencytemperature, Kenneth?

The mosquitto test server has one topic — /outbox/crouton-demo/temperature — which is a fake temperature sensor that just sends data periodically so you have something to test with. Let’s capture 50 samples and plot them.

Since we’re using a callback we have to use the tricksy <<- operator to store/update variables outside the callback function. And, we should pre-allocate space for said data to avoid needlessly growing objects. Here’s a complete code-block:

library(mqtt) # devtools::install_github("hrbrmstr/mqtt")
library(jsonlite)
library(hrbrthemes)
library(tidyverse)

i <- 0 # initialize our counter
max_recs <- 50 # max number of readings to get

# our callback function
temp_cb <- function(id, topic, payload, qos, retain) {

i <<- i + 1 # update the counter

return(if (i==max_recs) "quit" else "go") # need to send at least "". "quit" == done

}

topic_subscribe(
topic = "/outbox/crouton-demo/temperature",
message_callback=temp_cb
)

# each reading looks like this:
# {"update": {"labels":[4631],"series":[[68]]}}
map(unlist) %>%
map_df(as.list) %>%
ggplot(aes(update.labels, update.series)) +
geom_line() +
geom_point() +
labs(x="Reading", y="Temp (F)", title="Temperature via MQTT") +
theme_ipsum_rc(grid="XY")

We setup temp_cb() to be our callback and topic_subscribe() ensures that the underlying mosquitto library will call it every time a new message is published to that topic. The chart really shows how synthetic the data is.

## Subtitles from the Edge

Temperature sensors are just the sort of thing that MQTT was designed for. But, we don’t need to be stodgy about our use of MQTT.

Just about a year ago from this post, the BBC launched live subtitles for iPlayer. Residents of the Colonies may not know what iPlayer is, but it’s the “app” that lets UK citizens watch BBC programmes on glowing rectangles that aren’t proper tellys. Live subtitles are hard to produce well (and get right) and the BBC making the effort to do so also on their digital platform is quite commendable. We U.S. folks will likely be charged $0.99 for each set of digital subtitles now that net neutrality is gone. Now, some clever person(s) wired up some of these live subtitles to MQTT topics. We can wire up our own code in R to read them live: bbc_callback <- function(id, topic, payload, qos, retain) { cat(crayon::green(readBin(payload, "character")), "\n", sep="") return("") # ctrl-c will terminate } mqtt::topic_subscribe(topic = "bbc/subtitles/bbc_news24/raw", connection_callback=mqtt::mqtt_silent_connection_callback, message_callback=bbc_callback) In this case, control-c terminates things (cleanly). You could easily modify the above code to have a bot that monitors for certain keywords then sends windowed chunks of subtitled text to some other system (Slack, database, etc). Or, create an online tidy text analysis workflow from them. ## FIN The code is on GitHub and all input/contributions are welcome and encouraged. Some necessary TBDs are authentication & encryption. But, how would you like the API to look for using it, say, in Shiny apps? What should publishing look like? What helper functions would be useful (ones to slice & dice topic names or another to convert raw message text more safely)? Should there be an R MQTT “DSL”? Lots of things to ponder and so many sites to “test”! ## P.S. In case you are concerned about the unusually boring R package name, I wanted to use RIoT (lower-cased, of course) but riot is, alas, already taken. To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### Book Memo: “Anticipatory Systems”  Philosophical, Mathematical, and Methodological Foundations Robert Rosen was not only a biologist, he was also a brilliant mathematician whose extraordinary contributions to theoretical biology were tremendous. Founding, with this book, the area of Anticipatory Systems Theory is a remarkable outcome of his work in theoretical biology. This second edition of his book Anticipatory Systems, has been carefully revised and edited, and includes an Introduction by Judith Rosen. It has also been expanded with a set of Prolegomena by Dr. Mihai Nadin, who offers an historical survey of this fast growing field since the original work was published. There is also some exciting new work, in the form of an additional chapter on the Ontology of Anticipation, by Dr. John Kineman. An addendum– with autobiographical reminiscences by Robert Rosen, himself, and a short story by Judith Rosen about her father– adds a personal touch. This work, now available again, serves as the guiding foundations for the growing field of Anticipatory Systems and, indeed, any area of science that deals with living organisms in some way, including the study of Life and Mind. It will also be of interest to graduate students and researchers in the field of Systems Science. Continue Reading… ### Best Masters in Data Science and Analytics – Asia and Australia Edition The fourth edition of our comprehensive, unbiased survey on graduate degrees in Data Science and Analytics from around the world. Continue Reading… ### If you did not already know Value Prediction Network (VPN) This paper proposes a novel deep reinforcement learning (RL) architecture, called Value Prediction Network (VPN), which integrates model-free and model-based RL methods into a single neural network. In contrast to typical model-based RL methods, VPN learns a dynamics model whose abstract states are trained to make option-conditional predictions of future values (discounted sum of rewards) rather than of future observations. Our experimental results show that VPN has several advantages over both model-free and model-based baselines in a stochastic environment where careful planning is required but building an accurate observation-prediction model is difficult. Furthermore, VPN outperforms Deep Q-Network (DQN) on several Atari games even with short-lookahead planning, demonstrating its potential as a new way of learning a good state representation. … Discrete Event Simulation (DES) In the field of simulation, a discrete-event simulation (DES), models the operation of a system as a discrete sequence of events in time. Each event occurs at a particular instant in time and marks a change of state in the system. Between consecutive events, no change in the system is assumed to occur; thus the simulation can directly jump in time from one event to the next. This contrasts with continuous simulation in which the simulation continuously tracks the system dynamics over time. Instead of being event-based, this is called an activity-based simulation; time is broken up into small time slices and the system state is updated according to the set of activities happening in the time slice. Because discrete-event simulations do not have to simulate every time slice, they can typically run much faster than the corresponding continuous simulation. Another alternative to event-based simulation is process-based simulation. In this approach, each activity in a system corresponds to a separate process, where a process is typically simulated by a thread in the simulation program. In this case, the discrete events, which are generated by threads, would cause other threads to sleep, wake, and update the system state. A more recent method is the three-phased approach to discrete event simulation (Pidd, 1998). In this approach, the first phase is to jump to the next chronological event. The second phase is to execute all events that unconditionally occur at that time (these are called B-events). The third phase is to execute all events that conditionally occur at that time (these are called C-events). The three phase approach is a refinement of the event-based approach in which simultaneous events are ordered so as to make the most efficient use of computer resources. The three-phase approach is used by a number of commercial simulation software packages, but from the user’s point of view, the specifics of the underlying simulation method are generally hidden. … Adaptive Learning for Multi-Agent Navigation (ALAN) In multi-agent navigation, agents need to move towards their goal locations while avoiding collisions with other agents and static obstacles, often without communication with each other. Existing methods compute motions that are optimal locally but do not account for the aggregated motions of all agents, producing inefficient global behavior especially when agents move in a crowded space. In this work, we develop methods to allow agents to dynamically adapt their behavior to their local conditions. We accomplish this by formulating the multi-agent navigation problem as an action-selection problem, and propose an approach, ALAN, that allows agents to compute time-efficient and collision-free motions. ALAN is highly scalable because each agent makes its own decisions on how to move using a set of velocities optimized for a variety of navigation tasks. Experimental results show that the agents using ALAN, in general, reach their destinations faster than using ORCA, a state-of-the-art collision avoidance framework, the Social Forces model for pedestrian navigation, and a Predictive collision avoidance model. … Continue Reading… ### Build Great Conversational Bots Using Azure Bot Service & LUIS (Both Services Now Generally Available) Re-posted from the Microsoft Azure blog. Conversational AI, or making human and computer interactions more natural, has been a goal of computer scientists for a long time. In support of that longstanding quest, we are excited to announce the general availability of two key Microsoft Azure services that streamline the creation of interactive conversational bots, namely the Azure Bot Service and the Language Understanding Intelligent Service (LUIS). The Azure Bot Service helps developers create conversational interfaces on multiple channels. LUIS helps developers create customized natural interactions on any platform for any type of application, including bots. With these two services now generally available on Azure, developers can easily build custom models that naturally interpret the intentions of users who converse with their bots. We are also introducing new capabilities in each service. Azure Bot Service is now available in more regions, offers premium channels to communicate better with users, and provides advanced customization capabilities. LUIS now has an updated user interface, is available in more regions as well, and helps developers create substantially richer conversational experiences in their apps. More detailed information about the new features of Azure Bot Service and LUIS can be obtained here. LUIS, in fact, is just one part of Microsoft’s Cognitive Services, a collection of intelligent APIs that enables systems to see, hear, speak, understand and interpret our world using natural modes of communication. Many Cognitive Services offer a high degree of customizability and businesses can tailor them for their unique AI needs. These capabilities allow bots to be trained on the vocabulary of any domain, such as understanding what items customers are ordering from a fast food restaurant menu, to cite one example. Over 760,000 developers from 60 countries are using our Cognitive Services to add intelligent capabilities to their applications. Over 240,000 developers have signed up to use the Azure Bot Service, using it for everything they need to build intelligent software agents. Furthermore, thousands of real world customers have developed and deployed intelligent apps using these services, customers such as Molson Coors, UPS, Dixons Carphone, Equadex, Human Interact, Sabre and many more. To take the UPS example, they have been able to improve customer service and increase the efficiency of their IT staff using the UPS Bot, a sophisticated agent that allows customers to interact via text and voice and get the information they need about shipments, rates, UPS locations and more.  “Within five weeks, we had developed a chatbot prototype with the Microsoft bot technology. Our Chief Information and Engineering Officer loved it and asked that we get a version into production in just two months…and that’s just what we did.” – Kumar Athreya, Senior Applications Development Manager of Shipping Systems, UPS. You can learn more about these exciting announcements at our original blog post here. Then get started on these technologies and build yourself a great conversationalist! ML Blog Team Continue Reading… ### This is a brave post and everyone in statistics should read it This post by Kristian Lum is incredibly brave. It points out some awful behavior by people in our field and should be required reading for everyone. It took a lot of courage for Kristian to post this but we believe her, think this is a serious and critical issue for our field, and will not tolerate this kind of behavior among our colleagues. Her post has aleady inspired important discussions among the faculty at Johns Hopkins Biostatistics and is an important contribution to making sure our field is welcoming for everyone. Like many others we’ll be waiting to see the response of the ISBA. The ASA is also creating a task force on sexual harrassment. We’ll be looking for more details on how to get involved in these efforts as we all think about what we can do about this critical issue. Continue Reading… ## December 14, 2017 ### Get Network insights in Excel with NodeXL NodeXL, the network overview discovery and exploration add-in for the familiar Microsoft Office Excel (TM) spreadsheet brings network functions within the reach of people who are more comfortable making pie charts than writing code. See what NodeXL finds in KDnuggets network and download NodeXL Pro for your analyses. Continue Reading… ### Best Data Science, Machine Learning Courses from Udemy, only$10/$15 until Dec 21 Holiday Dev & IT sale on best courses from Udemy, including Data Science, Machine Learning, Python, Spark, Tableau, and Hadoop - only$10/15 until Dec 21, 2017. Continue Reading… ### Data Science in 30 Minutes: A Conversation with Gregory Piatetsky-Shapiro, President of KDnuggets, Jan 11 KDnuggets founder, Gregory Piatetsky-Shapiro, joins Michael Li, CEO and founder of The Data Incubator, Jan 11 at 2:30 pm PT/ 5:30 pm ET for their monthly webinar series, Data Science in 30 Minutes. Gregory will discuss his career - from Data Mining to Data Science and examine current trends in the field. Continue Reading… ### how we position and what we compare When visualizing data, one piece of advice I often give is to consider what you want your audience to be able to compare, and align those things to a common baseline and put them as close together as possible. This makes the comparison easy. If we step back and consider this more generally, the way we organize our data has implications on what our audience can more (or less) easily do with the data and what they are able to easily (or not so easily) compare. I was working with a client recently when this came into play. The task was to visualize funnel data for a number of cohorts. For each cohort, there were a number of funnel stages, or “gates,” where accounts could fall out: targeted, engaged, pitched, and adopted. Each of these stage represents some portion of those accounts that made it through the previous stage. In this case, the client wanted to compare all of this across a handful of cohorts and regions. Here is an anonymized version of the original graph: There are some things I like about this visual. Everything is titled and labeled. So, while it takes a bit of time to orient and figure out what I’m looking at, the words are all there so that I can eventually figure this out, helping to make the data accessible. But when I step back and think about what I can easily do with the current arrangement of the data, there are a number of limitations. Let’s consider the relative levels of work it takes to make various comparisons within this set of graphs. The easiest comparison for me to make is looking at a given region within a given cohort and focusing on the relative stages of the funnel. For example, if we start at the top left, I can easily compare for the Q1 Cohort in North America the purple vs. blue vs. orange vs. green bar. This is because they are both (1) aligned to a common baseline and (2) close in proximity (directly next to each other). The next most straightforward comparison I can make is for a given stage in the funnel, I can compare across the various regions for a given cohort. So again, starting at the top left, I can compare within the Q1 Cohort the first purple bar (Targeted in North America) scanning right to the next purple bar (Targeted in EMEA), and so on. They are still aligned to a common baseline, but in this case they aren’t right next to each other (I’m inclined to take my index finger and trace along to help with this comparison). This is a little harder than the first comparison described above, but still possible. The next comparison I can make—and this one is quite a bit more difficult—is a step in the funnel for a given region across cohorts. Again, starting at the top left, I can take that initial purple bar (Targeted in North America) and now scan downwards to compare to that same point for the Q2 cohort and the Q3 cohort. This is harder, because these bars are not aligned to a common baseline and they are also not next to each other. I can see that the bottom leftmost purple bar is bigger than the ones above it. But if I need to have a sense of how much bigger, that’s hard for me to wrap my head around. The numbers are there via the y-axis to make it possible, but it means I'm having to remember numbers and perhaps do a bit of math as I scan across the bars, which is simply more work. And if we step back and think about it… comparisons across cohorts… this is actually potentially one of the most important comparisons that we’d like to be able to make! Visualizing and arranging our data differently could make this easier. Perhaps it’s just me (and this really could be the case), but when I think of cohort analysis, it actually reminds me of my days in banking (a former life) and decay curves, and when I think of “curves,” it makes me think of lines, which makes me want to draw some lines over these bars… Actually, let’s try that. Here’s what it looks like if I draw lines over the bars in the first graph (Q1 cohort): While I’m at it, I might as well draw lines across the other graphs, too: And now that we have the lines, we don’t need the bars… The bars would have likely been too much to put into a single graph. But now that I’ve replaced what was previously four bars with a single line—thus remaking my original 16 bars in each graph into 4 lines, or if we multiply that across the three graphs, I’ve turned 48 bars into 12 lines—those, I can potentially all put into a single graph. It would look like this: While it’s nice to have everything in a single graph, those lines on their own don’t make much sense. Next, I’ll add the requisite details: axis labels and titles so we know what we’re looking at. Note that I didn’t have space to write out “Targeted,” “Engaged,” “Pitched,” and “Adopted” for every single data point. Instead, I chose to use just the first letter of each of these along the x-axis, and then I have a legend of sorts below the region that lists out what each of these letters means. This may not be a perfect solution, but every decision when we visualize data involves tradeoffs, and I’ve decided I’m ok with the tradeoffs here. You’ll perhaps notice here that I haven’t labeled the various cohorts yet. With this view, I could focus on one at a time (calling out either via text or my spoken narrative if talking through this live to make it clear what we are focusing on). For example, maybe first I want to set the stage and focus on the Q1 cohort and how it looked across the various funnel stages and regions: I could then do the same for the Q2 cohort (lower across everywhere: Is this expected? What drove this? My voiceover could lend commentary to raise or answer these questions): Then finally, I could do the same for the Q3 cohort (ah, now our metrics have recovered from their lows in the Q2 cohort and are now even higher than Q1, did we do something specific to achieve this? Looks like we targeted a higher proportion of the overall cohort, and it’s interesting to see how that impacted the downstream funnel stages): Note with this view, I could also focus on a given region at a time. For example, it might be interesting to note that these metrics are lower across all cohorts in North America compared to the other regions: Or the spread in APAC across cohorts might be noteworthy, as it’s the largest variance across cohorts compared to the other regions: This piece-by-piece emphasis could work well in a live presentation. But in the case where this is for a report or presentation that will be sent out where we’d likely have a single version of the graph (vs. the multiple iterations that can work well in a live setting so you can focus your audience on what you’re talking about as you discuss the various details), I’d venture to guess that the most recent cohort (Q3) is perhaps the most relevant, so let’s bring our focus back to that: Within the Q3 cohort, we may consider emphasizing one or a couple of data points. Data markers and labels are one way to draw attention and signal importance. If I put them everywhere, we’ll quickly end up with a cluttered mess. But if I’m strategic about which I show, I can help guide my audience towards specific comparisons within the data. For example, if the ultimate success metric is what proportion of accounts have adopted whatever it is we’re tracking (I’ve anonymized that detail away here), I might emphasize just those data points for the most recent cohort: Given the spatial separation between regions, I don’t necessarily have to introduce color here. But if I want to include some text to lend additional context about what’s going on in each region and what’s driving it, I could introduce color into the graph and then use that same color schematic for my annotations, tying those together visually: Let’s take a quick look at the before-and-after: Any time you create a visual, take a step back and think about what you want to allow your audience to do with the data. What should they be able to most easily compare? The design choices you make—how you visualize and arrange the data—can make those comparisons easy or difficult. Aim to make it easy. The Excel file with the above visuals can be downloaded here. I should perhaps mention a hack I used to achieve this overall layout: each cohort is a single line graph in Excel, where I’ve formatted it so there is no connecting line between the Adopted point for one region and the Targeted point in the following region. (It may be brute force, but it works!) Continue Reading… ### Incident Report: Jupyter services down update: December 14, 20:45 UTC, all services should be restored and back up. On December 13, at 22:10 UTC (4:10pm EST), a large number of Jupyter-provided services stopped responding. This included, but was not limited to https://nbviewer.jupyter.org, https://try.jupyter.org (powered by tmpnb) and https://cdn.jupyter.org. We quickly narrowed this down to an issue with our hosting provider and have been working with them to resolve the issue as fast as possible. When outages happen, the Jupyter Status page should show which services are affected and we publish updates there. ### How are Jupyter services hosted? To understand the cause of the outage, we need to understand how the Jupyter services are hosted and maintained. As Jupyter is an open organization which is mostly maintained by volunteers, we do not have a dev-ops team assigned to maintaining our infrastructure. Even with full-time developers hired through universities or companies, the time spent fixing infrastructure is taken on nights and weekends. These developers are often stretched thin and cannot be available 24/7. Most of our cloud infrastructure is donated to us by companies like CloudFlare, Rackspace, Fastly, Google, and Microsoft. Donating resources can be challenging, both technically and legally. In this particular case, Rackspace graciously created a special account for Jupyter that handles invoices on our behalf, thereby making resources free to the project. Following a hiccup, this Jupyter account was suspended and all services are unavailable as a result. ### Temporary resolution As nbviewer is one of the most used services provided by Jupyter, we’ve moved it to one of our personal account at another cloud-provider. Fastly was set up to load-balance on the yet-to-come-back-up instances as well as this newly created instance, so all should be fine now. The other services (tmpnb, mails@jupyter.org, cdn.jupyter.org, …) will still unavailable or highly degraded until a permanent solution is found, or the services are restarted. try.jupyter.org will likely redirect to a repo on https://mybinder.org in the meantime so people can still try out Jupyter. ### Low bus factor The outage of all these services lasted for a significant time (more than 18 hours). Which perturbed many of you relying on these services. We understand that this is hardly acceptable and we hope you’ll indulge us as these services are provided for free and without ads. One of the factors leading to the slow reestablishment of service was a relatively low bus factor, with only one and a half of our developers knowing how to deploy and maintain these services. Documentation and access to credentials was also limited. This is one of the challenges in a distributed team like Jupyter where contributors self-organize. It is easy to forget that new code is not the only way to contribute and that infrastructure and maintenance are crucial. We also overly rely on a single vendor (in this case Rackspace), and while we are happy with Rackspace and have no reason to move to another provider, we should have a plan to restore critical services even temporarily in case of failure. A couple of months ago, the subject was brought to our attention, and we developed a plan to move many of our deployment to Kubernetes (which is provider agnostic). We underestimated the probability to need an emergency plan this early. ### How can you help Jupyter is mainly governed by the community all around the world. Contributing is not limited to writing code! We need members with knowledge in multiple languages, in design, dev-ops, etc. Whether you are an expert, or still learning, we would like you to get involved. Thanks everyone for your patience and the kind words when you reached to us when discovering the services were down. Incident Report: Jupyter services down was originally published in Jupyter Blog on Medium, where people are continuing the conversation by highlighting and responding to this story. Continue Reading… ### Learn from Google Brain, DeepMind, Facebook & other AI experts, KDnuggets offer RE•WORK interview leading minds in the field to discuss the impact and progressions of AI on business and in society. The complimentary white paper 'Should You Be Using AI In Your Business?' is now available to download. Save 20% on globally renowned AI and Deep Learning summits with code KDNUGGETS. Continue Reading… ### CloudStream: Data Scientist Seeking a candidate will work in the Search and Data Science department to help develop and grow data science capabilities. The day-to-day job will involve managing hybrid teams of data scientists, search analysts and behavioral economists. Continue Reading… ### Harassment in the Statistics field Statistician Kristian Lum described her experiences with harassment as a graduate student at stat conferences. She held back on talking about it for many of the same reasons others have, but then there was a shift and she began warning colleagues. I started doing this because I heard that S (for the second time to my knowledge) had taken advantage of a junior person who had had too much to drink. This time, his act had been witnessed first-hand by several professors at the conference. Since then, I have heard one professor who witnessed the incident openly lament that he’ll have to find a way to delicately advise his female students on “how not to get raped by S” so as not to lose promising students. What the hell? Unacceptable. Tags: Continue Reading… ### How many images do you need to train a neural network? Photo by Glenn Scott Today I got an email with a question I’ve heard many times – “How many images do I need to train my classifier?“. In the early days I would reply with the technically most correct, but also useless answer of “it depends”, but over the last couple of years I’ve realized that just having a very approximate rule of thumb is useful, so here it is for posterity: You need 1,000 representative images for each class. Like all models, this rule is wrong but sometimes useful. In the rest of this post I’ll cover where it came from, why it’s wrong, and what it’s still good for. The origin of the 1,000-image magic number comes from the original ImageNet classification challenge, where the dataset had 1,000 categories, each with a bit less than 1,000 images for each class (most I looked at had around seven or eight hundred). This was good enough to train the early generations of image classifiers like AlexNet, and so proves that around 1,000 images is enough. Can you get away with less though? Anecdotally, based on my experience, you can in some cases but once you get into the low hundreds it seems to get trickier to train a model from scratch. The biggest exception is when you’re using transfer learning on an already-trained model. Because you’re using a network that has already seen a lot of images and learned to distinguish between the classes, you can usually teach it new classes in the same domain with as few as ten or twenty examples. What does “in the same domain” mean? It’s a lot easier to teach a network that’s been trained on photos of real world objects (like Imagenet) to recognize other objects, but taking that same network and asking it to categorize completely different types of images like x-rays, faces, or satellite photos is likely to be less successful, and at least require a lot more training images. Another key point is that “representative” modifier in my rule of thumb. That’s there because the quality of the images is important, not just the quantity. What’s crucial is that the training images are as close as possible to the inputs that the model will see when it’s deployed. When I first tried to run a model trained with ImageNet on a robot I didn’t see great results, and it turned out it that was because the robot’s camera had a lot of fisheye distortion, and the objects weren’t well-framed in the viewfinder. ImageNet consists of photos taken from the web, so they’re usually well-framed and without much distortion. Once I retrained my network with images that were taken by the robot itself the results got a lot better. The same applies to almost any application, a smaller amount of training images that were taken in the same environment that it will produce better end results than a larger number of less representative images. Andreas just reminded me that augmentations are important too. You can augment the training data by randomly cropping, rotating, brightening, or warping the original images. TensorFlow for Poets controls this with command line flags like ‘flip_left_to_right‘ and ‘random_scale‘. This has the effect of effectively increasing the size of your training images, and is standard for most ImageNet-style training pipelines. It can be very useful for helping out transfer learning on smaller sets of images as well though. In my experience, distorted copies are not worth quite as much as new original images when it comes to overall accuracy, but if you only have a few images it’s a great way to boost the results and will reduce the overall number of images you need. The real answer is to try for yourself, so if you have fewer images than the rule suggests don’t let it stop you, but I hope this rule of thumb will give you a good starting point for planning your approach at least. Continue Reading… ### Magister Dixit “The Web is so vast … you need to extend categorization and make sense of the content and have a Web ordered for you … One of the key pieces is you have to understand and decide what the Ontology of entities is. Meaning how things are named and how are they organized into hierarchies … By mapping people’s search habits you pull all their content together and have a feed of information that is the web ordered for you.” Marissa Mayer ( January 24, 2013 ) Continue Reading… ### Practical applications of reinforcement learning in industry An overview of commercial and industrial applications of reinforcement learning. The flurry of headlines surrounding AlphaGo Zero (the most recent version of DeepMind’s AI system for playing Go) means interest in reinforcement learning (RL) is bound to increase. Next to deep learning, RL is among the most followed topics in AI. For most companies, RL is something to investigate and evaluate but few organizations have identified use cases where RL may play a role. As we enter 2018, I want to briefly describe areas where RL has been applied. RL is confusingly used to refer to a set of problems and a set of techniques, so let’s first settle on what RL will mean for the rest of this post. Generally speaking, the goal in RL is learning how to map observations and measurements to a set of actions while trying to maximize some long-term reward. This usually involves applications where an agent interacts with an environment while trying to learn optimal sequences of decisions. In fact, many of the initial applications of RL are in areas where automating sequential decision-making have long been sought. RL poses a different set of challenges from traditional online learning, in that you often have some combination of delayed feedback, sparse rewards, and (most importantly) the agents in question are often able to affect the environments with which they interact. Deep learning as a machine learning technique is beginning to be used by companies on a variety of machine learning applications. RL hasn’t quite found its way into many companies, and my goal is to sketch out some of the areas where applications are appearing. Before I do so, let me start off by listing some of the challenges facing RL in the enterprise. As Andrew Ng noted in his keynote at our AI Conference in San Francisco, RL requires a lot of data, and as such, it has often been associated with domains where simulated data is available (gameplay, robotics). It also isn’t easy to take results from research papers and implement them in applications. Reproducing research results can be challenging even for RL researchers, let alone regular data scientists (see this recent paper and this OpenAI blog post). As machine learning gets deployed in mission-critical situations, reproducibility and the ability to estimate error become essential. So, at least for now, RL may not be ideal for mission-critical applications that require continuous control. AI notwithstanding, there are already interesting applications and products that rely on RL. There are many settings involving personalization, or the automation of well-defined tasks, that would benefit from sequential decision-making that RL can help automate (or at least, where RL can augment a human expert). The key for companies is to start with simple uses cases that fit this profile rather than overly complicated problems that “require AI.” To make things more concrete, let me highlight some of the key application domains where RL is beginning to appear. ## Robotics and industrial automation Applications of RL in high-dimensional control problems, like robotics, have been the subject of research (in academia and industry), and startups are beginning to use RL to build products for industrial robotics. Industrial automation is another promising area. It appears that RL technologies from DeepMind helped Google significantly reduce energy consumption (HVAC) in its own data centers. Startups have noticed there is a large market for automation solutions. Bonsai is one of several startups building tools to enable companies to use RL and other techniques for industrial applications. A common example is the use of AI for tuning machines and equipment where expert human operators are currently being used. With industrial systems in mind, Bonsai recently listed the following criteria for when RL might be useful to consider: • You’re using simulations because your system or process is too complex (or too physically hazardous) for teaching machines through trial and error. • You’re dealing with large state spaces. • You’re seeking to augment human analysts and domain experts by optimizing operational efficiency and providing decision support. ## Data science and machine learning Machine learning libraries have gotten easier to use, but choosing a proper model or model architecture can still be challenging for data scientists. With deep learning becoming a technique used by data scientists and machine learning engineers, tools that can help people identify and tune neural network architectures are active areas of research. Several research groups have proposed using RL to make the process of designing neural network architectures more accessible (MetaQNN from MIT and Net2Net operations). AutoML from Google uses RL to produce state-of-the-art machine-generated neural network architectures for computer vision and language modeling. Looking beyond tools that simplify the creation of machine learning models, there are some who think that RL will prove useful in assisting software engineers write computer programs. ## Education and training Online platforms are beginning to experiment with using machine learning to create personalized experiences. Several researchers are investigating the use of RL and other machine learning methods in tutoring systems and personalized learning. The use of RL can lead to training systems that provide custom instruction and materials tuned to the needs of individual students. A group of researchers is developing RL algorithms and statistical methods that require less data for use in future tutoring systems. ## Health and medicine The RL setup of an agent interacting with an environment receiving feedback based on actions taken shares similarities with the problem of learning treatment policies in the medical sciences. In fact, many RL applications in health care mostly pertain to finding optimal treatment policies. Recent papers cited applications of RL to usage of medical equipment, medication dosing, and two-stage clinical trials. ## Text, speech, and dialog systems Companies collect a lot of text, and good tools that can help unlock unstructured text will find users. Earlier this year, AI researchers at SalesForce used deep RL for abstractive text summarization (a technique for automatically generating summaries from text based on content “abstracted” from some original text document). This could be an area where RL-based tools gain new users, as many companies are in need of better text mining solutions. RL is also being used to allow dialog systems (i.e., chatbots) to learn from user interactions and thus help them improve over time (many enterprise chatbots currently rely on decision trees). This is an active area of research and VC investments: see Semantic Machines and VocalIQ—acquired by Apple. ## Media and advertising Microsoft recently described an internal system called Decision Service that has since been made available on Azure. This paper describes applications of Decision Service to content recommendation and advertising. Decision Service more generally targets machine learning products that suffer from failure modes including “feedback loops and bias, distributed data collection, changes in the environment, and weak monitoring and debugging.” Other applications of RL include cross-channel marketing optimization and real time bidding systems for online display advertising. ## Finance Having started my career as a lead quant in a hedge fund, it didn’t surprise me that few finance companies are willing to talk on record. Generally speaking, I came across quants and traders who were evaluating deep learning and RL but haven’t found sufficient reason to use the tools beyond small pilots. While potential applications in finance are described in research papers, few companies describe software in production. One exception is a system used for trade execution at JPMorgan Chase. A Financial Times article described an RL-based system for optimal trade execution. The system (dubbed “LOXM”) is being used to execute trading orders at maximum speed and at the best possible price. As with any new technique or technology, the key to using RL is to understand its strengths and weaknesses, and then find simple use cases on which to try it. Resist the hype around AI—rather, consider RL as a useful machine learning technique, albeit one that is best suited for a specific class of problems. We are just beginning to see RL in enterprise applications. Along with continued research into algorithms, many software tools (libraries, simulators, distributed computation frameworks like Ray, SaaS) are beginning to appear. But it’s fair to say that few of these tools come with examples aimed at users interested in industry applications. There are, however, already a few startups that are incorporating RL into their products. So, before you know it, you might soon be benefiting from developments in RL and related techniques. Related resources: Continue reading Practical applications of reinforcement learning in industry. Continue Reading… ### Getting started with seplyr A big “thank you!!!” to Microsoft for hosting our new introduction to seplyr. If you are working R and big data I think the seplyr package can be a valuable tool. For how and why, please check out our new introductory article. Note: now that wrapr version 1.0.2 is up on CRAN all of the examples can be re-written without quotes using the qae() operator (“quote assignment expression”). For example: library("seplyr") #> Loading required package: wrapr packageVersion("wrapr") #> [1] &apos1.0.2&apos plan <- partition_mutate_se( qae(name := tolower(name), height := height + 0.5, height := floor(height), mass := mass + 0.5, mass := floor(mass))) print(plan) #>group00001
#>            name          height            mass
#> "tolower(name)"  "height + 0.5"    "mass + 0.5"
#>
#> $group00002 #> height mass #> "floor(height)" "floor(mass)"  Continue Reading… ### Pipes in R Tutorial For Beginners (This article was first published on R-posts.com, and kindly contributed to R-bloggers) You might have already seen or used the pipe operator when you're working with packages such as dplyr, magrittr,… But do you know where pipes and the famous %>% operator come from, what they exactly are, or how, when and why you should use them? Can you also come up with some alternatives? This tutorial will give you an introduction to pipes in R and will cover the following topics: Are you interested in learning more about manipulating data in R with dplyr? Take a look at DataCamp's Data Manipulation in R with dplyr course. ## Pipe Operator in R: Introduction To understand what the pipe operator in R is and what you can do with it, it's necessary to consider the full picture, to learn the history behind it. Questions such as "where does this weird combination of symbols come from and why was it made like this?" might be on top of your mind. You'll discover the answers to these and more questions in this section. Now, you can look at the history from three perspectives: from a mathematical point of view, from a holistic point of view of programming languages, and from the point of view of the R language itself. You'll cover all three in what follows! ### History of the Pipe Operator in R #### Mathematical History If you have two functions, let's say$f : B → C$and$g : A → B$, you can chain these functions together by taking the output of one function and inserting it into the next. In short, "chaining" means that you pass an intermediate result onto the next function, but you'll see more about that later. For example, you can say,$f(g(x))$:$g(x)$serves as an input for$f()$, while$x$, of course, serves as input to$g()$. If you would want to note this down, you will use the notation$f ◦ g$, which reads as "f follows g". Alternatively, you can visually represent this as: Image Credit: James Balamuta, "Piping Data" #### Pipe Operators in Other Programming Languages As mentioned in the introduction to this section, this operator is not new in programming: in the Shell or Terminal, you can pass command from one to the next with the pipeline character |. Similarly, F# has a forward pipe operator, which will prove to be important later on! Lastly, it's also good to know that Haskell contains many piping operations that are derived from the Shell or Terminal. #### Pipes in R Now that you have seen some history of the pipe operator in other programming languages, it's time to focus on R. The history of this operator in R starts, according to this fantastic blog post written by Adolfo Álvarez, on January 17th, 2012, when an anonymous user asked the following question in this Stack Overflow post: How can you implement F#'s forward pipe operator in R? The operator makes it possible to easily chain a sequence of calculations. For example, when you have an input data and want to call functions foo and bar in sequence, you can write data |> foo |> bar? The answer came from Ben Bolker, professor at McMaster University, who replied: I don't know how well it would hold up to any real use, but this seems (?) to do what you want, at least for single-argument functions … "%>%" <- function(x,f) do.call(f,list(x)) pi %>% sin [1] 1.224606e-16 pi %>% sin %>% cos [1] 1 cos(sin(pi)) [1] 1  About nine months later, Hadley Wickham started the dplyr package on GitHub. You might now know Hadley, Chief Scientist at RStudio, as the author of many popular R packages (such as this last package!) and as the instructor for DataCamp's Writing Functions in R course. Be however it may, it wasn't until 2013 that the first pipe %.% appears in this package. As Adolfo Álvarez rightfully mentions in his blog post, the function was denominated chain(), which had the purpose to simplify the notation for the application of several functions to a single data frame in R. The %.% pipe would not be around for long, as Stefan Bache proposed an alternative on the 29th of December 2013, that included the operator as you might now know it: iris %>% subset(Sepal.Length > 5) %>% aggregate(. ~ Species, ., mean)  Bache continued to work with this pipe operation and at the end of 2013, the magrittr package came to being. In the meantime, Hadley Wickham continued to work on dplyr and in April 2014, the %.% operator got replaced with the one that you now know, %>%. Later that year, Kun Ren published the pipeR package on GitHub, which incorporated a different pipe operator, %>>%, which was designed to add more flexibility to the piping process. However, it's safe to say that the %>% is now established in the R language, especially with the recent popularity of the Tidyverse. ### What Is It? Knowing the history is one thing, but that still doesn't give you an idea of what F#'s forward pipe operator is nor what it actually does in R. In F#, the pipe-forward operator |> is syntactic sugar for chained method calls. Or, stated more simply, it lets you pass an intermediate result onto the next function. Remember that "chaining" means that you invoke multiple method calls. As each method returns an object, you can actually allow the calls to be chained together in a single statement, without needing variables to store the intermediate results. In R, the pipe operator is, as you have already seen, %>%. If you're not familiar with F#, you can think of this operator as being similar to the + in a ggplot2 statement. Its function is very similar to that one that you have seen of the F# operator: it takes the output of one statement and makes it the input of the next statement. When describing it, you can think of it as a "THEN". Take, for example, following code chunk and read it aloud: class="lang-{r}">iris %>% subset(Sepal.Length > 5) %>% aggregate(. ~ Species, ., mean)  You're right, the code chunk above will translate to something like "you take the Iris data, then you subset the data and then you aggregate the data". This is one of the most powerful things about the Tidyverse. In fact, having a standardized chain of processing actions is called "a pipeline". Making pipelines for a data format is great, because you can apply that pipeline to incoming data that has the same formatting and have it output in a ggplot2 friendly format, for example. ### Why Use It? R is a functional language, which means that your code often contains a lot of parenthesis, ( and ). When you have complex code, this often will mean that you will have to nest those parentheses together. This makes your R code hard to read and understand. Here's where %>% comes in to the rescue! Take a look at the following example, which is a typical example of nested code: class="lang-R"># Initialize x x <- c(0.109, 0.359, 0.63, 0.996, 0.515, 0.142, 0.017, 0.829, 0.907) # Compute the logarithm of x, return suitably lagged and iterated differences, # compute the exponential function and round the result round(exp(diff(log(x))), 1)  1. 3.3 2. 1.8 3. 1.6 4. 0.5 5. 0.3 6. 0.1 7. 48.8 8. 1.1 With the help of %<%, you can rewrite the above code as follows: class="lang-R"># Import magrittr library(magrittr) # Perform the same computations on x as above x %>% log() %>% diff() %>% exp() %>% round(1)  Does this seem difficult to you? No worries! You'll learn more on how to go about this later on in this tutorial. Note that you need to import the magrittr library to get the above code to work. That's because the pipe operator is, as you read above, part of the magrittr library and is, since 2014, also a part of dplyr. If you forget to import the library, you'll get an error like Error in eval(expr, envir, enclos): could not find function "%>%". Also note that it isn't a formal requirement to add the parentheses after log, diff and exp, but that, within the R community, some will use it to increase the readability of the code. In short, here are four reasons why you should be using pipes in R: • You'll structure the sequence of your data operations from left to right, as apposed to from inside and out; • You'll avoid nested function calls; • You'll minimize the need for local variables and function definitions; And • You'll make it easy to add steps anywhere in the sequence of operations. These reasons are taken from the magrittr documentation itself. Implicitly, you see the arguments of readability and flexibility returning. ### Additional Pipes Even though %>% is the (main) pipe operator of the magrittr package, there are a couple of other operators that you should know and that are part of the same package: • The compound assignment operator %%;  •    class="lang-R"># Initialize x x <- rnorm(100) # Update value of x and assign it to x x %% abs %>% sort The tee operator %T>%; class="lang-R">rnorm(200) %>% matrix(ncol = 2) %T>% plot %>% colSums Note that it's good to know for now that the above code chunk is actually a shortcut for: rnorm(200) %>% matrix(ncol = 2) %T>% { plot(.); . } %>% colSums But you'll see more about that later on! The exposition pipe operator %$%. class="lang-R">data.frame(z = rnorm(100)) %$% ts.plot(z) Of course, these three operators work slightly differently than the main %>% operator. You'll see more about their functionalities and their usage later on in this tutorial! Note that, even though you'll most often see the magrittr pipes, you might also encounter other pipes as you go along! Some examples are wrapr's dot arrow pipe %.>% or to dot pipe %>.%, or the Bizarro pipe ->.;. How to Use Pipes in R Now that you know how the %>% operator originated, what it actually is and why you should use it, it's time for you to discover how you can actually use it to your advantage. You will see that there are quite some ways in which you can use it! Basic Piping Before you go into the more advanced usages of the operator, it's good to first take a look at the most basic examples that use the operator. In essence, you'll see that there are 3 rules that you can follow when you're first starting out: f(x) can be rewritten as x %>% f In short, this means that functions that take one argument, function(argument), can be rewritten as follows: argument %>% function(). Take a look at the following, more practical example to understand how these two are equivalent: class="lang-R"># Compute the logarithm of x log(x) # Compute the logarithm of x x %>% log() f(x, y) can be rewritten as x %>% f(y) Of course, there are a lot of functions that don't just take one argument, but multiple. This is the case here: you see that the function takes two arguments, x and y. Similar to what you have seen in the first example, you can rewrite the function by following the structure argument1 %>% function(argument2), where argument1 is the magrittr placeholder and argument2 the function call. This all seems quite theoretical. Let's take a look at a more practical example: class="lang-R"># Round pi round(pi, 6) # Round pi pi %>% round(6) x %>% f %>% g %>% h can be rewritten as h(g(f(x))) This might seem complex, but it isn't quite like that when you look at a real-life R example: class="lang-R"># Import babynames data library(babynames) # Import dplyr library library(dplyr) # Load the data data(babynames) # Count how many young boys with the name "Taylor" are born sum(select(filter(babynames,sex=="M",name=="Taylor"),n)) # Do the same but now with %>% babynames%>%filter(sex=="M",name=="Taylor")%>% select(n)%>% sum Note how you work from the inside out when you rewrite the nested code: you first put in the babynames, then you use %>% to first filter() the data. After that, you'll select n and lastly, you'll sum() everything. Remember also that you already saw another example of such a nested code that was converted to more readable code in the beginning of this tutorial, where you used the log(), diff(), exp() and round() functions to perform calculations on x. Functions that Use the Current Environment Unfortunately, there are some exceptions to the more general rules that were outlined in the previous section. Let's take a look at some of them here. Consider this example, where you use the assign() function to assign the value 10 to the variable x. class="lang-R"># Assign 10 to x assign("x", 10) # Assign 100 to x "x" %>% assign(100) # Return x x 10 You see that the second call with the assign() function, in combination with the pipe, doesn't work properly. The value of x is not updated. Why is this? That's because the function assigns the new value 100 to a temporary environment used by %>%. So, if you want to use assign() with the pipe, you must be explicit about the environment: class="lang-R"># Define your environment env <- environment() # Add the environment to assign() "x" %>% assign(100, envir = env) # Return x x 100 Functions with Lazy Evalution Arguments within functions are only computed when the function uses them in R. This means that no arguments are computed before you call your function! That means also that the pipe computes each element of the function in turn. One place that this is a problem is tryCatch(), which lets you capture and handle errors, like in this example: class="lang-R">tryCatch(stop("!"), error = function(e) "An error") stop("!") %>% tryCatch(error = function(e) "An error") 'An error' Error in eval(expr, envir, enclos): ! Traceback: 1. stop("!") %>% tryCatch(error = function(e) "An error") 2. eval(lhs, parent, parent) 3. eval(expr, envir, enclos) 4. stop("!") You'll see that the nested way of writing down this line of code works perfectly, while the piped alternative returns an error. Other functions with the same behavior are try(), suppressMessages(), and suppressWarnings() in base R. Argument Placeholder There are also instances where you can use the pipe operator as an argument placeholder. Take a look at the following examples: f(x, y) can be rewritten as y %>% f(x, .) In some cases, you won't want the value or the magrittr placeholder to the function call at the first position, which has been the case in every example that you have seen up until now. Reconsider this line of code: % round(6) If you would rewrite this line of code, pi would be the first argument in your round() function. But what if you would want to replace the second, third, … argument and use that one as the magrittr placeholder to your function call? Take a look at this example, where the value is actually at the third position in the function call: class="lang-R">"Ceci n'est pas une pipe" %>% gsub("une", "un", .) 'Ceci n\'est pas un pipe' f(y, z = x) can be rewritten as x %>% f(y, z = .) Likewise, you might want to make the value of a specific argument within your function call the magrittr placeholder. Consider the following line of code: class="lang-R">6 %>% round(pi, digits=.) Re-using the Placeholder for Attributes It is straight-forward to use the placeholder several times in a right-hand side expression. However, when the placeholder only appears in a nested expressions magrittr will still apply the first-argument rule. The reason is that in most cases this results more clean code. Here are some general "rules" that you can take into account when you're working with argument placeholders in nested function calls: f(x, y = nrow(x), z = ncol(x)) can be rewritten as x %>% f(y = nrow(.), z = ncol(.)) class="lang-R"># Initialize a matrix ma ma <- matrix(1:12, 3, 4) # Return the maximum of the values inputted max(ma, nrow(ma), ncol(ma)) # Return the maximum of the values inputted ma %>% max(nrow(ma), ncol(ma)) 12 12 The behavior can be overruled by enclosing the right-hand side in braces: f(y = nrow(x), z = ncol(x)) can be rewritten as x %>% {f(y = nrow(.), z = ncol(.))} class="lang-R"># Only return the maximum of the nrow(ma) and ncol(ma) input values ma %>% {max(nrow(ma), ncol(ma))} 4 To conclude, also take a look at the following example, where you could possibly want to adjust the workings of the argument placeholder in the nested function call: class="lang-R"># The function that you want to rewrite paste(1:5, letters[1:5]) # The nested function call with dot placeholder 1:5 %>% paste(., letters[.]) '1 a' '2 b' '3 c' '4 d' '5 e' '1 a' '2 b' '3 c' '4 d' '5 e' You see that if the placeholder is only used in a nested function call, the magrittr placeholder will also be placed as the first argument! If you want to avoid this from happening, you can use the curly brackets { and }: class="lang-R"># The nested function call with dot placeholder and curly brackets 1:5 %>% { paste(letters[.]) } # Rewrite the above function call paste(letters[1:5]) 'a' 'b' 'c' 'd' 'e' 'a' 'b' 'c' 'd' 'e' Building Unary Functions Unary functions are functions that take one argument. Any pipeline that you might make that consists of a dot ., followed by functions and that is chained together with %>% can be used later if you want to apply it to values. Take a look at the following example of such a pipeline: class="lang-R">. %>% cos %>% sin This pipeline would take some input, after which both the cos() and sin() fuctions would be applied to it. But you're not there yet! If you want this pipeline to do exactly that which you have just read, you need to assign it first to a variable f, for example. After that, you can re-use it later to do the operations that are contained within the pipeline on other values. class="lang-R"># Unary function f <- . %>% cos %>% sin f structure(function (value) freduce(value, _function_list), class = c("fseq", "function" )) Remember also that you could put parentheses after the cos() and sin() functions in the line of code if you want to improve readability. Consider the same example with parentheses: . %>% cos() %>% sin(). You see, building functions in magrittr very similar to building functions with base R! If you're not sure how similar they actually are, check out the line above and compare it with the next line of code; Both lines have the same result! class="lang-R"># is equivalent to f <- function(.) sin(cos(.)) f function (.) sin(cos(.)) Compound Assignment Pipe Operations There are situations where you want to overwrite the value of the left-hand side, just like in the example right below. Intuitively, you will use the assignment operator <- to do this. class="lang-R"># Load in the Iris data iris <- read.csv(url("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"), header = FALSE) # Add column names to the Iris data names(iris) <- c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species") # Compute the square root of iris$Sepal.Length and assign it to the variable iris$Sepal.Length <- iris$Sepal.Length %>% sqrt() However, there is a compound assignment pipe operator, which allows you to use a shorthand notation to assign the result of your pipeline immediately to the left-hand side: class="lang-R"># Compute the square root of iris$Sepal.Length and assign it to the variable iris$Sepal.Length %% sqrt # Return Sepal.Length iris$Sepal.Length Note that the compound assignment operator %% needs to be the first pipe operator in the chain for this to work. This is completely in line with what you just read about the operator being a shorthand notation for a longer notation with repetition, where you use the regular <- assignment operator. As a result, this operator will assign a result of a pipeline rather than returning it. Tee Operations with The Tee Operator The tee operator works exactly like %>%, but it returns the left-hand side value rather than the potential result of the right-hand side operations. This means that the tee operator can come in handy in situations where you have included functions that are used for their side effect, such as plotting with plot() or printing to a file. In other words, functions like plot() typically don't return anything. That means that, after calling plot(), for example, your pipeline would end. However, in the following example, the tee operator %T>% allows you to continue your pipeline even after you have used plot(): class="lang-R">set.seed(123) rnorm(200) %>% matrix(ncol = 2) %T>% plot %>% colSums Exposing Data Variables with the Exposition Operator When you're working with R, you'll find that many functions take a data argument. Consider, for example, the lm() function or the with() function. These functions are useful in a pipeline where your data is first processed and then passed into the function. For functions that don't have a data argument, such as the cor() function, it's still handy if you can expose the variables in the data. That's where the %$% operator comes in. Consider the following example: class="lang-R">iris %>% subset(Sepal.Length > mean(Sepal.Length)) %$% cor(Sepal.Length, Sepal.Width) 0.336696922252551 With the help of %$% you make sure that Sepal.Length and Sepal.Width are exposed to cor(). Likewise, you see that the data in the data.frame() function is passed to the ts.plot() to plot several time series on a common plot: class="lang-R">data.frame(z = rnorm(100)) %$% ts.plot(z) dplyr and magrittr In the introduction to this tutorial, you already learned that the development of dplyr and magrittr occurred around the same time, namely, around 2013-2014. And, as you have read, the magrittr package is also part of the Tidyverse. In this section, you will discover how exciting it can be when you combine both packages in your R code. For those of you who are new to the dplyr package, you should know that this R package was built around five verbs, namely, "select", "filter", "arrange", "mutate" and "summarize". If you have already manipulated data for some data science project, you will know that these verbs make up the majority of the data manipulation tasks that you generally need to perform on your data. Take an example of some traditional code that makes use of these dplyr functions: class="lang-R">library(hflights) grouped_flights <- group_by(hflights, Year, Month, DayofMonth) flights_data <- select(grouped_flights, Year:DayofMonth, ArrDelay, DepDelay) summarized_flights <- summarise(flights_data, arr = mean(ArrDelay, na.rm = TRUE), dep = mean(DepDelay, na.rm = TRUE)) final_result <- filter(summarized_flights, arr > 30 | dep > 30) final_result Year Month DayofMonth arr dep 2011 2 4 44.08088 47.17216 2011 3 3 35.12898 38.20064 2011 3 14 46.63830 36.13657 2011 4 4 38.71651 27.94915 2011 4 25 37.79845 22.25574 2011 5 12 69.52046 64.52039 2011 5 20 37.02857 26.55090 2011 6 22 65.51852 62.30979 2011 7 29 29.55755 31.86944 2011 9 29 39.19649 32.49528 2011 10 9 61.90172 59.52586 2011 11 15 43.68134 39.23333 2011 12 29 26.30096 30.78855 2011 12 31 46.48465 54.17137 When you look at this example, you immediately understand why dplyr and magrittr are able to work so well together: class="lang-R">hflights %>% group_by(Year, Month, DayofMonth) %>% select(Year:DayofMonth, ArrDelay, DepDelay) %>% summarise(arr = mean(ArrDelay, na.rm = TRUE), dep = mean(DepDelay, na.rm = TRUE)) %>% filter(arr > 30 | dep > 30) Both code chunks are fairly long, but you could argue that the second code chunk is more clear if you want to follow along through all of the operations. With the creation of intermediate variables in the first code chunk, you could possibly lose the "flow" of the code. By using %>%, you gain a more clear overview of the operations that are being performed on the data! In short, dplyr and magrittr are your dreamteam for manipulating data in R! RStudio Keyboard Shortcuts for Pipes Adding all these pipes to your R code can be a challenging task! To make your life easier, John Mount, co-founder and Principal Consultant at Win-Vector, LLC and DataCamp instructor, has released a package with some RStudio add-ins that allow you to create keyboard shortcuts for pipes in R. Addins are actually R functions with a bit of special registration metadata. An example of a simple addin can, for example, be a function that inserts a commonly used snippet of text, but can also get very complex! With these addins, you'll be able to execute R functions interactively from within the RStudio IDE, either by using keyboard shortcuts or by going through the Addins menu. Note that this package is actually a fork from RStudio's original add-in package, which you can find here. Be careful though, the support for addins is available only within the most recent release of RStudio! If you want to know more on how you can install these RStudio addins, check out this page. You can download the add-ins and keyboard shortcuts here. When Not To Use the Pipe Operator in R In the above, you have seen that pipes are definitely something that you should be using when you're programming with R. More specifically, you have seen this by covering some cases in which pipes prove to be very useful! However, there are some situations, outlined by Hadley Wickham in "R for Data Science", in which you can best avoid them: Your pipes are longer than (say) ten steps. In cases like these, it's better to create intermediate objects with meaningful names. It will not only be easier for you to debug your code, but you'll also understand your code better and it'll be easier for others to understand your code. You have multiple inputs or outputs. If you aren't transforming one primary object, but two or more objects are combined together, it's better not to use the pipe. You are starting to think about a directed graph with a complex dependency structure. Pipes are fundamentally linear and expressing complex relationships with them will only result in complex code that will be hard to read and understand. You're doing internal package development Using pipes in internal package development is a no-go, as it makes it harder to debug! For more reflections on this topic, check out this Stack Overflow discussion. Other situations that appear in that discussion are loops, package dependencies, argument order and readability. In short, you could summarize it all as follows: keep the two things in mind that make this construct so great, namely, readability and flexibility. As soon as one of these two big advantages is compromised, you might consider some alternatives in favor of the pipes. Alternatives to Pipes in R After all that you have read by you might also be interested in some alternatives that exist in the R programming language. Some of the solutions that you have seen in this tutorial were the following: Create intermediate variables with meaningful names; Instead of chaining all operations together and outputting one single result, break up the chain and make sure you save intermediate results in separate variables. Be careful with the naming of these variables: the goal should always be to make your code as understandable as possible! Nest your code so that you read it from the inside out; One of the possible objections that you could have against pipes is the fact that it goes against the "flow" that you have been accustomed to with base R. The solution is then to stick with nesting your code! But what to do then if you don't like pipes but you also think nesting can be quite confusing? The solution here can be to use tabs to highlight the hierarchy. … Do you have more suggestions? Make sure to let me know – Drop me a tweet @willems_karlijn Conclusion You have covered a lot of ground in this tutorial: you have seen where %>% comes from, what it exactly is, why you should use it and how you should use it. You've seen that the dplyr and magrittr packages work wonderfully together and that there are even more operators out there! Lastly, you have also seen some cases in which you shouldn't use it when you're programming in R and what alternatives you can use in such cases. If you're interested in learning more about the Tidyverse, consider DataCamp's Introduction to the Tidyverse course. To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...  Continue Reading… ### CSjob: Multimedia / Research Scientist or Principal Research Scientist - Signal Processing, MERL, Massachusetts, USA Petros just sent me the following: Dear Igor, I hope you are doing well. We are excited to have a new opening in the Computational Sensing Team at MERL. I would appreciate it if you can post this on your blog, or otherwise disseminate as you see fit, and encourage anyone you think might be a good candidate to apply. Posting and application link is also here: http://www.merl.com/employment/employment.php#MM29 Thanks! Petros Sure Petros, here is the job ad: MM29 - Multimedia / Research Scientist or Principal Research Scientist - Signal Processing MERL's Computational Sensing Team is seeking an exceptional researcher in the area of signal processing, with particular emphasis on signal acquisition and active sensing technologies. Applicants are expected to hold a Ph.D. degree in Electrical Engineering, Computer Science, or a closely related field. The successful candidate will have an extensive signal processing background and familiarity with related techniques, such as compressive sensing and convex optimization. Specific experience with wave propagation or PDE constrained inverse problems, or with signal acquisition via ultrasonic, radio, optical or other sensing or imaging modalities, is a plus. Applicants must have a strong publication record in any of these or related areas, demonstrating novel research achievements. As a member of our team, the successful candidate will conduct original research that aims to advance state-of-the-art solutions in the field, with opportunities to work on both fundamental and application-motivated problems. Your work will involve initiating new projects with long-term research goals and leading research efforts. MERL is one of the most academically-oriented industrial research labs in the world, and the ideal environment to thrive as a leader in signal processing. MERL strongly supports, encourages, and values academic activities such as publishing and presenting research results at top conferences, collaborating with university professors and students, organizing workshops and challenges, and generally maintaining an influential presence in the scientific community. Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there ! Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin. Continue Reading… ### An introduction to seplyr by John Mount, Win-Vector LLC seplyr is an R package that supplies improved standard evaluation interfaces for many common data wrangling tasks. The core of seplyr is a re-skinning of dplyr's functionality to seplyr conventions (similar to how stringr re-skins the implementing package stringi). ## Standard Evaluation and Non-Standard Evaluation "Standard evaluation" is the name we are using for the value oriented calling convention found in many programming languages. The idea is: functions are only allowed to look at the values of their arguments and not how those values arise (i.e., they can not look at source code or variable names). This evaluation principle allows one to transform, optimize, and reason about code. It is what lets us say the following two snippets of code are equivalent. • x <- 4; sqrt(x) • x <- 4; sqrt(4) The mantra is: "variables can be replaced with their values." Which is called referential transparency. "Non-standard evaluation" is the name used for code that more aggressively inspects its environment. It is often used for harmless tasks such as conveniently setting axis labels on plots. For example, notice the following two plots have different y-axis labels (despite plotting identical values). plot(x = 1:3)  plot(x = c(1,2,3))  ## dplyr and seplyr The dplyr authors appear to strongly prefer a non-standard evaluation interface. Many in the dplyr community have come to think a package such as dplyr requires a non-standard interface. seplyr started as an experiment to show this is not actually the case. Syntactically the packages are deliberately similar. We can take a dplyr pipeline: suppressPackageStartupMessages(library("dplyr")) starwars %>% select(name, height, mass) %>% arrange(desc(height)) %>% head() ## # A tibble: 6 x 3 ## name height mass ## <chr> <int> <dbl> ## 1 Yarael Poof 264 NA ## 2 Tarfful 234 136 ## 3 Lama Su 229 88 ## 4 Chewbacca 228 112 ## 5 Roos Tarpals 224 82 ## 6 Grievous 216 159  And re-write it in seplyr notation: library("seplyr") starwars %.>% select_se(., c("name", "height", "mass")) %.>% arrange_se(., "desc(height)") %.>% head(.) ## # A tibble: 6 x 3 ## name height mass ## <chr> <int> <dbl> ## 1 Yarael Poof 264 NA ## 2 Tarfful 234 136 ## 3 Lama Su 229 88 ## 4 Chewbacca 228 112 ## 5 Roos Tarpals 224 82 ## 6 Grievous 216 159  For the common dplyr-verbs (excluding mutate(), which we will discuss next) all the non-standard evaluation is saving us is a few quote marks and array designations (and we have ways of getting rid of the need for quote marks). In exchange for this small benefit the non-standard evaluation is needlessly hard to program over. For instance in the seplyr pipeline it is easy to accept the list of columns from an outside source as a simple array of names. Until you introduce a substitution system such as rlang or wrapr::let() (which we recommend over rlang and publicly pre-dates the public release of rlang) you have some difficulty writing re-usable programs that use the dplyr verbs over "to be specified later" column names. We are presumably not the only ones who considered this a limitation: seplyr is an attempt to make programming a primary concern by making the value-oriented (standard) interfaces the primary interfaces. ## mutate() The earlier "standard evaluation costs just a few quotes" becomes a bit strained when we talk about the dplyr::mutate() operator. It doesn't seem worth the effort unless you get something more in return. In seplyr 0.5.0 we introduced "the something more": planning over and optimizing dplyr::mutate() sequences. A seplyr mutate looks like the following:  select_se(., c("name", "height", "mass")) %.>% mutate_se(., c( "height" := "height + 1", "mass" := "mass + 1", "height" := "height + 2", "mass" := "mass + 2", "height" := "height + 3", "mass" := "mass + 3" )) %.>% arrange_se(., "name") %.>% head(.) ## # A tibble: 6 x 3 ## name height mass ## <chr> <dbl> <dbl> ## 1 Ackbar 186 89 ## 2 Adi Gallia 190 56 ## 3 Anakin Skywalker 194 90 ## 4 Arvel Crynyd NA NA ## 5 Ayla Secura 184 61 ## 6 Bail Prestor Organa 197 NA  seplyr::mutate_se() always uses ":=" to denote assignment (dplyr::mutate() prefers "=" for assignment, except in cases where ":=" is required). The advantage is: once we are go to the trouble to capture the mutate expressions we can treat them as data and apply procedures to them. For example we can re-group and optimize the mutate assignments. plan <- partition_mutate_se( c("name" := "tolower(name)", "height" := "height + 0.5", "height" := "floor(height)", "mass" := "mass + 0.5", "mass" := "floor(mass)")) print(plan) ##$group00001
##            name          height            mass
## "tolower(name)"  "height + 0.5"    "mass + 0.5"
##
## $group00002 ## height mass ## "floor(height)" "floor(mass)"  Notice seplyr::partition_mutate_se() re-ordered and re-grouped the assignments so that: • In each group each value used is independent of values produced in other assignments. • All dependencies between assignments are respected by the group order. The "safe block" assignments can then be used in a pipeline: starwars %.>% select_se(., c("name", "height", "mass")) %.>% mutate_seb(., plan) %.>% arrange_se(., "name") %.>% head(.) ## # A tibble: 6 x 3 ## name height mass ## <chr> <dbl> <dbl> ## 1 ackbar 180 83 ## 2 adi gallia 184 50 ## 3 anakin skywalker 188 84 ## 4 arvel crynyd NA NA ## 5 ayla secura 178 55 ## 6 bail prestor organa 191 NA  This may not seem like much. However, when using dplyr with a SQL database (such as PostgreSQL or even Sparklyr) keeping the number of dependencies in a block low is critical for correct calculation (which is why I recommend keeping dependencies low). Furthermore, on Sparklyr sequences of mutates are simulated by nesting of SQL statements, so you must also keep the number of mutates at a moderate level (i.e., you want a minimal number of blocks or groups). ## Machine Generated Code Because we are representing mutate assignments as user manipulable data we can also enjoy the benefit of machine generated code. seplyr 0.5.* uses this opportunity to introduce a simple function named if_else_device(). This device uses R's ifelse() statement (which conditionally chooses values in a vectorized form) to implement a more powerful block-if/else statement (which conditionally simultaneously controls blocks of values and assignments; SAS has such a feature). For example: suppose we want to NA-out one of height or mass for each row of the starwars data uniformly at random. This can be written naturally using the if_else_device. if_else_device( testexpr = "runif(n())>=0.5", thenexprs = "height" := "NA", elseexprs = "mass" := "NA") ## ifebtest_30etsitqqutk ## "runif(n())>=0.5" ## height ## "ifelse( ifebtest_30etsitqqutk, NA, height)" ## mass ## "ifelse( !( ifebtest_30etsitqqutk ), NA, mass)"  Notice the if_else_device translates the user code into a sequence of dplyr::mutate() expressions (using only the weaker operator ifelse()). Obviously the user could perform this translation, but if_else_device automates the record keeping and can even be nested. Also many such steps can be chained together and broken into a minimal sequence of blocks by partition_mutate_se() (not forcing a new dplyr::mutate() step for each if-block encountered). When we combine the device with the partitioned we get performant database-safe code where the number of blocks is only the level of variable dependence (and not the possibly much larger number of initial value uses that a straightforward non-reordering split would give; note: seplyr::mutate_se() 0.5.1 and later incorporate the partition_mutate_se() in mutate_se()). starwars %.>% select_se(., c("name", "height", "mass")) %.>% mutate_se(., if_else_device( testexpr = "runif(n())>=0.5", thenexprs = "height" := "NA", elseexprs = "mass" := "NA")) %.>% arrange_se(., "name") %.>% head(.) ## # A tibble: 6 x 4 ## name height mass ifebtest_wwr9k0bq4v04 ## <chr> <int> <dbl> <lgl> ## 1 Ackbar NA 83 TRUE ## 2 Adi Gallia 184 NA FALSE ## 3 Anakin Skywalker NA 84 TRUE ## 4 Arvel Crynyd NA NA TRUE ## 5 Ayla Secura 178 NA FALSE ## 6 Bail Prestor Organa 191 NA FALSE  ## Conclusion The value oriented notation is a bit clunkier, but this is offset by it's greater flexibility in terms of composition and working parametrically. Our group has been using seplyr::if_else_device() and seplyr::partition_mutate_se() to greatly simplify porting powerful SAS procedures to R/Sparklyr/Apache Spark clusters. This is new code, but we are striving to supply sufficient initial documentation and examples. Continue Reading… ### An introduction to seplyr (This article was first published on Revolutions, and kindly contributed to R-bloggers) by John Mount, Win-Vector LLC seplyr is an R package that supplies improved standard evaluation interfaces for many common data wrangling tasks. The core of seplyr is a re-skinning of dplyr's functionality to seplyr conventions (similar to how stringr re-skins the implementing package stringi). ## Standard Evaluation and Non-Standard Evaluation "Standard evaluation" is the name we are using for the value oriented calling convention found in many programming languages. The idea is: functions are only allowed to look at the values of their arguments and not how those values arise (i.e., they can not look at source code or variable names). This evaluation principle allows one to transform, optimize, and reason about code. It is what lets us say the following two snippets of code are equivalent. • x <- 4; sqrt(x) • x <- 4; sqrt(4) The mantra is: "variables can be replaced with their values." Which is called referential transparency. "Non-standard evaluation" is the name used for code that more aggressively inspects its environment. It is often used for harmless tasks such as conveniently setting axis labels on plots. For example, notice the following two plots have different y-axis labels (despite plotting identical values). plot(x = 1:3)  plot(x = c(1,2,3))  ## dplyr and seplyr The dplyr authors appear to strongly prefer a non-standard evaluation interface. Many in the dplyr community have come to think a package such as dplyr requires a non-standard interface. seplyr started as an experiment to show this is not actually the case. Syntactically the packages are deliberately similar. We can take a dplyr pipeline: suppressPackageStartupMessages(library("dplyr")) starwars %>% select(name, height, mass) %>% arrange(desc(height)) %>% head() ## # A tibble: 6 x 3 ## name height mass ## ## 1 Yarael Poof 264 NA ## 2 Tarfful 234 136 ## 3 Lama Su 229 88 ## 4 Chewbacca 228 112 ## 5 Roos Tarpals 224 82 ## 6 Grievous 216 159  And re-write it in seplyr notation: library("seplyr") starwars %.>% select_se(., c("name", "height", "mass")) %.>% arrange_se(., "desc(height)") %.>% head(.) ## # A tibble: 6 x 3 ## name height mass ## ## 1 Yarael Poof 264 NA ## 2 Tarfful 234 136 ## 3 Lama Su 229 88 ## 4 Chewbacca 228 112 ## 5 Roos Tarpals 224 82 ## 6 Grievous 216 159  For the common dplyr-verbs (excluding mutate(), which we will discuss next) all the non-standard evaluation is saving us is a few quote marks and array designations (and we have ways of getting rid of the need for quote marks). In exchange for this small benefit the non-standard evaluation is needlessly hard to program over. For instance in the seplyr pipeline it is easy to accept the list of columns from an outside source as a simple array of names. Until you introduce a substitution system such as rlang or wrapr::let() (which we recommend over rlang and publicly pre-dates the public release of rlang) you have some difficulty writing re-usable programs that use the dplyr verbs over "to be specified later" column names. We are presumably not the only ones who considered this a limitation: seplyr is an attempt to make programming a primary concern by making the value-oriented (standard) interfaces the primary interfaces. ## mutate() The earlier "standard evaluation costs just a few quotes" becomes a bit strained when we talk about the dplyr::mutate() operator. It doesn't seem worth the effort unless you get something more in return. In seplyr 0.5.0 we introduced "the something more": planning over and optimizing dplyr::mutate() sequences. A seplyr mutate looks like the following:  select_se(., c("name", "height", "mass")) %.>% mutate_se(., c( "height" := "height + 1", "mass" := "mass + 1", "height" := "height + 2", "mass" := "mass + 2", "height" := "height + 3", "mass" := "mass + 3" )) %.>% arrange_se(., "name") %.>% head(.) ## # A tibble: 6 x 3 ## name height mass ## ## 1 Ackbar 186 89 ## 2 Adi Gallia 190 56 ## 3 Anakin Skywalker 194 90 ## 4 Arvel Crynyd NA NA ## 5 Ayla Secura 184 61 ## 6 Bail Prestor Organa 197 NA  seplyr::mutate_se() always uses ":=" to denote assignment (dplyr::mutate() prefers "=" for assignment, except in cases where ":=" is required). The advantage is: once we are go to the trouble to capture the mutate expressions we can treat them as data and apply procedures to them. For example we can re-group and optimize the mutate assignments. plan <- partition_mutate_se( c("name" := "tolower(name)", "height" := "height + 0.5", "height" := "floor(height)", "mass" := "mass + 0.5", "mass" := "floor(mass)")) print(plan) ##$group00001
##            name          height            mass
## "tolower(name)"  "height + 0.5"    "mass + 0.5"
##
## $group00002 ## height mass ## "floor(height)" "floor(mass)"  Notice seplyr::partition_mutate_se() re-ordered and re-grouped the assignments so that: • In each group each value used is independent of values produced in other assignments. • All dependencies between assignments are respected by the group order. The "safe block" assignments can then be used in a pipeline: starwars %.>% select_se(., c("name", "height", "mass")) %.>% mutate_seb(., plan) %.>% arrange_se(., "name") %.>% head(.) ## # A tibble: 6 x 3 ## name height mass ## ## 1 ackbar 180 83 ## 2 adi gallia 184 50 ## 3 anakin skywalker 188 84 ## 4 arvel crynyd NA NA ## 5 ayla secura 178 55 ## 6 bail prestor organa 191 NA  This may not seem like much. However, when using dplyr with a SQL database (such as PostgreSQL or even Sparklyr) keeping the number of dependencies in a block low is critical for correct calculation (which is why I recommend keeping dependencies low). Furthermore, on Sparklyr sequences of mutates are simulated by nesting of SQL statements, so you must also keep the number of mutates at a moderate level (i.e., you want a minimal number of blocks or groups). ## Machine Generated Code Because we are representing mutate assignments as user manipulable data we can also enjoy the benefit of machine generated code. seplyr 0.5.* uses this opportunity to introduce a simple function named if_else_device(). This device uses R's ifelse() statement (which conditionally chooses values in a vectorized form) to implement a more powerful block-if/else statement (which conditionally simultaneously controls blocks of values and assignments; SAS has such a feature). For example: suppose we want to NA-out one of height or mass for each row of the starwars data uniformly at random. This can be written naturally using the if_else_device. if_else_device( testexpr = "runif(n())>=0.5", thenexprs = "height" := "NA", elseexprs = "mass" := "NA") ## ifebtest_30etsitqqutk ## "runif(n())>=0.5" ## height ## "ifelse( ifebtest_30etsitqqutk, NA, height)" ## mass ## "ifelse( !( ifebtest_30etsitqqutk ), NA, mass)"  Notice the if_else_device translates the user code into a sequence of dplyr::mutate() expressions (using only the weaker operator ifelse()). Obviously the user could perform this translation, but if_else_device automates the record keeping and can even be nested. Also many such steps can be chained together and broken into a minimal sequence of blocks by partition_mutate_se() (not forcing a new dplyr::mutate() step for each if-block encountered). When we combine the device with the partitioned we get performant database-safe code where the number of blocks is only the level of variable dependence (and not the possibly much larger number of initial value uses that a straightforward non-reordering split would give; note: seplyr::mutate_se() 0.5.1 and later incorporate the partition_mutate_se() in mutate_se()). starwars %.>% select_se(., c("name", "height", "mass")) %.>% mutate_se(., if_else_device( testexpr = "runif(n())>=0.5", thenexprs = "height" := "NA", elseexprs = "mass" := "NA")) %.>% arrange_se(., "name") %.>% head(.) ## # A tibble: 6 x 4 ## name height mass ifebtest_wwr9k0bq4v04 ## ## 1 Ackbar NA 83 TRUE ## 2 Adi Gallia 184 NA FALSE ## 3 Anakin Skywalker NA 84 TRUE ## 4 Arvel Crynyd NA NA TRUE ## 5 Ayla Secura 178 NA FALSE ## 6 Bail Prestor Organa 191 NA FALSE  ## Conclusion The value oriented notation is a bit clunkier, but this is offset by it's greater flexibility in terms of composition and working parametrically. Our group has been using seplyr::if_else_device() and seplyr::partition_mutate_se() to greatly simplify porting powerful SAS procedures to R/Sparklyr/Apache Spark clusters. This is new code, but we are striving to supply sufficient initial documentation and examples. To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### How Big Data and New Technologies Are Changing Aging Big data and new technologies are changing the healthcare industry and the aging process as we know it; and for now, that seems to be a move in the right direction. Continue Reading… ### The Architecture of the Next CERN Accelerator Logging Service This is a community guest blog from Jakub Wozniak, a software engineer and project technical lead at CERN physics laboratory, further expounding and complementing his keynote at Spark Summit EU in Dublin. CERN is a physics laboratory founded in 1954 focused on research, technology, and education in the domain of Fundamental Physics and Standard Model where its accelerators serve as giant microscopes and allow discovering the particularities of the basic building blocks of matter. Funded by 22 member states with approximately 2,500 employees and over 10,000 active users coming from all over the world, CERN is home to the largest and the most powerful particle accelerator in the world – the Large Hadron Collider (LHC). Up to now, it is the most complicated scientific experiment that led to the discovery of the Higgs boson (the particle predicted by the Standard Model) in 2012 as announced by the ATLAS and CMS experiments. As complex and interconnected particle accelerators, collectively they generate massive amounts of data per day. In this blog, I want to share our architecture for logging service and how we collect and process data at massive scale with Apache Spark. But first, a bit of background on the family of CERN accelerators. ## Particle Accelerators Chain at CERN The LHC accelerator itself consists of a 27-kilometer ring of superconducting magnets with a number of accelerating cavities to increase the energy of the particles circulating inside its rings. But CERN itself is not only the LHC. In reality, it is a complex of interconnected particle accelerators arranged in a chain with each successive machine able to further increase the energy of the particles. The beam production starts with a simple hydrogen bottle at the Linac 2, the first accelerator in the chain that accelerates protons to the energy of 50 MeV. The beams later on get injected into the Proton Synchrotron Booster where the proton bunches are formed and accelerated further to 1.4 GeV. The next accelerator in the chain is called the Proton Synchrotron which forms the final shape of the beam bunches and kicks the beam up to 25 GeV. The particles are later sent to the Super Proton Synchrotron where they are accelerated to 450 GeV, from which they are injected using two transfer lines into two pipes of the LHC. This is where the beams go in opposite directions to be collided at the experimental sites. It takes around 25 minutes to fill the LHC with the desired particle bunches and accelerate them to their final energy of 6.5 TeV. The energy per proton corresponds to approximately that of a flying mosquito; however, the total accumulated energy coming from all the 116e09 protons per bunch and 2800 bunches in the beam gives an enormous energy, equivalent to that of an aircraft carrier cruising at 5 knots. Huge detectors in the experimental sites observe the collisions producing around 1PB of events per second that is filtered to around 30-50 PB of usable physics data per year. ## Particles’ Journey, Devices, and Data Collectors While the LHC is surely CERN’s flagship experiment, the laboratory engineers, accelerator operators, and beam physicists work hard to deliver various types of beams to other experiments located at the smaller accelerators around the complex. Before the users (i.e. experimental physicists) can analyze the data from the collisions, an enormous effort is put to make the whole complex work in a highly synchronized and coordinated manner. The typical accelerator is composed of the multitude of devices starting from various types of magnets like dipoles used to bend the particles, the quadruples to focus them or the kickers to inject/extract beam from/to other accelerators. The acceleration of the beam is carried out using radio-frequency cavities, and precise beam diagnostics instrumentation is used to measure the characteristics of the produced particle bunches. There are also many other systems involved in the beam production like orbit feedbacks, timing, synchronization, interlocks, security, radiation protection, etc. For high-energy accelerators like the LHC, cryogenics plays a very important role as well. All of them will of course also need vacuum, electricity, and ventilation. In order to control all of the related devices, Operations teams require Control System(s) to monitor and operate the machine. At CERN even the smallest accelerators are usually composed of thousands of devices with hundreds of different properties. All those devices are programmed with various settings and produce a number of observable outputs. Such output values need to be presented to the operators to help them understand the current state of the machine and allow them to respond to the events that happen every second in the accelerator chain. The data can be used in a form of online monitoring or offline queries. Software applications are used for everyday operations and present the current state of the devices, alarms, failures, beam properties, etc. Offline queries are required to perform various studies on controls data that are targeted at improving machine performance, beam quality, provide new beam types, design new experiments or even future accelerators. ### Data Analytics and Storage Requirements per Day Until now all of the acquired Controls data has been stored in a system based on two Oracle databases, which is called the “CERN Accelerator Logging Service” (CALS). The system subscribes to 20,000 different devices and logs data for some 1.5 million different signals in total. It has around 1000 users all over CERN, which generate 5 million queries per day (coming mainly from automated applications). CALS stores 71 billion records/day that occupy around 2TB / day of unfiltered data. Since this amount is quite significant for storage, heavy filtering is actively applied and 95% of data is filtered out after 3 months. That leaves the long-term storage with around 1PB of important data stored long term since 2003. ## Limitations and Latencies of the Old System Being a system that has been in production since a long time (development started in 2001) some design principles that looked very good a decade ago do show some signs of ageing, especially under current data loads. For instance, the Oracle DB is difficult to scale horizontally and it is not a particularly performing solution for Big Data analysis when it comes to data structures different from simple scalars. One of the biggest problems is that in order to do the analysis, one has to extract the data and this might be a lengthy process. For some analysis use cases, it is shown to take half a day to extract a days-worth of data. Moreover, the data rates are not likely to go down, or even stay constant. The CERN Council has approved the High-Luminosity LHC (HL-LHC) project to upgrade the LHC to produce much higher luminosities (luminosity is a measure of the rate of collisions and is a figure of merit for accelerators that collide particles such as LHC). This will lead to much higher data taking frequencies from 1Hz to 100Hz, much bigger vector data and a desire for limited filtering that inevitably is linked with increased equipment testing and operational tuning during early stages of this project. Even bigger accelerators are being actively discussed like the Future Circular Collider (FCC) with a tunnel design of approximately 100km in circumference, stretching between the Jura Mountains and the Alps going under the Geneva Lake. Future challenges aside, the current Oracle-based CALS system faced a challenging reality from the beginning of 2014 when the LHC entered into so-called “run 2” phase (see diagram below) following 2 years of planned maintenance. In the 3 years of LHC operation that followed, the data logging rates increased from a stable flat rate of 150 GB/day observed in “run 1” to the linear increase currently reaching 900 GB/day stored long-term. Suddenly, the system has been confronted with a situation it was completely not prepared for. ## The Next Generation Scalable Big Data Architecture with Apache Spark These problems forced the responsible team to look into the domain of Big Data solutions and a new project was started called “Next CALS” (NXCALS). An initial feasibility study aimed at selecting the right tools for the job at hand from the far too rich Apache Hadoop ecosystem. After 3 months of prototyping with various tools and techniques, Apache Spark was selected (preferred over Apache Impala and Oracle) as the best tool for extraction and analysis for Controls data backed up by the synergy of Apache HBase and Apache Parquet files based storage in Hadoop. For the visualization, it was hard to neglect the emerging adoption of Python and Jupyter notebooks that was happening at CERN and other institutes involved with data science and scientific computing. This study truly set the scene, showing directions for how the Controls data could be presented to its users. The scalability of the new system relied on the CERN on-premise cloud services based on OpenStack with 250,000 cores available. After 18 months of development, the new NXCALS system architecture is comprised of distributed Apache Kafka brokers pumping the data to Hadoop and using Apache Spark enhanced with NXCALS’ DataSource API to present the data to its clients. ## Final Thoughts A lot more is coming, the visualization is shaping its way through a new project at CERN that is called Service for Web-based Analysis (SWAN) based on Jupyter notebooks & Python forming a truly Unified Software Platform—something akin to Unified Analytics Platform—for interactive data analysis in the cloud with Apache Spark as a first-class citizen there. The potential in this synergy is high, and the first version of NXCALS on SWAN will be available as early as Q1 2018, helping CERN scientists in their daily analysis work. ## Read More To read more about CERN projects, I recommend the following resources: -- Try Databricks for free. Get started today. The post The Architecture of the Next CERN Accelerator Logging Service appeared first on Databricks. Continue Reading… ### We need to stop sacrificing women on the altar of deeply mediocre men (ISBA edition) (This is not Andrew. I would ask you not to speculate in the comments who S is, this is not a great venue for that.) Kristian Lum just published an essay about her experiences being sexually assaulted at statistics conferences. You should read the whole thing because it’s important, but there’s a sample paragraph. I debated saying something about him at the time, but who would have cared? It was a story passed down among female graduate students in my circles that when one woman graduate student was groped at a party by a professor and reported it to a senior female professor, she was told that if she wanted to stay in the field, she’d just have to get used to it. On many occasions, I have been smacked on the butt at conferences. No one ever seemed to think it was a problem. I knew it would be even more difficult to get people to find S’s behavior problematic since he is employed by a large tech company and his participation in academic conferences, I have heard, often comes with sponsorship money. I have a friend who had essentially the same experience with the same man. Although it’s heartening to see that some senior people in ISBA cut his name from the nominations in the recent elections, there is no formal mechanism for the society to respond to these type of allegations or this type of behaviour. Why is that important? I’m just on my way back from O’Bayes, which is the cliquiest of the Bayesian conferences (all of which are pretty cliquey). You can set your watch by who gives the tutorials and who’s on the scientific committee. They’ve all known each other forever and are friends. To report this type of behaviour in an environment where the people you should tell are likely friends with the assaulter is an extreme act of bravery. ISBA (and all of its sections) need to work to help people come forward and support them formally as well as informally. And keep the creeps out. (Quick clarification: In the second last paragraph I really do not want to suggest that the senior people in ISBA would not act on reports about their friend. And, as the article said, they did in one case. But I know that because I know these people reasonably well, which is not a luxury most have. ISBA needs to do more to signal its willingness and openness to dealing with this problem.) More important edit: Kerrie Mengersen, ISBA president and all-round wonderful person has just weighed in officially. I look forward to seeing what ISBA does and I look forward to never seeing “S” in person again. Continue Reading… ### R in the Windows Subsystem for Linux R has been available for Windows since the very beginning, but if you have a Windows machine and want to use R within a Linux ecosystem, that's easy to do with the new Fall Creator's Update (version 1709). If you need access to the gcc toolchain for building R packages, or simply prefer the bash environment, it's easy to get things up and running. Once you have things set up, you can launch a bash shell and run R at the terminal like you would in any Linux system. And that's because this is a Linux system: the Windows Subsystem for Linux is a complete Linux distribution running within Windows. This page provides the details on installing Linux on Windows, but here are the basic steps you need and how to get the latest version of R up and running within it. First, Enable the Windows Subsystem for Linux option. Go to Control Panel > Programs > Turn Windows Features on or off (or just type "Windows Features" into the search box), and select the "Windows Subsystem for Linux" option. You'll need to reboot, just this once. Next, you'll need to install your preferred distribution of Linux from the Microsoft Store. If you search for "Linux" in the store, you'll find an entry "Run Linux on Windows" which will provide you with the available distributions. I'm using "Ubuntu", which as of this writing is Ubuntu 16.04 (Xenial Xerus). Once that's installed you can launch Ubuntu from the Start menu (just like any other app) to open a new bash shell window. The first time you launch, it will take a few minutes to install various components, and you'll also need to create a username and password. This is your Linux username, different from your Windows username. You'll automatically log in when you launch new Ubuntu sessions, but make sure you remember the password — you'll need it later. From here you can go ahead and install R, but if you use the default Ubuntu repository you'll get an old version of R (R 3.2.3, from 2015). You probably want the latest version of R, so add CRAN as a new package repository for Ubuntu. You'll need to run these three commands as root, so enter the password you created above here if requested: sudo echo "deb http://cloud.r-project.org/bin/linux/ubuntu xenial/" | sudo tee -a /etc/apt/sources.list sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9 sudo apt-get update (Don't be surprised by the message key E084DAB9: public key "Michael Rutter <marutter@gmail.com>" imported. That's how Ubuntu signs the R packages.) Now you're all set to install the latest version of R, which can be done with: sudo apt-get install r-base And that's it! (Once all the dependencies install, anyway, which can take a while the first time.) Now you're all ready to run R from the Linux command line: Note that you can access files on your Windows system from R — you'll find them at /mnt/c/Users/<your-Windows-username>. This FAQ on the WSL provides other useful tips, and for complete details refer to the Windows for Linux Subsystem Documentation. Update Dec 14: Jeroen Ooms shares that this works for running RStudio Server and OpenCPU Server as well. Continue Reading… ### Comparing smooths in factor-smooth interactions II (This article was first published on From the Bottom of the Heap - R, and kindly contributed to R-bloggers) In a previous post I looked at an approach for computing the differences between smooths estimated as part of a factor-smooth interaction using s()’s by argument. When a common-or-garden factor variable is passed to by, gam() estimates a separate smooth for each level of the by factor. Using the (Xp) matrix approach, we previously saw that we can post-process the model to generate estimates for pairwise differences of smooths. However, the by variable approach of estimating a separate smooth for each level of the factor my be quite inefficient in terms of degrees of freedom used by the model. This is especially so in situations where the estimated curves are quite similar but wiggly; why estimate many separate wiggly smooths when one, plus some simple difference smooths, will do the job just as well? In this post I look at an alternative to estimating separate smooths using an ordered factor for the by variable. When an ordered factor is passed to by, mgcv does something quite different to the model I described previously, although the end results should be similar. What mgcv does in the ordered factor case is to fit (L-1) difference smooths, where (l = 1, , L) are the levels of the factor and (L) the number of levels. These smooths model the difference between the smooth estimated for the reference level and the (l)th level of the factor. Additionally, the by variable smooth doesn’t itself estimate the smoother for the reference level; so we are required to add a second smooth to the model that estimates that particular smooth. In pseudo code our model would be something like, for ordered factor of, model <- gam(y ~ of + s(x) + s(x, by = of), data = df) As with any by factor smooth we are required to include a parametric term for the factor because the individual smooths are centered for identifiability reasons. The first s(x) in the model is the smooth effect of x on the reference level of the ordered factor of. The second smoother, s(x, by = of) is the set of (L-1) difference smooths, which model the smooth differences between the reference level smoother and those of the individual levels (excluding the reference one). Note that this model still estimates a separator smoother for each level of the ordered factor, it just does it in a different way. The smoother for the reference level is estimated via contribution from s(x) only, whilst the smoothers for the other levels are formed from the additive combination of s(x) and the relevant difference smoother from the set created by s(x, by = of). This is analogous to the situation we have when estimating an ANOVA using the default contrasts and lm(); the intercept is then an estimate of the mean response for the reference level of the factor, and the remaining model coefficients estimate the differences between the mean response of the reference level and that of the other factor levels. This ordered-factor-smooth interaction is most directly applicable to situations where you have a reference category and you are interested in difference between that category and the other levels. If you are interested in pair-wise comparison of smooths you could use the ordered factor approach — it may be more parsimonious than estimating separate smoothers for each level — but you will still need to post-process the results in a manner similar to that described in the previous post1. To illustrate the ordered factor difference smooths, I’ll reuse the example from the Geochimica paper I wrote with my colleagues at UCL, Neil Rose, Handong Yang, and Simon Turner (Rose et al., 2012), and which formed the basis for the previous post. Neil, Handong, and Simon had collected sediment cores from several Scottish lochs and measured metal concentrations, especially of lead (Pb) and mercury (Hg), in sediment slices covering the last 200 years. The aim of the study was to investigate sediment profiles of these metals in three regions of Scotland; north east, north west, and south west. A pair of lochs in each region was selected, one in a catchment with visibly eroding peat/soil, and the other in a catchment without erosion. The different regions represented variations in historical deposition levels, whilst the hypothesis was that cores from eroded and non-eroded catchments would show differential responses to reductions in emissions of Pb and Hg to the atmosphere. The difference, it was hypothesized, was that the eroding soil acts as a secondary source of pollutants to the lake. You can read more about it in the paper — if you’re interested but don’t have access to the journal, send me an email and I’ll pass on a pdf. Below I make use of the following packages • readr • dplyr • ggplot2, and • mgcv You’ll more than likely have these installed, but if you get errors about missing packages when you run the code chunk below, install any missing packages and run the chunk again Next, load the data set and convert the SiteCode variable to a factor This is a subset of the data used in Rose et al. (2012) — the Hg concentrations in the sediments for just three of the lochs are included here in the interests of simplicity. The data set contains 5 variables • SiteCode is a factor indexing the three lochs, with levels CHNA, FION, and NODH, • Date is a numeric variable of sediment age per sample, • SoilType and Region are additional factors for the (natural) experimental design, and • Hg is the response variable of interest, and contains the Hg concentration of each sediment sample. Neil gave me permission to make these data available openly should you want to try this approach out for yourself. If you make use of the data for other purposes, please cite the source publication (Rose et al., 2012) and recognize the contribution of the data creators; Handong Yang, Simon Turner, and Neil Rose. To proceed, we need to create an ordered factor. Here I’m going to use the SoilType variable as that is easier to relate to conditions of the soil (rather than the Site Code I used in the previous post). I set the non-eroded level to be the reference and as such the GAM will estimate a full smooth for that level and then smooth differences between the non-eroded, and each of the eroded and thin lakes. The ordered-factor GAM is fitted to the three lochs using the following and the resulting smooths can be drawn using the plot() method The smooth in the top left is the reference smooth trend for the non-eroded site. The other two smooths are the difference smooths between the non-eroded and eroded sites (top right). It is immediately clear that the difference between the non-eroded and eroded sites is not significant under this model. The estimated difference is linear, which suggests the trend in the eroded site is stronger than the one estimated for the non-eroded site. However, this difference is not so large as to be an identifiably different trend. The difference smooth for the thin soil site is considerably different to that estimated for the non-eroded site; the principal difference being the much reduced trend in the thin soil site, as indicated by the difference smooth acting in opposition to the estimated trend for the non-eroded site. A nice feature of the ordered factor approach is that inference on these difference can be performed formally and directly using the summary() output of the estimated GAM The impression we formed about the differences in trends are reinforced with actual test statistics; this is a clear advantage of the ordered-factor approach if your problem suits this different from reference situation. One feature to note, because we used an ordered factor, the parametric term for oSoilType uses polynomial contrasts: the .L and .Q refer to the linear and quadratic terms used to represent the factor. This is not as easy to identify differences in mean Hg concentration. If you want to retain that readily interpreted parameterisation, use the SoilType factor for the parametric part: Now the output in the parametric terms section is easier to interpret yet we retain the behavior of the reference smooth plus difference smooths part of the fitted GAM. ### References Rose, N. L., Yang, H., Turner, S. D., and Simpson, G. L. (2012). An assessment of the mechanisms for the transfer of lead and mercury from atmospherically contaminated organic soils to lake sediments with particular reference to scotland, UK. Geochimica et cosmochimica acta 82, 113–135. doi:http://doi.org/10.1016/j.gca.2010.12.026. 1. Except now you need to be sure to include the right set of basis functions that correspond to the pair of levels you want to compare. You can’t do that with the function I included in that post; it requires something a bit more sophisticated, but the principles are the same. To leave a comment for the author, please follow the link and comment on their blog: From the Bottom of the Heap - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### GPU-accelerated TensorFlow on Kubernetes A unified methodology for scheduling workflows, managing data, and offloading to GPUs. Many workflows that utilize TensorFlow need GPUs to efficiently train models on image or video data. Yet, these same workflows typically also involve multi-stage data pre-processing and post-processing, which might not need to run on GPUs. This mix of processing stages, illustrated in Figure 1, results in data science teams running things requiring CPUs in one system while trying to manage GPUs resources separately by yelling across the office: “Hey is anyone using the GPU machine?” A unified methodology is desperately needed for scheduling multi-stage workflows, managing data, and offloading certain portions of the workflows to GPUs. Pairing Kubernetes with TensorFlow enables a very elegant and easy-to-manage solution for these types of workflows. For example, Figure 1 shows three different data pipeline stages. Pre-processing runs on CPUs, model training on GPUs, and model inference again on CPUs. One would need to deploy and maintain each of these stages on a potentially shared set of computational resources (e.g., cloud instances), and that’s what Kubernetes does best. Each of the stages can be containerized via Docker and declaratively deployed on a cluster of machines via Kubernetes (see Figure 2). Along with this scheduling and deployment, you can utilize other open source tooling in the Kubernetes ecosystem, such as Pachyderm, to make sure you get the right data to the right TensorFlow code on the right type of nodes (i.e., CPU or GPU nodes). Pachyderm serves as a containerized data pipelining layer on top of Kubernetes. With this tool, you can subscribe your TensorFlow processing stages to particular versioned collections of data (backed by an object store) and automatically gather output, which can be fed to other stages running on the same or different nodes. Moreover, sending certain stages of our workflow, such as model training, to a GPU is as simple as telling Pachyderm via a JSON specification that a stage needs a GPU. Pachyderm will then work with Kubernetes under the hood to schedule that stage on an instance having a GPU. Sound good? Well then, get yourself up and running with TensorFlow + GPUs on Kubernetes as follows: 1. Deploy Kubernetes to the cloud of your choice or on premise. Certain cloud providers, such as Google Cloud Platform, even have one-click Kubernetes deploys with tools like Google Kubernetes Engine (GKE). 2. Add one or more GPU instances to your Kubernetes cluster. This may involve creating a new node pool if you are using GKE or a new instance group if you are using kops. In any event, you will need to update your cluster and then install GPU drivers on those GPU nodes. By way of example, you could add a new node pool to your alpha GKE cluster (which takes advantage of the latest GPU features) via gcloud: $ gcloud alpha container node-pools create gpu-pool --accelerator type=nvidia-tesla-k80,count=1 --machine-type <your-chosen-machine-type> --num-nodes 1 --zone us-east1-c --image-type UBUNTU --cluster <your-gke-cluster-name>


Then you could ssh into that cluster and add nvidia GPU drivers:

$curl -O http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_8.0.61-1_amd64.deb$ sudo -s

$dpkg -i cuda-repo-ubuntu1604_8.0.61-1_amd64.deb$ apt-get update && apt-get install cuda -y

1. Deploy Pachyderm on your Kubernetes cluster to manage data pipelining and your collections of input/output data.
2. Docker-ize the various stages of your workflow.
3. Deploy your data pipeline by referencing your Docker images and the commands in JSON specifications. If you need a GPU, just utilize the resource requests and limits provided by Kubernetes to grab one. Note, you may need to add the path to your GPU drivers to the LD_LIBRARY_PATH environmental variable in the container as further discussed here.

This post is part of a collaboration between O'Reilly and TensorFlow. See our statement of editorial independence.

Continue reading GPU-accelerated TensorFlow on Kubernetes.

### Xavier Amatriain’s Machine Learning and Artificial Intelligence Year-end Roundup

So much has happened in the world of AI that it is hard to fit in a couple of paragraphs. Here is my attempt.

### Summertime Analytics: Predicting E. Coli and West Nile Virus

Gene Leynes (Senior Data Scientist) and Nick Lucius (Advanced Analytics) from the City of Chicago discussed two predictive analytics projects that forecasted potential risk involved with E. coli in Lake Michigan and West Nile Virus from mosquitos.

# Session Summary

At a recent Data Science PopUp, Gene Leyes and Nick Lucius, from the City of Chicago Advanced Analytics Team, provided insight into the data collection, analysis, and models involved with two predictive analytics projects.

Session highlights include

• Use of a generalized linear mixed-effects model and incorporating season and regional bias. Predicted likelihood of West Nile Virus one week in advance 78% of the time. The prediction was correct 65% of the time.
• Team discovered how a linear model was not going to work for predicting E. Coli in Lake Michigan because there was not a clear correlation. Team implemented rapid testing of volatile beaches, strategic selection of predictor beaches, use of “k-means clustering algorithm in order to cluster the beaches into different groups and then use that information to decide our predictors for the model.”

Interested in the West Nile project? Review the West Nile project on Github and download the West Nile data. How about the E. Coli project? Then review the Clear Water project on Github and download the data on the data portal. For more insights from the presentation, review the video and the presentation transcript.

# Presentation Transcript

Hello. I'm Gene Leynes.

I'm a data scientist at the city of Chicago. We're going to get started with the next presentation. We've had a nice mix of technical and more high-level talks. Today I'm going to talk about a couple of projects that we've been working on at the city that were debuted this summer. At any given time, we have quite a few different initiatives going on in the city of Chicago. We're always doing different data science projects to optimize operations and try to make more from less. We're just like every other company. We deal with shrinking budgets, and we're always trying to find ways to do things smarter to extend our resources further. Within our data science team, we have DBAs, and we have business intelligence professionals. Nick and I are representative of the advanced analytics team, and we report directly to the Chief Data Officer Tom Schenk, who, if you do anything with data and you live in Chicago, you've probably seen Tom because he's omnipresent. The two projects we're going to talk about today are the West Nile virus predictions and the water quality predictions.

I'm going to start off with a slide of-- this is horse brain tissue from a horse that died from West Nile virus and… this is the enemy, this is the mosquito. This is how humans, horses, babies, old people…this is how we get West Nile.

This was a really interesting project. I learned a lot working on it. I'm not a big fan of pestilence, but at the same time it was a pretty enjoyable project. It was really surprising and interesting.

Who'd have a guess where West Nile is in the United States or where it started? Well, what were some states you'd think of? Florida, California, North Carolina, that's a good one. It is a little bit in these places. It started in Queens. It's actually a very urban problem. It's a very unusual sort of thing. It came to Queens in 1999, spread very quickly throughout the United States. And by 2001 or 2002, it was already in Illinois. And we are the fifth-most contaged-- I don't know the right word for that-- state in the Union. We're actually right there in the top as far as West Nile virus cases. So, that was very surprising to me.

And the thing-- there's some good news about that, though. Even though it's everywhere, it's not usually that bad. Most of the people who are infected with West Nile don't even know that they ever were infected. About 80% of people show absolutely no symptoms at all. And of the 20% of the people who do show symptoms, they have flu-like symptoms, and they rarely even go to the hospital or to the emergency room. Of those people, 1% have the severe symptoms that are neural-invasive diseases, and this is where it gets bad. This is where people become paralyzed and have chronic pain and die. But it's really a pretty small number.

This year, for example, we had, I think, final count-- and they're still coming in it is about a 60-day lag between incubation and the testing, 30 days for incubation, 30 days for testing. The numbers are actually still coming in, but there's only about, I think it was three or four cases throughout Illinois this whole year. But the weird thing, and the reason that it's actually a pretty important public health issue, is because outbreaks can happen anywhere, and they're very unpredictable.

For example, one year in Colorado, there were something like 2,500 cases. I don't know why. I don't think Colorado knows why. But it's very important to get ahead of these cases and to take action to reduce the West Nile spread and to reduce the mosquito populations.

The other important thing to know is that not every mosquito transmits West Nile virus; it's Culex restuans and Culex pipiens. These are not the nuisance mosquitoes that are normally biting you at backyard barbecues. A lot of people in Chicago-- it's funny… the south side of Chicago really wants us to spray for mosquitoes, and the north side really doesn't want us to spray because they want their beehives, and they're more naturalist focused. You've kind of got this dichotomy of like who wants to be sprayed and who doesn't. It doesn't actually matter because we're not spraying for the things that bother you, for the most part anyway.

This is kind of important to understand. This is the life cycle of the disease basically. The West Nile mostly transmitted between mosquitoes and birds. Birds migrate and pass it around the country. That's why it spread so quickly throughout the United States. The mosquitoes transmit it from bird to bird. The mosquitoes that infect the birds actually don't prefer humans. The human and horse cases are pretty much just spill over.

So enough of being a downer (I'm going to use the same language that I used from the Chi Hack Night. I'm sorry if you were there but that's what I'm going to do). I'd like to talk about what we do to prevent West Nile virus at the city of Chicago; we really do three things.

The first thing is we larvacide stormwater drains and it's an unbelievable number. I still kind of can't believe it. I'd have to see it to believe it. But we larvacide 150,000 storm drains around the city of Chicago. And they get interns or whoever to drop pellets. [LAUGHTER] So they drop these pellets into the catch basins in the storm drain, and this basically just prevents the mosquitoes from breeding in the storm water.

The second thing we do is we do DNA testing. We have these gravid traps that have this specially chemically formulated sugar water that attracts just the mosquito species that we want. There's a little fan, because mosquitoes are terrible fliers, and it just blows them up into the net. We catch the mosquitoes alive. We have-- I should know the number…. it's like 40 or so traps around the city. We harvest these, shake them out, and grind up the mosquitoes in batches of no more than 50, because over 50 the West Nile DNA would become too diluted to measure. We grind up the mosquitoes in batches of less than 50 and test for West Nile using DNA tests at the CDC lab on the west side.

Then the third thing that we do, if West Niles is present in a particular region where we're testing, which we have almost complete city coverage, if we see it two weeks in a row, we spray for it. The whole point of this project was to reduce that time from two weeks to one week to really nip it in the bud when we do have these problems and immediately knock down the mosquito population. So this is what the data looks like. This is not a fancy ESRI map… This is just my map. [LAUGHTER]

So the-- oh, and also, this would be a good time. I should point out we use a lot of open-source tools. A lot of our projects, including this project, are on GitHub. The data itself is on the open-data portal. The data that I used in the model is exactly the same data that you can use, at least as far as the test results go.

I actually did have some secret data that we couldn't make public. That's the precise trap location because we can't make that publicly available. These are approximate trap locations because we don't want people tampering with the traps. But you can get pretty close. We have the lat-longs of the approximate trap locations for all the traps published on the open-data portal. I guess I was wrong in my numbers. There's about 60. I wrote these slides, but I forgot because it's been awhile. We have about 60 of these traps located throughout the city collecting these mosquitoes. They're collected usually once a week, maybe twice a week. We publish the data in terms of the actual lab results on the data portal.

This is what the shape of the mosquito season looks like in the city of Chicago. The blue line represents the mean number of mosquitoes. Wait, let me just read this. Actually, I think it's the total. Anyway, no, this is the total number of mosquitoes captured per trap. Then the orange line is the average number that are infected with West Nile. You can see in May, we started doing the collecting. We don't put all the traps out yet because there's never any West Nile in May. June, it starts to pick up. We start to get our very first positive results. July, it's ramping up. August is the really heavy month. And by the end of the year or by October, it's gone. There's a little bit there, but it's really gone by the end of October.

The other thing that has been really reinforced by working on this project with me personally is that we can really see the effects of climate change. A lot of these vector-borne illnesses, and by vector, I mean mosquito vector or tick vectors, are really increasing throughout the United States in different climates. The seasons are getting longer because we're missing the really cold winters that kill off the vectors. This is a problem that's going to continue to be a problem or continue to get worse. But sorry about that. But that's certainly something that was reinforced with me.

So, the guy before me-- I'm sorry-- could probably explain the model better than me. We used a generalized mixed-effect model and it does use Bayesian optimization to correct for the bias of the shape of the season, as well as the bias on a trap-by-trap basis. Some traps are just more likely to have positive results. The other variables that we feed into the model include things like weather and whether or not the trap had positive results last week, as well as the cumulative number of results that the trap has had throughout the season. We try to get an idea of whether or not it's a bad season, what happened last week, what the weather is, and then incorporate the overall shape of the season.

We tried a lot of different models. We tried gradient boosted [models] / GBMs, we tried random forests, and we actually-- to back up again, this whole thing started with a Kaggle competition that used really sophisticated Bayesian models to calculate the entire season. It's funny. the results from the Kaggle competition weren't particularly useful for us because they won the competition, but it was tuned for each season. And you don't know the season until after the season happens. So, they weren't cheating, but it wasn't something that we could immediately just take off the shelf and use for predictions for next week. And by the way, in all these data science problems, the hardest thing for me is usually figuring out -- it's easy to model stuff -- it's hard to figure out how to project it out for your t-plus-one time step, and how to put things back in your model so that you're making a prediction for next week. This is the thing that's not in a textbook, and this is the thing where the rubber meets the road, where it's always the tricky part. The outcome from our model was a number between 0 and 1. Let me think. For the most part, the results were around 14. I think that was the average. We chose a cut off of 0.39. Anything over 0.39 we said “this is a positive”. With that cut off we were able to predict 78% of the true positives, and of our positives we were correct 65% of the time. I don't have the f-score handy. You can certainly find it in the GitHub page. It really is all there. Some of the more machine learning people might enjoy seeing some of those statistics and seeing an old-fashioned confusion matrix, but this is something that makes it a lot easier to communicate to the public and to management and to epidemiologists and to other people within the city of Chicago.

Once we have these predictions we put them into our situational awareness program. I'm going to give you an idea of what that looks like. We have this thing where you can find the data, and you select the data set that you want. This is a preloaded query and we basically said anything that's over 0.39, color it red. Anything that's under, color it green. And this, for one particular week in the middle of the summer, was what our map looked like. And these are some of the details of what actually happened in that trap. And underneath, I don't think I show this part, but there's a little thing down here where you can look at the raw data and download that if you want it. So unfortunately, this particular data set, because of the trap locations, is only available internally in Windy Grid, which is our situational awareness program. But we also have something called Open Grid, which has most of our other data sets. I think it does have the West Nile test results, but it doesn't have the predictions per se because the predictions also have the top secret locations.

I hope this gives you a sense of one of our projects. And I'm going to let Nick take over and tell you about another pretty cool project that we've been doing. Thanks.

Thanks, everyone.

I'm looking forward to telling you about another project here in the city of Chicago. It's involving similar conditions, E.coli in the lake water. So again, we're looking at a pathogen, something that can cause illness, and a way to use predictive analytics in order to help let people know about the problem and try to mitigate the diseases. I'm really sorry about this slide. I thought about not including it, given the time of the year. But it's really impossible to tell the story of E.coli and the Chicago beaches without this because this is why this project exists.

Chicago's beaches are a very large source of enjoyment for residents and visitors during the summers. People go to swim. People go to have picnics. People go to ride their bikes. It's such an amazing amenity that we have. So a little bit more about it-- I didn't realize this when I started the project, but over 20 million people each year visit Chicago beaches. I think that's just an astounding number, given that there's not even three million residents in the city of Chicago. I think this is also a timely number, given that news just came out, I think, yesterday or the day before that Chicago's hitting all-time tourism records. Over 55 million people visit the Chicago city on an annual basis. And so, yeah, that's amazing. 20 million of them hit the beaches.

Now, each year at the 27 Chicago beaches that we have, there's about 150 water quality exceedances. What that means is that the bacteria, the E. coli that's in the water, hit a level that research has shown that people can get sick if they go swimming when the water's at that level. To put that 150 number into context, there's about 2,000 beach days each year. When you take the number of beaches, you multiply that by the number of swimming days that are out there, there's around 2,000. So it's really a handful of times that this happens. But when it does happen, it's really important to get notifications to the public, accurate notifications, so that people can make a decision about whether they want to swim. The beaches don't close. Normally, when this happens, there's an advisory that's sent out. If a person has a weak immune system or is a child or is elderly, they can make that decision whether to go into the water.

Now, finally, the state of beach technology for actually checking the water quality, it's like this. There's these traditional culture tests, where they actually grow the bacteria in a Petri dish and come back after about 18 hours and check how many bacteria grew. These are slow. These are slow tests. And with the rate that E. coli count changes during a day, once you get those results, it no longer reflects what's going on in the water at that time.

Because of the 18-hour lag time, models that have been built with those test results have just never really accurately notified people. I know it can be disconcerting. But for the prior years, if you ever went to the beach and you saw an advisory and it said, “hey, there's a problem here,” it was telling you about yesterday. And there might not have been a problem at the beach that day.

Another brand-new way for testing for E.coli on beaches is rapid DNA tests. Now, these have just been researched and developed over the last few years. This year, this past summer was the first year in Chicago where they were used at each beach. And a lot of municipalities over the world are looking at what Chicago is doing with rapid testing to be able to potentially take it to their communities and use it for beach monitoring.

But the one main weakness of these rapid DNA testing methods is that they're quite expensive. It's mostly the machinery that it takes. These samples are picked up along the lake each morning, driven to UIC, where they're put into the machines to do the tests. And while that's happening over the summer, that's taking up all the capacity in the regional area to even do these kinds of tests. If somebody wanted to say, “hey, I want to do one of these tests as well at 11:00 AM,” on any day during the summer, well you got to wait. So, that's the kind of supply and demand that's out there right now.

All this motivates using predictive modeling to be able to get a cost-effective, accurate read-out on what's the water quality at any beach at any given time. Predictive models have the capacity to prevent illnesses and, bottom line, save governments millions of dollars while notifying the public of whether or not they should go into the water.

Now this project that I'm going to tell you about what we've done, it's really interesting how it came to be. It actually came from Chi Hack Night, where a group of people noticed that there was beach water quality online on the data portal and thought that maybe they can make a predictive model, and approached the city. The city of Chicago worked with these developers, worked with these data scientists at Chi Hack Night-- I've actually see some in the crowd here tonight-- and on a volunteer basis in order to develop the model and also worked with students from the DePaul University in order to develop data visualizations and do some model refining. I was actually one of the volunteers. I did not work for the city of Chicago at the time that this was going on at Chi Hack Night. So it was really cool. I got to be a volunteer at Chi Hack Night, work on my data science skills, and then afterwards end up working at the city of Chicago as a data scientist. It's been an awesome experience.

I'll tell you a little bit about the model now. The originally developed model used water sensors. That's what's on the top left. There is a water sensor in the water that was just reading out things like water cloudiness, wave height, temperature of the water on an ongoing basis and then sending it to this city of Chicago's a data portal. The team used weather sensor data. It also used the results of the E.coli tests from prior days, from the day before, from the week before in order to power the model.

Then there were a lot of one-off data sets that were interesting. Like when the locks opened, when the sewage water was put into Lake Michigan, which, in case you don't know, actually happens a few times a year, usually when there's huge rainstorms and the whole city of Chicago area is flooded. They'll open up those floodgates and let it out into the lake. In those instances, the beaches are closed immediately until that dissipates. But the thought was there might be some effects in the following days that might lead to a better model.

Then down on the bottom here, what you're looking at is the E.coli level for a single beach during a year. I think it shows you how rare of an event a bad water quality day might be. This particular beach in 2015 only had one single day and there's really not a lot of warning. It comes out of nowhere. It goes away right away. These can be very difficult to predict anomaly rare events and that's what the team noticed right away.

All the modeling that the volunteers put all their hours into, the conclusion at the end was that no matter how much environmental data we might be able to get our hands on, we can't seem to pin down what causes E.coli and so what in nature you can use to predict E.coli. Accuracy rates in the models just never got over a certain threshold. It was a frustrating experience. But what that work did and what all those individuals' contributions to the group did was get a discussion going around what other ways we might be able to look at the problem.

We ended up developing a new way to model beach water quality, which shows some great promise so far. The way that it works is, instead of using these environmental variables to try to figure out what's going on in the water versus what's going on in the air, the idea is this. Pay for a few of those expensive tests today and then predict what's going on at other beaches in the area today with those tests. It really becomes more of a missing value problem. You've got some beaches that you're testing. Then you've got some beaches that you're inferring. And there are regional effects in Lake Michigan where some beaches do tend to move with other beaches. And so you can predict one beach's E.coli level versus another's. But it still is not-- it's not something you can use a linear model for. There's no clear correlation.

What we've done is we've used a k-means clustering algorithm in order to cluster the beaches into different groups and then use that information to decide our predictors for the model. What we do is we say, OK, these are the beaches that we're going to pay for tests, and these are the ones we're going to test. Then these are the beaches that we're going to predict.

Now, for the finer details on that, I'll refer you to a paper that's forthcoming. There's a draft version on our GitHub page already. It really goes into the details of all the modeling. So, I won't go into that here.

What I do want to do is show a public website that we made that allows people to learn about the model, to learn about this project, but also to create their own model. Because one of the key structures in this model is the choice of beaches that you use, the ones that you choose to test. So, a person can go on here and pick-- they see a list of beaches. You can choose whatever beaches you want to say, OK, I'm going to test these ones out, build a model.

In the background in this Shiny app, there's an R script running that's going to build the model. Maybe I should have done a demo on this right before we got started. I think I just need to reload it. Let me choose a couple of beaches again here. The background is going to build and validate a model. You'll be able to see how your model did and put it up against the city of Chicago's model.

Let me just give it a few seconds here to get going. There's this dotted line showing the true-positive rate of the city of Chicago's model. Then that bar shows that the one that I just made really doesn't do-- it did a 15% true-positive rate versus the city's model here, which was at about a 38%. People can go and try to build their own, see if they can come up with a beach combination that might be useful.

You can also mess with the false-positive rate down here to see. I increased the false-positive rate quite a bit, which means that now the model is going to issue advisories when there's really no problem at the lake. But it gets a better true-positive rate because of that. When we went to evaluate the model, we decided to put it into production. Because, like Gene was saying, one of the hardest parts that we face is the operationalization of the model and getting it to actually work on real data in real time.

In 2017, this past summer, even though the city of Chicago was rapid testing every single beach, we created a model that selected a few beaches and then predicted other beaches so that we can see how it did. In doing this hybrid method, we saw an increase of about 60 different days that the public would have been notified with our process, whereas with the old model, the public would not have known about the problem. A lot of those do come from the fact that our process would actually do rapid tests. You get some wins that way. But the model does a better job itself, too.

The predictive model itself stacked up against the prior predictive model is doing about three times as accurate. I have a slide for that, three times the accuracy of the prior model. What we've done is, like I said, we've got a paper to publish the model. I've gone and talked with water quality experts, the people who actually are doing the science on a regular basis, to show them, to make sure that everything makes sense to them, and, hopefully, to get this into the hands of beach water quality monitors wherever they may need it. So that's the end of the Clear Water and the West Nile virus part of the presentation.

But we just wanted to tell you a little bit more about the city of Chicago and what we do. These projects actually fulfill-- they're part of fulfilling some core pledges that the city has called a tech plan. The mayor of the city issued a tech plan right at the beginning of when he took office. And it pledges to the city of Chicago, among other things, that we will work with civic technology innovators to develop creative solutions to city challenges. This project here did exactly that. Not only did we work together, but individual volunteers as a group ended up donating 1,000 hours of their time on this. It was such a great thing to see people on Tuesday nights. Now, everybody's got jobs. Everybody's got busy lives, sitting there together in a room here in this building just working on this, trying to figure out a way to make the city better for everybody. I've got proof, in case you didn't want to take my word. We looked at the history on GitHub and were able to put this together so that you can see when people were actually working. I don't know what's going on here on Saturday at 2:00 in the morning. But Tuesday nights, you can see a big uptick in the work that was being done. When you look at the whole history of this project, the Chi Hack Night volunteers really, really did it. Another pledge in the city's plan that both these project meets is to leverage data and new technology to make government more efficient, effective, and open.

And finally, I'll mention something else about Gene's project, which was really cool. He put Open Grid, the city's situational awareness mapping platform, up, ran a demo for you. But his project actually enhanced that platform and created a new development for it. Some of the maps that were built out for the West Nile project actually got into Open Grid for not only Chicago residents to use, but any other municipality that uses Open Grid will now have these capabilities built in. So that's it for us. If you have any questions, I'm sure Gene can come up, and we can answer anything you might have.

The post Summertime Analytics: Predicting E. Coli and West Nile Virus appeared first on Data Science Blog by Domino.

### The Night Riders

Gilbert Chin writes:

After reading this piece [“How one 19-year-old Illinois man Is distorting national polling averages,” by Nate Cohn] and this Nature news story [“Seeing deadly mutations in an new light,” by Erika Hayden], I wonder if you might consider blogging about how this appears to be the same issue in two different disciplines.

I said, sure, I’d blog on this. Actually I wrote about the Cohn article right after it appeared. Then I finally got around to reading the Hayden article, which is about an effort in epidemiology to set up and analyze a huge database on diseases and genetic mutations. The traditional approach is to look at one disease at a time and try to find genes associated with that disease, but by putting all the information together, more can be learned. This makes sense to me. The key idea is that the “cases” for one disease can be considered as the “controls” for lots of others, so all of a sudden you’re using your data a lot more efficiently.

I think multilevel modeling is the way to analyze such data. When you have just one disease, there’s the whole challenge of choosing a prior distribution. But with thousands of diseases, you have the internal replication that makes the problem much more direct to address.

As is often the case, though, I’m just talk here, having never done such an analysis myself in this sort of problem. I’d love to collaborate with an expert sometime and see how it goes.

Finally, I don’t really see the connection to that earlier polling story—except that multilevel regression and poststratification should be useful in both cases.

The post The Night Riders appeared first on Statistical Modeling, Causal Inference, and Social Science.

### How to Generate FiveThirtyEight Graphs in Python

In this post, we'll help you. Using Python's matplotlib and pandas, we'll see that it's rather easy to replicate the core parts of any FiveThirtyEight (FTE) visualization.