# My Data Science Blogs

## April 21, 2018

### If you did not already know

Pachinko Allocation Model (PAM)
In machine learning and natural language processing, the pachinko allocation model (PAM) is a topic model. Topic models are a suite of algorithms to uncover the hidden thematic structure of a collection of documents. The algorithm improves upon earlier topic models such as latent Dirichlet allocation (LDA) by modeling correlations between topics in addition to the word correlations which constitute topics. PAM provides more flexibility and greater expressive power than latent Dirichlet allocation. While first described and implemented in the context of natural language processing, the algorithm may have applications in other fields such as bioinformatics. The model is named for pachinko machines – a game popular in Japan, in which metal balls bounce down around a complex collection of pins until they land in various bins at the bottom.
http://…/pam-icml06.pdf

Wiener Process
In mathematics, the Wiener process is a continuous-time stochastic process named in honor of Norbert Wiener. It is often called standard Brownian motion, after Robert Brown. It is one of the best known Lévy processes (càdlàg stochastic processes with stationary independent increments) and occurs frequently in pure and applied mathematics, economics, quantitative finance, and physics. The Wiener process plays an important role both in pure and applied mathematics. In pure mathematics, the Wiener process gave rise to the study of continuous time martingales. It is a key process in terms of which more complicated stochastic processes can be described. As such, it plays a vital role in stochastic calculus, diffusion processes and even potential theory. It is the driving process of Schramm-Loewner evolution. In applied mathematics, the Wiener process is used to represent the integral of a Gaussian white noise process, and so is useful as a model of noise in electronics engineering, instrument errors in filtering theory and unknown forces in control theory. The Wiener process has applications throughout the mathematical sciences. In physics it is used to study Brownian motion, the diffusion of minute particles suspended in fluid, and other types of diffusion via the Fokker-Planck and Langevin equations. It also forms the basis for the rigorous path integral formulation of quantum mechanics (by the Feynman-Kac formula, a solution to the Schrödinger equation can be represented in terms of the Wiener process) and the study of eternal inflation in physical cosmology. It is also prominent in the mathematical theory of finance, in particular the Black-Scholes option pricing model. …

Deep Transfer Network (DTN)
In recent years, an increasing popularity of deep learning model for intelligent condition monitoring and diagnosis as well as prognostics used for mechanical systems and structures has been observed. In the previous studies, however, a major assumption accepted by default, is that the training and testing data are taking from same feature distribution. Unfortunately, this assumption is mostly invalid in real application, resulting in a certain lack of applicability for the traditional diagnosis approaches. Inspired by the idea of transfer learning that leverages the knowledge learnt from rich labeled data in source domain to facilitate diagnosing a new but similar target task, a new intelligent fault diagnosis framework, i.e., deep transfer network (DTN), which generalizes deep learning model to domain adaptation scenario, is proposed in this paper. By extending the marginal distribution adaptation (MDA) to joint distribution adaptation (JDA), the proposed framework can exploit the discrimination structures associated with the labeled data in source domain to adapt the conditional distribution of unlabeled target data, and thus guarantee a more accurate distribution matching. Extensive empirical evaluations on three fault datasets validate the applicability and practicability of DTN, while achieving many state-of-the-art transfer results in terms of diverse operating conditions, fault severities and fault types. …

Title says it all, Some datasets for teaching data science

&utm&utm&utm

### What is “blogging”? Is it different from “writing”?

Thomas Basbøll wrote:

To blog is not to write in a particular style, or publish in a particular form. Rather, blogging is an experience that is structured by a particular functionality. . . . What makes it a blog is a structural coordination of the blogger and the audience. . . .

Blogging, in my experience, reduces writing to the short-term effects you have on your readers and they have on you. You try to have an immediate, essentially real-time impact on the discourse, which makes it much more like speech than writing. . . .

You can’t definite “writing” simply by way of “written communication”. It is possible to write a tweet in the formal sense I want to insist on and some writers have in fact tried to do this. But most tweets and a great many emails are much more like speech than like writing. Think of the way we end an email chain when we’re arranging a meeting with a short message sent from our phone: “OK. See you then. / T.” I don’t want to call that writing. It’s speech in another medium. . . .

I responded:

I like a lot of what you’re saying here, and I think these sort of distinctions are valuable. I’ll put this post on the reading list for my class on communication.

There’s one place, though, where I think you overstate your point.

You write, “Blogging, in my experience, reduces writing to the short-term effects you have on your readers and they have on you.” I can’t argue with your experience, of course, but . . . blogging does some other things too:

1. Blogging is permanent (at least on the scale of years or a decade or so; I could well imagine that the software will start to fall apart and much of my blogging will be lost in the future). So when I blog, it’s not just to have a conversation now, it can also be to lay down a marker. Often I’ll blog about an article I’ve been given, just to avoid forgetting it and to have the article there in a searchable form. Other times I’ll post something knowing that I’ll be referring back to it in the future.

2. A related point: blogging creates a sort of community memory, so that, for example, on my blog I can talk about Weick and Weggy and pizzagate, and air rage and himmicanes and ages ending in 9, and even the good stuff like multilevel modeling and Stan and the birthday model, and readers know what I’m talking about—or even if they don’t know, they have a sense that there is an ongoing conversation, a density to the discussion, in the same way that a good novel will give the sense that the characters have depth and that much is happening offstage. Indeed, awhile after the Monkey Cage moved to the Washington Post, our editors told me that my posts were too “bloggy” in that they were presupposing some continuity that was inappropriate for a newspaper feature.

3. And, just responding to the “short-term effects” thing: the blogging here mostly on a six-month delay, so the effects don’t have to be short term. (Regular readers will recall that “long term” = 3 days or possibly 5 minutes!)

4. Finally, to get back to the issue of different forms of communication (in your view, blogging is “much more like speech than writing”): A blog post, or even a blog comment (such as this one), can be “written” in the sense of being structured and arranged. One thing I like to tell students is that writing is non-algorithmic: despite what one might think based on naive theories of communication, you can’t in general just write down your thoughts, or write down what you did today. Part of this is that, as the linguists say, ideas don’t generally exist in the absence of language: writing down an idea helps to form it. And part of it is that language and story have some internal logic (see here and search on Ramona), I guess related to the sound of the words and related to the idea that we are often trying to convey notions of cause and effect while reporting discrete events.

5. How do you characterize chatty journalism, such as George Orwell’s “As I please” columns? This is not a trick question. They would seem to fall somewhere in between what you’re calling “writing” and “blogging.”

I think our goal here in this discussion is not to come up with some sort of perfect categorization, or to argue about whether blogging is “really” writing, or the relative literary merits of book writing and journalism, but rather to lay out some connections between goals, methods, audiences, and media of communication. When framed that way, I guess there’s probably been a lot written on this sort of thing, but I’m ignorant of any relevant literature.

Ummm, I like this comment. I think I’ll blog it so it won’t get forgotten. Next open spot is mid-Apr.

And I followed up with one more thing, which I thought about after clicking Publish:

One thing that blogging does not seem to supply for me is “closure.” For example, I hope you will follow up on the above discussion, and maybe some others can contribute too, and . . . we can write an article or book, really nailing down the idea. Somehow a blog post, no matter how definitive, never quite seems to get there. And it’s not just the content, it really does seem to be the form, or maybe I should say the placement, of the post. For example, last year I wrote What has happened down here is the winds have changed, which was one of the most successful posts I’ve ever written, both in terms of content (I like what I wrote, and I developed many of the ideas while writing the post) and in reception (it was widely discussed and in an overwhelmingly positive way). Still, I’d feel better, somehow, if it were “published” somewhere in a more formal way—even if the content were completely unchanged. I’m not quite sure how much of this is pure old-fashionedness on my part and how much it has to do with the idea that a mutable scrolling html document inherently has less of a definitive feel than an article in some clearly-defined place. I could reformat that particular post as pdf and put it on my webpage as an unpublished article but that wouldn’t quite do the trick either. And of course one good reason for keeping it as a blog post is that people can read and contribute to the comment thread.

Which forms of writing seem definitive and which don’t? For example, when I publish an article in the Journal of the American Statistical Association, it seems real. If I publish in a more obscure journal, not so much. If I publish something in the New York Times or Slate, it gets many more readers, but it still seems temporary, or unfinished, in the same way as a blog post.

For the other direction, I think of published book reviews as definitive, but others don’t. One of my favorite books by Alfred Kazin is a collection published in 1959, mostly of book reviews. They vary in quality, but that’s fine, as it’s also interesting to see some of his misguided (in my view) and off-the-cuff thoughts. I love old book reviews. So a few years ago when I encountered Kazin’s son, I asked if there was any interest in publishing Alfred’s unpublished book reviews, or at least supplying an online repository. The son said no, and what struck me was not just that there are no plans to publish a hypothetical book that would maybe sell a couple hundred copies (I have no idea) but that he didn’t even seem to be sad about this, that his dad’s words would remain uncollected. But I guess that makes sense if you take the perspective that the book reviews were mostly just practice work and it was the completed books and longer essays that were real.

There’s also the question of how important it is to have “closure.” It feels important to me to have some aspect of a project wrapped up and done, that’s for sure. But in many settings I think the feeling of closure is a bad thing. Closure can be counterproductive to the research enterprise. Think of all the examples of junk science I’ve discussed on the blog over the years. Just about every one of these examples is associated with a published research paper that is seriously, perhaps hopelessly, flawed, but for which the authors and journal editors go to great lengths to avoid acknowledging error. They seem to value closure too much: the paper is published and it seems unfair to them for outsiders to go and criticize, to re-litigate the publication decision, as it were. My impression is that these authors and editors have an attitude similar to that of a baseball team that won a game, and then a careful view of the videotape made it clear that someone missed a tag at second base in the fifth inning. The game’s already over, it doesn’t get replayed! Science is different (at least I think it should be) in that it’s about getting closer to the truth, not about winning or losing. Anyway, that’s a bit of a digression, but the point about closure is relevant, I think, to discussions of different forms of writing.

### R Packages worth a look

Making Optimal Matching Size-Scalable Using Optimal Calipers (bigmatch)
Implements optimal matching with near-fine balance in large observational studies with the use of optimal calipers to get a sparse network. The caliper is optimal in the sense that it is as small as possible such that a matching exists. Glover, F. (1967). <DOI:10.1002/nav.3800140304>. Katriel, I. (2008). <DOI:10.1287/ijoc.1070.0232>. Rosenbaum, P.R. (1989). <DOI:10.1080/01621459.1989.10478868>. Yang, D., Small, D. S., Silber, J. H., and Rosenbaum, P. R. (2012). <DOI:10.1111/j.1541-0420.2011.01691.x>.

Open Population Capture-Recapture (openCR)
Functions for the analysis of capture-recapture data from animal populations subject to turnover. The models extend Schwarz and Arnason (1996) <DOI:10.2307/2533048> and Borchers and Efford (2008) <DOI:10.1111/j.1541-0420.2007.00927.x>, and may be non-spatial or spatial. The parameterisation of recruitment is flexible (options include population growth rate and per capita recruitment). Spatially explicit analyses may assume home-range centres are fixed or allow dispersal between sampling sessions.

Nearest-Neighbor Analysis (nna)
Calculates spatial pattern analysis using a T-square sample procedure. This method is based on two measures ‘x’ and ‘y’. ‘x’ – Distance from the random point to the nearest individual. ‘y’ – Distance from individual to its nearest neighbor. This is a methodology commonly used in phytosociology or marine benthos ecology to analyze the species’ distribution (random, uniform or clumped patterns). Ludwig & Reynolds (1988, ISBN:0471832359).

Process Map Token Replay Animation (processanimateR)
Token replay animation for process maps created with ‘processmapR’ by using SVG animations (‘SMIL’) and the ‘htmlwidget’ package.

Recency, Frequency and Monetary Value Analysis (rfm)
Tools for RFM (recency, frequency and monetary value) analysis. Generate RFM score from both transaction and customer level data. Visualize the relationship between recency, frequency and monetary value using heatmap, histograms, bar charts and scatter plots. Includes a ‘shiny’ app for interactive segmentation. References: i. Blattberg R.C., Kim BD., Neslin S.A (2008) <doi:10.1007/978-0-387-72579-6_12>.

### Distilled News

Explore the key concepts in object detection and learn how they are implemented in SSD and Faster RCNN, which are available in the Tensorflow Detection API.
In the previous article, we gained an understanding of the main Kafka components and how Kafka consumers work. Now, we’ll see how these contribute to the ability of Kafka to provide extreme scalability for streaming write and read workloads.
The tough thing about learning data is remembering all the syntax. While at Dataquest we advocate getting used to consulting the Python documentation, sometimes it’s nice to have a handy reference, so we’ve put together this cheat sheet to help you out!
In this post, I talk about performance through an efficient algorithm I developed for finding closest points on a map. This algorithm uses both concepts from mathematics and algorithmics.
Two years ago, Wes McKinney and Hadley Wickham got together to discuss some of the systems challenges facing the Python and R communities. Data science teams inevitably work with multiple languages and systems, so it’s critical that data flow seamlessly and efficiently between these environments. Wes and Hadley wanted to explore opportunities to collaborate on tools for improving interoperability between Python, R, and external compute and storage systems. This discussion led to the creation of the feather file format, a very fast on-disk format for storing data frames that can be read and written to by multiple languages.
Data visualizations can reveal trends and patterns that are not otherwise obvious from the raw data or summary statistics. While visualizing low-dimensional data is relatively straightforward (for example, plotting the change in a variable over time as (x,y) coordinates on a graph), it is not always obvious how to visualize high-dimensional datasets in a similarly intuitive way. Here we present HypeTools, a Python toolbox for visualizing and manipulating large, high-dimensional datasets. Our primary approach is to use dimensionality reduction techniques (Pearson, 1901; Tipping & Bishop, 1999) to embed high-dimensional datasets in a lower-dimensional space, and plot the data using a simple (yet powerful) API with many options for data manipulation [e.g. hyperalignment (Haxby et al., 2011), clustering, normalizing, etc.] and plot styling. The toolbox is designed around the notion of data trajectories and point clouds. Just as the position of an object moving through space can be visualized as a 3D trajectory, HyperTools uses dimensionality reduction algorithms to create similar 2D and 3D trajectories for time series of high-dimensional observations. The trajectories may be plotted as interactive static plots or visualized as animations. These same dimensionality reduction and alignment algorithms can also reveal structure in static datasets (e.g. collections of observations or attributes). We present several examples showcasing how using our toolbox to explore data through trajectories and low-dimensional embeddings can reveal deep insights into datasets across a wide variety of domains.
Titus is a container management platform that provides scalable and reliable container execution and cloud-native integration with Amazon AWS. Titus was built internally at Netflix and is used in production to power Netflix streaming, recommendation, and content systems.

### Book Memo: “Introduction to Agent-Based Economics”

 Introduction to Agent-Based Economics describes the principal elements of agent-based computational economics (ACE). It illustrates ACE’s theoretical foundations, which are rooted in the application of the concept of complexity to the social sciences, and it depicts its growth and development from a non-linear out-of-equilibrium approach to a state-of-the-art agent-based macroeconomics. The book helps readers gain a better understanding of the limits and perspectives of the ACE models and their capacity to reproduce economic phenomena and empirical patterns.

### Book Memo: “Fractional and Multivariable Calculus”

 Fractional and Multivariable Calculus This textbook presents a rigorous approach to multivariable calculus in the context of model building and optimization problems. This comprehensive overview is based on lectures given at five SERC Schools from 2008 to 2012 and covers a broad range of topics that will enable readers to understand and create deterministic and nondeterministic models. Researchers, advanced undergraduate, and graduate students in mathematics, statistics, physics, engineering, and biological sciences will find this book to be a valuable resource for finding appropriate models to describe real-life situations. The first chapter begins with an introduction to fractional calculus moving on to discuss fractional integrals, fractional derivatives, fractional differential equations and their solutions. Multivariable calculus is covered in the second chapter and introduces the fundamentals of multivariable calculus (multivariable functions, limits and continuity, differentiability, directional derivatives and expansions of multivariable functions). Illustrative examples, input-output process, optimal recovery of functions and approximations are given; each section lists an ample number of exercises to heighten understanding of the material. Chapter three discusses deterministic/mathematical and optimization models evolving from differential equations, difference equations, algebraic models, power function models, input-output models and pathway models. Fractional integral and derivative models are examined. Chapter four covers non-deterministic/stochastic models. The random walk model, branching process model, birth and death process model, time series models, and regression type models are examined. The fifth chapter covers optimal design. General linear models from a statistical point of view are introduced; the Gauss-Markov theorem, quadratic forms, and generalized inverses of matrices are covered. Pathway, symmetric, and asymmetric models are covered in chapter six, the concepts are illustrated with graphs.

### Enter the OVIC low-power challenge!

Photo by Pete

I’m a big believer in the power of benchmarks to help innovators compete and collaborate together. It’s hard to imagine deep learning taking off in the way it did without ImageNet, and I’ve learned so much from the Kaggle community as teams work to come up with the best solutions. It’s surprisingly hard to create good benchmarks though, as I’ve learned in the Kaggle competitions I’ve run. Most of engineering is about tradeoffs, and when you specify just a single metric you end up with solutions that ignore other costs you might care about. It made sense in the early days of the ImageNet challenge to focus only on accuracy because that was by far the biggest problem that blocked potential users from deploying computer vision technology. If the models don’t work well enough with infinite resources, then nothing else matters.

Now that deep learning can produce models that are accurate enough for many applications, we’re facing a different set of challenges. We need models that are fast and small enough to run on mobile and embedded platforms, and now that the maximum achievable accuracy is so high, we’re often able to trade some of it off to fit the resource constraints. Models like SqueezeNet, MobileNet, and recently MobileNet v2 have emerged that offer the ability to pick the best accuracy you can get given particular memory and latency constraints. These are extremely useful solutions for many applications, and I’d like to see research in this area continue to flourish, but because the models all involve trade-offs it’s not possible to evaluate them with a single metric. It’s also tricky to measure some of the properties we care about, like latency and memory usage, because they’re tied to particular hardware and software implementations. For example, some of the early NASNet models had very low numbers of floating-point operations, but it turned out because of the model structure and software implementations they didn’t translate into as low latency as we’d expected in practice.

All this means it’s a lot of work to propose a useful benchmark in this area, but I’m very pleased to say that Bo Chen, Jeff Gilbert, Andrew Howard, Achille Brighton, and the rest of the Mobile Vision team have put in the effort to launch the On-device Visual Intelligence Challenge for CVPR. This includes a complete suite of software for measuring accuracy and latency on known devices, and I’m hoping it will encourage a lot of innovative new model architectures that will translate into practical advances for application developers. One of the exciting features of this competition is that there are a lot of ways to produce an impressive entry, even if it doesn’t win the main 30ms-on-a-Pixel-phone challenge, because the state of the art is a curve not a point. For example, I’d love a model that gave me 40% top-one accuracy in well under a millisecond, since that would probably translate well to even smaller devices and would still be extremely useful. You can read more about the rules here, and I look forward to seeing your creative entries!

### Whats new on arXiv

In recent years, an increasing popularity of deep learning model for intelligent condition monitoring and diagnosis as well as prognostics used for mechanical systems and structures has been observed. In the previous studies, however, a major assumption accepted by default, is that the training and testing data are taking from same feature distribution. Unfortunately, this assumption is mostly invalid in real application, resulting in a certain lack of applicability for the traditional diagnosis approaches. Inspired by the idea of transfer learning that leverages the knowledge learnt from rich labeled data in source domain to facilitate diagnosing a new but similar target task, a new intelligent fault diagnosis framework, i.e., deep transfer network (DTN), which generalizes deep learning model to domain adaptation scenario, is proposed in this paper. By extending the marginal distribution adaptation (MDA) to joint distribution adaptation (JDA), the proposed framework can exploit the discrimination structures associated with the labeled data in source domain to adapt the conditional distribution of unlabeled target data, and thus guarantee a more accurate distribution matching. Extensive empirical evaluations on three fault datasets validate the applicability and practicability of DTN, while achieving many state-of-the-art transfer results in terms of diverse operating conditions, fault severities and fault types.
An increasing need of running Convolutional Neural Network (CNN) models on mobile devices with limited computing power and memory resource encourages studies on efficient model design. A number of efficient architectures have been proposed in recent years, for example, MobileNet, ShuffleNet, and NASNet-A. However, all these models are heavily dependent on depthwise separable convolution which lacks efficient implementation in most deep learning frameworks. In this study, we propose an efficient architecture named PeleeNet, which is built with conventional convolution instead. On ImageNet ILSVRC 2012 dataset, our proposed PeleeNet achieves a higher accuracy by 0.6% (71.3% vs. 70.7%) and 11% lower computational cost than MobileNet, the state-of-the-art efficient architecture. Meanwhile, PeleeNet is only 66% of the model size of MobileNet. We then propose a real-time object detection system by combining PeleeNet with Single Shot MultiBox Detector (SSD) method and optimizing the architecture for fast speed. Our proposed detection system, named Pelee, achieves 76.4% mAP (mean average precision) on PASCAL VOC2007 and 22.4 mAP on MS COCO dataset at the speed of 17.1 FPS on iPhone 6s and 23.6 FPS on iPhone 8. The result on COCO outperforms YOLOv2 in consideration of a higher precision, 13.6 times lower computational cost and 11.3 times smaller model size. The code and models are open sourced.
Recent years have witnessed significant progresses in deep Reinforcement Learning (RL). Empowered with large scale neural networks, carefully designed architectures, novel training algorithms and massively parallel computing devices, researchers are able to attack many challenging RL problems. However, in machine learning, more training power comes with a potential risk of more overfitting. As deep RL techniques are being applied to critical problems such as healthcare and finance, it is important to understand the generalization behaviors of the trained agents. In this paper, we conduct a systematic study of standard RL agents and find that they could overfit in various ways. Moreover, overfitting could happen “robustly”: commonly used techniques in RL that add stochasticity do not necessarily prevent or detect overfitting. In particular, the same agents and learning algorithms could have drastically different test performance, even when all of them achieve optimal rewards during training. The observations call for more principled and careful evaluation protocols in RL. We conclude with a general discussion on overfitting in RL and a study of the generalization behaviors from the perspective of inductive bias.
Dynamic Ensemble Selection (DES) techniques aim to select locally competent classifiers for the classification of each new test sample. Most DES techniques estimate the competence of classifiers using a given criterion over the region of competence of the test sample (its the nearest neighbors in the validation set). The K-Nearest Oracles Eliminate (KNORA-E) DES selects all classifiers that correctly classify all samples in the region of competence of the test sample, if such classifier exists, otherwise, it removes from the region of competence the sample that is furthest from the test sample, and the process repeats. When the region of competence has samples of different classes, KNORA-E can reduce the region of competence in such a way that only samples of a single class remain in the region of competence, leading to the selection of locally incompetent classifiers that classify all samples in the region of competence as being from the same class. In this paper, we propose two DES techniques: K-Nearest Oracles Borderline (KNORA-B) and K-Nearest Oracles Borderline Imbalanced (KNORA-BI). KNORA-B is a DES technique based on KNORA-E that reduces the region of competence but maintains at least one sample from each class that is in the original region of competence. KNORA-BI is a variation of KNORA-B for imbalance datasets that reduces the region of competence but maintains at least one minority class sample if there is any in the original region of competence. Experiments are conducted comparing the proposed techniques with 19 DES techniques from the literature using 40 datasets. The results show that the proposed techniques achieved interesting results, with KNORA-BI outperforming state-of-art techniques.
Independent samples from an unknown probability distribution $\mathbf{p}$ on a domain of size $k$ are distributed across $n$ players, with each player holding one sample. Each player can communicate $\ell$ bits to a central referee in a simultaneous message passing (SMP) model of communication to help the referee infer a property of the unknown $\mathbf{p}$. When $\ell\geq\log k$ bits, the problem reduces to the well-studied collocated case where all the samples are available in one place. In this work, we focus on the communication-starved setting of $\ell < \log k$, in which the landscape may change drastically. We propose a general formulation for inference problems in this distributed setting, and instantiate it to two prototypical inference questions: learning and uniformity testing.
We present a brief introduction to the theory of operator limits of random matrices to non-experts. Several open problems and conjectures are given. Connections to statistics, integrable systems, orthogonal polynomials, and more, are discussed.
In this work, we propose Adversarial Complementary Learning (ACoL) to automatically localize integral objects of semantic interest with weak supervision. We first mathematically prove that class localization maps can be obtained by directly selecting the class-specific feature maps of the last convolutional layer, which paves a simple way to identify object regions. We then present a simple network architecture including two parallel-classifiers for object localization. Specifically, we leverage one classification branch to dynamically localize some discriminative object regions during the forward pass. Although it is usually responsive to sparse parts of the target objects, this classifier can drive the counterpart classifier to discover new and complementary object regions by erasing its discovered regions from the feature maps. With such an adversarial learning, the two parallel-classifiers are forced to leverage complementary object regions for classification and can finally generate integral object localization together. The merits of ACoL are mainly two-fold: 1) it can be trained in an end-to-end manner; 2) dynamically erasing enables the counterpart classifier to discover complementary object regions more effectively. We demonstrate the superiority of our ACoL approach in a variety of experiments. In particular, the Top-1 localization error rate on the ILSVRC dataset is 45.14%, which is the new state-of-the-art.
A key problem in deep multi-attribute learning is to effectively discover the inter-attribute correlation structures. Typically, the conventional deep multi-attribute learning approaches follow the pipeline of manually designing the network architectures based on task-specific expertise prior knowledge and careful network tunings, leading to the inflexibility for various complicated scenarios in practice. Motivated by addressing this problem, we propose an efficient greedy neural architecture search approach (GNAS) to automatically discover the optimal tree-like deep architecture for multi-attribute learning. In a greedy manner, GNAS divides the optimization of global architecture into the optimizations of individual connections step by step. By iteratively updating the local architectures, the global tree-like architecture gets converged where the bottom layers are shared across relevant attributes and the branches in top layers more encode attribute-specific features. Experiments on three benchmark multi-attribute datasets show the effectiveness and compactness of neural architectures derived by GNAS, and also demonstrate the efficiency of GNAS in searching neural architectures.
A game-theoretic framework for time-inconsistent stopping problems where the time-inconsistency is due to the consideration of a non-linear function of an expected reward is developed. A class of mixed strategy stopping times that allows the agents in the game to choose the intensity function of a Cox process is introduced. A subgame perfect Nash equilibrium is defined. The equilibrium is characterized and other results with different necessary and sufficient conditions for equilibrium are proven. This includes a smooth fit result. A mean-variance problem and a variance problem are studied as examples. The state process is a general one-dimensional It\^{o} diffusion.
Coherence plays a critical role in producing a high-quality summary from a document. In recent years, neural extractive summarization is becoming increasingly attractive. However, most of them ignore the coherence of summaries when extracting sentences. As an effort towards extracting coherent summaries, we propose a neural coherence model to capture the cross-sentence semantic and syntactic coherence patterns. The proposed neural coherence model obviates the need for feature engineering and can be trained in an end-to-end fashion using unlabeled data. Empirical results show that the proposed neural coherence model can efficiently capture the cross-sentence coherence patterns. Using the combined output of the neural coherence model and ROUGE package as the reward, we design a reinforcement learning method to train a proposed neural extractive summarizer which is named Reinforced Neural Extractive Summarization (RNES) model. The RNES model learns to optimize coherence and informative importance of the summary simultaneously. Experimental results show that the proposed RNES outperforms existing baselines and achieves state-of-the-art performance in term of ROUGE on CNN/Daily Mail dataset. The qualitative evaluation indicates that summaries produced by RNES are more coherent and readable.
Fueled by massive amounts of data, models produced by machine-learning (ML) algorithms, especially deep neural networks, are being used in diverse domains where trustworthiness is a concern, including automotive systems, finance, health care, natural language processing, and malware detection. Of particular concern is the use of ML algorithms in cyber-physical systems (CPS), such as self-driving cars and aviation, where an adversary can cause serious consequences. However, existing approaches to generating adversarial examples and devising robust ML algorithms mostly ignore the semantics and context of the overall system containing the ML component. For example, in an autonomous vehicle using deep learning for perception, not every adversarial example for the neural network might lead to a harmful consequence. Moreover, one may want to prioritize the search for adversarial examples towards those that significantly modify the desired semantics of the overall system. Along the same lines, existing algorithms for constructing robust ML algorithms ignore the specification of the overall system. In this paper, we argue that the semantics and specification of the overall system has a crucial role to play in this line of research. We present preliminary research results that support this claim.
The output of Convolutional Neural Networks (CNN) has been shown to be discontinuous which can make the CNN image classifier vulnerable to small well-tuned artificial perturbations. That is, images modified by adding such perturbations(i.e. adversarial perturbations) that make little difference to human eyes, can completely alter the CNN classification results. In this paper, we propose a practical attack using differential evolution(DE) for generating effective adversarial perturbations. We comprehensively evaluate the effectiveness of different types of DEs for conducting the attack on different network structures. The proposed method is a black-box attack which only requires the miracle feedback of the target CNN systems. The results show that under strict constraints which simultaneously control the number of pixels changed and overall perturbation strength, attacking can achieve 72.29%, 78.24% and 61.28% non-targeted attack success rates, with 88.68%, 99.85% and 73.07% confidence on average, on three common types of CNNs. The attack only requires modifying 5 pixels with 20.44, 14.76 and 22.98 pixel values distortion. Thus, the result shows that the current DNNs are also vulnerable to such simpler black-box attacks even under very limited attack conditions.
Automatic detection of anomalies in space- and time-varying measurements is an important tool in several fields, e.g., fraud detection, climate analysis, or healthcare monitoring. We present an algorithm for detecting anomalous regions in multivariate spatio-temporal time-series, which allows for spotting the interesting parts in large amounts of data, including video and text data. In opposition to existing techniques for detecting isolated anomalous data points, we propose the ‘Maximally Divergent Intervals’ (MDI) framework for unsupervised detection of coherent spatial regions and time intervals characterized by a high Kullback-Leibler divergence compared with all other data given. In this regard, we define an unbiased Kullback-Leibler divergence that allows for ranking regions of different size and show how to enable the algorithm to run on large-scale data sets in reasonable time using an interval proposal technique. Experiments on both synthetic and real data from various domains, such as climate analysis, video surveillance, and text forensics, demonstrate that our method is widely applicable and a valuable tool for finding interesting events in different types of data.
A natural way of handling imbalanced data is to attempt to equalise the class frequencies and train the classifier of choice on balanced data. For two-class imbalanced problems, the classification success is typically measured by the geometric mean (GM) of the true positive and true negative rates. Here we prove that GM can be improved upon by instance selection, and give the theoretical conditions for such an improvement. We demonstrate that GM is non-monotonic with respect to the number of retained instances, which discourages systematic instance selection. We also show that balancing the distribution frequencies is inferior to a direct maximisation of GM. To verify our theoretical findings, we carried out an experimental study of 12 instance selection methods for imbalanced data, using 66 standard benchmark data sets. The results reveal possible room for new instance selection methods for imbalanced data.

## April 20, 2018

### Document worth reading: “On Cognitive Preferences and the Interpretability of Rule-based Models”

It is conventional wisdom in machine learning and data mining that logical models such as rule sets are more interpretable than other models, and that among such rule-based models, simpler models are more interpretable than more complex ones. In this position paper, we question this latter assumption, and recapitulate evidence for and against this postulate. We also report the results of an evaluation in a crowd-sourcing study, which does not reveal a strong preference for simple rules, whereas we can observe a weak preference for longer rules in some domains. We then continue to review criteria for interpretability from the psychological literature, evaluate some of them, and briefly discuss their potential use in machine learning. On Cognitive Preferences and the Interpretability of Rule-based Models

### How to build deep learning models with SAS

SAS® supports the creation of deep neural network models. Examples of these models include convolutional neural networks, recurrent neural networks, feedforward neural networks and autoencoder neural networks. Let’s examine in more detail how SAS creates deep learning models using SAS® Visual Data Mining and Machine Learning.

### Deep learning models with SAS Cloud Analytic Services

SAS Visual Mining and Machine Learning takes advantage of SAS Cloud Analytic Services (CAS) to perform what are referred to as CAS actions. You use CAS actions to load data, transform data, compute statistics, perform analytics and create output. Each action is configured by specifying a set of input parameters. Running a CAS action processes the action’s parameters and data, which creates an action result. CAS actions are grouped into CAS action sets.

Deep neural net models are trained and scored using the actions in the deepLearn CAS action set. This action set consists of several actions that support the end-to-end preprocessing, developing and deploying of deep neural network models. This action set provides users with the flexibility to describe their own model directed acyclic graph (DAG) to define the initial deep net structure. There are also actions that support adding and removing of layers from the network structure.

Appropriate model descriptions and parameters are needed to build deep learning models. We first need to define the network topology as a DAG and use this model description to train the parameters of the deep net models.

### A deeper dive into the deepLearn SAS CAS Action Set

An overview of the steps involved in training deep neural network models, using the deepLearn CAS action set, is as follows:

1. Create an empty deep learning model.
• The BuildModel() CAS action in the deepLearn action set creates an empty deep learning model in the form of a CASTable object.
• Users can choose from DNN, RNN or CNN network types to build the respective initial network.
1. Add layers to the model.
• This can be implemented using the addLayer() CAS action.
• This CAS action provides the flexibility to add various types of layers, such as the input, convolutional, pooling, fully connected, residual or output as desired.
• The specified layers are then added to the model table.
• Each new layer has a unique identifier name associated with it.
• This action also makes it possible to randomly crop/flip the input layer when images are given as inputs.
1. Remove layers from the model.
• Carried out using the removeLayer() CAS action.
• By specifying the necessary layer name, layers can be removed from the model table.
1. Perform hyperparameter autotuning.
• dlTune() helps tune the optimization parameters needed for training the model.
• Some of the tunable parameters include learning rate, dropout, mini batch size, gradient noise, etc.
• For tuning, we must specify the lower and the upper bound range of the parameters within where we think the optimized value would lie.
• An initial model weights table needs to be specified (in the form of a CASTable), which will initialize the model.
• An exhaustive searching through the specified weights table is then performed on the same data multiple times to determine the optimized parameter values.
• The resulting model weights with the best validation fit error is stored in a CAS table object.
1. Train the neural net model.
• The dlTrain() action trains the specified deep learning model for classification or regression tasks.
• By allowing the user to input the initial model table that was built, the best model weights table that was stored by performing hyper-parameter tuning and the predictor and response variables, we train the necessary neural net model.
• Trained models such as DNNs can be stored as an ASTORE binary object to be deployed in the SAS Event Stream Processing engine for real-time online scoring of data.
1. Score the model.
• The dlScore() action uses the trained model to score new data sets.
• The model is scored using the trained model information from the ASTORE binary object and predicting against the new data set.
1. Export the model.
• The dlExportModel() exports the trained neural net models to other formats.
• ASTORE is the current binary format supported by CAS.
1. Import the model weights table.
• dlImportModelWeights() imports the model weights information (that are initially specified as CAS table object) from external sources.
• The currently supported format is HDF5.

As advances in deep learning are made, SAS will also continue to advance its deepLearn CAS action set.

This blog post is based on a SAS white paper, "How to Do Deep Learning With SAS: An introduction to deep learning neural networks and a guide to building deep learning models using SAS."

The post How to build deep learning models with SAS appeared first on Subconscious Musings.

### If you did not already know

Stacked Deconvolutional Network (SDN)
Recent progress in semantic segmentation has been driven by improving the spatial resolution under Fully Convolutional Networks (FCNs). To address this problem, we propose a Stacked Deconvolutional Network (SDN) for semantic segmentation. In SDN, multiple shallow deconvolutional networks, which are called as SDN units, are stacked one by one to integrate contextual information and guarantee the fine recovery of localization information. Meanwhile, inter-unit and intra-unit connections are designed to assist network training and enhance feature fusion since the connections improve the flow of information and gradient propagation throughout the network. Besides, hierarchical supervision is applied during the upsampling process of each SDN unit, which guarantees the discrimination of feature representations and benefits the network optimization. We carry out comprehensive experiments and achieve the new state-of-the-art results on three datasets, including PASCAL VOC 2012, CamVid, GATECH. In particular, our best model without CRF post-processing achieves an intersection-over-union score of 86.6% in the test set. …

Data Version Control (DVC)
DVC makes your data science projects reproducible by automatically building data dependency graph (DAG). Your code and the dependencies could be easily shared by Git, and data – through cloud storage (AWS S3, GCP) in a single DVC environment. …

Damerau-Levenshtein Distance
In information theory and computer science, the Damerau-Levenshtein distance (named after Frederick J. Damerau and Vladimir I. Levenshtein) is a string metric for measuring the edit distance between two sequences. Informally, the Damerau-Levenshtein distance between two words is the minimum number of operations (consisting of insertions, deletions or substitutions of a single character, or transposition of two adjacent characters) required to change one word into the other. The Damerau-Levenshtein distance differs from the classical Levenshtein distance by including transpositions among its allowable operations in addition to the three classical single-character edit operations (insertions, deletions and substitutions). In his seminal paper, Damerau stated that these four operations correspond to more than 80% of all human misspellings. Damerau’s paper considered only misspellings that could be corrected with at most one edit operation. While the original motivation was to measure distance between human misspellings to improve applications such as spell checkers, Damerau-Levenshtein distance has also seen uses in biology to measure the variation between protein sequences. …

### Introducing the Anaconda Data Science Certification Program

This program gives data scientists a way to verify their proficiency, and organizations an independent standard for qualifying current and prospective data science experts. Register now!

### R/Finance 2018 Registration

This year marks the 10th anniversary of the R/Finance Conference!  As in prior years, we expect more than 250 attendees from around the world. R users from industry, academia, and government will joining 50+ presenters covering all areas of finance with R.  The conference will take place on June 1st and 2nd, at UIC in Chicago.

You can find registration information on the conference website, or you can go directly to the Cvent registration page.

Note that registration fees will increase by 50% at the end of early registration on May 21, 2018.

We are very excited about keynote presentations by JJ AllaireLi Deng, and Norm Matloff.  The conference agenda (currently) includes 18 full presentations and 33 shorter “lightning talks”.  As in previous years, several (optional) pre-conference seminars are offered on Friday morning.  We’re still working on the agenda, but we have another great lineup of speakers this year!

There is also an (optional) conference dinner at Wyndham Grand Chicago Riverfront in the 39th Floor Penthouse Ballroom and Terrace.  Situated directly on the riverfront, it is a perfect venue to continue conversations while dining and drinking.

We would to thank our 2018 Sponsors for the continued support enabling us to host such an exciting conference:

On behalf of the committee and sponsors, we look forward to seeing you in Chicago!

Gib Bassett, Peter Carl, Dirk Eddelbuettel, Brian Peterson, Dale Rosenthal, Jeffrey Ryan, Joshua Ulrich

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Magister Dixit

“On a scale less grand, but probably more common, data analytics projects reach into all business units. Employees throughout these units must interact with the data science team. If these employees do not have a fundamental grounding in the principles of data-analytic thinking, they will not really understand what is happening in the business. This lack of understanding is much more damaging in data science projects than in other technical projects, because the data science is supporting improved decision-making. This requires a close interaction between the data scientists and the business people responsible for decision-making. Firms where the business people do not understand what the data scientists are doing are at a substantial disadvantage, because they waste time and effort or, worse, because they ultimately make wrong decisions.” Foster Provost & Tom Fawcett ( 2013 )

### Why Deep Learning is perfect for NLP (Natural Language Processing)

Deep learning brings multiple benefits in learning multiple levels of representation of natural language. Here we will cover the motivation of using deep learning and distributed representation for NLP, word embeddings and several methods to perform word embeddings, and applications.

### Painless ODBC + dplyr Connections to Amazon Athena and Apache Drill with R & odbc

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

I spent some time this morning upgrading the JDBC driver (and changing up some supporting code to account for changes to it) for my metis package which connects R up to Amazon Athena via RJDBC. I’m used to JDBC and have to deal with Java separately from R so I’m also comfortable with Java, JDBC and keeping R working with Java. I notified the #rstats Twitterverse about it and it started this thread (click on the embed to go to it — and, yes, this means Twitter is tracking you via this post unless you’ve blocked their JavaScript):

If you do scroll through the thread you’ll see @hadleywickham suggested using the odbc package with the ODBC driver for Athena.

I, and others, have noted that ODBC on macOS (and — for me, at least — Linux) never really played well together for us. Given that I’m familiar with JDBC, I just gravitated towards using it after trying it out with raw Java and it worked fine in R.

Never one to discount advice from Hadley, I quickly grabbed the Athena ODBC driver and installed it and wired up an odbc + dplyr connection almost instantly:

library(odbc)
library(tidyverse)

DBI::dbConnect(
odbc::odbc(),
driver = "Simba Athena ODBC Driver",
Schema = "redacted",
AwsRegion = "us-east-1",
AuthenticationType = "Default Credentials",
S3OutputLocation = "s3://aws-athena-query-results-redacted"
) -> con

some_tbl <- tbl(con, "redacted")

Apologies for the redaction and lack of output but we’ve removed the default example databases from our work Athena environment and I’m not near my personal systems, so a more complete example will have to wait until later.

The TLDR is that I can now use 100% dplyr idioms with Athena vs add one to the RJDBC driver I made for metis. The metis package will still be around to support JDBC on systems that do have issues with ODBC and to add other methods that work with the AWS Athena API (managing Athena vs the interactive queries part).

The downside is that I’m now even more likely to run up the AWS bill

I also maintain the sergeant package which provides REST API and REST query access to Apache Drill along with a REST API DBI driver and an RJDBC interface for Drill. I remember trying to get the MapR ODBC client working with R a few years ago so I made the package (which was also a great learning experience).

I noticed there was a very recent MapR Drill ODBC driver released. Since I was on a roll, I figured why not try it one more time, especially since the RStudio team has made it dead simple to work with ODBC from R.

library(odbc)
library(tidyverse)

DBI::dbConnect(
odbc::odbc(),
driver = "/Library/mapr/drill/lib/libdrillodbc_sbu.dylib",
ConnectionType = "Zookeeper",
AuthenticationType = "No Authentication",
ZKCLusterID = "CLUSTERID",
ZkQuorum = "HOST:2181",
ExcludedSchemas=sys,INFORMATION_SCHEMA;NumberOfPrefetchBuffers=5;"
) -> drill_con

(employee <- tbl(drill_con, sql("SELECT * FROM cp.employee.json")))
## # Source:   SQL [?? x 16]
## # Database: Drill 01.13.0000[@Apache Drill Server/DRILL]
##    employee_id   full_name    first_name last_name position_id   position_title   store_id
##
##  1 1             Sheri Nowmer Sheri      Nowmer    1             President        0
##  2 2             Derrick Whe… Derrick    Whelply   2             VP Country Mana… 0
##  3 4             Michael Spe… Michael    Spence    2             VP Country Mana… 0
##  4 5             Maya Gutier… Maya       Gutierrez 2             VP Country Mana… 0
##  5 6             Roberta Dam… Roberta    Damstra   3             VP Information … 0
##  6 7             Rebecca Kan… Rebecca    Kanagaki  4             VP Human Resour… 0
##  7 8             Kim Brunner  Kim        Brunner   11            Store Manager    9
##  8 9             Brenda Blum… Brenda     Blumberg  11            Store Manager    21
##  9 10            Darren Stanz Darren     Stanz     5             VP Finance       0
## 10 11            Jonathan Mu… Jonathan   Murraiin  11            Store Manager    1
## # ... with more rows, and 9 more variables: department_id , birth_date ,
## #   hire_date , salary , supervisor_id , education_level ,
## #   marital_status , gender , management_role ##

count(employee, position_title, sort=TRUE)
## # Source:     lazy query [?? x 2]
## # Database:   Drill 01.13.0000[@Apache Drill Server/DRILL]
## # Ordered by: desc(n)
##    position_title            n
##
##  1 Store Temporary Checker   268
##  2 Store Temporary Stocker   264
##  3 Store Permanent Checker   226
##  4 Store Permanent Stocker   222
##  5 Store Shift Supervisor    52
##  6 Store Permanent Butcher   32
##  7 Store Manager             24
##  8 Store Assistant Manager   24
##  9 Store Information Systems 16
## 10 HQ Finance and Accounting 8
## # ... with more rows##

Apart from having to do that sql(…) to make the table connection work, it was pretty painless and I had both Athena and Drill working with dplyr verbs in under ten minutes (total).

You can head on over to the main Apache Drill site to learn all about the ODBC driver configuration parameters and I’ll be updating my ongoing Using Apache Drill with R e-book to include this information. I will also keep maintaining the existing sergeant package but also be including some additional methods provide ODBC usage guidance and potentially other helpers if there are any “gotchas” that arise.

### FIN

The odbc package is super-slick and it’s refreshing to be able to use dplyr verbs with Athena vs gosh-awful SQL. However, for some of our needs the hand-crafted queries will still be necessary as they are far more optimized than what would likely get pieced together via the dplyr verbs. However, those queries can also be put right into sql() with the Athena ODBC driver connection and used via the same dplyr verb magic afterwards.

Today is, indeed, a good day to query!

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Neural Network based Startup Name Generator

How to build a recurrent neural network to generate suggestions for your new company’s name.

### Apple: Commerce Data Scientist – Apple Media Products

Seeking someone with a love for data. This position involves working on very large scale data mining, cleaning, analysis, deep level processing, machine learning or statistic modeling, metrics tracking and evaluation.

### Most banks won’t touch America’s legal pot industry

IT IS often said that markets hate uncertainty. America’s marijuana industry is no exception. Earlier this year Jeff Sessions, the country’s attorney-general, rescinded a set of federal guidelines for marijuana-related businesses operating in states where the drug is legal.

### G Research: Data Scientist – Data Intelligence

Seeking a Senior Data Scientist to join the Data Intelligence Team, to analyse and verify the fidelity and cleanliness of diverse data sources and generate analytics to help determine the usefulness of data sets in the investment process.

### Bacon Bytes for 20-April

As I’m writing this edition of Bacon Bytes it is snowing. in April. Again. Go home Winter, you’re drunk.

This past weekend I was having a discussion with friends about how Facebook tracks you even if you aren’t a member of Facebook. And while not an unusual practice, most people have no idea this is happening. Or allowed. So, if you want Facebook to stop tracking you, all you need to do is join Facebook. This is why Zuckerberg is a billionaire and I’m writing weekly recap posts.

In a sure sign that hell has frozen over, Microsoft released it’s first application for Linux this week. Microsoft built their own custom Linux kernel for their new IoT security service, Azure Sphere. Although it is not available yet (and I haven’t even heard about a private preview), this announcement is most interesting for two reasons. First, Microsoft built their own Linux distribution, which means Linus has won. No word yet on what his winnings are, but I’m guessing it’s a bunch of Bing rewards points. The second reason is that Amazon announced a similar service at re:Invent last year. So, the two major cloud providers are looking to corner the market on IoT security.

Here’s a good example of the current state of IoT security. Hackers stole data from a casino via a thermometer in a lobby fish tank. Just let that sink in for a moment. That’s why Azure Sphere is so important. Microsoft and Amazon need to protect people from themselves.

Microsoft along with 33 other companies signed an anti-cyberattack pledge this week. This announcement was timed with the start of the RSA conference because that’s the place to let the world know you are committed to securing customers data and devices. Unless you are Apple, Google, or (surprisingly) Amazon. Those companies chose not to sign the pledge. I’ve no idea why Amazon would have not signed, perhaps nobody could get a hold of Jeff Bezos in time.

SolarWinds released their annual IT Trends report. The report is the result of a survey of over 800 companies worldwide. The biggest takeaway this year is the report shows the gap that exists between management and labor. If you have ever sat at your desk and thought your management was a bit crazy, this is the report for you. It breaks down that a gap exists, and has data to show why. Essentially, people are too busy keeping the lights on in order to keep pace with emerging technology. This isn’t a new problem, I’m sure, but it is nice to have data to help understand the gap, and to figure out what actions to take next.

If you’ve ever thought about getting up and walking out of YAUM 9Yet Another Useless Meeting), Elon Musk has your back. Musk shared some productivity tips such as leaving meetings or hanging up the phone when needed. Most of the rules Musk lays out for employees all follow along with the lines of common sense. If your time is better spent elsewhere, don’t stay in a meeting where you no longer add value. Unfortunately, common sense isn’t so common.

The post Bacon Bytes for 20-April appeared first on Thomas LaRock.

### Viacom’s Journey to Improving Viewer Experiences with Real-time Analytics at Scale

With over 4 billion subscribers, Viacom is focused on delivering amazing viewing experiences to their global audiences. Core to this strategy is ensuring petabytes of streaming content is delivered flawlessly through web, mobile and streaming applications. This is critically important during popular live events like the MTV Video Music Awards.

Streaming this much video can strain delivery systems resulting in long load times, mid-stream freezes and other issues. Not only does this create a poor experience, but can also result in lost ad dollars. To combat this, Viacom set out to build a scalable analytics platform capable of processing terabytes of streaming data for real-time insights on the viewer experience.

After evaluating a number of technologies, Viacom found their solution in Amazon S3 and the Databricks Unified Analytics Platform powered by Apache SparkTM. The rapid scalability of S3 coupled with the ease and processing power of Databricks, enabled Viacom to rapidly deploy and scale Spark clusters and unify their entire analytics stack – from basic SQL to advanced analytics on large scale streaming and historical datasets – with a single platform.

To learn more, join our webinar How Viacom Revolutionized Audience Experiences with Real-Time Analytics and AI on Apr 25 at 10:00 am PT.

The webinar will cover:

• Why Viacom chose Databricks, Spark and AWS for scalable real-time insights and AI
• How a unified platform for ad-hoc, batch, and real-time data analytics enabled them to improve content delivery
• What it takes to create a self service analytics platform for business users, analysts, and data scientists

Register to attend this session.

--

The post Viacom’s Journey to Improving Viewer Experiences with Real-time Analytics at Scale appeared first on Databricks.

### G Research: Data Scientist

Seeking a data scientist to help us diversify and extend the capabilities of the Security Data Science team.

### NLP –  Building a Question Answering Model

In this blog, I want to cover the main building blocks of a question answering model.

### G Research: Data Science Tooling Expert

Seeking a candidate to identify and research new data science and machine learning technologies and support POCs of these technologies as well as other kinds of underlying infrastructure.

### Packaging Shiny applications: A deep dive

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

(Or, how to write a Shiny app.R file that only contains a single line of code)

##### Mark Sellors, Head of Data Engineering

This post is long overdue. The information contained herein has been built up over years of deploying and hosting Shiny apps, particularly in production environments, and mainly where those Shiny apps are very large and contain a lot of code.

Last year, during some of my conference talks, I told the story of Mango’s early adoption of Shiny and how it wasn’t always an easy path to production for us. In this post I’d like to fill in some of the technical background and provide some information about Shiny app publishing and packaging that is hopefully useful to a wider audience.

I’ve figured out some of this for myself, but the most pivotal piece of information came from Shiny creator, Joe Cheng. Joe told me some time ago, that all you really need in an app.R file is a function that returns a Shiny application object. When he told me this, I was heavily embedded in the publication side and I didn’t immediately understand the implications.

Over time though I came to understand the power and flexibility that this model provides and, to a large extent, that’s what this post is about.

### What is Shiny?

Hopefully if you’re reading this you already know, but Shiny is a web application framework for Ri. It allows R users to develop powerful web applications entirely in R without having to understand HTML, CSS and JavaScript. It also allows us to embed the statistical power of R directly into those web applications.

Shiny apps generally consist of either a ui.R and a server.R (containing user interface and server-side logic respectively) or a single app.R which contains both.
Why package a Shiny app anyway?

If your app is small enough to fit comfortably in a single file, then packaging your application is unlikely to be worth it. As with any R script though, when it gets too large to be comfortably worked with as a single file, it can be useful to break it up into discrete components.

Publishing a packaged app will be more difficult, but to some extent that will depend on the infrastructure you have available to you.

### Pros of packaging

Packaging is one of the many great features of the R language. Packages are fairly straightforward, quick to create and you can build them with a host of useful features like built-in documentation and unit tests.

They also integrate really nicely into Continuous Integration (CI) pipelines and are supported by tools like Travis. You can also get test coverage reports using things like codecov.io.

They’re also really easy to share. Even if you don’t publish your package to CRAN, you can still share it on GitHub and have people install it with devtools, or build the package and share that around, or publish the package on a CRAN-like system within your organisation’s firewall.

### Cons of packaging

Before you get all excited and start to package your Shiny applications, you should be aware that — depending on your publishing environment — packaging a Shiny application may make it difficult or even impossible to publish to a system like Shiny Server or RStudio Connect, without first unpacking it again.

### A little bit of Mango history

This is where Mango were in the early days of our Shiny use. We had a significant disconnect between our data scientists writing the Shiny apps and the IT team tasked with supporting the infrastructure they used. This was before we’d committed to having an engineering team that could sit in the middle and provide a bridge between the two.

When our data scientists would write apps that got a little large or that they wanted robust tests and documentation for, they would stick them in packages and send them over to me to publish to our original Shiny Server. The problem was: R packages didn’t really mean anything to me at the time. I knew how to install them, but that was about as far as it went. I knew from the Shiny docs that a Shiny app needs certain files (server.R and ui.R or an app.R) file, but that wasn’t what I got, so I’d send it back to the data science team and tell them that I needed those files or I wouldn’t be able to publish it.

More than once I got back a response along the lines of, “but you just need to load it up and then do runApp()”. But, that’s just not how Shiny Server works. Over time, we’ve evolved a set of best practices around when and how to package a Shiny application.

The first step was taking the leap into understanding Shiny and R packages better. It was here that I started to work in the space between data science and IT.

### How to package a Shiny application

If you’ve seen the simple app you get when you choose to create a new Shiny application in RStudio, you’ll be familiar with the basic structure of a Shiny application. You need to have a UI object and a server function.

If you have a look inside the UI object you’ll see that it contains the html that will be used for building your user interface. It’s not everything that will get served to the user when they access the web application — some of that is added by the Shiny framework when it runs the application — but it covers off the elements you’ve defined yourself.

The server function defines the server-side logic that will be executed for your application. This includes code to handle your inputs and produce outputs in response.

The great thing about Shiny is that you can create something awesome quite quickly, but once you’ve mastered the basics, the only limit is your imagination.

For our purposes here, we’re going to stick with the ‘geyser’ application that RStudio gives you when you click to create a new Shiny Web Application. If you open up RStudio, and create a new Shiny app — choosing the single file app.R version — you’ll be able to see what we’re talking about. The small size of the geyser app makes it ideal for further study.

If you look through the code you’ll see that there are essentially three components: the UI object, the server function, and the shinyApp() function that actually runs the app.

Building an R package of just those three components is a case of breaking them out into the constituent parts and inserting them into a blank package structure. We have a version of this up on GitHub that you can check out if you want.

The directory layout of the demo project looks like this:

|-- DESCRIPTION
|-- NAMESPACE
|-- R
|   |-- launchApp.R
|   |-- shinyAppServer.R
|   -- shinyAppUI.R
|-- inst
|   -- shinyApp
|       -- app.R
|-- man
|   |-- launchApp.Rd
|   |-- shinyAppServer.Rd
|   -- shinyAppUI.Rd
-- shinyAppDemo.Rproj

Once the app has been adapted to sit within the standard R package structure we’re almost done. The UI object and server function don’t really need to be exported, and we’ve just put a really thin wrapper function around shinyApp() — I’ve called it launchApp() — which we’ll actually use to launch the app. If you install the package from GitHub with devtools, you can see it in action.

library(shinyAppDemo)
launchApp()

This will start the Shiny application running locally.

The approach outlined here also works fine with Shiny Modules, either in the same package, or called from a separate package.

And that’s almost it! The only thing remaining is how we might deploy this app to Shiny server (including Shiny Server Pro) or RStudio Connect.

### Publishing your packaged Shiny app

We already know that Shiny Server and RStudio Connect expect either a ui.R and a server.R or an app.R file. We’re running our application out of a package with none of this, so we won’t be able to publish it until we fix this problem.

The solution we’ve arrived at is to create a directory called ‘shinyApp’ inside the inst directory of the package. For those of you who are new to R packaging, the contents of the ‘inst’ directory are essentially ignored during the package build process, so it’s an ideal place to put little extras like this.

The name ‘shinyApp’ was chosen for consistency with Shiny Server which uses a ‘shinyApps’ directory if a user is allowed to serve applications from their home directory.

Inside this directory we create a single ‘app.R’ file with the following line in it:

shinyAppDemo::launchApp()

And that really is it. This one file will allow us to publish our packaged application under some circumstances, which we’ll discuss shortly.

Here’s where having a packaged Shiny app can get tricky, so we’re going to talk you through the options and do what we can to point out the pitfalls.

### Shiny Server and Shiny Server Pro

Perhaps surprisingly — given that Shiny Server is the oldest method of Shiny app publication — it’s also the easiest one to use with these sorts of packaged Shiny apps. There are basically two ways to publish on Shiny Server. From your home directory on the server — also known as self-publishing — or publishing from a central location, usually the directory ‘/srv/shiny-server’.

The central benefit of this approach is the ability to update the application just by installing a newer version of the package. Sadly though, it’s not always an easy approach to take.

#### Apps served from home directory (AKA self-publishing)

The first publication method is from a users’ home directory. This is generally used in conjunction with RStudio Server. In the self-publishing model, Shiny Server (and Pro) expect apps to be found in a directory called ‘ShinyApps’, within the users home directory. This means that if we install a Shiny app in a package the final location of the app directory will be inside the installed package, not in the ShinyApps directory. In order to work around this, we create a link from where the app is expected to be, to where it actually is within the installed package structure.

So in the example of our package, we’d do something like this in a terminal session:

# make sure we’re in our home directory
cd
# change into the shinyApps directory
cd shinyApps
# create a link from our app directory inside the package
ln -s /home/sellorm/R/x86_64-pc-linux-gnu-library/3.4/shinyAppDemo/shinyApp ./testApp

Note: The path you will find your libraries in will differ from the above. Check by running .libPaths()[1] and then dir(.libPaths()[1]) to see if that’s where your packages are installed.

Once this is done, the app should be available at ‘http://:3838//’ and can be updated by updating the installed version of the package. Update the package and the updates will be published via Shiny Server straight away.

#### Apps Server from a central location (usually /srv/shiny-server)

This is essentially the same as above, but the task of publishing the application generally falls to an administrator of some sort.

Since they would have to transfer files to the server and log in anyway, it shouldn’t be too much of an additional burden to install a package while they’re there. Especially if that makes life easier from then on.

The admin would need to transfer the package to the server, install it and then create a link — just like in the example above — from the expected location, to the installed location.

The great thing with this approach is that when updates are due to be installed the admin only has to update the installed package and not any other files.

### RStudio Connect

Connect is the next generation Shiny Server. In terms of features and performance, it’s far superior to its predecessor. One of the best features is the ability to push Shiny app code directly from the RStudio IDE. For the vast majority of users, this is a huge productivity boost, since you no longer have to wait for an administrator to publish your app for you.

Since publishing doesn’t require anyone to directly log into the server as part of the publishing process, there aren’t really any straightforward opportunities to install a custom package. This means that, in general, publishing a packaged shiny application isn’t really possible.

There’s only one real workaround for this situation that I’m aware of. If you have an internal CRAN-like repository for your custom packages, you should be able to use that to update Connect, with a little work.

You’d need to have your dev environment and Connect hooked up to the same repo. The updated app package needs to be available in that repo and installed in your dev environment. Then, you could publish and then update the single line app.R for each successive package version you publish.

Connect uses packrat under the hood, so when you publish the app.R the packrat manifest will also be sent to the server. Connect will use the manifest to decide which packages are required to run your app. If you’re using a custom package this would get picked up and installed or updated during deployment.

### shinyapps.io

It’s not currently possible to publish a packaged application to shinyapps.io. You’d need to make sure your app followed the accepted conventions for creating Shiny apps and only uses files, rather than any custom packages.

### Conclusion

Packaging Shiny apps can be a real productivity boon for you and your team. In situations where you can integrate that process into other processes, such as automatically running your unit tests or automated publishing it can also help you adopt devops-style workflows.

However, in some instances, the practice can actually make things worse and really slow you down. It’s essential to understand what the publishing workflow is in your organisation before embarking on any significant Shiny packaging project as this will help steer you towards the best course of action.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### 4 Ways IT Managers Should Expand Their Skills for Career Growth

Whether you’re already working in IT management or still on your way up, you should be sure of one thing: that you’re as comfortable working with people as you are working with tech. As a leader in IT, you’re responsible for much more than technology. You’ll likely spend as much –

The post 4 Ways IT Managers Should Expand Their Skills for Career Growth appeared first on Dataconomy.

### Understanding What is Behind Sentiment Analysis – Part 2

Fine-tuning our sentiment classifier...

### Microsoft Weekly Data Science News for April 20, 2018

Here are the latest articles from Microsoft regarding cloud data science products and updates.

&utm&utm&utm

### AI, Machine Learning and Data Science Roundup: April 2018

A monthly roundup of news about Artificial Intelligence, Machine Learning and Data Science. This is an eclectic collection of interesting blog posts, software announcements and data applications I've noted over the past month or so.

## Open Source AI, ML & Data Science News

An interface between R and Python: reticulate.

TensorFlow.js: Browser-based machine learning with WebGL acceleration.

## Industry News

Tensorflow 1.7 supports the TensorRT library for faster computation on NVIDIA GPUs.

RStudio now provides a Tensorflow template in Paperspace for computation with NVIDIA GPUs.

Google Cloud Text-to-Speech provides natural speech in 32 voices and 12 languages.

Amazon Translate is now generally available.

### Microsoft News

ZDNet reviews The Future Computed: "do read it to remind yourself how much preparation is required for the impact of AI".

Azure Sphere: a new chip, Linux-based OS, and cloud services to secure IoT devices.

Improvements for Python developers in the March 2018 release of Visual Studio Code.

A review of the Azure Data Science Virtual Machine with a focus on deep learning with GPUs.

Azure Media Analytics services: motion, face and text detection and semantic tagging for videos.

### Learning resources

Training SqueezeNet in Azure with MXNet and the Data Science Virtual Machine.

Microsoft's Professional Program in AI, now available to the public as an EdX course

Run Python scripts on demand with Azure Container Instances.

How to train multiple models simultaneously with Azure Batch AI.

Scaling models to Kubernetes clusters with Azure ML Workbench.

A "weird" introduction to Deep Learning, by  Favio Vázquez.

Find previous editions of the monthly AI roundup here.

### Carol Nickerson investigates an unfounded claim of “17 replications”

Carol Nickerson sends along this report in which she carefully looks into the claim that the effect of power posing on feelings of power has replicated 17 times. Also relevant to the discussion is this post from a few months ago by Joe Simmons, Leif Nelson, and Uri Simonsohn.

I am writing about this because the claims of replication have been receiving wide publicity, and so, to the extent that these claims are important and worth publicizing, it’s also important to point out their errors. Everyone makes scientific mistakes—myself included—and the fact that some mistakes were made regarding claimed replications is not intended in any way to represent a personal criticism of anyone involved.

### Traits you’ll find in good managers

Work with your manager to get what you need, when you need it.

Continue reading Traits you’ll find in good managers.

### Thinking beyond bots: How AI can drive social impact

A few ways to think differently and integrate innovation and AI into your company's altruistic pursuits.

What do artificial intelligence (AI), invention, and social good have in common? While on the surface they serve very different purposes, at their core, they all require you to do one thing in order to be successful at them: think differently.

Take the act of inventing—in order to develop a great patent, trade secret, or other intellectual property, you need to think outside of the box. Similarly, at the heart of AI is the act of unlocking new capabilities, whether that’s making virtual personal assistants like Alexa more useful, or creating a chatbot that provides a personalized experience to customers. And because of the constantly changing economic and social landscapes, coming up with impactful social good initiatives requires you to constantly approach things through a new lens.

Individually, these fields have seen notable advancements over the past year, including new technologies that are bringing improvements to AI and large companies that are prioritizing giving back. But even more exciting is that we’re seeing more and more business leaders and nonprofits combining AI, innovation, and social good to reach communities in innovative ways, at a scale we’ve never before seen.

There’s no better time than now to explore how your organization approaches your social good efforts. Here are a few ways you can think differently and integrate innovation and AI into your company’s altruistic pursuits.

## Approach social good through the mind of an inventor

As a master inventor at IBM, I’m part of the team responsible for helping the company become the leading recipient of U.S. patents for the last quarter century. While developing patents and intellectual properties might not be what you’re setting out to do as part of your humanitarian efforts, the way we approach our jobs as inventors is something that can be applied across all aspects of giving back. Consider the United Nations’ 17 Sustainable Development Goals, which aim to eradicate things like poverty, hunger, disease, and more. These are game-changing initiatives that definitely require new ideas. What’s more, the United Nations estimates that we’re $5 trillion short on resources needed to accomplish these goals. How do we bridge this gap? Well, we need to start thinking differently. Foundationally, coming up with a great invention is identifying a problem that needs to be solved and coming up with an out-of-the-box idea that’s smart, has the biggest impact, and the lowest risk. To do this, we look around us to see which relevant technologies we can use that are already at our disposal so we don’t have to completely reinvent the wheel if we don’t have to. We also identify which parts of the solution need a completely new idea to be created from scratch. Additionally, we look at the issue we’re trying to solve and the current landscape as a whole so we can predict any issues or future problems that may arise, and we try to address them ahead of time in our invention. The same approach should be applied to social good—identify the problem you want to solve, the tools that already exist that can help you solve this dilemma, and the resources that need to be created or brought in from outside properties in order to execute your plan. At the heart of social good, similar to most inventions, are the people you’re trying to help. You need to make sure you’re maximizing the reach of your project while also minimizing any risks that may unintentionally create additional problems for the people you’re trying to help. To do this, you need to be creative in your approach. As an example, this is exactly the approach InvestEd is taking (full disclosure: I am an advisor for InvestEd). They started off by realizing they could commercialize and create social good at the same time by enabling financial education and facilitating microloans for small businesses in emerging markets. Helping these small businesses grow added more value to the small, local communities. And to make their product even better, InvestEd is adding AI capabilities to widen their offerings and provide a more innovative user experience. ## AI: Unlocking new capabilities To grab the value and create disruptive AI technology for social good ideas, we have to think beyond the typical automation activities of a machine. Take Guiding Eyes, for example, which is using AI to discover the secrets behind successful guide dogs. By taking advantage of natural language processing (NLP) on structured and unstructured data, the system they’re using is trained to find correlations to successful dogs among genetic, health, temperament, and environmental factors—and the technology continues to learn and get better. By using AI, Guiding Eyes has seen a 10% increase in guide dog graduation rates, helping the organization meet the growing demand for guide dogs. There are many other examples of AI being used for the betterment of society. For example, PAWS is an organization that uses machine learning to predict where poachers may strike, or Dr. Eric Elster, who worked with the Walter Reed National Military Medical Center to apply machine learning techniques to improve the treatment of U.S. service members injured in combat. ## Best practices for getting started These are a just a few ideas for how AI can be used for social good— there are still plenty of opportunities out there. The challenge is how to get started, so here are three best practices I’d like to share to help people who want to embark on this journey. 1. First, build your understanding of what AI is and is not through great online learning, such as Intro to Artificial Intelligence, led by Peter Norvig and Sebastian Thrun, on Udacity. 2. Second, think differently. AI is a different computing model. Instead of thinking about use cases and scenarios, really focus on the problem you want to solve. Think more “ideal scenario” on how to best develop a solution, and then see if a machine can be trained to do this work. Let’s consider personalized education, particularly reading comprehension (which has shown to have a tremendous impact on a child’s long-term educational performance across all subjects). With a traditional use case approach, we would probably try to develop a general framework that would help in a handful of scenarios. Now, Learning Ovations has thought about the more ideal scenario. They have realized there are too many possible scenarios to program or for a general framework to even cover. Instead, they’re training AI to assess each child’s performance (across traditional metrics and some new ones) as a tool for educators and parents. In addition, they’re creating an AI-powered recommendation engine based on each individual school’s curriculum to provide another tool for educators to create a customized reading program for each student. Thus, Learning Ovations thought differently on how to personalized education. 3. Third, set aside preconceived notions. There are things that people are better than machines at doing, but there are things machines are better than people at doing—some of which may be surprising. For example, people seem to be more honest in sharing health or financial information with a machine than a person because they don’t worry about being judged. This typically means the machine gets more accurate data to provide recommendations. Thus, recognizing that a machine might be as capable in some areas could unlock whole new capabilities. When it comes to AI, invention, and social good, the possibilities are endless. Technology will only continue to become more advanced, creating new opportunities to fix societal problems related to health, sustainability, conservation, accessibility, and much more. If you’re thinking of jumping into AI for good, just remember the most important rule: think differently. Continue reading Thinking beyond bots: How AI can drive social impact. Continue Reading… ### Four short links: 20 April 2018 Functional Programming, High-Dimensional Data, Games and Datavis, and Container Management 1. Interview with Simon Peyton-Jones -- I had always assumed that the more bleeding-edge changes to the type system, things like type-level functions, generalized algebraic data types (GADTs), higher rank polymorphism, and existential data types, would be picked up and used enthusiastically by Ph.D. students in search of a topic, but not really used much in industry. But in fact, it turns out that people in companies are using some of these still-not-terribly-stable extensions. I think it's because people in companies are writing software that they want to still be able to maintain and modify in five years time. SPJ is the creator of Haskell, and one of the leading thinkers in functional programming. 2. HyperTools -- A Python toolbox for visualizing and manipulating high-dimensional data. Open source. High-dimensional = "a lot of columns in each row". 3. What Videogames Have to Teach Us About Data Visualization -- super-interesting exploration of space, storytelling, structure, and annotations. 4. Titus -- Netflix open-sourced their container management platform. There aren't many companies with the scale problems of Amazon, Netflix, Google, etc., so it's always interesting to see what comes out of them. Continue reading Four short links: 20 April 2018. Continue Reading… ### Here’s what you get when you cross dinosaurs and flowers with deep learning Neural networks have shown usefulness with a number of things, but here is an especially practical use case. Chris Rodley used neural networks to create a hybrid of a dinosaur book and a flower book. The world may never be the same again. Continue Reading… ### Videos: Computational Theories of the Brain, Simons Institute for the Theory of Computing Monday, April 16th, 20188:30 am – 8:50 am Coffee and Check-In 8:50 am – 9:00 am Opening Remarks 9:00 am – 9:45 am The Prefrontal Cortex as a Meta-Reinforcement Learning SystemMatthew Botvinick, DeepMind Technologies Limited, London and University College London9:45 am – 10:30 am Working Memory Influences Reinforcement Learning Computations in Brain and BehaviorAnne Collins, UC Berkeley10:30 am – 11:00 am Break 11:00 am – 11:45 am Predictive Coding Models of PerceptionDavid Cox, Harvard University11:45 am – 12:30 pm TBASophie Denève, Ecole Normale Supérieure12:30 pm – 2:30 pm Lunch 2:30 pm – 3:15 pm Towards Biologically Plausible Deep Learning: Early Inference in Energy-Based Models Approximates Back-PropagationAsja Fischer, University of Bonn 3:15 pm – 4:00 pm Neural Circuitry Underling Working Memory in the Dorsolateral Prefrontal CortexVeronica Galvin, Yale University 4:00 pm – 5:00 pm Reception Tuesday, April 17th, 20188:30 am – 9:00 am Coffee and Check-In 9:00 am – 9:45 am TBASurya Ganguli, Stanford University9:45 am – 10:30 am Does the Neocortex Use Grid Cell-Like Mechanisms to Learn the Structure of Objects?Jeff Hawkins, Numenta 10:30 am – 11:00 am Break 11:00 am – 11:45 am Dynamic Neural Network Structures Through Stochastic RewiringRobert Legenstein, Graz University of Technology11:45 am – 12:30 pm Backpropagation and Deep Learning in the BrainTimothy Lillicrap, DeepMind Technologies Limited, London12:30 pm – 2:30 pm Lunch 2:30 pm – 3:15 pm An Algorithmic Theory of Brain NetworksNancy Lynch, Massachusetts Institute of Technology3:15 pm – 4:00 pm Networks of Spiking Neurons Learn to Learn and RememberWolfgang Maass, Graz University of Technology4:00 pm – 4:30 pm Break 4:30 pm – 5:30 pm Plenary Discussion: What Is Missing in Current Theories of Brain Computation? Wednesday, April 18th, 20188:30 am – 9:00 am Coffee and Check-In 9:00 am – 9:45 am Functional Triplet Motifs Underlie Accurate Predictions of Single-Trial Responses in Populations of Tuned and Untuned v1 NeuronsJason MacLean, University of Chicago9:45 am – 10:30 am The Sparse Manifold TransformBruno Olshausen, UC Berkeley10:30 am – 11:00 am Break 11:00 am – 11:45 am Playing Newton: Automatic Construction of Phenomenological, Data-Driven Theories and ModelsIlya Nemenman, Emory University11:45 am – 12:30 pm A Functional Classification of Glutamatergic Circuits in Cortex and ThalamusS. Murray Sherman, University of Chicago12:30 pm – 2:30 pm Lunch 2:30 pm – 3:15 pm On the Link Between Energy & Information for the Design of Neuromorphic SystemsNarayan Srinivasa, Eta Compute3:15 pm – 4:00 pm Neural Circuit Representation of Multiple Cognitive Tasks: Clustering and CompositionalityXJ Wang, New York University4:00 pm – 4:30 pm Break 4:30 pm – 5:30 pm Plenary Discussion: How Can One Test/Falsify Current Theories of Brain Computation? Thursday, April 19th, 20188:30 am – 9:00 am Coffee and Check-In 9:00 am – 9:45 pm Control of Synaptic Plasticity in Deep Cortical NetworksPieter Roelfsema, University of Amsterdam9:45 am – 10:30 am Computation with AssembliesChristos Papadimitriou, Columbia University10:30 am – 11:00 am Break 11:00 am – 11:45 am Capacity of Neural Networks for Lifelong Learning of Composable TasksLes Valiant, Harvard University11:45 am – 12:30 pm An Integrated Cognitive ArchitectureGreg Wayne, Columbia University Continue Reading… ### Whats new on arXiv This thesis investigates unsupervised time series representation learning for sequence prediction problems, i.e. generating nice-looking input samples given a previous history, for high dimensional input sequences by decoupling the static input representation from the recurrent sequence representation. We introduce three models based on Generative Stochastic Networks (GSN) for unsupervised sequence learning and prediction. Experimental results for these three models are presented on pixels of sequential handwritten digit (MNIST) data, videos of low-resolution bouncing balls, and motion capture data. The main contribution of this thesis is to provide evidence that GSNs are a viable framework to learn useful representations of complex sequential input data, and to suggest a new framework for deep generative models to learn complex sequences by decoupling static input representations from dynamic time dependency representations. This paper presents issues regarding short term electric load forecasting using feedforward and Elman recurrent neural networks. The study cases were developed using measured data representing electrical energy consume from Banat area. There were considered 35 different types of structure for both feedforward and recurrent network cases. For each type of neural network structure were performed many trainings and best solution was selected. The issue of forecasting the load on short term is essential in the effective energetic consume management in an open market environment. State-of-the-art forecasting methods using Recurrent Neural Net- works (RNN) based on Long-Short Term Memory (LSTM) cells have shown exceptional performance targeting short-horizon forecasts, e.g given a set of predictor features, forecast a target value for the next few time steps in the future. However, in many applications, the performance of these methods decays as the forecasting horizon extends beyond these few time steps. This paper aims to explore the challenges of long-horizon forecasting using LSTM networks. Here, we illustrate the long-horizon forecasting problem in datasets from neuroscience and energy supply management. We then propose expectation-biasing, an approach motivated by the literature of Dynamic Belief Networks, as a solution to improve long-horizon forecasting using LSTMs. We propose two LSTM ar- chitectures along with two methods for expectation biasing that significantly outperforms standard practice. With the increasing demand for large amount of labeled data, crowdsourcing has been used in many large-scale data mining applications. However, most existing works in crowdsourcing mainly focus on label inference and incentive design. In this paper, we address a different problem of adaptive crowd teaching, which is a sub-area of machine teaching in the context of crowdsourcing. Compared with machines, human beings are extremely good at learning a specific target concept (e.g., classifying the images into given categories) and they can also easily transfer the learned concepts into similar learning tasks. Therefore, a more effective way of utilizing crowdsourcing is by supervising the crowd to label in the form of teaching. In order to perform the teaching and expertise estimation simultaneously, we propose an adaptive teaching framework named JEDI to construct the personalized optimal teaching set for the crowdsourcing workers. In JEDI teaching, the teacher assumes that each learner has an exponentially decayed memory. Furthermore, it ensures comprehensiveness in the learning process by carefully balancing teaching diversity and learner’s accurate learning in terms of teaching usefulness. Finally, we validate the effectiveness and efficacy of JEDI teaching in comparison with the state-of-the-art techniques on multiple data sets with both synthetic learners and real crowdsourcing workers. Deep probabilistic programming languages try to combine the advantages of deep learning with those of probabilistic programming languages. If successful, this would be a big step forward in machine learning and programming languages. Unfortunately, as of now, this new crop of languages is hard to use and understand. This paper addresses this problem directly by explaining deep probabilistic programming languages and indirectly by characterizing their current strengths and weaknesses. We present convolutional neural network (CNN) based approaches for unsupervised multimodal subspace clustering. The proposed framework consists of three main stages – multimodal encoder, self-expressive layer, and multimodal decoder. The encoder takes multimodal data as input and fuses them to a latent space representation. We investigate early, late and intermediate fusion techniques and propose three different encoders corresponding to them for spatial fusion. The self-expressive layers and multimodal decoders are essentially the same for different spatial fusion-based approaches. In addition to various spatial fusion-based methods, an affinity fusion-based network is also proposed in which the self-expressiveness layer corresponding to different modalities is enforced to be the same. Extensive experiments on three datasets show that the proposed methods significantly outperform the state-of-the-art multimodal subspace clustering methods. Associative memory using fast weights is a short-term memory mechanism that substantially improves the memory capacity and time scale of recurrent neural networks (RNNs). As recent studies introduced fast weights only to regular RNNs, it is unknown whether fast weight memory is beneficial to gated RNNs. In this work, we report a significant synergy between long short-term memory (LSTM) networks and fast weight associative memories. We show that this combination, in learning associative retrieval tasks, results in much faster training and lower test error, a performance boost most prominent at high memory task difficulties. Using information theoretic concepts to understand and explore the inner organization of deep neural networks (DNNs) remains a big challenge. Recently, the concept of an information plane began to shed light on the analysis of multilayer perceptrons (MLPs). We provided an in-depth insight into stacked autoencoders (SAEs) using a novel matrix-based Renyi’s {\alpha}-entropy functional, enabling for the first time the analysis of the dynamics of learning using information flow in real-world scenario involving complex network architecture and large data. Despite the great potential of these past works, there are several open questions when it comes to applying information theoretic concepts to understand convolutional neural networks (CNNs). These include for instance the accurate estimation of information quantities among multiple variables, and the many different training methodologies. By extending the novel matrix-based Renyi’s {\alpha}-entropy functional to a multivariate scenario, this paper presents a systematic method to analyze CNNs training using information theory. Our results validate two fundamental data processing inequalities in CNNs, and also have direct impacts on previous work concerning the training and design of CNNs. This paper presents the SCvx algorithm, a successive convexification algorithm designed to solve non-convex optimal control problems with global convergence and superlinear convergence-rate guarantees. The proposed algorithm handles nonlinear dynamics and non-convex state and control constraints by linearizing them about the solution of the previous iterate, and solving the resulting convex subproblem to obtain a solution for the current iterate. Additionally, the algorithm incorporates several safe-guarding techniques into each convex subproblem, employing virtual controls and virtual buffer zones to avoid artificial infeasibility, and a trust region to avoid artificial unboundedness. The procedure is repeated in succession, thus turning a difficult non-convex optimal control problem into a sequence of numerically tractable convex subproblems. Using fast and reliable Interior Point Method (IPM) solvers, the convex subproblems can be computed quickly, making the SCvx algorithm well suited for real-time applications. Analysis is presented to show that the algorithm converges both globally and superlinearly, guaranteeing the local optimality of the original problem. The superlinear convergence is obtained by exploiting the structure of optimal control problems, showcasing the superior convergence rate that can be obtained by leveraging specific problem properties when compared to generic nonlinear programming methods. Numerical simulations are performed for an illustrative non-convex quad-rotor motion planning example problem, and corresponding results obtained using Sequential Quadratic Programming (SQP) solver are provided for comparison. Our results show that the convergence rate of the SCvx algorithm is indeed superlinear, and surpasses that of the SQP-based method by converging in less than half the number of iterations. Tasks such as social network analysis, human behavior recognition, or modeling biochemical reactions, can be solved elegantly by using the probabilistic inference framework. However, standard probabilistic inference algorithms work at a propositional level, and thus cannot capture the symmetries and redundancies that are present in these tasks. Algorithms that exploit those symmetries have been devised in different research fields, for example by the lifted inference-, multiple object tracking-, and modeling and simulation-communities. The common idea, that we call state space abstraction, is to perform inference over compact representations of sets of symmetric states. Although they are concerned with a similar topic, the relationship between these approaches has not been investigated systematically. This survey provides the following contributions. We perform a systematic literature review to outline the state of the art in probabilistic inference methods exploiting symmetries. From an initial set of more than 4,000 papers, we identify 116 relevant papers. Furthermore, we provide new high-level categories that classify the approaches, based on the problem classes the different approaches can solve. Researchers from different fields that are confronted with a state space explosion problem in a probabilistic system can use this classification to identify possible solutions. Finally, based on this conceptualization, we identify potentials for future research, as some relevant application domains are not addressed by current approaches. We introduce an exact distributed algorithm to train Random Forest models as well as other decision forest models without relying on approximating best split search. We explain the proposed algorithm and compare it to related approaches for various complexity measures (time, ram, disk, and network complexity analysis). We report its running performances on artificial and real-world datasets of up to 18 billions examples. This figure is several orders of magnitude larger than datasets tackled in the existing literature. Finally, we empirically show that Random Forest benefits from being trained on more data, even in the case of already gigantic datasets. Given a dataset with 17.3B examples with 82 features (3 numerical, other categorical with high arity), our implementation trains a tree in 22h. The cross-domain recommendation technique is an effective way of alleviating the data sparsity in recommender systems by leveraging the knowledge from relevant domains. Transfer learning is a class of algorithms underlying these techniques. In this paper, we propose a novel transfer learning approach for cross-domain recommendation by using neural networks as the base model. We assume that hidden layers in two base networks are connected by cross mappings, leading to the collaborative cross networks (CoNet). CoNet enables dual knowledge transfer across domains by introducing cross connections from one base network to another and vice versa. CoNet is achieved in multi-layer feedforward networks by adding dual connections and joint loss functions, which can be trained efficiently by back-propagation. The proposed model is evaluated on two real-world datasets and it outperforms baseline models by relative improvements of 3.56\% in MRR and 8.94\% in NDCG, respectively. Verifying the correctness of Bayesian computation is challenging. This is especially true for complex models that are common in practice, as these require sophisticated model implementations and algorithms. In this paper we introduce \emph{simulation-based calibration} (SBC), a general procedure for validating inferences from Bayesian algorithms capable of generating posterior samples. This procedure not only identifies inaccurate computation and inconsistencies in model implementations but also provides graphical summaries that can indicate the nature of the problems that arise. We argue that SBC is a critical part of a robust Bayesian workflow, as well as being a useful tool for those developing computational algorithms and statistical software. We present a novel algorithm for learning the spectral density of large scale networks using stochastic trace estimation and the method of maximum entropy. The complexity of the algorithm is linear in the number of non-zero elements of the matrix, offering a computational advantage over other algorithms. We apply our algorithm to the problem of community detection in large networks. We show state-of-the-art performance on both synthetic and real datasets. Continue Reading… ### Document worth reading: “Statistical Validity and Consistency of Big Data Analytics: A General Framework” Informatics and technological advancements have triggered generation of huge volume of data with varied complexity in its management and analysis. Big Data analytics is the practice of revealing hidden aspects of such data and making inferences from it. Although storage, retrieval and management of Big Data seem possible through efficient algorithm and system development, concern about statistical consistency remains to be addressed in view of its specific characteristics. Since Big Data does not conform to standard analytics, we need proper modification of the existing statistical theory and tools. Here we propose, with illustrations, a general statistical framework and an algorithmic principle for Big Data analytics that ensure statistical accuracy of the conclusions. The proposed framework has the potential to push forward advancement of Big Data analytics in the right direction. The partition-repetition approach proposed here is broad enough to encompass all practical data analytic problems. Statistical Validity and Consistency of Big Data Analytics: A General Framework Continue Reading… ### R Packages worth a look Rapidjson’ C++ Header Files (rapidjsonr) Provides JSON parsing capability through the ‘Rapidjson’ ‘C++’ header-only library. Prediction Intervals for Random-Effects Meta-Analysis (pimeta) An implementation of prediction intervals for random-effects meta-analysis: Higgins et al. (2009) <doi:10.1111/j.1467-985X.2008.00552.x>, Partlett and Riley (2017) <doi:10.1002/sim.7140>, and Nagashima et al. (2018) <arXiv:1804.01054>. Nonparametric Smoothing of Laplacian Graph Spectra (LPGraph) A nonparametric method to approximate Laplacian graph spectra of a network with ordered vertices. This provides a computationally efficient algorithm for obtaining an accurate and smooth estimate of the graph Laplacian basis. The approximation results can then be used for tasks like change point detection, k-sample testing, and so on. The primary reference is Mukhopadhyay, S. and Wang, K. (2018, Technical Report). Simulation of Correlated Systems of Equations with Multiple Variable Types (SimRepeat) Generate correlated systems of statistical equations which represent repeated measurements or clustered data. These systems contain either: a) continuous normal, non-normal, and mixture variables based on the techniques of Headrick and Beasley (2004) <DOI:10.1081/SAC-120028431> or b) continuous (normal, non-normal and mixture), ordinal, and count (regular or zero-inflated, Poisson and Negative Binomial) variables based on the hierarchical linear models (HLM) approach. Headrick and Beasley’s method for continuous variables calculates the beta (slope) coefficients based on the target correlations between independent variables and between outcomes and independent variables. The package provides functions to calculate the expected correlations between outcomes, between outcomes and error terms, and between outcomes and independent variables, extending Headrick and Beasley’s equations to include mixture variables. These theoretical values can be compared to the simulated correlations. The HLM approach requires specification of the beta coefficients, but permits group and subject-level independent variables, interactions among independent variables, and fixed and random effects, providing more flexibility in the system of equations. Both methods permit simulation of data sets that mimic real-world clinical or genetic data sets (i.e. plasmodes, as in Vaughan et al., 2009, <10.1016/j.csda.2008.02.032>). The techniques extend those found in the ‘SimMultiCorrData’ and ‘SimCorrMix’ packages. Standard normal variables with an imposed intermediate correlation matrix are transformed to generate the desired distributions. Continuous variables are simulated using either Fleishman’s third-order (<DOI:10.1007/BF02293811>) or Headrick’s fifth-order (<DOI:10.1016/S0167-9473(02)00072-5>) power method transformation (PMT). Simulation occurs at the component-level for continuous mixture distributions. These components are transformed into the desired mixture variables using random multinomial variables based on the mixing probabilities. The target correlation matrices are specified in terms of correlations with components of continuous mixture variables. Binary and ordinal variables are simulated by discretizing the normal variables at quantiles defined by the marginal distributions. Count variables are simulated using the inverse CDF method. There are two simulation pathways for the multi-variable type systems which differ by intermediate correlations involving count variables. Correlation Method 1 adapts Yahav and Shmueli’s 2012 method <DOI:10.1002/asmb.901> and performs best with large count variable means and positive correlations or small means and negative correlations. Correlation Method 2 adapts Barbiero and Ferrari’s 2015 modification of the ‘GenOrd’ package <DOI:10.1002/asmb.2072> and performs best under the opposite scenarios. There are three methods available for correcting non-positive definite correlation matrices. The optional error loop may be used to improve the accuracy of the final correlation matrices. The package also provides function to check parameter inputs and summarize the simulated systems of equations. Interface for ‘GraphFrames’ (graphframes) A ‘sparklyr’ <https://…/> extension that provides an R interface for ‘GraphFrames’ <https://…/>. ‘GraphFrames’ is a package for ‘Apache Spark’ that provides a DataFrame-based API for working with graphs. Functionality includes motif finding and common graph algorithms, such as PageRank and Breadth-first search. Continue Reading… ### Magister Dixit “We must convey what constitutes data, what it can be used for, and why it’s valuable.” Jake Porway ( October 1, 2015 ) Continue Reading… ### Monkeying around with Code and Paying it Forward (This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers) library(tidyverse) library(monkeylearn)  This is a story (mostly) about how I started contributing to the rOpenSci package monkeylearn. I can’t promise any life flipturning upside down, but there will be a small discussion about git best practices which is almost as good . The tl;dr here is nothing novel but is something I wish I’d experienced firsthand sooner. That is, that tinkering with and improving on the code others have written is more rewarding for you and more valuable to others when you contribute it back to the original source. We all write code all the time to graft aditional features onto existing tools or reshape output into forms that fit better in our particular pipelines. Chances are, these are improvements our fellow package users could take advantage of. Plus, if they’re integrated into the package source code, then we no longer need our own wrappers and reshapers and speeder-uppers. That means less code and fewer chances of bugs all around . So, tinkering with and improving on the code others have written is more rewarding for you and more valuable to others when you contribute it back to the original source. ### Some Backstory My first brush with the monkeylearn package was at work one day when I was looking around for an easy way to classify groups of texts using R. I made the very clever first move of Googling “easy way to classify groups of texts using R” and thanks to the magic of what I suppose used to be PageRank I landed upon a GitHub README for a package called monkeylearn. A quick install.packages("monkeylearn") and creation of an API key later it started looking like this package would fit my use case. I loved that it sported only two functions, monkeylearn_classify() and monkeylearn_extract(), which did exactly what they said on the tin. They accept a vector of texts and return a dataframe of classifications or keyword extractions, respectively. For a bit of background, the monkeylearn package hooks into the MonkeyLearn API, which uses natural language processing techniques to take a text input and hands back a vector of outputs (keyword extractions or classifications) along with metadata such as their confidence in relevance of the classification. There are a set of built-in “modules” (e.g., retail classifier, profanity extractor) but users can also create their own “custom” modules1 by supplying their own labeled training data. The monkeylearn R package serves as a friendly interface to that API, allowing users to process data using the built-in modules (it doesn’t yet support creating and training of custom modules). In the rOpenSci tradition it’s peer-reviewed and was contributed via the onboarding process. I began using the package to attach classifications to around 70,000 texts. I soon discovered a major stumbling block: I could not send texts to the MonkeyLearn API in batches. This wasn’t because the monkeylearn_classify() and monkeylearn_extract() functions themselves didn’t accept multiple inputs. Instead, it was because they didn’t explicitly relate inputs to outputs. This became a problem because inputs and outputs are not 1:1; if I send a vector of three texts for classification, my output dataframe might be 10 rows long. However, there was no user-friendly way to know for sure2 whether the first two or the first four output rows, for example, belonged to the first input text. Here’s an example of what I mean. texts <- c( "It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.", "When Mr. Bilbo Baggins of Bag End announced that he would shortly be celebrating his eleventy-first birthday with a party of special magnificence, there was much talk and excitement in Hobbiton.", "I'm not an ambiturner. I can't turn left.") (texts_out <- monkeylearn_classify(texts, verbose = FALSE))  ## # A tibble: 11 x 4 ## category_id probability label text_md5 ## ## 1 18314767 0.0620 Books af55421029d7236ca6ecbb281… ## 2 18314954 0.0470 Mystery & Suspense af55421029d7236ca6ecbb281… ## 3 18314957 0.102 Police Procedural af55421029d7236ca6ecbb281… ## 4 18313210 0.0820 Party & Occasions 602f1ab2654b88f5c7f5c90e4… ## 5 18313231 0.176 "Party Supplies " 602f1ab2654b88f5c7f5c90e4… ## 6 18313235 0.134 Party Decorations 602f1ab2654b88f5c7f5c90e4… ## 7 18313236 0.406 Decorations 602f1ab2654b88f5c7f5c90e4… ## 8 18314767 0.0630 Books bdb9881250321ce8abecacd4d… ## 9 18314870 0.0460 Literature & Fiction bdb9881250321ce8abecacd4d… ## 10 18314876 0.0400 Mystery & Suspense bdb9881250321ce8abecacd4d… ## 11 18314878 0.289 Suspense bdb9881250321ce8abecacd4d…  So we can see we’ve now got classifications for the texts we fed in as input. The MD5 hash can be used to disambiguate which outputs correspond to which inputs in some cases (see Maëlle’s fantastic Guardian Experience post!). This works great if you either don’t care about classifying your inputs independently of one another or you know that your inputs will never contain empty strings or other values that won’t be sent to the API. In my case, though, my inputs were independent of one another and also could not be counted on to be well-formed. I determined that each had to be classified separately so that I could guarantee a 1:1 match between input and output. ### Initial Workaround My first approach to this problem was to simply treat each text as a separate call. I wrapped monkeylearn_classify() in a function that would send a vector of texts and return a dataframe relating the input in one column to the output in the others. Here is a simplified version of it, sans the error handling and other bells and whistles: initial_workaround <- function(df, col, verbose = FALSE) { quo_col <- enquo(col) out <- df %>% mutate( tags = NA_character_ ) for (i in 1:nrow(df)) { this_text <- df %>% select(!!quo_col) %>% slice(i) %>% as_vector() this_classification <- monkeylearn_classify(this_text, verbose = verbose) %>% select(-text_md5) %>% list() out[i, ]$tags <- this_classification
}

return(out)
}


Since initial_workaround() takes a dataframe as input rather than a vector, let’s turn our sample into a tibble before feeding it in.

texts_df <- tibble(texts)


And now we’ll run the workaround:

initial_out <- initial_workaround(texts_df, texts)

initial_out

## # A tibble: 3 x 2
##   texts                                                           tags
##
## 1 It is a truth universally acknowledged, that a single man in p… <tibble… code="" left.="" turn="" i="" <tibble…="" an="" <="" ##="" 3="" not="" ambiturner.="" i'm="" can't=""></tibble…>
 We see that this retains the 1:1 relationship between input and output, but still allows the output list-col to be unnested. (initial_out %>% unnest()) ## # A tibble: 11 x 4 ## texts category_id probability label ## <chr> <int> <dbl> <chr> ## 1 It is a truth universally acknowledged… 18314767 0.0620 Books ## 2 It is a truth universally acknowledged… 18314954 0.0470 Myster… ## 3 It is a truth universally acknowledged… 18314957 0.102 Police… ## 4 When Mr. Bilbo Baggins of Bag End anno… 18313210 0.0820 Party … ## 5 When Mr. Bilbo Baggins of Bag End anno… 18313231 0.176 "Party… ## 6 When Mr. Bilbo Baggins of Bag End anno… 18313235 0.134 Party … ## 7 When Mr. Bilbo Baggins of Bag End anno… 18313236 0.406 Decora… ## 8 I'm not an ambiturner. I can't turn le… 18314767 0.0630 Books ## 9 I'm not an ambiturner. I can't turn le… 18314870 0.0460 Litera… ## 10 I'm not an ambiturner. I can't turn le… 18314876 0.0400 Myster… ## 11 I'm not an ambiturner. I can't turn le… 18314878 0.289 Suspen… </chr></dbl></int></chr> But, the catch: this approach was quite slow. The real bottleneck here isn’t the for loop; it’s that this requires a round trip to the MonkeyLearn API for each individual text. For just these three meager texts, let’s see how long initial_workaround() takes to finish. (benchmark <- system.time(initial_workaround(texts_df, texts))) ## user system elapsed ## 0.036 0.001 15.609 It was clear that if classifying 3 inputs was going to take 15.6 seconds, even classifying my relatively small data was going to take a looong time, like on the order of 4 days, just for the first batch of data . I updated the function to write each row out to an RDS file after it was classified inside the loop (with an addition along the lines of write_rds(out[i, ], glue::glue("some_directory/{i}.rds"))) so that I wouldn’t have to rely on the function successfully finishing execution in one run. Still, I didn’t like my options. This classification job was intended to be run every night, and with an unknown amount of input text data coming in every day, I didn’t want it to run for more than 24 hours one day and either a) prevent the next night’s job from running or b) necessitate spinning up a second server to handle the next night’s data. Diving In Now that I’m starting to think I’m just about at the point where I have to start making myself useful. I’d seen in the package docs and on the MonkeyLearn FAQ that batching up to 200 texts was possible2. So, I decide to first look into the mechanics of how text batching is done in the monkeylearn package. Was the MonkeyLearn API returning JSON that didn’t relate each input individual and output? I sort of doubted it. You’d think that an API that was sent a JSON “array” of inputs would send back a hierarchical array to match. My hunch was that either the package was concatenating the input before shooting it off to the API (which would save user on API queries) or rowbinding the output after it was returned. (The rowbinding itself would be fine if each input could somehow be related to its one or many outputs.) So I fork the package repo and set about rummaging through the source code. Blissfully, everything is nicely commented and the code was quite readable. I step through monkeylearn_classify() in the debugger and narrow in on a call to what looks like a utility function: monkeylearn_parse(). I find it in utils.R. The lines in monkeylearn_parse() that matter for our purposes are: text <- httr::content(output, as = "text", encoding = "UTF-8") temp <- jsonlite::fromJSON(text) if(length(temp$result[[1]]) != 0){ results <- do.call("rbind", temp$result) results$text_md5 <- unlist(mapply(rep, vapply(X=request_text, FUN=digest::digest, FUN.VALUE=character(1), USE.NAMES=FALSE, algo = "md5"), unlist(vapply(temp$result, nrow, FUN.VALUE = 0)), SIMPLIFY = FALSE)) } So this is where the rowbinding happens – after the fromJSON() call! This is good news because it means that the MonkeyLearn API is sending differentiated outputs back in a nested JSON object. The package converts this to a list with fromJSON() and only then is the rbinding applied. That’s why the text_md5 hash is generated during this step: to be able to group outputs that all correspond to a single input (same hash means same input). I set about copy-pasting monkeylearn_parse() and did a bit of surgery on it, emerging with monkeylearn_parse_each(). monkeylearn_parse_each() skips the rbinding and retains the list structure of each output, which means that its output can be turned into a nested tibble with each row corresponding to one input. That nested tibble can then be related to each corresponding element of the input vector. All that remained was to use create a new enclosing analog to monkeylearn_classify() that could use monkeylearn_parse_each(). Thinking PR thoughts At this point, I thought that such a function might be useful to some other people using the package so I started building it with an eye toward making a pull request. Since I’d found it useful to be able to pass in an input dataframe in initial_workaround(), I figured I’d retain that feature of the function. I wanted users to still be able to pass in a bare column name but the package seemed to be light on tidyverse functions unless there was no alternative, so I un-tidyeval’d the function (using deparse(substitute()) instead of a quosure) and gave it the imaginative name…monkeylearn_classify_df(). The rest of the original code was so airtight I didn’t have to change much more to get it working. A nice side effect of my plumbing through the guts of the package was that I caught a couple minor bugs (things like the remnants of a for loop remaining in what had been revamped into a while loop) and noticed where there could be some quick wins for improving the package. After a few more checks I wrote up the description for the pull request which outlined the issue and the solution (though I probably should have first opened an issue, waited for a response, and then submitted a PR referencing the issue as Mara Averick suggests in her excellent guide to contributing to the tidyverse). I checked the list of package contributors to see if I knew anyone. Far and away the main contributor was Maëlle Salmon! I’d heard of her through the magic of #rstats Twitter and the R-Ladies Global Slack. A minute or two after submitting it I headed over to Slack to give her a heads up that a PR would be heading her way. In what I would come to know as her usual cheerful, perpetually-on-top-of-it form, Maëlle had already seen it and liked the idea for the new function. Continuing Work To make a short story shorter, Maëlle asked me if I’d like to create the extractor counterpart to monkeylearn_classify_df() and become an author on the package with push access to the repo. I said yes, of course, and so we began to strategize over Slack about tradeoffs like which package dependencies we were okay with taking on, whether to go the tidyeval or base route, what the best naming conventions for the new functions should be, etc. On the naming front, we decided to gracefully deprecate monkeylearn_classify() and monkeylearn_extract() as the newer functions could cover all of the functionality that the older ones did. I don’t know much about cache invalidation, but the naming problem was hard as usual. We settled on naming their counterparts monkey_classify() (which replaced the original monkeylearn_classify_df()) and monkey_extract(). gitflow Early on in the process we started talking git conventions. Rather than both work off a development branch, I floated a structure that we typically follow at my company, where each ticket (or in this case, GitHub Issue) becomes its own branch off of dev. For instance, issue #33 becomes branch T33 (T for ticket). Each of these feature branches come off of dev (unless they’re hotfixes) and are merged back into dev and deleted when they pass all the necessary checks. This approach, I am told, stems from the “gitflow” philosophy which, as far as I understand it, is one of many ways to structure a git workflow that mostly doesn’t end in tears. Image source Like most git strategies, the idea here is to make pull requests as bite-sized as possible; in this case, a PR can only be as big as the issue it’s named from. An added benefit for me, at least, is that this keeps me from wandering off into other parts of the code without first documenting the point in a separate issue, and then creating a branch. At most one person is assinged to each ticket/issue, which minimizes merge conflicts. You also leave a nice paper trail because the branch name directly references the issue front and center in its title. This means you don’t have to explicitly name the issue in the commit or rely on GitHub’s (albeit awesome) keyword branch closing system3. Finally, since the system is so tied to issues themselves, it encourages very frequent communication between collaborators. Since the issue must necessarily be made before the branch and the accompanying changes to the code, the other contributors have a chance to weigh in on the issue or the approach suggested in its comments before any code is written. In our case, it’s certainly made frequent communication the path of least resistance. While this branch and PR-naming convention isn’t particular to gitflow (to my knowledge), it did spark a short conversation on Twitter that I think is useful to have: Thomas Lin Pedersen makes a good point on the topic: I prefer named PRs as it gives a quick overview over opened PRs. While cross referencing with open issues is possible it is very tedious when you try to get an overview — Thomas Lin Pedersen (@thomasp85) March 6, 2018 <script src="https://platform.twitter.com/widgets.js" charset="utf-8" async=""></script> This insight got me thinking that the best approach might be to explicitly name the issue number and give a description in the branch name, like a slug of sorts. I started using a branch syntax like T31-fix-bug-i-just-created which has worked out well for Maëlle and me thus far, making the history a bit more readable. Main Improvements As I mentioned, the package was so good to begin with it was difficult to find ways to improve it. Most of the subsequent work I did on monkeylearn was to improve the new monkey_ functions. The original monkeylearn_ functions discarded inputs such as empty strings that could not be sent to the API. We now retain those empty inputs and return NAs in the response columns for that row. This means that the output is always of the same dimensions as the input. We return an unnested dataframe by default, as the original functions did, but allow the output to be nested if the unnest flag is set to FALSE. The functions also got more informative messages about which batches are currently being processed and which texts those batches corresponded to. text_w_empties <- c( "It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.", "When Mr. Bilbo Baggins of Bag End announced that he would shortly be celebrating his eleventy-first birthday with a party of special magnificence, there was much talk and excitement in Hobbiton.", "", "I'm not an ambiturner. I can't turn left.", " ") (empties_out <- monkey_classify(text_w_empties, classifier_id = "cl_5icAVzKR", texts_per_req = 2, unnest = TRUE)) ## The following indices were empty strings and could not be sent to the API: 3, 5. They will still be included in the output. ## Processing batch 1 of 2 batches: texts 1 to 2 ## Processing batch 2 of 2 batches: texts 2 to 3 ## # A tibble: 8 x 4 ## req category_id probability label ## <chr> <int> <dbl> <chr> ## 1 It is a truth universally acknowledged,… 64708 0.289 Society ## 2 It is a truth universally acknowledged,… 64711 0.490 Relati… ## 3 When Mr. Bilbo Baggins of Bag End annou… 64708 0.348 Society ## 4 When Mr. Bilbo Baggins of Bag End annou… 64713 0.724 Specia… ## 5 "" NA NA <na> ## 6 I'm not an ambiturner. I can't turn lef… 64708 0.125 Society ## 7 I'm not an ambiturner. I can't turn lef… 64710 0.377 Parent… ## 8 " " NA NA <na> </na></na></chr></dbl></int></chr> So even though the empty string inputs like the 3rd and 5th, aren’t sent to the API, we can see they’re still included in the output dataframe and assigned the same column names as all of the other outputs. That means that even if unnest is set to FALSE, the output can still be unnested with tidyr::unnest() after the fact. If a dataframe is supplied, there is now a .keep_all option which allows for all columns of the input to be retained, not just the column that contains the text to be classified. This makes the monkey_ functions work even more like a mutate(); rather than returning an object that has to be joined on the original input, we do that association for the user. sw <- dplyr::starwars %>% dplyr::select(name, height) %>% dplyr::sample_n(length(text_w_empties)) df <- tibble::tibble(text = text_w_empties) %>% dplyr::bind_cols(sw) df %>% monkey_classify(text, classifier_id = "cl_5icAVzKR", unnest = FALSE, .keep_all = TRUE, verbose = FALSE) ## # A tibble: 5 x 4 ## name height text res ## <chr> <int> <chr> <list> ## 1 Ackbar 180 It is a truth universally acknowledged, t… <data.… code="" ##="" bilbo="" shaak="" an="" 178="" left.="" 172="" "="" end="" turn="" skywalker="" mr.="" 3="" 2="" baggins="" 4="" ti="" 5="" shmi="" <="" ""="" luke="" <data.…="" lama="" 229="" not="" ambiturner.="" i'm="" 163="" i="" of="" when="" su="" bag="" announc…="" can't=""></data.…></list></chr></int></chr> We see that the input column, text is sandwiched between the other columns of the original dataframe (the Starwars ones) and the output column res. The hope is that all of this serves to improve the data safety and user experience of the package. Developing functions in tandem Something I’ve been thinking about while working on the twin functions monkey_extract() and monkey_classify() is what the best practice is for developing very similar functions in sync with one another. These two functions are different enough to have different default values (for example, monkey_extract() has a default extractor_id while monkey_classify() has a default classifier_id) but are so similar in other regards as to be sort of embarrassingly parallel. What I’ve been turning over in my head is the question of how in sync these functions should be during development. As soon as you make a change to one function, should you immediately make the same change to the other? Or is it instead better to work on one function at a time, and, at some checkpoints then batch these changes over to the other function in a big copy-paste job? I’ve been tending toward the latter but it’s seemed a little dangerous to me. Since there are only two functions to worry about here, creating a function factory to handle them seemed like overkill, but might technically be the best practice. I’d love to hear people’s thoughts on how they go about navigating this facet of package development. Last Thoughts My work on the monkeylearn package so far has been rewarding to say the least. It’s inspired me to be not just a consumer but more of an active contributor to open source. Some wise words from Maëlle on this front: You too can become a contributor to an rOpenSci package! Have a look at the issues tracker of your favorite rOpenSci package(s) e.g. rgbif. Browse issues suitable for beginners over many rOpenSci repos thanks to Lucy D’Agostino McGowan’s contributr Shiny app. Always first ask in a new or existing issue whether your contribution would be welcome, plan a bit with the maintainer, and then have fun! We’d be happy to have you. Maëlle’s been a fantastic mentor, providing guidance in at least four languages – English, French, R, and emoji, despite the time difference and (!). When it comes to monkeylearn, the hope is to keep improving the core package features, add some more niceties, and look into building out an R-centric way for users to create and train their own custom modules on MonkeyLearn. On y va! Custom, to a point. As of this writing, two types of classifier models you can create use either Naive Bayes or Support Vector Machines, though you can specify other parameters such as use_stemmer and strip_stopwords. Custom extractor modules are coming soon. That MD5 hash almost provided the solution; each row of the output gets a hash that corresponds to a single input row, so it seemed like the hash was meant to be used to be able to map inputs to outputs. Provided that I knew that all of my inputs were non-empty strings, which are filtered out before they can be sent to the API, and could be classified I could have nested the output based on its MD5 sum and mapped the indices of the inputs and the outputs 1:1. The trouble was that I knew that my input data would be changing and I wasn’t convinced that all of my inputs would be receive well-formed responses from the API. If some of the text couldn’t receive a corresponding set of classification, such a nested output would have fewer rows than the input vector’s length. There would be no way to tell which input corresponded to which nested output. Keywords in commits don’t automatically close issues until they’re merged into master, and since we were working off of dev for quite a long time, if we relied on keywords to automatically close issues our Open iIsues list wouldn’t accurately reflect the issues that we actually still had to address. Would be cool for GitHub to allow flags like maybe “fixes #33 –dev” could close issue #33 when the PR with that phrase in the commit was merged into dev. <script type="text/javascript"> var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); </script> To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... 

### Announcing CGPfunctions 0.3 – April 20, 2018

(This article was first published on Chuck Powell, and kindly contributed to R-bloggers)

As I continue to learn and grow in using R I have been trying to develop
the habit of being more formal in documenting and maintaining the
various functions and pieces of code I write. It’s not that I think they
are major inventions but they are useful and I like having them stored
in one place that I can keep track of. So I started building them as a
package and even publishing them to CRAN. For any of you who might find
them of interest as well.

## Overview

A package that includes functions that I find useful for teaching
statistics as well as actually practicing the art. They typically are
not “new” methods but rather wrappers around either base R or other
packages and concepts I’m trying to master. Currently contains:

• Plot2WayANOVA which as the name implies conducts a 2 way ANOVA and
plots the results using ggplot2
• PlotXTabs which as the name implies plots cross tabulated
variables using ggplot2
• neweta which is a helper function that appends the results of a
Type II eta squared calculation onto a classic ANOVA table
• Mode which finds the modal value in a vector of data
• SeeDist which wraps around ggplot2 to provide visualizations of
univariate data.
• OurConf is a simulation function that helps you learn about
confidence intervals

## Installation

# Install from CRAN
install.packages("CGPfunctions")

# Or the development version from GitHub
# install.packages("devtools")
devtools::install_github("ibecav/CGPfunctions")


## Credits

Many thanks to Dani Navarro and the book > (Learning Statistics with
R
)
whose etaSquared function was the genesis of neweta.

“He who gives up safety for speed deserves neither.”
(via)

#### A shoutout to some other packages I find essential.

• stringr, for strings.
• lubridate, for date/times.
• forcats, for factors.
• haven, for SPSS, SAS and Stata
files.
• readxl, for .xls and .xlsx
files.
• modelr, for modelling within a
pipeline
• broom, for turning models into
tidy data
• ggplot2, for data visualisation.
• dplyr, for data manipulation.
• tidyr, for data tidying.
• purrr, for functional programming.
• tibble, for tibbles, a modern
re-imagining of data frames.

## Leaving Feedback

If you like CGPfunctions, please consider leaving feedback
here
.

## Contributing

Contributions in the form of feedback, comments, code, and bug reports
are most welcome. How to contribute:

• Issues, bug reports, and wish lists: File a GitHub
issue
.
• Contact the maintainer ibecav at gmail.com by email.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### eRum Competition Winners

(This article was first published on R on The Jumping Rivers Blog, and kindly contributed to R-bloggers)

The results of the eRum competition are in! Before we announce the winners we would like to thank everyone who entered. It has been a pleasure to look at all of the ideas on show.

### The Main Competition

The winner of the main competition is Lukasz Janiszewski. Lukasz provided a fantastic visualisation of the locations of each R user/ladies group and all R conferences. You can see his app here. If you want to view his code, you are able to do so in this GitHub repo. The code is contained in the directory erum_jr and the data preparation can be seen in budap.R.

Lukasz made 3 csv files contained the information about the R user, R ladies and R conferences. Using the help of an Rbloggers post, he was able to add the geospatial information to those csv files. Finally, he scraped each meetup page for information on the R-ladies groups. Using all of this information, he was able to make an informative, visually appealing dashboard with shiny.

Lukasz will now be jetting off to Budapest, to eRum 2018!

### The Secondary Competition

The winner of the secondary competition is Jenny Snape. Jenny provided an excellent script to parse the current .Rmd files and extract the conference and group urls & locations. The script can be found in this GitHub gist. Jenny has written a few words to summarise her script…

“The files on github can be read into R as character vectors (where each line is a element of the vector) using the R readLines() function.

From this character vector, we need to extract the country, the group name and url. This can be done by recognising that each line containing a country starts with a ‘##’ and each line containing the group name and url starts with a ’*‘. Therefore we can use these ’tags’ to cycle through each element of the character vector and pull out vectors containing the countries, the cities and the urls of the R groups. These vectors can then be cleaned and joined together into a data frame.

I wrote these steps into a function that accepted each R group character vector as an input and returned the final data frame. As one of the data sets contained just R Ladies groups, I fed this in as an argument and returned it as a column in the final data frame in order to differentiate between the different group types. I also returned a variable based on the character vector input in order to differentiate between the different world continents.

Running this function on each of the character vectors creates separate data sets which can then be all joined together. This creates a final dataset containing all the information on each R group: the type of group, the url, the city and the region."

As well as this, Jenny provided us with a fantastic shiny dashboard, summarising the data.

Jenny has now received a free copy of Efficient R Prgoramming!

Once again, thank you to all who entered and well done to our winners, Lukasz and Jenny!

### What next?

We’re in the process of converting Jenny’s & Lukasz’s hard work into a nice dashboard that will be magically updated via our list of useR groups and conferences. It should be ready in a few days.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

## April 19, 2018

### If you did not already know

Online Multiple Kernel Classification (OMKC)
Online learning and kernel learning are two active research topics in machine learning. Although each of them has been studied extensively, there is a limited effort in addressing the intersecting research. In this paper, we introduce a new research problem, termed OnlineMultiple Kernel Learning (OMKL), that aims to learn a kernel based prediction function from a pool of predefined kernels in an online learning fashion. OMKL is generally more challenging than typical online learning because both the kernel classifiers and their linear combination weights must be learned simultaneously. In this work, we consider two setups for OMKL, i.e. combining binary predictions or real-valued outputs from multiple kernel classifiers, and we propose both deterministic and stochastic approaches in the two setups for OMKL. The deterministic approach updates all kernel classifiers for every misclassified example, while the stochastic approach randomly chooses a classifier(s) for updating according to some sampling strategies. Mistake bounds are derived for all the proposed OMKL algorithms. …

Deep Generalized Canonical Correlation Analysis (DGCCA)
We present Deep Generalized Canonical Correlation Analysis (DGCCA) — a method for learning nonlinear transformations of arbitrarily many views of data, such that the resulting transformations are maximally informative of each other. While methods for nonlinear two-view representation learning (Deep CCA, (Andrew et al., 2013)) and linear many-view representation learning (Generalized CCA (Horst, 1961)) exist, DGCCA is the first CCA-style multiview representation learning technique that combines the flexibility of nonlinear (deep) representation learning with the statistical power of incorporating information from many independent sources, or views. We present the DGCCA formulation as well as an efficient stochastic optimization algorithm for solving it. We learn DGCCA repre- sentations on two distinct datasets for three downstream tasks: phonetic transcrip- tion from acoustic and articulatory measurements, and recommending hashtags and friends on a dataset of Twitter users. We find that DGCCA representations soundly beat existing methods at phonetic transcription and hashtag recommendation, and in general perform no worse than standard linear many-view techniques. …

Distributed Computing
Distributed computing is a field of computer science that studies distributed systems. A distributed system is a software system in which components located on networked computers communicate and coordinate their actions by passing messages. The components interact with each other in order to achieve a common goal. Three significant characteristics of distributed systems are: concurrency of components, lack of a global clock, and independent failure of components. Examples of distributed systems vary from SOA-based systems to massively multiplayer online games to peer-to-peer applications. A computer program that runs in a distributed system is called a distributed program, and distributed programming is the process of writing such programs. There are many alternatives for the message passing mechanism, including RPC-like connectors and message queues. A goal and challenge pursued by some computer scientists and practitioners in distributed systems is location transparency; however, this goal has fallen out of favour in industry, as distributed systems are different from conventional non-distributed systems, and the differences, such as network partitions, partial system failures, and partial upgrades, cannot simply be ‘papered over’ by attempts at ‘transparency’ – see CAP theorem. Distributed computing also refers to the use of distributed systems to solve computational problems. In distributed computing, a problem is divided into many tasks, each of which is solved by one or more computers, which communicate with each other by message passing. …

### Pastagate!

[relevant picture]

In a news article, “Pasta Is Good For You, Say Scientists Funded By Big Pasta,” Stephanie Lee writes:

The headlines were a fettuccine fanatic’s dream. “Eating Pasta Linked to Weight Loss in New Study,” Newsweek reported this month, racking up more than 22,500 Facebook likes, shares, and comments. The happy news also went viral on the Independent, the New York Daily News, and Business Insider.

What those and many other stories failed to note, however, was that three of the scientists behind the study in question had financial conflicts as tangled as a bowl of spaghetti, including ties to the world’s largest pasta company, the Barilla Group. . . .

They should get together with Big Oregano.

P.S. Our work has many government and corporate sponsors. Make of this what you will.

The post Pastagate! appeared first on Statistical Modeling, Causal Inference, and Social Science.

### Distilled News

“A picture speaks a thousand words” is one of the most commonly used phrases. But a graph speaks so much more than that. A visual representation of data, in the form of graphs, helps us gain actionable insights and make better data driven decisions based on them. But to truly understand what graphs are and why they are used, we will need to understand a concept known as Graph Theory. Understanding this concept makes us better programmers. Source: Quantdare But if you have tried to understand this concept before, you’ll have come across tons of formulae and dry theoretical concepts. This is why we decided to write this blog post. We have explained the concepts and then provided illustrations so you can follow along and intuitively understand how the functions are performing. This is a detailed post, because we believe that providing a proper explanation of this concept is a much preferred option over succinct definitions. In this article, we will look at what graphs are, their applications and a bit of history about them. We’ll also cover some Graph Theory concepts and then take up a case study using python to cement our understanding. Ready? Let’s dive into it.
One of the most common problem data science professionals face is to avoid overfitting. Have you come across a situation where your model performed exceptionally well on train data, but was not able to predict test data. Or you were on the top of a competition in public leaderboard, only to fall hundreds of places in the final rankings? Well – this is the article for you! Avoiding overfitting can single-handedly improve our model’s performance. In this article, we will understand the concept of overfitting and how regularization helps in overcoming the same problem. We will then look at a few different regularization techniques and take a case study in python to further solidify these concepts.
Deep Learning, based on deep neural nets is launching a thousand ventures but leaving tens of thousands behind. Transfer Learning (TL), a method of reusing previously trained deep neural nets promises to make these applications available to everyone, even those with very little labeled data.
An analysis of more than 400 use cases across 19 industries and nine business functions highlights the broad use and significant economic potential of advanced AI techniques.
Artificial Intelligence is on the rise! Not as a machine rebellion against human creators in the distant future, but as a growing modern trend of using machine-based predictions and decision-making in informational technologies. AI hype is everywhere: self-driving cars, smart image processing (e.g. Prisma), and communication domain use like conversational AI a.k.a. chatbots. The chatbot industry is expanding fast, yet the technologies are still young. Conversational bots used to be rather vacant like the old school text-based game “I smell a Wumpus”, but now they evolved into a top quality business tool. Chatbots offer a new type of simple and friendly interface imperative for browsing information and receiving services. IT experts and industry giants including Google, Microsoft, and Facebook agree that this technology will play a huge role in the future. To enjoy the marvels of Conversational Artificial Intelligence tools (or chatbots, if you are into brevity things), you must master the basics and understand the typical stack. In this article, we will discuss all kinds of instruments you can gear up with, how they are similar and at the same time different from each other, as well as their ups and downs. But before we hop on the journey of discovering these, let’s get into the deeper understanding of the chatbots and their topology.
First part on a full discussion on how to do Distributed Deep Learning with Apache Spark. This part: What is Spark, basics on Spark+DL and a little more.
Time series prediction (forecasting) has experienced dramatic improvements in predictive accuracy as a result of the data science machine learning and deep learning evolution. As these ML/DL tools have evolved, businesses and financial institutions are now able to forecast better by applying these new technologies to solve old problems. In this article, we showcase the use of a special type of Deep Learning model called an LSTM (Long Short-Term Memory), which is useful for problems involving sequences with autocorrelation. We analyze a famous historical data set called “sunspots” (a sunspot is a solar phenomenon wherein a dark spot forms on the surface of the sun). We’ll show you how you can use an LSTM model to predict sunspots ten years into the future with an LSTM model.
Do you want to try out a new version of Apache Spark without waiting around for the entire release process? Does running alpha-quality software sound like fun? Does setting up a test cluster sound like work? This is the blog post for you, my friend! We will help you deploy code that hasn’t even been reviewed yet (if that is the adventure you seek). If you’re a little cautious, reading this might sound like a bad idea, and often it is, but it can be a great way to ensure that a PR really fixes your bug, or the new proposed Spark release doesn’t break anything you depend on (and if it does, you can raise the alarm). This post will help you try out new (2.3.0+) and custom versions of Spark on Google/Azure with Kubernetes. Just don’t run this in production without a backup and a very fancy support contract for when things go sideways.
In the age of Artificial Intelligence Systems, developing solutions that don’t sound plastic or artificial is an area where a lot of innovation is happening. While Natural Language Processing (NLP) is primarily focused on consuming the Natural Language Text and making sense of it, Natural Language Generation – NLG is a niche area within NLP to generate human-like text rather than machine generated.

### 5 best practices for delivering design critiques

Real critique helps teams strengthen their designs, products, and services.

Continue reading 5 best practices for delivering design critiques.

### By how much does AVX-512 slow down your CPU? A first experiment.

Intel is finally making available processors that support the fancy AVX-512 instruction sets and that can fit nicely in a common server rack. So I went to Dell and ordered a server with a Skylake-X microarchitecture: an Intel Xeon W-2104 CPU @ 3.20GHz.

This processor supports several interesting AVX-512 instruction sets. They are made of very powerful instructions that can manipulate 512-bit vectors.

On the Internet, the word out is that using AVX-512 in your application is going to slow down your whole server, so you should just give up and never use AVX-512 instructions.

If you do not require AVX-512 for some specific high-performance tasks, I suggest you disable AVX-512 execution on your server or desktop, (…)

Table 15-16 in Intel’s optimization manual describes the impact of the various instructions you use on “Turbo Boost” (one of Intel’s frequency scaling technology). The type of instructions you use determines the “license” you are in. If you avoid AVX-512 and heavy AVX2 instructions (floating-point instructions and multiplications), you get the best boost. If you use light AVX-512 instructions or heavy AVX2 instructions, you get less of a boost… and you get the worst results with heavy AVX-512 instructions.

Intel sends us to a sheet of frequencies. Unfortunately, a quick look did not give me anything on my particular processor (Intel Xeon W-2104).

Intel is not being very clear:

Workloads that execute Intel AVX-512 instructions as a large proportion of their whole instruction count can gain performance compared to Intel AVX2 instructions, even though they may operate at a lower frequency. It is not always easy to predict whether a program’s performance will improve from building it to target Intel AVX-512 instructions.

What I am most interested in, is the theory that people seem to have that if you use AVX-512 sparingly, it is going to bring down the performance of your whole program. How could I check this theory?

I picked up a benchmark program that computes the Mandelbrot set. Then, using AVX-512 intrinsics, I added AVX-512 instructions to the program at select places. These instructions do nothing to contribute to the solution, but they cannot be trivially optimized away by the compiler. I used both light and heavy AVX-512 instructions. There are few enough of them so that the overhead is negligible… but if they slowed down the processor in a significant manner, we should be able to measure a difference.

The results?

mode running time (average over 10)
no AVX-512 1.48 s
light AVX-512 1.48 s
heavy AVX-512 1.48 s

Using spurious AVX-512 instructions made no difference to the running time in my tests. I don’t doubt that the frequency throttling is real, as it is described by Intel and widely reported, but I could not measure it.

This suggests that, maybe, it is less likely to be an issue than is often reported, at least on the type of processors I have. Or else I made a mistake in my tests.

In any case, we need reproducible simple tests. Do you have one?

### Leverage the Power of Data-Literacy

Optimizing your business for AI success is the only way to leverage its growing power; data-literacy represents the foundation of that optimization.

### Book Memo: “Optimization in Engineering: Models and Algorithms”

 Models and Algorithms This textbook covers the fundamentals of optimization, including linear, mixed-integer linear, nonlinear, and dynamic optimization techniques, with a clear engineering focus. It carefully describes classical optimization models and algorithms using an engineering problem-solving perspective, and emphasizes modeling issues using many real-world examples related to a variety of application areas. Providing an appropriate blend of practical applications and optimization theory makes the text useful to both practitioners and students, and gives the reader a good sense of the power of optimization and the potential difficulties in applying optimization to modeling real-world systems. The book is intended for undergraduate and graduate-level teaching in industrial engineering and other engineering specialties. It is also of use to industry practitioners, due to the inclusion of real-world applications, opening the door to advanced courses on both modeling and algorithm development within the industrial engineering and operations research fields.

### Book Memo: “Tensor Numerical Methods in Scientific Computing”

 This book presents an introduction to modern tensor-structured numerical methods in scientific computing. In recent years, these methods have been shown to provide a powerful tool for efficient computations in higher dimensions, thus overcoming the so-called “curse of dimensionality”, a problem that encompasses various phenomena that arise when analyzing and organizing data in high-dimensional spaces.

### Two day workshop: Flexible programming of MCMC and other methods for hierarchical and Bayesian models

(This article was first published on R – NIMBLE, and kindly contributed to R-bloggers)

We’ll be giving a two day workshop at the 43rd Annual Summer Institute of Applied Statistics at Brigham Young University (BYU) in Utah, June 19-20, 2018.

Abstract is below, and registration and logistics information can be found here.

This workshop provides a hands-on introduction to using, programming, and sharing Bayesian and hierarchical modeling algorithms using NIMBLE (r-nimble.org). In addition to learning the NIMBLE system, users will develop hands-on experience with various computational methods. NIMBLE is an R-based system that allows one to fit models specified using BUGS/JAGS syntax but with much more flexibility in defining the statistical model and the algorithm to be used on the model. Users operate from within R, but NIMBLE generates C++ code for models and algorithms for fast computation. I will open with an overview of creating a hierarchical model and fitting the model using a basic MCMC, similarly to how one can use WinBUGS, JAGS, and Stan. I will then discuss how NIMBLE allows the user to modify the MCMC – changing samplers and specifying blocking of parameters. Next I will show how to extend the BUGS syntax with user-defined distributions and functions that provide flexibility in specifying a statistical model of interest. With this background we can then explore the NIMBLE programming system, which allows one to write new algorithms not already provided by NIMBLE, including new MCMC samplers, using a subset of the R language. I will then provide examples of non-MCMC algorithms that have been programmed in NIMBLE and how algorithms can be combined together, using the example of a particle filter embedded within an MCMC. We will see new functionality in NIMBLE that allows one to fit Bayesian nonparametric models and spatial models. I will close with a discussion of how NIMBLE enables sharing of new methods and reproducibility of research. The workshop will include a number of breakout periods for participants to use and program MCMC and other methods, either on example problems or problems provided by participants. In addition, participants will see NIMBLE’s flexibility in action in several real problems.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Let’s Admit It: We’re a Long Way from Using “Real Intelligence” in AI

With the growth of AI systems and unstructured data, there is a need for an independent means of data curation, evaluation and measurement of output that does not depend on the natural language constructs of AI and creates a comparative method of how the data is processed.

### Hackathon – Hack the city 2018

(This article was first published on R blog | Quantide - R training & consulting, and kindly contributed to R-bloggers)

Hello, R-Users!

Have you ever joined an hackathon? If so, you surely know how fun and stimulating these events are. If not… now’s your chance!

Quantide is collaborating with Hack the City 2018, the 4th edition of the Southern Switzerland’s hackathon. Aside from being partners, we’re actively part of it: we have members in the mentors’ group and as part of the technical commission of the jury. We also give out a special mention for for the best data science project, developed with open source technology and preferably using R as programming language. It’s the first time that R makes its entry in this hackathon: if you work with R, here’s your chance!

Hack the city is a great occasion to show off your talent and your ideas, in team or as an individual. Programmers, graphic designers, project designers and makers collaborate to build something in just 48 hours. You can also join the competition from your own home, as long as at least one of your teammates is in Lugano! At the end of the hackathon, a jury will give out great prizes to the best project: respectively 5000, 2500 and 100o CHF for first, second and third prizes. There are other special mentions and scholarships, as we have mentioned before.

The last minutes tickets are available now, so if you want to grab this occasion, you should do it fast! Hack the City 2018 will take place in Lugano, from 27th to 29th April. Hope to see you there!

The post Hackathon – Hack the city 2018 appeared first on Quantide – R training & consulting.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Europeans remain welcoming to immigrants

FOR those who believe that migration can, if managed properly, make a country materially and culturally richer, recent developments in European politics have been worrying.

### Postdoc opportunity at AstraZeneca in Cambridge, England, in Bayesian Machine Learning using Stan!

Predicting drug toxicity with Bayesian machine learning models

We’re currently looking for talented scientists to join our innovative academic-style Postdoc. From our centre in Cambridge, UK you’ll be in a global pharmaceutical environment, contributing to live projects right from the start. You’ll take part in a comprehensive training programme, including a focus on drug discovery and development, given access to our existing Postdoctoral research, and encouraged to pursue your own independent research. It’s a newly expanding programme spanning a range of therapeutic areas across a wide range of disciplines. . . .

You will be part of the Quantitative Biology group and develop comprehensive Bayesian machine learning models for predicting drug toxicity in liver, heart, and other organs. This includes predicting the mechanism as well as the probability of toxicity by incorporating scientific knowledge into the prediction problem, such as known causal relationships and known toxicity mechanisms. Bayesian models will be used to account for uncertainty in the inputs and propagate this uncertainty into the predictions. In addition, you will promote the use of Bayesian methods across safety pharmacology and biology more generally. You are also expected to present your findings at key conferences and in leading publications

This project is in collaboration with Prof. Andrew Gelman at Columbia University, and Dr Stanley Lazic at AstraZeneca.

### Umpire strike zone changes to finish games earlier

When watching baseball on television, we get the benefit of seeing whether a pitch entered the strike zone or not. Umpires go by eye, and intentional or not, they tend towards finishing a game over extra innings. Michael Lopez, Brian Mills, and Gus Wezerek for FiveThirtyEight:

The left panel shows the comparative rate of strike calls when, in the bottom of an inning in extras, the batting team is positioned to win — defined as having a runner on base in a tie game — relative to those rates in situations when there’s no runner on base in a tie game. When the home team has a baserunner, umps call more balls, thus setting up more favorable counts for home-team hitters, creating more trouble for the pitcher, and giving the home team more chances to end the game.

I doubt the shift is on purpose, but it’s interesting to see the calls go that way regardless. Also, from a non baseball-viewer, why isn’t there any replay in baseball yet?

### Deploying Deep Learning Models on Kubernetes with GPUs

This post is authored by Mathew Salvaris and Fidan Boylu Uz, Senior Data Scientists at Microsoft.

One of the major challenges that data scientists often face is closing the gap between training a deep learning model and deploying it at production scale. Training of these models is a resource intensive task that requires a lot of computational power and is typically done using GPUs. The resource requirement is less of a problem for deployment since inference tends not to pose as heavy a computational burden as training. However, for inference, other goals also become pertinent such as maximizing throughput and minimizing latency. When inference speed is a bottleneck, GPUs show considerable performance gains over CPUs. Coupled with containerized applications and container orchestrators like Kubernetes, it is now possible to go from training to deployment with GPUs faster and more easily while satisfying latency and throughput goals for production grade deployments.

In this tutorial, we provide step-by-step instructions to go from loading a pre-trained Convolutional Neural Network model to creating a containerized web application that is hosted on Kubernetes cluster with GPUs on Azure Container Service (AKS). AKS makes it quick and easy to deploy and manage containerized applications without much expertise in managing Kubernetes environment. It eliminates complexity and operational overhead of maintaining the cluster by provisioning, upgrading, and scaling resources on demand, without taking the applications offline. AKS reduces the cost and complexity of using a Kubernetes cluster by managing the master nodes for which the user does no incur a cost. Azure Container Service has been available for a while and similar approach was provided in a previous tutorial to deploy a deep learning framework on Marathon cluster with CPUs . In this tutorial, we focus on two of the most popular deep learning frameworks and provide the step-by-step instructions to deploy pre- trained models on Kubernetes cluster with GPUs.

The tutorial is organized in two parts, one for each deep learning framework, specifically TensorFlow and Keras with TensorFlow backend. Under each framework, there are several notebooks that can be executed to perform the following steps:

• Develop the model that will be used in the application.
• Develop the API module that will initialize the model and make predictions.
• Create Docker image of the application with Flask and Nginx.
• Test the application locally.
• Create an AKS cluster with GPUs and deploy the web app.
• Test the web app hosted on AKS.
• Perform speed tests to understand latency of the web app.

Below, you will find short descriptions of the steps above.

Develop the Model

As the first step of the tutorial, we load the pre-trained ResNet152 model, pre-process an example image to the required format and call the model to find the top predictions. The code developed in this step will be used in the next step when we develop the API module that initializes the model and makes predictions.

Develop the API

In this step, we develop the API that will call the model. This driver module initializes the model, transforms the input so that it is in the appropriate format and defines the scoring method that will produce the predictions. The API will expect the input to be in JSON format. Once a request is received, the API will convert the JSON encoded request into the image format. The first function of the API loads the model and returns a scoring function. The second function processes the images and uses the first function to score them.

Create Docker Image

In this step, we create the Docker image that has three main parts, the web application, the pretrained model and the driver module for executing the model based on the requests made to the web application. The Docker image is based on a Nvidia image to which we only add the necessary Python dependencies and install the deep learning framework to keep the image as lightweight as possible. The Flask web app will be running on the default port 80 which is exposed on the docker image and Nginx is used to create a proxy from port 80 to port 5000. Once the container is built, we push it to a public Docker hub account for AKS cluster to pull it in later steps.

Test the Application Locally

In this step, we test our docker image by pulling it and running it locally. This step is especially important to make sure the image performs as expected before we go through the entire process of deploying to AKS. This will reduce the debugging time substantially by checking if we can send requests to the Docker container and receive predictions back properly.

Create and AKS Cluster and Deploy

In this step, we use Azure CLI to login to Azure, create a resource group for AKS and create the cluster. We create an AKS cluster with 1 node using Standard NC6 series with 1 GPU. After the AKS cluster is created, we connect to the cluster and deploy the application by defining the Kubernetes manifest where we provide the image name, map port 80 and specify Nvidia library locations. We set the number of Kubernetes replicas to 1 which can later be scaled up to meet certain throughput requirements (the latter is out of scope for this tutorial). Kubernetes also has a dashboard that can simply be accessed through a web browser.

Test the Web App

In this step, we test the web application that is deployed on AKS to quickly check if it can produce predictions against images that are sent to the service.

Perform Speed Tests

In this step, we use the deployed service to measure the average response time by sending 100 asynchronous requests with only four concurrent requests at any time. These types of tests are particularly important to perform, especially for deployments with low latency requirements to make sure the cluster is scaled to meet the demand. The result of the tests suggest that the average response times are less than a second for both frameworks with TensorFlow (~20 images/sec) being much faster than its Keras (~12 images/sec) counterpart on a single K80 GPU.

As a last step, to delete the AKS and free up the Azure resources, we use the commands provided at the end of the notebook where AKS was created.

We hope you give this tutorial a try! Reach out to us with any comments or questions below.

Mathew & Fidan

## Acknowledgements

We would like to thank William Buchwalter for helping us craft the Kubernetes manifest files, Daniel Grecoe for testing the throughput of the models and lastly Danielle Dean for the useful discussions and proofreading of the blog post.

### A Recession Before 2020 Is Likely; On the Distribution of Time Between Recessions

(This article was first published on R – Curtis Miller's Personal Website, and kindly contributed to R-bloggers)

I recently saw a Reddit thread in r/PoliticalDiscussion asking the question “If the economy is still booming 2020, how should the Democratic address this?” This gets to an issue that’s been on my mind since at least 2016, maybe even 2014: when will the current period of economic growth end?

For some context, the Great Recession, as economists colloquially call the recession beginning in 2007 and punctuated with the 2008 financial crisis, ended officially in June 2009; it was then the economy resumed growth. As of this writing, that was about eight years, ten months ago. The longest previous period between recessions was the time between the early 1990s recession and the recession in the early 2000s that coincided with the collapse of the dot-com bubble; that period was ten years, and the only period longer than the present period between recessions.

There is growing optimism in the economy, most noticeably amongst consumers, and we are finally seeing wages increase in the United States after years of stagnation. Donald Trump and Republicans point to the economy as a reason to vote for Republicans in November (and yet Donald Trump is still historically unpopular and Democrats have a strong chance of capturing the House, and a fair chance at the Senate). Followers of the American economy are starting to ask, “How long can this last?”

In 2016, I was thinking about this issue in relation to the election. I wanted Hillary Clinton to win, but at the same time I feared that a Clinton win would be a short-term gain, long-term loss for Democrats. One reason why is I believe there’s a strong chance of a recession within the next few years.

The 2008 financial crisis was a dramatic event, yet the Dodd-Frank reforms and other policy responses, in my opinion, did not go far enough to address the problems unearthed by the financial crisis. Too-big-to-fail institutions are now a part of law (though the policy jargon is systemically important financial institution, or SIFI). In fact, the scandal surrounding HSBC’s support of money laundering and the Justice Departments weak response suggested bankers may be too-big-to-jail! Many of the financial products and practices that caused the financial crisis are still legal; the fundamentals that produced the crisis have not changed. Barack Obama and the Democrats (and the Republicans, certainly) failed to break the political back of the bankers.

While I did not think Bernie Sanders’ reforms would necessarily make the American economy better, I thought he would put the fear of God back into the financial sector, and that alone could help keep risky behavior in check. Donald Trump, for all his populist rhetoric, has not demonstrated he’s going to put that fear in them. In fact, the Republicans passed a bill that’s a gift to corporations and top earners. The legacy of the 2008 financial crisis is that the financial sector can make grossly risky bets in the good “get government off our back!” times, but will have their losses covered by taxpayers in the “we need government help!” times. Recessions and financial crises are a part of the process of expropriating taxpayers. (I wrote other articles about this topic: see this article and this article, as well as this paper I wrote for an undergraduate class.)

Given all this, there’s good reason to believe that nothing has changed about the American economy that would change the likelihood of a financial crisis. Since it has been so long since the last one, it’s time to start expecting one, and whoever holds the Presidency will be blamed.

Right now that’s Donald Trump and the Republicans. And I don’t need to tell you that given Trump’s popularity in good economic times is historically low, a recession before the 2020 election would lead to a Republican rout, with few survivors.

And in a Census year, too!

So what is the probability of a recession? The rest of this article will focus on finding a statistical model for duration between elections and using that model to estimate the probability of a recession.

A recent article in the magazine Significance entitled “The Weibull distribution” describes the Weibull distribution, a common and expressive probability distribution (and one I recently taught in my statistics class). This distribution is used to model a lot of phenomena, including survival times, the time until a system fails or how long a patient diagnosed with a disease survives. Time until recession sounds like a “survival time”, so perhaps the Weibull distribution can be used to model it.

First, I’m going to be doing some bootstrapping, so here’s the seed for replicability:

set.seed(4182018)


The dataset below, obtained from this Wikipedia article, contains the time between recessions in the United States. I look only at recessions since the Great Depression, considering this to be the “modern” economic era for the United States. The sample size is necessarily small, at 13 observations.

recessions <- c( 4+ 2/12,  6+ 8/12,  3+ 1/12,  3+ 9/12,  3+ 3/12,  2+ 0/12,
8+10/12,  3+ 0/12,  4+10/12,  1+ 0/12,  7+ 8/12, 10+ 0/12,
6+ 1/12)

hist(recessions)


plot(density(recessions))


The fitdistrplus allows for estimating the parameters of statistical distributions using the usual statistical techniques. (I found J.Stat.Soft article useful for learning about the package.) I load it below and look at an initial plot to get a sense of appropriate distributions.

suppressPackageStartupMessages(library(fitdistrplus))

descdist(recessions, boot = 1000)


## summary statistics
## ------
## min:  1   max:  10
## median:  4.166667
## mean:  4.948718
## estimated sd:  2.71943
## estimated skewness:  0.51865
## estimated kurtosis:  2.349399


The recessions dataset is platykurtic though right-skewed, a surprising result. However, that’s not enough to deter me from attempting to use the Weibull distribution to model time between recessions. (I should mention here that I am assuming essentially that I’m assuming that time between recessions since the Great Depression are independent and identically distributed. This is not obvious or uncontroversial, but I doubt this could be credibly disproven or that assuming dependence would improve the model.) Let’s fit parameters.

fw <- fitdist(recessions, "weibull")
summary(fw)

## Fitting of the distribution ' weibull ' by maximum likelihood
## Parameters :
##       estimate Std. Error
## shape 2.001576  0.4393137
## scale 5.597367  0.8179352
## Loglikelihood:  -30.12135   AIC:  64.2427   BIC:  65.3726
## Correlation matrix:
##           shape     scale
## shape 1.0000000 0.3172753
## scale 0.3172753 1.0000000


plot(seq(0, 15, length.out = 1000), dweibull(seq(0, 15, length.out = 1000),
shape = fw$estimate["shape"], scale = fw$estimate["scale"]),
col = "blue", type = "l", xlab = "Duration", ylab = "Density",
main = "Weibull distribution applied to recession duration")
lines(density(recessions))


plot(fw)


The plots above suggest the fitted Weibull distribution describe the observed distribution; the Q-Q plot, P-P plot, and the estimated density function all fit well with a Weibull distribution. I also compared the AIC values of the fitted Weibull distribution to two other close candidates, the gamma and log-normal distributions; the Weibull distribution provides the best fit according to the AIC criterion, being twice as reasonable as the log-normal distribution, although only slightly better than the gamma distribution (which is not surprising, given that the two distributions are similar). Due to the interpretations that come with the Weibull distribution and the statistical evidence, I believe it provides the better fit and should be used.

Based on the form of the distribution and the estimated parameters we can find a point estimate for the probability of a recession both before the 2018 midterm election and before the 2020 presidential election. That is, if $T$ is the time between recessions, we can estimate

$P(T \leq t + t_0 | T > t_0)$

alpha <- fw$estimate["shape"] beta <- fw$estimate["scale"]

recession_prob_wei <- function(delta, passed, shape, scale) {
# Computes the probability of a recession within the next delta years given
# passed years
#
# args:
#   delta: a number representing time to next recession
#   passed: a number representing time since last recession
#   shape: the shape parameter of the Weibull distribution
#   scale: the scale parameter of the Weibull distribution

if (delta < 0 | passed < 0) {
stop("Both delta and passed must be non-negative")
}

return(1 - pweibull(passed + delta, shape = shape, scale = scale,
lower.tail = FALSE) /
pweibull(passed, shape = shape, scale = scale, lower.tail = FALSE))
}


# Recession prob. before 2018 election point estimate
recession_prob_wei(6/12, 8+10/12, shape = alpha, scale = beta)

## [1] 0.252013


# Before 2020 election
recession_prob_wei(2+6/12, 8+10/12, shape = alpha, scale = beta)

## [1] 0.8005031


Judging by the point estimates, there’s a 25% chance of a recession before the 2018 midterm election and an 80% chance of a recession before the 2020 election.

The code below finds bootstrapped 95% confidence intervals for these numbers.

suppressPackageStartupMessages(library(boot))
recession_prob_wei_bootci <- function(data, delta, passed, conf = .95,
R = 1000) {
# Computes bootstrapped CI for the probability a recession will occur before
# a certain time given some time has passed
#
# args:
#   data: A numeric vector containing recession data
#   delta: A nonnegative real number representing maximum time till recession
#   passed: A nonnegative real number representing time since last recession
#   conf: A real number between 0 and 1; the confidence level
#   R: A positive integer for the number of bootstrap replicates
bootobj <- boot(recessions, R = R, statistic = function(data, indices) {
d <- data[indices]
params <- fitdist(d, "weibull")\$estimate
return(recession_prob_wei(delta, passed, shape = params["shape"],
scale = params["scale"]))
})
boot.ci(bootobj, type = "perc", conf = conf)
}

# Bootstrapped 95% CI for probability of recession before 2018 election
recession_prob_wei_bootci(recessions, 6/12, 8+10/12, R = 10000)

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = bootobj, conf = conf, type = "perc")
##
## Intervals :
## Level     Percentile
## 95%   ( 0.1691,  0.6174 )
## Calculations and Intervals on Original Scale


# Bootstrapped 95% CI for probability of recession before 2020 election
recession_prob_wei_bootci(recessions, 2+6/12, 8+10/12, R = 10000)

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = bootobj, conf = conf, type = "perc")
##
## Intervals :
## Level     Percentile
## 95%   ( 0.6299,  0.9974 )
## Calculations and Intervals on Original Scale
`

These CIs suggest that while the probability of a recession before the 2018 midterm is very uncertain (it could plausibly be between 17% and 62%), my hunch about 2020 has validity; even the lower bound of that CI suggests a recession before 2020 is likely, and the upper bound is almost-certainty.

How bad could it be? That’s hard to say. However, these odds make the Republican tax bill and its trillion-dollar deficits look even more irresponsible; that money will be needed to deal with a potential recession’s fallout.

As bad as 2018 looks for Republicans, it could look like a cakewalk compared to 2020.

(And despite the seemingly jubilant tone, this suggests I may have trouble finding a job in the upcoming years.)

I have created a video course published by Packt Publishing entitled Data Acqusition and Manipulation with Python, the second volume in a four-volume set of video courses entitled, Taming Data with Python; Excelling as a Data Analyst. This course covers more advanced Pandas topics such as reading in datasets in different formats and from databases, aggregation, and data wrangling. The course then transitions to cover getting data in “messy” formats from Web documents via web scraping. The course covers web scraping using BeautifulSoup, Selenium, and Scrapy. If you are starting out using Python for data analysis or know someone who is, please consider buying my course or at least spreading the word about it. You can buy the course directly or purchase a subscription to Mapt and watch it there.

If you like my blog and would like to support it, spread the word (if not get a copy yourself)! Also, stay tuned for future courses I publish with Packt at the Video Courses section of my site.