# My Data Science Blogs

## March 17, 2018

### Document worth reading: “A Comprehensive Survey of Graph Embedding: Problems, Techniques and Applications”

Graph is an important data representation which appears in a wide diversity of real-world scenarios. Effective graph analytics provides users a deeper understanding of what is behind the data, and thus can benefit a lot of useful applications such as node classification, node recommendation, link prediction, etc. However, most graph analytics methods suffer the high computation and space cost. Graph embedding is an effective yet efficient way to solve the graph analytics problem. It converts the graph data into a low dimensional space in which the graph structural information and graph properties are maximally preserved. In this survey, we conduct a comprehensive review of the literature in graph embedding. We first introduce the formal definition of graph embedding as well as the related concepts. After that, we propose two taxonomies of graph embedding which correspond to what challenges exist in different graph embedding problem settings and how the existing work address these challenges in their solutions. Finally, we summarize the applications that graph embedding enables and suggest four promising future research directions in terms of computation efficiency, problem settings, techniques and application scenarios. A Comprehensive Survey of Graph Embedding: Problems, Techniques and Applications

### Book Memo: “Hybrid Intelligence for Social Networks”

 This book explains aspects of social networks, varying from development and application of new artificial intelligence and computational intelligence techniques for social networks to understanding the impact of social networks. Chapters 1 and 2 deal with the basic strategies towards social networks such as mining text from such networks and applying social network metrics using a hybrid approach; Chaps. 3 to 8 focus on the prime research areas in social networks: community detection, influence maximization and opinion mining. Chapter 9 to 13 concentrate on studying the impact and use of social networks in society, primarily in education, commerce, and crowd sourcing. The contributions provide a multidimensional approach, and the book will serve graduate students and researchers as a reference in computer science, electronics engineering, communications, and information technology.

### Book Memo: “Data Science Landscape”

 Towards Research Standards and Protocols The edited volume deals with different contours of data science with special reference to data management for the research innovation landscape. The data is becoming pervasive in all spheres of human, economic and development activity. In this context, it is important to take stock of what is being done in the data management area and begin to prioritize, consider and formulate adoption of a formal data management system including citation protocols for use by research communities in different disciplines and also address various technical research issues. The volume, thus, focuses on some of these issues drawing typical examples from various domains.

### What is not but could be if

And if I can remain there I will say – Baby Dee

Obviously this is a blog that love the tabloids. But as we all know, the best stories are the ones that confirm your own prior beliefs (because those must be true).  So I’m focussing on  this article in Science that talks about how STEM undergraduate programmes in the US lose gay and bisexual students.  This leaky pipeline narrative (that diversity is smaller the further you go in a field because minorities drop out earlier) is pretty common when you talk about diversity in STEM. But this article says that there are now numbers! So let’s have a look…

#### And when you’re up there in the cold, hopin’ that your knot will hold and swingin’ in the snow…

From the article:

The new study looked at a 2015 survey of 4162 college seniors at 78 U.S. institutions, roughly 8% of whom identified as LGBQ (the study focused on sexual identity and did not consider transgender status). All of the students had declared an intention to major in STEM 4 years earlier. Overall, 71% of heterosexual students and 64% of LGBQ students stayed in STEM. But looking at men and women separately uncovered more complexity. After controlling for things like high school grades and participation in undergraduate research, the study revealed that heterosexual men were 17% more likely to stay in STEM than their LGBQ male counterparts. The reverse was true for women: LGBQ women were 18% more likely than heterosexual women to stay in STEM.

Ok. There’s a lot going on here. First things first, let’s say a big hello to Simpson’s paradox! Although LGBQ people have a lower attainment rate in STEM, it’s driven by men going down and women going up. I think the thing that we can read straight off this is that there are “base rate” problems happening all over the place. (Note that the effect is similar across the two groups and in opposite directions, yet the combined total is fairly strongly aligned with the male effect.) We are also talking about a drop out of around 120 of the 333 LGBQ students in the survey. So the estimate will be noisy.

I’m less worried about forking paths–I don’t think it’s unreasonable to expect the experience to differ across gender. Why? Well there is a well known problem with gender diversity in STEM.  Given that gay women are potentially affected by two different leaky pipelines, it sort of makes sense that the interaction between gender and LGBQ status would be important.

The actual article does better–it’s all done with multilevel logistic regression, which seems like an appropriate tool. There are p-values everywhere, but that’s just life. I struggled from the paper to work out exactly what the model was (sometimes my eyes just glaze over…), but it seems to have been done fairly well.

As with anything however (see also Gayface), the study is only as generalizable as the data set. The survey seems fairly large, but I’d worry about non-response. And, if I’m honest with you, me at 18 would’ve filled out that survey as straight, so there are also some problems there.

#### My father’s affection for his crowbar collection was Freudian to say the least

So a very shallow read of the paper makes it seems like the stats is good enough. But what if it’s not? Does that really matter?

This is one of those effects that’s anecdotally expected to be true. But more importantly, a lot of the proposed fixes are the types of low-cost interventions that don’t really need to work very well to be “value for money”.

For instance, it’s suggested that STEM departments work to make LGBT+ visibility more prominent (have visible, active inclusion policies). They suggest that people teaching pay attention to diversity in their teaching material.

The common suggestion for the last point is to pay special attention to work by women and under-represented groups in your teaching. This is never a bad thing, but if you’re teaching something very old (like the central limit theorem or differentiation), there’s only so much you can do. The thing that we all have a lot more control over is our examples and exercises. It is a no-cost activity to replace, for example, “Bob and Alice” with “Barbra and Alice” or “Bob and Alex”.

This type of low-impact diversity work signals to students that they are in a welcoming environment. Sometimes this is enough.

A similar example (but further up the pipeline) is that when you’re interviewing PhD students, postdocs, researchers, or faculty, don’t ask the men if they have a wife. Swapping to a gender neutral catch-all (partner) is super-easy. Moreover, it doesn’t force a person who is not in an opposite gender relationship to throw themselves a little pride parade (or, worse, to let the assumption fly because they’re uncertain if the mini-pride parade is a good idea in this context). Partner is a gender-neutral term. They is a gender-neutral pronoun. They’re not hard to use.

These environmental changes are important. In the end, if you value science you need to value diversity. Losing women, racial and ethnic minorities, LGBT+ people, disabled people, and other minorities really means that you are making your talent pool more shallow. A deeper pool leads to better science and creating a welcoming, positive environment is a serious step towards deepening the pool.

#### In defence of half-arsed activism

Making a welcoming environment doesn’t fix STEM’s diversity problem. There is a lot more work to be done. Moreover, the ideas in the paragraph above may do very little to improve the problem. They are also fairly quiet solutions–no one knows you’re doing these things on purpose. That is, they are half-arsed activism.

The thing is, as much as it’s lovely to have someone loudly on my side when I need it, I mostly just want to feel welcome where I am. So this type of work is actually really important. No one will ever give you a medal, but that doesn’t make it less appreciated.

The other thing to remember is that sometimes half-arsed activism is all that’s left to you. If you’re a student, or a TA, or a colleague, you can’t singlehandedly change your work environment. More than that, if a well-intentioned-but-loud intervention isn’t carefully thought through it may well make things worse. (For example, a proposal at a previous workplace to ensure that all female students (about 400 of them) have a female faculty mentor (about 7 of them) would’ve put a completely infeasible burden on the female faculty members.)

So don’t discount low-key, low-cost, potentially high-value interventions. They may not make things perfect, but they can make things better and maybe even “good enough”.

The post What is not but could be if appeared first on Statistical Modeling, Causal Inference, and Social Science.

### If you did not already know

Optimal Matching Analysis (OMA)
Optimal matching is a sequence analysis method used in social science, to assess the dissimilarity of ordered arrays of tokens that usually represent a time-ordered sequence of socio-economic states two individuals have experienced. Once such distances have been calculated for a set of observations (e.g. individuals in a cohort) classical tools (such as cluster analysis) can be used. The method was tailored to social sciences from a technique originally introduced to study molecular biology (protein or genetic) sequences. Optimal matching uses the Needleman-Wunsch algorithm.

Apache Flink is an open source platform for scalable batch and stream data processing. Flink’s core is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams. Flink includes several APIs for creating applications that use the Flink engine:
1. DataSet API for static data embedded in Java, Scala, and Python,
2. DataStream API for unbounded streams embedded in Java and Scala, and
3. Table API with a SQL-like expression language embedded in Java and Scala.
Flink also bundles libraries for domain-specific use cases:
1. Machine Learning library, and
2. Gelly, a graph processing API and library.
You can integrate Flink easily with other well-known open source systems both for data input and output as well as deployment.
“Gelly”

Disciplined Convex Optimization
An object-oriented modeling language for disciplined convex programming (DCP). It allows the user to formulate convex optimization problems in a natural way following mathematical convention and DCP rules. The system analyzes the problem, verifies its convexity, converts it into a canonical form, and hands it off to an appropriate solver to obtain the solution.
“Disciplined Convex Programming”

### R Tip: Use stringsAsFactors = FALSE

R tip: use stringsAsFactors = FALSE.

R often uses a concept of factors to re-encode strings. This can be too early and too aggressive. Sometimes a string is just a string.

Sigmund Freud, it is often claimed, said: “Sometimes a cigar is just a cigar.”

To avoid problems delay re-encoding of strings by using stringsAsFactors = FALSE when creating data.frames.

Example:

d <- data.frame(label = rep("tbd", 5))

d$label[[2]] <- "north" #> Warning in [[<-.factor(*tmp*, 2, value = structure(c(1L, NA, 1L, 1L, : #> invalid factor level, NA generated print(d) #> label #> 1 tbd #> 2 <NA> #> 3 tbd #> 4 tbd #> 5 tbd  Notice our new value was not copied in! The fix is easy: use stringsAsFactors = FALSE. d <- data.frame(label = rep("tbd", 5), stringsAsFactors = FALSE) d$label[[2]] <- "north"

print(d)
#>   label
#> 1   tbd
#> 2 north
#> 3   tbd
#> 4   tbd
#> 5   tbd


As is often the case: base R works okay in default mode and works very well if you judiciously change a few defaults. There is much less need to whole-hog replace R functionality than some claim.

Note: the above pattern of pre-building a data.frame and filling values by addressing row/column index sets is a very effective (and under appreciated) way to build up data (often easier and quicker than binding rows or columns).

Shira Mitchell wrote:

I gave a talk today at Mathematica about NHST in low power settings (Type M/S errors). It was fun and the discussion was great.

One thing that came up is bias from doing some kind of regularization/shrinkage/partial-pooling versus selection bias (confounding, nonrandom samples, etc). One difference (I think?) is that the first kind of bias decreases with sample size, but the latter won’t. Though I’m not sure how comforting that is in small-sample settings. I’ve read this post which emphasizes that unbiased estimates don’t actually exist, but I’m not sure how relevant this is.

I replied that the error is to think that an “unbiased” estimate is a good thing. See p.94 of BDA.

And then Shira shot back:

I think what is confusing to folks is when you use unbiasedness as a principle here, for example here:

Ahhhh, good point! I was being sloppy. One difficulty is that in classical statistics, there are two similar-sounding but different concepts, unbiased estimation and unbiased prediction. For Bayesian inference we talk about calibration, which is yet another way that an estimate can be correct on average.

The point of my above-linked BDA excerpt is that, in some settings, unbiased estimation is not just a nice idea that can’t be done in practice or can be improved in some ways; rather it’s an actively bad idea that leads to terrible estimates. The key is that classical unbiased estimation requires E(theta.hat|theta) = theta for any theta, and, given that some outlying regions of theta are highly unlikely, the unbiased estimate has to be a contortionist in order to get things right for those values.

But in certain settings the idea of unbiasedness is relevant, as in the linked post above where we discuss the problems of selection bias. And, indeed, type M and type S errors are defined with respect to the true parameter values. The key difference is that we’re estimating these errors—these biases—conditional on reasonable values of the underlying parameters. We’re not interested in these biases conditional on unreasonable values of theta.

Subtle point, worth thinking about carefully. Bias is important, but only conditional on reasonable values of theta.

P.S. Thanks to Jaime Ashander for the above picture.

The post What We Talk About When We Talk About Bias appeared first on Statistical Modeling, Causal Inference, and Social Science.

### Distilled News

R 3.4.4 has been released, and binaries for Windows, Mac, Linux and now available for download on CRAN. This update (codenamed ‘Someone to Lean On’ — likely a Peanuts reference, though I couldn’t find which one with a quick search) is a minor bugfix release, and shouldn’t cause any compatibility issues with scripts or packages written for prior versions of R in the 3.4.x series. This update improves automatic timezone detection on some systems, and adds fixes for a some unusual corner cases in the statistics library. For a complete list of the changes, check the NEWS file for R 3.4.4 or follow the link below.
The final part of the introduction to Numpy. In this second part, we are going to see a few functions in order to create a specific array. Then we are going to see the computation between two arrays. The first part of Numpy you can find here.
Three years ago we launched Chicisimo, our goal was to offer automated outfit advice. Today, with over 4 million women on the app, we want to share how our data and machine learning approach helped us grow. It’s been chaotic but it is now under control.
The brain has evolved over a long time, from very simple worm brains 500 million years ago to a diversity of modern structures today. The human brain, for example, can accomplish a wide variety of activities, many of them effortlessly — telling whether a visual scene contains animals or buildings feels trivial to us, for example. To perform activities like these, artificial neural networks require careful design by experts over years of difficult research, and typically address one specific task, such as to find what’s in a photograph, to call a genetic variant, or to help diagnose a disease. Ideally, one would want to have an automated method to generate the right architecture for any given task. One approach to generate these architectures is through the use of evolutionary algorithms. Traditional research into neuro-evolution of topologies (e.g. Stanley and Miikkulainen 2002) has laid the foundations that allow us to apply these algorithms at scale today, and many groups are working on the subject, including OpenAI, Uber Labs, Sentient Labs and DeepMind. Of course, the Google Brain team has been thinking about AutoML too. In addition to learning-based approaches (eg. reinforcement learning), we wondered if we could use our computational resources to programmatically evolve image classifiers at unprecedented scale. Can we achieve solutions with minimal expert participation? How good can today’s artificially-evolved neural networks be? We address these questions through two papers.
Google CoLaboratory is Google’s latest contribution to AI, wherein users can code in Python using a Chrome browser in a Jupyter-like environment. In this article I have shared a method, and code, to create a simple binary text classifier using Scikit Learn within Google CoLaboratory environment.
Simulation studies are used in a wide range of areas from risk management, to epidemiology, and of course in statistics. The MonteCarlo package provides tools to automatize the design of these kind of simulation studies in R. The user only has to specify the random experiment he or she wants to conduct and to specify the number of replications. The rest is handled by the package. So far, the main tool to analyze the results was to look at Latex tables generated using the MakeTable() function. Now, the new package version 1.0.5 contains the function MakeFrame() that allows to represent the simulation results in form of a dataframe. This makes it very easy to visualize the results using standard tools such as dplyr and ggplot2. Here, I will demonstrate some of these concepts for a simple example that could be part of an introductory statistics course: the comparison of the mean and the median as estimators for the expected value. For an introduction to the MonteCarlo package click here or confer the package vignette.
The fast.ai library is a collection of supplementary wrappers for a host of popular machine learning libraries, designed to remove the necessity of writing your own functions to take care of some repetitive tasks in a machine learning workflow.
PyCon.DE is where Pythonistas in Germany can meet to learn about new and upcoming Python libraries, tools, software and data science. We welcome Python enthusiasts, programmers and data scientists from around the world to join us in Karlsruhe this year.
We expect 400 participants for PyCon.DE 2018 Karlsruhe. The conference will last 3 days and include about 60 talks, tutorials and hands on sessions. Python is a programming language which has found application and friends in many areas. Due to its popularity in science, Python has experienced a meteoric rise in the data science community over the past few years. At the conference, we expect a broad and interesting mix of Pythonistas including roles such as:
• Software Developer
• Data Scientist
• Technology Enthusiast

### Whats new on arXiv

Learning a Bayesian network (BN) from data can be useful for decision-making or discovering causal relationships. However, traditional methods often fail in modern applications, which exhibit a larger number of observed variables than data points. The resulting uncertainty about the underlying network as well as the desire to incorporate prior information recommend a Bayesian approach to learning the BN, but the highly combinatorial structure of BNs poses a striking challenge for inference. The current state-of-the-art methods such as order MCMC are faster than previous methods but prevent the use of many natural structural priors and still have running time exponential in the maximum indegree of the true directed acyclic graph (DAG) of the BN. We here propose an alternative posterior approximation based on the observation that, if we incorporate empirical conditional independence tests, we can focus on a high-probability DAG associated with each order of the vertices. We show that our method allows the desired flexibility in prior specification, removes timing dependence on the maximum indegree and yields provably good posterior approximations; in addition, we show that it achieves superior accuracy, scalability, and sampler mixing on several datasets.
Event management in sensor networks is a multidisciplinary field involving several steps across the processing chain. In this paper, we discuss the major steps that should be performed in real- or near real-time event handling including event detection, correlation, prediction and filtering. First, we discuss existing univariate and multivariate change detection schemes for the online event detection over sensor data. Next, we propose an online event correlation scheme that intends to unveil the internal dynamics that govern the operation of a system and are responsible for the generation of various types of events. We show that representation of event dependencies can be accommodated within a probabilistic temporal knowledge representation framework that allows the formulation of rules. We also address the important issue of identifying outdated dependencies among events by setting up a time-dependent framework for filtering the extracted rules over time. The proposed theory is applied on the maritime domain and is validated through extensive experimentation with real sensor streams originating from large-scale sensor networks deployed in ships.
We present data-dependent learning bounds for the general scenario of non-stationary non-mixing stochastic processes. Our learning guarantees are expressed in terms of a data-dependent measure of sequential complexity and a discrepancy measure that can be estimated from data under some mild assumptions. We also also provide novel analysis of stable time series forecasting algorithm using this new notion of discrepancy that we introduce. We use our learning bounds to devise new algorithms for non-stationary time series forecasting for which we report some preliminary experimental results.
Scientific fields such as insider-threat detection and highway-safety planning often lack sufficient amounts of time-series data to estimate statistical models for the purpose of scientific discovery. Moreover, the available limited data are quite noisy. This presents a major challenge when estimating time-series models that are robust to overfitting and have well-calibrated uncertainty estimates. Most of the current literature in these fields involve visualizing the time-series for noticeable structure and hard coding them into pre-specified parametric functions. This approach is associated with two limitations. First, given that such trends may not be easily noticeable in small data, it is difficult to explicitly incorporate expressive structure into the models during formulation. Second, it is difficult to know $\textit{a priori}$ the most appropriate functional form to use. To address these limitations, a nonparametric Bayesian approach was proposed to implicitly capture hidden structure from time series having limited data. The proposed model, a Gaussian process with a spectral mixture kernel, precludes the need to pre-specify a functional form and hard code trends, is robust to overfitting and has well-calibrated uncertainty estimates.
We introduce SentEval, a toolkit for evaluating the quality of universal sentence representations. SentEval encompasses a variety of tasks, including binary and multi-class classification, natural language inference and sentence similarity. The set of tasks was selected based on what appears to be the community consensus regarding the appropriate evaluations for universal sentence representations. The toolkit comes with scripts to download and preprocess datasets, and an easy interface to evaluate sentence encoders. The aim is to provide a fairer, less cumbersome and more centralized way for evaluating sentence representations.
In this study, we propose advancing all-neural speech recognition by directly incorporating attention modeling within the Connectionist Temporal Classification (CTC) framework. In particular, we derive new context vectors using time convolution features to model attention as part of the CTC network. To further improve attention modeling, we utilize content information extracted from a network representing an implicit language model. Finally, we introduce vector based attention weights that are applied on context vectors across both time and their individual components. We evaluate our system on a 3400 hours Microsoft Cortana voice assistant task and demonstrate that our proposed model consistently outperforms the baseline model achieving about 20% relative reduction in word error rates.
We present a formulation of deep learning that aims at producing a large margin classifier. The notion of margin, minimum distance to a decision boundary, has served as the foundation of several theoretically profound and empirically successful results for both classification and regression tasks. However, most large margin algorithms are applicable only to shallow models with a preset feature representation; and conventional margin methods for neural networks only enforce margin at the output layer. Such methods are therefore not well suited for deep networks. In this work, we propose a novel loss function to impose a margin on any chosen set of layers of a deep network (including input and hidden layers). Our formulation allows choosing any norm on the metric measuring the margin. We demonstrate that the decision boundary obtained by our loss has nice properties compared to standard classification loss functions. Specifically, we show improved empirical results on the MNIST, CIFAR-10 and ImageNet datasets on multiple tasks: generalization from small training sets, corrupted labels, and robustness against adversarial perturbations. The resulting loss is general and complementary to existing data augmentation (such as random/adversarial input transform) and regularization techniques (such as weight decay, dropout, and batch norm).
Variational inference relies on flexible approximate posterior distributions. Normalizing flows provide a general recipe to construct flexible variational posteriors. We introduce Sylvester normalizing flows, which can be seen as a generalization of planar flows. Sylvester normalizing flows remove the well-known single-unit bottleneck from planar flows, making a single transformation much more flexible. We compare the performance of Sylvester normalizing flows against planar flows and inverse autoregressive flows and demonstrate that they compare favorably on several datasets.
Word vectors require significant amounts of memory and storage, posing issues to resource limited devices like mobile phones and GPUs. We show that high quality quantized word vectors using 1-2 bits per parameter can be learned by introducing a quantization function into Word2Vec. We furthermore show that training with the quantization function acts as a regularizer. We train word vectors on English Wikipedia (2017) and evaluate them on standard word similarity and analogy tasks and on question answering (SQuAD). Our quantized word vectors not only take 8-16x less space than full precision (32 bit) word vectors but also outperform them on word similarity tasks and question answering.
This paper reports on modern approaches in Information Extraction (IE) and its two main sub-tasks of Named Entity Recognition (NER) and Relation Extraction (RE). Basic concepts and the most recent approaches in this area are reviewed, which mainly include Machine Learning (ML) based approaches and the more recent trend to Deep Learning (DL) based methods.
Self-replication is a key aspect of biological life that has been largely overlooked in Artificial Intelligence systems. Here we describe how to build and train self-replicating neural networks. The network replicates itself by learning to output its own weights. The network is designed using a loss function that can be optimized with either gradient-based or non-gradient-based methods. We also describe a method we call regeneration to train the network without explicit optimization, by injecting the network with predictions of its own parameters. The best solution for a self-replicating network was found by alternating between regeneration and optimization steps. Finally, we describe a design for a self-replicating neural network that can solve an auxiliary task such as MNIST image classification. We observe that there is a trade-off between the network’s ability to classify images and its ability to replicate, but training is biased towards increasing its specialization at image classification at the expense of replication. This is analogous to the trade-off between reproduction and other tasks observed in nature. We suggest that a self-replication mechanism for artificial intelligence is useful because it introduces the possibility of continual improvement through natural selection.

### Book Memo: “Machine Learning Techniques for Online Social Networks”

 The book covers tools in the study of online social networks such as machine learning techniques, clustering, and deep learning. A variety of theoretical aspects, application domains, and case studies for analyzing social network data are covered. The aim is to provide new perspectives on utilizing machine learning and related scientific methods and techniques for social network analysis. Machine Learning Techniques for Online Social Networks will appeal to researchers and students in these fields.

### Book Memo: “Computerized Adaptive and Multistage Testing with R”

 Using Packages catR and mstR The goal of this guide and manual is to provide a practical and brief overview of the theory on computerized adaptive testing (CAT) and multistage testing (MST) and to illustrate the methodologies and applications using R open source language and several data examples. Implementation relies on the R packages catR and mstR that have been already or are being developed by the first author (with the team) and that include some of the newest research algorithms on the topic. The book covers many topics along with the R-code: the basics of R, theoretical overview of CAT and MST, CAT designs, CAT assembly methodologies, CAT simulations, catR package, CAT applications, MST designs, IRT-based MST methodologies, tree-based MST methodologies, mstR package, and MST applications. CAT has been used in many large-scale assessments over recent decades, and MST has become very popular in recent years. R open source language also has become one of the most useful tools for applications in almost all fields, including business and education. Though very useful and popular, R is a difficult language to learn, with a steep learning curve. Given the obvious need for but with the complex implementation of CAT and MST, it is very difficult for users to simulate or implement CAT and MST. Until this manual, there has been no book for users to design and use CAT and MST easily and without expense; i.e., by using the free R software. All examples and illustrations are generated using predefined scripts in R language, available for free download from the book’s website.

### Announcing the winners of the Facebook Communications & Networking research awards

We are pleased to announce the winners of the Facebook Communications & Networking research awards. Continued research and innovation is key to building next generation communications and networking systems. By sponsoring research and collaborating across a wide range of networking research areas we expect to share new insights with the broader networking community.

The Facebook Communications & Networking award winners and their topic areas are:

Network Control Plane Verification at Scale
David Walker, Princeton University

End-to-End Transport for Multi-User Video QoE Optimization

Integrating IPv6 Segment Routing and Modern Transport Protocols
Olivier Bonaventure, Université catholique de Louvain, Louvain-la-Neuve

Automated Repair and Verification of Firewalls
Ruzica Piskac, Yale University

High Performance Server Packet Processing
Thomas Anderson, University of Washington

Scaling Distributed Storage with Programmable Switches
Xin Jin, Johns Hopkins University

Navigating the Latency-Quality Tradeoff in Personalized Live Video Streaming
Rashmi Vinayak, Carnegie Mellon University

### Magister Dixit

“What makes a good metric?
Here are some rules of thumb for what makes a good metric-a number that will drive the changes you’re looking for.
A good metric is comparative.
Being able to compare a metric to other time periods, groups of users, or competitors helps you understand which way things are moving. “Increased conversion from last week” is more meaningful than “2% conversion”.
A good metric is understandable.
If people can’t remember it and discuss it, it’s much harder to turn a change in the data into a change in the culture.
A good metric is a ratio or a rate.
Accountants and financial analysts have several ratios they look at to understand, at a glance, the fundamental health of a company. You need some, too.
There are several reasons ratios tend to be the best metrics:
1 Ratios are easier to act on. Think about driving a car. Distance travelled is informational. But speed-distance per hour-is something you can act on, because it tells you about your current state, and whether you need to go faster or slower to get to your destination on time.
2 Ratios are inherently comparative. If you compare a daily metric to the same metric over a month, you’ll see whether you’re looking at a sudden spike or a long-term trend. In a car, speed is one metric, but speed right now over average speed this hour shows you a lot about whether you’re accelerating or slowing down.
3 Ratios are also good for comparing factors that are somehow opposed, or for which there’s an inherent tension. In a car, this might be distance covered divided by traffic tickets. The faster you drive, the more distance you cover-but the more tickets you get. This ratio might suggest whether or not you should be breaking the speed limit. A good metric changes the way you behave. This is by far the most important criterion for a metric: what will you do differently based on changes in the metric?
1 “Accounting” metrics like daily sales revenue, when entered into your spreadsheet, need to make your predictions more accurate. These metrics form the basis of Lean Startup’s innovation accounting, showing you how close you are to an ideal model and whether your actual results are converging on your business plan.
2 “Experimental” metrics, like the results of a test, help you to optimize the product, pricing, or market. Changes in these metrics will significantly change your behavior. Agree on what that change will be before you collect the data: if the pink website generates more revenue than the alternative, you’re going pink; if more than half your respondents say they won’t pay for a feature, don’t build it; if your curated MVP doesn’t increase order size by 30%, try something else. Drawing a line in the sand is a great way to enforce a disciplined approach. A good metric changes the way you behave precisely because it’s aligned to your goals of keeping users, encouraging word of mouth, acquiring customers efficiently, or generating revenue. If you want to choose the right metrics, you need to keep five things in mind:
1 Qualitative versus quantitative metrics
Qualitative metrics are unstructured, anecdotal, revealing, and hard to aggregate; quantitative metrics involve numbers and statistics, and provide hard numbers but less insight.
2 Vanity versus actionable metrics
Vanity metrics might make you feel good, but they don’t change how you act. Actionable metrics change your behavior by helping you pick a course of action.
3 Exploratory versus reporting metrics
Exploratory metrics are speculative and try to find unknown insights to give you the upper hand, while reporting metrics keep you abreast of normal, managerial, day-to-day operations.
Leading metrics give you a predictive understanding of the future; lagging metrics explain the past. Leading metrics are better because you still have time to act on them-the horse hasn’t left the barn yet.
5 Correlated versus causal metrics
If two metrics change together, they’re correlated, but if one metric causes another metric to change, they’re causal. If you find a causal relationship between something you want (like revenue) and something you can control (like which ad you show), then you can change the future
Analysts look at specific metrics that drive the business, called key performance indicators (KPIs). Every industry has KPIs-if you’re a restaurant owner, it’s the number of covers (tables) in a night; if you’re an investor, it’s the return on an investment; if you’re a media website, it’s ad clicks; and so on.”
Alistair Croll, Benjamin Yoskovitz ( 2013 )

## March 16, 2018

### Bob’s talk at Berkeley, Thursday 22 March, 3 pm

It’s at the Institute for Data Science at Berkeley.

And here’s the abstract:

I’ll provide an end-to-end example of using R and Stan to carry out full Bayesian inference for a simple set of repeated binary trial data: Efron and Morris’s classic baseball batting data, with multiple players observed for many at bats; clinical trial, educational testing, and manufacturing quality control problems have the same flavor.

We will consider three models that provide complete pooling (every player is the same), no pooling (every player is independent), and partial pooling (every player is to some degree like every other player). Hierarchical models allow the degree of similarity to be jointly modeled with individual effects, tightening estimates and sharpening predictions compared to the no pooling and complete pooling models. They also outperform empirical Bayes and max marginal likelihood predictively, both of which rely on point estimates of hierarchical parameters (aka “mixed effects”). I’ll show how to fit observed data to make predictions for future observations, estimate event probabilities, and carry out (multiple) comparisons such as ranking. I’ll explain how hierarchical modeling mitigates the multiple comparison problem by partial pooling (and I’ll tie it into rookie of the year effects and sophomore slumps). Along the way, I will show how to evaluate models predictively, preferring those that are well calibrated and make sharp predictions. I’ll also show how to evaluate model fit to data with posterior predictive checks and Bayesian p-values.

### R Packages worth a look

Toolbox for Model Selection and Combinations for the Forecasting Purposes (greybox)
Implements model selection and combinations via information criteria based on the values of partial correlations. This allows, for example, solving ‘fat regression’ problems, where the number of variables is much larger than the number of observations. This is driven by the research on information criteria, which is well discussed in Burnham & Anderson (2002) <doi:10.1007/b97636>, and currently developed further by Ivan Svetunkov and Yves Sagaert (working paper in progress). Models developed in the package are tailored specifically for forecasting purposes. So as a results there are several methods that allow producing forecasts from these models and visualising them.

High-Dimensional Regression with Measurement Error (hdme)
Penalized regression for generalized linear models for measurement error problems (aka. errors-in-variables). The package contains a version of the lasso (L1-penalization) which corrects for measurement error (Sorensen et al. (2015) <doi:10.5705/ss.2013.180>). It also contains an implementation of the Generalized Matrix Uncertainty Selector, which is a version the (Generalized) Dantzig Selector for the case of measurement error (Sorensen et al. (2018) <doi:10.1080/10618600.2018.1425626>).

Interface to the Corpus Query Protocol (rcqp)
Implements Corpus Query Protocol functions based on the CWB software. Rely on CWB (GPL v2), PCRE (BSD licence), glib2 (LGPL).

Create the Best Train for Classification Models (OptimClassifier)
Patterns searching and binary classification in economic and financial data is a large field of research. There are a large part of the data that the target variable is binary. Nowadays, many methodologies are used, this package collects most popular and compare different configuration options for Linear Models (LM), Generalized Linear Models (GLM), Linear Mixed Models (LMM), Discriminant Analysis (DA), Classification And Regression Trees (CART), Neural Networks (NN) and Support Vector Machines (SVM).

Providing just one primary function, ‘readit’ uses a set of reasonable heuristics to apply the appropriate reader function to the given file path. As long as the data file has an extension, and the data is (or can be coerced to be) rectangular, readit() can probably read it.

### Because it's Friday: Email a tree

The City of Melbourne has collected data on the more than 70,000 trees in the urban forest of this Australian metropolis. The data include the species, the health status of the tree and its life expectancy, all shown on a lovely map.

As you can see from the image above, each tree also has a unique email address. The idea is that citizens can report problems with trees, like disease or a fallen limb. But as the Atlantic reported in 2015, the addresses have also been used to write charming letters to the trees. For example, this email to a Golden Elm:

21 May 2015

I’m so sorry you're going to die soon. It makes me sad when trucks damage your low hanging branches. Are you as tired of all this construction work as we are?

Sometimes the trees even reply, like this Willow Leaf Pepperment:

29 Jan 2015

Hello Mr Willow Leaf Peppermint, or should I say Mrs Willow Leaf Peppermint?

Do trees have genders?

I hope you've had some nice sun today.

Regards

L

30 Jan 2015

Hello

I am not a Mr or a Mrs, as I have what's called perfect flowers that include both genders in my flower structure, the term for this is Monoicous. Some trees species have only male or female flowers on individual plants and therefore do have genders, the term for this is Dioecious. Some other trees have male flowers and female flowers on the same tree. It is all very confusing and quite amazing how diverse and complex trees can be.

Kind regards,

Mr and Mrs Willow Leaf Peppermint (same Tree)

You can find a new more letters in this news.com.au article as well.

That's all from us for this week. Hope you have a great weekend (perhaps amongst the trees?) and we'll be back with more next week.

### RcppClassicExamples 0.1.2

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Per a CRAN email sent to 300+ maintainers, this package (just like many others) was asked to please register its S3 method. So we did, and also overhauled a few other packagaging standards which changed since the previous uploads in December of 2012 (!!).

No new code or features. Full details below. And as a reminder, don’t use the old RcppClassic — use Rcpp instead.

#### Changes in version 0.1.2 (2018-03-15)

• Registered S3 print method [per CRAN request]

• Added src/init.c with registration and updated all .Call usages taking advantage of it

• Updated http references to https

• Updated DESCRIPTION conventions

Thanks to CRANberries, you can also look at a diff to the previous release.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### RDieHarder 0.1.4

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Per a CRAN email sent to 300+ maintainers, this package (just like many others) was asked to please register its S3 method. So we did, and also overhauled a few other packagaging standards which changed since the last upload in 2014.

No NEWS.Rd file to take a summary from, but the top of the ChangeLog has details.

Thanks to CRANberries, you can also look at a diff to the previous release.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Document worth reading: “Towards Deep Learning Models Resistant to Adversarial Attacks”

Recent work has demonstrated that neural networks are vulnerable to adversarial examples, i.e., inputs that are almost indistinguishable from natural data and yet classified incorrectly by the network. In fact, some of the latest findings suggest that the existence of adversarial attacks may be an inherent weakness of deep learning models. To address this problem, we study the adversarial robustness of neural networks through the lens of robust optimization. This approach provides us with a broad and unifying view on much of the prior work on this topic. Its principled nature also enables us to identify methods for both training and attacking neural networks that are reliable and, in a certain sense, universal. In particular, they specify a concrete, general guarantee to provide. These methods let us train networks with significantly improved resistance to a wide range of adversarial attacks. This suggests that adversarially resistant deep learning models might be within our reach after all. Towards Deep Learning Models Resistant to Adversarial Attacks

### Science and Technology links (March 16th, 2018)

1. From the beginning of the 20th century to 2010, the life expectancy at birth for females in the United States increased by more than 32 years. The 3 major causes of death for females in 1900 were pneumonia and influenza, tuberculosis, and enteritis and diarrhea. In 2010, the 3 major causes were heart disease, all cancers, and stroke.
2. It looks like Dwarf stars could be orbited by habitable planets.
3. More evidence that intelligence is genetic.
4. Sugar and bread are killing you: Dietary Carbohydrates Impair Healthspan and Promote Mortality (in Cell).
5. It turns out that people in organized crime are probably saner than you’d expect: “we were able to determine that in the sample analysed there was not one subject with a psychotic personality”.
6. If you made it to Pluto and could somehow survive, would there be enough light to read? More than enough according to Cook.
7. China has reduced fine particulates in the air by a third in four years.
8. You can cure blindness (in mice) using small wires: “artificial photoreceptors based on gold nanoparticle-decorated titania nanowire arrays restored visual responses in the blind mice with degenerated photoreceptors”. (In Nature.)
9. According to Nature, a science doctorate has high value in the UK and Canadian job markets. It sounds true to me. However, you should simply not expect to automatically become a professor: “Nearly 30% of those with full- or part-time jobs ended up in academia.”
10. Brenda Milner is a professor at McGill University who is going to turn 100 this summer. She is still an active professor with an ongoing publication record. Here is what the New Times wrote about her last year:

“People think because I’m 98 years old I must be emerita,” she said. “Well, not at all. I’m still nosy, you know, curious.” (…) Dr. Milner continues working, because she sees no reason not to. Neither McGill nor the affiliated Montreal Neurological Institute and Hospital has asked her to step aside. She has funding: In 2014 she won three prominent achievement awards, which came with money for research.

11. Mozilla has released an open source speech recognition model “so that anyone can develop compelling speech experiences” (via Leonid Boytsov).
12. One of my favorite authors, Brian Martin, has published a new book: Vaccination panic in Australia. We all know that vaccination can be an effective public health policy. So you think that it is crazy to question vaccination policies? Not so fast. Brian explains carefully that there is room for reasonable disagreement on how exactly vaccination is to be used. But most importantly, the book reviews how authorities proceed to suppress dissent, even reasonable well-founded dissent. The book can be freely accessed online.
13. As we get older, our muscles tend to disappear. It is a condition called Sarcopenia, first coined in 1988. It still unclear what causes it, but there is now evidence that it has to do with the disappearance of nerves. Even if we did nothing to cure cancer and heart diseases, simply keeping the muscles of older people strong would make a huge difference. Sadly, we have barely begun to consider maybe doing something about it.

### New Book: Credit risk analytics, The R Companion

Credit risk analytics in R will enable you to build credit risk models from start to finish, with access to real credit data on accompanying website, you will master a wide range of applications.

### Microsoft Weekly Data Science News for March 16, 2018

Here are the latest articles from Microsoft regarding cloud data science products and updates.

&utm&utm&utm

### If you did not already know

Genetic Programming for Reinforcement Learning (GPRL)
The search for interpretable reinforcement learning policies is of high academic and industrial interest. Especially for industrial systems, domain experts are more likely to deploy autonomously learned controllers if they are understandable and convenient to evaluate. Basic algebraic equations are supposed to meet these requirements, as long as they are restricted to an adequate complexity. Here we introduce the genetic programming for reinforcement learning (GPRL) approach based on model-based batch reinforcement learning and genetic programming, which autonomously learns policy equations from pre-existing default state-action trajectory samples. GPRL is compared to a straight-forward method which utilizes genetic programming for symbolic regression, yielding policies imitating an existing well-performing, but non-interpretable policy. Experiments on three reinforcement learning benchmarks, i.e., mountain car, cart-pole balancing, and industrial benchmark, demonstrate the superiority of our GPRL approach compared to the symbolic regression method. GPRL is capable of producing well-performing interpretable reinforcement learning policies from pre-existing default trajectory data. …

In this paper we propose a new methodology for solving an uncertain stochastic Markovian control problem in discrete time. We call the proposed methodology the adaptive robust control. We demonstrate that the uncertain control problem under consideration can be solved in terms of associated adaptive robust Bellman equation. The success of our approach is to the great extend owed to the recursive methodology for construction of relevant confidence regions. We illustrate our methodology by considering an optimal portfolio allocation problem, and we compare results obtained using the adaptive robust control method with some other existing methods. …

Active Function Cross-Entropy Clustering (afCEC)
Active function cross-entropy clustering partitions the n-dimensional data into the clusters by finding the parameters of the mixed generalized multivariate normal distribution, that optimally approximates the scattering of the data in the n-dimensional space, whose density function is of the form: p_1*N(mi_1,^sigma_1,sigma_1,f_1)+…+p_k*N(mi_k,^sigma_k,sigma_k,f_k). The above-mentioned generalization is performed by introducing so called ‘f-adapted Gaussian densities’ (i.e. the ordinary Gaussian densities adapted by the ‘active function’). Additionally, the active function cross-entropy clustering performs the automatic reduction of the unnecessary clusters. For more information please refer to P. Spurek, J. Tabor, K.Byrski, ‘Active function Cross-Entropy Clustering’ (2017) <doi:10.1016/j.eswa.2016.12.011>. …

### Speeding up Metropolis-Hastings with Rcpp

(This article was first published on R – Stable Markets, and kindly contributed to R-bloggers)

Previous posts in this series on MCMC samplers for Bayesian inference (in order of publication): Bayesian Simple Linear Regression with Gibbs Sampling in R Blocked Gibbs Sampling in R for Bayesian Multiple Linear Regression Metropolis-in-Gibbs Sampling and Runtime Analysis with Profviz The code for all of these posts can be found in my BayesianTutorials GitHub … Continue reading Speeding up Metropolis-Hastings with Rcpp

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Your free 70-page guide to a career in data science

To help you become a Data Scientist, we put together a guide with answers to: how do you break into the profession? What skills do you need to become a data scientist? Where are best data science jobs?

### Will young Russians make for a new Russia?

As part of our multimedia project on young Russians, we broke down the country’s views on a range of issues by age cohort, drawing on extensive data kindly provided by the Levada Centre, an independent pollster.

### 5 Things You Need to Know about Big Data

We take a look at five things you need to know about Big Data.

### Contents

#### The Shape of a Pixel

What is the shape of a pixel? At various times, I have a pixel as a square (often), a point (sometimes), or a rectangle (occasionally). I recall back in grad school doing some homework where we were treating pixels as hexagons.

As I haved worked through the last few posts on computing Feret diameters, though, I have started to entertain the possible usefulness of considering pixels to be circles. (See 29-Sep-2017, 24-Oct-2017, and 20-Feb-2018.) Let me try to explain why.

Here's a binary image with a single foreground blob (or "object," or "connected component.")

bw = imread('Martha''s Vineyard (30x20).png');
imshow(bw)


Most of the time, we think of image pixels as being squares with unit area.

pixelgrid


We can use find to get the $x$- and $y$-coordinates of the pixel centers, and then we can use convhull to find their convex hull. As an optimization that I think will often reduce execution time and memory, I'm going to preprocess the input binary image here by calling bwperim. I'm not going to show that step everywhere in this example, though.

[y,x] = find(bwperim(bw));
hold on
plot(x,y,'.')
hold off
title('Pixel centers')
h = convhull(x,y);
x_hull = x(h);
y_hull = y(h);
hold on
hull_line = plot(x_hull,y_hull,'r*','MarkerSize',12);
hold off
title('Pixel centers and convex hull vertices')


Notice that there are some chains of three or more colinear convex hull vertices.

xlim([21.5 32.5])
ylim([9.5 15.5])
title('Colinear convex hull vertices')


In some of the other processing steps related to Feret diameter measurements, colinear convex hull vertices can cause problems. We can eliminate these vertices directly in the call to convhull using the 'Simplify' parameter.

h = convhull(x,y,'Simplify',true);
x_hull = x(h);
y_hull = y(h);
delete(hull_line);
hold on
plot(x_hull,y_hull,'r*','MarkerSize',12)
hold off
title('Colinear hull vertices removed')

imshow(bw)
hold on
plot(x_hull,y_hull,'r-*','LineWidth',2,'MarkerSize',12)
hold off
title('A Blob''s Convex Hull and Its Vertices')


Notice, though, that there are white bits showing outside the red convex hull polygon. That's because we are only using the pixel centers.

#### Weaknesses of Using the Pixel Centers

Consider a simpler binary object, one that has only one row.

bw2 = false(5,15);
bw2(3,5:10) = true;
imshow(bw2)
pixelgrid
[y,x] = find(bw2);


The function convhull doesn't even work on colinear points.

try
hull = convhull(x,y,'Simplify',true);
catch e
fprintf('Error message from convhull: "%s"\n', e.message);
end

Error message from convhull: "Error computing the convex hull. The points may be collinear."


But even if it did return an answer, the answer would be a degenerate polygon with length 5 (even though the number of foreground pixels is 6) and zero area.

hold on
plot(x,y,'r-*','MarkerSize',12,'LineWidth',2)
hold off
title('Degenerate convex hull polygon')


We can solve this degeneracy problem by using square pixels.

#### Square Pixels

In the computation of the convex hull above, we treated each pixel as a point. We can, instead, treat each pixel as a square by computing the convex hull of all the corners of every pixel. Here's one way to perform that computation.

offsets = [ ...
0.5  -0.5
0.5   0.5
-0.5  -0.5
-0.5   0.5 ]';

offsets = reshape(offsets,1,2,[]);

P = [x y];
Q = P + offsets;
R = permute(Q,[1 3 2]);
S = reshape(R,[],2);

h = convhull(S,'Simplify',true);
x_hull = S(h,1);
y_hull = S(h,2);

imshow(bw2)
pixelgrid
hold on
plot(x_hull,y_hull,'r-*','MarkerSize',12,'LineWidth',2)
hold off
title('Convex hull of square pixels')


This result looks good at first glance. However, it loses some of its appeal when you consider the implications for computing the maximum Feret diameter.

points = [x_hull y_hull];
[d,end_points] = maxFeretDiameter(points,antipodalPairs(points))
hold on
plot(end_points(:,1),end_points(:,2),'k','LineWidth',3)
hold off
title('The maximum Feret diameter is not horizontal')

d =

6.0828

end_points =

10.5000    2.5000
4.5000    3.5000



The maximum Feret distance of this horizontal segment of is 6.0828 ($\sqrt{37}$) instead of 6, and the corresponding orientation in degrees is:

atan2d(1,6)

ans =

9.4623



Another worthy attempt is to use diamond pixels.

#### Diamond Pixels

Instead of using the four corners of each pixel, let's try using the middle of each pixel edge. Once we define the offsets, the code is exactly the same as for square pixels.

offsets = [ ...
0.5   0.0
0.0   0.5
-0.5   0.0
0.0  -0.5 ]';

offsets = reshape(offsets,1,2,[]);

P = [x y];
Q = P + offsets;
R = permute(Q,[1 3 2]);
S = reshape(R,[],2);

h = convhull(S,'Simplify',true);
x_hull = S(h,1);
y_hull = S(h,2);

imshow(bw2)
pixelgrid
hold on
plot(x_hull,y_hull,'r-*','MarkerSize',12,'LineWidth',2)
hold off
title('Convex hull of diamond pixels')


Now the max Feret diameter result looks better for the horizontal row of pixels.

points = [x_hull y_hull];
[d,end_points] = maxFeretDiameter(points,antipodalPairs(points))
hold on
plot(end_points(:,1),end_points(:,2),'k','LineWidth',3)
hold off

d =

6

end_points =

10.5000    3.0000
4.5000    3.0000



Hold on, though. Consider a square blob.

bw3 = false(9,9);
bw3(3:7,3:7) = true;
imshow(bw3)
pixelgrid
[y,x] = find(bw3);
P = [x y];
Q = P + offsets;
R = permute(Q,[1 3 2]);
S = reshape(R,[],2);

h = convhull(S,'Simplify',true);
x_hull = S(h,1);
y_hull = S(h,2);

hold on
plot(x_hull,y_hull,'r-*','MarkerSize',12,'LineWidth',2)
points = [x_hull y_hull];
[d,end_points] = maxFeretDiameter(points,antipodalPairs(points))
plot(end_points(:,1),end_points(:,2),'k','LineWidth',3)
hold off
title('The max Feret diameter is not at 45 degrees')

d =

6.4031

end_points =

7.5000    3.0000
2.5000    7.0000



We'd like to see the max Feret diameter oriented at 45 degrees, and clearly we don't.

#### Circular Pixels

OK, I'm going to make one more attempt. I'm going to treat each pixel as approximately a circle. I'm going to approximate a circle using 24 points that are spaced at 15-degree intervals along the circumference.

thetad = 0:15:345;
offsets = reshape(offsets,1,2,[]);
Q = P + offsets;
R = permute(Q,[1 3 2]);
S = reshape(R,[],2);

h = convhull(S,'Simplify',true);
x_hull = S(h,1);
y_hull = S(h,2);

imshow(bw3)
pixelgrid
hold on
plot(x_hull,y_hull,'r-*','MarkerSize',12,'LineWidth',2)
points = [x_hull y_hull];
[d,end_points] = maxFeretDiameter(points,antipodalPairs(points))
plot(end_points(:,1),end_points(:,2),'k','LineWidth',3)
axis on
hold off

d =

6.6569

end_points =

7.3536    7.3536
2.6464    2.6464



Now the max Feret diameter orientation is what we would naturally expect, which is $\pm 45^{\circ}$. The orientation would also be as expected for a horizontal or vertical segment of pixels.

Still, a circular approximation might not always give exactly what a user might expect. Let's go back to the Martha's Vinyard blob that I started with. I wrote a function called pixelHull that can compute the convex hull of binary image pixels in a variety of different ways. The call pixelHull(bw,24) computes the pixel hull using a 24-point circle approximation.

Here's the maximum Feret diameter using that approximation.

imshow(bw)
V = pixelHull(bw,24);
hold on
plot(V(:,1),V(:,2),'r-','LineWidth',2,'MarkerSize',12)
[d,end_points] = maxFeretDiameter(V,antipodalPairs(V));
plot(end_points(:,1),end_points(:,2),'m','LineWidth',3)
axis on
pixelgrid
hold off


I think many people might expect the maximum Feret diameter to go corner-to-corner in this case, but it doesn't exactly do that.

xlim([22.07 31.92])
ylim([8.63 15.20])


You have to use square pixels to get corner-to-corner.

imshow(bw)
V = pixelHull(bw,'square');
hold on
plot(V(:,1),V(:,2),'r-','LineWidth',2,'MarkerSize',12)
[d,end_points] = maxFeretDiameter(V,antipodalPairs(V));
plot(end_points(:,1),end_points(:,2),'m','LineWidth',3)
axis on
pixelgrid
hold off

xlim([22.07 31.92])
ylim([8.63 15.20])


After all this, I'm still not completely certain which shape assumption will generally work best. My only firm conclusion is that the point approximation is the worst choice. The degeneracies associated with point pixels are just too troublesome.

If you have an opinion, please share it in the comments. (Note: A comment that says, "Steve, you're totally overthinking this" would be totally legit.)

The rest of the post contains functions used by the code above.

function V = pixelHull(P,type)

if nargin < 2
type = 24;
end

if islogical(P)
P = bwperim(P);
[i,j] = find(P);
P = [j i];
end

if strcmp(type,'square')
offsets = [ ...
0.5  -0.5
0.5   0.5
-0.5   0.5
-0.5  -0.5 ];

elseif strcmp(type,'diamond')
offsets = [ ...
0.5  0
0    0.5
-0.5  0
0   -0.5 ];

else
% type is number of angles for sampling a circle of diameter 1.

end

offsets = offsets';
offsets = reshape(offsets,1,2,[]);

Q = P + offsets;
R = permute(Q,[1 3 2]);
S = reshape(R,[],2);

k = convhull(S,'Simplify',true);
V = S(k,:);
end


Get the MATLAB code <noscript>(requires JavaScript)</noscript>

Published with MATLAB® R2017b

### Take Care If Trying the RPostgres Package

Take care if trying the new RPostgres database connection package. By default it returns some non-standard types that code developed against other database drivers may not expect, and may not be ready to defend against.

Danger, Will Robinson!

## Trying the new package

One can try the newer RPostgres as a drop-in replacement for the usual RPostgreSQL.

That starts out okay. We can connect to the database and and pull a summary about remote data to R.

db <- DBI::dbConnect(
RPostgres::Postgres(),
host = 'localhost',
port = 5432,
user = 'johnmount',
password = '')
## Warning: multiple methods tables found for 'dbQuoteLiteral'
d <- DBI::dbGetQuery(
db,
"SELECT COUNT(1) FROM pg_catalog.pg_tables")
print(d)
##   count
## 1   177
ntables <- d$count[[1]] print(ntables) ## integer64 ## [1] 177 The result at first looks okay. class(ntables) ## [1] "integer64" typeof(ntables) ## [1] "double" ntables + 1L ## integer64 ## [1] 178 ntables + 1 ## integer64 ## [1] 178 is.numeric(ntables) ## [1] TRUE But it is only okay, until it is not. pmax(1L, ntables) ## [1] 8.744962e-322 pmin(1L, ntables) ## [1] 1 ifelse(TRUE, ntables, ntables) ## [1] 8.744962e-322 for(ni in ntables) { print(ni) } ## [1] 8.744962e-322 unclass(ntables) ## [1] 8.744962e-322 If your code, or any package code you are using, perform any of the above calculations, your results will be corrupt and wrong. It is quite likely any code written before December 2017 (RPostgres‘s first CRAN distribution) would not have been written with the RPostgres "integer64 for all of my friends" design decision in mind. Also note, RPostgres does not currently appear to write integer64 back to the database. DBI::dbWriteTable(db, "d", d, temporary = TRUE, overwrite = TRUE) DBI::dbGetQuery(db, " SELECT column_name, data_type, numeric_precision, numeric_precision_radix, udt_name FROM information_schema.columns WHERE table_name = 'd' ") ## column_name data_type numeric_precision numeric_precision_radix udt_name ## 1 count real 24 2 float4 DBI::dbDisconnect(db) ## The work-around The work-around is: add the argument bigint = "numeric" to your dbConnect() call. This is mentioned in the manual, but not the default and not called out in the package description or README. Or, of course, you could use RPostgreSQL. Continue Reading… ### Quick Feature Engineering with Dates Using fast.ai The fast.ai library is a collection of supplementary wrappers for a host of popular machine learning libraries, designed to remove the necessity of writing your own functions to take care of some repetitive tasks in a machine learning workflow. Continue Reading… ### University College Dublin: Postdoc Research Fellow Seeking a temporary Post-doctoral Research Fellow in the UCD School of Computer Science for a project on analysing activity/fitness data, working with a team of researchers at the Insight Centre for Data Analytics. Continue Reading… ### How to get started in data science? &utm&utm&utm Continue Reading… ### Gaydar and the fallacy of objective measurement Greggor Mattson, Dan Simpson, and I wrote this paper, which begins: Recent media coverage of studies about “gaydar,” the supposed ability to detect another’s sexual orientation through visual cues, reveal problems in which the ideals of scientific precision strip the context from intrinsically social phenomena. This fallacy of objective measurement, as we term it, leads to nonsensical claims based on the predictive accuracy of statistical significance. We interrogate these gaydar studies’ assumption that there is some sort of pure biological measure of perception of sexual orientation. Instead, we argue that the concept of gaydar inherently exists within a social context and that this should be recognized when studying it. We use this case as an example of a more general concern about illusory precision in the measurement of social phenomena, and suggest statistical strategies to address common problems. There’s a funny backstory to this one. I was going through my files a few months ago and came across an unpublished paper of mine from 2012, “The fallacy of objective measurement: The case of gaydar,” which I didn’t even remember ever writing! A completed article, never submitted anywhere, just sitting in my files. How can that happen? I must be getting old. Anyway, I liked the paper—it addresses some issues of measurement that we’ve been talking about a lot lately. In particular, “the fallacy of objective measurement”: researchers took a rich real-world phenomenon and abstracted it so much that they removed its most interesting content. “Gaydar” existed within a social context—a world in which gays were an invisible minority, hiding in plain sight and seeking to be inconspicuous to the general population while communicating with others of their subgroup. How can it make sense to boil this down to the shapes of faces? Stripping a phemenon of its social context, normalizing a base rate to 50%, and seeking an on-off decision: all of these can give the feel of scientific objectivity—but the very steps taken to ensure objectivity can remove social context and relevance. We had some gaydar discussion (also here) on the blog recently and this motivated me to freshen up the gaydar paper, with the collaboration of Mattson and Simpson. I also recently met Michal Kosinski, the coauthor of one of the articles under discussion, and that was helpful too. Continue Reading… ### Web Scraping with Python: Illustration with CIA World Factbook In this article, we show how to use Python libraries and HTML parsing to extract useful information from a website and answer some important analytics questions afterwards. Continue Reading… ### What Are Beacons, and How Are They Used in IoT Projects? All new technologies are becoming a part of our environment, but many of them remain unnoticed or incomprehensible. For many people, beacons are one of these mysterious items. Many IoT applications in large industries –such as retail and warehousing – use beacons everyday, but these small devices go unnoticed. Although the The post What Are Beacons, and How Are They Used in IoT Projects? appeared first on Dataconomy. Continue Reading… ### JPMorgan: Data Scientist, Payments & Liquidity Seeking a Data Scientist with modeling expertise and implementation experience to serve as a thought partner to key business leaders and clients to generate hypotheses and insights. Continue Reading… ### Apple: Big Data Engineer Seeking extraordinary engineers to help take our environment to the next level. You'll have the opportunity to solve challenging big data engineering problems across a broad range of Apple manufacturing services. Continue Reading… ### Simple maths of a fairer USS deal (This article was first published on R – Let's Look at the Figures, and kindly contributed to R-bloggers) This will be my last post for a while (I promise!). After today I’ll be taking a rest from all this, until at least the start of April. Hopefully all this USS stuff will be resolved by then, though! In yesterday’s post I showed a graph, followed by some comments to suggest that future USS proposals with a flatter (or even increasing) “percent lost” curve would be fairer (and, as I argued earlier in my Robin Hood post, more affordable at the same time). It’s now clear to me that my suggestion seemed a bit cryptic to many (maybe most!) who read it yesterday. So here I will try to show more specifically how to achieve a flat curve. (This is not because I think flat is optimal. It’s mainly because it’s easy to explain. As already mentioned, it might not be a bad idea if the curve was actually to increase a bit as salary levels increase; that would allow those with higher salaries to feel happy that they are doing their bit towards the sustainable future of USS.) ## Flattening the curve The graph below is the same as yesterday’s but with a flat (blue, dashed) line drawn at the level of 4% lost across all salary levels. I drew the line at 4% here just as an example, to illustrate the calculation. The actual level needed — i.e, the “affordable” level for universities — would need to be determined by negotiation; but the maths is essentially the same, whatever the level (within reason). Let’s suppose we want to adjust the USS contribution and benefits parameters to achieve just such a flat “percent lost” curve, at the 4% level. How is that done? I will assume here the same adjustable parameters that UUK and UCU appear to have in mind, namely: • employee contribution rate E (as percentage of salary — currently 8; was 8.7 in the 12 March proposal; was 8 in the January proposal) • threshold salary T, over which defined benefit (DB) pension entitlement ceases (which is currently £55.55k; was £42k in the 12 March proposal; and was £0 in the January proposal) • accrual rate A, in the DB pension. Expressed here in percentage points (currently 100/75; was 100/85 in the 12 March proposal; and not relevant to the January proposal). • employer contribution rate (%) to the defined contribution (DC) part of USS pension. Let’s allow different rates $C_1$ and $C_2$ for, respectively, salaries between T and £55.55k, and salaries over £55.55k. (Currently $C_1$ is irrelevant, and $C_2$ is 13 (max); these were both set at 12 in the 12th March proposal; and were both 13.25 in the January proposal.) I will assume also, as all the recent proposals do, that the 1% USS match possibility is lost to all members. Then, to get to 4% lost across the board, we need simply to solve the following linear equations. (To see where these came from, please see this earlier post.) For salary up to T: $(E - 8) + 19(100/75 - A) + 1] = 4.$ For salary between T and £55.55k: $-8 + 19(100/75) - C_1 + 1 = 4.$ For salary over £55.55k: $13 - C_2 = 4.$ Solving those last two equations is simple, and results in $C_1 = 14.33, \qquad C_2 = 9.$ The first equation above clearly allows more freedom: it’s just one equation, with two unknowns, so there are many solutions available. Three example solutions, still based the illustrative 4% loss level across all salary levels, are: $E=8, \qquad A = 1.175 = 100/85.1$ $E = 8.7, \qquad A = 1.21 = 100/82.6$ $E = 11, \qquad A = 100/75.$ At the end here I’ll give code in R to do the above calculation quite generally, i.e., for any desired percentage loss level. First let me just make a few remarks relating to all this. ## Remarks ### Choice of threshold Note that the value of T does not enter into the above calculation. Clearly there will be (negotiable) interplay between T and the required percentage loss, though, for a given level of affordability. ### Choice of $C_2$ Much depends on the value of $C_2$. The calculation above gives the value of $C_2$ needed for a flat “percent lost” curve, at any given level for the percent lost (which was 4% in the example above). To achieve an increasing “percent lost” curve, we could simply reduce the value of $C_2$ further than the answer given by the above calculation. Alternatively, as suggested in my earlier Robin Hood post, USS could apply a lower value of $C_2$ only for salaries above some higher threshold — i.e., in much the same spirit as progressive taxation of income. Just as with income tax, it would be important not to set $C_2$ too small, otherwise the highest-paid members would quite likely want to leave USS. There is clearly a delicate balance to be struck, at the top end of the salary spectrum. But it is clear that if the higher-paid were to sacrifice at least as much as everyone else, in proportion to their salary, then that would allow the overall level of “percent lost” to be appreciably reduced, which would benefit the vast majority of USS members. ### Determination of the overall “percent lost” Everything written here constitutes a methodology to help with finding a good solution. As mentioned at the top here, the actual solution — and in particular, the actual level of USS member pain (if any) deemed to be necessary to keep USS afloat — will be a matter for negotiation. The maths here can help inform that negotiation, though. ## Code for solving the above equations ## Function to compute the USS parameters needed for a ## flat "percent lost" curve ## ## Function arguments are: ## loss: in percentage points, the constant loss desired ## E: employee contribution, in percentage points ## A: the DB accrual rate ## ## Exactly one of E and A must be specified (ie, not NULL). ## ## Example calls: ## flatcurve(4.0, A = 100/75) ## flatcurve(2.0, E = 10.5) ## flatcurve(1.0, A = 100/75) # status quo, just 1% "match" lost flatcurve <- function(loss, E = NULL, A = NULL){ if (is.null(E) && is.null(A)) { stop("E and A can't both be NULL")} if (!is.null(E) && !is.null(A)) { stop("one of {E, A} must be NULL")} c1 <- 19 * (100/75) - (7 + loss) c2 <- 13 - loss if (is.null(E)) { E <- 7 + loss - (19 * (100/75 - A)) } if (is.null(A)) { A <- (E - 7 - loss + (19 * 100/75)) / 19 } return(list(loss_percent = loss, employee_contribution_percent = E, accrual_reciprocal = 100/A, DC_employer_rate_below_55.55k = c1, DC_employer_rate_above_55.55k = c2)) } The above function will run in base R. Here are three examples of its use (copied from an interactive session in R): ### Specify 4% loss level, ### still using the current USS DB accrual rate > flatcurve(4.0, A = 100/75)$loss_percent
[1] 4

$employee_contribution_percent [1] 11$accrual_reciprocal
[1] 75

$DC_employer_rate_below_55.55k [1] 14.33333$DC_employer_rate_above_55.55k
[1] 9

#------------------------------------------------------------
###  This time for a smaller (2%) loss,
###  with specified employee contribution

> flatcurve(2.0, E = 10.5)
$loss_percent [1] 2$employee_contribution_percent
[1] 10.5

$accrual_reciprocal [1] 70.80745$DC_employer_rate_below_55.55k
[1] 16.33333

$DC_employer_rate_above_55.55k [1] 11 #------------------------------------------------------------ ### Finally, my personal favourite: ### --- status quo with just the "match" lost > flatcurve(1, A = 100/75)$loss_percent
[1] 1

$employee_contribution_percent [1] 8$accrual_reciprocal
[1] 75

$DC_employer_rate_below_55.55k [1] 17.33333$DC_employer_rate_above_55.55k
[1] 12

To cite this entry:
Firth, D (2018). Simple maths of a fairer USS deal. Weblog entry at URL https://statgeek.net/2018/03/16/simple-maths-of-a-fairer-uss-deal/

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Four short links: 16 March 2018

Longevity, Partner Violence, Leaking Secrets, and Fallacy of Objective Measurement

1. Longevity FAQ (Laura Deming) -- I run Longevity Fund. I spend a lot of time thinking about what could increase healthy human lifespan. This is my overview of the field for beginners.
2. Intimate Partner Violence -- What we’ve discovered in our research is that digital abuse of intimate partners is both more mundane and more complicated than we might think. [...] [I]ntimate partner violence upends the way we typically think about how to protect digital privacy and security. You should read this because we all need to get a lot more aware of the ways in which the tools we make might be used to hurt others.
3. The Secret Sharer -- Machine learning models based on neural networks and deep learning are being rapidly adopted for many purposes. What those models learn, and what they may share, is a significant concern when the training data may contain secrets and the models are public—e.g., when a model helps users compose text messages using models trained on all users’ messages. [...] [W]e show that unintended memorization occurs early, is not due to overfitting, and is a persistent issue across different types of models, hyperparameters, and training strategies.
4. Gaydar and the Fallacy of Objective Measurement -- By taking gaydar into the lab, these research teams have taken the creative adaptation of an oppressed community of atomized members and turned gaydar into an essentialist story of “gender atypicality,” a topic that is related to, but distinctly different from, sexual orientation.

### Machine learning to estimate when bus and bike lanes blocked

Frustrated with vehicles blocking bus and bike lanes, Alex Bell applied some statistical methods to estimate the extent.

Sarah Maslin Nir for The New York Times:

Now Mr. Bell is trying another tack — the 30-year-old computer scientist who lives in Harlem has created a prototype of a machine-learning algorithm that studies footage from a traffic camera and tracks precisely how often bike lanes are obstructed by delivery trucks, parked cars and waiting cabs, among other scofflaws. It is a piece of data that transportation advocates said is missing in the largely anecdotal discussion of how well the city’s bus and bike lanes do or do not work.

(This article was first published on Marcelo S. Perlin, and kindly contributed to R-bloggers)

Some advices from a referee –

Back in 2007 I wrote a Matlab package for estimating regime switching models. I was just starting to learn to code and this project was my way of doing it. After publishing it in FEX (Matlab file exchange site) I got so many repeated questions on my email that eventually realized it would be easier to write a manual for people to read. Some time and effort would be spend writing it, but less time replying to repeated questions on my email.

This manual about the code became, by far, my most cited paper in Google Scholar. It is not even published, just a permanent working paper. When attending conferences and seminars, I was always surprised to hear that people knew me as the matlab regime switching guy.

Moving forward a few years, I stopped using Matlab for R and I continue to invest a lot of time writing papers about packages and publishing them in standard scientific journals. You can see a list of those here. I can testify for a greater contribution and impact for research papers about code. I strongly believe that it will become more popular in the years to come. The new generation of researchers is far more aware of code than the previous. In that sense, nothing beats R and CRAN at the diversity and depth of packages.

In this subject, I frequently review papers in the same topic and I see common mistakes that researchers do when writing their papers. Here’s some tips for those that wish to pursue such a publication:

• A problem must be clearly stated: Every paper is a solution to a problem. This is also true for a paper about code. Identify it and make it painfully clear how the code solves it. In other words, do your homework.

• The paper is NOT an extended manual: Don’t write a paper simply showing its functions. We have that from CRAN (or other repository).

• Make sure you know what’s available: How people did it before? Is there a competing package? How does your code improves it?

• A bibliometric study is mandatory.: Same as the previous point. Looking at the previous published research papers, can you find out how they handled the problem your code solves?

• Not everyone uses R, so make it easier for people to use you software: Make sure you keep a simple and accessible code. Explain what is R and why you should use it. Case in point, not everyone know what a tibble is.

• Think about your example of usage: You should always add a reproducible example of usage. This is what everyone will try! Make sure it is a simple example, not too deep in the literature. Something everyone can understand. Your code should also be accessible and reproducible.

It is a lot of work to publish a research paper about code. But, it is all worth it! The impact is much greater than a standard research paper. Your academic career will certainly move forward with it.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Distilled News

Deep Learning at scale is disrupting many industries by creating chatbots and bots never seen before. On the other hand, a person just starting out on Deep Learning would read about Basics of Neural Networks and its various architectures like CNN and RNN. But there seems like a big jump from the simple concepts to industrial applications of Deep Learning. Concepts such as Batch Normalization, Dropout and Attention are almost a requirement to know in building deep learning applications. In this article, we will cover two important concepts used in the current state of the art applications in Speech Recognition and Natural Language Processing – viz Sequence to Sequence modelling and Attention models. Just to give you a sneak peek of the potential application of these two techniques – Baidu’s AI system uses them to clone your voice It replicates a persons voice by understanding his voice in just three seconds of training.You can check out some audio samples provided by Baidu’s Research team which consist of original and synthesized voices.
• FastPhotoStyle
• Handwriting Synthesis
• ENAS PyTorch
• Sign Language
In my first article on this topic (see here) I introduced some of the complex stochastic processes used by Wall Street data scientists, using a simple approach that can be understood by people with no statistics background other than a first course such as stats 101. I defined and illustrated the continuous Brownian motion (the mother of all these stochastic processes) using approximations by discrete random walks, simply re-scaling the X-axis and the Y-axis appropriately, and making time increments (the X-axis) smaller and smaller, so that the limiting process is a time-continuous one. This was done without using any complicated mathematics such as measure theory or filtrations. Here I am going one step further, introducing the integral and derivative of such processes, using rudimentary mathematics. All the articles that I’ve found on this subject are full of complicated equations and formulas. It is not the case here. Not only do I explain this material in simple English, but I also provide pictures to show how an Integrated Brownian motion looks like (I could not find such illustrations in the literature), how to compute its variance, and focus on applications, especially to number theory, Fintech and cryptography problems. Along the way, I discuss moving averages in a theoretical but basic framework (again with pictures), discussing what the optimal window should be for these (time-continuous or discrete) time series.
Data classification is the central data-mining technique used for sorting data, understanding of data and for performing outcome predictions. In this small blog we will use a library Smilecthat includes many methods for supervising and non-supervising data classification methods. We will make a small Python-like code using Jython top build a complex Multilayer Perceptron Neural Network for data classification. It will have large number of inputs, several outputs, and can be easily extended for cases with many hidden layers. We will write a few lines of Jython code (most of our codding will deal with how to prepare an interface for reading data, rather than with Neural Network programming).
Numpy is a math library for python. It enables us to do computation (between an array, matrix, tensor etc..) efficiently and effectively. In this article, I’m just going to introduce you to the basics of what is mostly required for Machine Learning and Data science (and Deep Learning !).
Markov chains are a fairly common, and relatively simple, way to statistically model random processes. They have been used in many different domains, ranging from text generation to financial modeling. A popular example is r/SubredditSimulator, which uses Markov chains to automate the creation of content for an entire subreddit. Overall, Markov Chains are conceptually quite intuitive, and are very accessible in that they can be implemented without the use of any advanced statistical or mathematical concepts. They are a great way to start learning about probabilistic modeling and data science techniques.
In A Beginner’s Guide to Data Engineering – Part I, I explained that an organization’s analytics capability is built layers upon layers. From collecting raw data and building data warehouses to applying Machine Learning, we saw why data engineering plays a critical role in all of these areas. One of any data engineer’s most highly sought-after skills is the ability to design, build, and maintain data warehouses. I defined what data warehousing is and discussed its three common building blocks – Extract, Transform, and Load, where the name ETL comes from. For those who are new to ETL processes, I introduced a few popular open source frameworks built by companies like LinkedIn, Pinterest, Spotify, and highlight Airbnb’s own open-sourced tool Airflow. Finally, I argued that data scientist can learn data engineering much more effectively with the SQL-based ETL paradigm.
We discussed using and deploying deep learning at scale. This is an empirical era for machine learning, and, as I noted in an earlier article, as successful as deep learning has been, our level of understanding of why it works so well is still lacking. In practice, machine learning engineers need to explore and experiment using different architectures and hyperparameters before they settle on a model that works for their specific use case. Training a single model usually involves big (labeled) data and big models; as such, exploring the space of possible model architectures and parameters can take days, weeks, or even months. Talwalkar has spent the last few years grappling with this problem as an academic researcher and as an entrepreneur. In this episode, he describes some of his related work on hyperparameter tuning, systems, and more.

### Gradients explode - Deep Networks are shallow - ResNet explained

So last night at the Paris Machine Learning meetup, we had the good folks from Snips making an announcement on the release/open sourcing of their Natural language Understanding code. Joseph also mentioned that after many architectures search, a simple CRF model, a single layer model, did as well as other commercial models. It's NLP so the representability issue has already been parsed. In a different corner of the galaxy, the following paper seems to suggest that ResNets, while rendering these deep networks effectively shallower, do not solve the gradient explosion problem.

Abstract: Whereas it is believed that techniques such as Adam, batch normalization and, more recently, SeLU nonlinearities solve'' the exploding gradient problem, we show that this is not the case and that in a range of popular MLP architectures, exploding gradients exist and that they limit the depth to which networks can be effectively trained, both in theory and in practice. We explain why exploding gradients occur and highlight the {\it collapsing domain problem}, which can arise in architectures that avoid exploding gradients. ResNets have significantly lower gradients and thus can circumvent the exploding gradient problem, enabling the effective training of much deeper networks, which we show is a consequence of a surprising mathematical property. By noticing that {\it any neural network is a residual network}, we devise the {\it residual trick}, which reveals that introducing skip connections simplifies the network mathematically, and that this simplicity may be the major cause for their success.
TL;DR: We show that in contrast to popular wisdom, the exploding gradient problem has not been solved and that it limits the depth to which MLPs can be effectively trained. We show why gradients explode and how ResNet handles them.

In this work we propose a novel interpretation of residual networks showing that they can be seen as a collection of many paths of differing length. Moreover, residual networks seem to enable very deep networks by leveraging only the short paths during training. To support this observation, we rewrite residual networks as an explicit collection of paths. Unlike traditional models, paths through residual networks vary in length. Further, a lesion study reveals that these paths show ensemble-like behavior in the sense that they do not strongly depend on each other. Finally, and most surprising, most paths are shorter than one might expect, and only the short paths are needed during training, as longer paths do not contribute any gradient. For example, most of the gradient in a residual network with 110 layers comes from paths that are only 10-34 layers deep. Our results reveal one of the key characteristics that seem to enable the training of very deep networks: Residual networks avoid the vanishing gradient problem by introducing short paths which can carry gradient throughout the extent of very deep networks.

Deep Residual Learning for Image Recognition by Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun

Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers.
The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !

### Whats new on arXiv

The irreversibility of trajectories in stochastic dynamical systems is linked to the structure of their causal representation in terms of Bayesian networks. We consider stochastic maps resulting from a time discretization with interval \tau of signal-response models, and we find an integral fluctuation theorem that sets the backward transfer entropy as a lower bound to the conditional entropy production. We apply this to a linear signal-response model providing analytical solutions, and to a nonlinear model of receptor-ligand systems. We show that the observational time \tau has to be fine-tuned for an efficient detection of the irreversibility in time-series.
Tensor train is a hierarchical tensor network structure that helps alleviate the curse of dimensionality by parameterizing large-scale multidimensional data via a set of network of low-rank tensors. Associated with such a construction is a notion of Tensor Train subspace and in this paper we propose a TT-PCA algorithm for estimating this structured subspace from the given data. By maintaining low rank tensor structure, TT-PCA is more robust to noise comparing with PCA or Tucker-PCA. This is borne out numerically by testing the proposed approach on the Extended YaleFace Dataset B.
Fractal AI is a theory for general artificial intelligence. It allows to derive new mathematical tools that constitute the foundations for a new kind of stochastic calculus, by modelling information using cellular automaton-like structures instead of smooth functions. In the repository included we are presenting a new Agent, derived from the first principles of the theory, which is capable of solving Atari games several orders of magnitude more efficiently than other similar techniques, like Monte Carlo Tree Search. The code provided shows how it is now possible to beat some of the current state of the art benchmarks on Atari games, without previous learning and using less than 1000 samples to calculate each one of the actions when standard MCTS uses 3 Million samples. Among other things, Fractal AI makes it possible to generate a huge database of top performing examples with very little amount of computation required, transforming Reinforcement Learning into a supervised problem. The algorithm presented is capable of solving the exploration vs exploitation dilemma on both the discrete and continuous cases, while maintaining control over any aspect of the behavior of the Agent. From a general approach, new techniques presented here have direct applications to other areas such as: Non-equilibrium thermodynamics, chemistry, quantum physics, economics, information theory, and non-linear control theory.
In this work, we propose a new language modeling paradigm that has the ability to perform both prediction and moderation of information flow at multiple granularities: neural lattice language models. These models construct a lattice of possible paths through a sentence and marginalize across this lattice to calculate sequence probabilities or optimize parameters. This approach allows us to seamlessly incorporate linguistic intuitions – including polysemy and existence of multi-word lexical items – into our language model. Experiments on multiple language modeling tasks show that English neural lattice language models that utilize polysemous embeddings are able to improve perplexity by 9.95% relative to a word-level baseline, and that a Chinese model that handles multi-character tokens is able to improve perplexity by 20.94% relative to a character-level baseline.
Social and behavioral interventions are a critical tool for governments and communities to tackle deep-rooted societal challenges such as homelessness, disease, and poverty. However, real-world interventions are almost always plagued by limited resources and limited data, which creates a computational challenge: how can we use algorithmic techniques to enhance the targeting and delivery of social and behavioral interventions? The goal of my thesis is to provide a unified study of such questions, collectively considered under the name ‘algorithmic social intervention’. This proposal introduces algorithmic social intervention as a distinct area with characteristic technical challenges, presents my published research in the context of these challenges, and outlines open problems for future work. A common technical theme is decision making under uncertainty: how can we find actions which will impact a social system in desirable ways under limitations of knowledge and resources? The primary application area for my work thus far is public health, e.g. HIV or tuberculosis prevention. For instance, I have developed a series of algorithms which optimize social network interventions for HIV prevention. Two of these algorithms have been pilot-tested in collaboration with LA-area service providers for homeless youth, with preliminary results showing substantial improvement over status-quo approaches. My work also spans other topics in infectious disease prevention and underlying algorithmic questions in robust and risk-aware submodular optimization.
Retrieving the most similar objects in a large-scale database for a given query is a fundamental building block in many application domains, ranging from web searches, visual, cross media, and document retrievals. State-of-the-art approaches have mainly focused on capturing the underlying geometry of the data manifolds. Graph-based approaches, in particular, define various diffusion processes on weighted data graphs. Despite success, these approaches rely on fixed-weight graphs, making ranking sensitive to the input affinity matrix. In this study, we propose a new ranking algorithm that simultaneously learns the data affinity matrix and the ranking scores. The proposed optimization formulation assigns adaptive neighbors to each point in the data based on the local connectivity, and the smoothness constraint assigns similar ranking scores to similar data points. We develop a novel and efficient algorithm to solve the optimization problem. Evaluations using synthetic and real datasets suggest that the proposed algorithm can outperform the existing methods.
Paucity of large curated hand-labeled training data for every domain-of-interest forms a major bottleneck in the deployment of machine learning models in computer vision and other fields. Recent work (Data Programming) has shown how distant supervision signals in the form of labeling functions can be used to obtain labels for given data in near-constant time. In this work, we present Adversarial Data Programming (ADP), which presents an adversarial methodology to generate data as well as a curated aggregated label has given a set of weak labeling functions. We validated our method on the MNIST, Fashion MNIST, CIFAR 10 and SVHN datasets, and it outperformed many state-of-the-art models. We conducted extensive experiments to study its usefulness, as well as showed how the proposed ADP framework can be used for transfer learning as well as multi-task learning, where data from two domains are generated simultaneously using the framework along with the label information. Our future work will involve understanding the theoretical implications of this new framework from a game-theoretic perspective, as well as explore the performance of the method on more complex datasets.
Recently, deep learning based clustering methods are shown superior to traditional ones by jointly conducting representation learning and clustering. These methods rely on the assumptions that the number of clusters is known, and that there is one single partition over the data and all attributes define that partition. However, in real-world applications, prior knowledge of the number of clusters is usually unavailable and there are multiple ways to partition the data based on subsets of attributes. To resolve the issues, we propose latent tree variational autoencoder (LTVAE), which simultaneously performs representation learning and multidimensional clustering. LTVAE learns latent embeddings from data, discovers multi-facet clustering structures based on subsets of latent features, and automatically determines the number of clusters in each facet. Experiments show that the proposed method achieves state-of-the-art clustering performance and reals reasonable multifacet structures of the data.
Machine learning algorithms use error function minimization to fit a large set of parameters in a preexisting model. However, error minimization eventually leads to a memorization of the training dataset, losing the ability to generalize to other datasets. To achieve generalization something else is needed, for example a regularization method or stopping the training when error in a validation dataset is minimal. Here we propose a different approach to learning and generalization that is parameter-free, fully discrete and that does not use function minimization. We use the training data to find an algebraic representation with minimal size and maximal freedom, explicitly expressed as a product of irreducible components. This algebraic representation is shown to directly generalize, giving high accuracy in test data, more so the smaller the representation. We prove that the number of generalizing representations can be very large and the algebra only needs to find one. We also derive and test a relationship between compression and error rate. We give results for a simple problem solved step by step, hand-written character recognition, and the Queens Completion problem as an example of unsupervised learning. As an alternative to statistical learning, \enquote{algebraic learning} may offer advantages in combining bottom-up and top-down information, formal concept derivation from data and large-scale parallelization.
We propose NAMA (Newton-type Alternating Minimization Algorithm) for solving structured nonsmooth convex optimization problems where the sum of two functions is to be minimized, one being strongly convex and the other composed with a linear mapping. The proposed algorithm is a line-search method over a continuous, real-valued, exact penalty function for the corresponding dual problem, which is computed by evaluating the augmented Lagrangian at the primal points obtained by alternating minimizations. As a consequence, NAMA relies on exactly the same computations as the classical alternating minimization algorithm (AMA), also known as the dual proximal gradient method. Under standard assumptions the proposed algorithm possesses strong convergence properties, while under mild additional assumptions the asymptotic convergence is superlinear, provided that the search directions are chosen according to quasi-Newton formulas. Due to its simplicity, the proposed method is well suited for embedded applications and large-scale problems. Experiments show that using limited-memory directions in NAMA greatly improves the convergence speed over AMA and its accelerated variant.

### Book Memo: “Developing Bots with Microsoft Bots Framework”

 Create Intelligent Bots using MS Bot Framework and Azure Cognitive Servic Develop Intelligent Bots using Microsoft Bot framework (C# and Node.js), Visual Studio Enterprise & Code, Microsoft Azure and Cognitive Services. This book shows you how to develop great Bots, publish to Azure and register with Bot portal so that customers can connect and communicate using famous communication channels like Skype, Slack, Web and Facebook. You’ll also learn how to build intelligence into Bots using Azure Cognitive Services like LUIS, OCR, Speech to Text and Web Search. Bots are the new face of user experience. Conversational User Interface provides many options to make user experience richer, innovative and engaging with email, text, buttons or voice as the medium for communication. Modern line of business applications can be replaced or associated with Intelligent Bots that can use data/history combined with Machine Intelligence to make user experience inclusive and exciting.

### Book Memo: “Artificial Intelligence and Games”

 This is the first textbook dedicated to explaining how artificial intelligence (AI) techniques can be used in and for games. After introductory chapters that explain the background and key techniques in AI and games, the authors explain how to use AI to play games, to generate content for games and to model players. The book will be suitable for undergraduate and graduate courses in games, artificial intelligence, design, human-computer interaction, and computational intelligence, and also for self-study by industrial game developers and practitioners. The authors have developed a website (http://www.gameaibook.org) that complements the material covered in the book with up-to-date exercises, lecture slides and reading.

### R Packages worth a look

Miscellaneous Functions (miscF)
Various functions for random number generation, density estimation, classification, curve fitting, and spatial data analysis.

Random Graph Clustering (mixer)
Estimates the parameters, the clusters, as well as the number of clusters of a (binary) stochastic block model (J.-J Daudin, F. Picard, S. Robin (2008) <doi:10.1007/s11222-007-9046-7>).

Simulate Dynamic Networks using Exponential Random Graph Models (ERGM) Family (dnr)
Functions are provided to fit temporal lag models to dynamic networks. The models are build on top of exponential random graph models (ERGM) framework. There are functions for simulating or forecasting networks for future time points. Stable Multiple Time Step Simulation/Prediction from Lagged Dynamic Network Regression Models. Mallik, Almquist (2017, under review).

Analysis of Repeatability and Reproducibility Studies with Ordinal Measurements (ordinalRR)
Implements Bayesian data analyses of balanced repeatability and reproducibility studies with ordinal measurements. Model fitting is based on MCMC posterior sampling with ‘rjags’. Function ordinalRR() directly carries out the model fitting, and this function has the flexibility to allow the user to specify key aspects of the model, e.g., fixed versus random effects. Functions for preprocessing data and for the numerical and graphical display of a fitted model are also provided. There are also functions for displaying the model at fixed (user-specified) parameters and for simulating a hypothetical data set at a fixed (user-specified) set of parameters for a random-effects rater population. For additional technical details, refer to Culp, Ryan, Chen, and Hamada (2018) and cite this Technometrics paper when referencing any aspect of this work. The demo of this package reproduces results from the Technometrics paper.

Fits a linear combination of predictors by maximizing a smooth approximation to the estimated covariate-adjusted area under the receiver operating characteristic curve (AUC) for a discrete covariate. (Meisner, A, Parikh, CR, and Kerr, KF (2017) <http://…/>.)

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

At rOpenSci, our R package peer review process relies on the the hard work of many volunteer reviewers. These community members donate their time and expertise to improving the quality of rOpenSci packages and helping drive best practices into scientific software.

Our open review process, where reviews and reviewers are public, means that one benefit for reviewers is that they can get credit for their reviews. We want reviewers to see as much benefit as possible, and for their contributions to be recorded as part of the intellectual trail of academic work, so we have been working at reviews visible and discoverable.

That is why we are very excited about a tiny change in yesterday’s release of R 3.4.4.

If your are running R 3.4.3, and type utils:::MARC_relator_db_codes_used_with_R into the console, you get this:

> utils:::MARC_relator_db_codes_used_with_R
[1] "aut" "com" "ctr" "ctb" "cph" "cre" "dtc" "fnd" "ths" "trl"


Under 3.4.4, you get this:

> utils:::MARC_relator_db_codes_used_with_R
[1] "aut" "com" "ctr" "ctb" "cph" "cre" "dtc" "fnd" "rev" "ths" "trl"


What’s that little "rev" that shows up, third from right? It’s the official inclusion of “Reviewer” as an R package author role!

These three-letter codes come from the MARC (Machine-Readable Cataloging) terms vocabulary, a standard set of authorship types originally created for some of the first computerized library systems. R uses these codes to distinguish between different types of package authors. You may be familiar with some of these terms that show up in DESCRIPTION files, like so:

Authors@R: person("Scott", "Chamberlain", role = c("aut", "cre"),
comment = c(ORCID = "0000-0003-1444-9135"))


Here aut and cre stand for “Author” and “Creator”, indicating that Scott is the original and major creator of a package. You may have also seen ctb (Contributor) or cph (Copyright Holder1).

Standard descriptors like this are important because they allow for information about authorship to be machine-readable and credit for authors’ work to be cataloged and transferred. When metadata about R packages is displayed in help files or on websites, it’s clear the role everyone has played. Such metadata is also critical to transitive credit, the important task of tracking contributions through chains of dependencies so as to provide recognition to software developers and data providers that the traditional citation system often misses.

While there are many more2 MARC relator terms, R only allows the a small set that make sense in the context of software packages. These are found in utils:::MARC_relator_db_codes_used_with_R. Codes outside this set this set won’t pass R CMD Check and are not allowed on CRAN.

We believe peer reviewers make an imporant contribution the quality of published software. That’s why last year we requested R-Core add "rev" (Reviewer), to the list of allowed contributor types. And Lo and Behold, Kurt Hornik made the change on our behalf 3. Now in the release version of R.

Since CRAN uses the development version of R to check and build packages, the option has actually been available on CRAN for a while. A trickle of authors have been already been awknowledging peer-reviewers in this way by in their package DESCRIPTION files.

We hope to see adoption of reviewer acknowledgement in package metadata beyond rOpenSci. It can be adopted by authors who submit to JSS, JOSS, or any journal or process where reviewers make significant comments on software code or documentation. For non-R software, we’re working on including reviewers in codemeta, a cross-language software metadata standard.

A few notes about how this development relates specifically to rOpenSci’s peer-review process:

• First, it is 100% the choice of package authors to decide whether reviewers made a sufficient contribution to be included in Authors in this way. While we promote this option in general, we’ll never ask an author to specifically include a reviewer. Like a manuscript’s acknowledgements section, the Author section is under developer control. It is also up to reviewers whether they want to be included, so package authors should ask reviewers first.

• Second, rOpenSci editors should not be listed under Authors. "edt" (Editor) is not a valid R authorship role, and we are a step too far removed to be included. But we are flattered by those who have asked.

• Finally, if you do include reviewers in this way, we think it’s best practice to include information linking back to the review, like so:

person("Bea", "Hernández", role_ = "rev",
comment = "Bea reviewed the package for rOpenSci, see
https://github.com/ropensci/onboarding/issues/116")


We are very excited about this development and how it can improve incentives for peer review. Thanks to R-core for getting aboard with this, and the early adopters who tested it!

Sincerely,

c(
person("Noam", "Ross", role = c("aut", "cre", "lbt")),
person("Maëlle", "Salmon", role = c("rev", "med"),
comment = "Comments to improve structure of the introduction")
person("Karthik", "Ram", role = c("rev", "elt"),
comment = "Fixed a small typo"),
person("Scott", "Chamberlain", role = c("rev", "sce"),
comment = "Agrees with Maëlle about the intro.")
)


1. I can’t get through this post without mentioning that Her Majesty the Queen in Right of Canada, as represented by the Minister of Natural Resources Canada, is cph on eight CRAN packages.
2. Found here or as a handy data frame with descriptions in utils:::MARC_relator_db
3. R-core also added "fnd" (Funder) in R 3.4.3.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

## March 15, 2018

### Magister Dixit

“Big data should complement small data, not replace them.” Rob Kitchin

### Teaching Computers to be Fair

My newest Bloomberg View piece just came out:

#### How to Teach a Computer What ‘Fair’ Means

##### If we’re going to rely on algorithms, we’ll have to figure it out.

To read all of my Bloomberg View pieces, go here.

### Apple: Data Scientist

Seeking an outstanding data scientist who is interested in building and maintaining analytical solutions that have direct and measurable impact to Apple.

### Apple: Commerce Data Scientist – Apple Media Products

Seeking a talented, experienced Applied Researcher/Data Scientist to work on high visibility projects that affect millions of customers globally.

### Document worth reading: “An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling”

For most deep learning practitioners, sequence modeling is synonymous with recurrent networks. Yet recent results indicate that convolutional architectures can outperform recurrent networks on tasks such as audio synthesis and machine translation. Given a new sequence modeling task or dataset, which architecture should one use? We conduct a systematic evaluation of generic convolutional and recurrent architectures for sequence modeling. The models are evaluated across a broad range of standard tasks that are commonly used to benchmark recurrent networks. Our results indicate that a simple convolutional architecture outperforms canonical recurrent networks such as LSTMs across a diverse range of tasks and datasets, while demonstrating longer effective memory. We conclude that the common association between sequence modeling and recurrent networks should be reconsidered, and convolutional networks should be regarded as a natural starting point for sequence modeling tasks. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

### We Need To Talk About Dashboards

Hey everyone, gather ’round. We need to talk about dashboards.

For C-level executives, dashboard reports are essential. Executives don’t have time to review details for every decision they make, they just want to consume a report that has red, yellow, and green to help them make decisions for the day. But the need for such dashboards is also true for the cubicle-dwelling system administrators. They also need dashboards to help them understand where to focus their efforts daily in order to keep operations running.

I’m here today to tell you that your dashboards are a failure.

## Why Dashboards Fail

More than 90% of the data in the world has been created in the past two years.

Don’t take my word for it, though. I’m just citing a statement made in this article. Published more than four years ago, that article is the cited source in presentations on how data continues to grow at an exponential rate.

I believe we continue to create, and curate, data at an accelerated pace with each passing year. Today we have access to more data than ever before. Everyone I meet will say they manage more data today than a year ago.

The data explosion has given rise to a never-ending marketing lexicon. The first one I remember being used widely was data warehouse. That was soon followed by a data mart. Today we also have a data factory and a data lake, which is a nice feature to have next to our data estate, built with data bricks.

With so much data available, information is cheap. Today it is easy to get data about anything. We are drowning in data, inundated with metrics with every step of our day.

The trouble with such easy access to data is this: When information is so cheap, attention becomes expensive.

## I’m Looking Through You

Here’s an experiment for you to try. Watch this video and count the number of passes between the people in white shirts:

It’s an old study, and you may have seen it before. If you haven’t seen it before let me know if you are surprised by the results.

This is part of the problem with dashboards: they are being read by humans. And humans, as it turns out, can have difficulty determining what is important. The experiment helps to show how there is an area of our visual cortex that determines what is important and filters out everything else. In other words, we gain a lot of data when we focus our attention, but we can miss a gorilla staring back at us.

Focusing is a great thing for us humans, and this experiment helps to show why multi-tasking is something we shouldn’t be doing. Dashboards are meant to provide that focus. We don’t want to spend the time examining all the data streams.

## Spot the Difference

Here’s another experiment for you. Remember those “spot the difference” games? Here’s why your brain is so bad at them.

When we look at a dashboard we don’t take in everything that we see. Our brains don’t bother logging details about something that is not important. Just like the gorilla. Of course, once we see it, we don’t forget it.

Dashboards that contain an overload of information require more focus, which means less information is being consumed. This is not the desired outcome.

## Dashboards are a Horrible Way to Communicate

The trouble with such dashboards is that they are a horrible way to communicate.

Dashboards need data in order to exist. Good dashboards are able to communicate the story the data is trying to tell. But the data contains the details necessary for that story, and those details are often left behind. Summaries, aggregations, and averages blur the details from our view. Offering users the ability to drill-through to get the details is a workaround, but the whole point of a dashboard is to avoid having to review the details. Remember, it is better for us humans to be able to focus.

A common example I often use to explain when dashboards aren’t useful involves disk space usage. Let’s say that a disk is at 90% of capacity, and the dashboard shows a big red circle for this metric. The trouble now is that you are missing important details. A 1TB disk at 90% is a different situation than a 10TB disk at 90% full. You also need to know how full the disk was yesterday, what the growth trend has been over time, and at what point the disk is completely full.

While those details might help you figure out what steps to take next, they do little for your end user. This dashboard reporting a disk at 90% has little meaning to the end user that only wants to be able to get their work done for the day.

## Summary

Dashboards are not new, they’ve been around for years. It’s the ease in which they are created and consumed that has driven demand. You get a dashboard, and you get a dashboard, and everyone gets a dashboard. The phrase “pin it to your dashboard” has become common for users of tools such as PowerBI.

But with so much data coming across our desk each day we need the data to communicate with everyone in a way they can understand.

Saying your disk is 90% full is not nearly as effective as saying that you only have space for three more Netflix movie downloads. That’s a story that anyone can understand. Even simple things like bar charts do a better job communicating the story that data is trying to tell. And I have yet to meet a manager that doesn’t understand a bar chart.

Those of us that work in IT are always asking for more. We want more space, more memory, more CPU, more bandwidth.

It’s time we also ask for more ways for our data to tell a story that everyone can understand.

And don’t get me started on pie charts.

The post We Need To Talk About Dashboards appeared first on Thomas LaRock.

### R 3.4.4 released

R 3.4.4 has been released, and binaries for Windows, Mac, Linux and now available for download on CRAN. This update (codenamed "Someone to Lean On" — likely a Peanuts reference, though I couldn't find which one with a quick search) is a minor bugfix release, and shouldn't cause any compatibility issues with scripts or packages written for prior versions of R in the 3.4.x series.

This update improves automatic timezone detection on some systems, and adds fixes for a some unusual corner cases in the statistics library. For a complete list of the changes, check the NEWS file for R 3.4.4 or follow the link below.

R-announce mailing list: R 3.4.4 is released

### R 3.4.4 released

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

R 3.4.4 has been released, and binaries for Windows, Mac, Linux and now available for download on CRAN. This update (codenamed "Someone to Lean On" — likely a Peanuts reference, though I couldn't find which one with a quick search) is a minor bugfix release, and shouldn't cause any compatibility issues with scripts or packages written for prior versions of R in the 3.4.x series.

This update improves automatic timezone detection on some systems, and adds fixes for a some unusual corner cases in the statistics library. For a complete list of the changes, check the NEWS file for R 3.4.4 or follow the link below.

R-announce mailing list: R 3.4.4 is released

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### JupyterCon 2018, NYC August 21–25

Discover how data-driven organizations are using Jupyter to analyze data, share insights, and foster practices for dynamic, reproducible data science.

I’m grateful to join Fernando Pérez and Brian Granger as a program co-chair for JupyterCon 2018. Project Jupyter, NumFOCUS, and O’Reilly Media will present the second annual JupyterCon in New York City on August 21–25.

Timing for this event couldn’t be better. The human side of data science, machine learning/AI, and scientific computing is more important than ever. This is seen in the broad adoption of data-driven decision making in human organizations of all kinds, the increasing importance of human centered design in tools for working with data, the urgency for better data insights in the face of complex socioeconomic conditions worldwide, as well as dialogue about the social issues these technologies bring to the fore: collaboration, security, ethics, data privacy, transparency, propaganda, etc.

To paraphrase our co-chairs, Brian Granger:

“Jupyter is where humans and data science intersect”

and Fernando Perez:

“The better the technology, the more important that human judgement becomes”

Consequently, we’ll explore three main themes at JupyterCon:

• Interactive computing with data at scale: the technical best practices and organizational challenges of supporting interactive computing in companies, universities, research collaborations, etc (JupyterHub)
• Extensible user interfaces for data science, machine learning/AI, and scientific computing (JupyterLab)
• Computational communication: taking the artifacts of interactive computing and communicating them to different audiences

A meta-theme which ties these together is extensible software architecture for interactive computing with data. Jupyter is built on a set of flexible, extensible, and re-usable building blocks which can be combined and assembled to address a wide range of usage cases. These building blocks are expressed through the various open protocols, APIs, and standards of Jupyter.

The Jupyter community has much to discuss and share this year. For example, success stories such as the data science program at UC Berkeley illustrate the power of JupyterHub deployments at scale, in both education, research and industry. As universities and enterprise firms learn to handle the technical challenges of rolling out hands-on, interactive computing at scale, a cohort of organizational challenges come to the fore: practices regarding collaboration, security, compliance, data privacy, ethics, etc. These points are especially poignant in verticals such as healthcare, finance and education, where the handling of sensitive data is rightly constrained by ethical and legal requirements (HIPAA, FERPA, etc.). Overall, this dialogue is extremely relevant — it is happening at the intersection of contemporary political and social issues, industry concerns, new laws (GDPR), the evolution of computation, plus good storytelling and communication in general — as we’ll explore with practitioners throughout the conference.

Recent beta release of JupyterLab embodies the meta-theme of extensible software architecture for interactive computing with data. While many people think of Jupyter as a “notebook,” that’s merely one building block needed for interactive computing with data. Other building blocks include terminals, file browsers, LaTeX, markdown, rich outputs, text editors, and renderers/viewers for different data formats. JupyterLab is the next-generation user interface for Project Jupyter, and provides these different building blocks in a flexible, configurable, customizable environment. This opens the door for Jupyter users to build custom workflows, and also for organizations to extend JupyterLab with their own custom functionality.

Thousands of organizations require data infrastructure for reporting, sharing data insights, reproducing results of analytics, etc. Recent business studies estimate that more than half of all companies globally are precluded from adopting AI technologies due to a lack of digital infrastructure — often because their efforts toward data and reporting infrastructure are buried in technical debt. So much of that infrastructure was built from scratch, even when organizations needed essentially the same building blocks. JupyterLab’s primary goal is to make it routine to build highly customized, interactive computing platforms, while supporting more than 90 different popular programming environments.

A third major theme builds on top of the other two: computational communication. For data and code to be useful for humans, who need to make decisions, it has to be embedded into a narrative — a story — that that can be communicated to others. Examples of this pattern include: data journalism, reproducible research and open science, computational narratives, open data in society and government, citizen science, and really any area of scientific research (physics, zoology, chemistry, astronomy, etc.), plus the range of economics, finance, and econometric forecasting.

Another growing segment of use cases involves Jupyter as a “last-mile” layer for leveraging AI resources in the cloud. This becomes especially important in light of new hardware emerging for AI needs, vying with competing demand from online gaming, virtual reality, cryptocurrency mining, etc.

Please take the following as personal opinion, observations, perspectives: We’ve reached a point where hardware appears to be evolving more rapidly than software, while software appears to be evolving more rapidly than effective process. At O’Reilly Media we work to map the emerging themes in industry, in a process nicknamed “radar”. This perspective about hardware is a theme I’ve been mapping, and meanwhile comparing notes with industry experts. A few data points to consider: Jeff Dean’s talk at NIPS 2017, “Machine Learning for Systems and Systems for Machine Learning” about comparisons of CPUs/GPUs/TPUs, and how AI is transforming the design of computer hardware; The Case for Learned Index Structures, also from Google, about the impact of “branch vs. multiple” costs on decades of database theory; this podcast interview “Scaling machine learning” with Reza Zadeh about the critical importance of hardware/software interfaces in AI apps; the video interview that Wes McKinney and I recorded at JupyterCon 2017 about how Apache Arrow presents a much different take on how to leverage hardware and distributed resources.

The notion that “hardware > software > process” contradicts the past 15–20 years of software engineering practice. It’s an inversion of the general assumptions we make. In response, industry will need to rework approaches for building software within the context of AI — which was articulated succinctly by Lenny Pruss from Amplify Partners in “Infrastructure 3.0: Building blocks for the AI revolution”. In this light, Jupyter provides an abstraction layer — a kind of buffer to help “future proof” — for complex use cases in NLP, machine learning, and related work. We’re seeing this from most of the public cloud vendors, who are also leaders in AI, Google, Amazon, Microsoft, IBM, etc., and who will be represented at the conference in August.

Our program at JupyterCon will feature expert speakers across all of these themes. However, to me, that’s merely the tip of the iceberg. So much of the real value that I get from conferences happens in the proverbial “Hallway Track”, where you run into people who are riffing off news they’ve just learned in a session — perhaps in line with your thinking, perhaps in a completely different direction. Those conversations have space to flourish when people get immersed in the community, the issues, the possibilities.

It’ll be a busy week. We’ll have two days of training courses: intensive, hands-on coding, lots of interaction with expert instructors. Training will overlap with one day of tutorials: led by experts, generally larger than training courses though more detailed than session talks, featuring lots of Q&A.

Then we’ll have two days of keynotes and session talks, expo hall, lunches and sponsored breaks, plus Project Jupyter sponsored events. Events include Jupyter User Testing, author signings, “Meet the Experts” office hours, demos in the vendor expo hall — plus related meetups in the evenings. Last year the Poster Session was one of the biggest surprises to me: it was difficult to move through the room, walkways were packed with people asking presenters questions about their projects.

This year we’ll introduce a Business Summit, similar to the popular summits at Strata Data Conference and The AI Conf. This will include high-level presentations on the most promising and important developments in Jupyter for executives and decision-makers. Brian Granger and I will be hosting the Business Summit, along with Joel Horwitz of IBM. One interesting data point: among the regional events, we’ve seen much more engagement this year from enterprise and government than we’d expected, more emphasis on business use cases and new product launches. The ecosystem is growing, and will be represented well at JupyterCon!

We will also feature an Education Track in the main conference, expanding on the well-attended Education Birds-of-a-Feather and related talks during JupyterCon 2017. Use of Jupyter in education has grown rapidly across many contexts: middle/high-school, universities, corporate training, and online courses. Lorena Barba and Robert Talbert will be organizing this track.

Following our schedule of conference talks, the week wraps up with a community sprint day on Saturday. You can work side-by-side with leaders and contributors in the Jupyter ecosystem to implement that feature you’ve always wanted, fix bugs, work on design, write documentation, test software, or dive deep into the internals of something in the Jupyter ecosystem. Be sure to bring your laptop.

Note that we believe true innovation depends on hearing from, and listening to, people with a variety of perspectives. Please read our Diversity Statement for more details. Also, we’re committed to creating a safe and productive environment for everyone at all of our events. Please read our Code of Conduct. Last year we were able to work with the community plus matching donations to provide several Diversity & Inclusion scholarships, as well as more than dozen student scholarships. Looking forward to building on that this year!

That’s a sample of what’s coming up for JupyterCon in NYC this August. Meanwhile, we’ll be helping present and sponsor regional community events to help build momentum for the conference:

We look forward to many opportunities to showcase new work and ideas, to meet each other, to learn about the architecture of the project itself, and to contribute to the future of Jupyter.

Sign-up for email updates on the JupyterCon web site. See you there!

[kudos to Brian Granger for help developing and editing this article]

JupyterCon 2018, NYC August 21–25 was originally published in Jupyter Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

### Data and Analytics Cutting-Edge Events in San Francisco and Chicago

Join us in San Francisco or Chicago this spring for the next chapters of the World-renowned Data & Analytics Innovation events, bringing together the top minds in Big Data and Analytics across industries.

### R Packages worth a look

Incrementally Build Complex Plots using Natural Semantics (wheatmap)
Builds complex plots, heatmaps in particular, using natural semantics. Bigger plots can be assembled using directives such as ‘LeftOf’, ‘RightOf’, ‘TopOf’, and ‘Beneath’ and more. Other features include clustering, dendrograms and integration with ‘ggplot2’ generated grid objects. This package is particularly designed for bioinformaticians to assemble complex plots for publication.

Generalized Integration Model (gim)
Implements the generalized integration model, which integrates individual-level data and summary statistics under a generalized linear model framework. It supports continuous and binary outcomes to be modeled by the linear and logistic regression models.

Pathway Enrichment Analysis Utilizing Active Subnetworks (pathfindR)
Pathway enrichment analysis enables researchers to uncover mechanisms underlying the phenotype. pathfindR is a tool for pathway enrichment analysis utilizing active subnetworks. It identifies active subnetworks in a protein-protein interaction network using user-provided a list of genes. It performs pathway enrichment analyses on the identified subnetworks. pathfindR also offers functionality to cluster enriched pathways and identify representative pathways. The method is described in detail in Ulgen E, Ozisik O, Sezerman OU. 2018. pathfindR: An R Package for Pathway Enrichment Analysis Utilizing Active Subnetworks. bioRxiv. <doi:10.1101/272450>.

A ‘Java’ Platform Integration for ‘R’ with Programming Languages ‘Groovy’, ‘JavaScript’, ‘JRuby’ (‘Ruby’), ‘Jython’ (‘Python’), and ‘Kotlin’ (jsr223)
Provides a high-level integration for the ‘Java’ platform that makes ‘Java’ objects easy to use from within ‘R’; provides a unified interface to integrate ‘R’ with several programming languages; and features extensive data exchange between ‘R’ and ‘Java’. The ‘jsr223’-supported programming languages include ‘Groovy’, ‘JavaScript’, ‘JRuby’ (‘Ruby’), ‘Jython’ (‘Python’), and ‘Kotlin’. Any of these languages can use and extend ‘Java’ classes in natural syntax. Furthermore, solutions developed in any of the ‘jsr223’-supported languages are also accessible to ‘R’ developers. The ‘jsr223’ package also features callbacks, script compiling, and string interpolation. In all, ‘jsr223’ significantly extends the computing capabilities of the ‘R’ software environment.

Parsimonious Gaussian Mixture Models (pgmm)
Carries out model-based clustering or classification using parsimonious Gaussian mixture models. McNicholas and Murphy (2008) <doi:10.1007/s11222-008-9056-0>, McNicholas (2010) <doi:10.1016/j.jspi.2009.11.006>, McNicholas and Murphy (2010) <doi:10.1093/bioinformatics/btq498>.