My Data Science Blogs

March 17, 2018

Document worth reading: “A Comprehensive Survey of Graph Embedding: Problems, Techniques and Applications”

Graph is an important data representation which appears in a wide diversity of real-world scenarios. Effective graph analytics provides users a deeper understanding of what is behind the data, and thus can benefit a lot of useful applications such as node classification, node recommendation, link prediction, etc. However, most graph analytics methods suffer the high computation and space cost. Graph embedding is an effective yet efficient way to solve the graph analytics problem. It converts the graph data into a low dimensional space in which the graph structural information and graph properties are maximally preserved. In this survey, we conduct a comprehensive review of the literature in graph embedding. We first introduce the formal definition of graph embedding as well as the related concepts. After that, we propose two taxonomies of graph embedding which correspond to what challenges exist in different graph embedding problem settings and how the existing work address these challenges in their solutions. Finally, we summarize the applications that graph embedding enables and suggest four promising future research directions in terms of computation efficiency, problem settings, techniques and application scenarios. A Comprehensive Survey of Graph Embedding: Problems, Techniques and Applications

Continue Reading…


Read More

Book Memo: “Hybrid Intelligence for Social Networks”

This book explains aspects of social networks, varying from development and application of new artificial intelligence and computational intelligence techniques for social networks to understanding the impact of social networks. Chapters 1 and 2 deal with the basic strategies towards social networks such as mining text from such networks and applying social network metrics using a hybrid approach; Chaps. 3 to 8 focus on the prime research areas in social networks: community detection, influence maximization and opinion mining. Chapter 9 to 13 concentrate on studying the impact and use of social networks in society, primarily in education, commerce, and crowd sourcing. The contributions provide a multidimensional approach, and the book will serve graduate students and researchers as a reference in computer science, electronics engineering, communications, and information technology.

Continue Reading…


Read More

Book Memo: “Data Science Landscape”

Towards Research Standards and Protocols
The edited volume deals with different contours of data science with special reference to data management for the research innovation landscape. The data is becoming pervasive in all spheres of human, economic and development activity. In this context, it is important to take stock of what is being done in the data management area and begin to prioritize, consider and formulate adoption of a formal data management system including citation protocols for use by research communities in different disciplines and also address various technical research issues. The volume, thus, focuses on some of these issues drawing typical examples from various domains.

Continue Reading…


Read More

What is not but could be if

And if I can remain there I will say – Baby Dee

Obviously this is a blog that love the tabloids. But as we all know, the best stories are the ones that confirm your own prior beliefs (because those must be true).  So I’m focussing on  this article in Science that talks about how STEM undergraduate programmes in the US lose gay and bisexual students.  This leaky pipeline narrative (that diversity is smaller the further you go in a field because minorities drop out earlier) is pretty common when you talk about diversity in STEM. But this article says that there are now numbers! So let’s have a look…

And when you’re up there in the cold, hopin’ that your knot will hold and swingin’ in the snow…

From the article:

The new study looked at a 2015 survey of 4162 college seniors at 78 U.S. institutions, roughly 8% of whom identified as LGBQ (the study focused on sexual identity and did not consider transgender status). All of the students had declared an intention to major in STEM 4 years earlier. Overall, 71% of heterosexual students and 64% of LGBQ students stayed in STEM. But looking at men and women separately uncovered more complexity. After controlling for things like high school grades and participation in undergraduate research, the study revealed that heterosexual men were 17% more likely to stay in STEM than their LGBQ male counterparts. The reverse was true for women: LGBQ women were 18% more likely than heterosexual women to stay in STEM.

Ok. There’s a lot going on here. First things first, let’s say a big hello to Simpson’s paradox! Although LGBQ people have a lower attainment rate in STEM, it’s driven by men going down and women going up. I think the thing that we can read straight off this is that there are “base rate” problems happening all over the place. (Note that the effect is similar across the two groups and in opposite directions, yet the combined total is fairly strongly aligned with the male effect.) We are also talking about a drop out of around 120 of the 333 LGBQ students in the survey. So the estimate will be noisy.

I’m less worried about forking paths–I don’t think it’s unreasonable to expect the experience to differ across gender. Why? Well there is a well known problem with gender diversity in STEM.  Given that gay women are potentially affected by two different leaky pipelines, it sort of makes sense that the interaction between gender and LGBQ status would be important.

The actual article does better–it’s all done with multilevel logistic regression, which seems like an appropriate tool. There are p-values everywhere, but that’s just life. I struggled from the paper to work out exactly what the model was (sometimes my eyes just glaze over…), but it seems to have been done fairly well.

As with anything however (see also Gayface), the study is only as generalizable as the data set. The survey seems fairly large, but I’d worry about non-response. And, if I’m honest with you, me at 18 would’ve filled out that survey as straight, so there are also some problems there.

My father’s affection for his crowbar collection was Freudian to say the least

So a very shallow read of the paper makes it seems like the stats is good enough. But what if it’s not? Does that really matter?

This is one of those effects that’s anecdotally expected to be true. But more importantly, a lot of the proposed fixes are the types of low-cost interventions that don’t really need to work very well to be “value for money”.

For instance, it’s suggested that STEM departments work to make LGBT+ visibility more prominent (have visible, active inclusion policies). They suggest that people teaching pay attention to diversity in their teaching material.

The common suggestion for the last point is to pay special attention to work by women and under-represented groups in your teaching. This is never a bad thing, but if you’re teaching something very old (like the central limit theorem or differentiation), there’s only so much you can do. The thing that we all have a lot more control over is our examples and exercises. It is a no-cost activity to replace, for example, “Bob and Alice” with “Barbra and Alice” or “Bob and Alex”.

This type of low-impact diversity work signals to students that they are in a welcoming environment. Sometimes this is enough.

A similar example (but further up the pipeline) is that when you’re interviewing PhD students, postdocs, researchers, or faculty, don’t ask the men if they have a wife. Swapping to a gender neutral catch-all (partner) is super-easy. Moreover, it doesn’t force a person who is not in an opposite gender relationship to throw themselves a little pride parade (or, worse, to let the assumption fly because they’re uncertain if the mini-pride parade is a good idea in this context). Partner is a gender-neutral term. They is a gender-neutral pronoun. They’re not hard to use.

These environmental changes are important. In the end, if you value science you need to value diversity. Losing women, racial and ethnic minorities, LGBT+ people, disabled people, and other minorities really means that you are making your talent pool more shallow. A deeper pool leads to better science and creating a welcoming, positive environment is a serious step towards deepening the pool.

In defence of half-arsed activism

Making a welcoming environment doesn’t fix STEM’s diversity problem. There is a lot more work to be done. Moreover, the ideas in the paragraph above may do very little to improve the problem. They are also fairly quiet solutions–no one knows you’re doing these things on purpose. That is, they are half-arsed activism.

The thing is, as much as it’s lovely to have someone loudly on my side when I need it, I mostly just want to feel welcome where I am. So this type of work is actually really important. No one will ever give you a medal, but that doesn’t make it less appreciated.

The other thing to remember is that sometimes half-arsed activism is all that’s left to you. If you’re a student, or a TA, or a colleague, you can’t singlehandedly change your work environment. More than that, if a well-intentioned-but-loud intervention isn’t carefully thought through it may well make things worse. (For example, a proposal at a previous workplace to ensure that all female students (about 400 of them) have a female faculty mentor (about 7 of them) would’ve put a completely infeasible burden on the female faculty members.)

So don’t discount low-key, low-cost, potentially high-value interventions. They may not make things perfect, but they can make things better and maybe even “good enough”.

The post What is not but could be if appeared first on Statistical Modeling, Causal Inference, and Social Science.

Continue Reading…


Read More

If you did not already know

Optimal Matching Analysis (OMA) google
Optimal matching is a sequence analysis method used in social science, to assess the dissimilarity of ordered arrays of tokens that usually represent a time-ordered sequence of socio-economic states two individuals have experienced. Once such distances have been calculated for a set of observations (e.g. individuals in a cohort) classical tools (such as cluster analysis) can be used. The method was tailored to social sciences from a technique originally introduced to study molecular biology (protein or genetic) sequences. Optimal matching uses the Needleman-Wunsch algorithm.

Apache Flink google
Apache Flink is an open source platform for scalable batch and stream data processing. Flink’s core is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams. Flink includes several APIs for creating applications that use the Flink engine:
1. DataSet API for static data embedded in Java, Scala, and Python,
2. DataStream API for unbounded streams embedded in Java and Scala, and
3. Table API with a SQL-like expression language embedded in Java and Scala.
Flink also bundles libraries for domain-specific use cases:
1. Machine Learning library, and
2. Gelly, a graph processing API and library.
You can integrate Flink easily with other well-known open source systems both for data input and output as well as deployment.

Disciplined Convex Optimization google
An object-oriented modeling language for disciplined convex programming (DCP). It allows the user to formulate convex optimization problems in a natural way following mathematical convention and DCP rules. The system analyzes the problem, verifies its convexity, converts it into a canonical form, and hands it off to an appropriate solver to obtain the solution.
“Disciplined Convex Programming”

Continue Reading…


Read More

R Tip: Use stringsAsFactors = FALSE

R tip: use stringsAsFactors = FALSE.

R often uses a concept of factors to re-encode strings. This can be too early and too aggressive. Sometimes a string is just a string.

800px Sigmund Freud by Max Halberstadt cropped

Sigmund Freud, it is often claimed, said: “Sometimes a cigar is just a cigar.”

To avoid problems delay re-encoding of strings by using stringsAsFactors = FALSE when creating data.frames.


d <- data.frame(label = rep("tbd", 5))

d$label[[2]] <- "north"
#> Warning in `[[<-.factor`(`*tmp*`, 2, value = structure(c(1L, NA, 1L, 1L, :
#> invalid factor level, NA generated

#>   label
#> 1   tbd
#> 2  <NA>
#> 3   tbd
#> 4   tbd
#> 5   tbd

Notice our new value was not copied in!

The fix is easy: use stringsAsFactors = FALSE.

d <- data.frame(label = rep("tbd", 5),
                stringsAsFactors = FALSE)

d$label[[2]] <- "north"

#>   label
#> 1   tbd
#> 2 north
#> 3   tbd
#> 4   tbd
#> 5   tbd

As is often the case: base R works okay in default mode and works very well if you judiciously change a few defaults. There is much less need to whole-hog replace R functionality than some claim.

Note: the above pattern of pre-building a data.frame and filling values by addressing row/column index sets is a very effective (and under appreciated) way to build up data (often easier and quicker than binding rows or columns).

Continue Reading…


Read More

What We Talk About When We Talk About Bias

Shira Mitchell wrote:

I gave a talk today at Mathematica about NHST in low power settings (Type M/S errors). It was fun and the discussion was great.

One thing that came up is bias from doing some kind of regularization/shrinkage/partial-pooling versus selection bias (confounding, nonrandom samples, etc). One difference (I think?) is that the first kind of bias decreases with sample size, but the latter won’t. Though I’m not sure how comforting that is in small-sample settings. I’ve read this post which emphasizes that unbiased estimates don’t actually exist, but I’m not sure how relevant this is.

I replied that the error is to think that an “unbiased” estimate is a good thing. See p.94 of BDA.

And then Shira shot back:

I think what is confusing to folks is when you use unbiasedness as a principle here, for example here:

Ahhhh, good point! I was being sloppy. One difficulty is that in classical statistics, there are two similar-sounding but different concepts, unbiased estimation and unbiased prediction. For Bayesian inference we talk about calibration, which is yet another way that an estimate can be correct on average.

The point of my above-linked BDA excerpt is that, in some settings, unbiased estimation is not just a nice idea that can’t be done in practice or can be improved in some ways; rather it’s an actively bad idea that leads to terrible estimates. The key is that classical unbiased estimation requires E(theta.hat|theta) = theta for any theta, and, given that some outlying regions of theta are highly unlikely, the unbiased estimate has to be a contortionist in order to get things right for those values.

But in certain settings the idea of unbiasedness is relevant, as in the linked post above where we discuss the problems of selection bias. And, indeed, type M and type S errors are defined with respect to the true parameter values. The key difference is that we’re estimating these errors—these biases—conditional on reasonable values of the underlying parameters. We’re not interested in these biases conditional on unreasonable values of theta.

Subtle point, worth thinking about carefully. Bias is important, but only conditional on reasonable values of theta.

P.S. Thanks to Jaime Ashander for the above picture.

The post What We Talk About When We Talk About Bias appeared first on Statistical Modeling, Causal Inference, and Social Science.

Continue Reading…


Read More

Distilled News

R 3.4.4 released

R 3.4.4 has been released, and binaries for Windows, Mac, Linux and now available for download on CRAN. This update (codenamed ‘Someone to Lean On’ — likely a Peanuts reference, though I couldn’t find which one with a quick search) is a minor bugfix release, and shouldn’t cause any compatibility issues with scripts or packages written for prior versions of R in the 3.4.x series. This update improves automatic timezone detection on some systems, and adds fixes for a some unusual corner cases in the statistics library. For a complete list of the changes, check the NEWS file for R 3.4.4 or follow the link below.

Introduction to Numpy – Part II

The final part of the introduction to Numpy. In this second part, we are going to see a few functions in order to create a specific array. Then we are going to see the computation between two arrays. The first part of Numpy you can find here.

How we grew from 0 to 4 million women on our fashion app, with a vertical machine learning approach

Three years ago we launched Chicisimo, our goal was to offer automated outfit advice. Today, with over 4 million women on the app, we want to share how our data and machine learning approach helped us grow. It’s been chaotic but it is now under control.

Using Evolutionary AutoML to Discover Neural Network Architectures

The brain has evolved over a long time, from very simple worm brains 500 million years ago to a diversity of modern structures today. The human brain, for example, can accomplish a wide variety of activities, many of them effortlessly — telling whether a visual scene contains animals or buildings feels trivial to us, for example. To perform activities like these, artificial neural networks require careful design by experts over years of difficult research, and typically address one specific task, such as to find what’s in a photograph, to call a genetic variant, or to help diagnose a disease. Ideally, one would want to have an automated method to generate the right architecture for any given task. One approach to generate these architectures is through the use of evolutionary algorithms. Traditional research into neuro-evolution of topologies (e.g. Stanley and Miikkulainen 2002) has laid the foundations that allow us to apply these algorithms at scale today, and many groups are working on the subject, including OpenAI, Uber Labs, Sentient Labs and DeepMind. Of course, the Google Brain team has been thinking about AutoML too. In addition to learning-based approaches (eg. reinforcement learning), we wondered if we could use our computational resources to programmatically evolve image classifiers at unprecedented scale. Can we achieve solutions with minimal expert participation? How good can today’s artificially-evolved neural networks be? We address these questions through two papers.

Creating a simple text classifier using Google CoLaboratory

Google CoLaboratory is Google’s latest contribution to AI, wherein users can code in Python using a Chrome browser in a Jupyter-like environment. In this article I have shared a method, and code, to create a simple binary text classifier using Scikit Learn within Google CoLaboratory environment.

Visualizing MonteCarlo Simulation Results: Mean vs Median

Simulation studies are used in a wide range of areas from risk management, to epidemiology, and of course in statistics. The MonteCarlo package provides tools to automatize the design of these kind of simulation studies in R. The user only has to specify the random experiment he or she wants to conduct and to specify the number of replications. The rest is handled by the package. So far, the main tool to analyze the results was to look at Latex tables generated using the MakeTable() function. Now, the new package version 1.0.5 contains the function MakeFrame() that allows to represent the simulation results in form of a dataframe. This makes it very easy to visualize the results using standard tools such as dplyr and ggplot2. Here, I will demonstrate some of these concepts for a simple example that could be part of an introductory statistics course: the comparison of the mean and the median as estimators for the expected value. For an introduction to the MonteCarlo package click here or confer the package vignette.

Quick Feature Engineering with Dates Using

The library is a collection of supplementary wrappers for a host of popular machine learning libraries, designed to remove the necessity of writing your own functions to take care of some repetitive tasks in a machine learning workflow.

PyCon.DE 2018 & PyData Karlsruhe; October 24 – 27

PyCon.DE is where Pythonistas in Germany can meet to learn about new and upcoming Python libraries, tools, software and data science. We welcome Python enthusiasts, programmers and data scientists from around the world to join us in Karlsruhe this year.
We expect 400 participants for PyCon.DE 2018 Karlsruhe. The conference will last 3 days and include about 60 talks, tutorials and hands on sessions. Python is a programming language which has found application and friends in many areas. Due to its popularity in science, Python has experienced a meteoric rise in the data science community over the past few years. At the conference, we expect a broad and interesting mix of Pythonistas including roles such as:
• Software Developer
• Data Scientist
• System Administrator
• Academic Scientist
• Technology Enthusiast

Continue Reading…


Read More

Whats new on arXiv

Minimal I-MAP MCMC for Scalable Structure Discovery in Causal DAG Models

Learning a Bayesian network (BN) from data can be useful for decision-making or discovering causal relationships. However, traditional methods often fail in modern applications, which exhibit a larger number of observed variables than data points. The resulting uncertainty about the underlying network as well as the desire to incorporate prior information recommend a Bayesian approach to learning the BN, but the highly combinatorial structure of BNs poses a striking challenge for inference. The current state-of-the-art methods such as order MCMC are faster than previous methods but prevent the use of many natural structural priors and still have running time exponential in the maximum indegree of the true directed acyclic graph (DAG) of the BN. We here propose an alternative posterior approximation based on the observation that, if we incorporate empirical conditional independence tests, we can focus on a high-probability DAG associated with each order of the vertices. We show that our method allows the desired flexibility in prior specification, removes timing dependence on the maximum indegree and yields provably good posterior approximations; in addition, we show that it achieves superior accuracy, scalability, and sampler mixing on several datasets.

Event Correlation and Forecasting over Multivariate Streaming Sensor Data

Event management in sensor networks is a multidisciplinary field involving several steps across the processing chain. In this paper, we discuss the major steps that should be performed in real- or near real-time event handling including event detection, correlation, prediction and filtering. First, we discuss existing univariate and multivariate change detection schemes for the online event detection over sensor data. Next, we propose an online event correlation scheme that intends to unveil the internal dynamics that govern the operation of a system and are responsible for the generation of various types of events. We show that representation of event dependencies can be accommodated within a probabilistic temporal knowledge representation framework that allows the formulation of rules. We also address the important issue of identifying outdated dependencies among events by setting up a time-dependent framework for filtering the extracted rules over time. The proposed theory is applied on the maritime domain and is validated through extensive experimentation with real sensor streams originating from large-scale sensor networks deployed in ships.

Theory and Algorithms for Forecasting Time Series

We present data-dependent learning bounds for the general scenario of non-stationary non-mixing stochastic processes. Our learning guarantees are expressed in terms of a data-dependent measure of sequential complexity and a discrepancy measure that can be estimated from data under some mild assumptions. We also also provide novel analysis of stable time series forecasting algorithm using this new notion of discrepancy that we introduce. We use our learning bounds to devise new algorithms for non-stationary time series forecasting for which we report some preliminary experimental results.

Capturing Structure Implicitly from Time-Series having Limited Data

Scientific fields such as insider-threat detection and highway-safety planning often lack sufficient amounts of time-series data to estimate statistical models for the purpose of scientific discovery. Moreover, the available limited data are quite noisy. This presents a major challenge when estimating time-series models that are robust to overfitting and have well-calibrated uncertainty estimates. Most of the current literature in these fields involve visualizing the time-series for noticeable structure and hard coding them into pre-specified parametric functions. This approach is associated with two limitations. First, given that such trends may not be easily noticeable in small data, it is difficult to explicitly incorporate expressive structure into the models during formulation. Second, it is difficult to know \textit{a priori} the most appropriate functional form to use. To address these limitations, a nonparametric Bayesian approach was proposed to implicitly capture hidden structure from time series having limited data. The proposed model, a Gaussian process with a spectral mixture kernel, precludes the need to pre-specify a functional form and hard code trends, is robust to overfitting and has well-calibrated uncertainty estimates.

SentEval: An Evaluation Toolkit for Universal Sentence Representations

We introduce SentEval, a toolkit for evaluating the quality of universal sentence representations. SentEval encompasses a variety of tasks, including binary and multi-class classification, natural language inference and sentence similarity. The set of tasks was selected based on what appears to be the community consensus regarding the appropriate evaluations for universal sentence representations. The toolkit comes with scripts to download and preprocess datasets, and an easy interface to evaluate sentence encoders. The aim is to provide a fairer, less cumbersome and more centralized way for evaluating sentence representations.

Advancing Connectionist Temporal Classification With Attention Modeling

In this study, we propose advancing all-neural speech recognition by directly incorporating attention modeling within the Connectionist Temporal Classification (CTC) framework. In particular, we derive new context vectors using time convolution features to model attention as part of the CTC network. To further improve attention modeling, we utilize content information extracted from a network representing an implicit language model. Finally, we introduce vector based attention weights that are applied on context vectors across both time and their individual components. We evaluate our system on a 3400 hours Microsoft Cortana voice assistant task and demonstrate that our proposed model consistently outperforms the baseline model achieving about 20% relative reduction in word error rates.

Large Margin Deep Networks for Classification

We present a formulation of deep learning that aims at producing a large margin classifier. The notion of margin, minimum distance to a decision boundary, has served as the foundation of several theoretically profound and empirically successful results for both classification and regression tasks. However, most large margin algorithms are applicable only to shallow models with a preset feature representation; and conventional margin methods for neural networks only enforce margin at the output layer. Such methods are therefore not well suited for deep networks. In this work, we propose a novel loss function to impose a margin on any chosen set of layers of a deep network (including input and hidden layers). Our formulation allows choosing any norm on the metric measuring the margin. We demonstrate that the decision boundary obtained by our loss has nice properties compared to standard classification loss functions. Specifically, we show improved empirical results on the MNIST, CIFAR-10 and ImageNet datasets on multiple tasks: generalization from small training sets, corrupted labels, and robustness against adversarial perturbations. The resulting loss is general and complementary to existing data augmentation (such as random/adversarial input transform) and regularization techniques (such as weight decay, dropout, and batch norm).

Sylvester Normalizing Flows for Variational Inference

Variational inference relies on flexible approximate posterior distributions. Normalizing flows provide a general recipe to construct flexible variational posteriors. We introduce Sylvester normalizing flows, which can be seen as a generalization of planar flows. Sylvester normalizing flows remove the well-known single-unit bottleneck from planar flows, making a single transformation much more flexible. We compare the performance of Sylvester normalizing flows against planar flows and inverse autoregressive flows and demonstrate that they compare favorably on several datasets.

Word2Bits – Quantized Word Vectors

Word vectors require significant amounts of memory and storage, posing issues to resource limited devices like mobile phones and GPUs. We show that high quality quantized word vectors using 1-2 bits per parameter can be learned by introducing a quantization function into Word2Vec. We furthermore show that training with the quantization function acts as a regularizer. We train word vectors on English Wikipedia (2017) and evaluate them on standard word similarity and analogy tasks and on question answering (SQuAD). Our quantized word vectors not only take 8-16x less space than full precision (32 bit) word vectors but also outperform them on word similarity tasks and question answering.

A Study of Recent Contributions on Information Extraction

This paper reports on modern approaches in Information Extraction (IE) and its two main sub-tasks of Named Entity Recognition (NER) and Relation Extraction (RE). Basic concepts and the most recent approaches in this area are reviewed, which mainly include Machine Learning (ML) based approaches and the more recent trend to Deep Learning (DL) based methods.

Neural Network Quine

Self-replication is a key aspect of biological life that has been largely overlooked in Artificial Intelligence systems. Here we describe how to build and train self-replicating neural networks. The network replicates itself by learning to output its own weights. The network is designed using a loss function that can be optimized with either gradient-based or non-gradient-based methods. We also describe a method we call regeneration to train the network without explicit optimization, by injecting the network with predictions of its own parameters. The best solution for a self-replicating network was found by alternating between regeneration and optimization steps. Finally, we describe a design for a self-replicating neural network that can solve an auxiliary task such as MNIST image classification. We observe that there is a trade-off between the network’s ability to classify images and its ability to replicate, but training is biased towards increasing its specialization at image classification at the expense of replication. This is analogous to the trade-off between reproduction and other tasks observed in nature. We suggest that a self-replication mechanism for artificial intelligence is useful because it introduces the possibility of continual improvement through natural selection.

GossipGraD: Scalable Deep Learning using Gossip Communication based Asynchronous Gradient Descent

In this paper, we present GossipGraD – a gossip communication protocol based Stochastic Gradient Descent (SGD) algorithm for scaling Deep Learning (DL) algorithms on large-scale systems. The salient features of GossipGraD are: 1) reduction in overall communication complexity from {\Theta}(log(p)) for p compute nodes in well-studied SGD to O(1), 2) model diffusion such that compute nodes exchange their updates (gradients) indirectly after every log(p) steps, 3) rotation of communication partners for facilitating direct diffusion of gradients, 4) asynchronous distributed shuffle of samples during the feedforward phase in SGD to prevent over-fitting, 5) asynchronous communication of gradients for further reducing the communication cost of SGD and GossipGraD. We implement GossipGraD for GPU and CPU clusters and use NVIDIA GPUs (Pascal P100) connected with InfiniBand, and Intel Knights Landing (KNL) connected with Aries network. We evaluate GossipGraD using well-studied dataset ImageNet-1K (~250GB), and widely studied neural network topologies such as GoogLeNet and ResNet50 (current winner of ImageNet Large Scale Visualization Research Challenge (ILSVRC)). Our performance evaluation using both KNL and Pascal GPUs indicates that GossipGraD can achieve perfect efficiency for these datasets and their associated neural network topologies. Specifically, for ResNet50, GossipGraD is able to achieve ~100% compute efficiency using 128 NVIDIA Pascal P100 GPUs – while matching the top-1 classification accuracy published in literature.

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Review of Multi-Agent Algorithms for Collective Behavior: a Structural Taxonomy
Subexponential-Time and FPT Algorithms for Embedded Flat Clustered Planarity
Cake-Cutting with Different Entitlements: How Many Cuts are Needed?
Computer-aided diagnosis of lung carcinoma using deep learning – a pilot study
SUSTain: Scalable Unsupervised Scoring for Tensors and its Application to Phenotyping
Targeted change detection in remote sensing images
Removing Skill Bias from Gaming Statistics
Braess paradox in a network with stochastic dynamics and fixed strategies
CLT for supercritical branching processes with heavy-tailed branching law
Improving Object Counting with Heatmap Regulation
Challenges in Discriminating Profanity from Hate Speech
Logical Gates via Gliders Collisions
Computational complexity of the avalanche problem on one dimensional Kadanoff sandpiles
A Distributed Architecture for Edge Service Orchestration with Guarantees
Max-Min Greedy Matching
Probing the non-Debye low frequency excitations in glasses through random pinning
Sequential and exact formulae for the subdifferential of nonconvex integral functionals
Unpaired Image Captioning by Language Pivoting
Lowering the Upper Bounds of the Cost of Robust Distributed Controllers Beyond Quadratic Invariance
Self-Supervised Monocular Image Depth Learning and Confidence Estimation
Low coherence unit norm tight frames
Evaluation of Dense 3D Reconstruction from 2D Face Images in the Wild
Hovering stochastic oscillations in self-organized critical systems
Tutte Invariants for Alternating Dimaps
Context-Aware Mixed Reality: A Framework for Ubiquitous Interaction
A Game-Theoretic Framework for the Virtual Machines Migration Timing Problem
Testing the homogeneity of risk differences with sparse count data
Geometric duality and parametric duality for multiple objective linear programs are equivalent
A Simple and Effective Approach to the Story Cloze Test
Object Detection in Video with Spatiotemporal Sampling Networks
Advancing Acoustic-to-Word CTC Model
Achieving Human Parity on Automatic Chinese to English News Translation
Improving GANs Using Optimal Transport
Global Stabilization for Causally Consistent Partial Replication
Facelet-Bank for Fast Portrait Manipulation
The Penetration Effect of Connected Automated Vehicles in Urban Traffic: An Energy Impact Study
$\texttt{A2BCD}$: An Asynchronous Accelerated Block Coordinate Descent Algorithm With Optimal Complexity
On the Underspread/Overspread Classification of Random Processes
Micky: A Cheaper Alternative for Selecting Cloud Instances
Deep Adaptive Attention for Joint Facial Action Unit Detection and Face Alignment
Variational Message Passing with Structured Inference Networks
On the insufficiency of existing momentum schemes for Stochastic Optimization
Deriving constant coefficient linear recurrences for enumerating standard Young tableaux of periodic shape
Tuning a random field mechanism in a frustrated magnet
Reconstructing Gaussian sources by spatial sampling
Hydrodynamic Limit of the Inhomogeneous $\ell$-TASEP with Open Boundaries: Derivation and Solution
Generalized Proximal Smoothing for Phase Retrieval
Demyanov-Ryabova conjecture is false
A generalized projection-based scheme for solving convex constrained optimization problems
Fast End-to-End Trainable Guided Filter
Proximal SCOPE for Distributed Sparse Learning: Better Data Partition Implies Faster Convergence Rate
Measure-valued branching processes associated with Neumann nonlinear semiflows
Optimal Weight Allocation of Dynamic Distribution Networks and Positive Semi-definiteness of Signed Laplacians
Resource Allocation in NOMA based Fog Radio Access Networks
Graph codes and local systems
Existence of (Markovian) solutions to martingale problems associated with Lévy-type operators
LEGO: Learning Edge with Geometry all at Once by Watching Videos
Relaxed Locally Correctable Codes in Computationally Bounded Channels
Laws of large numbers for Hayashi-Yoshida-type functionals
Kolmogorov equations associated to the stochastic 2D Euler equations
HFL-RC System at SemEval-2018 Task 11: Hybrid Multi-Aspects Model for Commonsense Reading Comprehension
Fast Subspace Clustering Based on the Kronecker Product
Does agricultural subsidies foster Italian southern farms? A Spatial Quantile Regression Approach
Mm-wave specific challenges in designing 5G transceiver architectures and air-interfaces
Structure Regularized Neural Network for Entity Relation Classification for Chinese Literature Text
Conditional Model Selection in Mixed-Effects Models with cAIC4
Performance and Impairment Modelling for Hardware Components in Millimetre-wave Transceivers
Motion optimization and parameter identification for a human and lower-back exoskeleton model
Spectral radii of asymptotic mappings and the convergence speed of the standard fixed point algorithm
The Hot Hand in Professional Darts
Training of Convolutional Networks on Multiple Heterogeneous Datasets for Street Scene Semantic Segmentation
Efficient First-order Methods for Convex Minimization: a Constructive Approach
Harmonic functions of random walks in a semigroup via ladder heights
Some insights into the behaviour of millimetre wave spectrum on key 5G cellular KPIs
Achieving Spatial Scalability for Coded Caching over Wireless Networks
Combinatorial analogs of topological zeta functions
i-HUMO: An Interactive Human and Machine Cooperation Framework for Entity Resolution with Quality Guarantees
Approximating Max-Cut under Graph-MSO Constraints
Exploring Linear Relationship in Feature Map Subspace for ConvNets Compression
Diverse M-Best Solutions by Dynamic Programming
Rearrangement with Nonprehensile Manipulation Using Deep Reinforcement Learning
What Catches the Eye? Visualizing and Understanding Deep Saliency Models
FDD Massive MIMO via UL/DL Channel Covariance Extrapolation and Active Channel Sparsification
A two-stage method for estimating the association between blood pressure variability and cardiovascular disease: An application using the ARIC Study
Salient Region Segmentation
Unmanned Aerial Vehicle Assisted Cellular Communication
PAC-Reasoning in Relational Domains
$\mathfrak{q}$-crystal structure on primed tableaux and on signed unimodal factorizations of reduced words of type $B$
Gaussian Processes Over Graphs
Minimax optimal rates for Mondrian trees and forests
Aggregated Sparse Attention for Steering Angle Prediction
Temporal Human Action Segmentation via Dynamic Clustering
Hierarchical Species Sampling Models
RUSSE’2018: A Shared Task on Word Sense Induction for the Russian Language
Deep architectures for learning context-dependent ranking functions
Hypergraph Saturation Irregularities
A geometric model for the module category of a gentle algebra
Stability analysis by dynamic dissipation inequalities: On merging frequency-domain techniques with time-domain conditions
Hyperbolic Geometry and Amplituhedra in 1+2 dimensions
On a General Dynamic Programming Approach for Decentralized Stochastic Control
OFDM-Autoencoder for End-to-End Learning of Communications Systems
2D Reconstruction of Small Intestine’s Interior Wall
RUSSE: The First Workshop on Russian Semantic Similarity
Dynamic Approximate Matchings with an Optimal Recourse Bound
Local Spectral Graph Convolution for Point Set Feature Learning
Enriching Frame Representations with Distributionally Induced Senses
A policy iteration algorithm for nonzero-sum stochastic impulse games
Statistical harmonization and uncertainty assessment in the comparison of satellite and radiosonde climate variables
Asymptotic theory for longitudinal data with missing responses adjusted by inverse probability weights
Effective Connectivity from Single Trial fMRI Data by Sampling Biologically Plausible Models
Joint Turbo Receiver for LDPC-Coded MIMO Systems Based on Semi-definite Relaxation
Pseudo Mask Augmented Object Detection
Learned Iterative Decoding for Lossy Image Compression Systems
The complexity of comparing multiply-labelled trees by extending phylogenetic-tree metrics
Distributed Data Vending on Blockchain
Virtual CNN Branching: Efficient Feature Ensemble for Person Re-Identification
Deep Structure Inference Network for Facial Action Unit Recognition
Strategies to facilitate access to detailed geocoding information using synthetic data
Maxiset point of view for signal detection in inverse problems
The Laplace transform of the lognormal distribution
Identifiability of dynamical networks with partial node measurements
Coulomb-gas electrostatics controls large fluctuations of the KPZ equation
Blow-up results for space-time fractional stochastic partial differential equations
Contrasting information theoretic decompositions of modulatory and arithmetic interactions in neural information processing systems
Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions

Continue Reading…


Read More

Book Memo: “Machine Learning Techniques for Online Social Networks”

The book covers tools in the study of online social networks such as machine learning techniques, clustering, and deep learning. A variety of theoretical aspects, application domains, and case studies for analyzing social network data are covered. The aim is to provide new perspectives on utilizing machine learning and related scientific methods and techniques for social network analysis. Machine Learning Techniques for Online Social Networks will appeal to researchers and students in these fields.

Continue Reading…


Read More

Book Memo: “Computerized Adaptive and Multistage Testing with R”

Using Packages catR and mstR
The goal of this guide and manual is to provide a practical and brief overview of the theory on computerized adaptive testing (CAT) and multistage testing (MST) and to illustrate the methodologies and applications using R open source language and several data examples. Implementation relies on the R packages catR and mstR that have been already or are being developed by the first author (with the team) and that include some of the newest research algorithms on the topic. The book covers many topics along with the R-code: the basics of R, theoretical overview of CAT and MST, CAT designs, CAT assembly methodologies, CAT simulations, catR package, CAT applications, MST designs, IRT-based MST methodologies, tree-based MST methodologies, mstR package, and MST applications. CAT has been used in many large-scale assessments over recent decades, and MST has become very popular in recent years. R open source language also has become one of the most useful tools for applications in almost all fields, including business and education. Though very useful and popular, R is a difficult language to learn, with a steep learning curve. Given the obvious need for but with the complex implementation of CAT and MST, it is very difficult for users to simulate or implement CAT and MST. Until this manual, there has been no book for users to design and use CAT and MST easily and without expense; i.e., by using the free R software. All examples and illustrations are generated using predefined scripts in R language, available for free download from the book’s website.

Continue Reading…


Read More

Announcing the winners of the Facebook Communications & Networking research awards

We are pleased to announce the winners of the Facebook Communications & Networking research awards. Continued research and innovation is key to building next generation communications and networking systems. By sponsoring research and collaborating across a wide range of networking research areas we expect to share new insights with the broader networking community.

The Facebook Communications & Networking award winners and their topic areas are:

Network Control Plane Verification at Scale
David Walker, Princeton University

End-to-End Transport for Multi-User Video QoE Optimization
Mohammad Alizadeh, Massachusetts Institute of Technology

Integrating IPv6 Segment Routing and Modern Transport Protocols
Olivier Bonaventure, Université catholique de Louvain, Louvain-la-Neuve

Automated Repair and Verification of Firewalls
Ruzica Piskac, Yale University

High Performance Server Packet Processing
Thomas Anderson, University of Washington

Scaling Distributed Storage with Programmable Switches
Xin Jin, Johns Hopkins University

Navigating the Latency-Quality Tradeoff in Personalized Live Video Streaming
Rashmi Vinayak, Carnegie Mellon University

Continue Reading…


Read More

Magister Dixit

“What makes a good metric?
Here are some rules of thumb for what makes a good metric-a number that will drive the changes you’re looking for.
A good metric is comparative.
Being able to compare a metric to other time periods, groups of users, or competitors helps you understand which way things are moving. “Increased conversion from last week” is more meaningful than “2% conversion”.
A good metric is understandable.
If people can’t remember it and discuss it, it’s much harder to turn a change in the data into a change in the culture.
A good metric is a ratio or a rate.
Accountants and financial analysts have several ratios they look at to understand, at a glance, the fundamental health of a company. You need some, too.
There are several reasons ratios tend to be the best metrics:
1 Ratios are easier to act on. Think about driving a car. Distance travelled is informational. But speed-distance per hour-is something you can act on, because it tells you about your current state, and whether you need to go faster or slower to get to your destination on time.
2 Ratios are inherently comparative. If you compare a daily metric to the same metric over a month, you’ll see whether you’re looking at a sudden spike or a long-term trend. In a car, speed is one metric, but speed right now over average speed this hour shows you a lot about whether you’re accelerating or slowing down.
3 Ratios are also good for comparing factors that are somehow opposed, or for which there’s an inherent tension. In a car, this might be distance covered divided by traffic tickets. The faster you drive, the more distance you cover-but the more tickets you get. This ratio might suggest whether or not you should be breaking the speed limit. A good metric changes the way you behave. This is by far the most important criterion for a metric: what will you do differently based on changes in the metric?
1 “Accounting” metrics like daily sales revenue, when entered into your spreadsheet, need to make your predictions more accurate. These metrics form the basis of Lean Startup’s innovation accounting, showing you how close you are to an ideal model and whether your actual results are converging on your business plan.
2 “Experimental” metrics, like the results of a test, help you to optimize the product, pricing, or market. Changes in these metrics will significantly change your behavior. Agree on what that change will be before you collect the data: if the pink website generates more revenue than the alternative, you’re going pink; if more than half your respondents say they won’t pay for a feature, don’t build it; if your curated MVP doesn’t increase order size by 30%, try something else. Drawing a line in the sand is a great way to enforce a disciplined approach. A good metric changes the way you behave precisely because it’s aligned to your goals of keeping users, encouraging word of mouth, acquiring customers efficiently, or generating revenue. If you want to choose the right metrics, you need to keep five things in mind:
1 Qualitative versus quantitative metrics
Qualitative metrics are unstructured, anecdotal, revealing, and hard to aggregate; quantitative metrics involve numbers and statistics, and provide hard numbers but less insight.
2 Vanity versus actionable metrics
Vanity metrics might make you feel good, but they don’t change how you act. Actionable metrics change your behavior by helping you pick a course of action.
3 Exploratory versus reporting metrics
Exploratory metrics are speculative and try to find unknown insights to give you the upper hand, while reporting metrics keep you abreast of normal, managerial, day-to-day operations.
4 Leading versus lagging metrics
Leading metrics give you a predictive understanding of the future; lagging metrics explain the past. Leading metrics are better because you still have time to act on them-the horse hasn’t left the barn yet.
5 Correlated versus causal metrics
If two metrics change together, they’re correlated, but if one metric causes another metric to change, they’re causal. If you find a causal relationship between something you want (like revenue) and something you can control (like which ad you show), then you can change the future
Analysts look at specific metrics that drive the business, called key performance indicators (KPIs). Every industry has KPIs-if you’re a restaurant owner, it’s the number of covers (tables) in a night; if you’re an investor, it’s the return on an investment; if you’re a media website, it’s ad clicks; and so on.”
Alistair Croll, Benjamin Yoskovitz ( 2013 )

Continue Reading…


Read More

March 16, 2018

Bob’s talk at Berkeley, Thursday 22 March, 3 pm

It’s at the Institute for Data Science at Berkeley.

And here’s the abstract:

I’ll provide an end-to-end example of using R and Stan to carry out full Bayesian inference for a simple set of repeated binary trial data: Efron and Morris’s classic baseball batting data, with multiple players observed for many at bats; clinical trial, educational testing, and manufacturing quality control problems have the same flavor.

We will consider three models that provide complete pooling (every player is the same), no pooling (every player is independent), and partial pooling (every player is to some degree like every other player). Hierarchical models allow the degree of similarity to be jointly modeled with individual effects, tightening estimates and sharpening predictions compared to the no pooling and complete pooling models. They also outperform empirical Bayes and max marginal likelihood predictively, both of which rely on point estimates of hierarchical parameters (aka “mixed effects”). I’ll show how to fit observed data to make predictions for future observations, estimate event probabilities, and carry out (multiple) comparisons such as ranking. I’ll explain how hierarchical modeling mitigates the multiple comparison problem by partial pooling (and I’ll tie it into rookie of the year effects and sophomore slumps). Along the way, I will show how to evaluate models predictively, preferring those that are well calibrated and make sharp predictions. I’ll also show how to evaluate model fit to data with posterior predictive checks and Bayesian p-values.

The post Bob’s talk at Berkeley, Thursday 22 March, 3 pm appeared first on Statistical Modeling, Causal Inference, and Social Science.

Continue Reading…


Read More

R Packages worth a look

Toolbox for Model Selection and Combinations for the Forecasting Purposes (greybox)
Implements model selection and combinations via information criteria based on the values of partial correlations. This allows, for example, solving ‘fat regression’ problems, where the number of variables is much larger than the number of observations. This is driven by the research on information criteria, which is well discussed in Burnham & Anderson (2002) <doi:10.1007/b97636>, and currently developed further by Ivan Svetunkov and Yves Sagaert (working paper in progress). Models developed in the package are tailored specifically for forecasting purposes. So as a results there are several methods that allow producing forecasts from these models and visualising them.

High-Dimensional Regression with Measurement Error (hdme)
Penalized regression for generalized linear models for measurement error problems (aka. errors-in-variables). The package contains a version of the lasso (L1-penalization) which corrects for measurement error (Sorensen et al. (2015) <doi:10.5705/ss.2013.180>). It also contains an implementation of the Generalized Matrix Uncertainty Selector, which is a version the (Generalized) Dantzig Selector for the case of measurement error (Sorensen et al. (2018) <doi:10.1080/10618600.2018.1425626>).

Interface to the Corpus Query Protocol (rcqp)
Implements Corpus Query Protocol functions based on the CWB software. Rely on CWB (GPL v2), PCRE (BSD licence), glib2 (LGPL).

Create the Best Train for Classification Models (OptimClassifier)
Patterns searching and binary classification in economic and financial data is a large field of research. There are a large part of the data that the target variable is binary. Nowadays, many methodologies are used, this package collects most popular and compare different configuration options for Linear Models (LM), Generalized Linear Models (GLM), Linear Mixed Models (LMM), Discriminant Analysis (DA), Classification And Regression Trees (CART), Neural Networks (NN) and Support Vector Machines (SVM).

Effortlessly Read Any Rectangular Data (readit)
Providing just one primary function, ‘readit’ uses a set of reasonable heuristics to apply the appropriate reader function to the given file path. As long as the data file has an extension, and the data is (or can be coerced to be) rectangular, readit() can probably read it.

Continue Reading…


Read More

Because it's Friday: Email a tree

The City of Melbourne has collected data on the more than 70,000 trees in the urban forest of this Australian metropolis. The data include the species, the health status of the tree and its life expectancy, all shown on a lovely map.


As you can see from the image above, each tree also has a unique email address. The idea is that citizens can report problems with trees, like disease or a fallen limb. But as the Atlantic reported in 2015, the addresses have also been used to write charming letters to the trees. For example, this email to a Golden Elm:

21 May 2015

I’m so sorry you're going to die soon. It makes me sad when trucks damage your low hanging branches. Are you as tired of all this construction work as we are?

Sometimes the trees even reply, like this Willow Leaf Pepperment:

29 Jan 2015

Hello Mr Willow Leaf Peppermint, or should I say Mrs Willow Leaf Peppermint?

Do trees have genders?

I hope you've had some nice sun today.



30 Jan 2015


I am not a Mr or a Mrs, as I have what's called perfect flowers that include both genders in my flower structure, the term for this is Monoicous. Some trees species have only male or female flowers on individual plants and therefore do have genders, the term for this is Dioecious. Some other trees have male flowers and female flowers on the same tree. It is all very confusing and quite amazing how diverse and complex trees can be.

Kind regards,

Mr and Mrs Willow Leaf Peppermint (same Tree)

 You can find a new more letters in this article as well.

 That's all from us for this week. Hope you have a great weekend (perhaps amongst the trees?) and we'll be back with more next week.

Continue Reading…


Read More

RcppClassicExamples 0.1.2

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Per a CRAN email sent to 300+ maintainers, this package (just like many others) was asked to please register its S3 method. So we did, and also overhauled a few other packagaging standards which changed since the previous uploads in December of 2012 (!!).

No new code or features. Full details below. And as a reminder, don’t use the old RcppClassic — use Rcpp instead.

Changes in version 0.1.2 (2018-03-15)

  • Registered S3 print method [per CRAN request]

  • Added src/init.c with registration and updated all .Call usages taking advantage of it

  • Updated http references to https

  • Updated DESCRIPTION conventions

Thanks to CRANberries, you can also look at a diff to the previous release.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

RDieHarder 0.1.4

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Per a CRAN email sent to 300+ maintainers, this package (just like many others) was asked to please register its S3 method. So we did, and also overhauled a few other packagaging standards which changed since the last upload in 2014.

No NEWS.Rd file to take a summary from, but the top of the ChangeLog has details.

Thanks to CRANberries, you can also look at a diff to the previous release.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Document worth reading: “Towards Deep Learning Models Resistant to Adversarial Attacks”

Recent work has demonstrated that neural networks are vulnerable to adversarial examples, i.e., inputs that are almost indistinguishable from natural data and yet classified incorrectly by the network. In fact, some of the latest findings suggest that the existence of adversarial attacks may be an inherent weakness of deep learning models. To address this problem, we study the adversarial robustness of neural networks through the lens of robust optimization. This approach provides us with a broad and unifying view on much of the prior work on this topic. Its principled nature also enables us to identify methods for both training and attacking neural networks that are reliable and, in a certain sense, universal. In particular, they specify a concrete, general guarantee to provide. These methods let us train networks with significantly improved resistance to a wide range of adversarial attacks. This suggests that adversarially resistant deep learning models might be within our reach after all. Towards Deep Learning Models Resistant to Adversarial Attacks

Continue Reading…


Read More

Science and Technology links (March 16th, 2018)

  1. From the beginning of the 20th century to 2010, the life expectancy at birth for females in the United States increased by more than 32 years. The 3 major causes of death for females in 1900 were pneumonia and influenza, tuberculosis, and enteritis and diarrhea. In 2010, the 3 major causes were heart disease, all cancers, and stroke.
  2. It looks like Dwarf stars could be orbited by habitable planets.
  3. More evidence that intelligence is genetic.
  4. Sugar and bread are killing you: Dietary Carbohydrates Impair Healthspan and Promote Mortality (in Cell).
  5. It turns out that people in organized crime are probably saner than you’d expect: “we were able to determine that in the sample analysed there was not one subject with a psychotic personality”.
  6. If you made it to Pluto and could somehow survive, would there be enough light to read? More than enough according to Cook.
  7. China has reduced fine particulates in the air by a third in four years.
  8. You can cure blindness (in mice) using small wires: “artificial photoreceptors based on gold nanoparticle-decorated titania nanowire arrays restored visual responses in the blind mice with degenerated photoreceptors”. (In Nature.)
  9. According to Nature, a science doctorate has high value in the UK and Canadian job markets. It sounds true to me. However, you should simply not expect to automatically become a professor: “Nearly 30% of those with full- or part-time jobs ended up in academia.”
  10. Brenda Milner is a professor at McGill University who is going to turn 100 this summer. She is still an active professor with an ongoing publication record. Here is what the New Times wrote about her last year:

    “People think because I’m 98 years old I must be emerita,” she said. “Well, not at all. I’m still nosy, you know, curious.” (…) Dr. Milner continues working, because she sees no reason not to. Neither McGill nor the affiliated Montreal Neurological Institute and Hospital has asked her to step aside. She has funding: In 2014 she won three prominent achievement awards, which came with money for research.

  11. Mozilla has released an open source speech recognition model “so that anyone can develop compelling speech experiences” (via Leonid Boytsov).
  12. One of my favorite authors, Brian Martin, has published a new book: Vaccination panic in Australia. We all know that vaccination can be an effective public health policy. So you think that it is crazy to question vaccination policies? Not so fast. Brian explains carefully that there is room for reasonable disagreement on how exactly vaccination is to be used. But most importantly, the book reviews how authorities proceed to suppress dissent, even reasonable well-founded dissent. The book can be freely accessed online.
  13. As we get older, our muscles tend to disappear. It is a condition called Sarcopenia, first coined in 1988. It still unclear what causes it, but there is now evidence that it has to do with the disappearance of nerves. Even if we did nothing to cure cancer and heart diseases, simply keeping the muscles of older people strong would make a huge difference. Sadly, we have barely begun to consider maybe doing something about it.

Continue Reading…


Read More

New Book: Credit risk analytics, The R Companion

Credit risk analytics in R will enable you to build credit risk models from start to finish, with access to real credit data on accompanying website, you will master a wide range of applications.

Continue Reading…


Read More

Microsoft Weekly Data Science News for March 16, 2018

Here are the latest articles from Microsoft regarding cloud data science products and updates.


Continue Reading…


Read More

If you did not already know

Genetic Programming for Reinforcement Learning (GPRL) google
The search for interpretable reinforcement learning policies is of high academic and industrial interest. Especially for industrial systems, domain experts are more likely to deploy autonomously learned controllers if they are understandable and convenient to evaluate. Basic algebraic equations are supposed to meet these requirements, as long as they are restricted to an adequate complexity. Here we introduce the genetic programming for reinforcement learning (GPRL) approach based on model-based batch reinforcement learning and genetic programming, which autonomously learns policy equations from pre-existing default state-action trajectory samples. GPRL is compared to a straight-forward method which utilizes genetic programming for symbolic regression, yielding policies imitating an existing well-performing, but non-interpretable policy. Experiments on three reinforcement learning benchmarks, i.e., mountain car, cart-pole balancing, and industrial benchmark, demonstrate the superiority of our GPRL approach compared to the symbolic regression method. GPRL is capable of producing well-performing interpretable reinforcement learning policies from pre-existing default trajectory data. …

Adaptive Robust Control google
In this paper we propose a new methodology for solving an uncertain stochastic Markovian control problem in discrete time. We call the proposed methodology the adaptive robust control. We demonstrate that the uncertain control problem under consideration can be solved in terms of associated adaptive robust Bellman equation. The success of our approach is to the great extend owed to the recursive methodology for construction of relevant confidence regions. We illustrate our methodology by considering an optimal portfolio allocation problem, and we compare results obtained using the adaptive robust control method with some other existing methods. …

Active Function Cross-Entropy Clustering (afCEC) google
Active function cross-entropy clustering partitions the n-dimensional data into the clusters by finding the parameters of the mixed generalized multivariate normal distribution, that optimally approximates the scattering of the data in the n-dimensional space, whose density function is of the form: p_1*N(mi_1,^sigma_1,sigma_1,f_1)+…+p_k*N(mi_k,^sigma_k,sigma_k,f_k). The above-mentioned generalization is performed by introducing so called ‘f-adapted Gaussian densities’ (i.e. the ordinary Gaussian densities adapted by the ‘active function’). Additionally, the active function cross-entropy clustering performs the automatic reduction of the unnecessary clusters. For more information please refer to P. Spurek, J. Tabor, K.Byrski, ‘Active function Cross-Entropy Clustering’ (2017) <doi:10.1016/j.eswa.2016.12.011>. …

Continue Reading…


Read More

Speeding up Metropolis-Hastings with Rcpp

(This article was first published on R – Stable Markets, and kindly contributed to R-bloggers)

Previous posts in this series on MCMC samplers for Bayesian inference (in order of publication): Bayesian Simple Linear Regression with Gibbs Sampling in R Blocked Gibbs Sampling in R for Bayesian Multiple Linear Regression Metropolis-in-Gibbs Sampling and Runtime Analysis with Profviz The code for all of these posts can be found in my BayesianTutorials GitHub … Continue reading Speeding up Metropolis-Hastings with Rcpp

To leave a comment for the author, please follow the link and comment on their blog: R – Stable Markets. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Your free 70-page guide to a career in data science

To help you become a Data Scientist, we put together a guide with answers to: how do you break into the profession? What skills do you need to become a data scientist? Where are best data science jobs?

Continue Reading…


Read More

Will young Russians make for a new Russia?

As part of our multimedia project on young Russians, we broke down the country’s views on a range of issues by age cohort, drawing on extensive data kindly provided by the Levada Centre, an independent pollster.

Continue Reading…


Read More

5 Things You Need to Know about Big Data

We take a look at five things you need to know about Big Data.

Continue Reading…


Read More

What is the Shape of a Pixel?


The Shape of a Pixel

What is the shape of a pixel? At various times, I have a pixel as a square (often), a point (sometimes), or a rectangle (occasionally). I recall back in grad school doing some homework where we were treating pixels as hexagons.

As I haved worked through the last few posts on computing Feret diameters, though, I have started to entertain the possible usefulness of considering pixels to be circles. (See 29-Sep-2017, 24-Oct-2017, and 20-Feb-2018.) Let me try to explain why.

Here's a binary image with a single foreground blob (or "object," or "connected component.")

bw = imread('Martha''s Vineyard (30x20).png');

Most of the time, we think of image pixels as being squares with unit area.


We can use find to get the $x$- and $y$-coordinates of the pixel centers, and then we can use convhull to find their convex hull. As an optimization that I think will often reduce execution time and memory, I'm going to preprocess the input binary image here by calling bwperim. I'm not going to show that step everywhere in this example, though.

[y,x] = find(bwperim(bw));
hold on
hold off
title('Pixel centers')
h = convhull(x,y);
x_hull = x(h);
y_hull = y(h);
hold on
hull_line = plot(x_hull,y_hull,'r*','MarkerSize',12);
hold off
title('Pixel centers and convex hull vertices')

Notice that there are some chains of three or more colinear convex hull vertices.

xlim([21.5 32.5])
ylim([9.5 15.5])
title('Colinear convex hull vertices')

In some of the other processing steps related to Feret diameter measurements, colinear convex hull vertices can cause problems. We can eliminate these vertices directly in the call to convhull using the 'Simplify' parameter.

h = convhull(x,y,'Simplify',true);
x_hull = x(h);
y_hull = y(h);
hold on
hold off
title('Colinear hull vertices removed')
hold on
hold off
title('A Blob''s Convex Hull and Its Vertices')

Notice, though, that there are white bits showing outside the red convex hull polygon. That's because we are only using the pixel centers.

Weaknesses of Using the Pixel Centers

Consider a simpler binary object, one that has only one row.

bw2 = false(5,15);
bw2(3,5:10) = true;
[y,x] = find(bw2);

The function convhull doesn't even work on colinear points.

    hull = convhull(x,y,'Simplify',true);
catch e
    fprintf('Error message from convhull: "%s"\n', e.message);
Error message from convhull: "Error computing the convex hull. The points may be collinear."

But even if it did return an answer, the answer would be a degenerate polygon with length 5 (even though the number of foreground pixels is 6) and zero area.

hold on
hold off
title('Degenerate convex hull polygon')

We can solve this degeneracy problem by using square pixels.

Square Pixels

In the computation of the convex hull above, we treated each pixel as a point. We can, instead, treat each pixel as a square by computing the convex hull of all the corners of every pixel. Here's one way to perform that computation.

offsets = [ ...
     0.5  -0.5
     0.5   0.5
    -0.5  -0.5
    -0.5   0.5 ]';

offsets = reshape(offsets,1,2,[]);

P = [x y];
Q = P + offsets;
R = permute(Q,[1 3 2]);
S = reshape(R,[],2);

h = convhull(S,'Simplify',true);
x_hull = S(h,1);
y_hull = S(h,2);
hold on
hold off
title('Convex hull of square pixels')

This result looks good at first glance. However, it loses some of its appeal when you consider the implications for computing the maximum Feret diameter.

points = [x_hull y_hull];
[d,end_points] = maxFeretDiameter(points,antipodalPairs(points))
hold on
hold off
title('The maximum Feret diameter is not horizontal')
d =


end_points =

   10.5000    2.5000
    4.5000    3.5000

The maximum Feret distance of this horizontal segment of is 6.0828 ($\sqrt{37}$) instead of 6, and the corresponding orientation in degrees is:

ans =


instead of 0.

Another worthy attempt is to use diamond pixels.

Diamond Pixels

Instead of using the four corners of each pixel, let's try using the middle of each pixel edge. Once we define the offsets, the code is exactly the same as for square pixels.

offsets = [ ...
     0.5   0.0
     0.0   0.5
    -0.5   0.0
     0.0  -0.5 ]';

offsets = reshape(offsets,1,2,[]);

P = [x y];
Q = P + offsets;
R = permute(Q,[1 3 2]);
S = reshape(R,[],2);

h = convhull(S,'Simplify',true);
x_hull = S(h,1);
y_hull = S(h,2);
hold on
hold off
title('Convex hull of diamond pixels')

Now the max Feret diameter result looks better for the horizontal row of pixels.

points = [x_hull y_hull];
[d,end_points] = maxFeretDiameter(points,antipodalPairs(points))
hold on
hold off
d =


end_points =

   10.5000    3.0000
    4.5000    3.0000

Hold on, though. Consider a square blob.

bw3 = false(9,9);
bw3(3:7,3:7) = true;
[y,x] = find(bw3);
P = [x y];
Q = P + offsets;
R = permute(Q,[1 3 2]);
S = reshape(R,[],2);

h = convhull(S,'Simplify',true);
x_hull = S(h,1);
y_hull = S(h,2);
hold on
points = [x_hull y_hull];
[d,end_points] = maxFeretDiameter(points,antipodalPairs(points))
hold off
title('The max Feret diameter is not at 45 degrees')
d =


end_points =

    7.5000    3.0000
    2.5000    7.0000

We'd like to see the max Feret diameter oriented at 45 degrees, and clearly we don't.

Circular Pixels

OK, I'm going to make one more attempt. I'm going to treat each pixel as approximately a circle. I'm going to approximate a circle using 24 points that are spaced at 15-degree intervals along the circumference.

thetad = 0:15:345;
offsets = 0.5*[cosd(thetad) ; sind(thetad)];
offsets = reshape(offsets,1,2,[]);
Q = P + offsets;
R = permute(Q,[1 3 2]);
S = reshape(R,[],2);

h = convhull(S,'Simplify',true);
x_hull = S(h,1);
y_hull = S(h,2);

hold on
points = [x_hull y_hull];
[d,end_points] = maxFeretDiameter(points,antipodalPairs(points))
axis on
hold off
d =


end_points =

    7.3536    7.3536
    2.6464    2.6464

Now the max Feret diameter orientation is what we would naturally expect, which is $\pm 45^{\circ}$. The orientation would also be as expected for a horizontal or vertical segment of pixels.

Still, a circular approximation might not always give exactly what a user might expect. Let's go back to the Martha's Vinyard blob that I started with. I wrote a function called pixelHull that can compute the convex hull of binary image pixels in a variety of different ways. The call pixelHull(bw,24) computes the pixel hull using a 24-point circle approximation.

Here's the maximum Feret diameter using that approximation.

V = pixelHull(bw,24);
hold on
[d,end_points] = maxFeretDiameter(V,antipodalPairs(V));
axis on
hold off

I think many people might expect the maximum Feret diameter to go corner-to-corner in this case, but it doesn't exactly do that.

xlim([22.07 31.92])
ylim([8.63 15.20])

You have to use square pixels to get corner-to-corner.

V = pixelHull(bw,'square');
hold on
[d,end_points] = maxFeretDiameter(V,antipodalPairs(V));
axis on
hold off
xlim([22.07 31.92])
ylim([8.63 15.20])

After all this, I'm still not completely certain which shape assumption will generally work best. My only firm conclusion is that the point approximation is the worst choice. The degeneracies associated with point pixels are just too troublesome.

If you have an opinion, please share it in the comments. (Note: A comment that says, "Steve, you're totally overthinking this" would be totally legit.)

The rest of the post contains functions used by the code above.

function V = pixelHull(P,type)

if nargin < 2
    type = 24;

if islogical(P)
    P = bwperim(P);
    [i,j] = find(P);
    P = [j i];

if strcmp(type,'square')
    offsets = [ ...
         0.5  -0.5
         0.5   0.5
        -0.5   0.5
        -0.5  -0.5 ];

elseif strcmp(type,'diamond')
    offsets = [ ...
         0.5  0
         0    0.5
        -0.5  0
         0   -0.5 ];

    % type is number of angles for sampling a circle of diameter 1.
    thetad = linspace(0,360,type+1)';
    thetad(end) = [];

    offsets = 0.5*[cosd(thetad) sind(thetad)];

offsets = offsets';
offsets = reshape(offsets,1,2,[]);

Q = P + offsets;
R = permute(Q,[1 3 2]);
S = reshape(R,[],2);

k = convhull(S,'Simplify',true);
V = S(k,:);

Get the MATLAB code <noscript>(requires JavaScript)</noscript>

Published with MATLAB® R2017b

Continue Reading…


Read More

Take Care If Trying the RPostgres Package

Take care if trying the new RPostgres database connection package. By default it returns some non-standard types that code developed against other database drivers may not expect, and may not be ready to defend against.


Danger, Will Robinson!

Trying the new package

One can try the newer RPostgres as a drop-in replacement for the usual RPostgreSQL.

That starts out okay. We can connect to the database and and pull a summary about remote data to R.

db <- DBI::dbConnect(
  host = 'localhost',
  port = 5432,
  user = 'johnmount',
  password = '')
## Warning: multiple methods tables found for 'dbQuoteLiteral'
d <- DBI::dbGetQuery(
  "SELECT COUNT(1) FROM pg_catalog.pg_tables")
##   count
## 1   177
ntables <- d$count[[1]]
## integer64
## [1] 177

The result at first looks okay.

## [1] "integer64"
## [1] "double"
ntables + 1L
## integer64
## [1] 178
ntables + 1
## integer64
## [1] 178
## [1] TRUE

But it is only okay, until it is not.

pmax(1L, ntables)
## [1] 8.744962e-322
pmin(1L, ntables)
## [1] 1
ifelse(TRUE, ntables, ntables)
## [1] 8.744962e-322
for(ni in ntables) {
## [1] 8.744962e-322
## [1] 8.744962e-322

If your code, or any package code you are using, perform any of the above calculations, your results will be corrupt and wrong. It is quite likely any code written before December 2017 (RPostgres‘s first CRAN distribution) would not have been written with the RPostgres "integer64 for all of my friends" design decision in mind.

Also note, RPostgres does not currently appear to write integer64 back to the database.

DBI::dbWriteTable(db, "d", d, 
                  temporary = TRUE, 
                  overwrite = TRUE)
DBI::dbGetQuery(db, "
     table_name = 'd'
##   column_name data_type numeric_precision numeric_precision_radix udt_name
## 1       count      real                24                       2   float4

The work-around

The work-around is: add the argument bigint = "numeric" to your dbConnect() call. This is mentioned in the manual, but not the default and not called out in the package description or README. Or, of course, you could use RPostgreSQL.

Continue Reading…


Read More

Quick Feature Engineering with Dates Using

The library is a collection of supplementary wrappers for a host of popular machine learning libraries, designed to remove the necessity of writing your own functions to take care of some repetitive tasks in a machine learning workflow.

Continue Reading…


Read More

University College Dublin: Postdoc Research Fellow

Seeking a temporary Post-doctoral Research Fellow in the UCD School of Computer Science for a project on analysing activity/fitness data, working with a team of researchers at the Insight Centre for Data Analytics.

Continue Reading…


Read More

How to get started in data science?


Continue Reading…


Read More

Gaydar and the fallacy of objective measurement

Greggor Mattson, Dan Simpson, and I wrote this paper, which begins:

Recent media coverage of studies about “gaydar,” the supposed ability to detect another’s sexual orientation through visual cues, reveal problems in which the ideals of scientific precision strip the context from intrinsically social phenomena. This fallacy of objective measurement, as we term it, leads to nonsensical claims based on the predictive accuracy of statistical significance. We interrogate these gaydar studies’ assumption that there is some sort of pure biological measure of perception of sexual orientation. Instead, we argue that the concept of gaydar inherently exists within a social context and that this should be recognized when studying it. We use this case as an example of a more general concern about illusory precision in the measurement of social phenomena, and suggest statistical strategies to address common problems.

There’s a funny backstory to this one.

I was going through my files a few months ago and came across an unpublished paper of mine from 2012, “The fallacy of objective measurement: The case of gaydar,” which I didn’t even remember ever writing! A completed article, never submitted anywhere, just sitting in my files.

How can that happen? I must be getting old.

Anyway, I liked the paper—it addresses some issues of measurement that we’ve been talking about a lot lately. In particular, “the fallacy of objective measurement”: researchers took a rich real-world phenomenon and abstracted it so much that they removed its most interesting content. “Gaydar” existed within a social context—a world in which gays were an invisible minority, hiding in plain sight and seeking to be inconspicuous to the general population while communicating with others of their subgroup. How can it make sense to boil this down to the shapes of faces?

Stripping a phemenon of its social context, normalizing a base rate to 50%, and seeking an on-off decision: all of these can give the feel of scientific objectivity—but the very steps taken to ensure objectivity can remove social context and relevance.

We had some gaydar discussion (also here) on the blog recently and this motivated me to freshen up the gaydar paper, with the collaboration of Mattson and Simpson. I also recently met Michal Kosinski, the coauthor of one of the articles under discussion, and that was helpful too.

The post Gaydar and the fallacy of objective measurement appeared first on Statistical Modeling, Causal Inference, and Social Science.

Continue Reading…


Read More

Web Scraping with Python: Illustration with CIA World Factbook

In this article, we show how to use Python libraries and HTML parsing to extract useful information from a website and answer some important analytics questions afterwards.

Continue Reading…


Read More

What Are Beacons, and How Are They Used in IoT Projects?

All new technologies are becoming a part of our environment, but many of them remain unnoticed or incomprehensible. For many people, beacons are one of these mysterious items. Many IoT applications in large industries –such as retail and warehousing – use beacons everyday, but these small devices go unnoticed. Although the

The post What Are Beacons, and How Are They Used in IoT Projects? appeared first on Dataconomy.

Continue Reading…


Read More

JPMorgan: Data Scientist, Payments & Liquidity

Seeking a Data Scientist with modeling expertise and implementation experience to serve as a thought partner to key business leaders and clients to generate hypotheses and insights.

Continue Reading…


Read More

Apple: Big Data Engineer

Seeking extraordinary engineers to help take our environment to the next level. You'll have the opportunity to solve challenging big data engineering problems across a broad range of Apple manufacturing services.

Continue Reading…


Read More

Simple maths of a fairer USS deal

(This article was first published on R – Let's Look at the Figures, and kindly contributed to R-bloggers)

This will be my last post for a while (I promise!).  After today I’ll be taking a rest from all this, until at least the start of April.  Hopefully all this USS stuff will be resolved by then, though!

In yesterday’s post I showed a graph, followed by some comments to suggest that future USS proposals with a flatter (or even increasing) “percent lost” curve would be fairer (and, as I argued earlier in my Robin Hood post, more affordable at the same time).

It’s now clear to me that my suggestion seemed a bit cryptic to many (maybe most!) who read it yesterday.  So here I will try to show more specifically how to achieve a flat curve.  (This is not because I think flat is optimal.  It’s mainly because it’s easy to explain.  As already mentioned, it might not be a bad idea if the curve was actually to increase a bit as salary levels increase; that would allow those with higher salaries to feel happy that they are doing their bit towards the sustainable future of USS.)

Flattening the curve

The graph below is the same as yesterday’s but with a flat (blue, dashed) line drawn at the level of 4% lost across all salary levels.

I drew the line at 4% here just as an example, to illustrate the calculation.  The actual level needed — i.e, the “affordable” level for universities —  would need to be determined by negotiation; but the maths is essentially the same, whatever the level (within reason).

Let’s suppose we want to adjust the USS contribution and benefits parameters to achieve just such a flat “percent lost” curve, at the 4% level.  How is that done?

I will assume here the same adjustable parameters that UUK and UCU appear to have in mind, namely:

  • employee contribution rate E (as percentage of salary — currently 8; was 8.7 in the 12 March proposal; was 8 in the January proposal)
  • threshold salary T, over which defined benefit (DB) pension entitlement ceases (which is currently £55.55k; was £42k in the 12 March proposal; and was £0 in the January proposal)
  • accrual rate A, in the DB pension.  Expressed here in percentage points (currently 100/75; was 100/85 in the 12 March proposal; and not relevant to the January proposal).
  • employer contribution rate (%) to the defined contribution (DC) part of USS pension.  Let’s allow different rates C_1 and C_2 for, respectively, salaries between T and £55.55k, and salaries over £55.55k. (Currently C_1 is irrelevant, and C_2 is 13 (max); these were both set at 12 in the 12th March proposal; and were both 13.25 in the January proposal.)

I will assume also, as all the recent proposals do, that the 1% USS match possibility is lost to all members.

Then, to get to 4% lost across the board, we need simply to solve the following linear equations.  (To see where these came from, please see this earlier post.)

For salary up to T:

 (E - 8) + 19(100/75 - A) + 1] = 4.

For salary between T and £55.55k:

  -8 + 19(100/75) - C_1 + 1 = 4.

For salary over £55.55k:

 13 - C_2 = 4.

Solving those last two equations is simple, and results in

 C_1 = 14.33, \qquad C_2 = 9.

The first equation above clearly allows more freedom: it’s just one equation, with two unknowns, so there are many solutions available.  Three example solutions, still based the illustrative 4% loss level across all salary levels, are:

 E=8, \qquad A = 1.175 = 100/85.1

 E = 8.7, \qquad A = 1.21 = 100/82.6

 E = 11, \qquad A = 100/75.

At the end here I’ll give code in R to do the above calculation quite generally, i.e., for any desired percentage loss level.  First let me just make a few remarks relating to all this.


Choice of threshold

Note that the value of T does not enter into the above calculation.  Clearly there will be (negotiable) interplay between T and the required percentage loss, though, for a given level of affordability.

Choice of C_2

Much depends on the value of C_2.

The calculation above gives the value of C_2 needed for a flat “percent lost” curve, at any given level for the percent lost (which was 4% in the example above).

To achieve an increasing “percent lost” curve, we could simply reduce the value of C_2 further than the answer given by the above calculation.  Alternatively, as suggested in my earlier Robin Hood post, USS could apply a lower value of C_2 only for salaries above some higher threshold — i.e., in much the same spirit as progressive taxation of income.

Just as with income tax, it would be important not to set C_2 too small, otherwise the highest-paid members would quite likely want to leave USS.  There is clearly a delicate balance to be struck, at the top end of the salary spectrum.

But it is clear that if the higher-paid were to sacrifice at least as much as everyone else, in proportion to their salary, then that would allow the overall level of “percent lost” to be appreciably reduced, which would benefit the vast majority of USS members.

Determination of the overall “percent lost”

Everything written here constitutes a methodology to help with finding a good solution.  As mentioned at the top here, the actual solution — and in particular, the actual level of USS member pain (if any) deemed to be necessary to keep USS afloat — will be a matter for negotiation.  The maths here can help inform that negotiation, though.

Code for solving the above equations

## Function to compute the USS parameters needed for a
## flat "percent lost" curve
## Function arguments are:
## loss: in percentage points, the constant loss desired
## E: employee contribution, in percentage points
## A: the DB accrual rate
## Exactly one of E and A must be specified (ie, not NULL).
## Example calls:
## flatcurve(4.0, A = 100/75)
## flatcurve(2.0, E = 10.5)
## flatcurve(1.0, A = 100/75)  # status quo, just 1% "match" lost

flatcurve <- function(loss, E = NULL, A = NULL){

    if (is.null(E) && is.null(A)) {
        stop("E and A can't both be NULL")}
    if (!is.null(E) && !is.null(A)) {
        stop("one of {E, A} must be NULL")}

    c1 <- 19 * (100/75) - (7 + loss)
    c2 <- 13 - loss

    if (is.null(E)) {
        E <- 7 + loss - (19 * (100/75 - A))

    if (is.null(A)) {
        A <- (E - 7 - loss + (19 * 100/75)) / 19

return(list(loss_percent = loss,
            employee_contribution_percent = E,
            accrual_reciprocal = 100/A,
            DC_employer_rate_below_55.55k = c1,
            DC_employer_rate_above_55.55k = c2))

The above function will run in base R.

Here are three examples of its use (copied from an interactive session in R):

###  Specify 4% loss level, 
###  still using the current USS DB accrual rate

> flatcurve(4.0, A = 100/75)
[1] 4

[1] 11

[1] 75

[1] 14.33333

[1] 9

###  This time for a smaller (2%) loss, 
###  with specified employee contribution

> flatcurve(2.0, E = 10.5)
[1] 2

[1] 10.5

[1] 70.80745

[1] 16.33333

[1] 11

### Finally, my personal favourite:
### --- status quo with just the "match" lost

> flatcurve(1, A = 100/75)
[1] 1

[1] 8

[1] 75

[1] 17.33333

[1] 12

© David Firth, March 2018

To cite this entry:
Firth, D (2018). Simple maths of a fairer USS deal. Weblog entry at URL

To leave a comment for the author, please follow the link and comment on their blog: R – Let's Look at the Figures. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Four short links: 16 March 2018

Longevity, Partner Violence, Leaking Secrets, and Fallacy of Objective Measurement

  1. Longevity FAQ (Laura Deming) -- I run Longevity Fund. I spend a lot of time thinking about what could increase healthy human lifespan. This is my overview of the field for beginners.
  2. Intimate Partner Violence -- What we’ve discovered in our research is that digital abuse of intimate partners is both more mundane and more complicated than we might think. [...] [I]ntimate partner violence upends the way we typically think about how to protect digital privacy and security. You should read this because we all need to get a lot more aware of the ways in which the tools we make might be used to hurt others.
  3. The Secret Sharer -- Machine learning models based on neural networks and deep learning are being rapidly adopted for many purposes. What those models learn, and what they may share, is a significant concern when the training data may contain secrets and the models are public—e.g., when a model helps users compose text messages using models trained on all users’ messages. [...] [W]e show that unintended memorization occurs early, is not due to overfitting, and is a persistent issue across different types of models, hyperparameters, and training strategies.
  4. Gaydar and the Fallacy of Objective Measurement -- By taking gaydar into the lab, these research teams have taken the creative adaptation of an oppressed community of atomized members and turned gaydar into an essentialist story of “gender atypicality,” a topic that is related to, but distinctly different from, sexual orientation.

Continue reading Four short links: 16 March 2018.

Continue Reading…


Read More

Machine learning to estimate when bus and bike lanes blocked

Frustrated with vehicles blocking bus and bike lanes, Alex Bell applied some statistical methods to estimate the extent.

Sarah Maslin Nir for The New York Times:

Now Mr. Bell is trying another tack — the 30-year-old computer scientist who lives in Harlem has created a prototype of a machine-learning algorithm that studies footage from a traffic camera and tracks precisely how often bike lanes are obstructed by delivery trucks, parked cars and waiting cabs, among other scofflaws. It is a piece of data that transportation advocates said is missing in the largely anecdotal discussion of how well the city’s bus and bike lanes do or do not work.

Tags: ,

Continue Reading…


Read More

Writing papers about packages

(This article was first published on Marcelo S. Perlin, and kindly contributed to R-bloggers)

Some advices from a referee –

Back in 2007 I wrote a Matlab package for estimating regime switching models. I was just starting to learn to code and this project was my way of doing it. After publishing it in FEX (Matlab file exchange site) I got so many repeated questions on my email that eventually realized it would be easier to write a manual for people to read. Some time and effort would be spend writing it, but less time replying to repeated questions on my email.

This manual about the code became, by far, my most cited paper in Google Scholar. It is not even published, just a permanent working paper. When attending conferences and seminars, I was always surprised to hear that people knew me as the matlab regime switching guy.

Moving forward a few years, I stopped using Matlab for R and I continue to invest a lot of time writing papers about packages and publishing them in standard scientific journals. You can see a list of those here. I can testify for a greater contribution and impact for research papers about code. I strongly believe that it will become more popular in the years to come. The new generation of researchers is far more aware of code than the previous. In that sense, nothing beats R and CRAN at the diversity and depth of packages.

In this subject, I frequently review papers in the same topic and I see common mistakes that researchers do when writing their papers. Here’s some tips for those that wish to pursue such a publication:

  • A problem must be clearly stated: Every paper is a solution to a problem. This is also true for a paper about code. Identify it and make it painfully clear how the code solves it. In other words, do your homework.

  • The paper is NOT an extended manual: Don’t write a paper simply showing its functions. We have that from CRAN (or other repository).

  • Make sure you know what’s available: How people did it before? Is there a competing package? How does your code improves it?

  • A bibliometric study is mandatory.: Same as the previous point. Looking at the previous published research papers, can you find out how they handled the problem your code solves?

  • Not everyone uses R, so make it easier for people to use you software: Make sure you keep a simple and accessible code. Explain what is R and why you should use it. Case in point, not everyone know what a tibble is.

  • Think about your example of usage: You should always add a reproducible example of usage. This is what everyone will try! Make sure it is a simple example, not too deep in the literature. Something everyone can understand. Your code should also be accessible and reproducible.

It is a lot of work to publish a research paper about code. But, it is all worth it! The impact is much greater than a standard research paper. Your academic career will certainly move forward with it.

To leave a comment for the author, please follow the link and comment on their blog: Marcelo S. Perlin. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Distilled News

Essentials of Deep Learning – Sequence to Sequence modelling with Attention (using python)

Deep Learning at scale is disrupting many industries by creating chatbots and bots never seen before. On the other hand, a person just starting out on Deep Learning would read about Basics of Neural Networks and its various architectures like CNN and RNN. But there seems like a big jump from the simple concepts to industrial applications of Deep Learning. Concepts such as Batch Normalization, Dropout and Attention are almost a requirement to know in building deep learning applications. In this article, we will cover two important concepts used in the current state of the art applications in Speech Recognition and Natural Language Processing – viz Sequence to Sequence modelling and Attention models. Just to give you a sneak peek of the potential application of these two techniques – Baidu’s AI system uses them to clone your voice It replicates a persons voice by understanding his voice in just three seconds of training.You can check out some audio samples provided by Baidu’s Research team which consist of original and synthesized voices.

Top 5 Data Science & Machine Learning Repositories on GitHub in Feb 2018

• FastPhotoStyle
• Twitter Scraper
• Handwriting Synthesis
• ENAS PyTorch
• Sign Language

A Simple Introduction to Complex Stochastic Processes – Part 2

In my first article on this topic (see here) I introduced some of the complex stochastic processes used by Wall Street data scientists, using a simple approach that can be understood by people with no statistics background other than a first course such as stats 101. I defined and illustrated the continuous Brownian motion (the mother of all these stochastic processes) using approximations by discrete random walks, simply re-scaling the X-axis and the Y-axis appropriately, and making time increments (the X-axis) smaller and smaller, so that the limiting process is a time-continuous one. This was done without using any complicated mathematics such as measure theory or filtrations. Here I am going one step further, introducing the integral and derivative of such processes, using rudimentary mathematics. All the articles that I’ve found on this subject are full of complicated equations and formulas. It is not the case here. Not only do I explain this material in simple English, but I also provide pictures to show how an Integrated Brownian motion looks like (I could not find such illustrations in the literature), how to compute its variance, and focus on applications, especially to number theory, Fintech and cryptography problems. Along the way, I discuss moving averages in a theoretical but basic framework (again with pictures), discussing what the optimal window should be for these (time-continuous or discrete) time series.

Neural network classification of data using Smile

Data classification is the central data-mining technique used for sorting data, understanding of data and for performing outcome predictions. In this small blog we will use a library Smilecthat includes many methods for supervising and non-supervising data classification methods. We will make a small Python-like code using Jython top build a complex Multilayer Perceptron Neural Network for data classification. It will have large number of inputs, several outputs, and can be easily extended for cases with many hidden layers. We will write a few lines of Jython code (most of our codding will deal with how to prepare an interface for reading data, rather than with Neural Network programming).

Introduction to Numpy – Part I

Numpy is a math library for python. It enables us to do computation (between an array, matrix, tensor etc..) efficiently and effectively. In this article, I’m just going to introduce you to the basics of what is mostly required for Machine Learning and Data science (and Deep Learning !).

Introduction to Markov Chains

Markov chains are a fairly common, and relatively simple, way to statistically model random processes. They have been used in many different domains, ranging from text generation to financial modeling. A popular example is r/SubredditSimulator, which uses Markov chains to automate the creation of content for an entire subreddit. Overall, Markov Chains are conceptually quite intuitive, and are very accessible in that they can be implemented without the use of any advanced statistical or mathematical concepts. They are a great way to start learning about probabilistic modeling and data science techniques.

A Beginner’s Guide to Data Engineering – Part II

In A Beginner’s Guide to Data Engineering – Part I, I explained that an organization’s analytics capability is built layers upon layers. From collecting raw data and building data warehouses to applying Machine Learning, we saw why data engineering plays a critical role in all of these areas. One of any data engineer’s most highly sought-after skills is the ability to design, build, and maintain data warehouses. I defined what data warehousing is and discussed its three common building blocks – Extract, Transform, and Load, where the name ETL comes from. For those who are new to ETL processes, I introduced a few popular open source frameworks built by companies like LinkedIn, Pinterest, Spotify, and highlight Airbnb’s own open-sourced tool Airflow. Finally, I argued that data scientist can learn data engineering much more effectively with the SQL-based ETL paradigm.

How to train and deploy deep learning at scale

We discussed using and deploying deep learning at scale. This is an empirical era for machine learning, and, as I noted in an earlier article, as successful as deep learning has been, our level of understanding of why it works so well is still lacking. In practice, machine learning engineers need to explore and experiment using different architectures and hyperparameters before they settle on a model that works for their specific use case. Training a single model usually involves big (labeled) data and big models; as such, exploring the space of possible model architectures and parameters can take days, weeks, or even months. Talwalkar has spent the last few years grappling with this problem as an academic researcher and as an entrepreneur. In this episode, he describes some of his related work on hyperparameter tuning, systems, and more.

Continue Reading…


Read More

Gradients explode - Deep Networks are shallow - ResNet explained

So last night at the Paris Machine Learning meetup, we had the good folks from Snips making an announcement on the release/open sourcing of their Natural language Understanding code. Joseph also mentioned that after many architectures search, a simple CRF model, a single layer model, did as well as other commercial models. It's NLP so the representability issue has already been parsed. In a different corner of the galaxy, the following paper seems to suggest that ResNets, while rendering these deep networks effectively shallower, do not solve the gradient explosion problem. 

Abstract: Whereas it is believed that techniques such as Adam, batch normalization and, more recently, SeLU nonlinearities ``solve'' the exploding gradient problem, we show that this is not the case and that in a range of popular MLP architectures, exploding gradients exist and that they limit the depth to which networks can be effectively trained, both in theory and in practice. We explain why exploding gradients occur and highlight the {\it collapsing domain problem}, which can arise in architectures that avoid exploding gradients. ResNets have significantly lower gradients and thus can circumvent the exploding gradient problem, enabling the effective training of much deeper networks, which we show is a consequence of a surprising mathematical property. By noticing that {\it any neural network is a residual network}, we devise the {\it residual trick}, which reveals that introducing skip connections simplifies the network mathematically, and that this simplicity may be the major cause for their success.
TL;DR: We show that in contrast to popular wisdom, the exploding gradient problem has not been solved and that it limits the depth to which MLPs can be effectively trained. We show why gradients explode and how ResNet handles them.

In this work we propose a novel interpretation of residual networks showing that they can be seen as a collection of many paths of differing length. Moreover, residual networks seem to enable very deep networks by leveraging only the short paths during training. To support this observation, we rewrite residual networks as an explicit collection of paths. Unlike traditional models, paths through residual networks vary in length. Further, a lesion study reveals that these paths show ensemble-like behavior in the sense that they do not strongly depend on each other. Finally, and most surprising, most paths are shorter than one might expect, and only the short paths are needed during training, as longer paths do not contribute any gradient. For example, most of the gradient in a residual network with 110 layers comes from paths that are only 10-34 layers deep. Our results reveal one of the key characteristics that seem to enable the training of very deep networks: Residual networks avoid the vanishing gradient problem by introducing short paths which can carry gradient throughout the extent of very deep networks.

Deep Residual Learning for Image Recognition by Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun

Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers.
The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !

Continue Reading…


Read More

Whats new on arXiv

A fluctuation theorem for time-series of signal-response models with the backward transfer entropy

The irreversibility of trajectories in stochastic dynamical systems is linked to the structure of their causal representation in terms of Bayesian networks. We consider stochastic maps resulting from a time discretization with interval \tau of signal-response models, and we find an integral fluctuation theorem that sets the backward transfer entropy as a lower bound to the conditional entropy production. We apply this to a linear signal-response model providing analytical solutions, and to a nonlinear model of receptor-ligand systems. We show that the observational time \tau has to be fine-tuned for an efficient detection of the irreversibility in time-series.

Principal Component Analysis with Tensor Train Subspace

Tensor train is a hierarchical tensor network structure that helps alleviate the curse of dimensionality by parameterizing large-scale multidimensional data via a set of network of low-rank tensors. Associated with such a construction is a notion of Tensor Train subspace and in this paper we propose a TT-PCA algorithm for estimating this structured subspace from the given data. By maintaining low rank tensor structure, TT-PCA is more robust to noise comparing with PCA or Tucker-PCA. This is borne out numerically by testing the proposed approach on the Extended YaleFace Dataset B.

Fractal AI: A fragile theory of intelligence

Fractal AI is a theory for general artificial intelligence. It allows to derive new mathematical tools that constitute the foundations for a new kind of stochastic calculus, by modelling information using cellular automaton-like structures instead of smooth functions. In the repository included we are presenting a new Agent, derived from the first principles of the theory, which is capable of solving Atari games several orders of magnitude more efficiently than other similar techniques, like Monte Carlo Tree Search. The code provided shows how it is now possible to beat some of the current state of the art benchmarks on Atari games, without previous learning and using less than 1000 samples to calculate each one of the actions when standard MCTS uses 3 Million samples. Among other things, Fractal AI makes it possible to generate a huge database of top performing examples with very little amount of computation required, transforming Reinforcement Learning into a supervised problem. The algorithm presented is capable of solving the exploration vs exploitation dilemma on both the discrete and continuous cases, while maintaining control over any aspect of the behavior of the Agent. From a general approach, new techniques presented here have direct applications to other areas such as: Non-equilibrium thermodynamics, chemistry, quantum physics, economics, information theory, and non-linear control theory.

Neural Lattice Language Models

In this work, we propose a new language modeling paradigm that has the ability to perform both prediction and moderation of information flow at multiple granularities: neural lattice language models. These models construct a lattice of possible paths through a sentence and marginalize across this lattice to calculate sequence probabilities or optimize parameters. This approach allows us to seamlessly incorporate linguistic intuitions – including polysemy and existence of multi-word lexical items – into our language model. Experiments on multiple language modeling tasks show that English neural lattice language models that utilize polysemous embeddings are able to improve perplexity by 9.95% relative to a word-level baseline, and that a Chinese model that handles multi-character tokens is able to improve perplexity by 20.94% relative to a character-level baseline.

Algorithmic Social Intervention

Social and behavioral interventions are a critical tool for governments and communities to tackle deep-rooted societal challenges such as homelessness, disease, and poverty. However, real-world interventions are almost always plagued by limited resources and limited data, which creates a computational challenge: how can we use algorithmic techniques to enhance the targeting and delivery of social and behavioral interventions? The goal of my thesis is to provide a unified study of such questions, collectively considered under the name ‘algorithmic social intervention’. This proposal introduces algorithmic social intervention as a distinct area with characteristic technical challenges, presents my published research in the context of these challenges, and outlines open problems for future work. A common technical theme is decision making under uncertainty: how can we find actions which will impact a social system in desirable ways under limitations of knowledge and resources? The primary application area for my work thus far is public health, e.g. HIV or tuberculosis prevention. For instance, I have developed a series of algorithms which optimize social network interventions for HIV prevention. Two of these algorithms have been pilot-tested in collaboration with LA-area service providers for homeless youth, with preliminary results showing substantial improvement over status-quo approaches. My work also spans other topics in infectious disease prevention and underlying algorithmic questions in robust and risk-aware submodular optimization.

Ranking with Adaptive Neighbors

Retrieving the most similar objects in a large-scale database for a given query is a fundamental building block in many application domains, ranging from web searches, visual, cross media, and document retrievals. State-of-the-art approaches have mainly focused on capturing the underlying geometry of the data manifolds. Graph-based approaches, in particular, define various diffusion processes on weighted data graphs. Despite success, these approaches rely on fixed-weight graphs, making ranking sensitive to the input affinity matrix. In this study, we propose a new ranking algorithm that simultaneously learns the data affinity matrix and the ranking scores. The proposed optimization formulation assigns adaptive neighbors to each point in the data based on the local connectivity, and the smoothness constraint assigns similar ranking scores to similar data points. We develop a novel and efficient algorithm to solve the optimization problem. Evaluations using synthetic and real datasets suggest that the proposed algorithm can outperform the existing methods.

Adversarial Data Programming: Using GANs to Relax the Bottleneck of Curated Labeled Data

Paucity of large curated hand-labeled training data for every domain-of-interest forms a major bottleneck in the deployment of machine learning models in computer vision and other fields. Recent work (Data Programming) has shown how distant supervision signals in the form of labeling functions can be used to obtain labels for given data in near-constant time. In this work, we present Adversarial Data Programming (ADP), which presents an adversarial methodology to generate data as well as a curated aggregated label has given a set of weak labeling functions. We validated our method on the MNIST, Fashion MNIST, CIFAR 10 and SVHN datasets, and it outperformed many state-of-the-art models. We conducted extensive experiments to study its usefulness, as well as showed how the proposed ADP framework can be used for transfer learning as well as multi-task learning, where data from two domains are generated simultaneously using the framework along with the label information. Our future work will involve understanding the theoretical implications of this new framework from a game-theoretic perspective, as well as explore the performance of the method on more complex datasets.

Latent Tree Variational Autoencoder for Joint Representation Learning and Multidimensional Clustering

Recently, deep learning based clustering methods are shown superior to traditional ones by jointly conducting representation learning and clustering. These methods rely on the assumptions that the number of clusters is known, and that there is one single partition over the data and all attributes define that partition. However, in real-world applications, prior knowledge of the number of clusters is usually unavailable and there are multiple ways to partition the data based on subsets of attributes. To resolve the issues, we propose latent tree variational autoencoder (LTVAE), which simultaneously performs representation learning and multidimensional clustering. LTVAE learns latent embeddings from data, discovers multi-facet clustering structures based on subsets of latent features, and automatically determines the number of clusters in each facet. Experiments show that the proposed method achieves state-of-the-art clustering performance and reals reasonable multifacet structures of the data.

Algebraic Machine Learning

Machine learning algorithms use error function minimization to fit a large set of parameters in a preexisting model. However, error minimization eventually leads to a memorization of the training dataset, losing the ability to generalize to other datasets. To achieve generalization something else is needed, for example a regularization method or stopping the training when error in a validation dataset is minimal. Here we propose a different approach to learning and generalization that is parameter-free, fully discrete and that does not use function minimization. We use the training data to find an algebraic representation with minimal size and maximal freedom, explicitly expressed as a product of irreducible components. This algebraic representation is shown to directly generalize, giving high accuracy in test data, more so the smaller the representation. We prove that the number of generalizing representations can be very large and the algebra only needs to find one. We also derive and test a relationship between compression and error rate. We give results for a simple problem solved step by step, hand-written character recognition, and the Queens Completion problem as an example of unsupervised learning. As an alternative to statistical learning, \enquote{algebraic learning} may offer advantages in combining bottom-up and top-down information, formal concept derivation from data and large-scale parallelization.

Newton-type Alternating Minimization Algorithm for Convex Optimization

We propose NAMA (Newton-type Alternating Minimization Algorithm) for solving structured nonsmooth convex optimization problems where the sum of two functions is to be minimized, one being strongly convex and the other composed with a linear mapping. The proposed algorithm is a line-search method over a continuous, real-valued, exact penalty function for the corresponding dual problem, which is computed by evaluating the augmented Lagrangian at the primal points obtained by alternating minimizations. As a consequence, NAMA relies on exactly the same computations as the classical alternating minimization algorithm (AMA), also known as the dual proximal gradient method. Under standard assumptions the proposed algorithm possesses strong convergence properties, while under mild additional assumptions the asymptotic convergence is superlinear, provided that the search directions are chosen according to quasi-Newton formulas. Due to its simplicity, the proposed method is well suited for embedded applications and large-scale problems. Experiments show that using limited-memory directions in NAMA greatly improves the convergence speed over AMA and its accelerated variant.

LCANet: End-to-End Lipreading with Cascaded Attention-CTC
Inference on a Distribution from Noisy Draws
On the Algebra in Boole’s Laws of Thought
Closure Operators and Spam Resistance for PageRank
Conditional Activation for Diverse Neurons in Heterogeneous Networks
A Probabilistic Disease Progression Model for Predicting Future Clinical Outcome
Decentralised Learning in Systems with Many, Many Strategic Agents
Spin-glass–like aging in colloidal and granular glasses
Limiting probabilities for vertices of a given rank in rooted trees
Variational zero-inflated Gaussian processes with sparse kernels
Controlled Islanding via Weak Submodularity
Learning to Explore with Meta-Policy Gradient
Analysis of Nonautonomous Adversarial Systems
Monochromatic loose paths in multicolored $k$-uniform cliques
Investigating the Effect of Music and Lyrics on Spoken-Word Recognition
Discussion on Bayesian Cluster Analysis: Point Estimation and Credible Balls by Sara Wade and Zoubin Ghahramani
Hot-Stuff the Linear, Optimal-Resilience, One-Message BFT Devil
A Multi-Modal Approach to Infer Image Affect
Development of Safety Performance Functions: Incorporating Unobserved Heterogeneity and Functional Form Analysis
Revisiting Salient Object Detection: Simultaneous Detection, Ranking, and Subitizing of Multiple Salient Objects
Block Diagonally Dominant Positive Definite Sub-optimal Filters and Smoothers
The $\mathbb{Z}_2$-genus of Kuratowski minors
Smoothing Spline Growth Curves With Covariates
Symbol-level precoding is symbol-perturbed ZF when energy Efficiency is sought
Noisy Adaptive Group Testing: Bounds and Algorithms
Model-Agnostic Private Learning via Stability
Robustness to incorrect priors in partially observed stochastic control
Bucket Renormalization for Approximate Inference
PT-Spike: A Precise-Time-Dependent Single Spike Neuromorphic Architecture with Efficient Supervised Learning
Uplift Modeling from Separate Labels
MT-Spike: A Multilayer Time-based Spiking Neuromorphic Architecture with Temporal Error Backpropagation
Topology guaranteed segmentation of the human retina from OCT using convolutional neural networks
Linear Quadratic Optimal Control and Stabilization for Discrete-time Markov Jump Linear Systems
Defensive Collaborative Multi-task Training – Defending against Adversarial Attack towards Deep Neural Networks
Damped Newton’s Method on Riemannian Manifolds
Feature Selection and Model Comparison on Microsoft Learning-to-Rank Data Sets
Signal Processing and Piecewise Convex Estimation
Feature extraction without learning in an analog Spatial Pooler memristive-CMOS circuit design of Hierarchical Temporal Memory
Neuron inspired data encoding memristive multi-level memory cell
Nearly defect-free dynamical models of disordered solids: The case of amorphous silicon
Network Coding for Real-time Wireless Communication for Automation
Bernstein type inequalities for self-normalized martingales with applications
The 2017 AIBIRDS Competition
Multiplicative Updates for Elastic Net Regularized Convolutional NMF Under $β$-Divergence
How to evaluate sentiment classifiers for Twitter time-ordered data?
A curious class of Hankel determinants
Fast generalised linear models by database sampling and one-step polishing
1D Mott variable-range hopping with external field
A generalization of the steepest-edge rule and its number of simplex iterations for a nondegenerate LP
xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems
Multi-objective Analysis of MAP-Elites Performance
Localization due to topological stochastic disorder in active networks
Can Autism be Catered with Artificial Intelligence-Assisted Intervention Technology? A Literature Review
Approximative Theorem of Incomplete Riemann-Stieltjes Sum of Stochastic Integral
Determinantal elliptic Selberg integrals
Interlocking permutations
Higher order concentration in presence of Poincaré-type inequalities
Spatio-temporal Deep De-aliasing for Prospective Assessment of Real-time Ventricular Volumes
EdgeStereo: A Context Integrated Residual Pyramid Network for Stereo Matching
The Value of Reactive Power for Voltage Control in Lossy Networks
Enhancing Favorable Propagation in Cell-Free Massive MIMO Through Spatial User Grouping
Combining Multi-level Contexts of Superpixel using Convolutional Neural Networks to perform Natural Scene Labeling
The complete enumeration of 4-polytopes and 3-spheres with nine vertices
Building Sparse Deep Feedforward Networks using Tree Receptive Fields
LivDet 2017 Fingerprint Liveness Detection Competition 2017
Deep Image Demosaicking using a Cascade of Convolutional Residual Denoising Networks
MCScript: A Novel Dataset for Assessing Machine Comprehension Using Script Knowledge
A complex network framework to model cognition: unveiling correlation structures from connectivity
Stochastic Dynamic Utilities and Inter-Temporal Preferences
On the connectivity threshold for colorings of random graphs and hypergraphs
Identifiability of Undirected Dynamical Networks: a Graph-Theoretic Approach
The skeleton of the UIPT, seen from infinity
A mean-field game model for homogeneous flocking
Addressing the Challenges in Federating Edge Resources
Lovász extension and graph cut
Face-MagNet: Magnifying Feature Maps to Detect Small Faces
Products and Projective Limits of Continuous Valuations on $T_0$ Spaces
Learning to Play General Video-Games via an Object Embedding Network
All Graphs are S^n-Synchronizing; What About St(p, n)?
Rotation-Sensitive Regression for Oriented Scene Text Detection
On the Ambiguity of Registration Uncertainty
Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual Reasoning
Constant delay algorithms for regular document spanners
Secure SWIPT for Directional Modulation Aided AF Relaying Networks
A Unified View of False Discovery Rate Control: Reconciliation of Bayesian and Frequentist Approaches
Domain Adaptation on Graphs by Learning Aligned Graph Bases
On the Security of Some Compact Keys for McEliece Scheme
A quantitative analysis of the 2017 Honduran election and the argument used to defend its outcome
Joint Modelling of Location, Scale and Skewness Parameters of the Skew Laplace Normal Distribution
A generalization of Croot-Lev -Pach’s Lemma and a new upper bound for the size of difference sets in polynomial rings
Euler-Lagrangian approach to 3D stochastic Euler equations
Efficient Realization of Givens Rotation through Algorithm-Architecture Co-design for Acceleration of QR Factorization
Predicting Oral Disintegrating Tablet Formulations by Neural Network Techniques
Measurement-based adaptation protocol with quantum reinforcement learning
Illumination-aware Faster R-CNN for Robust Multispectral Pedestrian Detection
Optimal Bounds for Johnson-Lindenstrauss Transformations
FEVER: a large-scale dataset for Fact Extraction and VERification
Approximating Generalized Network Design under (Dis)economies of Scale with Applications to Energy Efficiency
Complex activity patterns generated by short-term synaptic plasticity
Rigid reflections and Kac–Moody algebras
Greedy can also beat pure dynamic programming
Familywise error control in multi-armed response-adaptive trials
Towards Monocular Digital Elevation Model (DEM) Estimation by Convolutional Neural Networks – Application on Synthetic Aperture Radar Images
LSH Microbatches for Stochastic Gradients: Value in Rearrangement
Computational Techniques for the Analysis of Small Signals in High-Statistics Neutrino Oscillation Experiments
On the Universal Approximation Property and Equivalence of Stochastic Computing-based Neural Networks and Binary Neural Networks
Constructing Imperfect Recall Abstractions to Solve Large Extensive-Form Games
Shift-invert diagonalization of large many-body localizing spin chains
$H$-colouring $P_t$-free graphs in subexponential time
Redundancy Techniques for Straggler Mitigation in Distributed Optimization and Learning
Image Colorization with Generative Adversarial Networks
Approximate Query Matching for Image Retrieval
Imitation Learning with Concurrent Actions in 3D Games
Additive quantile regression for clustered data with an application to children’s physical activity
Optimal energy decay for the wave-heat system on a rectangular domain
Averaging Weights Leads to Wider Optima and Better Generalization
Maximum likelihood drift estimation for a threshold diffusion
On trend and its derivatives estimation in repeated time series with subordinated long-range dependent errors
Generalised Structural CNNs (SCNNs) for time series data with arbitrary graph-toplogies
Totally Ordered Measured Trees and Splitting Trees with Infinite Variation II: Prolific Skeleton Decomposition

Continue Reading…


Read More

Book Memo: “Developing Bots with Microsoft Bots Framework”

Create Intelligent Bots using MS Bot Framework and Azure Cognitive Servic
Develop Intelligent Bots using Microsoft Bot framework (C# and Node.js), Visual Studio Enterprise & Code, Microsoft Azure and Cognitive Services. This book shows you how to develop great Bots, publish to Azure and register with Bot portal so that customers can connect and communicate using famous communication channels like Skype, Slack, Web and Facebook. You’ll also learn how to build intelligence into Bots using Azure Cognitive Services like LUIS, OCR, Speech to Text and Web Search. Bots are the new face of user experience. Conversational User Interface provides many options to make user experience richer, innovative and engaging with email, text, buttons or voice as the medium for communication. Modern line of business applications can be replaced or associated with Intelligent Bots that can use data/history combined with Machine Intelligence to make user experience inclusive and exciting.

Continue Reading…


Read More

Book Memo: “Artificial Intelligence and Games”

This is the first textbook dedicated to explaining how artificial intelligence (AI) techniques can be used in and for games. After introductory chapters that explain the background and key techniques in AI and games, the authors explain how to use AI to play games, to generate content for games and to model players. The book will be suitable for undergraduate and graduate courses in games, artificial intelligence, design, human-computer interaction, and computational intelligence, and also for self-study by industrial game developers and practitioners. The authors have developed a website ( that complements the material covered in the book with up-to-date exercises, lecture slides and reading.

Continue Reading…


Read More

R Packages worth a look

Miscellaneous Functions (miscF)
Various functions for random number generation, density estimation, classification, curve fitting, and spatial data analysis.

Random Graph Clustering (mixer)
Estimates the parameters, the clusters, as well as the number of clusters of a (binary) stochastic block model (J.-J Daudin, F. Picard, S. Robin (2008) <doi:10.1007/s11222-007-9046-7>).

Simulate Dynamic Networks using Exponential Random Graph Models (ERGM) Family (dnr)
Functions are provided to fit temporal lag models to dynamic networks. The models are build on top of exponential random graph models (ERGM) framework. There are functions for simulating or forecasting networks for future time points. Stable Multiple Time Step Simulation/Prediction from Lagged Dynamic Network Regression Models. Mallik, Almquist (2017, under review).

Analysis of Repeatability and Reproducibility Studies with Ordinal Measurements (ordinalRR)
Implements Bayesian data analyses of balanced repeatability and reproducibility studies with ordinal measurements. Model fitting is based on MCMC posterior sampling with ‘rjags’. Function ordinalRR() directly carries out the model fitting, and this function has the flexibility to allow the user to specify key aspects of the model, e.g., fixed versus random effects. Functions for preprocessing data and for the numerical and graphical display of a fitted model are also provided. There are also functions for displaying the model at fixed (user-specified) parameters and for simulating a hypothetical data set at a fixed (user-specified) set of parameters for a random-effects rater population. For additional technical details, refer to Culp, Ryan, Chen, and Hamada (2018) and cite this Technometrics paper when referencing any aspect of this work. The demo of this package reproduces results from the Technometrics paper.

Maximizing the Adjusted AUC (maxadjAUC)
Fits a linear combination of predictors by maximizing a smooth approximation to the estimated covariate-adjusted area under the receiver operating characteristic curve (AUC) for a discrete covariate. (Meisner, A, Parikh, CR, and Kerr, KF (2017) <http://…/>.)

Continue Reading…


Read More

Thanking Your Reviewers: Gratitude through Semantic Metadata

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

At rOpenSci, our R package peer review process relies on the the hard work of many volunteer reviewers. These community members donate their time and expertise to improving the quality of rOpenSci packages and helping drive best practices into scientific software.

Our open review process, where reviews and reviewers are public, means that one benefit for reviewers is that they can get credit for their reviews. We want reviewers to see as much benefit as possible, and for their contributions to be recorded as part of the intellectual trail of academic work, so we have been working at reviews visible and discoverable.

That is why we are very excited about a tiny change in yesterday’s release of R 3.4.4.

If your are running R 3.4.3, and type utils:::MARC_relator_db_codes_used_with_R into the console, you get this:

> utils:::MARC_relator_db_codes_used_with_R
 [1] "aut" "com" "ctr" "ctb" "cph" "cre" "dtc" "fnd" "ths" "trl"

Under 3.4.4, you get this:

> utils:::MARC_relator_db_codes_used_with_R
 [1] "aut" "com" "ctr" "ctb" "cph" "cre" "dtc" "fnd" "rev" "ths" "trl"

What’s that little "rev" that shows up, third from right? It’s the official inclusion of “Reviewer” as an R package author role! 🎉

These three-letter codes come from the MARC (Machine-Readable Cataloging) terms vocabulary, a standard set of authorship types originally created for some of the first computerized library systems. R uses these codes to distinguish between different types of package authors. You may be familiar with some of these terms that show up in DESCRIPTION files, like so:

Authors@R: person("Scott", "Chamberlain", role = c("aut", "cre"), 
       comment = c(ORCID = "0000-0003-1444-9135"))

Here aut and cre stand for “Author” and “Creator”, indicating that Scott is the original and major creator of a package. You may have also seen ctb (Contributor) or cph (Copyright Holder1).

Standard descriptors like this are important because they allow for information about authorship to be machine-readable and credit for authors’ work to be cataloged and transferred. When metadata about R packages is displayed in help files or on websites, it’s clear the role everyone has played. Such metadata is also critical to transitive credit, the important task of tracking contributions through chains of dependencies so as to provide recognition to software developers and data providers that the traditional citation system often misses.

While there are many more2 MARC relator terms, R only allows the a small set that make sense in the context of software packages. These are found in utils:::MARC_relator_db_codes_used_with_R. Codes outside this set this set won’t pass R CMD Check and are not allowed on CRAN.

We believe peer reviewers make an imporant contribution the quality of published software. That’s why last year we requested R-Core add "rev" (Reviewer), to the list of allowed contributor types. And Lo and Behold, Kurt Hornik made the change on our behalf 3. Now in the release version of R.

Since CRAN uses the development version of R to check and build packages, the option has actually been available on CRAN for a while. A trickle of authors have been already been awknowledging peer-reviewers in this way by in their package DESCRIPTION files.

We hope to see adoption of reviewer acknowledgement in package metadata beyond rOpenSci. It can be adopted by authors who submit to JSS, JOSS, or any journal or process where reviewers make significant comments on software code or documentation. For non-R software, we’re working on including reviewers in codemeta, a cross-language software metadata standard.

A few notes about how this development relates specifically to rOpenSci’s peer-review process:

  • First, it is 100% the choice of package authors to decide whether reviewers made a sufficient contribution to be included in Authors in this way. While we promote this option in general, we’ll never ask an author to specifically include a reviewer. Like a manuscript’s acknowledgements section, the Author section is under developer control. It is also up to reviewers whether they want to be included, so package authors should ask reviewers first.

  • Second, rOpenSci editors should not be listed under Authors. "edt" (Editor) is not a valid R authorship role, and we are a step too far removed to be included. But we are flattered by those who have asked.

  • Finally, if you do include reviewers in this way, we think it’s best practice to include information linking back to the review, like so:

    person("Bea", "Hernández", role_ = "rev",
           comment = "Bea reviewed the package for rOpenSci, see 

We are very excited about this development and how it can improve incentives for peer review. Thanks to R-core for getting aboard with this, and the early adopters who tested it!


    person("Noam", "Ross", role = c("aut", "cre", "lbt")),
    person("Maëlle", "Salmon", role = c("rev", "med"),
           comment = "Comments to improve structure of the introduction")
    person("Karthik", "Ram", role = c("rev", "elt"),
         comment = "Fixed a small typo"),
    person("Scott", "Chamberlain", role = c("rev", "sce"),
           comment = "Agrees with Maëlle about the intro.")

  1. I can’t get through this post without mentioning that Her Majesty the Queen in Right of Canada, as represented by the Minister of Natural Resources Canada, is cph on eight CRAN packages. 👑
  2. Found here or as a handy data frame with descriptions in utils:::MARC_relator_db
  3. R-core also added "fnd" (Funder) in R 3.4.3.

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

March 15, 2018

Magister Dixit

“Big data should complement small data, not replace them.” Rob Kitchin

Continue Reading…


Read More

Teaching Computers to be Fair

My newest Bloomberg View piece just came out:

How to Teach a Computer What ‘Fair’ Means

If we’re going to rely on algorithms, we’ll have to figure it out.


To read all of my Bloomberg View pieces, go here.

Continue Reading…


Read More

Apple: Data Scientist

Seeking an outstanding data scientist who is interested in building and maintaining analytical solutions that have direct and measurable impact to Apple.

Continue Reading…


Read More

Apple: Commerce Data Scientist – Apple Media Products

Seeking a talented, experienced Applied Researcher/Data Scientist to work on high visibility projects that affect millions of customers globally.

Continue Reading…


Read More

Document worth reading: “An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling”

For most deep learning practitioners, sequence modeling is synonymous with recurrent networks. Yet recent results indicate that convolutional architectures can outperform recurrent networks on tasks such as audio synthesis and machine translation. Given a new sequence modeling task or dataset, which architecture should one use? We conduct a systematic evaluation of generic convolutional and recurrent architectures for sequence modeling. The models are evaluated across a broad range of standard tasks that are commonly used to benchmark recurrent networks. Our results indicate that a simple convolutional architecture outperforms canonical recurrent networks such as LSTMs across a diverse range of tasks and datasets, while demonstrating longer effective memory. We conclude that the common association between sequence modeling and recurrent networks should be reconsidered, and convolutional networks should be regarded as a natural starting point for sequence modeling tasks. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

Continue Reading…


Read More

We Need To Talk About Dashboards

we need to talk about dashboards

Hey everyone, gather ’round. We need to talk about dashboards.

For C-level executives, dashboard reports are essential. Executives don’t have time to review details for every decision they make, they just want to consume a report that has red, yellow, and green to help them make decisions for the day. But the need for such dashboards is also true for the cubicle-dwelling system administrators. They also need dashboards to help them understand where to focus their efforts daily in order to keep operations running.

I’m here today to tell you that your dashboards are a failure.


Why Dashboards Fail

More than 90% of the data in the world has been created in the past two years.

Don’t take my word for it, though. I’m just citing a statement made in this article. Published more than four years ago, that article is the cited source in presentations on how data continues to grow at an exponential rate.

I believe we continue to create, and curate, data at an accelerated pace with each passing year. Today we have access to more data than ever before. Everyone I meet will say they manage more data today than a year ago.

The data explosion has given rise to a never-ending marketing lexicon. The first one I remember being used widely was data warehouse. That was soon followed by a data mart. Today we also have a data factory and a data lake, which is a nice feature to have next to our data estate, built with data bricks.

With so much data available, information is cheap. Today it is easy to get data about anything. We are drowning in data, inundated with metrics with every step of our day.

The trouble with such easy access to data is this: When information is so cheap, attention becomes expensive.


I’m Looking Through You

Here’s an experiment for you to try. Watch this video and count the number of passes between the people in white shirts:

It’s an old study, and you may have seen it before. If you haven’t seen it before let me know if you are surprised by the results.

This is part of the problem with dashboards: they are being read by humans. And humans, as it turns out, can have difficulty determining what is important. The experiment helps to show how there is an area of our visual cortex that determines what is important and filters out everything else. In other words, we gain a lot of data when we focus our attention, but we can miss a gorilla staring back at us.

Focusing is a great thing for us humans, and this experiment helps to show why multi-tasking is something we shouldn’t be doing. Dashboards are meant to provide that focus. We don’t want to spend the time examining all the data streams.


Spot the Difference

Here’s another experiment for you. Remember those “spot the difference” games? Here’s why your brain is so bad at them.

When we look at a dashboard we don’t take in everything that we see. Our brains don’t bother logging details about something that is not important. Just like the gorilla. Of course, once we see it, we don’t forget it.

Dashboards that contain an overload of information require more focus, which means less information is being consumed. This is not the desired outcome.


Dashboards are a Horrible Way to Communicate

The trouble with such dashboards is that they are a horrible way to communicate.

Dashboards need data in order to exist. Good dashboards are able to communicate the story the data is trying to tell. But the data contains the details necessary for that story, and those details are often left behind. Summaries, aggregations, and averages blur the details from our view. Offering users the ability to drill-through to get the details is a workaround, but the whole point of a dashboard is to avoid having to review the details. Remember, it is better for us humans to be able to focus.

A common example I often use to explain when dashboards aren’t useful involves disk space usage. Let’s say that a disk is at 90% of capacity, and the dashboard shows a big red circle for this metric. The trouble now is that you are missing important details. A 1TB disk at 90% is a different situation than a 10TB disk at 90% full. You also need to know how full the disk was yesterday, what the growth trend has been over time, and at what point the disk is completely full.

While those details might help you figure out what steps to take next, they do little for your end user. This dashboard reporting a disk at 90% has little meaning to the end user that only wants to be able to get their work done for the day.



Dashboards are not new, they’ve been around for years. It’s the ease in which they are created and consumed that has driven demand. You get a dashboard, and you get a dashboard, and everyone gets a dashboard. The phrase “pin it to your dashboard” has become common for users of tools such as PowerBI.

But with so much data coming across our desk each day we need the data to communicate with everyone in a way they can understand.

Saying your disk is 90% full is not nearly as effective as saying that you only have space for three more Netflix movie downloads. That’s a story that anyone can understand. Even simple things like bar charts do a better job communicating the story that data is trying to tell. And I have yet to meet a manager that doesn’t understand a bar chart.

Those of us that work in IT are always asking for more. We want more space, more memory, more CPU, more bandwidth.

It’s time we also ask for more ways for our data to tell a story that everyone can understand.

And don’t get me started on pie charts.

The post We Need To Talk About Dashboards appeared first on Thomas LaRock.

Continue Reading…


Read More

R 3.4.4 released

R 3.4.4 has been released, and binaries for Windows, Mac, Linux and now available for download on CRAN. This update (codenamed "Someone to Lean On" — likely a Peanuts reference, though I couldn't find which one with a quick search) is a minor bugfix release, and shouldn't cause any compatibility issues with scripts or packages written for prior versions of R in the 3.4.x series.

This update improves automatic timezone detection on some systems, and adds fixes for a some unusual corner cases in the statistics library. For a complete list of the changes, check the NEWS file for R 3.4.4 or follow the link below.

R-announce mailing list: R 3.4.4 is released

Continue Reading…


Read More

R 3.4.4 released

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

R 3.4.4 has been released, and binaries for Windows, Mac, Linux and now available for download on CRAN. This update (codenamed "Someone to Lean On" — likely a Peanuts reference, though I couldn't find which one with a quick search) is a minor bugfix release, and shouldn't cause any compatibility issues with scripts or packages written for prior versions of R in the 3.4.x series.

This update improves automatic timezone detection on some systems, and adds fixes for a some unusual corner cases in the statistics library. For a complete list of the changes, check the NEWS file for R 3.4.4 or follow the link below.

R-announce mailing list: R 3.4.4 is released

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

JupyterCon 2018, NYC August 21–25

Discover how data-driven organizations are using Jupyter to analyze data, share insights, and foster practices for dynamic, reproducible data science.

I’m grateful to join Fernando Pérez and Brian Granger as a program co-chair for JupyterCon 2018. Project Jupyter, NumFOCUS, and O’Reilly Media will present the second annual JupyterCon in New York City on August 21–25.

Timing for this event couldn’t be better. The human side of data science, machine learning/AI, and scientific computing is more important than ever. This is seen in the broad adoption of data-driven decision making in human organizations of all kinds, the increasing importance of human centered design in tools for working with data, the urgency for better data insights in the face of complex socioeconomic conditions worldwide, as well as dialogue about the social issues these technologies bring to the fore: collaboration, security, ethics, data privacy, transparency, propaganda, etc.

To paraphrase our co-chairs, Brian Granger:

“Jupyter is where humans and data science intersect”

and Fernando Perez:

“The better the technology, the more important that human judgement becomes”

Consequently, we’ll explore three main themes at JupyterCon:

  • Interactive computing with data at scale: the technical best practices and organizational challenges of supporting interactive computing in companies, universities, research collaborations, etc (JupyterHub)
  • Extensible user interfaces for data science, machine learning/AI, and scientific computing (JupyterLab)
  • Computational communication: taking the artifacts of interactive computing and communicating them to different audiences

A meta-theme which ties these together is extensible software architecture for interactive computing with data. Jupyter is built on a set of flexible, extensible, and re-usable building blocks which can be combined and assembled to address a wide range of usage cases. These building blocks are expressed through the various open protocols, APIs, and standards of Jupyter.

The Jupyter community has much to discuss and share this year. For example, success stories such as the data science program at UC Berkeley illustrate the power of JupyterHub deployments at scale, in both education, research and industry. As universities and enterprise firms learn to handle the technical challenges of rolling out hands-on, interactive computing at scale, a cohort of organizational challenges come to the fore: practices regarding collaboration, security, compliance, data privacy, ethics, etc. These points are especially poignant in verticals such as healthcare, finance and education, where the handling of sensitive data is rightly constrained by ethical and legal requirements (HIPAA, FERPA, etc.). Overall, this dialogue is extremely relevant — it is happening at the intersection of contemporary political and social issues, industry concerns, new laws (GDPR), the evolution of computation, plus good storytelling and communication in general — as we’ll explore with practitioners throughout the conference.

Recent beta release of JupyterLab embodies the meta-theme of extensible software architecture for interactive computing with data. While many people think of Jupyter as a “notebook,” that’s merely one building block needed for interactive computing with data. Other building blocks include terminals, file browsers, LaTeX, markdown, rich outputs, text editors, and renderers/viewers for different data formats. JupyterLab is the next-generation user interface for Project Jupyter, and provides these different building blocks in a flexible, configurable, customizable environment. This opens the door for Jupyter users to build custom workflows, and also for organizations to extend JupyterLab with their own custom functionality.

Thousands of organizations require data infrastructure for reporting, sharing data insights, reproducing results of analytics, etc. Recent business studies estimate that more than half of all companies globally are precluded from adopting AI technologies due to a lack of digital infrastructure — often because their efforts toward data and reporting infrastructure are buried in technical debt. So much of that infrastructure was built from scratch, even when organizations needed essentially the same building blocks. JupyterLab’s primary goal is to make it routine to build highly customized, interactive computing platforms, while supporting more than 90 different popular programming environments.

Screenshot from the JupyterLab beta release. Image used with permission from Project Jupyter contributors.

A third major theme builds on top of the other two: computational communication. For data and code to be useful for humans, who need to make decisions, it has to be embedded into a narrative — a story — that that can be communicated to others. Examples of this pattern include: data journalism, reproducible research and open science, computational narratives, open data in society and government, citizen science, and really any area of scientific research (physics, zoology, chemistry, astronomy, etc.), plus the range of economics, finance, and econometric forecasting.

Another growing segment of use cases involves Jupyter as a “last-mile” layer for leveraging AI resources in the cloud. This becomes especially important in light of new hardware emerging for AI needs, vying with competing demand from online gaming, virtual reality, cryptocurrency mining, etc.

Please take the following as personal opinion, observations, perspectives: We’ve reached a point where hardware appears to be evolving more rapidly than software, while software appears to be evolving more rapidly than effective process. At O’Reilly Media we work to map the emerging themes in industry, in a process nicknamed “radar”. This perspective about hardware is a theme I’ve been mapping, and meanwhile comparing notes with industry experts. A few data points to consider: Jeff Dean’s talk at NIPS 2017, “Machine Learning for Systems and Systems for Machine Learning” about comparisons of CPUs/GPUs/TPUs, and how AI is transforming the design of computer hardware; The Case for Learned Index Structures, also from Google, about the impact of “branch vs. multiple” costs on decades of database theory; this podcast interview “Scaling machine learning” with Reza Zadeh about the critical importance of hardware/software interfaces in AI apps; the video interview that Wes McKinney and I recorded at JupyterCon 2017 about how Apache Arrow presents a much different take on how to leverage hardware and distributed resources.

The notion that “hardware > software > process” contradicts the past 15–20 years of software engineering practice. It’s an inversion of the general assumptions we make. In response, industry will need to rework approaches for building software within the context of AI — which was articulated succinctly by Lenny Pruss from Amplify Partners in “Infrastructure 3.0: Building blocks for the AI revolution”. In this light, Jupyter provides an abstraction layer — a kind of buffer to help “future proof” — for complex use cases in NLP, machine learning, and related work. We’re seeing this from most of the public cloud vendors, who are also leaders in AI, Google, Amazon, Microsoft, IBM, etc., and who will be represented at the conference in August.

Our program at JupyterCon will feature expert speakers across all of these themes. However, to me, that’s merely the tip of the iceberg. So much of the real value that I get from conferences happens in the proverbial “Hallway Track”, where you run into people who are riffing off news they’ve just learned in a session — perhaps in line with your thinking, perhaps in a completely different direction. Those conversations have space to flourish when people get immersed in the community, the issues, the possibilities.

It’ll be a busy week. We’ll have two days of training courses: intensive, hands-on coding, lots of interaction with expert instructors. Training will overlap with one day of tutorials: led by experts, generally larger than training courses though more detailed than session talks, featuring lots of Q&A.

Then we’ll have two days of keynotes and session talks, expo hall, lunches and sponsored breaks, plus Project Jupyter sponsored events. Events include Jupyter User Testing, author signings, “Meet the Experts” office hours, demos in the vendor expo hall — plus related meetups in the evenings. Last year the Poster Session was one of the biggest surprises to me: it was difficult to move through the room, walkways were packed with people asking presenters questions about their projects.

This year we’ll introduce a Business Summit, similar to the popular summits at Strata Data Conference and The AI Conf. This will include high-level presentations on the most promising and important developments in Jupyter for executives and decision-makers. Brian Granger and I will be hosting the Business Summit, along with Joel Horwitz of IBM. One interesting data point: among the regional events, we’ve seen much more engagement this year from enterprise and government than we’d expected, more emphasis on business use cases and new product launches. The ecosystem is growing, and will be represented well at JupyterCon!

We will also feature an Education Track in the main conference, expanding on the well-attended Education Birds-of-a-Feather and related talks during JupyterCon 2017. Use of Jupyter in education has grown rapidly across many contexts: middle/high-school, universities, corporate training, and online courses. Lorena Barba and Robert Talbert will be organizing this track.

Following our schedule of conference talks, the week wraps up with a community sprint day on Saturday. You can work side-by-side with leaders and contributors in the Jupyter ecosystem to implement that feature you’ve always wanted, fix bugs, work on design, write documentation, test software, or dive deep into the internals of something in the Jupyter ecosystem. Be sure to bring your laptop.

Note that we believe true innovation depends on hearing from, and listening to, people with a variety of perspectives. Please read our Diversity Statement for more details. Also, we’re committed to creating a safe and productive environment for everyone at all of our events. Please read our Code of Conduct. Last year we were able to work with the community plus matching donations to provide several Diversity & Inclusion scholarships, as well as more than dozen student scholarships. Looking forward to building on that this year!

That’s a sample of what’s coming up for JupyterCon in NYC this August. Meanwhile, we’ll be helping present and sponsor regional community events to help build momentum for the conference:

We look forward to many opportunities to showcase new work and ideas, to meet each other, to learn about the architecture of the project itself, and to contribute to the future of Jupyter.

Sign-up for email updates on the JupyterCon web site. See you there!

[kudos to Brian Granger for help developing and editing this article]

JupyterCon 2018, NYC August 21–25 was originally published in Jupyter Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Continue Reading…


Read More

Data and Analytics Cutting-Edge Events in San Francisco and Chicago

Join us in San Francisco or Chicago this spring for the next chapters of the World-renowned Data & Analytics Innovation events, bringing together the top minds in Big Data and Analytics across industries.

Continue Reading…


Read More

R Packages worth a look

Incrementally Build Complex Plots using Natural Semantics (wheatmap)
Builds complex plots, heatmaps in particular, using natural semantics. Bigger plots can be assembled using directives such as ‘LeftOf’, ‘RightOf’, ‘TopOf’, and ‘Beneath’ and more. Other features include clustering, dendrograms and integration with ‘ggplot2’ generated grid objects. This package is particularly designed for bioinformaticians to assemble complex plots for publication.

Generalized Integration Model (gim)
Implements the generalized integration model, which integrates individual-level data and summary statistics under a generalized linear model framework. It supports continuous and binary outcomes to be modeled by the linear and logistic regression models.

Pathway Enrichment Analysis Utilizing Active Subnetworks (pathfindR)
Pathway enrichment analysis enables researchers to uncover mechanisms underlying the phenotype. pathfindR is a tool for pathway enrichment analysis utilizing active subnetworks. It identifies active subnetworks in a protein-protein interaction network using user-provided a list of genes. It performs pathway enrichment analyses on the identified subnetworks. pathfindR also offers functionality to cluster enriched pathways and identify representative pathways. The method is described in detail in Ulgen E, Ozisik O, Sezerman OU. 2018. pathfindR: An R Package for Pathway Enrichment Analysis Utilizing Active Subnetworks. bioRxiv. <doi:10.1101/272450>.

A ‘Java’ Platform Integration for ‘R’ with Programming Languages ‘Groovy’, ‘JavaScript’, ‘JRuby’ (‘Ruby’), ‘Jython’ (‘Python’), and ‘Kotlin’ (jsr223)
Provides a high-level integration for the ‘Java’ platform that makes ‘Java’ objects easy to use from within ‘R’; provides a unified interface to integrate ‘R’ with several programming languages; and features extensive data exchange between ‘R’ and ‘Java’. The ‘jsr223’-supported programming languages include ‘Groovy’, ‘JavaScript’, ‘JRuby’ (‘Ruby’), ‘Jython’ (‘Python’), and ‘Kotlin’. Any of these languages can use and extend ‘Java’ classes in natural syntax. Furthermore, solutions developed in any of the ‘jsr223’-supported languages are also accessible to ‘R’ developers. The ‘jsr223’ package also features callbacks, script compiling, and string interpolation. In all, ‘jsr223’ significantly extends the computing capabilities of the ‘R’ software environment.

Parsimonious Gaussian Mixture Models (pgmm)
Carries out model-based clustering or classification using parsimonious Gaussian mixture models. McNicholas and Murphy (2008) <doi:10.1007/s11222-008-9056-0>, McNicholas (2010) <doi:10.1016/j.jspi.2009.11.006>, McNicholas and Murphy (2010) <doi:10.1093/bioinformatics/btq498>.

Continue Reading…


Read More

Asian and European cities compete for the title of most expensive city

SINGAPORE remains the most expensive city in the world for the fifth year running, according to the latest findings of the Worldwide Cost of Living Survey from The Economist Intelligence Unit.

Continue Reading…


Read More

Thanks for reading!