# My Data Science Blogs

## September 15, 2019

### Whats new on arXiv

Language models are generally trained on data spanning a wide range of topics (e.g., news, reviews, fiction), but they might be applied to an a priori unknown target distribution (e.g., restaurant reviews). In this paper, we first show that training on text outside the test distribution can degrade test performance when using standard maximum likelihood (MLE) training. To remedy this without the knowledge of the test distribution, we propose an approach which trains a model that performs well over a wide range of potential test distributions. In particular, we derive a new distributionally robust optimization (DRO) procedure which minimizes the loss of the model over the worst-case mixture of topics with sufficient overlap with the training distribution. Our approach, called topic conditional value at risk (topic CVaR), obtains a 5.5 point perplexity reduction over MLE when the language models are trained on a mixture of Yelp reviews and news and tested only on reviews.
Deep learning has permeated through many aspects of computing/processing systems in recent years. While distributed training architectures/frameworks are adopted for training large deep learning models quickly, there has not been a systematic study of the communication bottlenecks of these architectures and their effects on the computation cycle time and scalability. In order to analyze this problem for synchronous Stochastic Gradient Descent (SGD) training of deep learning models, we developed a performance model of computation time and communication latency under three different system architectures: Parameter Server (PS), peer-to-peer (P2P), and Ring allreduce (RA). To complement and corroborate our analytical models with quantitative results, we evaluated the computation and communication performance of these system architectures of the systems via experiments performed with Tensorflow and Horovod frameworks. We found that the system architecture has a very significant effect on the performance of training. RA-based systems achieve scalable performance as they successfully decouple network usage from the number of workers in the system. In contrast, 1PS systems suffer from low performance due to network congestion at the parameter server side. While P2P systems fare better than 1PS systems, they still suffer from significant network bottleneck. Finally, RA systems also excel by virtue of overlapping computation time and communication time, which PS and P2P architectures fail to achieve.
We present a systematic framework for the Nesterov’s accelerated gradient flows in the spaces of probabilities embedded with information metrics. Here two metrics are considered, including both the Fisher-Rao metric and the Wasserstein-$2$ metric. For the Wasserstein-$2$ metric case, we prove the convergence properties of the accelerated gradient flows, and introduce their formulations in Gaussian families. Furthermore, we propose a practical discrete-time algorithm in particle implementations with an adaptive restart technique. We formulate a novel bandwidth selection method, which learns the Wasserstein-$2$ gradient direction from Brownian-motion samples. Experimental results including Bayesian inference show the strength of the current method compared with the state-of-the-art.
This paper proposes a new meta-learning method — named HARMLESS (HAwkes Relational Meta LEarning method for Short Sequences) for learning heterogeneous point process models from short event sequence data along with a relational network. Specifically, we propose a hierarchical Bayesian mixture Hawkes process model, which naturally incorporates the relational information among sequences into point process modeling. Compared with existing methods, our model can capture the underlying mixed-community patterns of the relational network, which simultaneously encourages knowledge sharing among sequences and facilitates adaptive learning for each individual sequence. We further propose an efficient stochastic variational meta expectation maximization algorithm that can scale to large problems. Numerical experiments on both synthetic and real data show that HARMLESS outperforms existing methods in terms of predicting the future events.
A quantum generalization of Natural Gradient Descent is presented as part of a general-purpose optimization framework for variational quantum circuits. The optimization dynamics is interpreted as moving in the steepest descent direction with respect to the Quantum Information Geometry, corresponding to the real part of the Quantum Geometric Tensor (QGT), also known as the Fubini-Study metric tensor. An efficient algorithm is presented for computing a block-diagonal approximation to the Fubini-Study metric tensor for parametrized quantum circuits, which may be of independent interest.
This study concerns the issue of high dimensional outliers which are challenging to distinguish from inliers due to the special structure of high dimensional space. We introduce a new notion of high dimensional outliers that embraces various types and provides deep insights into understanding the behavior of these outliers based on several asymptotic regimes. Our study of geometrical properties of high dimensional outliers reveals an interesting transition phenomenon of outliers from near the surface of a high dimensional sphere to being distant from the sphere. Also, we study the PCA subspace consistency when data contain a limited number of outliers.
The problem of verifying whether a textual hypothesis holds the truth based on the given evidence, also known as fact verification, plays an important role in the study of natural language understanding and semantic representation. However, existing studies are mainly restricted to dealing with unstructured evidence (e.g., natural language sentences and documents, news, etc), while verification under structured evidence, such as tables, graphs, and databases, remains unexplored. This paper specifically aims to study the fact verification given semi-structured data as evidence. To this end, we construct a large-scale dataset called \textsc{TabFact} with 16k Wikipedia tables as evidence for 118k human-annotated natural language statements, which are labeled as either {\tt ENTAILED} or {\tt REFUTED}. \textsc{TabFact} is more challenging since it involves both soft linguistic reasoning and hard symbolic reasoning. To address these reasoning challenges, we design two different models: Table-BERT and Latent Program Algorithm (LPA). Table-BERT leverages the state-of-the-art pre-trained language model to encode the linearized tables and statements into continuous vectors for verification. LPA parses statements into LISP-like programs and executes them against the tables to obtain the returned binary value. Both methods achieve similar accuracy but yet far from human performance. We also perform comprehensive analysis and demonstrate great future opportunities. The data and code of the dataset are provided in \url{https://…/Table-Fact-Checking}.
Deep neural networks (DNNs) are shown to be promising solutions in many challenging artificial intelligence tasks, including object recognition, natural language processing, and even unmanned driving. A DNN model, generally based on statistical summarization of in-house training data, aims to predict correct output given an input encountered in the wild. In general, 100% precision is therefore impossible due to its probabilistic nature. For DNN practitioners, it is very hard, if not impossible, to figure out whether the low precision of a DNN model is an inevitable result, or caused by defects such as bad network design or improper training process. This paper aims at addressing this challenging problem. We approach with a careful categorization of the root causes of low precision. We find that the internal data flow footprints of a DNN model can provide insights to locate the root cause effectively. We then develop a tool, namely, DeepMorph (DNN Tomography) to analyze the root cause, which can instantly guide a DNN developer to improve the model. Case studies on four popular datasets show the effectiveness of DeepMorph.
Researchers often use artificial data to assess the performance of new econometric methods. In many cases the data generating processes used in these Monte Carlo studies do not resemble real data sets and instead reflect many arbitrary decisions made by the researchers. As a result potential users of the methods are rarely persuaded by these simulations that the new methods are as attractive as the simulations make them out to be. We discuss the use of Wasserstein Generative Adversarial Networks (WGANs) as a method for systematically generating artificial data that mimic closely any given real data set without the researcher having many degrees of freedom. We apply the methods to compare in three different settings twelve different estimators for average treatment effects under unconfoundedness. We conclude in this example that (i) there is not one estimator that outperforms the others in all three settings, and (ii) that systematic simulation studies can be helpful for selecting among competing methods.
We conduct a sequential social learning experiment where subjects guess a hidden state after observing private signals and the guesses of a subset of their predecessors. A network determines the observable predecessors, and we compare subjects’ accuracy on sparse and dense networks. Later agents’ accuracy gains from social learning are twice as large in the sparse treatment compared to the dense treatment. Models of naive inference where agents ignore correlation between observations predict this comparative static in network density, while the result is difficult to reconcile with rational-learning models.
Convolutional neural networks require numerous data for training. Considering the difficulties in data collection and labeling in some specific tasks, existing approaches generally use models pre-trained on a large source domain (e.g. ImageNet), and then fine-tune them on these tasks. However, the datasets from source domain are simply discarded in the fine-tuning process. We argue that the source datasets could be better utilized and benefit fine-tuning. This paper firstly introduces the concept of general discrimination to describe ability of a network to distinguish untrained patterns, and then experimentally demonstrates that general discrimination could potentially enhance the total discrimination ability on target domain. Furthermore, we propose a novel and light-weighted method, namely soft fine-tuning. Unlike traditional fine-tuning which directly replaces optimization objective by a loss function on the target domain, soft fine-tuning effectively keeps general discrimination by holding the previous loss and removes it softly. By doing so, soft fine-tuning improves the robustness of the network to data bias, and meanwhile accelerates the convergence. We evaluate our approach on several visual recognition tasks. Extensive experimental results support that soft fine-tuning provides consistent improvement on all evaluated tasks, and outperforms the state-of-the-art significantly. Codes will be made available to the public.
Based on the theories of sliced inverse regression (SIR) and reproducing kernel Hilbert space (RKHS), a new approach RDSIR (RKHS-based Double SIR) to nonlinear dimension reduction for survival data is proposed and discussed. An isometrically isomorphism is constructed based on RKHS property, then the nonlinear function in the RKHS can be represented by the inner product of two elements which reside in the isomorphic feature space. Due to the censorship of survival data, double slicing is used to estimate weight function or conditional survival function to adjust for the censoring bias. The sufficient dimension reduction (SDR) subspace is estimated by a generalized eigen-decomposition problem. Our method is computationally efficient with fast calculation speed and small computational burden. The asymptotic property and the convergence rate of the estimator are also discussed based on the perturbation theory. Finally, we illustrate the performance of RDSIR on simulated and real data to confirm that RDSIR is comparable with linear SDR method. The most important is that RDSIR can also extract nonlinearity in survival data effectively.
We present a novel class of convolutional neural networks (CNNs) for set functions, i.e., data indexed with the powerset of a finite set. The convolutions are derived as linear, shift-equivariant functions for various notions of shifts on set functions. The framework is fundamentally different from graph convolutions based on the Laplacian, as it provides not one but several basic shifts, one for each element in the ground set. Prototypical experiments with several set function classification tasks on synthetic datasets and on datasets derived from real-world hypergraphs demonstrate the potential of our new powerset CNNs.
Although Seq2Seq models for table-to-text generation have achieved remarkable progress, modeling table representation in one dimension is inadequate. This is because (1) the table consists of multiple rows and columns, which means that encoding a table should not depend only on one dimensional sequence or set of records and (2) most of the tables are time series data (e.g. NBA game data, stock market data), which means that the description of the current table may be affected by its historical data. To address aforementioned problems, not only do we model each table cell considering other records in the same row, we also enrich table’s representation by modeling each table cell in context of other cells in the same column or with historical (time dimension) data respectively. In addition, we develop a table cell fusion gate to combine representations from row, column and time dimension into one dense vector according to the saliency of each dimension’s representation. We evaluated our methods on ROTOWIRE, a benchmark dataset of NBA basketball games. Both automatic and human evaluation results demonstrate the effectiveness of our model with improvement of 2.66 in BLEU over the strong baseline and outperformance of state-of-the-art model.
The rapid advancement of artificial intelligence (AI) is changing our lives in many ways. One application domain is data science. New techniques in automating the creation of AI, known as AutoAI or AutoML, aim to automate the work practices of data scientists. AutoAI systems are capable of autonomously ingesting and pre-processing data, engineering new features, and creating and scoring models based on a target objectives (e.g. accuracy or run-time efficiency). Though not yet widely adopted, we are interested in understanding how AutoAI will impact the practice of data science. We conducted interviews with 20 data scientists who work at a large, multinational technology company and practice data science in various business settings. Our goal is to understand their current work practices and how these practices might change with AutoAI. Reactions were mixed: while informants expressed concerns about the trend of automating their jobs, they also strongly felt it was inevitable. Despite these concerns, they remained optimistic about their future job security due to a view that the future of data science work will be a collaboration between humans and AI systems, in which both automation and human expertise are indispensable.
We describe a detailed analysis of a sample of large benchmark of commonsense reasoning problems that has been automatically obtained from WordNet, SUMO and their mapping. The objective is to provide a better assessment of the quality of both the benchmark and the involved knowledge resources for advanced commonsense reasoning tasks. By means of this analysis, we are able to detect some knowledge misalignments, mapping errors and lack of knowledge and resources. Our final objective is the extraction of some guidelines towards a better exploitation of this commonsense knowledge framework by the improvement of the included resources.
Automated systems for detecting deformation in satellite InSAR imagery could be used to develop a global monitoring system for volcanic and urban environments. Here we explore the limits of a CNN for detecting slow, sustained deformations in wrapped interferograms. Using synthetic data, we estimate a detection threshold of 3.9cm for deformation signals alone, and 6.3cm when atmospheric artefacts are considered. Over-wrapping reduces this to 1.8cm and 5.0cm respectively as more fringes are generated without altering SNR. We test the approach on timeseries of cumulative deformation from Campi Flegrei and Dallol, where over-wrapping improves classication performance by up to 15%. We propose a mean-filtering method for combining results of different wrap parameters to flag deformation. At Campi Flegrei, deformation of 8.5cm/yr was detected after 60days and at Dallol, deformation of 3.5cm/yr was detected after 310 days. This corresponds to cumulative displacements of 3 cm and 4 cm consistent with estimates based on synthetic data.
Opinion summarization is the task of automatically generating summaries for a set of opinions about a specific target (e.g., a movie or a product). Since the number of input documents can be prohibitively large, neural network-based methods sacrifice end-to-end elegance and follow a two-stage approach where an extractive model first pre-selects a subset of salient opinions and an abstractive model creates the summary while conditioning on the extracted subset. However, the extractive stage leads to information loss and inflexible generation capability. In this paper we propose a summarization framework that eliminates the need to pre-select salient content. We view opinion summarization as an instance of multi-source transduction, and make use of all input documents by condensing them into multiple dense vectors which serve as input to an abstractive model. Beyond producing more informative summaries, we demonstrate that our approach allows to take user preferences into account based on a simple zero-shot customization technique. Experimental results show that our model improves the state of the art on the Rotten Tomatoes dataset by a wide margin and generates customized summaries effectively.
We recently outlined the vision of ‘Learning Everywhere’ which captures the possibility and impact of how learning methods and traditional HPC methods can be coupled together. A primary driver of such coupling is the promise that Machine Learning (ML) will give major performance improvements for traditional HPC simulations. Motivated by this potential, the ML around HPC class of integration is of particular significance. In a related follow-up paper, we provided an initial taxonomy for integrating learning around HPC methods. In this paper, which is part of the Learning Everywhere series, we discuss ‘how’ learning methods and HPC simulations are being integrated to enhance effective performance of computations. This paper identifies several modes — substitution, assimilation, and control, in which learning methods integrate with HPC simulations and provide representative applications in each mode. This paper discusses some open research questions and we hope will motivate and clear the ground for MLaroundHPC benchmarks.
We describe a set of best practices for the young field of neural architecture search (NAS), which lead to the best practices checklist for NAS available at http://…/nas_checklist.pdf.
Spiking neural networks (SNNs) are a promising candidate for biologically-inspired and energy efficient computation. However, their simulation is notoriously time consuming, and may be seen as a bottleneck in developing competitive training methods with potential deployment on neuromorphic hardware platforms. To address this issue, we provide an implementation of mini-batch processing applied to clock-based SNN simulation, leading to drastically increased data throughput. To our knowledge, this is the first general-purpose implementation of mini-batch processing in a spiking neural networks simulator, which works with arbitrary neuron and synapse models. We demonstrate nearly constant-time scaling with batch size on a simulation setup (up to GPU memory limits), and showcase the effectiveness of large batch sizes in two SNN application domains, resulting in $\approx$880X and $\approx$24X reductions in wall-clock time respectively. Different parameter reduction techniques are shown to produce different learning outcomes in a simulation of networks trained with spike-timing-dependent plasticity. Machine learning practitioners and biological modelers alike may benefit from the drastically reduced simulation time and increased iteration speed this method enables. Code to reproduce the benchmarks and experimental findings in this paper can be found at https://…/snn-minibatch.

### “Boston Globe Columnist Suspended During Investigation Of Marathon Bombing Stories That Don’t Add Up”

I came across this news article by Samer Kalaf and it made me think of some problems we’ve been seeing in recent years involving cargo-cult science.

Here’s the story:

The Boston Globe has placed columnist Kevin Cullen on “administrative leave” while it conducts a review of his work, after WEEI radio host Kirk Minihane scrutinized Cullen’s April 14 column about the five-year anniversary of the Boston Marathon bombings, and found several inconsistencies. . . .

Here’s an excerpt of the column:

I happened upon a house fire recently, in Mattapan, and the smell reminded me of Boylston Street five years ago, when so many lost their lives and their limbs and their sense of security.

I can smell Patriots Day, 2013. I can hear it. God, can I hear it, whenever multiple fire engines or ambulances are racing to a scene.

I can taste it, when I’m around a campfire and embers create a certain sensation.

I can see it, when I bump into survivors, which happens with more regularity than I could ever have imagined. And I can touch it, when I grab those survivors’ hands or their shoulders.

Cullen, who was part of the paper’s 2003 Pulitzer-winning Spotlight team that broke the stories on the Catholic Church sex abuse scandal, had established in this column, and in prior reporting, that he was present for the bombings. . . .

But Cullen wasn’t really there. And his stories had lots of details that sounded good but were actually made up. Including, horrifyingly enough, made-up stories about a little girl who was missing her leg.

OK, so far, same old story. Mike Barnicle, Janet Cooke, Stephen Glass, . . . and now one more reporter who prefers to make things up than to do actual reporting. For one thing, making stuff up is easier; for another, if you make things up, you can make the story work better, as you’re not constrained by pesky details.

What’s the point of writing about this, then? What’s the connection to statistical modeling, causal inference, and social science?

Here’s the point:

1. What’s the reason for journalism? To convey information, to give readers a different window into reality. To give a sense of what it was like to be there, for those who were not there. Or to help people who were there, to remember.

2. What does good journalism look like? It’s typically emotionally stirring and convincingly specific.

And here’s the problem.

The reason for journalism is 1, but some journalists decide to take a shortcut and go straight to the form of good journalism, that is, 2.

Indeed, I suspect that many journalists think that 2 is the goal, and that 1 is just some old-fashioned traditional attitude.

Now, to connect to statistical modeling, causal inference, and social science . . . let’s think about science:

1. What’s the reason for science? To learn about reality, to learn new facts, to encompass facts into existing and new theories, to find flaws in our models of the world.

2. And what does good science look like? It typically has an air of rigor.

And here’s the problem.

The reason for science is 1, but some scientists decide to take a shortcut and go straight to the form of good science, that is, 2.

The problem is not scientists don’t care about the goal of learning about reality; the problem is that they think that if they follow various formal expressions of science (randomized experiments, p-values, peer review, publication in journals, association with authority figures, etc.) that they’ll get the discovery for free.

It’s a natural mistake, given statistical training with its focus on randomization and p-values, an attitude that statistical methods can yield effective certainty from noisy data (true for Las Vegas casinos where the probability model is known; not so true for messy real-world science experiments), and scientific training that’s focused on getting papers published.

Summary

What struck me about the above-quoted Boston Globe article (“I happened upon a house fire recently . . . I can smell Patriots Day, 2013. I can hear it. God, can I hear it . . . I can taste it . . .”) was how it looks like good journalism. Not great journalism—it’s too clichéd and trope-y for that—but what’s generally considered good reporting, the kind that sometimes wins awards.

Similarly, if you look at a bunch of the fatally flawed articles we’ve seen in science journals in the past few years, they look like solid science. It’s only when you examine the details that you start seeing all the problems, and these papers disintegrate like a sock whose thread has been pulled.

Ok, yeah yeah sure, you’re saying: Once again I’m reminded of bad science. Who cares? I care, because bad science Greshams good science in so many ways: in scientists’ decision of what to work on and publish (why do a slow careful study if you can get a better publication with something flashy?), in who gets promoted and honored and who decides to quit the field in disgust (not always, but sometimes), and in what gets publicized. The above Boston marathon story struck me because it had that same flavor.

P.S. Tomorrow’s post: Harking, Sharking, Tharking.

### pinp 0.0.9: Real Fix and Polish

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Another pinp package release! pinp allows for snazzier one or two column Markdown-based pdf vignettes, and is now used by a few packages. A screenshot of the package vignette can be seen below. Additional screenshots are at the pinp page.

This release comes exactly one week (i.e. the minimal time to not earn a NOTE) after the hot-fix release 0.0.8 which addressed breakage on CRAN tickled by changed in TeX Live. After updating the PNAS style LaTeX macros, and avoiding the issue with an (older) custom copy of titlesec, we now have the real fix, thanks to the eagle-eyed attention of Javier Bezos. The error, as so often, was simple and ours: we had left a stray \makeatother in pinp.cls where it may have been in hiding for a while. A very big Thank You! to Javier for spotting it, to Norbert for all his help and to James for double-checking on PNAS.

The good news in all of this is that the package is now in better shape than ever. The newer PNAS style works really well, and I went over a few of our extensions (such as papersize support for a4 as well as letter), direct on/off off a Draft watermark, a custom subtitle and more—and they all work consistently. So happy vignette or paper writing!

The NEWS entry for this release follows.

#### Changes in pinp version 0.0.9 (2019-09-15)

• The processing error first addressed in release 0.0.8 is now fixed by removing one stray command; many thanks to Javier Bezos.

• The hotfix of also installing titlesec.sty has been reverted.

• Processing of the ‘papersize’ and ‘watermark’ options was updated.

Courtesy of CRANberries, there is a comparison to the previous release. More information is on the tint page. For questions or comments use the issue tracker off the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

### Magister Dixit

“Big Data is not about volume, size or velocity of data – neither of which are easily translated into financial results for most businesses. It is about the integration of external sources of information and unstructured data into a company’s IT infrastructure and business processes.” Gregory Yankelovich ( October 20, 2014 )

### Finding out why

In many scientific contexts, different investigators experiment with or observe different variables with data from a domain in which the distinct variable sets might well be related. This sort of fragmentation sometimes occurs in molecular biology, whether in studies of RNA expression or studies of protein interaction, and it is common in the social sciences. Models are built on the diverse data sets, but combining them can provide a more unified account of the causal processes in the domain. On the other hand, this problem is made challenging by the fact that a variable in one data set may influence variables in another although neither data set contains all of the variables involved. Several authors have proposed using conditional independence properties of fragmentary (marginal) data collections to form unified causal explanations when it is assumed that the data have a common causal explanation but cannot be merged to form a unified dataset. These methods typically return a large number of alternative causal models. The first part of the thesis shows that marginal datasets contain extra information that can be used to reduce the number of possible models, in some cases yielding a unique model.
This paper treats gender bias latent in word embeddings. Previous mitigation attempts rely on the operationalisation of gender bias as a projection over a linear subspace. An alternative approach is Counterfactual Data Augmentation (CDA), in which a corpus is duplicated and augmented to remove bias, e.g. by swapping all inherently-gendered words in the copy. We perform an empirical comparison of these approaches on the English Gigaword and Wikipedia, and find that whilst both successfully reduce direct bias and perform well in tasks which quantify embedding quality, CDA variants outperform projection-based methods at the task of drawing non-biased gender analogies by an average of 19% across both corpora. We propose two improvements to CDA: Counterfactual Data Substitution (CDS), a variant of CDA in which potentially biased text is randomly substituted to avoid duplication, and the Names Intervention, a novel name-pairing technique that vastly increases the number of words being treated. CDA/S with the Names Intervention is the only approach which is able to mitigate indirect gender bias: following debiasing, previously biased words are significantly less clustered according to gender (cluster purity is reduced by 49%), thus improving on the state-of-the-art for bias mitigation.
We describe a method that predicts, from a single RGB image, a depth map that describes the scene when a masked object is removed – we call this ‘counterfactual depth’ that models hidden scene geometry together with the observations. Our method works for the same reason that scene completion works: the spatial structure of objects is simple. But we offer a much higher resolution representation of space than current scene completion methods, as we operate at pixel-level precision and do not rely on a voxel representation. Furthermore, we do not require RGBD inputs. Our method uses a standard encoder-decoder architecture, and with a decoder modified to accept an object mask. We describe a small evaluation dataset that we have collected, which allows inference about what factors affect reconstruction most strongly. Using this dataset, we show that our depth predictions for masked objects are better than other baselines.
This paper proposes a novel cost-reflective and computationally efficient method for allocating distribution network costs to residential customers. First, the method estimates the growth in peak demand with a 50% probability of exceedance (50POE) and the associated network augmentation costs using a probabilistic long-run marginal cost computation based on the Turvey perturbation method. Second, it allocates these costs to customers on a cost-causal basis using the Shapley value solution concept. To overcome the intractability of the exact Shapley value computation for real-world applications, we implement a fast, scalable and efficient clustering technique based on customers’ peak demand contribution, which drastically reduces the Shapley value computation time. Using customer load traces from an Australian smart grid trial (Solar Home Electricity Data), we demonstrate the efficacy of our method by comparing it with established energy- and peak demand-based cost allocation approaches.
Where performance comparison of healthcare providers is of interest, characteristics of both patients and the health condition of interest must be balanced across providers for a fair comparison. This is unlikely to be feasible within observational data, as patient population characteristics may vary geographically and patient care may vary by characteristics of the health condition. We simulated data for patients and providers, based on a previously utilized real-world dataset, and separately considered both binary and continuous covariate-effects at the upper level. Multilevel latent class (MLC) modelling is proposed to partition a prediction focus at the patient level (accommodating casemix) and a causal inference focus at the provider level. The MLC model recovered a range of simulated Trust-level effects. Median recovered values were almost identical to simulated values for the binary Trust-level covariate, and we observed successful recovery of the continuous Trust-level covariate with at least 3 latent Trust classes. Credible intervals widen as the error variance increases. The MLC approach successfully partitioned modelling for prediction and for causal inference, addressing the potential conflict between these two distinct analytical strategies. This improves upon strategies which only adjust for differential selection. Patient-level variation and measurement uncertainty are accommodated within the latent classes.
We consider the following communication scenario. An encoder causally observes the Wiener process and decides when and what to transmit about it. A decoder makes real-time estimation of the process using causally received codewords. We determine the causal encoding and decoding policies that jointly minimize the mean-square estimation error, under the long-term communication rate constraint of $R$ bits per second. We show that an optimal encoding policy can be implemented as a causal sampling policy followed by a causal compressing policy. We prove that the optimal encoding policy samples the Wiener process once the innovation passes either $\sqrt{\frac{1}{R}}$ or $-\sqrt{\frac{1}{R}}$, and compresses the sign of the innovation (SOI) using a 1-bit codeword. The SOI coding scheme achieves the operational distortion-rate function, which is equal to $D^{\mathrm{op}}(R)=\frac{1}{6R}$. Surprisingly, this is significantly better than the distortion-rate tradeoff achieved in the limit of infinite delay by the best non-causal code. This is because the SOI coding scheme leverages the free timing information supplied by the zero-delay channel between the encoder and the decoder. The key to unlock that gain is the event-triggered nature of the SOI sampling policy. In contrast, the distortion-rate tradeoffs achieved with deterministic sampling policies are much worse: we prove that the causal informational distortion-rate function in that scenario is as high as $D_{\mathrm{DET}}(R) = \frac{5}{6R}$. It is achieved by the uniform sampling policy with the sampling interval $\frac{1}{R}$. In either case, the optimal strategy is to sample the process as fast as possible and to transmit 1-bit codewords to the decoder without delay.
Experimental data bases are typically very large and high dimensional. To learn from them requires to recognize important features (a pattern), often present at scales different to that of the recorded data. Following the experience collected in statistical mechanics and thermodynamics, the process of recognizing the pattern (the learning process) can be seen as a dissipative time evolution driven by entropy. This is the way thermodynamics enters machine learning. Learning to handle free surface liquids serves as an illustration.

Python Library: causal-tree-learn

Python implementation of causal trees with validation

### Develop Performance Benchmark with GRNN

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

It has been mentioned in https://github.com/statcompute/GRnnet that GRNN is an ideal approach employed to develop performance benchmarks for a variety of risk models. People might wonder what the purpose of performance benchmarks is and why we would even need one at all. Sometimes, a model developer had to answer questions about how well the model would perform even before completing the model. Likewise, a model validator also wondered whether the model being validated has a reasonable performance given the data used and the effort spent. As a result, the performance benchmark, which could be built with the same data sample but an alternative methodology, is called for to address aforementioned questions.

While the performance benchmark can take various forms, including but not limited to business expectations, industry practices, or vendor products, a model-based approach should possess following characteristics:

– Quick prototype with reasonable efforts
– Comparable baseline with acceptable outcomes
– Flexible framework without strict assumptions
– Practical application to broad domains

With both empirical and conceptual advantages, GRNN is able to accommodate each of above-mentioned requirements and thus can be considered an appropriate candidate that might potentially be employed to develop performance benchmarks for a wide variety of models.

Below is an example illustrating how to use GRNN to develop a benchmark model for the logistic regression shown in https://statcompute.wordpress.com/2019/05/04/why-use-weight-of-evidence/. The function grnn.margin() was also employed to explore the marginal effect of each attribute in a GRNN.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

### Document worth reading: “The History of Digital Spam”

Spam!: that’s what Lorrie Faith Cranor and Brian LaMacchia exclaimed in the title of a popular call-to-action article that appeared twenty years ago on Communications of the ACM. And yet, despite the tremendous efforts of the research community over the last two decades to mitigate this problem, the sense of urgency remains unchanged, as emerging technologies have brought new dangerous forms of digital spam under the spotlight. Furthermore, when spam is carried out with the intent to deceive or influence at scale, it can alter the very fabric of society and our behavior. In this article, I will briefly review the history of digital spam: starting from its quintessential incarnation, spam emails, to modern-days forms of spam affecting the Web and social media, the survey will close by depicting future risks associated with spam and abuse of new technologies, including Artificial Intelligence (e.g., Digital Humans). After providing a taxonomy of spam, and its most popular applications emerged throughout the last two decades, I will review technological and regulatory approaches proposed in the literature, and suggest some possible solutions to tackle this ubiquitous digital epidemic moving forward. The History of Digital Spam

### Posts

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

### Distilled News

The biggest issue facing machine learning (ML) isn’t whether we will discover better algorithms (we probably will), whether we’ll create a general AI (we probably won’t), or whether we’ll be able to deal with a flood of smart fakes (that’s a long-term, escalating battle). The biggest issue is how we’ll put ML systems into production. Getting an experiment to work on a laptop, even an experiment that runs ‘in the cloud,’ is one thing. Putting that experiment into production is another matter. Production has to deal with reality, and reality doesn’t live on our laptops. Most of our understanding of ‘production’ has come from the web world and learning how to run ecommerce and social media applications at scale. The latest advances in web operations – containerization and container orchestration – make it easier to package applications that can be deployed reliably and maintained consistently. It’s still not easy, but the tools are there. That’s a good start. ML applications differ from traditional software in two important ways. First, they’re not deterministic. Second, the application’s behavior isn’t determined by the code, but by the data used for training. These two differences are closely related.
Association is a powerful data analysis technique that appears frequently in data mining literature. An association rule is an implication of the form X?Y where X is a set of antecedent items and Y is the consequent item. An example association rule of a supermarket database is 80% of the people who buy diapers and baby powder also buy baby oil. The analysis of association rules is used in a variety of ways, including merchandise stocking, insurance fraud investigation, and climate prediction. For years scientists and engineers have developed many visualization techniques to support the analyses of association rules. Many of the visualizations, however, have come up short in dealing with large amounts of rules or rules with multiple antecedents. This limitation results in serious challenges for analysts who need to understand the association information of large databases.
This article will probably be of most interest to individuals just starting out in deep learning or people who are relatively new to leveraging PyTorch. It is a summary of my experience with recent attempts to modify the torchvision package’s CNNs that have been pre-trained on data from Imagenet. The aim of making a multiple architecture classifier a little easier to program.
Lately, I have been consolidating my experiences of working in different ML projects. I will tell this story from the lens of my recent project. Our task was to classify certain phrases into categories – A multiclass single label problem.
This article describes the technique that forecasts the market behavior. The second part demonstrates the application of the approach in a trading strategy. The market data is a sequence called time series. Usually, researchers use only price data (or asset returns) to create a model that forecasts the next price value, movement direction, or other output. I think the better way is to use more data for that. The idea is try to combine versatile market conditions (volatility, volumes, price changes, and etc.)
It’s hard to select the right measure of accuracy for a given problem. Having a standardised approach is what every data scientist should do. Plan of this article
• Motivation
• First consideration
• Must-know measures of accuracy for ML models
• An approach to use these measures to select the right ML model for your problem
Note I focus on binary classification problems in this article, but the approach would be similar with multi classification and regression problems.
Natural language processing (NLP) is a sub-field of machine learning (ML) that deals with natural language, often in the form of text, which is itself composed of smaller units like words and characters. Dealing with text data is problematic, since our computers, scripts and machine learning models can’t read and understand text in any human sense. When I read the word ‘cat’, many different associations are invoked – it’s a small furry animal that’s cute, eats fish, my landlord doesn’t allow, etc. But these linguistic associations are a result of quite complex neurological computations honed over millions of years of evolution, whereas our ML models must start from scratch with no pre-built understanding of word meaning.
The curse of dimensionality is a very crucial problem while dealing with real-life datasets which are generally higher dimensional data .As the dimensionality of the feature space increases, the number of configurations can grow exponentially and thus the number of configurations covered by an observation decreases. In such a scenario, Principal Component Analysis plays a major part in efficiently reducing the dimensionality of the data yet retaining as much as possible of the variation present in the data set. Let us give a very brief introduction to Principal component analysis before delving into the actual problem.
Much of classical machine learning (ML) focuses on utilizing available data to make more accurate predictions. More recently, researchers have considered other important objectives, such as how to design algorithms to be small, efficient, and robust. With these goals in mind, a natural research objective is the design of a system on top of neural networks that efficiently stores information encoded within – in other words, a mechanism to compute a succinct summary (a ‘sketch’) of how a complex deep network processes its inputs. Sketching is a rich field of study that dates back to the foundational work of Alon, Matias, and Szegedy, which can enable neural networks to efficiently summarize information about their inputs. For example: Imagine stepping into a room and briefly viewing the objects within. Modern machine learning is excellent at answering immediate questions, known at training time, about this scene: ‘Is there a cat? How big is said cat?’ Now, suppose we view this room every day over the course of a year. People can reminisce about the times they saw the room: ‘How often did the room contain a cat? Was it usually morning or night when we saw the room?’. However, can one design systems that are also capable of efficiently answering such memory-based questions even if they are unknown at training time? In ‘Recursive Sketches for Modular Deep Learning’, recently presented at ICML 2019, we explore how to succinctly summarize how a machine learning model understands its input. We do this by augmenting an existing (already trained) machine learning model with ‘sketches’ of its computation, using them to efficiently answer memory-based questions – for example, image-to-image-similarity and summary statistics – despite the fact that they take up much less memory than storing the entire original computation.
Scikit-Learn is known for its easily understandable API for Python users, and MLR became an alternative to the popular Caret package with a larger suite of available algorithms and an easy way of tuning hyperparameters. These two packages are somewhat in competition due to the debate where many people involved in analytics turn to Python for machine learning and R for statistical analysis. One of the reasons for a preference in using Python could be that current R packages for machine learning are provided via other packages that contain the algorithm. The packages are called through MLR but still require extra installation. Even external feature selection libraries are needed and they will have other external dependencies that need to be satisfied as well. Scikit-Learn is dubbed as a unified API to a number of machine learning algorithms that do not require the user to call anymore libraries. This by no means discredits R. R is still a major component in the data science world regardless of what an online poll would say. Anyone with a background in Statistics and or Mathematics will know why you should use R (regardless of whether they use it themselves they recognize the appeal). Now we will take a look at how a user would go through a typical machine learning workflow. We will proceed with Logistic Regression in Scikit-Learn, and Decision Tree in MLR.
Recommendation engines are everywhere now. Almost any app you use incorporates some sort of recommendation system to either push new content or drive sales. Recommendation engines are Netflix telling you what you should watch next, the ads on Facebook pushing products that you just happened to look at once, or even Slack suggesting which organization channels you should join. The advent of big data and machine learning has made recommendation engines one of the most directly applicable aspects of Data Science.
BERT (Devlin et al., 2018) is a method of pre-training language representations, meaning that we train a general-purpose ‘language understanding’ model on a large text corpus (like Wikipedia), and then use that model for downstream NLP tasks that we care about (like question answering). BERT outperforms previous methods because it is the first unsupervised, deeply bidirectional system for pre-training NLP.

## September 14, 2019

### If you did not already know

I-Optimality
The generalized linear model plays an important role in statistical analysis and the related design issues are undoubtedly challenging. The state-of-the-art works mostly apply to design criteria on the estimates of regression coefficients. It is of importance to study optimal designs for generalized linear models, especially on the prediction aspects. In this work, we propose a prediction-oriented design criterion, I-optimality, and develop an efficient sequential algorithm of constructing I-optimal designs for generalized linear models. Through establishing the General Equivalence Theorem of the I-optimality for generalized linear models, we obtain an insightful understanding for the proposed algorithm on how to sequentially choose the support points and update the weights of support points of the design. The proposed algorithm is computationally efficient with guaranteed convergence property. Numerical examples are conducted to evaluate the feasibility and computational efficiency of the proposed algorithm. …

Regularized Determinantal Point Process (R-DPP)
Given a fixed $n\times d$ matrix $\mathbf{X}$, where $n\gg d$, we study the complexity of sampling from a distribution over all subsets of rows where the probability of a subset is proportional to the squared volume of the parallelopiped spanned by the rows (a.k.a. a determinantal point process). In this task, it is important to minimize the preprocessing cost of the procedure (performed once) as well as the sampling cost (performed repeatedly). To that end, we propose a new determinantal point process algorithm which has the following two properties, both of which are novel: (1) a preprocessing step which runs in time $O(\text{number-of-non-zeros}(\mathbf{X})\cdot\log n)+\text{poly}(d)$, and (2) a sampling step which runs in $\text{poly}(d)$ time, independent of the number of rows $n$. We achieve this by introducing a new regularized determinantal point process (R-DPP), which serves as an intermediate distribution in the sampling procedure by reducing the number of rows from $n$ to $\text{poly}(d)$. Crucially, this intermediate distribution does not distort the probabilities of the target sample. Our key novelty in defining the R-DPP is the use of a Poisson random variable for controlling the probabilities of different subset sizes, leading to new determinantal formulas such as the normalization constant for this distribution. Our algorithm has applications in many diverse areas where determinantal point processes have been used, such as machine learning, stochastic optimization, data summarization and low-rank matrix reconstruction. …

Feature-Label Memory Network
Deep learning typically requires training a very capable architecture using large datasets. However, many important learning problems demand an ability to draw valid inferences from small size datasets, and such problems pose a particular challenge for deep learning. In this regard, various researches on ‘meta-learning’ are being actively conducted. Recent work has suggested a Memory Augmented Neural Network (MANN) for meta-learning. MANN is an implementation of a Neural Turing Machine (NTM) with the ability to rapidly assimilate new data in its memory, and use this data to make accurate predictions. In models such as MANN, the input data samples and their appropriate labels from previous step are bound together in the same memory locations. This often leads to memory interference when performing a task as these models have to retrieve a feature of an input from a certain memory location and read only the label information bound to that location. In this paper, we tried to address this issue by presenting a more robust MANN. We revisited the idea of meta-learning and proposed a new memory augmented neural network by explicitly splitting the external memory into feature and label memories. The feature memory is used to store the features of input data samples and the label memory stores their labels. Hence, when predicting the label of a given input, our model uses its feature memory unit as a reference to extract the stored feature of the input, and based on that feature, it retrieves the label information of the input from the label memory unit. In order for the network to function in this framework, a new memory-writingmodule to encode label information into the label memory in accordance with the meta-learning task structure is designed. Here, we demonstrate that our model outperforms MANN by a large margin in supervised one-shot classification tasks using Omniglot and MNIST datasets. …

Decision Tree Based Missing Value Imputation Technique (DMI)
Decision tree based Missing value Imputation technique’ (DMI) makes use of an EM algorithm and a decision tree (DT) algorithm. …

### Facebook Research at Interspeech 2019

The post Facebook Research at Interspeech 2019 appeared first on Facebook Research.

### Cartoon: Unsupervised Machine Learning?

New KDnuggets Cartoon looks at one of the hottest directions in Machine Learning and asks "Can Machine Learning be too unsupervised?"

### Science and Technology links (September 14th 2019)

1. Streaming music makes up 80% of the revenue of the music industry. Revenue is up 18% for the first six months of 2019. This follows a record year in 2018 when the music industry reached its highest revenues in 10 years. Though it should be good news for musicians, it is also the case that record labels often take the lion share of the revenues.
2. We have seen massive progress in the last five years in artificial intelligence. Yet we do not see obvious signs of economic gains from this progress.
3. A widely cited study contradicting the existence of free will was fatally flawed.
4. A common diabetes drug might spur neurogenesis and brain repair in the brain of (female) mice.
5. Facebook is working on creating real-time models of your face for use in virtual reality.
6. A simple and inexpensive eye test might be sufficient to detect early signs of Alzheimer’s.

### Speeding up independent binary searches by interleaving them

Given a long list of sorted values, how do you find the location of a particular value? A simple strategy is to first look at the middle of the list. If your value is larger than the middle value, look at the last half of the list, if not look at the first half of the list. Then repeat with select half, looking again in the middle. This algorithm is called a binary search. One of the first algorithms that computer science students learn is the binary search. I suspect that many people figure it out as a kid.

It is hard to drastically improve on the binary search if you only need to do one.

But what if you need to execute multiple binary searches, over distinct lists? Maybe surprisingly, in such cases, you can multiply the speed.

Let us first reason about how a binary search works. The processor needs to retrieve the middle value and compare it with your target value. Before the middle value is loaded, the processor cannot tell whether it will need to access the last or first half of the list. It might speculate. Most processors have branch predictors. In the case of a binary search, the branch is hard to predict so we might expect that the branch predictor will get it wrong half the time. Still: when it speculates properly, it is a net win. Importantly, we are using the fact that processors can do multiple memory requests at once.

What else could the processor do? If there are many binary searches to do, it might initiate the second one. And then maybe initiate the third one and so forth. This might be a lot more beneficial than speculating wastefully on a single binary search.

How can it start the second or third search before it finishes the current one? Again, due to speculation. If it is far along in the first search to predict its end, it might see the next search coming and start it even though it is still waiting for data regarding the current binary search. This should be especially easy if your sorted arrays have a predictable size.

There is a limit to how many instructions your processor might reorder in this manner. Let us say, for example, that the limit is 200 instructions. If each binary search takes 100 instructions, and that this value is reliable (maybe because the arrays have fixed sizes), then your processor might be able do up to two binary searches at once. So it might go twice as fast. But it probably cannot easily go much further (to 3 or 4).

But the programmer can help. We can manually tell the processor initiate right away four binary searches:

given arrays a1, a2, a3, a4
given target t1, t2, t3, t4

compare t1 with middle of a1
compare t2 with middle of a2
compare t3 with middle of a3
compare t4 with middle of a4

cut a1 in two based on above comparison
cut a2 in two based on above comparison
cut a3 in two based on above comparison
cut a4 in two based on above comparison

compare t1 with middle of a1
compare t2 with middle of a2
compare t3 with middle of a3
compare t4 with middle of a4

...


You can go far higher than 4 interleaved searches. You can do 8, 16, 32… I suspect that there is no practical need to go beyond 32.

How well does it work? Let us take 1024 sorted arrays containing a total of 64 million integers. In each array, I want to do one and just one binary search. Between each test, I access all of data in all of the arrays a couple of times to fool the cache.

By default, if you code a binary search, the resulting assembly will be made of comparisons and jumps. Hence your processor will execute this code in a speculative manner. At least with GNU GCC, we can write the C/C++ code in such a way that the branches are implemented as “conditional move” instructions which prevents the processor from speculating.

My results on a recent Intel processor (Cannon Lake) with GNU GCC 8 are as follows:

algorithm time per search relative speed
1-wise (independent) 2100 cycles/search 1.0 ×
1-wise (speculative) 900 cycles/search 2.3 ×
1-wise (branchless) 1100 cycles/search 2.0 ×
2-wise (branchless) 800 cycles/search 2.5 ×
4-wise (branchless) 600 cycles/search 3.5 ×
8-wise (branchless) 350 cycles/search 6.0 ×
16-wise (branchless) 280 cycles/search 7.5 ×
32-wise (branchless) 230 cycles/search 9 ×

So we are able to go 3 to 4 times faster, on what is effectively a memory bound problem, by interleaving 32 binary searches together. Interleaving merely 16 searches might also be preferable on some systems.

But why is this not higher than 4 times faster? Surely the processor can issue more than 4 memory loads at once?

That is because the 1-wise search, even without speculation, already benefits from the fact that we are streaming multiple binary searches, so that more than one is ongoing at any one time. Indeed, I can prevent the processor from usefully executing more than one search either by inserting memory fences between each search, or by making the target of one search dependent on the index found in the previous search. When I do so the time goes up to 2100 cycles/array which is approximately 9 times longer than 32-wise approach. The exact ratio (9) varies depending on the exact processor: I get a factor of 7 on the older Skylake architecture and a factor of 5 on an ARM Skylark processor.

Implementation-wise, I code my algorithms in pure C/C++. There is no need for fancy, esoteric instructions. The condition move instructions are pretty much standard and old at this point. Sadly, I only know how to convince one compiler (GNU GCC) to reliably produce conditional move instructions. And I have no clue how to control how Java, Swift, Rust or any other language deals with branches.

Could you do better than my implementation? Maybe but there are arguments suggesting that you can’t beat it by much in general. Each data access is done using fewer than 10 instructions in my implementation, which is far below the number of cycles and small compared to the size of the instruction buffers, so finding ways to reduce the instruction count should not help. Furthermore, it looks like I am already nearly maxing out the amount of memory-level parallelism at least on some hardware (9 on Cannon Lake, 7 on Skylake, 5 on Skylark). On Cannon Lake, however, you should be able to do better.

Credit: This work is the result of a collaboration with Travis Downs and Nathan Kurz, though all of the mistakes are mine.

### Twitter “Account Analysis” in R

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This past week @propublica linked to a really spiffy resource for getting an overview of a Twitter user’s profile and activity called accountanalysis. It has a beautiful interface that works as well on mobile as it does in a real browser. It also is fully interactive and supports cross-filtering (zoom in on the timeline and the other graphs change). It’s especially great if you’re not a coder, but if you are, @kearneymw’s {rtweet} can get you all this info and more, putting the power of R behind data frames full of tweet inanity.

While we covered quite a bit of {rtweet} ground in the 21 Recipes book, summarizing an account to the degree that accountanalysis does is not in there. To rectify this oversight, I threw together a static clone of accountanalysis that can make standalone HTML reports like this one.

It’s a fully parameterized R markdown document, meaning you can run it as just a function call (or change the parameter and knit it by hand):

rmarkdown::render(
input = "account-analysis.Rmd",
params = list(
),
output_file = "~/Documents/propublica-analysis.html"
)


It will also, by default, save a date-stamped copy of the user info and retrieved timeline into the directory you generate the report from (add a prefix path to the save portion in the Rmd to store it in a better place).

With all the data available, you can dig in and extract all the information you want/need.

### FIN

You can get the Rmd at your favorite social coding service:

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

### Whats new on arXiv

Current cross-modal retrieval systems are evaluated using R@K measure which does not leverage semantic relationships rather strictly follows the manually marked image text query pairs. Therefore, current systems do not generalize well for the unseen data in the wild. To handle this, we propose a new measure, SemanticMap, to evaluate the performance of cross-modal systems. Our proposed measure evaluates the semantic similarity between the image and text representations in the latent embedding space. We also propose a novel cross-modal retrieval system using a single stream network for bidirectional retrieval. The proposed system is based on a deep neural network trained using extended center loss, minimizing the distance of image and text descriptions in the latent space from the class centers. In our system, the text descriptions are also encoded as images which enabled us to use a single stream network for both text and images. To the best of our knowledge, our work is the first of its kind in terms of employing a single stream network for cross-modal retrieval systems. The proposed system is evaluated on two publicly available datasets including MSCOCO and Flickr30K and has shown comparable results to the current state-of-the-art methods.
Knowing when a graphical model is perfect to a distribution is essential in order to relate separation in the graph to conditional independence in the distribution, and this is particularly important when performing inference from data. When the model is perfect, there is a one-to-one correspondence between conditional independence statements in the distribution and separation statements in the graph. Previous work has shown that almost all models based on linear directed acyclic graphs as well as Gaussian chain graphs are perfect, the latter of which subsumes Gaussian graphical models (i.e., the undirected Gaussian models) as a special case. However, the complexity of chain graph models leads to a proof of this result which is indirect and mired by the complications of parameterizing this general class. In this paper, we directly approach the problem of perfectness for the Gaussian graphical models, and provide a new proof, via a more transparent parametrization, that almost all such models are perfect. Our approach is based on, and substantially extends, a construction of Ln\v{e}ni\v{c}ka and Mat\’u\v{s} showing the existence of a perfect Gaussian distribution for any graph.
Link prediction is an important way to complete knowledge graphs (KGs), while embedding-based methods, effective for link prediction in KGs, perform poorly on relations that only have a few associative triples. In this work, we propose a Meta Relational Learning (MetaR) framework to do the common but challenging few-shot link prediction in KGs, namely predicting new triples about a relation by only observing a few associative triples. We solve few-shot link prediction by focusing on transferring relation-specific meta information to make model learn the most important knowledge and learn faster, corresponding to relation meta and gradient meta respectively in MetaR. Empirically, our model achieves state-of-the-art results on few-shot link prediction KG benchmarks.
When a robot acquires new information, ideally it would immediately be capable of using that information to understand its environment. While deep neural networks are now widely used by robots for inferring semantic information, conventional neural networks suffer from catastrophic forgetting when they are incrementally updated, with new knowledge overwriting established representations. While a variety of approaches have been developed that attempt to mitigate catastrophic forgetting in the incremental batch learning scenario, in which an agent learns a large collection of labeled samples at once, streaming learning has been much less studied in the robotics and deep learning communities. In streaming learning, an agent learns instances one-by-one and can be tested at any time. Here, we revisit streaming linear discriminant analysis, which has been widely used in the data mining research community. By combining streaming linear discriminant analysis with deep learning, we are able to outperform both incremental batch learning and streaming learning algorithms on both ImageNet-1K and CORe50.
Development sets are impractical to obtain for real low-resource languages, since using all available data for training is often more effective. However, development sets are widely used in research papers that purport to deal with low-resource natural language processing (NLP). Here, we aim to answer the following questions: Does using a development set for early stopping in the low-resource setting influence results as compared to a more realistic alternative, where the number of training epochs is tuned on development languages? And does it lead to overestimation or underestimation of performance? We repeat multiple experiments from recent work on neural models for low-resource NLP and compare results for models obtained by training with and without development sets. On average over languages, absolute accuracy differs by up to 1.4%. However, for some languages and tasks, differences are as big as 18.0% accuracy. Our results highlight the importance of realistic experimental setups in the publication of low-resource NLP research results.
Mathematical morphology is a theory and technique to collect features like geometric and topological structures in digital images. Given a target image, determining suitable morphological operations and structuring elements is a cumbersome and time-consuming task. In this paper, a morphological neural network is proposed to address this problem. Serving as a nonlinear feature extracting layer in deep learning frameworks, the efficiency of the proposed morphological layer is confirmed analytically and empirically. With a known target, a single-filter morphological layer learns the structuring element correctly, and an adaptive layer can automatically select appropriate morphological operations. For practical applications, the proposed morphological neural networks are tested on several classification datasets related to shape or geometric image features, and the experimental results have confirmed the high computational efficiency and high accuracy.
This paper is the preprint of an invited commentary on Lake et al’s Behavioral and Brain Sciences article titled ‘Building machines that learn and think like people’. Lake et al’s paper offers a timely critique on the recent accomplishments in artificial intelligence from the vantage point of human intelligence, and provides insightful suggestions about research directions for building more human-like intelligence. Since we agree with most of the points raised in that paper, we will offer a few points that are complementary.
A recent trend observed in traditionally challenging fields such as computer vision and natural language processing has been the significant performance gains shown by deep learning (DL). In many different research fields, DL models have been evolving rapidly and become ubiquitous. Despite researchers’ excitement, unfortunately, most software developers are not DL experts and oftentimes have a difficult time following the booming DL research outputs. As a result, it usually takes a significant amount of time for the latest superior DL models to prevail in industry. This issue is further exacerbated by the common use of sundry incompatible DL programming frameworks, such as Tensorflow, PyTorch, Theano, etc. To address this issue, we propose a system, called Model Asset Exchange (MAX), that avails developers of easy access to state-of-the-art DL models. Regardless of the underlying DL programming frameworks, it provides an open source Python library (called the MAX framework) that wraps DL models and unifies programming interfaces with our standardized RESTful APIs. These RESTful APIs enable developers to exploit the wrapped DL models for inference tasks without the need to fully understand different DL programming frameworks. Using MAX, we have wrapped and open-sourced more than 30 state-of-the-art DL models from various research fields, including computer vision, natural language processing and signal processing, etc. In the end, we selectively demonstrate two web applications that are built on top of MAX, as well as the process of adding a DL model to MAX.
Abstractive summarization approaches based on Reinforcement Learning (RL) have recently been proposed to overcome classical likelihood maximization. RL enables to consider complex, possibly non-differentiable, metrics that globally assess the quality and relevance of the generated outputs. ROUGE, the most used summarization metric, is known to suffer from bias towards lexical similarity as well as from suboptimal accounting for fluency and readability of the generated abstracts. We thus explore and propose alternative evaluation measures: the reported human-evaluation analysis shows that the proposed metrics, based on Question Answering, favorably compares to ROUGE — with the additional property of not requiring reference summaries. Training a RL-based model on these metrics leads to improvements (both in terms of human or automated metrics) over current approaches that use ROUGE as a reward.
A key element of any machine learning algorithm is the use of a function that measures the dis/similarity between data points. Given a task, such a function can be optimized with a metric learning algorithm. Although this research field has received a lot of attention during the past decade, very few approaches have focused on learning a metric in an imbalanced scenario where the number of positive examples is much smaller than the negatives. Here, we address this challenging task by designing a new Mahalanobis metric learning algorithm (IML) which deals with class imbalance. The empirical study performed shows the efficiency of IML.
Knowledge distillation (KD) is a very popular method for model size reduction. Recently, the technique is exploited for quantized deep neural networks (QDNNs) training as a way to restore the performance sacrificed by word-length reduction. KD, however, employs additional hyper-parameters, such as temperature, coefficient, and the size of teacher network for QDNN training. We analyze the effect of these hyper-parameters for QDNN optimization with KD. We find that these hyper-parameters are inter-related, and also introduce a simple and effective technique that reduces \textit{coefficient} during training. With KD employing the proposed hyper-parameters, we achieve the test accuracy of 92.7% and 67.0% on Resnet20 with 2-bit ternary weights for CIFAR-10 and CIFAR-100 data sets, respectively.
In recent years, there has been a growing interest in identifying anomalous structure within multivariate data streams. We consider the problem of detecting collective anomalies, corresponding to intervals where one or more of the data streams behaves anomalously. We first develop a test for a single collective anomaly that has power to simultaneously detect anomalies that are either rare, that is affecting few data streams, or common. We then show how to detect multiple anomalies in a way that is computationally efficient but avoids the approximations inherent in binary segmentation-like approaches. This approach, which we call MVCAPA, is shown to consistently estimate the number and location of the collective anomalies, a property that has not previously been shown for competing methods. MVCAPA can be made robust to point anomalies and can allow for the anomalies to be imperfectly aligned. We show the practical usefulness of allowing for imperfect alignments through a resulting increase in power to detect regions of copy number variation.
Recent developments within deep learning are relevant for nonlinear system identification problems. In this paper, we establish connections between the deep learning and the system identification communities. It has recently been shown that convolutional architectures are at least as capable as recurrent architectures when it comes to sequence modeling tasks. Inspired by these results we explore the explicit relationships between the recently proposed temporal convolutional network (TCN) and two classic system identification model structures; Volterra series and block-oriented models. We end the paper with an experimental study where we provide results on two real-world problems, the well-known Silverbox dataset and a newer dataset originating from ground vibration experiments on an F-16 fighter aircraft.
One of the major challenges in training deep architectures for predictive tasks is the scarcity and cost of labeled training data. Active Learning (AL) is one way of addressing this challenge. In stream-based AL, observations are continuously made available to the learner that have to decide whether to request a label or to make a prediction. The goal is to reduce the request rate while at the same time maximize prediction performance. In previous research, reinforcement learning has been used for learning the AL request/prediction strategy. In our work, we propose to equip a reinforcement learning process with memory augmented neural networks, to enhance the one-shot capabilities. Moreover, we introduce Class Margin Sampling (CMS) as an extension of the standard margin sampling to the reinforcement learning setting. This strategy aims to reduce training time and improve sample efficiency in the training process. We evaluate the proposed method on a classification task using empirical accuracy of label predictions and percentage of label requests. The results indicates that the proposed method, by making use of the memory augmented networks and CMS in the training process, outperforms existing baselines.
Many advances in Natural Language Processing have been based upon more expressive models for how inputs interact with the context in which they occur. Recurrent networks, which have enjoyed a modicum of success, still lack the generalization and systematicity ultimately required for modelling language. In this work, we propose an extension to the venerable Long Short-Term Memory in the form of mutual gating of the current input and the previous output. This mechanism affords the modelling of a richer space of interactions between inputs and their context. Equivalently, our model can be viewed as making the transition function given by the LSTM context-dependent. Experiments demonstrate markedly improved generalization on language modelling in the range of 3-4 perplexity points on Penn Treebank and Wikitext-2, and 0.01-0.05 bpc on four character-based datasets. We establish a new state of the art on all datasets with the exception of Enwik8, where we close a large gap between the LSTM and Transformer models.
We introduce a new Procrustes-type method called matching component analysis to isolate components in data for transfer learning. Our theoretical results describe the sample complexity of this method, and we demonstrate through numerical experiments that our approach is indeed well suited for transfer learning.
Deep learning algorithms often require solving a highly non-linear and nonconvex unconstrained optimization problem. Methods for solving optimization problems in large-scale machine learning, such as deep learning and deep reinforcement learning (RL), are generally restricted to the class of first-order algorithms, like stochastic gradient descent (SGD). While SGD iterates are inexpensive to compute, they have slow theoretical convergence rates. Furthermore, they require exhaustive trial-and-error to fine-tune many learning parameters. Using second-order curvature information to find search directions can help with more robust convergence for non-convex optimization problems. However, computing Hessian matrices for large-scale problems is not computationally practical. Alternatively, quasi-Newton methods construct an approximate of the Hessian matrix to build a quadratic model of the objective function. Quasi-Newton methods, like SGD, require only first-order gradient information, but they can result in superlinear convergence, which makes them attractive alternatives to SGD. The limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) approach is one of the most popular quasi-Newton methods that construct positive definite Hessian approximations. In this chapter, we propose efficient optimization methods based on L-BFGS quasi-Newton methods using line search and trust-region strategies. Our methods bridge the disparity between first- and second-order methods by using gradient information to calculate low-rank updates to Hessian approximations. We provide formal convergence analysis of these methods as well as empirical results on deep learning applications, such as image classification tasks and deep reinforcement learning on a set of ATARI 2600 video games. Our results show a robust convergence with preferred generalization characteristics as well as fast training time.
Differential privacy (DP) provides formal guarantees that the output of a database query does not reveal too much information about any individual present in the database. While many differentially private algorithms have been proposed in the scientific literature, there are only a few end-to-end implementations of differentially private query engines. Crucially, existing systems assume that each individual is associated with at most one database record, which is unrealistic in practice. We propose a generic and scalable method to perform differentially private aggregations on databases, even when individuals can each be associated with arbitrarily many rows. We express this method as an operator in relational algebra, and implement it in an SQL engine. To validate this system, we test the utility of typical queries on industry benchmarks, and verify its correctness with a stochastic test framework we developed. We highlight the promises and pitfalls learned when deploying such a system in practice, and we publish its core components as open-source software.

### Document worth reading: “High-Performance Support Vector Machines and Its Applications”

The support vector machines (SVM) algorithm is a popular classification technique in data mining and machine learning. In this paper, we propose a distributed SVM algorithm and demonstrate its use in a number of applications. The algorithm is named high-performance support vector machines (HPSVM). The major contribution of HPSVM is two-fold. First, HPSVM provides a new way to distribute computations to the machines in the cloud without shuffling the data. Second, HPSVM minimizes the inter-machine communications in order to maximize the performance. We apply HPSVM to some real-world classification problems and compare it with the state-of-the-art SVM technique implemented in R on several public data sets. HPSVM achieves similar or better results. High-Performance Support Vector Machines and Its Applications

### If you did not already know

NestDNN
Mobile vision systems such as smartphones, drones, and augmented-reality headsets are revolutionizing our lives. These systems usually run multiple applications concurrently and their available resources at runtime are dynamic due to events such as starting new applications, closing existing applications, and application priority changes. In this paper, we present NestDNN, a framework that takes the dynamics of runtime resources into account to enable resource-aware multi-tenant on-device deep learning for mobile vision systems. NestDNN enables each deep learning model to offer flexible resource-accuracy trade-offs. At runtime, it dynamically selects the optimal resource-accuracy trade-off for each deep learning model to fit the model’s resource demand to the system’s available runtime resources. In doing so, NestDNN efficiently utilizes the limited resources in mobile vision systems to jointly maximize the performance of all the concurrently running applications. Our experiments show that compared to the resource-agnostic status quo approach, NestDNN achieves as much as 4.2% increase in inference accuracy, 2.0x increase in video frame processing rate and 1.7x reduction on energy consumption. …

Attention Branch Network (ABN)
Visual explanation enables human to understand the decision making of Deep Convolutional Neural Network (CNN), but it is insufficient to contribute the performance improvement. In this paper, we focus on the attention map for visual explanation, which represents high response value as the important region in image recognition. This region significantly improves the performance of CNN by introducing an attention mechanism that focuses on a specific region in an image. In this work, we propose Attention Branch Network (ABN), which extends the top-down visual explanation model by introducing a branch structure with an attention mechanism. ABN can be applicable to several image recognition tasks by introducing a branch for attention mechanism and is trainable for the visual explanation and image recognition in end-to-end manner. We evaluate ABN on several image recognition tasks such as image classification, fine-grained recognition, and multiple facial attributes recognition. Experimental results show that ABN can outperform the accuracy of baseline models on these image recognition tasks while generating an attention map for visual explanation.
Embedding Human Knowledge in Deep Neural Network via Attention Map

Payoff Dynamical Model (PDM)
We consider that at every instant each member of a population, which we refer to as an agent, selects one strategy out of a finite set. The agents are nondescript, and their strategy choices are described by the so-called population state vector, whose entries are the portions of the population selecting each strategy. Likewise, each entry constituting the so-called payoff vector is the reward attributed to a strategy. We consider that a general finite-dimensional nonlinear dynamical system, denoted as payoff dynamical model (PDM), describes a mechanism that determines the payoff as a causal map of the population state. A bounded-rationality protocol, inspired primarily on evolutionary biology principles, governs how each agent revises its strategy repeatedly based on complete or partial knowledge of the population state and payoff. The population is protocol-homogeneous but is otherwise strategy-heterogeneous considering that the agents are allowed to select distinct strategies concurrently. A stochastic mechanism determines the instants when agents revise their strategies, but we consider that the population is large enough that, with high probability, the population state can be approximated with arbitrary accuracy uniformly over any finite horizon by a so-called (deterministic) mean population state. We propose an approach that takes advantage of passivity principles to obtain sufficient conditions determining, for a given protocol and PDM, when the mean population state is guaranteed to converge to a meaningful set of equilibria, which could be either an appropriately defined extension of Nash’s for the PDM or a perturbed version of it. By generalizing and unifying previous work, our framework also provides a foundation for future work. …

GP-DRF
Deep Gaussian processes (DGP) have appealing Bayesian properties, can handle variable-sized data, and learn deep features. Their limitation is that they do not scale well with the size of the data. Existing approaches address this using a deep random feature (DRF) expansion model, which makes inference tractable by approximating DGPs. However, DRF is not suitable for variable-sized input data such as trees, graphs, and sequences. We introduce the GP-DRF, a novel Bayesian model with an input layer of GPs, followed by DRF layers. The key advantage is that the combination of GP and DRF leads to a tractable model that can both handle a variable-sized input as well as learn deep long-range dependency structures of the data. We provide a novel efficient method to simultaneously infer the posterior of GP’s latent vectors and infer the posterior of DRF’s internal weights and random frequencies. Our experiments show that GP-DRF outperforms the standard GP model and DRF model across many datasets. Furthermore, they demonstrate that GP-DRF enables improved uncertainty quantification compared to GP and DRF alone, with respect to a Bhattacharyya distance assessment. Source code is available at https://…/GP_DRF.

### Magister Dixit

“Statistics is one of the most important skills required by a data scientist.” G S Praneeth Reddy ( 12.11.2018 )

### Distilled News

Concepts, applications, and examples with Python. Graphs are becoming central to machine learning these days, whether you’d like to understand the structure of a social network by predicting potential connections, detecting fraud, understand customer’s behavior of a car rental service or making real-time recommendations for example.
Random Forests are one of the most popular machine learning models used by data scientists today. How they are actually implemented and the variety of use cases they can be applied to are often overlooked. While this article will focus on the inner workings of Random Forests, we’ll start off by exploring the main problems this model solves.
Using a deep learning technique to fight overfitting for a simple linear regression model. I was testing an example from scikit-learn site, that demonstrates the problems of underfitting and overfitting and how we can use linear regression with polynomial features to approximate nonlinear functions, according to the article. Below is a modified version of this code.
So, I decided to collect 5 cases when data science is used for nothing but fun. Let’s get started, shall we?
1. Game Of Throne Deaths in Season 8, Data Science’s Angle
2. Predicting the outcome of sports game: Syracuse over Michigan State
3. Taylor Swift detector developed with Swift
4. Game of Wines – the ML and data science based detector of wine quality
5. Who’s killing the Academy Awards Game? – predicted by data science
Whether you’re a city planner, a small business owner, or a software developer, gaining useful insights from data can help make services work better and answer important questions. But, without strong privacy protections, you risk losing the trust of your citizens, customers, and users. Differentially-private data analysis is a principled approach that enables organizations to learn from the majority of their data while simultaneously ensuring that those results do not allow any individual’s data to be distinguished or re-identified. This type of analysis can be implemented in a wide variety of ways and for many different purposes. For example, if you are a health researcher, you may want to compare the average amount of time patients remain admitted across various hospitals in order to determine if there are differences in care. Differential privacy is a high-assurance, analytic means of ensuring that use cases like this are addressed in a privacy-preserving manner. Today, we’re rolling out the open-source version of the differential privacy library that helps power some of Google’s core products. To make the library easy for developers to use, we’re focusing on features that can be particularly difficult to execute from scratch, like automatically calculating bounds on user contributions. It is now freely available to any organization or developer that wants to use it.
This post aims to introduce you to animating ggplot2 visualisations in r using the gganimate package by Thomas Lin Pedersen. The post will visualise the theoretical winnings I would’ve had, had I followed the simple model to predict (or tip as it’s known in Australia) winners in the AFL that I explained in this post. The data used in the analysis was collected from the AFL Tables website as part of a larger series I wrote about on AFL crowds.
Applications for machine learning and deep learning have become increasingly accessible. For example, Keras provides APIs with TensorFlow backend that enable users to build neural networks without being fluent with TensorFlow. Despite the ease of building and testing models, deep learning has suffered from a lack of interpretability; deep learning models are considered black boxes to many users. In a talk at ODSC West in 2018, Pramit Choudhary explained the importance of model evaluation and interpretability in deep learning and some cutting edge techniques for addressing it.
Document Classification: The task of assigning labels to large bodies of text. In this case the task is to classify news articles into different labels, such as sport or politics. The data set used wasn’t ideally suited for deep learning, having only low thousands of examples, but this is far from an unrealistic case outside larger firms.
Copenhagen is the capital and most populous city of Denmark and capital sits on the coastal islands of Zealand and Amager. It’s linked to Malmo in southern Sweden by the Oresund Bridge. Indre By, the city’s historic centre, contains Frederiksstaden, an 18th-century rococo district, home to the royal family’s Amalienborg Palace. Nearby is Christiansborg Palace and the Renaissance-era Rosenborg Castle, surrounded by gardens and home to the crown jewels.
Machine learning models include the step of preprocessing or feature engineering before the data is actually trainable. Feature Engineering includes normalizing and scaling data, encoding categorical values as numerical values, forming vocabularies, and binning of continuous numerical values. Distributed frameworks like Google Cloud Dataflow or Apache Spark are often well known for applying large scale data preprocessing. To remove the inconsistency between training and serving ML models from different environments Google has come up with tf.Transform, a library for TensorFlow that ensures consistency of the feature engineering steps during model training and serving.
A substantial school in the philosophy of science identifies Bayesian inference with inductive inference and even rationality as such, and seems to be strengthened by the rise and practical success of Bayesian statistics. We argue that the most successful forms of Bayesian statistics do not actually support that particular philosophy but rather accord much better with sophisticated forms of hypothetico-deductivism. We examine the actual role played by prior distributions in Bayesian models, and the crucial aspects of model checking and model revision, which fall outside the scope of Bayesian confirmation theory. We draw on the literature on the consistency of Bayesian updating and also on our experience of applied work in social science. Clarity about these matters should benefit not just philosophy of science, but also statistical practice. At best, the inductivist view has encouraged researchers to fit and compare models without checking them; at worst, theorists have actively discouraged practitioners from performing model checking because it does not fit into their framework.
DeepPrivacy is a fully automatic anonymization technique for images. This repository contains the source code for the paper ‘DeepPrivacy: A Generative Adversarial Network for Face Anonymization’, published at ISVC 2019.

### RSwitch 1.5.0 Release Now Also Corrals RStudio Server Connections

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

RSwitch is a macOS menubar application that works on macOS 10.14+ and provides handy shortcuts for developing with R on macOS. Version 1.5.0 brings a reorganized menu system and the ability to manage and make connections to RStudio Server instances. Here’s a quick peek at the new setup:

All books, links, and other reference resources are under a single submenu system:

If there’s a resource you’d like added, follow the links on the main RSwitch site to file PRs where you’re most comfortable.

You can also setup automatic checks and notifications for when new RStudio Dailies are available (you can still always check manually and this check feature is off by default):

But, the biggest new feature is the ability to manage and launch RStudio Server connections right from RSwitch:

Click to view slideshow.

These RStudio Server browser connections are kept separate from your internet browsing and are one menu selection away. RSwitch also remembers the size and position of your RStudio Server session windows, so everything should be where you want/need/expect. This is somewhat of an experimental feature so definitely file issues if you run into any problems or would like things to work differently.

### FIN

Kick the tyres, file issues or requests and, if so inclined, let me know how you’re liking RSwitch!

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

### I think that science is mostly “Brezhnevs.” It’s rare to see a “Gorbachev” who will abandon a paradigm just because it doesn’t do the job. Also, moving beyond naive falsificationism

Sandro Ambuehl writes:

I’ve been following your blog and the discussion of replications and replicability across different fields daily, for years. I’m an experimental economist. The following question arose from a discussion I recently had with Anna Dreber, George Loewenstein, and others.

You’ve previously written about the importance of sound theories (and the dangers of anything-goes theories), and I was wondering whether there’s any formal treatment of that, or any empirical evidence on whether empirical investigations based on precise theories that simultaneously test multiple predictions are more likely to replicate than those without theoretical underpinnings, or those that test only isolated predictions.

Specifically: Many of the proposed solutions to the replicability issue (such as preregistration) seem to implicitly assume one-dimensional hypotheses such as “Does X increase Y?” In experimental economics, by contrast, we often test theories. The value of a theory is precisely that it makes multiple predictions. (In economics, theories that explain just one single phenomenon, or make one single prediction are generally viewed as useless and are highly discouraged.) Theories typically also specify how its various predictions relate to each other, often even regarding magnitudes. They are formulated as mathematical models, and their predictions are correspondingly precise. Let’s call a within-subjects experiment that tests a set of predictions of a theory a “multi-dimensional experiment”.

My conjecture is that all the statistical skulduggery that leads to non-replicable results is much harder to do in a theory-based, multi-dimensional experiment. If so, multi-dimensional experiment should lead to better replicability even absent safeguards such as preregistration.

The intuition is the following. Suppose an unscrupulous researcher attempts to “prove” a single prediction that X increases Y. He can do that by selectively excluding subjects with low X and high Y (or high X and low Y) from the sample. Compare that to a researcher who attempts to “prove”, in a within-subject experiment, that X increases Y and A increases B. The latter researcher must exclude many more subjects until his “preferred” sample includes only subjects that conform to the joint hypothesis. The exclusions become harder to justify, and more subjects must be run.

A similar intuition applies to the case of an unscrupulous researcher who tries to “prove” a hypothesis by messing with the measurements of variables (e.g. by using log(X) instead of X). Here, an example is a theory that predicts that X increases both Y and Z. Suppose the researcher finds a Null if he regresses X on Y, but finds a positive correlation between f(X) on Y for some selected transformation f. If the researcher only “tested” the relation between X and Y (a one-dimensional experiment), the researcher could now declare “success”. In a multi-dimensional experiment, however, the researcher will have to dig for an f that doesn’t only generate a positive correlation between f(X) and Y, but also between f(X) and Z, which is harder. A similar point applies if the researcher measures X in different ways (e.g. through a variety of related survey questions) and attempts to select the measurement that best helps “prove” the hypothesis. (Moreover, such a theory would typically also specify something like “If X increases Y by magnitude alpha, then it should increase Z by magnitude beta”. The relation between Y and Z would then present an additional prediction to be tested, yet again increasing the difficulty of “proving” the result through nefarious manipulations.)

So if there is any formal treatment relating to the above intuitions, or any empirical evidence on what kind of research tends to be more or less likely to replicate (depending on factors other than preregistration), I would much appreciate if you could point me to it.

I have two answers for you.

First, some colleagues and I recently published a preregistered replication of one of our own studies; see here. This might be interesting to you because our original study did not test a single thing, so our evaluation was necessarily holistic. In our case, the study was descriptive, not theoretically-motivated, so it’s not quite what you’re talking about—but it’s like your study in that the outcomes of interest were complex and multidimensional.

This was one of the problems I’ve had with recent mass replication studies, that they treat a scientific paper as if it has a single conclusion, even though real papers—theoretically-based or not—typically have many conclusions.

My second response is that I fear you are being too optimistic. Yes, when a theory makes multiple predictions, it may be difficulty to select data to make all the predictions work out. But on the other hand you have many degrees of freedom with which to declare success.

This has been one of my problems with a lot of social science research. Just about any pattern in data can be given a theoretical explanation, and just about any pattern in data can be said to be the result of a theoretical prediction. Remember that claim that women were three times more likely to wear red or pink clothing during a certain time of the month? The authors of that study did a replication which failed–but they declared it a success after adding an interaction with outdoor air temperature. Or there was this political science study where the data went in the opposite direction of the preregistration but were retroactively declared to be consistent with the theory. It’s my impression that a lot of economics is like this too: If it goes the wrong way, the result can be explained. That’s fine—it’s one reason why economics is often a useful framework for modeling the world—but I think the idea that statistical studies and p-values and replication are some sort of testing ground for models, the idea that economists are a group of hard-headed Popperians, regularly subjecting their theories to the hard test of reality—I’m skeptical of that take. I think it’s much more that individual economists, and schools of economists, are devoted to their theories and only rarely abandon them on their own. That is, I have a much more Kuhnian take on the whole process. Or, to put it another way, I try to be Popperian in my own research, I think that’s the ideal, but I think the Kuhnian model better describes the general process of science. Or, to put it another way, I think that science is mostly “Brezhnevs.” It’s rare to see a “Gorbachev” who will abandon a paradigm just because it doesn’t do the job.

Ambuehl responded:

Anna did have a similar reaction to you—and I think that reaction depends much on what passes as a “theory”. For instance, you won’t find anything in a social psychology textbook that an economic theorist would call a “theory”. You’re certainly right about the issues pertaining to hand-wavy ex-post explanations as with the clothes and ovulation study, or “anything-goes theories” such as the Himicanes that might well have turned out the other way.

By contrast, the theories I had in mind when asking the question are mathematically formulated theories that precisely specify their domain of applicability. An example of the kind of theory I have in mind would be Expected Utility theory, tested in countless papers, e.g. here). Another example of such a theory is the Shannon model of choice under limited attention (tested, e.g., here). These theories are in an entirely different ballpark than vague ideas like, e.g., self-perception theory or social comparison theory that are so loosely specified that one cannot even begin to test them unless one is willing to make assumptions on each of the countless researcher degrees of freedom they leave open.

In fact, economic theorists tend to regard the following characteristics virtues, or even necessities, of any model: precision (can be tested without requiring additional assumptions), parsimony (and hence, makes it hard to explain “uncomfortable” results by interactions etc.), generality (in the sense that they make multiple predictions, across several domains). And they very much frown upon ex post theorizing, ad-hoc assumptions, and imprecision. For theories that satisfy these properties, it would seem much harder to fudge empirical research in a way that doesn’t replicate, wouldn’t it? (Whether the community will accept the results or not seems orthogonal to the question of replicability, no?)

Finally, to the extent that theories in the form of precise, mathematical models are often based on wide bodies of empirical research (economic theorists often try to capture “stylized facts”), wouldn’t one also expect higher rates of replicability because such theories essentially correspond to well-informed priors?

So my overall point is, doesn’t (good) theory have a potentially important role to play regarding replicability? (Many current suggestions for solving the replication crisis, in particular formulaic ones such as pre-registration, or p<0.005, don't seem to recognize those potential benefits of sound theory.)

I replied:

Well, sure, but expected utility theory is flat-out false. Much has been written on the way that utilities only exist after the choices are given. This can even be seen in simple classroom demonstrations, as in section 5 of this paper from 1998. No statistics are needed at all to demonstrate the problems with that theory!

Amdahl responded with some examples of more sophisticated, but still testable, theories such as reference-dependent preferences, various theories of decision making under ambiguity, and perception-based theories, and I responded with my view that all these theories are either vague enough to be adaptable to any data or precise enough to be evidently false with no data collection needed. This was what Lakatos noted: any theory is either so brittle that it can be destroyed by collecting enough data, or flexible enough to fit anything. This does not mean we can’t do science, it just means we have to move beyond naive falsificationism.

P.S. Tomorrow’s post: “Boston Globe Columnist Suspended During Investigation Of Marathon Bombing Stories That Don’t Add Up.”

### What’s going on on PyPI

Scanning all new published packages on PyPI I know that the quality is often quite bad. I try to filter out the worst ones and list here the ones which might be worth a look, being followed or inspire you in some way.

pychamp
Data Science Tools. pychamp is a data science tool intended to ease data science practices.
Modules:
• configparser : It can be used for parsing configuration file format json and ini.
• connection : It can be used for creating connection URL, connection engine and for executing any type of sql queries.
• features_selection : It can be used for selecting features using Backward Elimination, VIF and Features Importance.
• sampling : It can be used for different types of sampling operations such as SMOTE, SMOTENC and ADASYN for both categorical and numerical features.
• stats : It can be used for Confidence and Prediction Interval, IQR outlier removal and Summary Statistics.
• viz : It can be used for visualization.
• net : It can be used for sending mail currently.
• model : Different types of regression, classification and clustering models can be used.
• eda : It handles data types and missing values.

pyExplore
Python Package for exploratory data analysis in Data Science

PYSNN
PYthon Simple Neural Network – PYSNN is python3 lib for machine learning

The Python document processor

rskit
Toolkit for recommender systems. Toolkit for building recommender systems
• Provide CLI interface for running recommendation algorithms
• Contains abstractions you can leverage to build custom recommenders

topicnet
TopicNet is a module for topic modelling using ARTM algorithm

atom-ml
ATOM is an AutoML package. ATOM is a python package for exploration of ML problems. With just a few lines of code, you can compare the performance of multiple machine learning models on a given dataset, providing a quick insight on which algorithms performs best for the task at hand. Furthermore, ATOM contains a variety of plotting functions to help you analyze the models’ performances.

catch22
CAnonical Time-series Features, see description and license on GitHub.

causal-tree-learn
Python implementation of causal trees with validation

ems-dataflow-testframework
Framework helping testing Google Cloud Dataflows. This framework aims to help test Google Cloud Platform dataflows in an end-to-end way.

FAT-Forensics
A Python Toolbox for Algorithmic Fairness, Accountability and Transparency

oba-sparql
OBA Sparql Manager

sciwing
Modern Scientific Document Processing Framework. SciWING is a modern framework from WING-NUS to facilitate Scientific Document Processing. It is built on PyTorch and believes in modularity from ground up and easy to use interface. SciWING includes many pre-trained models for fundamental tasks in Scientific Document Processing for practitioners.

sko
Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Algorithm in Python

gaus-binom-distributions
Gaussian & Binomial Distributions

### What’s going on on PyPI

Scanning all new published packages on PyPI I know that the quality is often quite bad. I try to filter out the worst ones and list here the ones which might be worth a look, being followed or inspire you in some way.

pychamp
Data Science Tools. pychamp is a data science tool intended to ease data science practices.
Modules:
• configparser : It can be used for parsing configuration file format json and ini.
• connection : It can be used for creating connection URL, connection engine and for executing any type of sql queries.
• features_selection : It can be used for selecting features using Backward Elimination, VIF and Features Importance.
• sampling : It can be used for different types of sampling operations such as SMOTE, SMOTENC and ADASYN for both categorical and numerical features.
• stats : It can be used for Confidence and Prediction Interval, IQR outlier removal and Summary Statistics.
• viz : It can be used for visualization.
• net : It can be used for sending mail currently.
• model : Different types of regression, classification and clustering models can be used.
• eda : It handles data types and missing values.

pyExplore
Python Package for exploratory data analysis in Data Science

PYSNN
PYthon Simple Neural Network – PYSNN is python3 lib for machine learning

The Python document processor

rskit
Toolkit for recommender systems. Toolkit for building recommender systems
• Provide CLI interface for running recommendation algorithms
• Contains abstractions you can leverage to build custom recommenders

topicnet
TopicNet is a module for topic modelling using ARTM algorithm

atom-ml
ATOM is an AutoML package. ATOM is a python package for exploration of ML problems. With just a few lines of code, you can compare the performance of multiple machine learning models on a given dataset, providing a quick insight on which algorithms performs best for the task at hand. Furthermore, ATOM contains a variety of plotting functions to help you analyze the models’ performances.

catch22
CAnonical Time-series Features, see description and license on GitHub.

causal-tree-learn
Python implementation of causal trees with validation

ems-dataflow-testframework
Framework helping testing Google Cloud Dataflows. This framework aims to help test Google Cloud Platform dataflows in an end-to-end way.

FAT-Forensics
A Python Toolbox for Algorithmic Fairness, Accountability and Transparency

oba-sparql
OBA Sparql Manager

sciwing
Modern Scientific Document Processing Framework. SciWING is a modern framework from WING-NUS to facilitate Scientific Document Processing. It is built on PyTorch and believes in modularity from ground up and easy to use interface. SciWING includes many pre-trained models for fundamental tasks in Scientific Document Processing for practitioners.

sko
Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Algorithm in Python

gaus-binom-distributions
Gaussian & Binomial Distributions

### If you did not already know

Apache Avro
Apache Avro is a data serialization system. …

Feudal Multi-agent Hierarchies (FMH)
We investigate how reinforcement learning agents can learn to cooperate. Drawing inspiration from human societies, in which successful coordination of many individuals is often facilitated by hierarchical organisation, we introduce Feudal Multi-agent Hierarchies (FMH). In this framework, a ‘manager’ agent, which is tasked with maximising the environmentally-determined reward function, learns to communicate subgoals to multiple, simultaneously-operating, ‘worker’ agents. Workers, which are rewarded for achieving managerial subgoals, take concurrent actions in the world. We outline the structure of FMH and demonstrate its potential for decentralised learning and control. We find that, given an adequate set of subgoals from which to choose, FMH performs, and particularly scales, substantially better than cooperative approaches that use a shared reward function. …

Generative adversarial networks (GANs) learn to mimic training data that represents the underlying true data distribution. However, GANs suffer when the training data lacks quantity or diversity and therefore cannot represent the underlying distribution well. To improve the performance of GANs trained on under-represented training data distributions, this paper proposes KG-GAN (Knowledge-Guided Generative Adversarial Network) to fuse domain knowledge with the GAN framework. KG-GAN trains two generators; one learns from data while the other learns from knowledge. To achieve KG-GAN, domain knowledge is formulated as a constraint function to guide the learning of the second generator. We validate our framework on two tasks: fine-grained image generation and hair recoloring. Experimental results demonstrate the effectiveness of KG-GAN. …

Relay
Frameworks for writing, compiling, and optimizing deep learning (DL) models have recently enabled progress in areas like computer vision and natural language processing. Extending these frameworks to accommodate the rapidly diversifying landscape of DL models and hardware platforms presents challenging tradeoffs between expressiveness, composability, and portability. We present Relay, a new intermediate representation (IR) and compiler framework for DL models. The functional, statically-typed Relay IR unifies and generalizes existing DL IRs and can express state-of-the-art models. Relay’s expressive IR required careful design of the type system, automatic differentiation, and optimizations. Relay’s extensible compiler can eliminate abstraction overhead and target new hardware platforms. The design insights from Relay can be applied to existing frameworks to develop IRs that support extension without compromising on expressivity, composibility, and portability. Our evaluation demonstrates that the Relay prototype can already provide competitive performance for a broad class of models running on CPUs, GPUs, and FPGAs. …

## September 13, 2019

### ttdo 0.0.3: New package

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A new package of mine arrived on CRAN yesterday, having been uploaded a few days prior on the weekend. It extends the most excellent (and very minimal / zero depends) unit testing package tinytest by Mark van der Loo with the very clever and well-done diffobj package by Brodie Gaslam. Mark also tweeted about it.

The package was written to address a fairly specific need. In teaching STAT 430 at Illinois, I am relying on the powerful PrairieLearn system (developed there) to provides tests, quizzes or homework. Alton and I have put together an autograder for R (which is work in progress, more on that maybe another day), and that uses this package to provides colorized differences between supplied and expected answers in case of an incorrect answer.

Now, the aspect of providing colorized diffs when tests do not evalute to TRUE is both simple and general enough. As our approach works rather well, I decided to offer the package on CRAN as well. The small screenshot gives a simple idea, the README.md contains a larger screenshoot.

The initial NEWS entries follow below.

#### Changes in ttdo version 0.0.2 (2019-08-31)

• Updated defaults for format and mode to use the same options used by diffobj along with fallbacks.

#### Changes in ttdo version 0.0.1 (2019-08-26)

• Initial version, with thanks to both Mark and Brodie.

Please use the GitHub repo and its issues for any questions.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

### Distilled News

Word embeddings – the mapping of words into numerical vector spaces – has proved to be an incredibly important method for natural language processing (NLP) tasks in recent years, enabling various machine learning models that rely on vector representation as input to enjoy richer representations of text input. These representation preserve more semantic and syntactic information on words, leading to improved performance in almost every imaginable NLP task.
BERT is changing the NLP landscape and making chatbots much smarter by enabling computers to better understand speech and respond intelligently in real-time. What Makes BERT so Amazing?
• BERT is a contextual model.
• BERT enables transfer learning.
• BERT can be fine-tuned cheaply and quickly.
In the previous story (part A) we discussed the structure and three main building blocks of a Neural Network. This story will take you through the elements which really make a useful force and separate them from rest of the Machine Learning Algorithms. Previously we discussed about Units/Neurons, Weights/Parameters & Biases today we will discuss – Hyper-Parameters
Graphs are applicable to many real-world datasets such as social networks, citation networks, chemical graphs, etc. The growing interest in graph-structured data increases the number of researches in graph neural networks. Variational autoencoders (VAEs) embodied the success of variational Bayesian methods in deep learning and have inspired a wide range of ongoing researches. Variational graph autoencoder (VGAE) applies the idea of VAE on graph-structured data, which significantly improves predictive performance on a number of citation network datasets such as Cora and Citesser. I searched on the internet and have yet to see a detailed tutorial on VGAE. In this article, I will briefly talk about traditional autoencoders and variational autoencoders. Furthermore, I will discuss the idea of applying VAE to graph-structured data (VGAE).
When we create our machine learning models, a common task that falls on us is how to tune them. People end up taking different manual approaches. Some of them work, and some don’t, and a lot of time is spent in anticipation and running the code again and again. So that brings us to the quintessential question: Can we automate this process?
R dogma is that for loops are bad because they are slow but this is not the case in C++. I had never programmed a line of C++ as of last week but my beloved firstborn started university last week and is enrolled in a C++ intro course, so I thought I would try to learn some and see if it would speed up Passing Bablok regression.
What it is really like to develop a model for a real-world business case. Have you ever taken part in a Kaggle competition? If you are studying, or have studied machine learning it is fairly likely that at some point you will have entered one. It is definitely a great way to put your model building skills into practice and I spent quite a bit of time on Kaggle when I was studying.
On Wednesday, the Allen Institute for Artificial Intelligence, a prominent lab in Seattle, unveiled a new system that passed the test with room to spare. It correctly answered more than 90 percent of the questions on an eighth-grade science test and more than 80 percent on a 12th-grade exam.
How do new scientific disciplines get started? For Iyad Rahwan, a computational social scientist with self-described ‘maverick’ tendencies, it happened on a sunny afternoon in Cambridge, Massachusetts, in October 2017. Rahwan and Manuel Cebrian, a colleague from the MIT Media Lab, were sitting in Harvard Yard discussing how to best describe their preferred brand of multidisciplinary research. The rapid rise of artificial intelligence technology had generated new questions about the relationship between people and machines, which they had set out to explore. Rahwan, for example, had been exploring the question of ethical behavior for a self-driving car – should it swerve to avoid an oncoming SUV, even if it means hitting a cyclist? – in his Moral Machine experiment.
Based on some recent conversations, I realized that text preprocessing is a severely overlooked topic. A few people I spoke to mentioned inconsistent results from their NLP applications only to realize that they were not preprocessing their text or were using the wrong kind of text preprocessing for their project. With that in mind, I thought of shedding some light around what text preprocessing really is, the different techniques of text preprocessing and a way to estimate how much preprocessing you may need. For those interested, I’ve also made some text preprocessing code snippets in python for you to try. Now, let’s get started!
We are excited to introduce Neural Structured Learning in TensorFlow, an easy-to-use framework that both novice and advanced developers can use for training neural networks with structured signals. Neural Structured Learning (NSL) can be applied to construct accurate and robust models for vision, language understanding, and prediction in general.
• Chinese billionaire’s face identified as jaywalker
• Uber self-driving car kills a pedestrian
• IBM Watson comes up short in healthcare
• Amazon AI recruiting tool is gender biased
• DeepFakes reveals AI’s unseemly side
• Google Photo confuses skier and mountain
• LG robot Cloi gets stagefright at its unveiling
• Boston Dynamics robot blooper
• AI World Cup 2018 predictions almost all wrong
• Startup claims to predict IQ from faces
As the deputy director for technology, data, and analysis at the San Francisco County Transportation Authority, Castiglione spends his days manipulating models of the Bay Area and its 7 million residents. From wide-sweeping ridership and traffic data to deep dives into personal travel choices via surveys, his models are able to estimate the number of people who will disembark at a specific train platform at a certain time of day and predict how that might change if a new housing development is built nearby, or if train-frequency is increased. The models are exceedingly complex, because people are so complex. ‘Think about the travel choices you’ve made in the last week, or the last year,’ Castiglione says. ‘How do you time your trips? What tradeoffs do you make? What modes of transportation do you use? How do those choices change from day to day?’ He has the deep voice of an NPR host and the demeanor of a patient professor. ‘The models are complex but highly rational,’ he says.
In my previous article, I introduced the idea behind the classification algorithm Support Vector Machine. Here, I’m going to show you a practical application in Python of what I’ve been explaining, and I will do so by using the well-known Iris dataset. Following the same structure of that article, I will first deal on linearly separable data, then I will move towards no-linearly separable data, so that you can appreciate the power of SVM which lie in the so-called Kernel Trick.
Whether we’re predicting water levels, queue lengths or bike rentals, at HAL24K we do a lot of regression, with everything from random forests to recurrent neural networks. And as good as our models are, we know they can never be perfect. Therefore, whenever we provide our customers with predictions, we also like to include a set of confidence intervals: what range around the prediction will the actual value fall within, with (e.g.) 80% confidence?

### Tell Me a Story: How to Generate Textual Explanations for Predictive Models

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

TL;DR: If you are going to explain predictions for a black box model you should combine statistical charts with natural language descriptions. This combination is more powerful than SHAP/LIME/PDP/Break Down charts alone. During this summer Adam Izdebski implemented this feature for explanations generated in R with DALEX library. How he did it? Find out here:

Long version:
Amazing things were created during summer internships at MI2DataLab this year. One of them is the generator of natural language descriptions for DALEX explainers developed by Adam Izdebski.

Is text better than charts for explanations?
Packages from DrWhy.AI toolbox generate lots of graphical explanations for predictive models. Available statistical charts allow to better understand how a model is working in general (global perspective) or for a specific prediction (local perspective).
Yet for domain experts without training in mathematics or computer science, graphical explanations may be insufficient. Charts are great for exploration and discovery, but for explanations they introduce some ambiguity. Have I read everything? Maybe I missed something?
To address this problem we introduced the describe() function, which
automatically generates textual explanations for predictive models. Right now these natural language descriptions are implemented in R packages by ingredients and iBreakDown.

Insufficient interpretability
Domain experts without formal training in mathematics or computer science can often find statistical explanations as hard to interpret. There are various reasons for this. First of all, explanations are often displayed as complex plots without instructions. Often there is no clear narration or interpretation visible. Plots are using different scales mixing probabilities with relative changes in model’s prediction. The order of variables may be also misleading. See for example a Break-Down plot.

The figure displays the prediction generated with Random Forest that a selected passenger survived the Titanic sinking. The model’s average response on titanic data set (intercept) is equal to 0.324. The model predicts that the selected passenger survived with probability 0.639. It also displays all the variables that have contributed to that prediction. Once the plot is described it is easy to interpret as it posses a very clear graphical layout.
However, interpreting it for the first time may be tricky.

Properties of a good description
Effective communication and argumentation is a difficult craft. For that reason, we refer to winning debates strategies as for guiding in generating persuasive textual explanations. First of all, any description should be
intelligible and persuasive. We achieve this by using:

• Fixed structure: Effective communication requires a rigid structure. Thus we generate descriptions from a fixed template, that always includes a proper introduction, argumentation and conclusion part. This makes the description more predictable, hence intelligible.
• Situation recognition: In order to make a description more trustworthy, we begin generating the text by identifying one of the scenarios, that we are dealing with. Currently, the following scenarios are available:
• The model prediction is significantly higher than the average model prediction. In this case, the description should convince the reader why the prediction is higher than the average.
• The model prediction is significantly lower than the average model prediction. In this case, the description should convince the reader why the prediction is lower than the average.
• The model prediction is close to the average. In this case the description should convince the reader that either: variables are contradicting each other or variables are insignificant.

Identifying what should be justified, is a crucial step for generating persuasive descriptions.

Description’s template for persuasive argumentation
As noted before, to achieve clarity we generate descriptions with three separate components: an introduction, an argumentation part, and a summary.

An introduction should provide a claim. It is a basic point that an arguer wishes to make. In our case, it is the model’s prediction. Displaying the additional information about the predictions’ distribution helps to place it in a context — is it low, high or close to the average.

An argumentation part should provide evidence and reason, which connects the evidence to the claim. In normal settings this will work like that: This particular passenger survived the catastrophe (claim) because it was a child (evidence no. 1) and children were evacuated from the ship in the first order as in the phrase women and children first. (reason no. 1) What is more, the children were traveling in the 1-st class (evidence no. 2) and first-class passengers had the best cabins, which were close to the rescue boats. (reason no. 2).

The tricky part is that we are not able to make up a reason automatically, as it is a matter of context and interpretation. However what we can do is highlight the main evidence, that made the model produce the claim. If a model is making its’ predictions for the right reason, evidences should make much sense and it should be easy for the reader to make a story and connect the evidence to the claim. If the model is displaying evidence, that makes not much sense, it also should be a clear signal, that the model may not be trustworthy.

A summary is just the rest of the justification. It states that other pieces of evidence are with less importance, thus they may be omitted. A good rule of thumb is displaying three most important evidence, not to make the picture too complex. We can refer to the above scheme as to creating relational arguments as in winning debates guides.

Implementation
The logic described above is implemented in ingredients and iBreakDown packages.

For generating a description we should pass the explanation generated by ceteris_paribus() or break_down() or shap() to the describe() function.

describe(break_down_explanation)
# Random Forest predicts, that the prediction for the selected instance is 0.639 which is higher than the average.
# The most important variables that increase the prediction are gender, fare.
# The most important variable that decrease the prediction is class.
# Other variables are with less importance. The contribution of all other variables is -0.063 .

There are various parameters that control the display of the description making it more flexible, thus suited for more applications. They include:

• generating a short version of descriptions,
• displaying predictions’ distribution details,
• generating more detailed argumentation.

While explanations generated by iBreakDown are feature attribution explanations that aim at providing interpretable reasons for the model’s prediction, explanations generated by ingredients are rather speculative. In fine, they explain how the model’s prediction would change if we perturb the instance being explained. For example, ceteris_paribus() explanation explores how would the prediction change if we change the values of a single feature while keeping the other features unchanged.

describe(ceteris_paribus_explanation, variables = “age”)
# For the selected instance Random Forest predicts that , prediction is equal to 0.699.
# The highest prediction occurs for (age = 16), while the lowest for (age = 74).
# Breakpoint is identified at (age = 40).
# Average model responses are *lower* for variable values *higher* than breakpoint.

Applications and future work

Generating natural language explanations is a sensitive task, as the interpretability always depends on the end user’s cognition. For this reason, experiments should be designed to assess the usefulness of the descriptions being generated. Furthermore, more vocabulary flexibility could be added, to make the descriptions more human alike. Lastly, descriptions could be integrated with a chatbot that would explain predictions interactively, using the framework described here. Also, better discretization techniques can be used for generating better continuous ceteris paribus and aggregated profiles textual explanations.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

### Magister Dixit

“Disributed computation generally is hard, because it adds an additional layer of complexity and communication overhead. The ideal case is scaling linearly with the number of nodes; that’s rarely the case. Emerging evidence shows that very often, one big machine, or even a laptop, outperforms a cluster.” Zygmunt Z. ( 2015-04-27 )

### NIMBLE short course at Bayes Comp 2020 conference

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We’ll be giving a short course on NIMBLE on January 7, 2020 at the Bayes Comp 2020 conference being held January 7-10 in Gainesville, Florida, USA.

Bayes Comp is a popular biennial ISBA-sponsored conference focused on computational methods/algorithms/technologies for Bayesian inference.

The short course focuses on programming algorithms in NIMBLE and is titled:

“Developing, modifying, and sharing Bayesian algorithms (MCMC samplers, SMC, and more) using the NIMBLE platform in R.”

More details on the conference and the short course are available at the conference website.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

### Many Heads Are Better Than One: The Case For Ensemble Learning

While ensembling techniques are notoriously hard to set up, operate, and explain, with the latest modeling, explainability and monitoring tools, they can produce more accurate and stable predictions. And better predictions can be better for business.

### AutoML: BigML Automated Machine Learning

Last year, BigML launched the OptiML resource for Automatic Model Optimization. Without a doubt, it has marked a milestone in our Machine Learning platform. Since then, many users have included OptiML in their Machine Learning toolboxes. However, some users are asking us to go further than model selection, so today we’re presenting BigML’s AutoML, an […]

### Initializing an empty list

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

### Problem

How do I initialize an empty list for use in a for-loop or function?

### Context

Sometimes I’m writing a for-loop (I know, I know, don’t use for-loops, but sometimes it’s just easier. I’m a little less good at apply functions than I’d like to be) and I know I’ll need to store the output in a list. Once in a while, the new list will be the same size as an existing one, but more often, I just need to start from scratch, knowing only the number of elements I want to include.

This isn’t a totally alien thing to need to do––it’s pretty familiar if you’re used to initializing empty vectors before for-loops. There’s a whole other debate to be had about whether or not it’s acceptable to start with a truly empty vector and append to it on every iteration of the loop or whether you should always know the length beforehand, but I’ll just focus on the latter case for now.

Anyway, initializing a vector of a given length is easy enough; I usually do it like this:

> desired_length <- 10 # or whatever length you want
> empty_vec <- rep(NA, desired_length)

I couldn’t immediately figure out how to replicate this for a list, though. The solution turns out to be relatively simple, but it’s just different enough that I can never seem to remember the syntax. This post is more for my records than anything, then.

### Solution

Initializing an empty list turns out to have an added benefit over my rep(NA) method for vectors; namely, the list ends up actually empty, not filled with NA’s. Confusingly, the function to use is “vector,” not “list.”

> desired_length <- 10 # or whatever length you want
> empty_list <- vector(mode = "list", length = desired_length)

> str(empty_list)
List of 10
$: NULL$ : NULL
$: NULL$ : NULL
$: NULL$ : NULL
$: NULL$ : NULL
$: NULL$ : NULL

### Outcome

Voilà, an empty list. No restrictions on the data type or structure of the individual list elements. Specify the length easily. Useful for loops, primarily, but may have other applications I haven’t come across yet.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

### Version Control for Data Science: Tracking Machine Learning Models and Datasets

I am a Git god, why do I need another version control system for Machine Learning Projects?

### Deterministic thinking (“dichotomania”): a problem in how we think, not just in how we act

This has come up before:

And it came up again recently.

Epidemiologist Sander Greenland has written about “dichotomania: the compulsion to replace quantities with dichotomies (‘black-and-white thinking’), even when such dichotomization is unnecessary and misleading for inference.”

I’d avoid the misleadingly clinically-sounding term “compulsion,” and I’d similarly prefer a word that doesn’t include the pejorative suffix “mania,” hence I’d rather just speak of “deterministic thinking” or “discrete thinking”—but I agree with Greenland’s general point that this tendency to prematurely collapse the wave function contributes to many problems in statistics and science.

Often when the problem of deterministic thinking comes up in discussion, I hear people explain it away, arguing that decisions have to be made (FDA drug trials are often brought up here), or that all rules are essentially deterministic (the idea that confidence intervals are interpreted as whether they include zero), or that this is a problem with incentives or publication bias, or that, sure, everyone knows that thinking of hypotheses as “true” or “false” is wrong, and that statistical significance and other summaries are just convenient shorthands for expressions of uncertainty that are well understood.

But I’d argue, with Eric Loken, that inappropriate discretization is not just a problem with statistical practice; it’s also a problem with how people think, that the idea of things being on or off is “actually the internal working model for a lot of otherwise smart scientists and researchers.”

This came up in some of the recent discussions on abandoning statistical significance, and I want to use this space to emphasize one more time the problem of inappropriate discrete modeling.

The issue arose in my 2011 paper, Causality and Statistical Learning:

More generally, anything that plausibly could have an effect will not have an effect that is exactly zero. I can respect that some social scientists find it useful to frame their research in terms of conditional independence and the testing of null effects, but I don’t generally find this approach helpful—and I certainly don’t believe that it is necessary to think in terms of conditional independence in order to study causality. Without structural zeros, it is impossible to identify graphical structural equation models.

The most common exceptions to this rule, as I see it, are independences from design (as in a designed or natural experiment) or effects that are zero based on a plausible scientific hypothesis (as might arise, e.g., in genetics, where genes on different chromosomes might have essentially independent effects), or in a study of ESP. In such settings I can see the value of testing a null hypothesis of zero effect, either for its own sake or to rule out the possibility of a conditional correlation that is supposed not to be there.

Another sort of exception to the “no true zeros” rule comes from in- formation restriction: a person’s decision should not be affected by knowledge that he or she does not have. For example, a consumer interested in buying apples cares about the total price he pays, not about how much of that goes to the seller and how much goes to the government in the form of taxes. So the restriction is that the utility depends on prices, not on the share of that going to taxes. That is the type of restriction that can help identify demand functions in economics.

I realize, however, that my perspective that there are no true zeros (information restrictions aside) is a minority view among social scientists and perhaps among people in general, on the evidence of [cognitive scientist Steven] Sloman’s book [Causal Models: How People Think About the World and Its Alternatives]. For example, from chapter 2: “A good politician will know who is motivated by greed and who is motivated by larger principles in order to discern how to solicit each one’s vote when it is needed” (p. 17). I can well believe that people think in this way but I don’t buy it: just about everyone is motivated by greed and by larger principles. This sort of discrete thinking doesn’t seem to me to be at all realistic about how people behave—although it might very well be a good model about how people characterize others.

In the next chapter, Sloman writes, “No matter how many times A and B occur together, mere co-occurrence cannot reveal whether A causes B, or B causes A, or something else causes both” (p. 25; emphasis added). Again, I am bothered by this sort of discrete thinking. I will return in a moment with an example, but just to speak generally, if A could cause B, and B could cause A, then I would think that, yes, they could cause each other. And if something else could cause them both, I imagine that could be happening along with the causation of A on B and of B on A.

To continue:

Let’s put this another way. Sloman’s book is called, “Causal Models: How People Think About the World and Its Alternatives,” and it makes sense to me that people think about the world discretely. My point, which I think aligns with those of Loken and Greenland, is that this discrete model of the world is typically inaccurate and misleading, so it’s worth fighting this tendency in ourselves. The point of the above example is that Sloman, who’s writing about how people think, is himself slipping into this error.

One more time

The above is just an example. My point is not to argue with Sloman—his book stands on its own, and his ideas can be valuable even if (or especially if!) his perspective on discrete modeling is different from mine. My point is that discrete thinking is not simply something people do because they have to make decisions, nor is it something that people do just because they have some incentive to take a stance of certainty. So when we’re talking about the problems of deterministic thinking, or premature collapse of the “wave function” of inferential uncertainty, we really are talking about a failure to incorporate enough of a continuous view of the world in our mental model.

P.S. Tomorrow’s post: I think that science is mostly “Brezhnevs.” It’s rare to see a “Gorbachev” who will abandon a paradigm just because it doesn’t do the job. Also, moving beyond naive falsificationism.

### #FunDataFriday – The magick package in R

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

## WHAT IS IT?

Magick is a package available in R that allows you to very easily work with images and animations. Jeroen Ooms is the genius behind the package.

## WHY IS IT AWESOME?

Example graph with an animation. Produced using the magick package in R.

It’s, awesome because it comes in handy every time that I want to work with images or animations within R. It allows for image and animation manipulation, creation and combination.

The best part is that it’s so easy! I’ve put minimal effort into understanding this package and I’ve been able to manipulate images and gifs, add them to my graphs and also create custom gifs and animated graphs.

## HOW TO GET STARTED?

The package vignette has a great primer which I used to get started. I have also posted a tutorial showing how to create the graph above. Try the examples. Next, try bringing an image or animation into a graph of your own. Or, try saving your data visualization series into an animated format!

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

### Should you use "dot notation" or "bracket notation" with pandas?

If you've ever used the pandas library in Python, you probably know that there are two ways to select a Series (meaning a column) from a DataFrame:

# dot notation
df.col_name

# bracket notation
df['col_name']


Which method should you use? I'll make the case for each, and then you can decide...

## Why use bracket notation?

The case for bracket notation is simple: It always works.

Here are the specific cases in which you must use bracket notation, because dot notation would fail:

# column name includes a space
df['col name']

# column name matches a DataFrame method
df['count']

# column name matches a Python keyword
df['class']

# column name is stored in a variable
var = 'col_name'
df[var]

# column name is an integer
df[0]

# new column is created through assignment
df['new'] = 0


In other words, bracket notation always works, whereas dot notation only works under certain circumstances. That's a pretty compelling case for bracket notation!

As stated in the Zen of Python:

There should be one-- and preferably only one --obvious way to do it.

## Why use dot notation?

If you've watched any of my pandas videos, you may have noticed that I use dot notation. Here are four reasons why:

### Reason 1: Dot notation is easier to type

Dot notation is three fewer characters to type than bracket notation. And in terms of finger movement, typing a single period is much more convenient than typing brackets and quotes.

This might sound like a trivial reason, but if you're selecting columns dozens (or hundreds) of times a day, it makes a real difference!

### Reason 2: Dot notation is easier to read

Most of my pandas code is a made up of chains of selections and methods. By using dot notation, my code is mostly adorned with periods and parentheses (plus an occasional quotation mark):

# dot notation
df.col_one.sum()
df.col_one.isna().sum()
df.groupby('col_two').col_one.sum()


If you instead use bracket notation, your code is adorned with periods and parentheses plus lots of brackets and quotation marks:

# bracket notation
df['col_one'].sum()
df['col_one'].isna().sum()
df.groupby('col_two')['col_one'].sum()


I find the dot notation code easier to read, as well as more aesthetically pleasing.

### Reason 3: Dot notation is easier to remember

With dot notation, every component in a chain is separated by a period on both sides. For example, this line of code has 4 components, and thus there are 3 periods separating the individual components:

# dot notation
df.groupby('col_two').col_one.sum()


If you instead use bracket notation, some of your components are separated by periods, and some are not:

# bracket notation
df.groupby('col_two')['col_one'].sum()


With bracket notation, I often forget whether there's supposed to be a period before ['col_one'], after ['col_one'], or both before and after ['col_one'].

With dot notation, it's easier for me to remember the correct syntax.

### Reason 4: Dot notation limits the usage of brackets

Brackets can be used for many purposes in pandas:

df[['col_one', 'col_two']]
df.iloc[4, 2]
df.loc['row_label', 'col_one':'col_three']
df.col_one['row_label']
df[(df.col_one > 5) & (df.col_two == 'value')]


If you also use bracket notation for Series selection, you end up with even more brackets in your code:

df['col_one']['row_label']
df[(df['col_one'] > 5) & (df['col_two'] == 'value')]


As you use more brackets, each bracket becomes slightly more ambiguous as to its purpose, imposing a higher mental burden on the person reading the code. By using dot notation for Series selection, you reduce bracket usage to only the essential cases.

## Conclusion

If you prefer bracket notation, then you can use it all of the time! However, you still have to be familiar with dot notation in order to read other people's code.

If you prefer dot notation, then you can use it most of the time, as long as you are diligent about renaming columns when they contains spaces or collide with DataFrame methods. However, you still have to use bracket notation when creating new columns.

Which do you prefer? Let me know in the comments below, or vote in the Twitter poll:

### The State of Transfer Learning in NLP

This post expands on the NAACL 2019 tutorial on Transfer Learning in NLP organized by Matthew Peters, Swabha Swayamdipta, Thomas Wolf, and Sebastian Ruder. This post highlights key insights and takeaways and provides updates based on recent work.

### Taking a year to explain computer things

I’ve been working on explaining computer things I’m learning on this blog for 6 years. I wrote one of my first posts, what does a shell even do? on Sept 30, 2013. Since then, I’ve written 11 zines, 370,000 words on this blog, and given 20 or so talks. So it seems like I like explaining things a lot.

### tl;dr: I’m going to work on explaining computer things for a year

Here’s the exciting news: I left my job a month ago and my plan is to spend the next year working on explaining computer things!

As for why I’m doing this – I was talking through some reasons with my friend Mat last night and he said “well, sometimes there are things you just feel compelled to do”. I think that’s all there is to it :)

### what does “explain computer things” mean?

I’m planning to:

1. write some more zines (maybe I can write 10 zines in a year? we’ll see! I want to tackle both general-interest and slightly more niche topics, we’ll see what happens).
2. work on some more interactive ways to learn things. I learn things best by trying things out and breaking them, so I want to see if I can facilitate that a little bit for other people. I started a project around this in May which has been on the backburner for a bit but which I’m excited about. Hopefully I’ll release it soon and then you can try it out and tell me what you think!

I say “a year” because I think I have at least a year’s worth of ideas and I can’t predict how I’ll feel after doing this for a year.

I started a corporation almost exactly a year ago, and I’m planning to keep running my explaining-things efforts as a business. This business has been making more than I made in my first programming job (that is, definitely enough money to live on!), which has been really surprising and great (thank you!).

• I’m not planning to hire employees or anything, it’ll just be me and some (awesome) freelancers. The biggest change I have in mind is that I’m hoping to find a freelance editor to help me with editing.
• I also don’t have any specific plans for world domination or to work 80-hour weeks. I’m just going to make zines & things that explain computer concepts and sell them on the internet, like I’ve been doing.
• No commissions or consulting work, just building ideas I have

It’s been pretty interesting to learn more about running a small business and so far I like it more than I thought I would. (except for taxes, which I like exactly as much as I thought I would)

### that’s all!

I’m excited to keep making explanations of computer things and to have more time to do it. This blog might change a bit away from “here’s what I’m learning at work these days” and towards “here are attempts at explaining things that I mostly already know”. It’ll be different! We’ll see how it goes!

### Sleep Schedule, From the Inconsistent Teenage Years to Retirement

From the teenage years to college to adulthood through retirement, sleep is all over the place at first but then converges towards consistency. Read More

### Reproducing the kidney cancer example from BDA

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is an attempt at reproducing the analysis of Section 2.7 of Bayesian Data Analysis, 3rd edition (Gelman et al.), on kidney cancer rates in the USA in the 1980s. I have done my best to clean the data from the original. Andrew wrote a blog post to “disillusion [us] about the reproducibility of textbook analysis”, in which he refers to this example. This might then be an attempt at reillusionment…

The cleaner data are on GitHub, as is the RMarkDown of this analysis.

library(usmap)
library(ggplot2)

d = read.csv("KidneyCancerClean.csv", skip=4)


In the data, the columns dc and dc.2 correspond (I think) to the death counts due to kidney cancer in each county of the USA, respectively in 1908-84 and 1985-89. The columns pop and pop.2 are some measure of the population in the counties. It is not clear to me what the other columns represent.

## Simple model

Let $n_j$ be the population on county $j$, and $K_j$ the number of kidney cancer deaths in that county between 1980 and 1989. A simple model is $K_j\sim Poisson(\theta_j n_j)$ where $\theta_j$ is the unknown parameter of interest, representing the incidence of kidney cancer in that county. The maximum likelihood estimator is $\hat\theta_j=\frac{K_j}{n_j}$.

d$dct = d$dc + d$dc.2 d$popm = (d$pop + d$pop.2) / 2
d$thetahat = d$dct / d$popm  In particular, the original question is to understand these two maps, which show the counties in the first and last decile for kidney cancer deaths. q = quantile(d$thetahat, c(.1, .9))
d$cancerlow = d$thetahat <= q[1] d$cancerhigh = d$thetahat >= q[2]
plot_usmap("counties", data=d, values="cancerhigh") +
scale_fill_discrete(h.start = 200,
name = "Large rate of kidney cancer deaths")


plot_usmap("counties", data=d, values="cancerlow") +
scale_fill_discrete(h.start = 200,
name = "Low rate of kidney cancer deaths")


These maps are suprising, because the counties with the highest kidney cancer death rate, and those with the lowest, are somewhat similar: mostly counties in the middle of the map.

(Also, note that the data for Alaska are missing. You can hide Alaska on the maps by adding the parameter include = statepop$full[-2] to calls to plot_usmap.) The reason for this pattern (as explained in BDA3) is that these are counties with a low population. Indeed, a typical value for $\hat\theta_j$ is around $0.0001$. Take a county with a population of 1000. It is likely to have no kidney cancer deaths, giving $\hat\theta_j=0$ and putting it in the first decile. But if it happens to have a single death, the estimated rate jumps to $\hat\theta_j=0.001$ (10 times the average rate), putting it in the last decile. This is hinted at in this histogram of the $(\theta_j)$: ggplot(data=d, aes(d$thetahat)) +
geom_histogram(bins=30, fill="lightblue") +
labs(x="Estimated kidney cancer death rate (maximum likelihood)",
y="Number of counties") +
xlim(c(-1e-5, 5e-4))


## Bayesian approach

If you have ever followed a Bayesian modelling course, you are probably screaming that this calls for a hierarchical model. I agree (and I’m pretty sure the authors of BDA do as well), but here is a more basic Bayesian approach. Take a common $\Gamma(\alpha, \beta)$ distribution for all the $(\theta_j)$; I’ll go for $\alpha=15$ and $\beta = 200\ 000$, which is slightly vaguer than the prior used in BDA. Obviously, you should try various values of the prior parameters to check their influence.

The prior is conjugate, so the posterior is $\theta_j|K_j \sim \Gamma(\alpha + K_j, \beta + n_j)$. For small counties, the posterior will be extremely close to the prior; for larger counties, the likelihood will take over.

It is usually a shame to use only point estimates, but here it will be sufficient: let us compute the posterior mean of $\theta_j$. Because the prior has a strong impact on counties with low population, the histogram looks very different:

alpha = 15
beta = 2e5
d$thetabayes = (alpha + d$dct) / (beta + d$pop) ggplot(data=d, aes(d$thetabayes)) +
geom_histogram(bins=30, fill="lightblue") +
labs(x="Estimated kidney cancer death rate (posterior mean)",
y="Number of counties") +
xlim(c(-1e-5, 5e-4))


And the maps of counties in the first and last decile are now much easier to distinguish; for instance, Florida and New England are heavily represented in the last decile. The counties represented here are mostly populated counties: these are counties for which we have reason to believe that they are on the lower or higher end for kidney cancer death rates.

qb = quantile(d$thetabayes, c(.1, .9)) d$bayeslow = d$thetabayes <= qb[1] d$bayeshigh = d$thetabayes >= qb[2] plot_usmap("counties", data=d, values="bayeslow") + scale_fill_discrete( h.start = 200, name = "Low kidney cancer death rate (Bayesian inference)")  plot_usmap("counties", data=d, values="bayeshigh") + scale_fill_discrete( h.start = 200, name = "High kidney cancer death rate (Bayesian inference)")  An important caveat: I am not an expert on cancer rates (and I expect some of the vocabulary I used is ill-chosen), nor do I claim that the data here are correct (from what I understand, many adjustments need to be made, but they are not detailed in BDA, which explains why the maps are slightly different). I am merely posting this as a reproducible example where the naïve frequentist and Bayesian estimators differ appreciably, because they handle sample size in different ways. I have found this example to be useful in introductory Bayesian courses, as the difference is easy to grasp for students who are new to Bayesian inference. To leave a comment for the author, please follow the link and comment on their blog: R – Robin Ryder's blog. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Continue Reading… ### If you did not already know semopy Structural equation modelling (SEM) is a multivariate statistical technique for estimating complex relationships between observed and latent variables. Although numerous SEM packages exist, each of them has limitations. Some packages are not free or open-source; the most popular package not having this disadvantage is$\textbf{lavaan}$, but it is written in R language, which is behind current mainstream tendencies that make it harder to be incorporated into developmental pipelines (i.e. bioinformatical ones). Thus we developed the Python package$\textbf{semopy}$to satisfy those criteria. The paper provides detailed examples of package usage and explains it’s inner clockworks. Moreover, we developed the unique generator of SEM models to extensively test SEM packages and demonstrated that$\textbf{semopy}$significantly outperforms$\textbf{lavaan}$in execution time and accuracy. … Causaltoolbox Estimating heterogeneous treatment effects has become extremely important in many fields and often life changing decisions for individuals are based on these estimates, for example choosing a medical treatment for a patient. In the recent years, a variety of techniques for estimating heterogeneous treatment effects, each making subtly different assumptions, have been suggested. Unfortunately, there are no compelling approaches that allow identification of the procedure that has assumptions that hew closest to the process generating the data set under study and researchers often select just one estimator. This approach risks making inferences based on incorrect assumptions and gives the experimenter too much scope for p-hacking. A single estimator will also tend to overlook patterns other estimators would have picked up. We believe that the conclusion of many published papers might change had a different estimator been chosen and we suggest that practitioners should evaluate many estimators and assess their similarity when investigating heterogeneous treatment effects. We demonstrate this by applying 32 different estimation procedures to an emulated observational data set; this analysis shows that different estimation procedures may give starkly different estimates. We also provide an extensible \texttt{R} package which makes it straightforward for practitioners to apply our analysis to their data. … Gradient Episodic Memory One major obstacle towards AI is the poor ability of models to solve new problems quicker, and without forgetting previously acquired knowledge. To better understand this issue, we study the problem of continual learning, where the model observes, once and one by one, examples concerning a sequence of tasks. First, we propose a set of metrics to evaluate models learning over a continuum of data. These metrics characterize models not only by their test accuracy, but also in terms of their ability to transfer knowledge across tasks. Second, we propose a model for continual learning, called Gradient Episodic Memory (GEM) that alleviates forgetting, while allowing beneficial transfer of knowledge to previous tasks. Our experiments on variants of the MNIST and CIFAR-100 datasets demonstrate the strong performance of GEM when compared to the state-of-the-art. … AlignFlow Given unpaired data from multiple domains, a key challenge is to efficiently exploit these data sources for modeling a target domain. Variants of this problem have been studied in many contexts, such as cross-domain translation and domain adaptation. We propose AlignFlow, a generative modeling framework for learning from multiple domains via normalizing flows. The use of normalizing flows in AlignFlow allows for a) flexibility in specifying learning objectives via adversarial training, maximum likelihood estimation, or a hybrid of the two methods; and b) exact inference of the shared latent factors across domains at test time. We derive theoretical results for the conditions under which AlignFlow guarantees marginal consistency for the different learning objectives. Furthermore, we show that AlignFlow guarantees exact cycle consistency in mapping datapoints from one domain to another. Empirically, AlignFlow can be used for data-efficient density estimation given multiple data sources and shows significant improvements over relevant baselines on unsupervised domain adaptation. … Continue Reading… ### Magister Dixit “Portability of code and environment is one of the challenge every data scientist faces. The code can be framework dependent or it can be machine dependent. The end result – A model that works like a charm on one machine might not do so on another.” AVBytes ( 23.04.2018 ) Continue Reading… ### Getting on the meet-up bandwagon – our first meet up event [This article was first published on R Blogs – Hutsons-hacks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. My company Draper and Dash have tasked me with organising a wider meet-up event for anyone who is interested in AI / ML in healthcare. This wider working group consists of people from different sectors, however they are interested in how we can apply AI / ML methods in their organisations. ## Why did we choose meet-up Meet-up seems like a great way to engage like minded people from across different communities. ## What is our first event Our first event is detailed below: This will focus on how we use supervised ML methods in our BI solutions and this adds to our BI stack. ## Join now So, this community is gathering members – we have over 90 now. If you are interested, and are based in the London, UK area, then please follow our events – as we plan to have a whole lot more in the future – with guest speakers as well. To leave a comment for the author, please follow the link and comment on their blog: R Blogs – Hutsons-hacks. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Continue Reading… ### Document worth reading: “Deconstructing Blockchains: A Comprehensive Survey on Consensus, Membership and Structure” It is no exaggeration to say that since the introduction of Bitcoin, blockchains have become a disruptive technology that has shaken the world. However, the rising popularity of the paradigm has led to a flurry of proposals addressing variations and/or trying to solve problems stemming from the initial specification. This added considerable complexity to the current blockchain ecosystems, amplified by the absence of detail in many accompanying blockchain whitepapers. Through this paper, we set out to explain blockchains in a simple way, taming that complexity through the deconstruction of the blockchain into three simple, critical components common to all known systems: membership selection, consensus mechanism and structure. We propose an evaluation framework with insight into system models, desired properties and analysis criteria, using the decoupled components as criteria. We use this framework to provide clear and intuitive overviews of the design principles behind the analyzed systems and the properties achieved. We hope our effort will help clarifying the current state of blockchain proposals and provide directions to the analysis of future proposals. Deconstructing Blockchains: A Comprehensive Survey on Consensus, Membership and Structure Continue Reading… ### Regex Problem? Here’s an R package that will write Regex for you [This article was first published on r-bloggers on Programming with R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. REGEX is that thing that scares everyone almost all the time. Hence, finding some alternative is always very helpful and peaceful too. Here’s a nice R package thst helps us do REGEX without knowing REGEX. ### REGEX This is the REGEX pattern to test the validity of a URL: ^(http)(s)?(\:\/\/)(www\.)?([^\ ]*)$

A typical regular expression contains — Characters ( http ) and Meta Characters ([]). The combination of these two form a meaningful regular expression for a particular task.
So, What’s the problem?

Remembering the way in which characters and meta-characters are combined to create a meaningful regex is itself a tedious task which sometimes becomes a bigger task than the actual problem of NLP which is the larger goal.

### Solution at Hand

Some good soul on this planet has created an open-source Javascript library JSVerbalExpressions to make Regex creation easy. Then some other good soul (Tyler Littlefield) ported the javascript library to R— RVerbalExpressions. This is the beauty of the open source world.

### Installation

is available on RVerbalExpressions Github so you can use devtools or remotes to install it from Github.

# install.packages("devtools")
devtools::install_github("VerbalExpressions/RVerbalExpressions")


### Pseudo-Problem

Let’s create a pseudo-problem that we’d like to solve with regex through which we can understand this package to programmatically create regex.

A simpler one perhaps, We’ve got multiple text like and we’d like to extract the names from it. Here’s our input and output look like:


strings = c('123Problem233','233Solution434','223Tim Apple444')

Problem, Solution, Time Apple


Once we solve this, we’ll move forward with slightly complicated problems.

### Pseudo-Code

Before we code, it’s always good to write-out a pseudo-code on a napkin or even a paper if you’ve got. That is, We want to extract names (which is composition of alphabets) except numbers (which is digits). We build a regex for one-line and then we iterate it for all the elements in our vector.

Like any other R package, we can load RVerbalExpressions with library() function.

library(RVerbalExpressions)


## Constructing the Expression

### Extract Strings

Like many other modern-day R packages, RVerbalExpressions support %>% pipe operator for better simplicity and readability of the code. But for this problem of extracting strings that are present between the numbers, we can simply use one function that is rx_alpha() to say that we need alphabets from the given string.

expr =  rx_alpha()

stringr::str_extract_all(strings,expr)

[[1]]
[1] "P" "r" "o" "b" "l" "e" "m"

[[2]]
[1] "S" "o" "l" "u" "t" "i" "o" "n"

[[3]]
[1] "T" "i" "m" "A" "p" "p" "l" "e"*


### Extract Numbers

Similar to the text that we extracted, Extracting Numbers again is very English as we’ve to use the function rx_digit() to say that we need numbers from the given text.

expr =  rx_digit()

stringr::str_extract_all(strings,expr)
[[1]]
[1] "1" "2" "3" "2" "3" "3"

[[2]]
[1] "2" "3" "3" "4" "3" "4"

[[3]]
[1] "2" "2" "3" "4" "4" "4"


### Another Constructor to extract the name as a word

Here, we can use the function rx_word() to match it as word (rather than letters).

expr =  rx_alpha()  %>%  rx_word() %>% rx_alpha()

stringr::str_extract_all(strings,expr)

[[1]]
[1] "Problem"

[[2]]
[1] "Solution"

[[3]]
[1] "Tim" "Apple"


### Expression

What if we want to use the expression somewhere else or simply we need the regex expression. It’s simple because the expression is what we’ve constructed and printing what we constructed would reveal the relevant regex pattern.

expr

"[A-z]\\w+[A-z]"


### Summary

Thus, we managed to build a regex pattern without knowing regex. Simply put, we programmatically generated a regex pattern using R (that doesn’t require the high-level knowledge of regex patterns) and accomplished a tiny task that we took up to demonstrate the potential. For more of Regex, Check out this course. The entire code is available here.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

### Laguerre-Samuelson Inequality

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Chebychev’s Theorem gives bounds on how spread out a probability distribution can be from the mean, in terms of the standard deviation. More precisely, if $$X$$ is a random variable with mean $$\mu$$ and standard deviation $$\sigma$$, then $P(|X – \mu| \ge k \sigma) \le \frac {1}{k^2}.$

When I was an undergraduate at Texas, I noticed that this has the following implication. Suppose $$x_1,\ldots,x_n$$ is any finite set of numbers. Create a random variable $$X$$ so that $$P(X = x_i) = 1/n$$ for all $$1\le i \le n$$. Then, the mean of $$X$$ is $$\mu = \overline{x}$$, and the standard deviation of $$X$$ is $$s_X = s \sqrt{\frac{n-1}{n}}$$, where $$s$$ is the sample distribution. By Chebychev’s Theorem with $$k = n$$, $P(|X – \overline{x}| \ge \sqrt{n} s_X) \le \frac{1}{n}$ Now, since the probability of each $$x_i$$ is $$\frac 1n$$, this tells us that all of the points $$x_i$$ are within $$\sqrt{n} s_X$$ if the mean of the $$x_i$$.

I totally didn’t believe that was true when I was 20 years old, so I spent (way too much) time computing examples. What I noticed was that not only did it seem to be true, but I couldn’t even find examples where $$\sqrt{n} s_X$$ was even necessary.

Fast forward 30 years later, and I decided to revisit this problem. Let’s consider the case when $$n = 4$$ and do some simulations. I will take a bunch of random samples of size 4, compute $$\mu$$ and $$s_X$$ and the maximum distance from $$\mu$$ to any of the $$x_i$$. Then, we’ll compare it to $$s_X$$ to see how far the farthest point away from the mean is.

set.seed(9132019)
vec <- 1:4
probs <- rep(1/4, 4)
mu <- mean(vec)
sd <- sqrt(sum(probs * (vec - mu)^2))
best <- max(abs(vec - mu))/sd
best_vec <- vec
for(i in 1:100000) {
vec <- rnorm(4)
mu <- mean(vec)
sd <- sd(vec) * sqrt(3/4)
if(max(abs(vec - mu))/sd > best) {
best <- max(abs(vec - mu))/sd
best_vec <- vec
}
}
best_vec
## [1] -0.1465015 -0.1494259  1.5503518 -0.1461181
best
## [1] 1.732048

We see a couple of things. First, we see that the extreme case is when all of the values are equal except for one which is different, and second, that all of the data is within 1.732 standard deviations of the mean, when Chebychev would’ve predicted that all of the data is within 2 standard deviations of the mean. What gives? Laguerre-Samuelson, that’s what!

Theorem Let $$x_1,\ldots,x_n$$ be any $$n$$ real numbers with mean $$\mu$$ and standard deviation $$s$$ (with $$n$$ in the downstairs). Then, all $$n$$ numbers lie within the interval $$[\mu – \sqrt{n-1}s, \mu + \sqrt{n-1}s]$$.

As hinted at above, this theorem can be considered a sharp version of Chebychev in the case that the rv $$X$$ consists of finitely many equally likely values.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

### Fitting ‘complex’ mixed models with ‘nlme’: Example #2

[This article was first published on R on The broken bridge between biologists and statisticians, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# A repeated split-plot experiment with heteroscedastic errors

Let’s imagine a field experiment, where different genotypes of khorasan wheat are to be compared under different nitrogen (N) fertilisation systems. Genotypes require bigger plots, with respect to fertilisation treatments and, therefore, the most convenient choice would be to lay-out the experiment as a split-plot, in a randomised complete block design. Genotypes would be randomly allocated to main plots, while fertilisation systems would be randomly allocated to sub-plots. As usual in agricultural research, the experiment should be repeated in different years, in order to explore the environmental variability of results.

What could we expect from such an experiment?

Please, look at the dataset ‘kamut.csv’, which is available on github. It provides the results for a split-plot experiment with 15 genotypes and 2 N fertilisation treatments, laid-out in three blocks and repeated in four years (360 observations, in all).

The dataset has five columns, the ‘Year’, the ‘Genotype’, the fertilisation level (‘N’), the ‘Block’ and the response variable, i.e. ‘Yield’. The fifteen genotypes are coded by using the letters from A to O, while the levels of the other independent variables are coded by using numbers. The following snippets loads the file and recodes the numerical independent variables into factors.

rm(list=ls())
dataset$Block <- factor(dataset$Block)
dataset$Year <- factor(dataset$Year)
dataset$N <- factor(dataset$N)
##   Year Genotype N Block Yield
## 1 2004        A 1     1 2.235
## 2 2004        A 1     2 2.605
## 3 2004        A 1     3 2.323
## 4 2004        A 2     1 3.766
## 5 2004        A 2     2 4.094
## 6 2004        A 2     3 3.902

Additionally, it may be useful to code some ‘helper’ factors, to represent the blocks (within years) and the main-plots. The first factors (‘YearBlock’) has 12 levels (4 years and 3 blocks per year) and the second factor (‘MainPlot’) has 180 levels (4 years, 3 blocks per year and 15 genotypes per block).

dataset$YearBlock <- with(dataset, factor(Year:Block)) dataset$MainPlot <- with(dataset, factor(Year:Block:Genotype))

For the analyses, we will make use of the ‘plyr’ (Wickham, 2011), ‘car’ (Fox and Weisberg, 2011) and ‘nlme’ (Pinheiro et al., 2018) packages, which we load now.

library(plyr)
library(car)
library(nlme)

It is always useful to start by separately considering the results for each year. This gives us a feel for what happened in all experiments. What model do we have to fit to single-year split-plot data? In order to avoid mathematical notation, I will follow the notation proposed by Piepho (2003), by using the names of variables, as reported in the dataset. The treatment model for this split-plot design is:

Yield ~ Genotype * N

All treatment effects are fixed. The block model, referencing all grouping structures, is:

Yield ~ Block + Block:MainPlot + Block:MainPlot:Subplot

The first element references the blocks, while the second element references the main-plots, to which the genotypes are randomly allocated (randomisation unit). The third element references the sub-plots, to which N treatments are randomly allocated (another randomisation unit); this latter element corresponds to the residual error and, therefore, it is fitted by default and needs not be explicitly included in the model. Main-plot and sub-plot effects need to be random, as they reference randomisation units (Piepho, 2003). The nature of the block effect is still under debate (Dixon, 2016), but I’ll take it as random (do not worry: I will also show how we can take it as fixed).

Coding a split-plot model in ‘lme’ is rather simple:

lme(Yield ~ Genotype * N, random = ~1|Block/MainPlot

where the notation ‘Block/MainPlot’ is totally equivalent to ‘Block + Block:MainPlot’. Instead of manually fitting this model four times (one per year), we can ask R to do so by using the ‘ddply()’ function in the ‘plyr’ package. In the code below, I used this technique to retrieve the residual variance for each experiment.

lmmFits <- ddply(dataset, c("Year"),
function(df) summary( lme(Yield ~ Genotype * N,
random = ~1|Block/MainPlot,
data = df))\$sigma^2 )
lmmFits
##   Year          V1
## 1 2004 0.052761644
## 2 2005 0.001423833
## 3 2006 0.776028791
## 4 2007 0.817594477

We see great differences! The residual variance in 2005 is more that 500 times smaller than that observed in 2007. Clearly, if we pool the data and make an ANOVA, when we pool the data, we violate the homoscedasticity assumption. In general, this problem has an obvious solution: we can model the variance-covariance matrix of observations, allowing a different variance per year. In R, this is only possible by using the ‘lme()’ function (unless we want to use the ‘asreml-R’ package, which is not freeware, unfortunately). The question is: how do we code such a model?

First of all, let’s derive a correct mixed model. The treatment model is:

Yield ~ Genotype * N

We have mentioned that the genotype and N effects are likely to be taken as fixed. The block model is:

 ~ Year + Year/Block + Year:Block:MainPlot + Year:Block:MainPlot:Subplot

The second element in the block model references the blocks within years, the second element references the main-plots, while the third element references the sub-plots and, as before, it is not needed. The year effect is likely to interact with both the treatment effects, so we need to add the following effects:

 ~ Year + Year:Genotype + Year:N + Year:Genotype:N

which is equivalent to writing:

 ~ Year*Genotype*N

The year effect can be taken as either as random or as fixed. In this post, we will show both approaches

# Year effect is fixed

If we take the year effect as fixed and the block effect as random, we see that the random effects are nested (blocks within years and main-plots within blocks and within years). The function ‘lme()’ is specifically tailored to deal with nested random effects and, therefore, fitting the above model is rather easy. In the first snippet we fit a homoscedastic model:

modMix1 <- lme(Yield ~ Year * Genotype * N,
random = ~1|YearBlock/MainPlot,
data = dataset)

We could also fit this model with the ‘lme4’ package and the ‘lmer()’; however, we are not happy with this, because we have seen clear signs of heteroscedastic within-year errors. Thus, let’s account for such an heteroscedasticity, by using the ‘weights()’ argument and the ‘varIdent()’ variance structure:

modMix2 <- lme(Yield ~ Year * Genotype * N,
random = ~1|YearBlock/MainPlot,
data = dataset,
weights = varIdent(form = ~1|Year))
AIC(modMix1, modMix2)
##          df      AIC
## modMix1 123 856.6704
## modMix2 126 575.1967

Based on the Akaike Information Criterion, we see that the second model is better than the first one, which supports the idea of heteroscedastic residuals. From this moment on, the analyses proceeds as usual, e.g. by testing for fixed effects and comparing means, as necessary. Just a few words about testing for fixed effects: Wald F tests can be obtained by using the ‘anova()’ function, although I usually avoid this with ‘lme’ objects, as there is no reliable approximation to degrees of freedom. With ‘lme’ objects, I suggest using the ‘Anova()’ function in the ‘car’ package, which shows the results of Wald chi square tests.

Anova(modMix2)
## Analysis of Deviance Table (Type II tests)
##
## Response: Yield
##                    Chisq Df Pr(>Chisq)
## Year              51.072  3  4.722e-11 ***
## Genotype         543.499 14  < 2.2e-16 ***
## N               2289.523  1  < 2.2e-16 ***
## Year:Genotype    123.847 42  5.281e-10 ***
## Year:N            21.695  3  7.549e-05 ***
## Genotype:N      1356.179 14  < 2.2e-16 ***
## Year:Genotype:N  224.477 42  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

One further aspect: do you prefer fixed blocks? Then you can fit the following model.

modMix4 <- lme(Yield ~ Year * Genotype * N + Year:Block,
random = ~1|MainPlot,
data = dataset,
weights = varIdent(form = ~1|Year))

# Year effect is random

If we’d rather take the year effect as random, all the interactions therein are random as well (Year:Genotype, Year:N and Year:Genotype:N). Similarly, the block (within years) effect needs to be random. Therefore, we have several crossed random effects, which are not straightforward to code with ‘lme()’. First, I will show the code, second, I will comment it.

modMix5 <- lme(Yield ~ Genotype * N,
random = list(Year = pdIdent(~1),
Year = pdIdent(~Block - 1),
Year = pdIdent(~MainPlot - 1),
Year = pdIdent(~Genotype - 1),
Year = pdIdent(~N - 1),
Genotype = pdIdent(~N - 1)),
data=dataset,
weights = varIdent(form = ~1|Year))

We see that random effects are coded using a named list; each component of this list is a pdMat object with name equal to a grouping factor. For example, the component ‘Year = pdIdent(~ 1)’ represents a random year effect, while ‘Year = pdIdent(~ Block – 1)’ represents a random year effect for each level of Block, i.e. a random ‘year x block’ interaction. This latter variance component is the same for all blocks (‘varIdent’), i.e. there is homoscedastic at this level.

It is important to remember that the grouping factors in the list are treated as nested; however, the grouping factor is only one (‘Year’), so that the nesting is irrelevant. The only exception is the genotype, which is regarded as nested within the year. As the consequence, the component ‘Genotype = pdIdent(~N – 1)’, specifies a random year:genotype effect for each level of N treatment, i.e. a random year:genotype:N interaction.

I agree, this is not straightforward to understand! If necessary, take a look at the good book of Gałecki and Burzykowski (2013). When fitting the above model, be patient; convergence may take a few seconds. I’d only like to reinforce the idea that, in case you need to test for fixed effects, you should not rely on the ‘anova()’ function, but you should prefer Wald chi square tests in the ‘Anova()’ function in the ‘car’ package.

Anova(modMix5, type = 2)
## Analysis of Deviance Table (Type II tests)
##
## Response: Yield
##              Chisq Df Pr(>Chisq)
## Genotype   68.6430 14  3.395e-09 ***
## N           2.4682  1     0.1162
## Genotype:N 14.1153 14     0.4412
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Another note: coding random effects as a named list is always possible. For example ‘modMix2’ can also be coded as:

modMix2b <- lme(Yield ~ Year * Genotype * N,
random = list(YearBlock = ~ 1, MainPlot = ~ 1),
data = dataset,
weights = varIdent(form = ~1|Year))

Or, also as:

modMix2c <- lme(Yield ~ Year * Genotype * N,
random = list(YearBlock = pdIdent(~ 1), MainPlot = pdIdent(~ 1)),
data = dataset,
weights = varIdent(form = ~1|Year))

Hope this is useful! Have fun with it.

Andrea Onofri
Department of Agricultural, Food and Environmental Sciences
University of Perugia (Italy)

# References

1. Fox J. and Weisberg S. (2011). An {R} Companion to Applied Regression, Second Edition. Thousand Oaks CA: Sage. URL: http://socserv.socsci.mcmaster.ca/jfox/Books/Companion
2. Gałecki, A., Burzykowski, T., 2013. Linear mixed-effects models using R: a step-by-step approach. Springer, Berlin.
3. Piepho, H.-P., Büchse, A., Emrich, K., 2003. A Hitchhiker’s Guide to Mixed Models for Randomized Experiments. Journal of Agronomy and Crop Science 189, 310–322.
4. Pinheiro J, Bates D, DebRoy S, Sarkar D, R Core Team (2018). nlme: Linear and Nonlinear Mixed Effects Models_. R package version 3.1-137, https://CRAN.R-project.org/package=nlme>.
5. Hadley Wickham (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1-29. URL: http://www.jstatsoft.org/v40/i01/.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

### In Britain, command of a foreign language is still à la mode

But a decline in languages in schools and limits on migration stifle supply

## September 12, 2019

### Social Network Visualization with R

[This article was first published on R programming – Journey of Analytics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this month’s we are going to look at data analysis and visualization of social networks using R programming.

## Friendster Networks Mapping

Friendster was a yesteryear social media network, something akin to Facebook. I’ve never used it but it is one of those easily available datasets where you have a list of users and all their connections. So it is easy to create a viz and look at whose networks are strong and whose are weak, or even the bridge between multiple networks.

The dataset and code files are added on the Projects Page here , under “social network viz”.

For this analysis, we will be using the following library packages:

• visNetwork
• geomnet
• igraph

## Steps:

1. Load the datafiles. The list of users is given in the file named “nodes” as each user is a node in the graph. The connection list is given in the file named “edges” as a 1-to-1 mapping. So if user Miranda has 10 friends, there would be 10 records for Miranda in the “edges” file, one for each friend. The friendster datafile has been anonymized, so there are numbers (id) rather than names.
2. Convert the dataframes into a very specific format. We do some prepwork so that we can directly use the graph visualization functions.
3. Create a graph object. This will also help to create clusters. Since the dataset is anonymized it might seem irrelevant, but imagine this in your own social network. You might have one cluster of friends who are from your school, another bunch from your office, one set who are cousins and family members and some random folks. Creating a graph object allows us to look at where those clusters lie automatically.
4. Visualize using functions specific to graph objects. The first function is visNetwork() which generates an interactive color coded cluster graph. When you click on any of th nodes (colored circles), it will highlight all the connections radiating from the node. (In the image below, I have highlighted node for user 17.
5. You can also use the same function with a bunch of different parameters, as shown below:
visNetwork(nodeset, links2, width = "100%") %>%
visIgraphLayout() %>%
visNodes(
shape = "dot",
color = list(
background = "#0085AF",
border = "#013848",
highlight = "#FF8000"
),
shadow = list(enabled = TRUE, size = 10)
) %>%
visEdges(
color = list(color = "#0085AF", highlight = "#C62F4B")
) %>%
visOptions(highlightNearest = list(enabled = T, degree = 1, hover = T)) %>%
visLayout(randomSeed = 11)

In the image below you can see the 3 colored clusters and the central (light blue) node. The connections in blue are the ones that do not have a lot of direct connections. The yellow and red clusters are tigher, indicating they have internal connections with each other. (similar to a bunch of classmates who all know each other)

That’s it. Again the code is available on the Projects Page.

## Code Extensions

Feel free to play around with the code. One extensions of this idea would be to download Facebook or LinkedIn data (premium account needed) and create similar visualizations.

Or if you have a list of airports and routes, you could create something like this as a flight network map, to know the minimum number of hops between 2 destinations and alternative routes.

You could also do a counter to see which nodes have the most number of friends and increase the size of the circle. This would make it easier to view which nodes are the most well-connected.

Of course, do not be over-mesmerized by the data. In real-life, the strength of the relationship also matters. This is hard to quantify or collect, even though its easy to depict once you have the data in hand/ For example, I have a 1000 connections who I’ve met at conferences or random events. If I needed a job, most may not really be useful. But my friend Sarah has only 300 but super-loyal friends who literally found her a job in 2 days when she had to move back to her hometown to take care of a sick parent.

With that thought, do take a look at the code and have fun coding!

The post Social Network Visualization with R appeared first on Journey of Analytics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

### Distilled News

Let’s explore the challenges involved in building a backend system to store and retrieve high-dimensional data vectors, typical to modern systems that use ‘artificial intelligence’ – image recognition, text comprehension, document search, music recommendations, …
A reproducible R / Python approach to getting up and running quickly on GCloud with GPUs in Tensorflow.
One of the critical problems of ‘identification’, be it NLP – speech/text or solving an image puzzle from pieces like a jigsaw, is to understand the words, or pieces of data and the context. The words or pieces individually don’t give any meaning and tying them together gives an idea about the context. Now the data itself has some patterns, which is broadly classified as sequential or time-series data and non-time series data, which is largely non-sequential or arbitrary. Sentiment analysis of text reports, documents and journals, novels & classics follow time series pattern, in the sense, the words itself follow a precedence as governed by the grammar and the language dictionary. So are the stock-price prediction problems which has a precedent of the previous time period predictions and socio-economic conditions.
Calendar heatmaps are a neglected, but valuable, way of representing time series data. Their chief advantage is in allowing the viewer to visually process trends in categorical or continuous data over a period of time, while relating these values to their month, week, and weekday context – something that simple line plots do not efficiently allow for. If you are displaying data on staffing levels, stock returns (as we will do here), on-time performance for transit systems, or any other one dimensional data, a calendar heatmap can do wonders for helping your stakeholders note patterns in the interaction between those variables and their calendar context. In this post, I will use stock data in the form of daily closing prices for the SPY – SPDR S&P 500 ETF, the most popular exchange traded fund in the world. ETF’s are growing in popularity, so much so that there’s even a podcast devoted entirely to them. For the purposes of this blog post, it’s not necessary to have any familiarity with ETF’s or stocks in general. Some knowledge of tidyverse packages and basic R will be helpful, though.
As a Scientist, it’s incredibly satisfying to be given the freedom to experiment by applying new research and rapidly prototyping. This satisfaction can be sustained quite well in a lab environment but can diminish quickly in a corporate environment. This is because of the underlying commercial value motive which science is driven by in a business setting – if it doesn’t add business value to employees or customers, there’s no place for it! Business value, however, goes beyond just being a nifty experiment which shows potential value to employees or customers. In the context of Machine Learning models, the only [business] valuable models, are models in Production! In this blog post, I will take you through the journey which my team and I went through in taking Machine Learning models to Production and some important lessons learnt along the way.
Adversarial examples are a large obstacle for a variety of machine learning systems to overcome. Their existence shows the tendency of models to rely on unreliable features to maximize performance, which if perturbed, can cause misclassifications with potentially catastrophic consequences. The informal definition of an adversarial example is an input that has been modified in a way that is imperceptible to humans, but is misclassified by a machine learning system whereas the original input was correctly classified.
My boring days of deploying Machine Learning and how I cope.
The motivation for an updated analysis: The first publication of Parsing text for emotion terms: analysis & visualization Using R published in May 2017 used the function get_sentiments(‘nrc’) that was made available in the tidytext package. Very recently, the nrc lexicon was dropped from the tidytext package and hence the R codes in the original publication failed to run. The NRC emotion terms are also available in the lexicon package.
In the previous four posts I have used multiple linear regression, decision trees, random forest, gradient boosting, and support vector machine to predict MPG for 2019 vehicles. It was determined that svm produced the best model. In this post I am going to use the neuralnet package to fit a neural network to the cars_19 dataset.
This overview covers the basics of Kubernetes: what it is and what you need to keep in mind before applying it within your organization. The information in this piece is curated from material available on the O’Reilly online learning platform and from interviews with Kubernetes experts.
Data is the new oil.’, ‘Our company needs to become more efficient.’, ‘Can we optimize this process?’, ‘Our processes are too complicated.’ – sentences you have heard very often and maybe cannot hear anymore. It is understandable but there are some actual real world benefits that stem from the technologies and discussions behind the super trend of (Big) Data. One of the emerging technologies in this field is in more ways than one directly linked to the sentences above. It is process mining. Maybe you have heard of it. Maybe you have not. Harvard Business Review thinks ‘[…] you should be exploring process mining’.
Hands-on transfer learning using a pretrained transformer in PyTorch. This is Part 3 of a series on fine-grained sentiment analysis in Python. Parts 1 and 2 covered the analysis and explanation of six different classification methods on the Stanford Sentiment Treebank fine-grained (SST-5) dataset. In this post, we’ll look at how to improve on past results by building a transformer-based model and applying transfer learning, a powerful method that has been dominating NLP task leaderboards lately.
Enable data scientists to make scaling and production-ready ML products.
Using standardized datasets is great for benchmarking new models/pipelines or for competitions. But for me at least a lot of fun of data science comes when you get to apply things to a project of your own choosing. One of the key parts of this process is building a dataset. So there are a lot of ways to build image datasets. For certain things I have legitimately just taken screenshots like when I was sick and built a facial recognition dataset using season 4 of the Flash and annotated it with labelimg. Another route I have taken is downloading a bunch of images by hand and just display images and label them in an excel spreadsheet… For certain projects you might just have to take a bunch of pictures with your phone as was the case when I made my dice counter. These days I have figured out a few more tricks which make this processes a bit easier and am working on improving things along the way.
The new open source framework that brings multi-task learning to conversational agents. Neural conversation systems and disciplines such as natural language processing(NLP) have seen significant advancements over the last few years. However, most of the current NLP stacks are designed for simple dialogs based on one or two sentences. Structuring more sophisticated conversations that factor in aspects such as personalities or context remains an open challenge. Recently, Microsoft Research unveiled IceCAPS, an open source framework for advanced conversation modeling.

### What’s going on on PyPI

Scanning all new published packages on PyPI I know that the quality is often quite bad. I try to filter out the worst ones and list here the ones which might be worth a look, being followed or inspire you in some way.

scaaml
Side Channel Attack Assisted with Machine Learning

sklite
Use Scikit Learn models in Flutter. Easily transpile scikit-learn models to native Dart code aimed at Flutter. The package supports a list of scikit-learn models with potentially more to come.

slalom-dataops
Slalom GGP libary for DataOps automation

tiggatts-watson-gsingh
Watson TTS Implementation. This Module is designed to convert Text to Speech format. It will generate Wav file for any Text-String passed to the module.

torchwood

type-classifier
ML classifier package

vivid
Enterprise Machine-Learning and Predictive Analytics. Vivid Code is a pioneering software framework for next generation data analysis applications, that interconnects collaborative data science with automated machine learning. Based on the **Cloud-Assisted Meta programming** (CAMP) paradigm, the framework allows the usage of Currently Best Fitting (CBF) algorithms. Before code interpretation / compilation the concrete algorithms, that implement the CBF specifications, are automatically chosen from local and public catalog servers, that host and deploy the concrete algorithms. Thereby the specification is constituted by a unique algorithm category, a data domain and a metric, which substantiates the meaning of *Best Fitting* within the respective algorithm- and data context. An example is the average prediction accuracy within a fixed set of gold standard samples of the data domain (e.g. latin handwriting samples, spoken word samples, TCGA gene expression data, etc.).

alyeska
Alyeska /al-ee-EHS-kah/ n. A Data Pipeline Toolkit

clause

compsyn
Python package to explore the color of language. compsyn is a package which provides a novel methodlogy to explore relationships between words and abstract concepts through color. The work rose through a collaboration between the contributors at the Santa Fe Institute’s Complex System Summer School 2019.

dezero
deep learning framework from zero

kubeflow-fairing
Kubeflow Fairing Python SDK. Python SDK for Kubeflow Fairing components.

md2ipynb
Markdown to Jupyter Notebook converter.

ml-pipelines
Python client for ML Pipelines. Python client for the BitGN Machine Learning Pipelines project.

morris-counter
Memory-efficient probabilistic counter namely Morris Counter

### Whats new on arXiv

Structure learning methods for covariance and concentration graphs are often validated on synthetic models, usually obtained by randomly generating: (i) an undirected graph, and (ii) a compatible symmetric positive definite (SPD) matrix. In order to ensure positive definiteness in (ii), a dominant diagonal is usually imposed. In this work we investigate different methods to generate random symmetric positive definite matrices with undirected graphical constraints. We show that if the graph is chordal is possible to sample uniformly from the set of correlation matrices compatible with the graph, while for general undirected graphs we rely on a partial orthogonalization method.
Experimental data bases are typically very large and high dimensional. To learn from them requires to recognize important features (a pattern), often present at scales different to that of the recorded data. Following the experience collected in statistical mechanics and thermodynamics, the process of recognizing the pattern (the learning process) can be seen as a dissipative time evolution driven by entropy. This is the way thermodynamics enters machine learning. Learning to handle free surface liquids serves as an illustration.
Integrating information from heterogeneous data sources is one of the fundamental problems facing any enterprise. Recently, it has been shown that deep learning based techniques such as embeddings are a promising approach for data integration problems. Prior efforts directly use pre-trained embeddings or simplistically adapt techniques from natural language processing to obtain relational embeddings. In this work, we propose algorithms for obtaining local embeddings that are effective for data integration tasks on relational data. We make three major contributions. First, we describe a compact graph-based representation that allows the specification of a rich set of relationships inherent in relational world. Second, we propose how to derive sentences from such graph that effectively describe the similarity across elements (tokens, attributes, rows) across the two datasets. The embeddings are learned based on such sentences. Finally, we propose a diverse collection of criteria to evaluate relational embeddings and perform extensive set of experiments validating them. Our experiments show that our system, EmbDI, produces meaningful results for data integration tasks and our embeddings improve the result quality for existing state of the art methods.
Sufficient conditions for the existence of efficient algorithms are established by introducing the concept of contractility for continuous optimization. Then all the possible continuous problems are divided into three categories: contractile in logarithmic time, contractile in polynomial time, or noncontractile. For the first two, we propose an efficient contracting algorithm to find the set of all global minimizers with a theoretical guarantee of linear convergence; for the last one, we discuss possible troubles caused by using the proposed algorithm.
Allen’s Interval Algebra constitutes a framework for reasoning about temporal information in a qualitative manner. In particular, it uses intervals, i.e., pairs of endpoints, on the timeline to represent entities corresponding to actions, events, or tasks, and binary relations such as precedes and overlaps to encode the possible configurations between those entities. Allen’s calculus has found its way in many academic and industrial applications that involve, most commonly, planning and scheduling, temporal databases, and healthcare. In this paper, we present a novel encoding of Interval Algebra using answer-set programming (ASP) extended by difference constraints, i.e., the fragment abbreviated as ASP(DL), and demonstrate its performance via a preliminary experimental evaluation. Although our ASP encoding is presented in the case of Allen’s calculus for the sake of clarity, we suggest that analogous encodings can be devised for other point-based calculi, too.
Machine learning and data mining techniques have been used extensively in order to detect credit card frauds. However, most studies consider credit card transactions as isolated events and not as a sequence of transactions. In this framework, we model a sequence of credit card transactions from three different perspectives, namely (i) The sequence contains or doesn’t contain a fraud (ii) The sequence is obtained by fixing the card-holder or the payment terminal (iii) It is a sequence of spent amount or of elapsed time between the current and previous transactions. Combinations of the three binary perspectives give eight sets of sequences from the (training) set of transactions. Each one of these sequences is modelled with a Hidden Markov Model (HMM). Each HMM associates a likelihood to a transaction given its sequence of previous transactions. These likelihoods are used as additional features in a Random Forest classifier for fraud detection. Our multiple perspectives HMM-based approach offers automated feature engineering to model temporal correlations so as to improve the effectiveness of the classification task and allows for an increase in the detection of fraudulent transactions when combined with the state of the art expert based feature engineering strategy for credit card fraud detection. In extension to previous works, we show that this approach goes beyond ecommerce transactions and provides a robust feature engineering over different datasets, hyperparameters and classifiers. Moreover, we compare strategies to deal with structural missing values.
We propose LaserTagger – a sequence tagging approach that casts text generation as a text editing task. Target texts are reconstructed from the inputs using three main edit operations: keeping a token, deleting it, and adding a phrase before the token. To predict the edit operations, we propose a novel model, which combines a BERT encoder with an autoregressive Transformer decoder. This approach is evaluated on English text on four tasks: sentence fusion, sentence splitting, abstractive summarization, and grammar correction. LaserTagger achieves new state-of-the-art results on three of these tasks, performs comparably to a set of strong seq2seq baselines with a large number of training examples, and outperforms them when the number of examples is limited. Furthermore, we show that at inference time tagging can be more than two orders of magnitude faster than comparable seq2seq models, making it more attractive for running in a live environment.
Several fundamental tasks in data science rely on computing an extremal eigenspace of size $r \ll n$, where $n$ is the underlying problem dimension. For example, spectral clustering and PCA both require the computation of the leading $r$-dimensional subspace. Often, this process is repeated over time due to the possible temporal nature of the data; e.g., graphs representing relations in a social network may change over time, and feature vectors may be added, removed or updated in a dataset. Therefore, it is important to efficiently carry out the computations involved to keep up with frequent changes in the underlying data and also to dynamically determine a reasonable size for the subspace of interest. We present a complete computational pipeline for efficiently updating spectral embeddings in a variety of contexts. Our basic approach is to ‘seed’ iterative methods for eigenproblems with the most recent subspace estimate to significantly reduce the computations involved, in contrast with a na\’ive approach which recomputes the subspace of interest from scratch at every step. In this setting, we provide various bounds on the number of iterations common eigensolvers need to perform in order to update the extremal eigenspace to a sufficient tolerance. We also incorporate a criterion for determining the size of the subspace based on successive eigenvalue ratios. We demonstrate the merits of our approach on the tasks of spectral clustering of temporally evolving graphs and PCA of an incrementally updated data matrix.
Online Analytical Processing (OLAP) comprises tools and algorithms that allow querying multidimensional databases. It is based on the multidimensional model, where data can be seen as a cube such that each cell contains one or more measures that can be aggregated along dimensions. In a Big Data scenario, traditional data warehousing and OLAP operations are clearly not sufficient to address current data analysis requirements, for example, social network analysis. Furthermore, OLAP operations and models can expand the possibilities of graph analysis beyond the traditional graph-based computation. Nevertheless, there is not much work on the problem of taking OLAP analysis to the graph data model. This paper proposes a formal multidimensional model for graph analysis, that considers the basic graph data, and also background information in the form of dimension hierarchies. The graphs in this model are node- and edge-labelled directed multi-hypergraphs, called graphoids, which can be defined at several different levels of granularity using the dimensions associated with them. Operations analogous to the ones used in typical OLAP over cubes are defined over graphoids. The paper presents a formal definition of the graphoid model for OLAP, proves that the typical OLAP operations on cubes can be expressed over the graphoid model, and shows that the classic data cube model is a particular case of the graphoid data model. Finally, a case study supports the claim that, for many kinds of OLAP-like analysis on graphs, the graphoid model works better than the typical relational OLAP alternative, and for the classic OLAP queries, it remains competitive.
A Natural Language Interface (NLI) facilitates users to pose queries to retrieve information from a database without using any artificial language such as the Structured Query Language (SQL). Several applications in various domains including healthcare, customer support and search engines, require elaborating structured data having information on text. Moreover, many issues have been explored including configuration complexity, processing of intensive algorithms, and popularity of relational databases, due to which translating natural language to database query has become a secondary area of investigation. The emerging trend of querying systems and speech-enabled interfaces revived natural language to database queries research area., The last survey published on this topic was six years ago in 2013. To best of our knowledge, there is no recent study found which discusses the current state of the art translations frameworks for natural language for structured and non-structured query languages. In this paper, we have reviewed 47 frameworks from 2008 to 2018. Out of 47, 35 were closely relevant to our work. SQL based frameworks have been categorized as statistical, symbolic and connectionist approaches. Whereas, NoSQL based frameworks have been categorized as semantic matching and pattern matching. These frameworks are then reviewed based on their supporting language, scheme of their heuristic rule, interoperability support, dataset scope and their overall performance score. The findings stated that 70% of the work in natural language to database querying has been carried out for SQL, and NoSQL share 15%, 10% and 5% of languages like SPAROL, CYPHER and GREMLIN respectively. It has also been observed that most of the frameworks support English language only.
This study proposes a Neural Attentive Bag-of-Entities model, which is a neural network model that performs text classification using entities in a knowledge base. Entities provide unambiguous and relevant semantic signals that are beneficial for capturing semantics in texts. We combine simple high-recall entity detection based on a dictionary, to detect entities in a document, with a novel neural attention mechanism that enables the model to focus on a small number of unambiguous and relevant entities. We tested the effectiveness of our model using two standard text classification datasets (i.e., the 20 Newsgroups and R8 datasets) and a popular factoid question answering dataset based on a trivia quiz game. As a result, our model achieved state-of-the-art results on all datasets. The source code of the proposed model will be available online at https://…/wikipedia2vec.
The uncertainties in future Bitcoin price make it difficult to accurately predict the price of Bitcoin. Accurately predicting the price for Bitcoin is therefore important for decision-making process of investors and market players in the cryptocurrency market. Using historical data from 01/01/2012 to 16/08/2019, machine learning techniques (Generalized linear model via penalized maximum likelihood, random forest, support vector regression with linear kernel, and stacking ensemble) were used to forecast the price of Bitcoin. The prediction models employed key and high dimensional technical indicators as the predictors. The performance of these techniques were evaluated using mean absolute percentage error (MAPE), root mean square error (RMSE), mean absolute error (MAE), and coefficient of determination (R-squared). The performance metrics revealed that the stacking ensemble model with two base learner (random forest and generalized linear model via penalized maximum likelihood) and support vector regression with linear kernel as meta-learner was the optimal model for forecasting Bitcoin price. The MAPE, RMSE, MAE, and R-squared values for the stacking ensemble model were 0.0191%, 15.5331 USD, 124.5508 USD, and 0.9967 respectively. These values show a high degree of reliability in predicting the price of Bitcoin using the stacking ensemble model. Accurately predicting the future price of Bitcoin will yield significant returns for investors and market players in the cryptocurrency market.
Deep learning (DL) research yields accuracy and product improvements from both model architecture changes and scale: larger data sets and models, and more computation. For hardware design, it is difficult to predict DL model changes. However, recent prior work shows that as dataset sizes grow, DL model accuracy and model size grow predictably. This paper leverages the prior work to project the dataset and model size growth required to advance DL accuracy beyond human-level, to frontier targets defined by machine learning experts. Datasets will need to grow $33$$971 \times$, while models will need to grow $6.6$$456\times$ to achieve target accuracies. We further characterize and project the computational requirements to train these applications at scale. Our characterization reveals an important segmentation of DL training challenges for recurrent neural networks (RNNs) that contrasts with prior studies of deep convolutional networks. RNNs will have comparatively moderate operational intensities and very large memory footprint requirements. In contrast to emerging accelerator designs, large-scale RNN training characteristics suggest designs with significantly larger memory capacity and on-chip caches.
Accelerating research in the emerging field of deep graph learning requires new tools. Such systems should support graph as the core abstraction and take care to maintain both forward (i.e. supporting new research ideas) and backward (i.e. integration with existing components) compatibility. In this paper, we present Deep Graph Library (DGL). DGL enables arbitrary message handling and mutation operators, flexible propagation rules, and is framework agnostic so as to leverage high-performance tensor, autograd operations, and other feature extraction modules already available in existing frameworks. DGL carefully handles the sparse and irregular graph structure, deals with graphs big and small which may change dynamically, fuses operations, and performs auto-batching, all to take advantages of modern hardware. DGL has been tested on a variety of models, including but not limited to the popular Graph Neural Networks (GNN) and its variants, with promising speed, memory footprint and scalability.
Recently, consistency-based methods have achieved state-of-the-art results in semi-supervised learning (SSL). These methods always involve two roles, an explicit or implicit teacher model and a student model, and penalize predictions under different perturbations by a consistency constraint. However, the weights of these two roles are tightly coupled since the teacher is essentially an exponential moving average (EMA) of the student. In this work, we show that the coupled EMA teacher causes a performance bottleneck. To address this problem, we introduce Dual Student, which replaces the teacher with another student. We also define a novel concept, stable sample, following which a stabilization constraint is designed for our structure to be trainable. Further, we discuss two variants of our method, which produce even higher performance. Extensive experiments show that our method improves the classification performance significantly on several main SSL benchmarks. Specifically, it reduces the error rate of the 13-layer CNN from 16.84% to 12.39% on CIFAR-10 with 1k labels and from 34.10% to 31.56% on CIFAR-100 with 10k labels. In addition, our method also achieves a clear improvement in domain adaptation.
We present a new approach to modeling sequential data: the deep equilibrium model (DEQ). Motivated by an observation that the hidden layers of many existing deep sequence models converge towards some fixed point, we propose the DEQ approach that directly finds these equilibrium points via root-finding. Such a method is equivalent to running an infinite depth (weight-tied) feedforward network, but has the notable advantage that we can analytically backpropagate through the equilibrium point using implicit differentiation. Using this approach, training and prediction in these networks require only constant memory, regardless of the effective ‘depth’ of the network. We demonstrate how DEQs can be applied to two state-of-the-art deep sequence models: self-attention transformers and trellis networks. On large-scale language modeling tasks, such as the WikiText-103 benchmark, we show that DEQs 1) often improve performance over these state-of-the-art models (for similar parameter counts); 2) have similar computational requirements as existing models; and 3) vastly reduce memory consumption (often the bottleneck for training large sequence models), demonstrating an up-to 88% memory reduction in our experiments. The code is available at https://github. com/locuslab/deq .
We study the limit computability of finding a global optimum of a continuous function. We give a short proof to show that the problem of checking whether a point is a global minimum is not limit computable. Thereby showing the same for the problem of finding a global minimum. In the next part, we give an algorithm that converges to the global minima when a lower bound on the size of the basin of attraction of the global minima is known. We prove the convergence of this algorithm and provide some numerical experiments.
Dimensionality reduction on Riemannian manifolds is challenging due to the complex nonlinear data structures. While probabilistic principal geodesic analysis~(PPGA) has been proposed to generalize conventional principal component analysis (PCA) onto manifolds, its effectiveness is limited to data with a single modality. In this paper, we present a novel Gaussian latent variable model that provides a unique way to integrate multiple PGA models into a maximum-likelihood framework. This leads to a well-defined mixture model of probabilistic principal geodesic analysis (MPPGA) on sub-populations, where parameters of the principal subspaces are automatically estimated by employing an Expectation Maximization algorithm. We further develop a mixture Bayesian PGA (MBPGA) model that automatically reduces data dimensionality by suppressing irrelevant principal geodesics. We demonstrate the advantages of our model in the contexts of clustering and statistical shape analysis, using synthetic sphere data, real corpus callosum, and mandible data from human brain magnetic resonance~(MR) and CT images.
This thesis focuses on process mining on event data where such a normative specification is absent and, as a result, the event data is less structured. The thesis puts special emphasis on one application domain that fits this description: the analysis of smart home data where sequences of daily activities are recorded. In this thesis we propose a set of techniques to analyze such data, which can be grouped into two categories of techniques. The first category of methods focuses on preprocessing event logs in order to enable process discovery techniques to extract insights into unstructured event data. In this category we have developed the following techniques: – An unsupervised approach to refine event labels based on the time at which the event took place, allowing for example to distinguish recorded eating events into breakfast, lunch, and dinner. – An approach to detect and filter from event logs so-called chaotic activities, which are activities that cause process discovery methods to overgeneralize. – A supervised approach to abstract low-level events into more high-level events, where we show that there exist situations where process discovery approaches overgeneralize on the low-level event data but are able to find precise models on the high-level event data. The second category focuses on mining local process models, i.e., collections of process model patterns that each describe some frequent pattern, in contrast to the single global process model that is obtained with existing process discovery techniques. Several techniques are introduced in the area of local process model mining, including a basic method, fast but approximate heuristic methods, and constraint-based techniques.
Neural networks enjoy widespread use, but many aspects of their training, representation, and operation are poorly understood. In particular, our view into the training process is limited, with a single scalar loss being the most common viewport into this high-dimensional, dynamic process. We propose a new window into training called Loss Change Allocation (LCA), in which credit for changes to the network loss is conservatively partitioned to the parameters. This measurement is accomplished by decomposing the components of an approximate path integral along the training trajectory using a Runge-Kutta integrator. This rich view shows which parameters are responsible for decreasing or increasing the loss during training, or which parameters ‘help’ or ‘hurt’ the network’s learning, respectively. LCA may be summed over training iterations and/or over neurons, channels, or layers for increasingly coarse views. This new measurement device produces several insights into training. (1) We find that barely over 50% of parameters help during any given iteration. (2) Some entire layers hurt overall, moving on average against the training gradient, a phenomenon we hypothesize may be due to phase lag in an oscillatory training process. (3) Finally, increments in learning proceed in a synchronized manner across layers, often peaking on identical iterations.
Everyone makes mistakes. So do human annotators when curating labels for named entity recognition (NER). Such label mistakes might hurt model training and interfere model comparison. In this study, we dive deep into one of the widely-adopted NER benchmark datasets, CoNLL03 NER. We are able to identify label mistakes in about 5.38% test sentences, which is a significant ratio considering that the state-of-the-art test F1 score is already around 93%. Therefore, we manually correct these label mistakes and form a cleaner test set. Our re-evaluation of popular models on this corrected test set leads to more accurate assessments, compared to those on the original test set. More importantly, we propose a simple yet effective framework, CrossWeigh, to handle label mistakes during NER model training. Specifically, it partitions the training data into several folds and train independent NER models to identify potential mistakes in each fold. Then it adjusts the weights of training data accordingly to train the final NER model. Extensive experiments demonstrate significant improvements of plugging various NER models into our proposed framework on three datasets. All implementations and corrected test set are available at our Github repo: https://…/CrossWeigh.
Word embeddings have demonstrated strong performance on NLP tasks. However, lack of interpretability and the unsupervised nature of word embeddings have limited their use within computational social science and digital humanities. We propose the use of informative priors to create interpretable and domain-informed dimensions for probabilistic word embeddings. Experimental results show that sensible priors can capture latent semantic concepts better than or on-par with the current state of the art, while retaining the simplicity and generalizability of using priors.
A key ingredient in branch and bound (B&B) solvers for mixed-integer programming (MIP) is the selection of branching variables since poor or arbitrary selection can affect the size of the resulting search trees by orders of magnitude. A recent article by Le Bodic and Nemhauser [Mathematical Programming, (2017)] investigated variable selection rules by developing a theoretical model of B&B trees from which they developed some new, effective scoring functions for MIP solvers. In their work, Le Bodic and Nemhauser left several open theoretical problems, solutions to which could guide the future design of variable selection rules. In this article, we first solve many of these open theoretical problems. We then implement an improved version of the model-based branching rules in SCIP 6.0, a modern open-source MIP solver, in which we observe an 11% geometric average time and node reduction on instances of the MIPLIB 2017 Benchmark Set that require large B&B trees.
Since the recent advent of deep reinforcement learning for game play and simulated robotic control, a multitude of new algorithms have flourished. Most are model-free algorithms which can be categorized into three families: deep Q-learning, policy gradients, and Q-value policy gradients. These have developed along separate lines of research, such that few, if any, code bases incorporate all three kinds. Yet these algorithms share a great depth of common deep reinforcement learning machinery. We are pleased to share rlpyt, which implements all three algorithm families on top of a shared, optimized infrastructure, in a single repository. It contains modular implementations of many common deep RL algorithms in Python using PyTorch, a leading deep learning library. rlpyt is designed as a high-throughput code base for small- to medium-scale research in deep RL. This white paper summarizes its features, algorithms implemented, and relation to prior work, and concludes with detailed implementation and usage notes. rlpyt is available at https://…/rlpyt.

### Distilled News

Combinatorics is that field of mathematics primarily concerned with counting elements from one or more sets. It can help us count the number of orders in which something can happen. In this article, I’m going to dwell on three different types of techniques:
• permutations
• dispositions
• combinations
Thanks to tools like Azure Databricks, we can build simple data pipelines in the cloud and use Spark to get some comprehensive insights into our data with relative ease. Combining this with the Apache Spark connector for Cosmos DB, we can leverage the power of Azure Cosmos DB to gain and store some incredible insights into our data. It’s been a while since I’ve written a post on Databricks and since I’ve been working with Cosmos DB quite a bit over the past few months, I’d thought I’d write a simple tutorial on how you can use Azure Blob Storage, Azure Databricks and Cosmos DB to build a straightforward data pipeline that does some simple transformations on our source data. I’m also going to throw a bit of Azure Key Vault into the mix to show you how simple it can be to protect vital secrets in Databricks such as Storage account keys and Cosmos DB endpoints! This blog post is mainly aimed at beginners. Ideally you would have some idea of what each component is and you’d have some understanding of Python.
Creating a great machine learning system is an art. There are a lot of things to consider while building a great machine learning system. But often it happens that we as data scientists only worry about certain parts of the project. Most of the time that happens to be modeling, but in reality, the success or failure of a Machine Learning project depends on a lot of other factors.
1. Problem Definition
2. Data
2. Data
4. Features
5. Modeling
6. Experimentation
Bayesian Linear Models are often presented as introductory material for those seeking to learn probabilistic programming, coming in with an existing understanding in frequentist statistical learning models. I believe this is effective because it allows one to scaffold new knowledge on top of existing knowledge, and even fit something that was already understood – perhaps as just one tool among many – into a wider and more theoretically satisfying framework. The relationship between the prior distribution of parameters chosen in a Bayesian linear model and the penalty term in regularized least-squares regression is already well known. Despite this, I feel that I was able to come to a more visceral and intuitive understanding of this equivalence by empirically examining the effect of tweaking hyperparameters of each model. I hope my small experiment can do the same for you, and be supplemental to the existing proofs that are available.
1. All unique
2. Anagrams
3. Memory
4. Byte size
5. Print a string N times
6. Capitalize first letters
7. Chunk
8. Compact
9. Count by
10. Chained comparison
11. Comma-separated
12. Count vowels
13. Decapitalize
14. Flatten
15. Difference
16. Difference by
17. Chained function call
18. Has duplicates
19. Merge two dictionaries
20. Convert two lists into a dictionary
21. Use enumerate
22. Time spent
22. Time spent
24. Most frequent
25. Palindrome
26. Calculator without if-else
27. Shuffle
29. Swap values
30. Get default value for missing keys
It is an unfortunate fact that many data scientists do not know how to write production-quality code.
Production-quality code is code that is:
• Free from errors;
• Robust to exceptions;
• Efficient;
• Well documented; and
• Reproducible.
Producing it is not rocket science.
A How-to for Non-Parametric Power Analyses, p-values, Confidence Intervals, Checking for Bias. This post will enable you to do a power analysis, calculate p-values, get confidence intervals, and check for bias in your design without making any assumptions (non-parametrically). This post will enable you to do a power analysis, calculate p-values, get confidence intervals, and check for bias in your design without making any assumptions (non-parametrically).
This post will discuss the Universal Transformer, which combines the original Transformer model with a technique called Adaptive Computation Time. The main innovation of Universal Transformers is to apply the Transformer components a different number of times for each symbol.
Feature selection, also known as variable selection, is a powerful idea, with major implications for your machine learning workflow. Why would you ever need it? Well, how do you like to reduce your number of features 10x? Or if doing NLP, even 1000x. What about besides smaller feature space, resulting in faster training and inference, also to have an observable improvement in accuracy, or whatever metric you use for your models? If that doesn’t grab your attention, I don’t know what does. Don’t believe me? This literally happened to me a couple of days ago at work. So, this is a 2 part blog post where I’m going to explain, and show, how to do automated feature selection in Python, so that you can level up your ML game. Only filter methods will be presented because they are more generic and less compute hungry than wrapper methods, while embedded feature selection methods being, well, embedded in the model, aren’t as flexible as filter methods.
Gen Z, the generation that comes after Millennials, are graduating college and entering the workforce. Growing up as digital natives, they know machine intelligence will scale in their generation as workers in the labor force. Those headlines about ‘robots’ coming for our jobs? Well, Gen Z have an inkling. They are actually going to live it. The eldest of Gen Z are only about 24 now, in 2019 and the majority of them are still students. Gen Z is right to have the sentiment that’s in between uncertainty and actual fear regarding their future of work.
Build an image classifier web application from scratch, without need for GPU or credit card. This article describes how to build a deep learning web based application for image classification, without need for GPU or credit card! Even though there are plenty of articles describing this stuff, I couldn’t find a complete guide describing all the steps from data collection to deployment and some details were hard to find (e.g. how to clone a Github repo on Google Drive).
A sample-efficient meta reinforcement learning method. Meta reinforcement learning could be particularly challenging because the agent has to not only adapt to the new incoming data but also find an efficient way to explore the new environment. Current meta-RL algorithms rely heavily on on-policy experience, which limits their sample efficiency. Worse still, most of them lack mechanisms to reason about task uncertainty when adapting to a new task, limiting their effectiveness in sparse reward problems. We discuss a meta-RL algorithm that attempts to address these challenges. In a nutshell, the algorithm, namely Probabilistic Embeddings for Actor-critic RL(PEARL) proposed by Rakelly & Zhou et al. in ICLR 2019, is comprised of two parts: It learns a probabilistic latent context that sufficiently describes a task; conditioned on that latent context, an off-policy RL algorithm learns to take actions. In this framework, the probabilistic latent context serves as the belief state of the current task. By conditioning the RL algorithm on the latent context, we expect the RL algorithm to learn to distinguish different tasks. Moreover, this disentangles task inference from action making, which, as we will see later, makes an off-policy algorithm applicable to meta-learning.
Chatbots, virtual assistants, augmented analytic systems typically receive user queries such as ‘Find me an action movie by Steven Spielberg’. The system should correctly detect the intent ‘find_movie’ while filling the slots ‘genre’ with value ‘action’ and ‘directed_by’ with value ‘Steven Spielberg’. This is a Natural Language Understanding (NLU) task kown as Intent Classification & Slot Filling. State-of-the-art performance is typically obtained using recurrent neural network (RNN) based approaches, as well as by leveraging an encoder-decoder architecture with sequence-to-sequence models. In this article we demonstrate hands-on strategies for improving the performance even further by adding Attention mechanism.
Why econometrics should be part of your skills. As a Data scientist with a master’s degree in econometrics, I took some time to understand the subtleties that make machine learning a different discipline from econometrics. I would like to talk to you about these subtleties that are not obvious at first sight and that made me wonder all along my journey.
Batch Normalization, and the zoo of related normalization strategies that have grown up around it, have played an interesting array of roles in recent deep learning research: as a wunderkind optimization trick, a focal point for discussions about theoretical rigor and, importantly, but somewhat more in the sidelines, as a flexible and broadly successful avenue for injecting conditioning information into models. Conditional renormalization started humbly enough, as a clever trick for training more flexible style transfer models, but over the years this originally-simple trick has grown in complexity and conceptual scope. I kept seeing new variants of this strategy pop up, not just on the edges of the literature, but in its most central and novel advances: from the winner of 2017’s ImageNet competition to 2018’s most impressive generative image model. The more I saw it, the more I wanted to tell the story of this simple idea I’d watched grow and evolve from a one-off trick to a broadly applicable way of integrating new information in a low-complexity way.

### R Packages worth a look

Memory-Efficient, Visualize-Enhanced, Parallel-Accelerated GWAS Tool (rMVP)
A memory-efficient, visualize-enhanced, parallel-accelerated Genome-Wide Association Study (GWAS) tool. It can
(1) effectively process large data,
(2) rapidly evaluate population structure,
(3) efficiently estimate variance components several algorithms,
(4) implement parallel-accelerated association tests of markers three methods,
(5) globally efficient design on GWAS process computing,
(6) enhance visualization of related information.
‘rMVP’ contains three models GLM (Alkes Price (2006) <DOI:10.1038/ng1847>), MLM (Jianming Yu (2006) <DOI:10.1038/ng1702>)
and FarmCPU (Xiaolei Liu (2016) <doi:10.1371/journal.pgen.1005767>); variance components estimation methods EMMAX
(Hyunmin Kang (2008) <DOI:10.1534/genetics.107.080101>;), FaSTLMM (method: Christoph Lippert (2011) <DOI:10.1038/nmeth.1681>,
R implementation from ‘GAPIT2’: You Tang and Xiaolei Liu (2016) <DOI:10.1371/journal.pone.0107684> and
‘SUPER’: Qishan Wang and Feng Tian (2014) <DOI:10.1371/journal.pone.0107684>), and HE regression
(Xiang Zhou (2017) <DOI:10.1214/17-AOAS1052>).

Import data from ‘Excel’ binary (.xlsb) workbooks into R.

Algebraic and Statistical Functions for Genetics (miraculix)
This is a collection of fast tools for application in quantitative genetics. For instance, the SNP matrix can be stored in a minimum of memory and the calculation of the genomic relationship matrix is based on a rapid algorithm. It also contains the window scanning approach by Kabluchko and Spodarev (2009), <doi:10.1239/aap/1240319575> to detect anomalous genomic areas <doi:10.1186/s12864-018-5009-y>. Furthermore, the package is used in the Modular Breeding Program Simulator (MoBPS, <https://…/MoBPS>, <http://…/> ). The tools are based on SIMD (Single Instruction Multiple Data, <https://…/SIMD> ) and OMP (Open Multi-Processing, <https://…/OpenMP> ).

Access Landscape Evaporative Response Index Raster Data (leri)
Finds and downloads Landscape Evaporative Response Index (LERI) data, then reads the data into ‘R’ using the ‘raster’ package. The LERI product measures anomalies in actual evapotranspiration, to support drought monitoring and early warning systems. More info on LERI is available at <https://…/>.

Boltzmann Bayes Learner (bbl)
Supervised learning using Boltzmann Bayes model inference, which extends naive Bayes model to include interactions. Enables classification of data into multiple response groups based on a large number of discrete predictors that can take factor values of heterogeneous levels. Either pseudo-likelihood and mean field inference can be used with L2 regularization, cross-validation, and prediction on new data. Woo et al. (2016) <doi:10.1186/s12864-016-2871-3>.

### Document worth reading: “A survey on Adversarial Attacks and Defenses in Text”

Deep neural networks (DNNs) have shown an inherent vulnerability to adversarial examples which are maliciously crafted on real examples by attackers, aiming at making target DNNs misbehave. The threats of adversarial examples are widely existed in image, voice, speech, and text recognition and classification. Inspired by the previous work, researches on adversarial attacks and defenses in text domain develop rapidly. To the best of our knowledge, this article presents a comprehensive review on adversarial examples in text. We analyze the advantages and shortcomings of recent adversarial examples generation methods and elaborate the efficiency and limitations on countermeasures. Finally, we discuss the challenges in adversarial texts and provide a research direction of this aspect. A survey on Adversarial Attacks and Defenses in Text