# My Data Science Blogs

## June 27, 2019

### An Overview of Outlier Detection Methods from PyOD – Part 1

PyOD is an outlier detection package developed with a comprehensive API to support multiple techniques. This post will showcase Part 1 of an overview of techniques that can be used to analyze anomalies in data.

### One simple chart: Who is interested in Spark NLP?

As we close in on its two-year anniversary, Spark NLP is proving itself a viable option for enterprise use.

In July 2016, I broached the idea for an NLP library aimed at Apache Spark users to my friend David Talby. A little over a year later, Talby and his collaborators announced the release of Spark NLP. They described the motivation behind the project in their announcement post and in this accompanying podcast that Talby and I wrote, as well as in this recent post comparing popular open source NLP libraries. [Full disclosure: I’m an advisor to Databricks, the startup founded by the team that originated Apache Spark.]

As we close in on the two-year anniversary of the project, I asked Talby where interest in the project has come from, and he graciously shared geo-demographic data of visitors to the project’s homepage:

Of the thousands of visitors to the site: 44% are from the Americas, 24% from Asia-Pacific, and the remaining 22% are based in the EMEA region.

Many of these site visitors are turning into users of the project. In our recent survey AI Adoption in the Enterprise, quite a few respondents signalled that they were giving Spark NLP a try. The project also garnered top prize—based on a tally of votes cast by Strata Data Conference attendees—in the open source category at the Strata Data awards in March.

There are many other excellent open source NLP libraries with significant numbers of users—spaCy, OpenNLP, Stanford CoreNLP, NLTK—but at the time when the project started, there seemed to be an opportunity for a library that appealed to users who already had Spark clusters (and needed a scalable solution). While the project started out targeting Apache Spark users, it has evolved to provide simple API’s that get things done in a few lines of code and fully hide Spark under the hood. The library’s Python API now has the most users. Installing Spark NLP is a one-liner operation using pip or conda for Python, or a single package pull on Java or Scala using maven, sbt, or spark-packages. The library’s documentation has also grown, and there are public online examples for common tasks like sentiment analysis, named entity recognition, and spell checking. Improvements in documentation, ease-of-use, and its production-ready implementation of key deep learning models, combined with speed, scalability, and accuracy has made Spark NLP a viable option for enterprises needing an NLP library.

For more on Spark NLP, join Talby and his fellow instructors for a three-hour tutorial, Natural language understanding at scale with Spark NLP, at the Strata Data Conference in New York City, September 23-26, 2019.

Related content:

Continue reading One simple chart: Who is interested in Spark NLP?.

### Four short links: 27 June 2019

Security Mnemonics, Evidence Might Work, Misinformation Inoculation, and Spoofing Presidential Alerts

1. STRIDE -- mnemonic for remembering the different types of threads: Spoofing of user identity; Tampering; Repudiation; Information disclosure (privacy breach or data leak); Denial of service (D.o.S); Elevation of privilege. Use when you're asking yourself, "what could possibly go wrong?" There's probably a parallel "how things can be misused" mnemonic like Nazis, Anti-Vaxx, Spam, Threats, and Your Ex- Follows You.
2. Backfire Effect is Mostly a Myth (Nieman Lab) -- some evidence that giving people evidence that shows they're wrong can change their mind. Perhaps you no longer have to be careful to whom you show this story. Full Fact research manager Amy Sippett reviewed seven studies that have explored the backfire effect and found that “cases where backfire effects were found tended to be particularly contentious topics, or where the factual claim being asked about was ambiguous.” The studies where a backfire effect was not found also tended to be larger than the studies where it was found. Full Fact cautions that most of the research on the backfire effect has been done in the U.S., and “we still need more evidence to understand how fact-checking content can be most effective.”
3. Bad News -- a browser game by Cambridge University researchers that seems to inoculate users against misinformation. We conducted a large-scale evaluation of the game with N = 15,000 participants in a pre-post gameplay design. We provide initial evidence that people’s ability to spot and resist misinformation improves after gameplay, irrespective of education, age, political ideology, and cognitive style. (via Cambridge University)
4. Spoofing Presidential Alerts -- Their research showed that four low cost USRP or bladeRF TX capable software defined radios (SDR) with 1 watt output power each, combined with open source LTE base station software could be used to send a fake Presidential Alert to a stadium of 50,000 people (note that this was only simulated—real-world tests were performed responsibly in a controlled environment). The attack works by creating a fake and malicious LTE cell tower on the SDR that nearby cell phones connect to. Once connected an alert can easily be crafted and sent to all connected phones. There is no way to verify that an alert is legitimate. The article itself is paywalled, though Sci-Hub knows how to reach it.

### Whats new on arXiv

Inferring new facts from existing knowledge graphs (KG) with explainable reasoning processes is a significant problem and has received much attention recently. However, few studies have focused on relation types unseen in the original KG, given only one or a few instances for training. To bridge this gap, we propose CogKR for one-shot KG reasoning. The one-shot relational learning problem is tackled through two modules: the summary module summarizes the underlying relationship of the given instances, based on which the reasoning module infers the correct answers. Motivated by the dual process theory in cognitive science, in the reasoning module, a cognitive graph is built by iteratively coordinating retrieval (System 1, collecting relevant evidence intuitively) and reasoning (System 2, conducting relational reasoning over collected information). The structural information offered by the cognitive graph enables our model to aggregate pieces of evidence from multiple reasoning paths and explain the reasoning process graphically. Experiments show that CogKR substantially outperforms previous state-of-the-art models on one-shot KG reasoning benchmarks, with relative improvements of 24.3%-29.7% on MRR. The source code is available at https://…/CogKR.
Natural languages are complexly structured entities. They exhibit characterising regularities that can be exploited to link them one another. In this work, I compare two morphological aspects of languages: Written Patterns and Sentence Structure. I show how languages spontaneously group by similarity in both analyses and derive an average language distance. Finally, exploiting Sentence Structure I developed an Artificial Neural Network capable of distinguishing languages suggesting that not only word roots but also grammatical sentence structure is a characterising trait which alone suffice to identify them.
Many massive data are assembled through collections of information of a large number of individuals in a population. The analysis of such data, especially in the aspect of individualized inferences and solutions, has the potential to create significant value for practical applications. Traditionally, inference for an individual in the data set is either solely relying on the information of the individual or from summarizing the information about the whole population. However, with the availability of big data, we have the opportunity, as well as a unique challenge, to make a more effective individualized inference that takes into consideration of both the population information and the individual discrepancy. To deal with the possible heterogeneity within the population while providing effective and credible inferences for individuals in a data set, this article develops a new approach called the individualized group learning (iGroup). The iGroup approach uses local nonparametric techniques to generate an individualized group by pooling other entities in the population which share similar characteristics with the target individual. Three general cases of iGroup are discussed, and their asymptotic performances are investigated. Both theoretical results and empirical simulations reveal that, by applying iGroup, the performance of statistical inference on the individual level are ensured and can be substantially improved from inference based on either solely individual information or entire population information. The method has a broad range of applications. Two examples in financial statistics and maritime anomaly detection are presented.
With the support of the blockchain systems, the cryptocurrency has changed the world of virtual assets. Digital games, especially those with massive multi-player scenarios, will be significantly impacted by this novel technology. However, there are insufficient academic studies on this topic. In this work, we filled the blank by surveying the state-of-the-art blockchain games. We discuss the blockchain integration for games and then categorize existing blockchain games from the aspects of their genres and technical platforms. Moreover, by analyzing the industrial trend with a statistical approach, we envision the future of blockchain games from technological and commercial perspectives.
We consider the topic modeling problem for large datasets. For this problem, Latent Dirichlet Allocation (LDA) with a collapsed Gibbs sampler optimization is the state-of-the-art approach in terms of topic quality. However, LDA is a slow approach, and running it on large datasets is impractical even with modern hardware. In this paper we propose to fit topics directly to the co-occurances data of the corpus. In particular, we introduce an extension of a mixture model, the Full Dependence Mixture (FDM), which arises naturally as a model of a second moment under general generative assumptions on the data. While there is some previous work on topic modeling using second moments, we develop a direct stochastic optimization procedure for fitting an FDM with a single Kullback Leibler objective. While moment methods in general have the benefit that an iteration no longer needs to scale with the size of the corpus, our approach also allows us to leverage standard optimizers and GPUs for the problem of topic modeling. We evaluate the approach on synthetic and semi-synthetic data, as well as on the SOTU and Neurips Papers corpora, and show that the approach outperforms LDA, where LDA is run on both full and sub-sampled data.
The reliance on deep learning algorithms has grown significantly in recent years. Yet, these models are highly vulnerable to adversarial attacks, which introduce visually imperceptible perturbations into testing data to induce misclassifications. The literature has proposed several methods to combat such adversarial attacks, but each method either fails at high perturbation values, requires excessive computing power, or both. This letter proposes a computationally efficient method for defending the Fast Gradient Sign (FGS) adversarial attack by simultaneously denoising and compressing data. Specifically, our proposed defense relies on training a fully connected multi-layer Denoising Autoencoder (DAE) and using its encoder as a defense against the adversarial attack. Our results show that using this dimensionality reduction scheme is not only highly effective in mitigating the effect of the FGS attack in multiple threat models, but it also provides a 2.43x speedup in comparison to defense strategies providing similar robustness against the same attack.
The majority of modern deep learning models are able to interpolate the data: the empirical loss can be driven near zero on all samples simultaneously. In this work, we explicitly exploit this interpolation property for the design of a new optimization algorithm for deep learning. Specifically, we use it to compute an adaptive learning-rate given a stochastic gradient direction. This results in the Adaptive Learning-rates for Interpolation with Gradients (ALI-G) algorithm. ALI-G retains the advantages of SGD, which are low computational cost and provable convergence in the convex setting. But unlike SGD, the learning-rate of ALI-G can be computed inexpensively in closed-form and does not require a manual schedule. We provide a detailed analysis of ALI-G in the stochastic convex setting with explicit convergence rates. In order to obtain good empirical performance in deep learning, we extend the algorithm to use a maximal learning-rate, which gives a single hyper-parameter to tune. We show that employing such a maximal learning-rate has an intuitive proximal interpretation and preserves all convergence guarantees. We provide experiments on a variety of architectures and tasks: (i) learning a differentiable neural computer; (ii) training a wide residual network on the SVHN data set; (iii) training a Bi-LSTM on the SNLI data set; and (iv) training wide residual networks and densely connected networks on the CIFAR data sets. We empirically show that ALI-G outperforms adaptive gradient methods such as Adam, and provides comparable performance with SGD, although SGD benefits from manual learning rate schedules. We release PyTorch and Tensorflow implementations of ALI-G as standalone optimizers that can be used as a drop-in replacement in existing code (code available at https://…/ali-g ).
Scientific computations or measurements may result in huge volumes of data. Often these can be thought of representing a real-valued function on a high-dimensional domain, and can be conceptually arranged in the format of a tensor of high degree in some truncated or lossy compressed format. We look at some common post-processing tasks which are not obvious in the compressed format, as such huge data sets can not be stored in their entirety, and the value of an element is not readily accessible through simple look-up. The tasks we consider are finding the location of maximum or minimum, or minimum and maximum of a function of the data, or finding the indices of all elements in some interval — i.e. level sets, the number of elements with a value in such a level set, the probability of an element being in a particular level set, and the mean and variance of the total collection. The algorithms to be described are fixed point iterations of particular functions of the tensor, which will then exhibit the desired result. For this, the data is considered as an element of a high degree tensor space, although in an abstract sense, the algorithms are independent of the representation of the data as a tensor. All that we require is that the data can be considered as an element of an associative, commutative algebra with an inner product. Such an algebra is isomorphic to a commutative sub-algebra of the usual matrix algebra, allowing the use of matrix algorithms to accomplish the mentioned tasks. We allow the actual computational representation to be a lossy compression, and we allow the algebra operations to be performed in an approximate fashion, so as to maintain a high compression level. One such example which we address explicitly is the representation of data as a tensor with compression in the form of a low-rank representation.
Fine-grained Entity Typing is a tough task which suffers from noise samples extracted from distant supervision. Thousands of manually annotated samples can achieve greater performance than millions of samples generated by the previous distant supervision method. Whereas, it’s hard for human beings to differentiate and memorize thousands of types, thus making large-scale human labeling hardly possible. In this paper, we introduce a Knowledge-Constraint Typing Annotation Tool (KCAT), which is efficient for fine-grained entity typing annotation. KCAT reduces the size of candidate types to an acceptable range for human beings through entity linking and provides a Multi-step Typing scheme to revise the entity linking result. Moreover, KCAT provides an efficient Annotator Client to accelerate the annotation process and a comprehensive Manager Module to analyse crowdsourcing annotations. Experiment shows that KCAT can significantly improve annotation efficiency, the time consumption increases slowly as the size of type set expands.
This paper focuses on the end-to-end abstractive summarization of a single product review without supervision. We assume that a review can be described as a discourse tree, in which the summary is the root, and the child sentences explain their parent in detail. By recursively estimating a parent from its children, our model learns the latent discourse tree without an external parser and generates a concise summary. We also introduce an architecture that ranks the importance of each sentence on the tree to support summary generation focusing on the main review point. The experimental results demonstrate that our model is competitive with or outperforms other unsupervised approaches. In particular, for relatively long reviews, it achieves a competitive or better performance than supervised models. The induced tree shows that the child sentences provide additional information about their parent, and the generated summary abstracts the entire review.
Multilingual writers and speakers often alternate between two languages in a single discourse, a practice called ‘code-switching’. Existing sentiment detection methods are usually trained on sentiment-labeled monolingual text. Manually labeled code-switched text, especially involving minority languages, is extremely rare. Consequently, the best monolingual methods perform relatively poorly on code-switched text. We present an effective technique for synthesizing labeled code-switched text from labeled monolingual text, which is more readily available. The idea is to replace carefully selected subtrees of constituency parses of sentences in the resource-rich language with suitable token spans selected from automatic translations to the resource-poor language. By augmenting scarce human-labeled code-switched text with plentiful synthetic code-switched text, we achieve significant improvements in sentiment labeling accuracy (1.5%, 5.11%, 7.20%) for three different language pairs (English-Hindi, English-Spanish and English-Bengali). We also get significant gains for hate speech detection: 4% improvement using only synthetic text and 6% if augmented with real text.
This paper describes a C++ library that compiles neural network models at runtime into machine code that performs inference. This approach in general promises to achieve the best performance possible since it is able to integrate statically known properties of the network directly into the code. In our experiments on the NAO V6 platform, it outperforms existing implementations significantly on small networks, while being inferior on large networks. The library was already part of the B-Human code release 2018, but has been extended since and is now available as a standalone version that can be integrated into any C++14 code base.
Noise modeling lies in the heart of many image processing tasks. However, existing deep learning methods for noise modeling generally require clean and noisy image pairs for model training; these image pairs are difficult to obtain in many realistic scenarios. To ameliorate this problem, we propose a self-consistent GAN (SCGAN), that can directly extract noise maps from noisy images, thus enabling unsupervised noise modeling. In particular, the SCGAN introduces three novel self-consistent constraints that are complementary to one another, viz.: the noise model should produce a zero response over a clean input; the noise model should return the same output when fed with a specific pure noise input; and the noise model also should re-extract a pure noise map if the map is added to a clean image. These three constraints are simple yet effective. They jointly facilitate unsupervised learning of a noise model for various noise types. To demonstrate its wide applicability, we deploy the SCGAN on three image processing tasks including blind image denoising, rain streak removal, and noisy image super-resolution. The results demonstrate the effectiveness and superiority of our method over the state-of-the-art methods on a variety of benchmark datasets, even though the noise types vary significantly and paired clean images are not available.
Building performance discrepancies between building design and operation are one of the causes that lead many new designs fail to achieve their goals and objectives. One of main factors contributing to the discrepancy is occupant behaviors. Occupants responding to a new design are influenced by several factors. Existing building performance models (BPMs) ignore or partially address those factors (called contextual factors) while developing BPMs. To potentially reduce the discrepancies and improve the prediction accuracy of BPMs, this paper proposes a computational framework for learning mixture models by using Generative Adversarial Networks (GANs) that appropriately combining existing BPMs with knowledge on occupant behaviors to contextual factors in new designs. Immersive virtual environments (IVEs) experiments are used to acquire data on such behaviors. Performance targets are used to guide appropriate combination of existing BPMs with knowledge on occupant behaviors. The resulting model obtained is called an augmented BPM. Two different experiments related to occupant lighting behaviors are shown as case study. The results reveal that augmented BPMs significantly outperformed existing BPMs with respect to achieving specified performance targets. The case study confirms the potential of the computational framework for improving prediction accuracy of BPMs during design.
The scale of Internet-connected systems has increased considerably, and these systems are being exposed to cyber attacks more than ever. The complexity and dynamics of cyber attacks require protecting mechanisms to be responsive, adaptive, and large-scale. Machine learning, or more specifically deep reinforcement learning (DRL), methods have been proposed widely to address these issues. By incorporating deep learning into traditional RL, DRL is highly capable of solving complex, dynamic, and especially high-dimensional cyber defense problems. This paper presents a survey of DRL approaches developed for cyber security. We touch on different vital aspects, including DRL-based security methods for cyber-physical systems, autonomous intrusion detection techniques, and multi-agent DRL-based game theory simulations for defense strategies against cyber attacks. Extensive discussions and future research directions on DRL-based cyber security are also given. We expect that this comprehensive review provides the foundations for and facilitates future studies on exploring the potential of emerging DRL to cope with increasingly complex cyber security problems.
Hierarchical Reinforcement Learning is a promising approach to long-horizon decision-making problems with sparse rewards. Unfortunately, most methods still decouple the lower-level skill acquisition process and the training of a higher level that controls the skills in a new task. Treating the skills as fixed can lead to significant sub-optimality in the transfer setting. In this work, we propose a novel algorithm to discover a set of skills, and continuously adapt them along with the higher level even when training on a new task. Our main contributions are two-fold. First, we derive a new hierarchical policy gradient, as well as an unbiased latent-dependent baseline. We introduce Hierarchical Proximal Policy Optimization (HiPPO), an on-policy method to efficiently train all levels of the hierarchy simultaneously. Second, we propose a method of training time-abstractions that improves the robustness of the obtained skills to environment changes. Code and results are available at sites.google.com/view/hippo-rl .
Humans view the world through many sensory channels, e.g., the long-wavelength light channel, viewed by the left eye, or the high-frequency vibrations channel, viewed by the right ear. Each view is noisy and incomplete, but important factors, such as physics, geometry, and semantics, tend to be shared between all views (e.g., a “dog’ can be seen, heard, and felt). We hypothesize that a powerful representation is one that models view-invariant factors. Based on this hypothesis, we investigate a contrastive coding scheme, in which a representation is learned that aims to maximize mutual information between different views but is otherwise compact. Our approach scales to any number of views, and is view-agnostic. The resulting learned representations perform above the state of the art for downstream tasks such as object classification, compared to formulations based on predictive learning or single view reconstruction, and improve as more views are added. Code and reference implementations are released on our project page: http://…/.
Training deep generative models with maximum likelihood remains a challenge. The typical workaround is to use variational inference (VI) and maximize a lower bound to the log marginal likelihood of the data. Variational auto-encoders (VAEs) adopt this approach. They further amortize the cost of inference by using a recognition network to parameterize the variational family. Amortized VI scales approximate posterior inference in deep generative models to large datasets. However it introduces an amortization gap and leads to approximate posteriors of reduced expressivity due to the problem known as posterior collapse. In this paper, we consider expectation maximization (EM) as a paradigm for fitting deep generative models. Unlike VI, EM directly maximizes the log marginal likelihood of the data. We rediscover the importance weighted auto-encoder (IWAE) as an instance of EM and propose a new EM-based algorithm for fitting deep generative models called reweighted expectation maximization (REM). REM learns better generative models than the IWAE by decoupling the learning dynamics of the generative model and the recognition network using a separate expressive proposal found by moment matching. We compared REM to the VAE and the IWAE on several density estimation benchmarks and found it leads to significantly better performance as measured by log-likelihood.
We present a novel deep zero-shot learning (ZSL) model for inferencing human-object-interaction with verb-object (VO) query. While the previous ZSL approaches only use the semantic/textual information to be fed into the query stream, we seek to incorporate and embed the semantics into the visual representation stream as well. Our approach is powered by Semantics-to-Space (S2S) architecture where semantics derived from the residing objects are embedded into a spatial space. This architecture allows the co-capturing of the semantic attributes of the human and the objects along with their location/size/silhouette information. As this is the first attempt to address the zero-shot human-object-interaction inferencing with VO query, we have constructed a new dataset, Verb-Transferability 60 (VT60). VT60 provides 60 different VO pairs with overlapping verbs tailored for testing ZSL approaches with VO query. Experimental evaluations show that our approach not only outperforms the state-of-the-art, but also shows the capability of consistently improving performance regardless of which ZSL baseline architecture is used.
Few-shot learning is a challenging problem where the system is required to achieve generalization from only few examples. Meta-learning tackles the problem by learning prior knowledge shared across a distribution of tasks, which is then used to quickly adapt to unseen tasks. Model-agnostic meta-learning (MAML) algorithm formulates prior knowledge as a common initialization across tasks. However, forcibly sharing an initialization brings about conflicts between tasks and thus compromises the quality of the initialization. In this work, by observing that the extent of compromise differs among tasks and between layers of a neural network, we propose a new initialization idea that employs task-dependent layer-wise attenuation, which we call selective forgetting. The proposed attenuation scheme dynamically controls how much of prior knowledge each layer will exploit for a given task. The experimental results demonstrate that the proposed method mitigates the conflicts and provides outstanding performance as a result. We further show that the proposed method, named L2F, can be applied and improve other state-of-the-art MAML-based frameworks, illustrating its generalizability.

### Magister Dixit

“The hype about the possibilities and possible applications of artificial intelligence (AI) seems currently unlimited. The AI procedures and solutions are praised as true panaceas. However, when viewed soberly, they are just another tool in the toolbox of IT experts.” Marc Botha ( 21.01.2019 18:07 )

### Degrees of Freedom Analysis of Unrolled Neural Networks

** Nuit Blanche is now on Twitter: @NuitBlog **

Studying the great convergence !

Unrolled neural networks emerged recently as an effective model for learning inverse maps appearing in image restoration tasks. However, their generalization risk (i.e., test mean-squared-error) and its link to network design and train sample size remains mysterious. Leveraging the Stein's Unbiased Risk Estimator (SURE), this paper analyzes the generalization risk with its bias and variance components for recurrent unrolled networks. We particularly investigate the degrees-of-freedom (DOF) component of SURE, trace of the end-to-end network Jacobian, to quantify the prediction variance. We prove that DOF is well-approximated by the weighted \textit{path sparsity} of the network under incoherence conditions on the trained weights. Empirically, we examine the SURE components as a function of train sample size for both recurrent and non-recurrent (with many more parameters) unrolled networks. Our key observations indicate that: 1) DOF increases with train sample size and converges to the generalization risk for both recurrent and non-recurrent schemes; 2) recurrent network converges significantly faster (with less train samples) compared with non-recurrent scheme, hence recurrence serves as a regularization for low sample size regimes.

Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email.

### R Packages worth a look

Leaner Style Sheets (rless)
Converts LESS to CSS. It uses V8 engine, where LESS parser is run. Functions for LESS text, file or folder conversion are provided.

Orchestrate and Exchange Data with ‘EtherCalc’ Instances (ethercalc)
The ‘EtherCalc’ (https://ethercalc.net ) web application is a multi-user, collaborative spreadsheet t …

Dimensionality Assessment Using Minimum Rank Factor Analysis (EFA.MRFA)
Performs parallel analysis (Timmerman & Lorenzo-Seva, 2011 <doi:10.1037/a0023353>) and hull method (Lorenzo-Seva, Timmerman, & Kiers, 201 …

Finds the Archetypal Analysis of a Data Frame (archetypal)
Performs archetypal analysis by using Convex Hull approximation under a full control of all algorithmic parameters. It contains functions useful for fi …

### Document worth reading: “Evolutionary Algorithms”

Evolutionary algorithms (EAs) are population-based metaheuristics, originally inspired by aspects of natural evolution. Modern varieties incorporate a broad mixture of search mechanisms, and tend to blend inspiration from nature with pragmatic engineering concerns; however, all EAs essentially operate by maintaining a population of potential solutions and in some way artificially ‘evolving’ that population over time. Particularly well-known categories of EAs include genetic algorithms (GAs), Genetic Programming (GP), and Evolution Strategies (ES). EAs have proven very successful in practical applications, particularly those requiring solutions to combinatorial problems. EAs are highly flexible and can be configured to address any optimization task, without the requirements for reformulation and/or simplification that would be needed for other techniques. However, this flexibility goes hand in hand with a cost: the tailoring of an EA’s configuration and parameters, so as to provide robust performance for a given class of tasks, is often a complex and time-consuming process. This tailoring process is one of the many ongoing research areas associated with EAs. Evolutionary Algorithms

### Distilled News

Extracting Data from Multiple Data Sources. If you’re in tech, chances are you have data to manage. Data grows fast, gets more complex and harder to manage as your company scales. The management wants to extract insights from the data they have, but they do not have the technical skills to do that. They then hire you, the friendly neighbourhood data scientist/engineer/analyst or whatever title you want to call yourself nowadays ( does not matter you’ll be doing all the work anyways ) to do exactly just that. You soon realise that in order for you to provide insights, you need to have some kind of visualisation to explain your findings and monitor them over time. For these data to be up to date, you need to extract, transform, load them into your preferred database from multiple data sources in a fixed time interval ( hourly , daily, weekly, monthly). Companies have workflow management systems for tasks like these. They enable us to create and schedule jobs internally. Everyone has their own preference, but I would say many of the old WMS tend to be inefficient, hard to scale, lack of a good UI, no strategy for retry, not to mention terrible logs that makes troubleshooting / debugging a nightmare. It causes unnecessary wastage of effort and time. Here is an example of a traditional crontab used to schedule jobs.
There are different loss functions available for different objectives. In this mini blog, I will take you through some of the very frequently used loss functions, with a set of examples. This blog is designed by keeping the Keras and Tensorflow framework in the mind.
Tensorflow inspires developers to experiment with their exciting AI ideas in almost any domain that comes to mind. There are three well known factors in the ML community that make up a good Deep Neural Network model do magical things.
• Model Architecture
• High quality training data
• Sufficient Compute Capacity
When we recently held a 24h hackathon aimed at helping an NGO fight human trafficking, the composition of the competing teams could not have been more different: There were a number of teams with a heavy background in analytics consulting. And then there was my team: One data engineer, and three designers. The goal of the event was to generate insights from data provided by the client. The game was on. The client brief was the nail. Each team had its hammer. It might have been hard to see what such a design heavy group had to do with what seemed like an obvious data analytics challenge, but we did not despair. Instead, the challenge allowed us to illustrate some valuable lessons about how combining data- and design-driven approaches can generate unique insights that previously had been overlooked. The key to getting there was in looking for the stories that came to the surface at the intersection of analytics and ethnography.
Football( soccer for Americans) is by far the most popular sport in the world. With over 4 billion fans worldwide, football has proven to transcend generations, geo-political rivalries and even war conflicts. That passion has transferred into the video game space, in which games like FIFA regularly ranked among the most popular videogames worldwide. Despite its popularity, football is one of the games that have proven resilient artificial intelligence(AI) techniques. The complexity of environments like FIFA often result on a nightmare for AI algorithms. Recently, researchers from the Google Brain team open sourced Google Research Football, a new environment that leverages reinforcement learning to teach AI agents how to master the most popular sport in the world. The principles behind Google Research Football were outlined in a research paper that accompanied the release.
Foreword: The following post is intended to be an introduction with some math. Although it includes math, I’m by no means an expert. I struggled to learn the basics of probabilistic programming, and hopefully this helps someone else make a little more sense out of the world. If you find any errors, please drop a comment and help us learn. Cheers! In data science we are often interested in understanding how data was generated, mainly because it allows us to answer questions about new and/or incomplete observations. Given a (often noisy) dataset though, there are an infinite number of possible model structures that could have generated the observed data. But not all model structures are made equally – that is, certain model structures are more realistic than others based on our assumptions of the world. It is up to the modeler to choose the one that best describes the world they are observing.
Correlation does not imply causation’. We hear this sentence over and over again. But what does that actually mean? This small analysis uncovers this topic with the help of R and simple regressions, focusing on how alcohol impacts health.
In the last few years, the machine-learning community has blossomed, applying the technology to challenges like food security and health care.
Every single successful data science project revolves around finding accurate correlations between the input and target variables. However more than often, we oversee how crucial correlation analysis is. It is recommended to perform correlation analysis before and after data gathering and transformation phases of a data science project. This article focuses on the important role correlations play in the data science projects and concentrates on the real world FinTech examples. Lastly it explains how we can model the correlations the right way.
This repository contains examples of popular machine learning algorithms implemented in Python with mathematics behind them being explained. Each algorithm has interactive Jupyter Notebook demo that allows you to play with training data, algorithms configurations and immediately see the results, charts and predictions right in your browser. In most cases the explanations are based on this great machine learning course by Andrew Ng. The purpose of this repository is not to implement machine learning algorithms by using 3rd party library one-liners but rather to practice implementing these algorithms from scratch and get better understanding of the mathematics behind each algorithm. That’s why all algorithms implementations are called ‘homemade’ and not intended to be used for production.
In the past couple of years, big companies hurried to publish their machine learning based time series libraries. For example, Facebook released Prophet, Amazon released Gluon Time Series, Microsoft released Time Series Insights and Google released Tensorflow time series. This popularity shows that machine learning based time series prediction is in high demand. This article introduces the recently released Tensorflow time series library from Google. This library uses probabilistic models to describe time series. In my 7 years’ experience in algorithmic trading, I’ve established the habit of analysing the inner workings of time series libraries. So I digged into the source code of this library to understand the Tensorflow team’s take on time series modelling.
The ability to answer factoid questions is a key feature of any dialogue system. Formally speaking, to give an answer based on the document collection covering wide range of topics is called open-domain question answering (ODQA). The ODQA task combines the challenges of document retrieval (finding the relevant articles) with that of machine comprehension of text (identifying the answer span from those articles). An ODQA system can be used in many applications. Chatbots apply ODQA to answer user requests, while the business-oriented Natural Language Processing (NLP) solutions leverage ODQA to answer questions based on internal corporate documentation. The picture below shows a typical dialogue with an ODQA system.
MCMC is a pretty hard topic to wrap your head around but examples do help a lot. Last time I wrote an article explaining MCMC methods intuitively. In that article, I showed how MCMC chains could be used to simulate from a random variable whose distribution is partially known i.e. we don’t know the normalizing constant. I also told how MCMC can be used to Solve problems with a large state space. But didn’t give an example. I will provide some real-world use cases in this post. If you don’t really appreciate MCMC till now, I hope I will be able to pique your interest by the end of this blog post. This post is about understanding MCMC Methods with the help of some Computer Science problems.
End-to-end (E2E) learning refers to training a possibly complex learning system represented by a single model (specifically a Deep Neural Network) that represents the complete target system, bypassing the intermediate layers usually present in traditional pipeline designs.

## June 26, 2019

### Blackman-Tukey Spectral Estimator in R

Blackman-Tukey Spectral Estimator in R!

There are two definitions of the power spectral density (PSD). Both definitions are mathematically nearly identical and define a function that describes the distribution of power over the frequency components in our data set. The periodogram PSD estimator is based on the first definition of the PSD (see periodogram post). The Blackman-Tukey spectral estimator (BTSE) is based on the second definition. The second definition says, find the PSD by calculating the Fourier transform of the autocorrelation sequence (ACS). In R this definition is written as

PSD <- function(rxx) {  fft(rxx)}

where fft is the R implementation of the fast Fourier transform, rxx is the autocorrelation sequence (ACS), the k’th element of the ACS rxx[k] = E[x[0]x[k]], k -infinity to +infinity, and E is the expectation operator. The xx in rxx[k] is a reminder r is a correlation between x and itself. The rxx[k]s are sometimes called lags. The ACS has the propriety that rxx[-k]=rxx[k]*, where * is the complex conjugate. In the post, we will only use real numbers and I’ll drop the * from here forward.

So, to find the PSD we just calculate rxx and take its fft! Unfortunately, in practice, we cannot do this. Calculating the expected value requires the probability density function (PDF) of x, which we don’t know and we need an infinite amount of data, which we don’t have. So, we can’t calculate the PSD: we’re doomed!

No, we are not doomed. We can’t calculate the PSD, but we estimate it! We can derive an estimator for the PSD from the definition of the PSD. First, we replace rxx with an estimate of rxx. We replace the expected value, which gives the exact rxx, with an average, which gives us an estimate of rxx. The E[x[0]x[k]] is replaced with (1/N)(x[1]x[1+k]+x[2]x[2+k]+…+x[N-1-k]x[N-1]+x[N-k]x[N]), where N is the number of data samples. For example; if k=0, then rxx[k]=(1/N)*sum(x*x). In R code the estimate is written as

lagEstimate <-function(x,k,N=length(x)){   (1/N)*sum(x[1:(N-k)]*x[(k+1):N])}

If we had an infinite amount of data, N=infinity, we could use lagEstimate to estimate the entire infinite ACS. Unfortunately we don’t have an infinite amount of data, even if we did it wouldn’t fit into a computer. So, we can only estimate a finite amount of ASC elements. The function below calculates lags 0 to kMax.

Lags <-function(x,kMax) {  sapply(0:kMax,lagEstimate,x=x)}

Before we can try these functions out we need data. In this case the data came from a random process with the PSD plotted in the figure below. The x-axis is normalized frequency(frequency divided by the sampling rate). So, if the sampling rate was 1000 Hz, you could multiply the normalized frequency by 1000 Hz and then the frequency axis would read 0 Hz to 1000 Hz. The y-axis in in dB (10log10(amplitude)). You can see six large sharp peaks in the plot and a gradual dip towards 0 Hz and then back up. Some of the peaks are close together and will be hard to resolve.

The data produced by the random process is plotted below. This is the data we will use through this post.

Let’s calculate the the ACS up to the 5th lag using the data.

Lags(x,kMax=5)
## [1]  6.095786 -1.368336  3.341608  1.738122 -1.737459  3.651765

A kMax of 5 gives us 6 lags: {r[0], r[1], r[2], r[3], r[4], r[5]}. These 6 lags are not an ACS, but are part of an ACS.

We used Lags to estimate the positive lags up to kMax, but the ACS is even sequence, r[-k]=r[k] for all k. So, let’s write a function to make a sequence consisting of lags from r[-kMax] to r[kMax]. This is a windowed ACS, values outside of the +/- kMax are replaced with 0. Where it won’t cause confusion, I’ll refer to the windowed ACS, as the ACS.

acsWindowed <- function(x,kMax,Nzero=0){  rHalf <- c(Lags(x,kMax),rep(0,Nzero))  c(rev(rHalf[2:length(rHalf)]),rHalf)}

Let’s try this function out.

acsW <- acsWindowed(x,9)

In the figure below you can see the r[0] lag, the maximum, is plotted in the middle of the plot.

The ACS in the figure above is how the ACS is usually plotted in textbooks. In textbooks the sum in the Fourier transform ranges from -N/2 to (N-1)/2. So, the r[0] lag should be in the center of the plot. In R the sum in the Fourier transform ranges from 1 to N, so the 0’th lag has to be first. We could just make the sequence in R form, but it is often handy to start in textbook from and switch to R form. We can write a function to make switching from textbook to R easy.

Textbook2R <- function(x,N=length(x),foldN=ceiling(N/2)) {  c(x[foldN:N],x[1:(foldN-1)])}

Notice in the figure below the maximum lag r[0], is plotted at the beginning.

Let’s imagine we have an infinite amount of data and used it to estimated and infinite number of ACS lags. Let’s call that sequence rAll. We make a windowed ACS by setting rW=rAll*W, where W=1 for our 9 lags and 0 everywhere else. W is called the rectangular window, because, as you can see in the plot below, it’s plot looks like a rectangle. By default when we estimate a finite number of lags we are using a rectangular window.

W <- c(rep(0,9),rep(1,9),rep(0,9))

The reason we can not use a rectangular window is its Fourier Transform is not always positive. As you can see in the plot below there are several values below zero, indicated with dotted line. Re() functions removes some small imaginary numbers due to numerical error, some imaginary dust we have to sweep up.

FFT_W <- Re(fft(Textbook2R(W)))

Even though the fft of the ACS rAll is positive , the produce rAll and a rectangular window might not be positive! The Bartlett window is a simple window whos fft is positive.

BartlettWindow <- function(N,n=seq(0, (N-1)))  {  1 - abs( (n-(N-1)/2) / ( (N-1)/2) )}Wb <- BartlettWindow(19)

As you can see in the plot below the Fourier transform of the Bartlett window is positive.

WbFft <- Re(fft(Textbook2R(Wb)))

## Calculating the BTSE with R

Now that we can estimate the ACS and window our estimate, we are ready to estimate the PSD of our data. The BTSE is written as

Btse <- function(rHat,Wb) {  Re(fft(rHat*Wb))}

Note the Re() is correcting for numerical error.

In the first example we use a 19 point ACS lag sequence.

rHat   <- Textbook2R(acsWindowed(x,kMax=9))Wb     <- Textbook2R(BartlettWindow(length(rHat)))Pbtse9 <- Btse(rHat,Wb)

In the figure below is the BTSE calculated with a maximum lag of 9. The dotted lines indicate the locations of the peaks in the PSD we are trying to estimate. The estimate with a maximum lage of only 9 produces a poor estimate.

We calculate a new estimate with a maximum lag of 18.

rHat    <- Textbook2R(acsWindowed(x,kMax=18))Wb      <- Textbook2R(BartlettWindow(length(rHat)))Pbtse18 <- Btse(rHat,Wb)

The next estimate is made with a maximum lag of 18. This estimate is better, the peaks around 0.4 and 0.6 are not resolved. We still need to increase the maximum lag.

Finally we increase the maximum lag to 65 and recalculate the estimate.

rHat    <- Textbook2R(acsWindowed(x,kMax=65))Wb      <- Textbook2R(BartlettWindow(length(rHat)))Pbtse65 <- Btse(rHat,Wb)

This finial estimate is very good. All six peaks are resolved and the location of our estimated peaks are very close to the true peak locations locations.

## Final Thoughts

Could we use 500 lags in the BTSE? In this case we could, since we have a lot of data, but the higher lags get estimated with less data and therefore have more variance. Using the high variance lags will produce a higher variance estimate.

Are there other ways to improve the BTSE other than using more lags? Yes! There are a few other ways. For instance, we could zero pad the lags. Basically add zeros to the end of our lag sequence. This will make the fft, in the BTSE estimator, evaluate the estimate at more frequencies and we will be able to see more details in the estimated PSD.

Also keep in mind there are other PSD estimation methods that do better on other PSD features. For instance, if you we more interested finding deep nulls rather than peaks, a moving average PSD estimation would be better.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Book Memo: “The Experience-Centric Organization”

 How to Win Through Customer Experience Is your organization prepared for the next paradigm of customer experience? Or will you be left behind? This practical book will make you a winner in a market driven by experience and enable you to develop desirable offerings and standout service and attract loyal customers. Author Simon Clatworthy shows you how to transform into an organization that aligns your customers’ experiential journey with platforms, organizational structures, and strategic alliances. Rather than treat customer experience as an add-on to product and service design, you’ll discover how experience-centricity can drive the whole organization.

### Whats new on arXiv

Gap-Measure Tests with Applications to Data Integrity Verification

In this paper we propose and examine gap statistics for assessing uniform distribution hypotheses. We provide examples relevant to data integrity testing for which max-gap statistics provide greater sensitivity than chi-square ($\chi^2$), thus allowing the new test to be used in place of or as a complement to $\chi^2$ testing for purposes of distinguishing a larger class of deviations from uniformity. We establish that the proposed max-gap test has the same sequential and parallel computational complexity as $\chi^2$ and thus is applicable for Big Data analytics and integrity verification.

Learning Representations by Maximizing Mutual Information Across Views

We propose an approach to self-supervised representation learning based on maximizing mutual information between features extracted from multiple views of a shared context. For example, a context could be an image from ImageNet, and multiple views of the context could be generated by repeatedly applying data augmentation to the image. Following this approach, we develop a new model which maximizes mutual information between features extracted at multiple scales from independently-augmented copies of each input. Our model significantly outperforms prior work on the tasks we consider. Most notably, it achieves over 60% accuracy on ImageNet using the standard linear evaluation protocol. This improves on prior results by over 4% (absolute). On Places205, using the representations learned on ImageNet, our model achieves 50% accuracy. This improves on prior results by 2% (absolute). When we extend our model to use mixture-based representations, segmentation behaviour emerges as a natural side-effect.

Learning Interpretable Shapelets for Time Series Classification through Adversarial Regularization

Times series classification can be successfully tackled by jointly learning a shapelet-based representation of the series in the dataset and classifying the series according to this representation. However, although the learned shapelets are discriminative, they are not always similar to pieces of a real series in the dataset. This makes it difficult to interpret the decision, i.e. difficult to analyze if there are particular behaviors in a series that triggered the decision. In this paper, we make use of a simple convolutional network to tackle the time series classification task and we introduce an adversarial regularization to constrain the model to learn more interpretable shapelets. Our classification results on all the usual time series benchmarks are comparable with the results obtained by similar state-of-the-art algorithms but our adversarially regularized method learns shapelets that are, by design, interpretable.

Big-Data Clustering: K-Means or K-Indicators?

The K-means algorithm is arguably the most popular data clustering method, commonly applied to processed datasets in some ‘feature spaces’, as is in spectral clustering. Highly sensitive to initializations, however, K-means encounters a scalability bottleneck with respect to the number of clusters K as this number grows in big data applications. In this work, we promote a closely related model called K-indicators model and construct an efficient, semi-convex-relaxation algorithm that requires no randomized initializations. We present extensive empirical results to show advantages of the new algorithm when K is large. In particular, using the new algorithm to start the K-means algorithm, without any replication, can significantly outperform the standard K-means with a large number of currently state-of-the-art random replications.

Mining YouTube – A dataset for learning fine-grained action concepts from webly supervised video data

Action recognition is so far mainly focusing on the problem of classification of hand selected preclipped actions and reaching impressive results in this field. But with the performance even ceiling on current datasets, it also appears that the next steps in the field will have to go beyond this fully supervised classification. One way to overcome those problems is to move towards less restricted scenarios. In this context we present a large-scale real-world dataset designed to evaluate learning techniques for human action recognition beyond hand-crafted datasets. To this end we put the process of collecting data on its feet again and start with the annotation of a test set of 250 cooking videos. The training data is then gathered by searching for the respective annotated classes within the subtitles of freely available videos. The uniqueness of the dataset is attributed to the fact that the whole process of collecting the data and training does not involve any human intervention. To address the problem of semantic inconsistencies that arise with this kind of training data, we further propose a semantical hierarchical structure for the mined classes.

Terminal Brain Damage: Exposing the Graceless Degradation in Deep Neural Networks Under Hardware Fault Attacks

Deep neural networks (DNNs) have been shown to tolerate ‘brain damage’: cumulative changes to the network’s parameters (e.g., pruning, numerical perturbations) typically result in a graceful degradation of classification accuracy. However, the limits of this natural resilience are not well understood in the presence of small adversarial changes to the DNN parameters’ underlying memory representation, such as bit-flips that may be induced by hardware fault attacks. We study the effects of bitwise corruptions on 19 DNN models—six architectures on three image classification tasks—and we show that most models have at least one parameter that, after a specific bit-flip in their bitwise representation, causes an accuracy loss of over 90%. We employ simple heuristics to efficiently identify the parameters likely to be vulnerable. We estimate that 40-50% of the parameters in a model might lead to an accuracy drop greater than 10% when individually subjected to such single-bit perturbations. To demonstrate how an adversary could take advantage of this vulnerability, we study the impact of an exemplary hardware fault attack, Rowhammer, on DNNs. Specifically, we show that a Rowhammer enabled attacker co-located in the same physical machine can inflict significant accuracy drops (up to 99%) even with single bit-flip corruptions and no knowledge of the model. Our results expose the limits of DNNs’ resilience against parameter perturbations induced by real-world fault attacks. We conclude by discussing possible mitigations and future research directions towards fault attack-resilient DNNs.

NodeDrop: A Condition for Reducing Network Size without Effect on Output

Determining an appropriate number of features for each layer in a neural network is an important and difficult task. This task is especially important in applications on systems with limited memory or processing power. Many current approaches to reduce network size either utilize iterative procedures, which can extend training time significantly, or require very careful tuning of algorithm parameters to achieve reasonable results. In this paper we propose NodeDrop, a new method for eliminating features in a network. With NodeDrop, we define a condition to identify and guarantee which nodes carry no information, and then use regularization to encourage nodes to meet this condition. We find that NodeDrop drastically reduces the number of features in a network while maintaining high performance, reducing the number of parameters by a factor of 114x for a VGG like network on CIFAR10 without a drop in accuracy.

A Hybrid RNN-HMM Approach for Weakly Supervised Temporal Action Segmentation

Action recognition has become a rapidly developing research field within the last decade. But with the increasing demand for large scale data, the need of hand annotated data for the training becomes more and more impractical. One way to avoid frame-based human annotation is the use of action order information to learn the respective action classes. In this context, we propose a hierarchical approach to address the problem of weakly supervised learning of human actions from ordered action labels by structuring recognition in a coarse-to-fine manner. Given a set of videos and an ordered list of the occurring actions, the task is to infer start and end frames of the related action classes within the video and to train the respective action classifiers without any need for hand labeled frame boundaries. We address this problem by combining a framewise RNN model with a coarse probabilistic inference. This combination allows for the temporal alignment of long sequences and thus, for an iterative training of both elements. While this system alone already generates good results, we show that the performance can be further improved by approximating the number of subactions to the characteristics of the different action classes as well as by the introduction of a regularizing length prior. The proposed system is evaluated on two benchmark datasets, the Breakfast and the Hollywood extended dataset, showing a competitive performance on various weak learning tasks such as temporal action segmentation and action alignment.

Correctness Verification of Neural Networks

We present the first verification that a neural network produces a correct output within a specified tolerance for every input of interest. We define correctness relative to a specification which identifies 1) a state space consisting of all relevant states of the world and 2) an observation process that produces neural network inputs from the states of the world. Tiling the state and input spaces with a finite number of tiles, obtaining ground truth bounds from the state tiles and network output bounds from the input tiles, then comparing the ground truth and network output bounds delivers an upper bound on the network output error for any input of interest. Results from a case study highlight the ability of our technique to deliver tight error bounds for all inputs of interest and show how the error bounds vary over the state and input spaces.

Transforming Complex Sentences into a Semantic Hierarchy

We present an approach for recursively splitting and rephrasing complex English sentences into a novel semantic hierarchy of simplified sentences, with each of them presenting a more regular structure that may facilitate a wide variety of artificial intelligence tasks, such as machine translation (MT) or information extraction (IE). Using a set of hand-crafted transformation rules, input sentences are recursively transformed into a two-layered hierarchical representation in the form of core sentences and accompanying contexts that are linked via rhetorical relations. In this way, the semantic relationship of the decomposed constituents is preserved in the output, maintaining its interpretability for downstream applications. Both a thorough manual analysis and automatic evaluation across three datasets from two different domains demonstrate that the proposed syntactic simplification approach outperforms the state of the art in structural text simplification. Moreover, an extrinsic evaluation shows that when applying our framework as a preprocessing step the performance of state-of-the-art Open IE systems can be improved by up to 346% in precision and 52% in recall. To enable reproducible research, all code is provided online.

Neural networks grown and self-organized by noise

Living neural networks emerge through a process of growth and self-organization that begins with a single cell and results in a brain, an organized and functional computational device. Artificial neural networks, however, rely on human-designed, hand-programmed architectures for their remarkable performance. Can we develop artificial computational devices that can grow and self-organize without human intervention? In this paper, we propose a biologically inspired developmental algorithm that can ‘grow’ a functional, layered neural network from a single initial cell. The algorithm organizes inter-layer connections to construct a convolutional pooling layer, a key constituent of convolutional neural networks (CNN’s). Our approach is inspired by the mechanisms employed by the early visual system to wire the retina to the lateral geniculate nucleus (LGN), days before animals open their eyes. The key ingredients for robust self-organization are an emergent spontaneous spatiotemporal activity wave in the first layer and a local learning rule in the second layer that ‘learns’ the underlying activity pattern in the first layer. The algorithm is adaptable to a wide-range of input-layer geometries, robust to malfunctioning units in the first layer, and so can be used to successfully grow and self-organize pooling architectures of different pool-sizes and shapes. The algorithm provides a primitive procedure for constructing layered neural networks through growth and self-organization. Broadly, our work shows that biologically inspired developmental algorithms can be applied to autonomously grow functional ‘brains’ in-silico.

Episodic Memory in Lifelong Language Learning

We introduce a lifelong language learning setup where a model needs to learn from a stream of text examples without any dataset identifier. We propose an episodic memory model that performs sparse experience replay and local adaptation to mitigate catastrophic forgetting in this setup. Experiments on text classification and question answering demonstrate the complementary benefits of sparse experience replay and local adaptation to allow the model to continuously learn from new datasets. We also show that the space complexity of the episodic memory module can be reduced significantly (~50-90%) by randomly choosing which examples to store in memory with a minimal decrease in performance. We consider an episodic memory component as a crucial building block of general linguistic intelligence and see our model as a first step in that direction.

We propose a dyadic Item Response Theory (dIRT) model for measuring interactions of pairs of individuals when the responses to items represent the actions (or behaviors, perceptions, etc.) of each individual (actor) made within the context of a dyad formed with another individual (partner). Examples of its use include the assessment of collaborative problem solving, or the evaluation of intra-team dynamics. The dIRT model generalizes both Item Response Theory (IRT) models for measurement and the Social Relations Model (SRM) for dyadic data. The responses of an actor when paired with a partner are modeled as a function of not only the actor’s inclination to act and the partner’s tendency to elicit that action, but also the unique relationship of the pair, represented by two directional, possibly correlated, interaction latent variables. Generalizations are discussed, such as accommodating triads or larger groups. Estimation is performed using Markov-chain Monte Carlo implemented in Stan, making it straightforward to extend the dIRT model in various ways. Specifically, we show how the basic dIRT model can be extended to accommodate latent regressions, multilevel settings with cluster-level random effects, as well as joint modeling of dyadic data and a distal outcome. A simulation study demonstrates that estimation performs well. We apply our proposed approach to speed-dating data and find new evidence of pairwise interactions between participants, describing a mutual attraction that is inadequately characterized by individual properties alone.

MEMe: An Accurate Maximum Entropy Method for Efficient Approximations in Large-Scale Machine Learning

Efficient approximation lies at the heart of large-scale machine learning problems. In this paper, we propose a novel, robust maximum entropy algorithm, which is capable of dealing with hundreds of moments and allows for computationally efficient approximations. We showcase the usefulness of the proposed method, its equivalence to constrained Bayesian variational inference and demonstrate its superiority over existing approaches in two applications, namely, fast log determinant estimation and information-theoretic Bayesian optimisation.

Random Path Selection for Incremental Learning

Incremental life-long learning is a main challenge towards the long-standing goal of Artificial General Intelligence. In real-life settings, learning tasks arrive in a sequence and machine learning models must continually learn to increment already acquired knowledge. Existing incremental learning approaches, fall well below the state-of-the-art cumulative models that use all training classes at once. In this paper, we propose a random path selection algorithm, called RPSnet, that progressively chooses optimal paths for the new tasks while encouraging parameter sharing and reuse. Our approach avoids the overhead introduced by computationally expensive evolutionary and reinforcement learning based path selection strategies while achieving considerable performance gains. As an added novelty, the proposed model integrates knowledge distillation and retrospection along with the path selection strategy to overcome catastrophic forgetting. In order to maintain an equilibrium between previous and newly acquired knowledge, we propose a simple controller to dynamically balance the model plasticity. Through extensive experiments, we demonstrate that the proposed method surpasses the state-of-the-art performance on incremental learning and by utilizing parallel computation this method can run in constant time with nearly the same efficiency as a conventional deep convolutional neural network.

A Case for Backward Compatibility for Human-AI Teams

AI systems are being deployed to support human decision making in high-stakes domains. In many cases, the human and AI form a team, in which the human makes decisions after reviewing the AI’s inferences. A successful partnership requires that the human develops insights into the performance of the AI system, including its failures. We study the influence of updates to an AI system in this setting. While updates can increase the AI’s predictive performance, they may also lead to changes that are at odds with the user’s prior experiences and confidence in the AI’s inferences, hurting therefore the overall team performance. We introduce the notion of the compatibility of an AI update with prior user experience and present methods for studying the role of compatibility in human-AI teams. Empirical results on three high-stakes domains show that current machine learning algorithms do not produce compatible updates. We propose a re-training objective to improve the compatibility of an update by penalizing new errors. The objective offers full leverage of the performance/compatibility tradeoff, enabling more compatible yet accurate updates.

Towards Fair and Decentralized Privacy-Preserving Deep Learning with Blockchain

In collaborative deep learning, current learning frameworks follow either a centralized architecture or a distributed architecture. Whilst centralized architecture deploys a central server to train a global model over the massive amount of joint data from all parties, distributed architecture aggregates parameter updates from participating parties’ local model training, via a parameter server. These two server-based architectures present security and robustness vulnerabilities such as single-point-of-failure, single-point-of-breach, privacy leakage, and lack of fairness. To address these problems, we design, implement, and evaluate a purely decentralized privacy-preserving deep learning framework, called DPPDL. DPPDL makes the first investigation on the research problem of fairness in collaborative deep learning, and simultaneously provides fairness and privacy by proposing two novel algorithms: initial benchmarking and privacy-preserving collaborative deep learning. During initial benchmarking, each party trains a local Differentially Private Generative Adversarial Network (DPGAN) and publishes the generated privacy-preserving artificial samples for other parties to label, based on the quality of which to initialize local credibility list for other parties. The local credibility list reflects how much one party contributes to another party, and it is used and updated during collaborative learning to ensure fairness. To protect gradients transaction during privacy-preserving collaborative deep learning, we further put forward a three-layer onion-style encryption scheme. We experimentally demonstrate, on benchmark image datasets, that accuracy, privacy and fairness in collaborative deep learning can be effectively addressed at the same time by our proposed DPPDL framework. Moreover, DPPDL provides a viable solution to detect and isolate the cheating party in the system.

Back Attention Knowledge Transfer for Low-resource Named Entity Recognition

In recent years, great success has been achieved in the field of natural language processing (NLP), thanks in part to the considerable amount of annotated resources. For named entity recognition (NER), most languages do not have such an abundance of labeled data, so the performances of those languages are comparatively lower. To improve the performance, we propose a general approach called Back Attention Network (BAN). BAN uses translation system to translate other language sentences into English and utilizes the pre-trained English NER model to get task-specific information. After that, BAN applies a new mechanism named back attention knowledge transfer to improve the semantic representation, which aids in generation of the result. Experiments on three different language datasets indicate that our approach outperforms other state-of-the-art methods.

Learning Attention-based Embeddings for Relation Prediction in Knowledge Graphs

The recent proliferation of knowledge graphs (KGs) coupled with incomplete or partial information, in the form of missing relations (links) between entities, has fueled a lot of research on knowledge base completion (also known as relation prediction). Several recent works suggest that convolutional neural network (CNN) based models generate richer and more expressive feature embeddings and hence also perform well on relation prediction. However, we observe that these KG embeddings treat triples independently and thus fail to cover the complex and hidden information that is inherently implicit in the local neighborhood surrounding a triple. To this effect, our paper proposes a novel attention based feature embedding that captures both entity and relation features in any given entity’s neighborhood. Additionally, we also encapsulate relation clusters and multihop relations in our model. Our empirical study offers insights into the efficacy of our attention based model and we show marked performance gains in comparison to state of the art methods on all datasets.

Robust Mean Estimation with the Bayesian Median of Means

The sample mean is often used to aggregate different unbiased estimates of a parameter, producing a final estimate that is unbiased but possibly high-variance. This paper introduces the Bayesian median of means, an aggregation rule that roughly interpolates between the sample mean and median, resulting in estimates with much smaller variance at the expense of bias. While the procedure is non-parametric, its squared bias is asymptotically negligible relative to the variance, similar to maximum likelihood estimators. The Bayesian median of means is consistent, and concentration bounds for the estimator’s bias and $L_1$ error are derived, as well as a fast non-randomized approximating algorithm. The performances of both the exact and the approximate procedures match that of the sample mean in low-variance settings, and exhibit much better results in high-variance scenarios. The empirical performances are examined in real and simulated data, and in applications such as importance sampling, cross-validation and bagging.

Attributed Graph Clustering via Adaptive Graph Convolution

Attributed graph clustering is challenging as it requires joint modelling of graph structures and node attributes. Recent progress on graph convolutional networks has proved that graph convolution is effective in combining structural and content information, and several recent methods based on it have achieved promising clustering performance on some real attributed networks. However, there is limited understanding of how graph convolution affects clustering performance and how to properly use it to optimize performance for different graphs. Existing methods essentially use graph convolution of a fixed and low order that only takes into account neighbours within a few hops of each node, which underutilizes node relations and ignores the diversity of graphs. In this paper, we propose an adaptive graph convolution method for attributed graph clustering that exploits high-order graph convolution to capture global cluster structure and adaptively selects the appropriate order for different graphs. We establish the validity of our method by theoretical analysis and extensive experiments on benchmark datasets. Empirical results show that our method compares favourably with state-of-the-art methods.

Progressive Self-Supervised Attention Learning for Aspect-Level Sentiment Analysis

In aspect-level sentiment classification (ASC), it is prevalent to equip dominant neural models with attention mechanisms, for the sake of acquiring the importance of each context word on the given aspect. However, such a mechanism tends to excessively focus on a few frequent words with sentiment polarities, while ignoring infrequent ones. In this paper, we propose a progressive self-supervised attention learning approach for neural ASC models, which automatically mines useful attention supervision information from a training corpus to refine attention mechanisms. Specifically, we iteratively conduct sentiment predictions on all training instances. Particularly, at each iteration, the context word with the maximum attention weight is extracted as the one with active/misleading influence on the correct/incorrect prediction of every instance, and then the word itself is masked for subsequent iterations. Finally, we augment the conventional training objective with a regularization term, which enables ASC models to continue equally focusing on the extracted active context words while decreasing weights of those misleading ones. Experimental results on multiple datasets show that our proposed approach yields better attention mechanisms, leading to substantial improvements over the two state-of-the-art neural ASC models. Source code and trained models are available at https://…/PSSAttention.

Toward Building Conversational Recommender Systems: A Contextual Bandit Approach

Contextual bandit algorithms have gained increasing popularity in recommender systems, because they can learn to adapt recommendations by making exploration-exploitation trade-off. Recommender systems equipped with traditional contextual bandit algorithms are usually trained with behavioral feedback (e.g., clicks) from users on items. The learning speed can be slow because behavioral feedback by nature does not carry sufficient information. As a result, extensive exploration has to be performed. To address the problem, we propose conversational recommendation in which the system occasionally asks questions to the user about her interest. We first generalize contextual bandit to leverage not only behavioral feedback (arm-level feedback), but also verbal feedback (users’ interest on categories, topics, etc.). We then propose a new UCB- based algorithm, and theoretically prove that the new algorithm can indeed reduce the amount of exploration in learning. We also design several strategies for asking questions to further optimize the speed of learning. Experiments on synthetic data, Yelp data, and news recommendation data from Toutiao demonstrate the efficacy of the proposed algorithm.

An Efficient Graph Convolutional Network Technique for the Travelling Salesman Problem

This paper introduces a new learning-based approach for approximately solving the Travelling Salesman Problem on 2D Euclidean graphs. We use deep Graph Convolutional Networks to build efficient TSP graph representations and output tours in a non-autoregressive manner via highly parallelized beam search. Our approach outperforms all recently proposed autoregressive deep learning techniques in terms of solution quality, inference speed and sample efficiency for problem instances of fixed graph sizes. In particular, we reduce the average optimality gap from 0.52% to 0.01% for 50 nodes, and from 2.26% to 1.39% for 100 nodes. Finally, despite improving upon other learning-based approaches for TSP, our approach falls short of standard Operations Research solvers.

An interpretable machine learning framework for modelling human decision behavior

Machine learning has recently been widely adopted to address the managerial decision making problems. However, there is a trade-off between performance and interpretability. Full complexity models (such as neural network-based models) are non-traceable black-box, whereas classic interpretable models (such as logistic regression) are usually simplified with lower accuracy. This trade-off limits the application of state-of-the-art machine learning models in management problems, which requires high prediction performance, as well as the understanding of individual attributes’ contributions to the model outcome. Multiple criteria decision aiding (MCDA) is a family of interpretable approaches to depicting the rationale of human decision behavior. It is also limited by strong assumptions (e.g. preference independence). In this paper, we propose an interpretable machine learning approach, namely Neural Network-based Multiple Criteria Decision Aiding (NN-MCDA), which combines an additive MCDA model and a fully-connected multilayer perceptron (MLP) to achieve good performance while preserving a certain degree of interpretability. NN-MCDA has a linear component (in an additive form of a set of polynomial functions) to capture the detailed relationship between individual attributes and the prediction, and a nonlinear component (in a standard MLP form) to capture the high-order interactions between attributes and their complex nonlinear transformations. We demonstrate the effectiveness of NN-MCDA with extensive simulation studies and two real-world datasets. To the best of our knowledge, this research is the first to enhance the interpretability of machine learning models with MCDA techniques. The proposed framework also sheds light on how to use machine learning techniques to free MCDA from strong assumptions.

Universal Boosting Variational Inference

Boosting variational inference (BVI) approximates an intractable probability density by iteratively building up a mixture of simple component distributions one at a time, using techniques from sparse convex optimization to provide both computational scalability and approximation error guarantees. But the guarantees have strong conditions that do not often hold in practice, resulting in degenerate component optimization problems; and we show that the ad-hoc regularization used to prevent degeneracy in practice can cause BVI to fail in unintuitive ways. We thus develop universal boosting variational inference (UBVI), a BVI scheme that exploits the simple geometry of probability densities under the Hellinger metric to prevent the degeneracy of other gradient-based BVI methods, avoid difficult joint optimizations of both component and weight, and simplify fully-corrective weight optimizations. We show that for any target density and any mixture component family, the output of UBVI converges to the best possible approximation in the mixture family, even when the mixture family is misspecified. We develop a scalable implementation based on exponential family mixture components and standard stochastic optimization techniques. Finally, we discuss statistical benefits of the Hellinger distance as a variational objective through bounds on posterior probability, moment, and importance sampling errors. Experiments on multiple datasets and models show that UBVI provides reliable, accurate posterior approximations.

Kinetic Market Model: An Evolutionary Algorithm

This research proposes the econophysics kinetic market model as an evolutionary algorithm’s instance. The immediate results from this proposal is a new replacement rule for family competition genetic algorithms. It also represents a starting point to adding evolvable entities to kinetic market models.

A Novel Hyperparameter-free Approach to Decision Tree Construction that Avoids Overfitting by Design

Decision trees are an extremely popular machine learning technique. Unfortunately, overfitting in decision trees still remains an open issue that sometimes prevents achieving good performance. In this work, we present a novel approach for the construction of decision trees that avoids the overfitting by design, without losing accuracy. A distinctive feature of our algorithm is that it requires neither the optimization of any hyperparameters, nor the use of regularization techniques, thus significantly reducing the decision tree training time. Moreover, our algorithm produces much smaller and shallower trees than traditional algorithms, facilitating the interpretability of the resulting models.

The Extended Dawid-Skene Model: Fusing Information from Multiple Data Schemas

While label fusion from multiple noisy annotations is a well understood concept in data wrangling (tackled for example by the Dawid-Skene (DS) model), we consider the extended problem of carrying out learning when the labels themselves are not consistently annotated with the same schema. We show that even if annotators use disparate, albeit related, label-sets, we can still draw inferences for the underlying full label-set. We propose the Inter-Schema AdapteR (ISAR) to translate the fully-specified label-set to the one used by each annotator, enabling learning under such heterogeneous schemas, without the need to re-annotate the data. We apply our method to a mouse behavioural dataset, achieving significant gains (compared with DS) in out-of-sample log-likelihood (-3.40 to -2.39) and F1-score (0.785 to 0.864).

Emotion-Cause Pair Extraction: A New Task to Emotion Analysis in Texts

Emotion cause extraction (ECE), the task aimed at extracting the potential causes behind certain emotions in text, has gained much attention in recent years due to its wide applications. However, it suffers from two shortcomings: 1) the emotion must be annotated before cause extraction in ECE, which greatly limits its applications in real-world scenarios; 2) the way to first annotate emotion and then extract the cause ignores the fact that they are mutually indicative. In this work, we propose a new task: emotion-cause pair extraction (ECPE), which aims to extract the potential pairs of emotions and corresponding causes in a document. We propose a 2-step approach to address this new ECPE task, which first performs individual emotion extraction and cause extraction via multi-task learning, and then conduct emotion-cause pairing and filtering. The experimental results on a benchmark emotion cause corpus prove the feasibility of the ECPE task as well as the effectiveness of our approach.

Privacy-preserving Crowd-guided AI Decision-making in Ethical Dilemmas

With the rapid development of artificial intelligence (AI), ethical issues surrounding AI have attracted increasing attention. In particular, autonomous vehicles may face moral dilemmas in accident scenarios, such as staying the course resulting in hurting pedestrians or swerving leading to hurting passengers. To investigate such ethical dilemmas, recent studies have adopted preference aggregation, in which each voter expresses her/his preferences over decisions for the possible ethical dilemma scenarios, and a centralized system aggregates these preferences to obtain the winning decision. Although a useful methodology for building ethical AI systems, such an approach can potentially violate the privacy of voters since moral preferences are sensitive information and their disclosure can be exploited by malicious parties. In this paper, we report a first-of-its-kind privacy-preserving crowd-guided AI decision-making approach in ethical dilemmas. We adopt the notion of differential privacy to quantify privacy and consider four granularities of privacy protection by taking voter-/record-level privacy protection and centralized/distributed perturbation into account, resulting in four approaches VLCP, RLCP, VLDP, and RLDP. Moreover, we propose different algorithms to achieve these privacy protection granularities, while retaining the accuracy of the learned moral preference model. Specifically, VLCP and RLCP are implemented with the data aggregator setting a universal privacy parameter and perturbing the averaged moral preference to protect the privacy of voters’ data. VLDP and RLDP are implemented in such a way that each voter perturbs her/his local moral preference with a personalized privacy parameter. Extensive experiments on both synthetic and real data demonstrate that the proposed approach can achieve high accuracy of preference aggregation while protecting individual voter’s privacy.

The Secrets of Machine Learning: Ten Things You Wish You Had Known Earlier to be More Effective at Data Analysis

Despite the widespread usage of machine learning throughout organizations, there are some key principles that are commonly missed. In particular: 1) There are at least four main families for supervised learning: logical modeling methods, linear combination methods, case-based reasoning methods, and iterative summarization methods. 2) For many application domains, almost all machine learning methods perform similarly (with some caveats). Deep learning methods, which are the leading technique for computer vision problems, do not maintain an edge over other methods for most problems (and there are reasons why). 3) Neural networks are hard to train and weird stuff often happens when you try to train them. 4) If you don’t use an interpretable model, you can make bad mistakes. 5) Explanations can be misleading and you can’t trust them. 6) You can pretty much always find an accurate-yet-interpretable model, even for deep neural networks. 7) Special properties such as decision making or robustness must be built in, they don’t happen on their own. 8) Causal inference is different than prediction (correlation is not causation). 9) There is a method to the madness of deep neural architectures, but not always. 10) It is a myth that artificial intelligence can do anything.

CCMI : Classifier based Conditional Mutual Information Estimation

Conditional Mutual Information (CMI) is a measure of conditional dependence between random variables X and Y, given another random variable Z. It can be used to quantify conditional dependence among variables in many data-driven inference problems such as graphical models, causal learning, feature selection and time-series analysis. While k-nearest neighbor (kNN) based estimators as well as kernel-based methods have been widely used for CMI estimation, they suffer severely from the curse of dimensionality. In this paper, we leverage advances in classifiers and generative models to design methods for CMI estimation. Specifically, we introduce an estimator for KL-Divergence based on the likelihood ratio by training a classifier to distinguish the observed joint distribution from the product distribution. We then show how to construct several CMI estimators using this basic divergence estimator by drawing ideas from conditional generative models. We demonstrate that the estimates from our proposed approaches do not degrade in performance with increasing dimension and obtain significant improvement over the widely used KSG estimator. Finally, as an application of accurate CMI estimation, we use our best estimator for conditional independence testing and achieve superior performance than the state-of-the-art tester on both simulated and real data-sets.

### Document worth reading: “Big Data Meet Cyber-Physical Systems: A Panoramic Survey”

The world is witnessing an unprecedented growth of cyber-physical systems (CPS), which are foreseen to revolutionize our world {via} creating new services and applications in a variety of sectors such as environmental monitoring, mobile-health systems, intelligent transportation systems and so on. The {information and communication technology }(ICT) sector is experiencing a significant growth in { data} traffic, driven by the widespread usage of smartphones, tablets and video streaming, along with the significant growth of sensors deployments that are anticipated in the near future. {It} is expected to outstandingly increase the growth rate of raw sensed data. In this paper, we present the CPS taxonomy {via} providing a broad overview of data collection, storage, access, processing and analysis. Compared with other survey papers, this is the first panoramic survey on big data for CPS, where our objective is to provide a panoramic summary of different CPS aspects. Furthermore, CPS {require} cybersecurity to protect {them} against malicious attacks and unauthorized intrusion, which {become} a challenge with the enormous amount of data that is continuously being generated in the network. {Thus, we also} provide an overview of the different security solutions proposed for CPS big data storage, access and analytics. We also discuss big data meeting green challenges in the contexts of CPS. Big Data Meet Cyber-Physical Systems: A Panoramic Survey

### Whats new on arXiv

Since 1950, when Alan Turing proposed what has since come to be called the Turing test, the ability of a machine to pass this test has established itself as the primary hallmark of general AI. To pass the test, a machine would have to be able to engage in dialogue in such a way that a human interrogator could not distinguish its behaviour from that of a human being. AI researchers have attempted to build machines that could meet this requirement, but they have so far failed. To pass the test, a machine would have to meet two conditions: (i) react appropriately to the variance in human dialogue and (ii) display a human-like personality and intentions. We argue, first, that it is for mathematical reasons impossible to program a machine which can master the enormously complex and constantly evolving pattern of variance which human dialogues contain. And second, that we do not know how to make machines that possess personality and intentions of the sort we find in humans. Since a Turing machine cannot master human dialogue behaviour, we conclude that a Turing machine also cannot possess what is called “general” Artificial Intelligence. We do, however, acknowledge the potential of Turing machines to master dialogue behaviour in highly restricted contexts, where what is called “narrow” AI can still be of considerable utility.
Open Neural Network Exchange (ONNX) is an open format to represent AI models and is supported by many machine learning frameworks. While ONNX defines unified and portable computation operators across various frameworks, the conformance tests for those operators are insufficient, which makes it difficult to verify if an operator’s behavior in an ONNX backend implementation complies with the ONNX standard. In this paper, we present the first automatic unit test generator named Sionnx for verifying the compliance of ONNX implementation. First, we propose a compact yet complete set of rules to describe the operator’s attributes and the properties of its operands. Second, we design an Operator Specification Language (OSL) to provide a high-level description for the operator’s syntax. Finally, through this easy-to-use specification language, we are able to build a full testing specification which leverages LLVM TableGen to automatically generate unit tests for ONNX operators with much large coverage. Sionnx is lightweight and flexible to support cross-framework verification. The Sionnx framework is open-sourced in the github repository (https://…/Sionnx ).
Few-shot learning and self-supervised learning address different facets of the same problem: how to train a model with little or no labeled data. Few-shot learning aims for optimization methods and models that can learn efficiently to recognize patterns in the low data regime. Self-supervised learning focuses instead on unlabeled data and looks into it for the supervisory signal to feed high capacity deep neural networks. In this work we exploit the complementarity of these two domains and propose an approach for improving few-shot learning through self-supervision. We use self-supervision as an auxiliary task in a few-shot learning pipeline, enabling feature extractors to learn richer and more transferable visual representations while still using few annotated samples. Through self-supervision, our approach can be naturally extended towards using diverse unlabeled data from other datasets in the few-shot setting. We report consistent improvements across an array of architectures, datasets and self-supervision techniques.
Time series are ubiquitous in real world problems and computing distance between two time series is often required in several learning tasks. Computing similarity between time series by ignoring variations in speed or warping is often encountered and dynamic time warping (DTW) is the state of the art. However DTW is not applicable in algorithms which require kernel or vectors. In this paper, we propose a mechanism named WaRTEm to generate vector embeddings of time series such that distance measures in the embedding space exhibit resilience to warping. Therefore, WaRTEm is more widely applicable than DTW. WaRTEm is based on a twin auto-encoder architecture and a training strategy involving warping operators for generating warping resilient embeddings for time series datasets. We evaluate the performance of WaRTEm and observed more than $20\%$ improvement over DTW in multiple real-world datasets.
Multi-hop reading comprehension requires the model to explore and connect relevant information from multiple sentences/documents in order to answer the question about the context. To achieve this, we propose an interpretable 3-module system called Explore-Propose-Assemble reader (EPAr). First, the Document Explorer iteratively selects relevant documents and represents divergent reasoning chains in a tree structure so as to allow assimilating information from all chains. The Answer Proposer then proposes an answer from every root-to-leaf path in the reasoning tree. Finally, the Evidence Assembler extracts a key sentence containing the proposed answer from every path and combines them to predict the final answer. Intuitively, EPAr approximates the coarse-to-fine-grained comprehension behavior of human readers when facing multiple long documents. We jointly optimize our 3 modules by minimizing the sum of losses from each stage conditioned on the previous stage’s output. On two multi-hop reading comprehension datasets WikiHop and MedHop, our EPAr model achieves significant improvements over the baseline and competitive results compared to the state-of-the-art model. We also present multiple reasoning-chain-recovery tests and ablation studies to demonstrate our system’s ability to perform interpretable and accurate reasoning.
In this paper, we introduce a new extension of the Singular Spectrum Analysis (SSA) called functional SSA to analyze functional time series. The new methodology is developed by integrating ideas from functional data analysis and univariate SSA. We explore the advantages of the functional SSA in terms of simulation results and with an application to a call center data. We compare the proposed approach with Multivariate SSA (MSSA) and Functional Principal Component Analysis (FPCA). The results suggest that further improvement to MSSA is possible and the new method provides an attractive alternative to the novel extensions of the FPCA for correlated functions. We have also developed an efficient and user-friendly R package and a shiny web application to allow interactive exploration of the results.
By bringing together code, text, and examples, Jupyter notebooks have become one of the most popular means to produce scientific results in a productive and reproducible way. As many of the notebook authors are experts in their scientific fields, but laymen with respect to software engineering, one may ask questions on the quality of notebooks and their code. In a preliminary study, we experimentally demonstrate that Jupyter notebooks are inundated with poor quality code, e.g., not respecting recommended coding practices, or containing unused variables and deprecated functions. Considering the education nature of Jupyter notebooks, these poor coding practices as well as the lacks of quality control might be propagated into the next generation of developers. Hence, we argue that there is a strong need to programmatically analyze Jupyter notebooks, calling on our community to pay more attention to the reliability of Jupyter notebooks.
This thesis presents new methods for unsupervised learning of distributed representations of words and entities from text and knowledge bases. The first algorithm presented in the thesis is a multi-view algorithm for learning representations of words called Multiview Latent Semantic Analysis (MVLSA). By incorporating up to 46 different types of co-occurrence statistics for the same vocabulary of english words, I show that MVLSA outperforms other state-of-the-art word embedding models. Next, I focus on learning entity representations for search and recommendation and present the second method of this thesis, Neural Variational Set Expansion (NVSE). NVSE is also an unsupervised learning method, but it is based on the Variational Autoencoder framework. Evaluations with human annotators show that NVSE can facilitate better search and recommendation of information gathered from noisy, automatic annotation of unstructured natural language corpora. Finally, I move from unstructured data and focus on structured knowledge graphs. I present novel approaches for learning embeddings of vertices and edges in a knowledge graph that obey logical constraints.
We introduce Gluon Time Series (GluonTS)\footnote{\url{https://gluon-ts.mxnet.io}}, a library for deep-learning-based time series modeling. GluonTS simplifies the development of and experimentation with time series models for common tasks such as forecasting or anomaly detection. It provides all necessary components and tools that scientists need for quickly building new models, for efficiently running and analyzing experiments and for evaluating model accuracy.
A series of recent works suggest that deep neural networks (DNNs), of fixed depth, are equivalent to certain Gaussian Processes (NNGP/NTK) in the highly over-parameterized regime (width or number-of-channels going to infinity). Other works suggest that this limit is relevant for real-world DNNs. These results invite further study into the generalization properties of Gaussian Processes of the NNGP and NTK type. Here we make several contributions along this line. First, we develop a formalism, based on field theory tools, for calculating learning curves perturbatively in one over the dataset size. For the case of NNGPs, this formalism naturally extends to finite width corrections. Second, in cases where one can diagonalize the covariance-function of the NNGP/NTK, we provide analytic expressions for the asymptotic learning curves of any given target function. These go beyond the standard equivalence kernel results. Last, we provide closed analytic expressions for the eigenvalues of NNGP/NTK kernels of depth 2 fully-connected ReLU networks. For datasets on the hypersphere, the eigenfunctions of such kernels, at any depth, are hyperspherical harmonics. A simple coherent picture emerges wherein fully-connected DNNs have a strong entropic bias towards functions which are low order polynomials of the input.
We present the first comprehensive study on automatic knowledge base construction for two prevalent commonsense knowledge graphs: ATOMIC (Sap et al., 2019) and ConceptNet (Speer et al., 2017). Contrary to many conventional KBs that store knowledge with canonical templates, commonsense KBs only store loosely structured open-text descriptions of knowledge. We posit that an important step toward automatic commonsense completion is the development of generative models of commonsense knowledge, and propose COMmonsEnse Transformers (COMET) that learn to generate rich and diverse commonsense descriptions in natural language. Despite the challenges of commonsense modeling, our investigation reveals promising results when implicit knowledge from deep pre-trained language models is transferred to generate explicit knowledge in commonsense knowledge graphs. Empirical results demonstrate that COMET is able to generate novel knowledge that humans rate as high quality, with up to 77.5% (ATOMIC) and 91.7% (ConceptNet) precision at top 1, which approaches human performance for these resources. Our findings suggest that using generative commonsense models for automatic commonsense KB completion could soon be a plausible alternative to extractive methods.
We present pairwise metrics of fairness for ranking and regression models that form analogues of statistical fairness notions such as equal opportunity or equal accuracy, as well as statistical parity. Our pairwise formulation supports both discrete protected groups, and continuous protected attributes. We show that the resulting training problems can be efficiently and effectively solved using constrained optimization and robust optimization techniques based on two player game algorithms developed for fair classification. Experiments illustrate the broad applicability and trade-offs of these methods.
In many applications, such as classification of images or videos, it is of interest to develop a framework for tensor data instead of ad-hoc way of transforming data to vectors due to the computational and under-sampling issues. In this paper, we study canonical correlation analysis by extending the framework of two dimensional analysis (Lee and Choi, 2007) to tensor-valued data. Instead of adopting the iterative algorithm provided in Lee and Choi (2007), we propose an efficient algorithm, called the higher-order power method, which is commonly used in tensor decomposition and more efficient for large-scale setting. Moreover, we carefully examine theoretical properties of our algorithm and establish a local convergence property via the theory of Lojasiewicz’s inequalities. Our results fill a missing, but crucial, part in the literature on tensor data. For practical applications, we further develop (a) an inexact updating scheme which allows us to use the state-of-the-art stochastic gradient descent algorithm, (b) an effective initialization scheme which alleviates the problem of local optimum in non-convex optimization, and (c) an extension for extracting several canonical components. Empirical analyses on challenging data including gene expression, air pollution indexes in Taiwan, and electricity demand in Australia, show the effectiveness and efficiency of the proposed methodology.
The dynamic time scan forecasting method relies on the premise that the most important pattern in a time series precedes the forecasting window, i.e., the last observed values. Thus, a scan procedure is applied to identify similar patterns, or best matches, throughout the time series. As oppose to euclidean distance, or any distance function, a similarity function is dynamically estimated in order to match previous values to the last observed values. Goodness-of-fit statistics are used to find the best matches. Using the respective similarity functions, the observed values proceeding the best matches are used to create a forecasting pattern, as well as forecasting intervals. Remarkably, the proposed method outperformed statistical and machine learning approaches in a real case wind speed forecasting problem.
Deep Linear Networks do not have expressive power but they are mathematically tractable. In our work, we found an architecture in which they are expressive. This paper presents a Linear Distillation Learning (LDL) a simple remedy to improve the performance of linear networks through distillation. In deep learning models, distillation often allows the smaller/shallow network to mimic the larger models in a much more accurate way, while a network of the same size trained on the one-hot targets can’t achieve comparable results to the cumbersome model. In our method, we train students to distill teacher separately for each class in dataset. The most striking result to emerge from the data is that neural networks without activation functions can achieve high classification score on a small amount of data on MNIST and Omniglot datasets. Due to tractability, linear networks can be used to explain some phenomena observed experimentally in deep non-linear networks. The suggested approach could become a simple and practical instrument while further studies in the field of linear networks and distillation are yet to be undertaken.
Space partitioning methods such as random forests and the Mondrian process are powerful machine learning methods for multi-dimensional and relational data, and are based on recursively cutting a domain. The flexibility of these methods is often limited by the requirement that the cuts be axis aligned. The Ostomachion process and the self-consistent binary space partitioning-tree process were recently introduced as generalizations of the Mondrian process for space partitioning with non-axis aligned cuts in the two dimensional plane. Motivated by the need for a multi-dimensional partitioning tree with non-axis aligned cuts, we propose the Random Tessellation Process (RTP), a framework that includes the Mondrian process and the binary space partitioning-tree process as special cases. We derive a sequential Monte Carlo algorithm for inference, and provide random forest methods. Our process is self-consistent and can relax axis-aligned constraints, allowing complex inter-dimensional dependence to be captured. We present a simulation study, and analyse gene expression data of brain tissue, showing improved accuracies over other methods.
Manifold optimization is ubiquitous in computational and applied mathematics, statistics, engineering, machine learning, physics, chemistry and etc. One of the main challenges usually is the non-convexity of the manifold constraints. By utilizing the geometry of manifold, a large class of constrained optimization problems can be viewed as unconstrained optimization problems on manifold. From this perspective, intrinsic structures, optimality conditions and numerical algorithms for manifold optimization are investigated. Some recent progress on the theoretical results of manifold optimization are also presented.
We investigate the sets of joint probability distributions that maximize the average multi-information over a collection of margins. These functionals serve as proxies for maximizing the multi-information of a set of variables or the mutual information of two subsets of variables, at a lower computation and estimation complexity. We describe the maximizers and their relations to the maximizers of the multi-information and the mutual information.
Spatio-temporal event data is ubiquitous in various applications, such as social media, crime events, and electronic health records. Spatio-temporal point processes offer a versatile framework for modeling such event data, as it can jointly capture spatial and temporal dependency. A key question is to estimate the generative model for such point processes, which enables the subsequent machine learning tasks. Existing works mainly focus on parametric models for the conditional intensity function, such as the widely used multi-dimensional Hawkes processes. However, parametric models tend to lack flexibility in tackling real data. On the other hand, non-parametric for spatio-temporal point processes tend to be less interpretable. We introduce a novel and flexible semi-parametric spatial-temporal point processes model, by combining spatial statistical models based on heterogeneous Gaussian mixture diffusion kernels, whose parameters are represented using neural networks. We learn the model using a reinforcement learning framework, where the reward function is defined via the maximum mean discrepancy (MMD) of the empirical processes generated by the model and the real data. Experiments based on real data show the superior performance of our method relative to the state-of-the-art.

### What’s new with MLflow? On-Demand Webinar and FAQs now available!

On June 6th, our team hosted a live webinar—Managing the Complete Machine Learning Lifecycle: What’s new with MLflow—with Clemens Mewald, Director of Product Management at Databricks.

Machine learning development brings many new complexities beyond the traditional software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models.

To solve for these challenges, last June, we unveiled MLflow, an open source platform to manage the complete machine learning lifecycle. Most recently, we announced the General Availability of Managed MLflow on Databricks and the MLflow 1.0 Release.

In this webinar, we reviewed new and existing MLflow capabilities that allow you to:

• Keep track of experiments runs and results across frameworks.
• Execute projects remotely on to a Databricks cluster, and quickly reproduce your runs.
• Quickly productionize models using Databricks production jobs, Docker containers, Azure ML, or Amazon SageMaker

We demonstrated these concepts using notebooks and tutorials from our public documentation so that you can practice at your own pace. If you’d like free access Databricks Unified Analytics Platform and try our notebooks on it, you can access a free trial here.

Toward the end, we held a Q&A and below are the questions and answers.

Q: Apart from having the trouble of all the set-up, is there any missing features/disadvantages of using MLflow on-premises rather than in the cloud on Databricks?

Databricks is very committed to the open source community. Our founders are the original creators of Apache SparkTM – a widely adopted open source unified analytics engine – and our company still actively maintains and contributes to the open source Spark code. Similarly, for both Delta Lake and MLflow, we’re equally committed to help the open source community benefit from these products, as well as provide an out-of-the-box managed version of these products.

When we think about features to provide on the open source or the managed version of Delta Lake or MLflow, we don’t think about whether we should hold back a feature on a version or another. We think about what additional features we can provide that only make sense in a hosted and managed version for enterprise users. Therefore, all the benefits you get from managed MLflow on Databricks are that you don’t need to worry about the setup, managing the servers, and all these integrations with the Databricks Unified Analytics Platform that makes it seamlessly work with the rest of the workflow. Visit http://databricks.com/mlflow to learn more.

Q: Does MLflow 1.0 supports Windows?

Yes, we added support to run the MLflow client on windows. Please see our release notes here.

Q: Is MLflow complements or competes with TensorFlow?

It’s a perfect complement. You can train TensorFlow models and log the metrics and models with MLflow.

Q: How many different metrics can we track using MLflow? Are there any restrictions imposed on it?

MLflow doesn’t impose any limits on the number of metrics you can track. The only limitations are in the backend that is used to store those metrics.

Q: How to parallelize models training with MLflow?

MLflow is agnostic to the ML framework you use to train the model. If you use TensorFlow or PyTorch you can distribute your training jobs with for example HorovodRunner and use MLflow to log your experiments, runs, and models.

Q: Is there a way to bulk extract the MLflow info to perform operational analytics (e.g. how many training runs were there in the last quarter. How many people are training models etc.)?

We are working on a way to more easily extract the MLflow tracking metadata into a format that you can do data science with, e.g. into a pandas dataframe.

Q: Is it possible to train and build a MLflow model using a platform (e.g. like Databricks using TensorFlow with PySpark) and then reuse that MLflow model in another platform (for example in R using RStudio) to score any input?

The MLflow Model format and abstraction allows using any MLflow model from anywhere you can load them. E.g., you can use the python function flavor to call the model from any Python library, or the r function flavor to call it as an R function. MLflow doesn’t rewrite the models into a new format, but you can always expose an MLflow model as a REST endpoint and then call it in a language agnostic way.

Q: To serve a model, what are the options to deploy outside of databricks, eg. Sagemaker. Do you have any plans to deploy as AWS Lambdas?

We provide several ways you can deploy MLflow models, including Amazon SageMaker, Microsoft Azure ML, Docker Containers, Spark UDF and more… See this page for a list. To give one example of how to use MLflow models with AWS Lambda, you can use the python function flavor which enables you to call the model from anywhere you can call a Python function.

Q: Can MLflow be used with python programs outside of Databricks?

Yes, MLflow is an open source product and can be found on GitHub and PyPi.

Q: What is the pricing model for Databricks?

Q: Hi how do you see MLflow evolving in relation to Airflow?

We are looking into ways to support multi-step workflows. One way we could do this is by using Airflow. We haven’t made these decisions yet.

Q: Suggestions for deploying multi-step models for example ensemble of several base models.

Right now you can deploy those as MLflow models by writing code to ensemble other models. E.g. similar to how the multi-step workflow example is implemented.

Q: Does MLflow provide a framework to do feature engineering on data?

Not specifically, but you can use any other framework together with MLflow.

To get started with MLflow, follow the instructions at mlflow.org or check out the alpha release code on Github. We’ve also recently created a Slack channel for MLflow as well for real time questions, and you can follow @MLflowOrg on Twitter. We are excited to hear your feedback!

--

The post What’s new with MLflow? On-Demand Webinar and FAQs now available! appeared first on Databricks.

### Top KDnuggets Tweets, Jun 19 – 25: Learn how to efficiently handle large amounts of data using #Pandas; The biggest mistake while learning #Python for #datascience

Also: Data Science Jobs Report 2019; Harvard CS109 #DataScience Course, Resources #Free and Online; Google launches TensorFlow; Mastering SQL for Data Science

### If you did not already know

zoNNscan
The training of deep neural network classifiers results in decision boundaries which geometry is still not well understood. This is in direct relation with classification problems such as so called adversarial examples. We introduce zoNNscan, an index that is intended to inform on the boundary uncertainty (in terms of the presence of other classes) around one given input datapoint. It is based on confidence entropy, and is implemented through sampling in the multidimensional ball surrounding that input. We detail the zoNNscan index, give an algorithm for approximating it, and finally illustrate its benefits on four applications, including two important problems for the adoption of deep networks in critical systems: adversarial examples and corner case inputs. We highlight that zoNNscan exhibits significantly higher values than for standard inputs in those two problem classes. …

Compositional Coding for Collaborative Filtering
Efficiency is crucial to the online recommender systems. Representing users and items as binary vectors for Collaborative Filtering (CF) can achieve fast user-item affinity computation in the Hamming space, in recent years, we have witnessed an emerging research effort in exploiting binary hashing techniques for CF methods. However, CF with binary codes naturally suffers from low accuracy due to limited representation capability in each bit, which impedes it from modeling complex structure of the data. In this work, we attempt to improve the efficiency without hurting the model performance by utilizing both the accuracy of real-valued vectors and the efficiency of binary codes to represent users/items. In particular, we propose the Compositional Coding for Collaborative Filtering (CCCF) framework, which not only gains better recommendation efficiency than the state-of-the-art binarized CF approaches but also achieves even higher accuracy than the real-valued CF method. Specifically, CCCF innovatively represents each user/item with a set of binary vectors, which are associated with a sparse real-value weight vector. Each value of the weight vector encodes the importance of the corresponding binary vector to the user/item. The continuous weight vectors greatly enhances the representation capability of binary codes, and its sparsity guarantees the processing speed. Furthermore, an integer weight approximation scheme is proposed to further accelerate the speed. Based on the CCCF framework, we design an efficient discrete optimization algorithm to learn its parameters. Extensive experiments on three real-world datasets show that our method outperforms the state-of-the-art binarized CF methods (even achieves better performance than the real-valued CF method) by a large margin in terms of both recommendation accuracy and efficiency. …

Coarse-to-Fine Network (C2F Net)
Deep neural networks have seen tremendous success for different modalities of data including images, videos, and speech. This success has led to their deployment in mobile and embedded systems for real-time applications. However, making repeated inferences using deep networks on embedded systems poses significant challenges due to constrained resources (e.g., energy and computing power). To address these challenges, we develop a principled co-design approach. Building on prior work, we develop a formalism referred to as Coarse-to-Fine Networks (C2F Nets) that allow us to employ classifiers of varying complexity to make predictions. We propose a principled optimization algorithm to automatically configure C2F Nets for a specified trade-off between accuracy and energy consumption for inference. The key idea is to select a classifier on-the-fly whose complexity is proportional to the hardness of the input example: simple classifiers for easy inputs and complex classifiers for hard inputs. We perform comprehensive experimental evaluation using four different C2F Net architectures on multiple real-world image classification tasks. Our results show that optimized C2F Net can reduce the Energy Delay Product (EDP) by 27 to 60 percent with no loss in accuracy when compared to the baseline solution, where all predictions are made using the most complex classifier in C2F Net. …

Warping Resilient Time Series Embedding
Time series are ubiquitous in real world problems and computing distance between two time series is often required in several learning tasks. Computing similarity between time series by ignoring variations in speed or warping is often encountered and dynamic time warping (DTW) is the state of the art. However DTW is not applicable in algorithms which require kernel or vectors. In this paper, we propose a mechanism named WaRTEm to generate vector embeddings of time series such that distance measures in the embedding space exhibit resilience to warping. Therefore, WaRTEm is more widely applicable than DTW. WaRTEm is based on a twin auto-encoder architecture and a training strategy involving warping operators for generating warping resilient embeddings for time series datasets. We evaluate the performance of WaRTEm and observed more than $20\%$ improvement over DTW in multiple real-world datasets. …

### Book Review: Good to Go, by Christine Aschwanden

This is a book review. It is by Phil Price. It is not by Andrew.

The book is Good To Go: What the athlete in all of us can learn from the strange science of recovery. By Christine Aschwanden, published by W.W. Norton and Company. The publisher offered a copy to Andrew to review, and Andrew offered it to me as this blog’s unofficial sports correspondent.

tldr: This book argues persuasively that when it comes to optimizing the recovery portion of the exercise-recover-exercise cycle, nobody knows nuthin’ and most people who claim to know sumthin’ are wrong. It’s easy to read and has some nice anecdotes. Worth reading if you have a special interest in the subject, otherwise not. Full review follows.

The book is about ‘recovery’. In the context of the book, recovery is what you do between bouts of exercise; or, if you prefer, exercise is what you do between periods of recovery. The book has great blurbs. “A tour de force of great science journalism”, writes Nate Silver (!). “…a definitive tour through a bewildering jungle of scientific and pseudoscientific claims…”, writes David Epstein. “…Aschwanden makes the mid-boggling world of sports recovery a hilarious adventure”, says Olympic gold medal skier Jessie Diggins. With blurbs like these I was expecting a lot…although once I realized Aschwanden works at FiveThirtyEight, I downweighted the Silver blurb appropriately. Even so, I expected too much: the book is fine but ultimately rather unsatisfying. It is fairly interesting and sometimes amusing, but there’s only so much any author can do with the subject given the current state of knowledge, which is this: other than getting enough sleep and eating enough calories, nobody knows for sure what helps athletes recover between events or training sessions better than just living a normal life. The book is mostly just 300 pages of elucidating and amplifying that disappointing state of knowledge.

[1] TRUE



## Singular Value Decomposition

Now let us look at Singular Value Decomposition in R:

A <- matrix(data = 1:36, nrow = 6, byrow = TRUE)
A

[,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    2    3    4    5    6
[2,]    7    8    9   10   11   12
[3,]   13   14   15   16   17   18
[4,]   19   20   21   22   23   24
[5,]   25   26   27   28   29   30
[6,]   31   32   33   34   35   36


y <- svd(x = A)
y

$d [1] 1.272064e+02 4.952580e+00 1.068280e-14 3.258502e-15 9.240498e-16 [6] 6.865073e-16$u
[,1]        [,2]       [,3]        [,4]        [,5]
[1,] -0.06954892 -0.72039744  0.6716423 -0.11924367  0.08965916
[2,] -0.18479698 -0.51096788 -0.6087484 -0.06762569  0.44007566
[3,] -0.30004504 -0.30153832 -0.3722328 -0.09266448 -0.41109295
[4,] -0.41529310 -0.09210875  0.0313011  0.21692481 -0.67511264
[5,] -0.53054116  0.11732081  0.1308779  0.71086492  0.37490569
[6,] -0.64578922  0.32675037  0.1471598 -0.64825589  0.18156508
[,6]
[1,] -0.05319067
[2,]  0.36871061
[3,] -0.70915885
[4,]  0.56145739
[5,] -0.20432734
[6,]  0.03650885

$v [,1] [,2] [,3] [,4] [,5] [,6] [1,] -0.3650545 0.62493577 0.54215504 0.08199306 -0.1033873 -0.4060131 [2,] -0.3819249 0.38648609 -0.23874067 -0.40371901 0.5758949 0.3913066 [3,] -0.3987952 0.14803642 -0.75665994 0.29137287 -0.2722608 -0.2957858 [4,] -0.4156655 -0.09041326 0.14938782 -0.18121587 -0.6652579 0.5668542 [5,] -0.4325358 -0.32886294 0.21539167 0.69322385 0.3606549 0.2184882 [6,] -0.4494062 -0.56731262 0.08846608 -0.48165491 0.1043562 -0.4748500  all.equal(y$u %*% diag(y$d) %*% t(y$v), A)

[1] TRUE



## Moore-Penrose Pseudoinverse

Now let us look at Moore-Penrose Pseudoinverse in R:

A <- matrix(data = 1:25, nrow = 5)
A

[,1] [,2] [,3] [,4] [,5]
[1,]    1    6   11   16   21
[2,]    2    7   12   17   22
[3,]    3    8   13   18   23
[4,]    4    9   14   19   24
[5,]    5   10   15   20   25


B <- ginv(A)
B

[,1]  [,2]         [,3]   [,4]   [,5]
[1,] -0.152 -0.08 -8.00000e-03  0.064  0.136
[2,] -0.096 -0.05 -4.00000e-03  0.042  0.088
[3,] -0.040 -0.02 -9.97466e-18  0.020  0.040
[4,]  0.016  0.01  4.00000e-03 -0.002 -0.008
[5,]  0.072  0.04  8.00000e-03 -0.024 -0.056


y <- svd(A)
all.equal(y$v %*% ginv(diag(y$d)) %*% t(yu), B) [1] TRUE  ## Trace Matrix in R Lastly, let us look at Trace Matrix in R: A <- diag(x = 1:10) A [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 1 0 0 0 0 0 0 0 0 0 [2,] 0 2 0 0 0 0 0 0 0 0 [3,] 0 0 3 0 0 0 0 0 0 0 [4,] 0 0 0 4 0 0 0 0 0 0 [5,] 0 0 0 0 5 0 0 0 0 0 [6,] 0 0 0 0 0 6 0 0 0 0 [7,] 0 0 0 0 0 0 7 0 0 0 [8,] 0 0 0 0 0 0 0 8 0 0 [9,] 0 0 0 0 0 0 0 0 9 0 [10,] 0 0 0 0 0 0 0 0 0 10  library(psych) tr(A) [1] 55  We can also code a function that calculates the Trace Matrix of the Frobenius Norm Matrix: alternativeFrobeniusNorm <- function(A) { sqrt(tr(t(A) %*% A)) } alternativeFrobeniusNorm(A) [1] 19.62142  frobeniusNorm(A) [1] 19.62142  all.equal(tr(A), tr(t(A))) [1] TRUE  Let us look at the diagonals (1:5) of the Trace Matrix of the Frobenius Norm Trace Matrix: A <- diag(x = 1:5) A [,1] [,2] [,3] [,4] [,5] [1,] 1 0 0 0 0 [2,] 0 2 0 0 0 [3,] 0 0 3 0 0 [4,] 0 0 0 4 0 [5,] 0 0 0 0 5  Let us also look at the diagonals (6:10) of the Trace Matrix of the Frobenius Norm Trace Matrix: B <- diag(x = 6:10) B [,1] [,2] [,3] [,4] [,5] [1,] 6 0 0 0 0 [2,] 0 7 0 0 0 [3,] 0 0 8 0 0 [4,] 0 0 0 9 0 [5,] 0 0 0 0 10  Let us anow look at the diagonals (11:15) of the Trace Matrix of the Frobenius Norm Trace Matrix: C <- diag(x = 11:15) C [,1] [,2] [,3] [,4] [,5] [1,] 11 0 0 0 0 [2,] 0 12 0 0 0 [3,] 0 0 13 0 0 [4,] 0 0 0 14 0 [5,] 0 0 0 0 15  Let us alsos see if all conditions of the Frobenius Norm Trace Matrix are true: all.equal(tr(A %*% B %*% C), tr(C %*% A %*% B)) [1] TRUE  all.equal(tr(C %*% A %*% B), tr(B %*% C %*% A)) [1] TRUE  Related Post To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### How does Stan work? A reading list. Bob writes, to someone who is doing work on the Stan language: The basic execution structure of Stan is in the JSS paper (by Bob Carpenter, Andrew Matt Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell) and in the reference manual. The details of autodiff are in the arXiv paper (by Bob Carpenter, Matt Hoffman, Marcus Brubaker, Daniel Lee, Peter Li, and Michael Betancourt). These are sort of background for what we’re trying to do. If you haven’t read Maria Gorinova’s MS thesis and POPL paper (with Andrew Gordon and Charles Sutton), you should probably start there. Radford Neal’s intro to HMC is nice, as is the one in David McKay’s book. Michael Betancourt’s papers are the thing to read to understand HMC deeply—he just wrote another brain bender on geometric autodiff (all on arXiv). Starting with the one on hierarchical models would be good as it explains the necessity of reparameterizations. Also I recommend our JEBS paper (with Daniel Lee, and Jiqiang Guo) as it presents Stan from a user’s rather than a developer’s perspective. And, for more general background on Bayesian data analysis, we recommend Statistical Rethinking by Richard McElreath and BDA3. Continue Reading… ### Quick Hit: Above the Fold; Hard wrapping text at ‘n’ characters (This article was first published on R – rud.is, and kindly contributed to R-bloggers) Despite being on holiday I’m getting in a bit of non-work R coding since the fam has a greater ability to sleep late than I do. Apart from other things I’ve been working on a PR into {lutz}, a package by @andyteucher that turns lat/lng pairs into timezone strings. The package is super neat and has two modes: “fast” (originally based on a {V8}-backed version of @darkskyapp’s tzlookup javascript module) and “accurate” using R’s amazing spatial ops. I ported the javascript algorithm to C++/Rcpp and have been tweaking the bit of package helper code that fetches this: and extracts the embedded string tree and corresponding timezones array and turns both into something C++ can use. Originally I just made a header file with the same long lines: but that’s icky and fairly bad form, especially given that C++ will combine adjacent string literals for you. The stringi::stri_wrap() function can easily take care of wrapping the time zone array elements for us: but, I also needed the ability to hard-wrap the encoded string tree at a fixed width. There are lots of ways to do that, here are three of them: library(Rcpp) library(stringi) library(tidyverse) library(microbenchmark) sourceCpp(code = " #include // [[Rcpp::export]] std::vector< std::string > fold_cpp(const std::string& input, int width) { int sz = input.length() / width; std::vector< std::string > out; out.reserve(sz); // shld make this more efficient for (unsigned long idx=0; idx<sz; );="" code="" input.substr(idx*width,="" nchar(input),="" (input.length()="" <-="" if="" !="0)" fold_tidy="" return(out);="" substr(input,="" map_chr(="" )="" +="" -="" width="" width)="" ")="" fun.value="character(1)" <="" 1),="" vapply(="" idx,="" out.push_back(input.substr(width*sz));="" ~stri_sub(input,="" out.push_back(="" %="" idx++)="" function(idx)="" seq(1,="" idx="" function(input,="" length="width)" width),="" .x,="" {="" }="" fold_base=""></sz;>  (If you know of a package that has this type of function def leave a note in the comments). Each one does the same thing: move n sequences of width characters into a new slot in a character vector. Let’s see what they do with this toy long string example: (src <- paste0(c(rep("a", 30), rep("b", 30), rep("c", 4)), collapse = "")) ## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbcccc" for (n in c(1, 7, 30, 40)) { print(fold_base(src, n)) print(fold_tidy(src, n)) print(fold_cpp(src, n)) cat("\n") } ## [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" ## [18] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "b" "b" "b" "b" ## [35] "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" ## [52] "b" "b" "b" "b" "b" "b" "b" "b" "b" "c" "c" "c" "c" ## [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" ## [18] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "b" "b" "b" "b" ## [35] "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" ## [52] "b" "b" "b" "b" "b" "b" "b" "b" "b" "c" "c" "c" "c" ## [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" ## [18] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "b" "b" "b" "b" ## [35] "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" ## [52] "b" "b" "b" "b" "b" "b" "b" "b" "b" "c" "c" "c" "c" ## ## [1] "aaaaaaa" "aaaaaaa" "aaaaaaa" "aaaaaaa" "aabbbbb" "bbbbbbb" ## [7] "bbbbbbb" "bbbbbbb" "bbbbccc" "c" ## [1] "aaaaaaa" "aaaaaaa" "aaaaaaa" "aaaaaaa" "aabbbbb" "bbbbbbb" ## [7] "bbbbbbb" "bbbbbbb" "bbbbccc" "c" ## [1] "aaaaaaa" "aaaaaaa" "aaaaaaa" "aaaaaaa" "aabbbbb" "bbbbbbb" ## [7] "bbbbbbb" "bbbbbbb" "bbbbccc" "c" ## ## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbb" ## [3] "cccc" ## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbb" ## [3] "cccc" ## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbb" ## [3] "cccc" ## ## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbb" ## [2] "bbbbbbbbbbbbbbbbbbbbcccc" ## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbb" ## [2] "bbbbbbbbbbbbbbbbbbbbcccc" ## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbb" ## [2] "bbbbbbbbbbbbbbbbbbbbcccc" So, we know they all work, which means we can take a look at which one is faster. Let’s compare folding at various widths: map_df(c(1, 3, 5, 7, 10, 20, 30, 40, 70), ~{ microbenchmark( base = fold_base(src, .x), tidy = fold_tidy(src, .x), cpp = fold_cpp(src, .x) ) %>% mutate(width = .x) %>% as_tibble() }) %>% mutate( width = factor(width, levels = sort(unique(width)), ordered = TRUE) ) -> bench_df ggplot(bench_df, aes(expr, time)) + ggbeeswarm::geom_quasirandom( aes(group = width, fill = width), groupOnX = TRUE, shape = 21, color = "white", size = 3, stroke = 0.125, alpha = 1/4 ) + scale_y_comma(trans = "log10", position = "right") + coord_flip() + guides( fill = guide_legend(override.aes = list(alpha = 1)) ) + labs( x = NULL, y = "Time (nanoseconds)", fill = "Split width:", title = "Performance comparison between 'fold' implementations" ) + theme_ft_rc(grid="X") + theme(legend.position = "top") ggplot(bench_df, aes(width, time)) + ggbeeswarm::geom_quasirandom( aes(group = expr, fill = expr), groupOnX = TRUE, shape = 21, color = "white", size = 3, stroke = 0.125, alpha = 1/4 ) + scale_x_discrete( labels = c(1, 3, 5, 7, 10, 20, 30, 40, "Split/fold width: 70") ) + scale_y_comma(trans = "log10", position = "right") + scale_fill_ft() + coord_flip() + guides( fill = guide_legend(override.aes = list(alpha = 1)) ) + labs( x = NULL, y = "Time (nanoseconds)", fill = NULL, title = "Performance comparison between 'fold' implementations" ) + theme_ft_rc(grid="X") + theme(legend.position = "top") The Rcpp version is both faster and more consistent than the other two implementations (though they get faster as the number of string subsetting operations decrease); but, they’re all pretty fast. For an infrequently run process, it might be better to use the base R version purely for simplicity. Despite that fact, I used the Rcpp version to turn the string tree long line into: FIN If you have need to “fold” like this how do you currently implement your solution? Found a bug or better way after looking at the code? Drop a note in the comments so you can help others find an optimal solution to their own ‘fold’ing problems. <script type="text/javascript"> var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); </script> To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...  Continue Reading… ### Can random forest provide insights into how yeast grows? (This article was first published on R – What You're Doing Is Rather Desperate, and kindly contributed to R-bloggers) I’m not saying this is a good idea, but bear with me. A recent question on Stack Overflow [r] asked why a random forest model was not working as expected. The questioner was working with data from an experiment in which yeast was grown under conditions where (a) the growth rate could be controlled and (b) one of 6 nutrients was limited. Their dataset consisted of 6 rows – one per nutrient – and several thousand columns, with values representing the activity (expression) of yeast genes. Could the expression values be used to predict the limiting nutrient? The random forest was not working as expected: not one of the nutrients was correctly classified. I pointed out that with only one case for each outcome, this was to be expected – as the random forest algorithm samples a proportion of the rows, no correct predictions are likely in this case. As sometimes happens the question was promptly deleted, which was unfortunate as we could have further explored the problem. A little web searching revealed that the dataset in question is quite well-known. It’s published in an article titled Coordination of Growth Rate, Cell Cycle, Stress Response, and Metabolic Activity in Yeast and has been transformed into a “tidy” format for use in tutorials, here and here. As it turns out, there are 6 cases (rows) for each outcome (limiting nutrient), as experiments were performed at 6 different growth rates. Whilst random forests are good for “large p small n” problems, it may be that ~ 5 500 x 36 is pushing the limit somewhat. But you know – let’s just try it anyway. As ever, code and a report for this blog post can be found at Github. First, we obtain the tidied Brauer dataset. But in fact we want to “untidy” it again (make wide from long) because for random forest we want observations (n) as rows and variables (p – both predicted and predictors) in columns. library(tidyverse) library(randomForest) library(randomForestExplainer) brauer2007_tidy <- read_csv("https://4va.github.io/biodatasci/data/brauer2007_tidy.csv")  I’ll show the code for running a random forest without much explanation. I’ll assume you have a basic understanding of how the process works. That is: rows are sampled from the data and used to build many decision trees where variables predict either a continuous outcome variable (regression) or a categorical outcome (classification). The trees are then averaged (for regression values) or the majority vote is taken (for classification) to generate predictions. Individual predictors have an “importance” which is essentially some measure of how much worse the model would be were they not included. Here we go then. A couple of notes. First, setting a seed for random sampling would not usually be used; it’s for reproducibility here. Second, unless you are specifically using the model to predict outcomes on unseen data, there’s no real need for splitting the data into test and training sets – the procedure is already performing a bootstrap by virtue of the out-of-bag error estimation. brauer2007_tidy_rf1 <- brauer2007_tidy %>% mutate(systematic_name = gsub("-", "minus", systematic_name), nutrient = factor(nutrient)) %>% select(systematic_name, nutrient, rate, expression) %>% spread(systematic_name, expression, fill = 0) %>% randomForest(nutrient ~ ., data = ., localImp = TRUE, importance = TRUE)  The model seems to have performed quite well: Call: randomForest(formula = nutrient ~ ., data = ., localImp = TRUE, importance = TRUE) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 74 OOB estimate of error rate: 5.56% Confusion matrix: Ammonia Glucose Leucine Phosphate Sulfate Uracil class.error Ammonia 6 0 0 0 0 0 0.0000000 Glucose 0 6 0 0 0 0 0.0000000 Leucine 0 1 5 0 0 0 0.1666667 Phosphate 0 0 0 6 0 0 0.0000000 Sulfate 0 0 0 0 6 0 0.0000000 Uracil 0 1 0 0 0 5 0.1666667  Let’s look at the expression of the top 20 genes (by variable importance), with respect to growth rate and limiting nutrient. brauer2007_tidy %>% filter(systematic_name %in% important_variables(brauer2007_tidy_rf1, k = 20)) %>% ggplot(aes(rate, expression)) + geom_line(aes(color = nutrient)) + facet_wrap(~systematic_name, ncol = 5) + scale_color_brewer(palette = "Set2")  Several of those look promising. Let’s select a gene where expression seems to be affected by each of the 6 limiting nutrients and search online resources such as the Saccharomyces Genome Database to see what’s known. systematic_name bp nutrient search_results YHR208W branched chain family amino acid biosynthesis* leucine Pathways – leucine biosynthesis YKL216W ‘de novo’ pyrimidine base biosynthesis uracil URA1 – null mutant requires uracil YLL055W biological process unknown sulfate Cysteine transporter; null mutant absent utilization of sulfur source YLR108C biological process unknown phosphate YOR348C proline catabolism* ammonia Proline permease; repressed in ammonia-grown cells YOR374W ethanol metabolism glucose Aldehyde dehydrogenase; expression is glucose-repressed I’d say the Brauer data “makes sense” for five of those genes. Little is known about the sixth (YLR108C, affected by phosphate limitation). In summary Normally, a study like this would start with the genes – identify those that are differentially expressed and then think about the conditions under which differential expression was observed. Here, the process is reversed in a sense: we view the experimental condition as an outcome, rather than a parameter and ask whether it can be “predicted” by other observations. So whilst not my first choice of method for this kind of study, and despite limited outcome data, random forest does seem to be generating some insights into which genes are affected by nutrient limitation. And at the end of the day: if a method provides insights, isn’t that what data science is for? To leave a comment for the author, please follow the link and comment on their blog: R – What You're Doing Is Rather Desperate. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### Why do we need AWS SageMaker? Today, there are several platforms available in the industry that aid software developers, data scientists as well as a layman in developing and deploying machine learning models within no time. Continue Reading… ### Four short links: 26 June 2019 Ethics and OKRs, Rewriting Binaries, Diversity of Implementation, and Uber's Metrics Systems 1. Ethical Principles and OKRs -- Your KPIs can’t conflict with your principles if you don’t have principles. (So, start by defining your principles, then consider your principles before optimizing a KPI, monitor user experience to see if you're compromising your principles, and repeat.) (via Peter Skomoroch) 2. RetroWrite -- Retrofitting compiler passes through binary rewriting. Paper. The ideal solution for binary security analysis would be a static rewriter that can intelligently add the required instrumentation as if it were inserted at compile time. Such instrumentation requires an analysis to statically disambiguate between references and scalars, a problem known to be undecidable in the general case. We show that recovering this information is possible in practice for the most common class of software and libraries: 64-bit, position independent code (via Mathias Payer) 3. Re: A libc in LLVM -- very thoughtful post from a libc maintainer about the risks if Google implements an LLVM libc. Avoiding monoculture preserves the motivation for consensus-based standards processes rather than single-party control (see also: Chrome and what it's done to the web) and the motivation for people writing software to write to the standards rather than to a particular implementation. 4. M3 and M3DB -- M3, a metrics platform, and M3DB, a distributed time series database, were developed at Uber out of necessity. After using what was available as open source and finding we were unable to use them at our scale due to issues with their reliability, cost, and operationally intensive nature, we built our own metrics platform piece by piece. We used our experience to help us build a native distributed time series database, a highly dynamic and performant aggregation service, query engine, and other supporting infrastructure. Continue reading Four short links: 26 June 2019. Continue Reading… ### KDnuggets™ News 19:n24, Jun 26: Understand Cloud Services; Pandas Tips & Tricks; Master Data Preparation w/ Python Happy summer! This week on KDnuggets: Understanding Cloud Data Services; How to select rows and columns in Pandas using [ ], .loc, iloc, .at and .iat; 7 Steps to Mastering Data Preparation for Machine Learning with Python; Examining the Transformer Architecture: The OpenAI GPT-2 Controversy; Data Literacy: Using the Socratic Method; and much more! Continue Reading… ### AI and machine learning will require retraining your entire organization To successfully integrate AI and machine learning technologies, companies need to take a more holistic approach toward training their workforce. In our recent surveys AI Adoption in the Enterprise and Machine Learning Adoption in the Enterprise, we found growing interest in AI technologies among companies across a variety of industries and geographic locations. Our findings align with other surveys and studies—in fact, a recent study by the World Intellectual Patent Office (WIPO) found that the surge in research in AI and machine learning (ML) has been accompanied by an even stronger growth in AI-related patent applications. Patents are one sign that companies are beginning to take these technologies very seriously. When we asked what held back their adoption of AI technologies, respondents cited a few reasons, including some that pertained to culture, organization, and skills: • [23%] Company culture does not yet recognize needs for AI • [18%] Lack of skilled people / difficulting hiring the required roles • [17%] Difficulties in identifying appropriate business use cases Implementing and incorporating AI and machine learning technologies will require retraining across an organization, not just technical teams. Recall that the rise of big data and data science necessitated a certain amount of retraining across an entire organization: technologists and analysts needed to familiarize themselves with new tools and architectures, but business experts and managers also needed to reorient their workflows to adjust to data-driven processes and data-intensive systems. AI and machine learning will require a similar holistic approach to training. Here are a few reasons why: • As noted from our survey, identifying appropriate business use cases remains an ongoing challenge. Domain experts and business owners need to develop an understanding of these technologies in order to be able to highlight areas where they are likely to make an impact within a company. • Members of an organization will need to understand—even at a high-level—the current state of AI and ML technologies so they know the strengths and limitations of these new tools. For instance, in the case of robotic process automation (RPA), it’s really the people closest to tasks (“bottoms up”) who can best identify areas where it is most suitable. • AI and machine learning depend on data (usually labeled training data for machine learning models), and in many instances, a certain amount of domain knowledge will be needed to assemble high-quality data. • Machine learning and AI involve end-to-end pipelines, so development/testing/integration will often cut across technical roles and technical teams. • AI and machine learning applications and solutions often interact with (or augment) users and domain experts, so UX/design remains critical. • Security, privacy, ethics, and other risk and compliance issues will increasingly require that companies set up cross-functional teams when they build AI and machine learning systems and products. At our upcoming Artificial Intelligence conferences in San Jose and London, we have assembled a roster of two-day training sessions, tutorials, and presentations to help individuals (across job roles and functions) sharpen their skills and understanding of AI and machine learning. We return to San Jose with a two-day Business Summit designed specifically for executives, business leaders, and strategists. This Business Summit includes a popular two-day training—AI for Managers—and tutorials—Bringing AI into the enterprise and Design Thinking for AI—along with 12 executive briefings designed to provide in-depth overviews into important topics in AI. We are also debuting a new half-day tutorial that will be taught by Ira Cohen (Product management in the Machine Learning era), which given the growing importance of AI and ML, is one that every manager should consider attending. We will also have our usual strong slate of technical training, tutorials, and talks. Here are some two-day training sessions and tutorials that I am excited about: AI and ML are going to impact and permeate most aspects of a company’s operations, products, and services. To succeed in implementing and incorporating AI and machine learning technologies, companies need to take a more holistic approach toward retraining their workforces. This will be an ongoing endeavor as research results continue to be translated into practical systems that companies can use. Individuals will need to continue to learn new skills as technologies continue to evolve and because many areas of AI and ML are increasingly becoming democratized. Related training and tutorial links: Continue Reading… ### Whats new on arXiv Recent developments in deep reinforcement learning are concerned with creating decision-making agents which can perform well in various complex domains. A particular approach which has received increasing attention is multi-agent reinforcement learning, in which multiple agents learn concurrently to coordinate their actions. In such multi-agent environments, additional learning problems arise due to the continually changing decision-making policies of agents. This paper surveys recent works that address the non-stationarity problem in multi-agent deep reinforcement learning. The surveyed methods range from modifications in the training procedure, such as centralized training, to learning representations of the opponent’s policy, meta-learning, communication, and decentralized learning. The survey concludes with a list of open problems and possible lines of future research. One of the most rapidly evolving and dynamic business sector is the IT domain, where there is a problem finding experienced, skilled and qualified employees. Specialists are essential for developing and implementing new ideas into products. Human resources (HR) department plays a major role in the recruitment of qualified employees by assessing their skills, using different HR metrics, and selecting the best candidates for a specific job. Most recruiters are not qualified to evaluate IT specialists. In order to decrease the gap between the HR department and IT specialists, we designed, implemented and tested an Expert System for Recruiting IT specialist – ESRIT. The expert system uses text mining, natural language processing, and classification algorithms to extract relevant information from resumes by using a knowledge base that stores the relevant key skills and phrases. The recruiter is looking for the same abilities and certificates, trying to place the best applicant into a specific position. The article presents a developing picture of the top major IT skills that will be required in 2014 and it argues for the choice of the IT abilities domain. With the fast development of various positioning techniques such as Global Position System (GPS), mobile devices and remote sensing, spatio-temporal data has become increasingly available nowadays. Mining valuable knowledge from spatio-temporal data is critically important to many real world applications including human mobility understanding, smart transportation, urban planning, public safety, health care and environmental management. As the number, volume and resolution of spatio-temporal datasets increase rapidly, traditional data mining methods, especially statistics based methods for dealing with such data are becoming overwhelmed. Recently, with the advances of deep learning techniques, deep leaning models such as convolutional neural network (CNN) and recurrent neural network (RNN) have enjoyed considerable success in various machine learning tasks due to their powerful hierarchical feature learning ability in both spatial and temporal domains, and have been widely applied in various spatio-temporal data mining (STDM) tasks such as predictive learning, representation learning, anomaly detection and classification. In this paper, we provide a comprehensive survey on recent progress in applying deep learning techniques for STDM. We first categorize the types of spatio-temporal data and briefly introduce the popular deep learning models that are used in STDM. Then a framework is introduced to show a general pipeline of the utilization of deep learning models for STDM. Next we classify existing literatures based on the types of ST data, the data mining tasks, and the deep learning models, followed by the applications of deep learning for STDM in different domains including transportation, climate science, human mobility, location based social network, crime analysis, and neuroscience. Finally, we conclude the limitations of current research and point out future research directions. We propose a comprehensive end-to-end pipeline for Twitter hashtags recommendation system including data collection, supervised training setting and zero shot training setting. In the supervised training setting, we have proposed and compared the performance of various deep learning architectures, namely Convolutional Neural Network (CNN), Recurrent Neural Network (RNN) and Transformer Network. However, it is not feasible to collect data for all possible hashtag labels and train a classifier model on them. To overcome this limitation, we propose a Zero Shot Learning (ZSL) paradigm for predicting unseen hashtag labels by learning the relationship between the semantic space of tweets and the embedding space of hashtag labels. We evaluated various state-of-the-art ZSL methods like Convex combination of Semantic Embedding (ConSE), Embarrassingly Simple Zero-Shot Learning (ESZSL) and Deep Embedding Model for Zero-Shot Learning (DEM-ZSL) for the hashtag recommendation task. We demonstrate the effectiveness and scalability of ZSL methods for the recommendation of unseen hashtags. To the best of our knowledge, this is the first quantitative evaluation of ZSL methods to date for unseen hashtags recommendations from tweet text. State-of-the-art segmentation methods rely on very deep networks that are not always easy to train without very large training datasets and tend to be relatively slow to run on standard GPUs. In this paper, we introduce a novel recurrent U-Net architecture that preserves the compactness of the original U-Net, while substantially increasing its performance to the point where it outperforms the state of the art on several benchmarks. We will demonstrate its effectiveness for several tasks, including hand segmentation, retina vessel segmentation, and road segmentation. We also introduce a large-scale dataset for hand segmentation. The development of machine learning is promoting the search for fast and stable minimization algorithms. To this end, we suggest a change in the current gradient descent methods that should speed up the motion in flat regions and slow it down in steep directions of the function to minimize. It is based on a ‘power gradient’, in which each component of the gradient is replaced by its versus-preserving $H$-th power, with $0. We test three modern gradient descent methods fed by such variant and by standard gradients, finding the new version to achieve significantly better performances for the Nesterov accelerated gradient and AMSGrad. We also propose an effective new take on the ADAM algorithm, which includes power gradients with varying $H$. In this work, we propose to quantize all parts of standard classification networks and replace the activation-weight–multiply step with a simple table-based lookup. This approach results in networks that are free of floating-point operations and free of multiplications, suitable for direct FPGA and ASIC implementations. It also provides us with two simple measures of per-layer and network-wide compactness as well as insight into the distribution characteristics of activationoutput and weight values. We run controlled studies across different quantization schemes, both fixed and adaptive and, within the set of adaptive approaches, both parametric and model-free. We implement our approach to quantization with minimal, localized changes to the training process, allowing us to benefit from advances in training continuous-valued network architectures. We apply our approach successfully to AlexNet, ResNet, and MobileNet. We show results that are within 1.6% of the reported, non-quantized performance on MobileNet using only 40 entries in our table. This performance gap narrows to zero when we allow tables with 320 entries. Our results give the best accuracies among multiply-free networks. Say that a graph G has property $\mathcal{K}$ if the size of its maximum matching is equal to the order of a minimal vertex cover. We study the following process. Set $N:= \binom{n}{2}$ and let $e_1, e_2, \dots e_{N}$ be a uniformly random ordering of the edges of $K_n$, with $n$ an even integer. Let $G_0$ be the empty graph on $n$ vertices. For $m \geq 0$, $G_{m+1}$ is obtained from $G_m$ by adding the edge $e_{m+1}$ exactly if $G_m \cup \{ e_{m+1}\}$ has property $\mathcal{K}$. We analyse the behaviour of this process, focusing mainly on two questions: What can be said about the structure of $G_N$ and for which $m$ will $G_m$ contain a perfect matching? Learning node embeddings that capture a node’s position within the broader graph structure is crucial for many prediction tasks on graphs. However, existing Graph Neural Network (GNN) architectures have limited power in capturing the position/location of a given node with respect to all other nodes of the graph. Here we propose Position-aware Graph Neural Networks (P-GNNs), a new class of GNNs for computing position-aware node embeddings. P-GNN first samples sets of anchor nodes, computes the distance of a given target node to each anchor-set,and then learns a non-linear distance-weighted aggregation scheme over the anchor-sets. This way P-GNNs can capture positions/locations of nodes with respect to the anchor nodes. P-GNNs have several advantages: they are inductive, scalable,and can incorporate node feature information. We apply P-GNNs to multiple prediction tasks including link prediction and community detection. We show that P-GNNs consistently outperform state of the art GNNs, with up to 66% improvement in terms of the ROC AUC score. An accurate load forecasting has always been one of the main indispensable parts in the operation and planning of power systems. Among different time horizons of forecasting, while short-term load forecasting (STLF) and long-term load forecasting (LTLF) have respectively got benefits of accurate predictors and probabilistic forecasting, medium-term load forecasting (MTLF) demands more attention due to its vital role in power system operation and planning such as optimal scheduling of generation units, robust planning program for customer service, and economic supply. In this study, a hybrid method, composed of Support Vector Regression (SVR) and Symbiotic Organism Search Optimization (SOSO) method, is proposed for MTLF. In the proposed forecasting model, SVR is the main part of the forecasting algorithm while SOSO is embedded into it to optimize the parameters of SVR. In addition, a minimum redundancy-maximum relevance feature selection algorithm is used to in the preprocessing of input data. The proposed method is tested on EUNITE competition dataset to demonstrate its proper performance. Furthermore, it is compared with some previous works to show eligibility of our method. Stochastic gradient decent~(SGD) and its variants, including some accelerated variants, have become popular for training in machine learning. However, in all existing SGD and its variants, the sample size in each iteration~(epoch) of training is the same as the size of the full training set. In this paper, we propose a new method, called \underline{ada}ptive \underline{s}ample \underline{s}election~(ADASS), for training acceleration. During different epoches of training, ADASS only need to visit different training subsets which are adaptively selected from the full training set according to the Lipschitz constants of the loss functions on samples. It means that in ADASS the sample size in each epoch of training can be smaller than the size of the full training set, by discarding some samples. ADASS can be seamlessly integrated with existing optimization methods, such as SGD and momentum SGD, for training acceleration. Theoretical results show that the learning accuracy of ADASS is comparable to that of counterparts with full training set. Furthermore, empirical results on both shallow models and deep models also show that ADASS can accelerate the training process of existing methods without sacrificing accuracy. We study a multi-modal route planning scenario consisting of a public transit network and a transfer graph representing a secondary transportation mode (e.g., walking or taxis). The objective is to compute all journeys that are Pareto-optimal with respect to arrival time and the number of required transfers. While various existing algorithms can efficiently compute optimal journeys in either a pure public transit network or a pure transfer graph, combining the two increases running times significantly. As a result, even walking between stops is typically limited by a maximal duration or distance, or by requiring the transfer graph to be transitively closed. To overcome these shortcomings, we propose a novel preprocessing technique called ULTRA (UnLimited TRAnsfers): Given a complete transfer graph (without any limitations, representing an arbitrary non-schedule-based mode of transportation), we compute a small number of transfer shortcuts that are provably sufficient for computing all Pareto-optimal journeys. We demonstrate the practicality of our approach by showing that these transfer shortcuts can be integrated into a variety of state-of-the-art public transit algorithms, establishing the ULTRA-Query algorithm family. Our extensive experimental evaluation shows that ULTRA is able to improve these algorithms from limited to unlimited transfers without sacrificing query speed, yielding the fastest known algorithms for multi-modal routing. This is true not just for walking, but also for other transfer modes such as cycling or driving. We recently introduced a formalism for the modeling of temporal networks, that we call stream graphs. It emphasizes the streaming nature of data and allows rigorous definitions of many important concepts generalizing classical graphs. This includes in particular size, density, clique, neighborhood, degree, clustering coefficient, and transitivity. In this contribution, we show that, like graphs, stream graphs may be extended to cope with bipartite structures, with node and link weights, or with link directions. We review the main bipartite, weighted or directed graph concepts proposed in the literature, we generalize them to the cases of bipartite, weighted, or directed stream graphs, and we show that obtained concepts are consistent with graph and stream graph ones. This provides a formal ground for an accurate modeling of the many temporal networks that have one or several of these features. Generative adversarial networks have been very successful in generative modeling, however they remain relatively hard to optimize compared to standard deep neural networks. In this paper, we try to gain insight into the optimization of GANs by looking at the game vector field resulting from the concatenation of the gradient of both players. Based on this point of view, we propose visualization techniques that allow us to make the following empirical observations. First, the training of GANs suffers from rotational behavior around locally stable stationary points, which, as we show, corresponds to the presence of imaginary components in the eigenvalues of the Jacobian of the game. Secondly, GAN training seems to converge to a stable stationary point which is a saddle point for the generator loss, not a minimum, while still achieving excellent performance. This counter-intuitive yet persistent observation questions whether we actually need a Nash equilibrium to get good performance in GANs. Integer programming (IP) is a general optimization framework widely applicable to a variety of unstructured and structured problems arising in, e.g., scheduling, production planning, and graph optimization. As IP models many provably hard to solve problems, modern IP solvers rely on many heuristics. These heuristics are usually human-designed, and naturally prone to suboptimality. The goal of this work is to show that the performance of those solvers can be greatly enhanced using reinforcement learning (RL). In particular, we investigate a specific methodology for solving IPs, known as the Cutting Plane Method. This method is employed as a subroutine by all modern IP solvers. We present a deep RL formulation, network architecture, and algorithms for intelligent adaptive selection of cutting planes (aka cuts). Across a wide range of IP tasks, we show that the trained RL agent significantly outperforms human-designed heuristics, and effectively generalizes to 10X larger instances and across IP problem classes. The trained agent is also demonstrated to benefit the popular downstream application of cutting plane methods in Branch-and-Cut algorithm, which is the backbone of state-of-the-art commercial IP solvers. Convolutional neural network is a very important model of deep learning. It can help avoid the exploding/vanishing gradient problem and improve the generalizability of a neural network if the singular values of the Jacobian of a layer are bounded around $1$ in the training process. We propose a new penalty function for a convolutional kernel to let the singular values of the corresponding transformation matrix are bounded around $1$. We show how to carry out the gradient type methods. The penalty is about the transformation matrix corresponding to a kernel, not directly about the kernel, which is different from results in existing papers. This provides a new regularization method about the weights of convolutional layers. Other penalty functions about a kernel can be devised following this idea in future. When the data are stored in a distributed manner, direct application of traditional statistical inference procedures is often prohibitive due to communication cost and privacy concerns. This paper develops and investigates two Communication-Efficient Accurate Statistical Estimators (CEASE), implemented through iterative algorithms for distributed optimization. In each iteration, node machines carry out computation in parallel and communicate with the central processor, which then broadcasts aggregated information to node machines for new updates. The algorithms adapt to the similarity among loss functions on node machines, and converge rapidly when each node machine has large enough sample size. Moreover, they do not require good initialization and enjoy linear converge guarantees under general conditions. The contraction rate of optimization errors is presented explicitly, with dependence on the local sample size unveiled. In addition, the improved statistical accuracy per iteration is derived. By regarding the proposed method as a multi-step statistical estimator, we show that statistical efficiency can be achieved in finite steps in typical statistical applications. In addition, we give the conditions under which the one-step CEASE estimator is statistically efficient. Extensive numerical experiments on both synthetic and real data validate the theoretical results and demonstrate the superior performance of our algorithms. Scalar-on-function linear models are commonly used to regress functional predictors on a scalar response. However, functional models are more difficult to estimate and interpret than traditional linear models, and may be unnecessarily complex for a data application. Hypothesis testing can be used to guide model selection by determining if a functional predictor is necessary. Using a mixed effects representation with penalized splines and variance component tests, we propose a framework for testing functional linear models with responses from exponential family distributions. The proposed method can accommodate dense and sparse functional data, and be used to test functional predictors for no effect and form of the effect. We show via simulation study that the proposed method achieves the nominal level and has high power, and we demonstrate its utility with two data applications. Statistical machine translation (SMT) is a fast-growing sub-field of computational linguistics. Until now, the most popular automatic metric to measure the quality of SMT is BiLingual Evaluation Understudy (BLEU) score. Lately, SMT along with the BLEU metric has been applied to a Software Engineering task named code migration. (In)Validating the use of BLEU score could advance the research and development of SMT-based code migration tools. Unfortunately, there is no study to approve or disapprove the use of BLEU score for source code. In this paper, we conducted an empirical study on BLEU score to (in)validate its suitability for the code migration task due to its inability to reflect the semantics of source code. In our work, we use human judgment as the ground truth to measure the semantic correctness of the migrated code. Our empirical study demonstrates that BLEU does not reflect translation quality due to its weak correlation with the semantic correctness of translated code. We provided counter-examples to show that BLEU is ineffective in comparing the translation quality between SMT-based models. Due to BLEU’s ineffectiveness for code migration task, we propose an alternative metric RUBY, which considers lexical, syntactical, and semantic representations of source code. We verified that RUBY achieves a higher correlation coefficient with the semantic correctness of migrated code, 0.775 in comparison with 0.583 of BLEU score. We also confirmed the effectiveness of RUBY in reflecting the changes in translation quality of SMT-based translation models. With its advantages, RUBY can be used to evaluate SMT-based code migration models. Many applications for classification methods not only require high accuracy but also reliable estimation of predictive uncertainty. However, while many current classification frameworks, in particular deep neural network architectures, provide very good results in terms of accuracy, they tend to underestimate their predictive uncertainty. In this paper, we propose a method that corrects the confidence output of a general classifier such that it approaches the true probability of classifying correctly. This classifier calibration is, in contrast to existing approaches, based on a non-parametric representation using a latent Gaussian process and specifically designed for multi-class classification. It can be applied to any classification method that outputs confidence estimates and is not limited to neural networks. We also provide a theoretical analysis regarding the over- and underconfidence of a classifier and its relationship to calibration. In experiments we show the universally strong performance of our method across different classifiers and benchmark data sets in contrast to existing classifier calibration techniques. Similarity measures are used extensively in machine learning and data science algorithms. The newly proposed graph Relative Hausdorff (RH) distance is a lightweight yet nuanced similarity measure for quantifying the closeness of two graphs. In this work we study the effectiveness of RH distance as a tool for detecting anomalies in time-evolving graph sequences. We apply RH to cyber data with given red team events, as well to synthetically generated sequences of graphs with planted attacks. In our experiments, the performance of RH distance is at times comparable, and sometimes superior, to graph edit distance in detecting anomalous phenomena. Our results suggest that in appropriate contexts, RH distance has advantages over more computationally intensive similarity measures. We propose an efficient transfer learning method for adapting ImageNet pre-trained Convolutional Neural Network (CNN) to fine-grained image classification task. Conventional transfer learning methods typically face the trade-off between training time and accuracy. By adding ‘attention module’ to each convolutional filters of the pre-trained network, we are able to rank and adjust the importance of each convolutional signal in an end-to-end pipeline. In this report, we show our method can adapt a pre-trianed ResNet50 for a fine-grained transfer learning task within few epochs and achieve accuracy above conventional transfer learning methods and close to models trained from scratch. Our model also offer interpretable result because the rank of the convolutional signal shows which convolution channels are utilized and amplified to achieve better classification result, as well as which signal should be treated as noise for the specific transfer learning task, which could be pruned to lower model size. Recent advances in Neural Variational Inference allowed for a renaissance in latent variable models in a variety of domains involving high-dimensional data. While traditional variational methods derive an analytical approximation for the intractable distribution over the latent variables, here we construct an inference network conditioned on the symbolic representation of entities and relation types in the Knowledge Graph, to provide the variational distributions. The new framework results in a highly-scalable method. Under a Bernoulli sampling framework, we provide an alternative justification for commonly used techniques in large-scale stochastic variational inference, which drastically reduce training time at a cost of an additional approximation to the variational lower bound. We introduce two models from this highly scalable probabilistic framework, namely the Latent Information and Latent Fact models, for reasoning over knowledge graph-based representations. Our Latent Information and Latent Fact models improve upon baseline performance under certain conditions. We use the learnt embedding variance to estimate predictive uncertainty during link prediction, and discuss the quality of these learnt uncertainty estimates. Our source code and datasets are publicly available online at https://…/Neural-Variational-Knowledge-Graphs. Recently, deep learning models play more and more important roles in contents recommender systems. However, although the performance of recommendations is greatly improved, the ‘Matthew effect’ becomes increasingly evident. While the head contents get more and more popular, many competitive long-tail contents are difficult to achieve timely exposure because of lacking behavior features. This issue has badly impacted the quality and diversity of recommendations. To solve this problem, look-alike algorithm is a good choice to extend audience for high quality long-tail contents. But the traditional look-alike models which widely used in online advertising are not suitable for recommender systems because of the strict requirement of both real-time and effectiveness. This paper introduces a real-time attention based look-alike model (RALM) for recommender systems, which tackles the challenge of conflict between real-time and effectiveness. RALM realizes real-time look-alike audience extension benefiting from seeds-to-user similarity prediction and improves the effectiveness through optimizing user representation learning and look-alike learning modeling. For user representation learning, we propose a novel neural network structure named attention merge layer to replace the concatenation layer, which significantly improves the expressive ability of multi-fields feature learning. On the other hand, considering the various members of seeds, we design global attention unit and local attention unit to learn robust and adaptive seeds representation with respect to a certain target user. At last, we introduce seeds clustering mechanism which not only reduces the time complexity of attention units prediction but also minimizes the loss of seeds information at the same time. According to our experiments, RALM shows superior effectiveness and performance than popular look-alike models. To protect user privacy and meet law regulations, federated (machine) learning is obtaining vast interests in recent years. The key principle of federated learning is training a machine learning model without needing to know each user’s personal raw private data. In this paper, we propose a secure matrix factorization framework under the federated learning setting, called FedMF. First, we design a user-level distributed matrix factorization framework where the model can be learned when each user only uploads the gradient information (instead of the raw preference data) to the server. While gradient information seems secure, we prove that it could still leak users’ raw data. To this end, we enhance the distributed matrix factorization framework with homomorphic encryption. We implement the prototype of FedMF and test it with a real movie rating dataset. Results verify the feasibility of FedMF. We also discuss the challenges for applying FedMF in practice for future research. Graphs are an increasingly popular way to model real-world entities and relationships between them, ranging from social networks to data lineage graphs and biological datasets. Queries over these large graphs often involve expensive subgraph traversals and complex analytical computations. These real-world graphs are often substantially more structured than a generic vertex-and-edge model would suggest, but this insight has remained mostly unexplored by existing graph engines for graph query optimization purposes. Therefore, in this work, we focus on leveraging structural properties of graphs and queries to automatically derive materialized graph views that can dramatically speed up query evaluation. We present KASKADE, the first graph query optimization framework to exploit materialized graph views for query optimization purposes. KASKADE employs a novel constraint-based view enumeration technique that mines constraints from query workloads and graph schemas, and injects them during view enumeration to significantly reduce the search space of views to be considered. Moreover, it introduces a graph view size estimator to pick the most beneficial views to materialize given a query set and to select the best query evaluation plan given a set of materialized views. We evaluate its performance over real-world graphs, including the provenance graph that we maintain at Microsoft to enable auditing, service analytics, and advanced system optimizations. Our results show that KASKADE substantially reduces the effective graph size and yields significant performance speedups (up to 50X), in some cases making otherwise intractable queries possible. Continue Reading… ### Document worth reading: “Artificial Intelligence and its Role in Near Future” AI technology has a long history which is actively and constantly changing and growing. It focuses on intelligent agents, which contain devices that perceive the environment and based on which takes actions in order to maximize goal success chances. In this paper, we will explain the modern AI basics and various representative applications of AI. In the context of the modern digitalized world, AI is the property of machines, computer programs, and systems to perform the intellectual and creative functions of a person, independently find ways to solve problems, be able to draw conclusions and make decisions. Most artificial intelligence systems have the ability to learn, which allows people to improve their performance over time. The recent research on AI tools, including machine learning, deep learning and predictive analysis intended toward increasing the planning, learning, reasoning, thinking and action taking ability. Based on which, the proposed research intends towards exploring on how the human intelligence differs from the artificial intelligence. Moreover, we critically analyze what AI of today is capable of doing, why it still cannot reach human intelligence and what are the open challenges existing in front of AI to reach and outperform human level of intelligence. Furthermore, it will explore the future predictions for artificial intelligence and based on which potential solution will be recommended to solve it within next decades. Artificial Intelligence and its Role in Near Future Continue Reading… ### Magister Dixit “You should embrace the Bayesian approach.” Kamil Bartocha ( 26. Apr 2015 ) Continue Reading… ### Magister Dixit “The uninspected (inevitably) deteriorates.” Dwight David Eisenhower Continue Reading… ### NAS-Bench-101: Towards Reproducible Neural Architecture Search - implementation - ** Nuit Blanche is now on Twitter: @NuitBlog ** Recent advances in neural architecture search (NAS) demand tremendous computational resources, which makes it difficult to reproduce experiments and imposes a barrier-to-entry to researchers without access to large-scale computation. We aim to ameliorate these problems by introducing NAS-Bench-101, the first public architecture dataset for NAS research. To build NAS-Bench-101, we carefully constructed a compact, yet expressive, search space, exploiting graph isomorphisms to identify 423k unique convolutional architectures. We trained and evaluated all of these architectures multiple times on CIFAR-10 and compiled the results into a large dataset of over 5 million trained models. This allows researchers to evaluate the quality of a diverse range of models in milliseconds by querying the pre-computed dataset. We demonstrate its utility by analyzing the dataset as a whole and by benchmarking a range of architecture optimization algorithms. Data and code for NAS-Bench-101 is here: https://github.com/google-research/nasbench Follow @NuitBlog or join the CompressiveSensing Reddit, the Facebook page, the Compressive Sensing group on LinkedIn or the Advanced Matrix Factorization group on LinkedIn Other links: Paris Machine LearningMeetup.com||@Archives||LinkedIn||Facebook|| @ParisMLGroup< br/> About LightOnNewsletter ||@LightOnIO|| on LinkedIn || on CrunchBase || our Blog Continue Reading… ### Document worth reading: “Characterizing HCI Research in China: Streams, Methodologies and Future Directions” Human-computer Interaction (HCI) is an interdisciplinary research field involving multiple disciplines, such as computer science, psychology, social science and design. It studies the interaction between users and computer in order to better design technologies and solve real-life problems. This position paper characterizes HCI research in China by comparing it with international HCI research traditions. We discuss the current streams and methodologies of Chinese HCI research. We then propose future HCI research directions such as including emergent users who have less access to technology and addressing the cultural dimensions in order to provide better technical solutions and support. Characterizing HCI Research in China: Streams, Methodologies and Future Directions Continue Reading… ### R Packages worth a look Easy Computation of Marketing Metrics with Different Analysis Axis (mmetrics) Provides a mechanism for easy computation of marketing metrics. By default in this package, metrics for digital marketing (e.g. CTR (Click Through Rate … D3 Dynamic Cluster Visualizations (klustR) Used to create dynamic, interactive ‘D3.js’ based parallel coordinates and principal component plots in ‘R’. The plots make visualizing k-means or othe … Store and Retrieve Data.frames in a Git Repository (git2rdata) Make versioning of data.frame easy and efficient using git repositories. Uplift Model Evaluation with Plots and Metrics (uplifteval) Provides a variety of plots and metrics to evaluate uplift models including the ‘R uplift’ package’s Qini metric and Qini plot, a port of the ‘python p … Continue Reading… ### Whats new on arXiv CNNs, RNNs, GCNs, and CapsNets have shown significant insights in representation learning and are widely used in various text mining tasks such as large-scale multi-label text classification. However, most existing deep models for multi-label text classification consider either the non-consecutive and long-distance semantics or the sequential semantics, but how to consider them both coherently is less studied. In addition, most existing methods treat output labels as independent methods, but ignore the hierarchical relations among them, leading to useful semantic information loss. In this paper, we propose a novel hierarchical taxonomy-aware and attentional graph capsule recurrent CNNs framework for large-scale multi-label text classification. Specifically, we first propose to model each document as a word order preserved graph-of-words and normalize it as a corresponding words-matrix representation which preserves both the non-consecutive, long-distance and local sequential semantics. Then the words-matrix is input to the proposed attentional graph capsule recurrent CNNs for more effectively learning the semantic features. To leverage the hierarchical relations among the class labels, we propose a hierarchical taxonomy embedding method to learn their representations, and define a novel weighted margin loss by incorporating the label representation similarity. Extensive evaluations on three datasets show that our model significantly improves the performance of large-scale multi-label text classification by comparing with state-of-the-art approaches. We present a document-grounded matching network (DGMN) for response selection that can power a knowledge-aware retrieval-based chatbot system. The challenges of building such a model lie in how to ground conversation contexts with background documents and how to recognize important information in the documents for matching. To overcome the challenges, DGMN fuses information in a document and a context into representations of each other, and dynamically determines if grounding is necessary and importance of different parts of the document and the context through hierarchical interaction with a response at the matching step. Empirical studies on two public data sets indicate that DGMN can significantly improve upon state-of-the-art methods and at the same time enjoys good interpretability. The main purpose of incremental learning is to learn new knowledge while not forgetting the knowledge which have been learned before. At present, the main challenge in this area is the catastrophe forgetting, namely the network will lose their performance in the old tasks after training for new tasks. In this paper, we introduce an ensemble method of incremental classifier to alleviate this problem, which is based on the cosine distance between the output feature and the pre-defined center, and can let each task to be preserved in different networks. During training, we make use of PEDCC-Loss to train the CNN network. In the stage of testing, the prediction is determined by the cosine distance between the network latent features and pre-defined center. The experimental results on EMINST and CIFAR100 show that our method outperforms the recent LwF method, which use the knowledge distillation, and iCaRL method, which keep some old samples while training for new task. The method can achieve the goal of not forgetting old knowledge while training new classes, and solve the problem of catastrophic forgetting better. We focus on the problem of streaming recommender system and explore novel collaborative filtering algorithms to handle the data dynamicity and complexity in a streaming manner. Although deep neural networks have demonstrated the effectiveness of recommendation tasks, it is lack of explorations on integrating probabilistic models and deep architectures under streaming recommendation settings. Conjoining the complementary advantages of probabilistic models and deep neural networks could enhance both model effectiveness and the understanding of inference uncertainties. To bridge the gap, in this paper, we propose a Coupled Variational Recurrent Collaborative Filtering (CVRCF) framework based on the idea of Deep Bayesian Learning to handle the streaming recommendation problem. The framework jointly combines stochastic processes and deep factorization models under a Bayesian paradigm to model the generation and evolution of users’ preferences and items’ popularities. To ensure efficient optimization and streaming update, we further propose a sequential variational inference algorithm based on a cross variational recurrent neural network structure. Experimental results on three benchmark datasets demonstrate that the proposed framework performs favorably against the state-of-the-art methods in terms of both temporal dependency modeling and predictive accuracy. The learned latent variables also provide visualized interpretations for the evolution of temporal dynamics. How an agent can act optimally in stochastic, partially observable domains is a challenge problem, the standard approach to address this issue is to learn the domain model firstly and then based on the learned model to find the (near) optimal policy. However, offline learning the model often needs to store the entire training data and cannot utilize the data generated in the planning phase. Furthermore, current research usually assumes the learned model is accurate or presupposes knowledge of the nature of the unobservable part of the world. In this paper, for systems with discrete settings, with the benefits of Predictive State Representations~(PSRs), a model-based planning approach is proposed where the learning and planning phases can both be executed online and no prior knowledge of the underlying system is required. Experimental results show compared to the state-of-the-art approaches, our algorithm achieved a high level of performance with no prior knowledge provided, along with theoretical advantages of PSRs. Source code is available at https://…/PSR-MCTS-Online. The field of deep learning is experiencing a trend towards producing reproducible research. Nevertheless, it is still often a frustrating experience to reproduce scientific results. This is especially true in the machine learning community, where it is considered acceptable to have black boxes in your experiments. We present DeepDIVA, a framework designed to facilitate easy experimentation and their reproduction. This framework allows researchers to share their experiments with others, while providing functionality that allows for easy experimentation, such as: boilerplate code, experiment management, hyper-parameter optimization, verification of data integrity and visualization of data and results. Additionally, the code of DeepDIVA is well-documented and supported by several tutorials that allow a new user to quickly familiarize themselves with the framework. Existing models for extractive summarization are usually trained from scratch with a cross-entropy loss, which does not explicitly capture the global context at the document level. In this paper, we aim to improve this task by introducing three auxiliary pre-training tasks that learn to capture the document-level context in a self-supervised fashion. Experiments on the widely-used CNN/DM dataset validate the effectiveness of the proposed auxiliary tasks. Furthermore, we show that after pre-training, a clean model with simple building blocks is able to outperform previous state-of-the-art that are carefully designed. Long session-based recommender systems have attacted much attention recently. For each user, they may create hundreds of click behaviors in short time. To learn long session item dependencies, previous sequential recommendation models resort either to data augmentation or a left-to-right autoregressive training approach. While effective, an obvious drawback is that future user behaviors are always mising during training. In this paper, we claim that users’ future action signals can be exploited to boost the recommendation quality. To model both past and future contexts, we investigate three ways of augmentation techniques from both data and model perspectives. Moreover, we carefully design two general neural network architectures: a pretrained two-way neural network model and a deep contextualized model trained on a text gap-filling task. Experiments on four real-word datasets show that our proposed two-way neural network models can achieve competitive or even much better results. Empirical evidence confirms that modeling both past and future context is a promising way to offer better recommendation accuracy. Generative Adversarial Networks (GANs) learn to model data distributions through two unsupervised neural networks, each minimizing the objective function maximized by the other. We relate this game theoretic strategy to earlier neural networks playing unsupervised minimax games. (i) GANs can be formulated as a special case of Adversarial Curiosity (1990) based on a minimax duel between two networks, one generating data through its probabilistic actions, the other predicting consequences thereof. (ii) We correct a previously published claim that Predictability Minimization (PM, 1990s) is not based on a minimax game. PM models data distributions through a neural encoder that maximizes the objective function minimized by a neural predictor of the code components. In this paper authors analyzed 50 000 keywords results collected from localized Polish Google search engine. We proposed a taxonomy for snippets displayed in search results as regular, rich, news, featured and entity types snippets. We observed some correlations between overlapping snippets in the same keywords. Results show that commercial keywords do not cause results having rich or entity types snippets, whereas keywords resulting with snippets are not commercial nature. We found that significant number of snippets are scholarly articles and rich cards carousel. We conclude our findings with conclusion and research limitations. Aspect-level sentiment classification aims to distinguish the sentiment polarities over one or more aspect terms in a sentence. Existing approaches mostly model different aspects in one sentence independently, which ignore the sentiment dependencies between different aspects. However, we find such dependency information between different aspects can bring additional valuable information. In this paper, we propose a novel aspect-level sentiment classification model based on graph convolutional networks (GCN) which can effectively capture the sentiment dependencies between multi-aspects in one sentence. Our model firstly introduces bidirectional attention mechanism with position encoding to model aspect-specific representations between each aspect and its context words, then employs GCN over the attention mechanism to capture the sentiment dependencies between different aspects in one sentence. We evaluate the proposed approach on the SemEval 2014 datasets. Experiments show that our model outperforms the state-of-the-art methods. We also conduct experiments to evaluate the effectiveness of GCN module, which indicates that the dependencies between different aspects is highly helpful in aspect-level sentiment classification. This paper presents a novel method which simultaneously learns the number of filters and network features repeatedly over multiple epochs. We propose a novel pruning loss to explicitly enforces the optimizer to focus on promising candidate filters while suppressing contributions of less relevant ones. In the meanwhile, we further propose to enforce the diversities between filters and this diversity-based regularization term improves the trade-off between model sizes and accuracies. It turns out the interplay between architecture and feature optimizations improves the final compressed models, and the proposed method is compared favorably to existing methods, in terms of both models sizes and accuracies for a wide range of applications including image classification, image compression and audio classification. Deep convolutional neural networks trained for image object categorization have shown remarkable similarities with representations found across the primate ventral visual stream. Yet, artificial and biological networks still exhibit important differences. Here we investigate one such property: increasing invariance to identity-preserving image transformations found along the ventral stream. Despite theoretical evidence that invariance should emerge naturally from the optimization process, we present empirical evidence that the activations of convolutional neural networks trained for object categorization are not robust to identity-preserving image transformations commonly used in data augmentation. As a solution, we propose data augmentation invariance, an unsupervised learning objective which improves the robustness of the learned representations by promoting the similarity between the activations of augmented image samples. Our results show that this approach is a simple, yet effective and efficient (10 % increase in training time) way of increasing the invariance of the models while obtaining similar categorization performance. In response to the demand for higher computational power, the number of computing nodes in high performance computers (HPC) increases rapidly. Exascale HPC systems are expected to arrive by 2020. With drastic increase in the number of HPC system components, it is expected to observe a sudden increase in the number of failures which, consequently, poses a threat to the continuous operation of the HPC systems. Detecting failures as early as possible and, ideally, predicting them, is a necessary step to avoid interruptions in HPC systems operation. Anomaly detection is a well-known general purpose approach for failure detection, in computing systems. The majority of existing methods are designed for specific architectures, require adjustments on the computing systems hardware and software, need excessive information, or pose a threat to users’ and systems’ privacy. This work proposes a node failure detection mechanism based on a vicinity-based statistical anomaly detection approach using passively collected and anonymized system log entries. Application of the proposed approach on system logs collected over 8 months indicates an anomaly detection precision between 62% to 81%. We present a novel method and analysis to train generative adversarial networks (GAN) in a stable manner. As shown in recent analysis, training is often undermined by the probability distribution of the data being zero on neighborhoods of the data space. We notice that the distributions of real and generated data should match even when they undergo the same filtering. Therefore, to address the limited support problem we propose to train GANs by using different filtered versions of the real and generated data distributions. In this way, filtering does not prevent the exact matching of the data distribution, while helping training by extending the support of both distributions. As filtering we consider adding samples from an arbitrary distribution to the data, which corresponds to a convolution of the data distribution with the arbitrary one. We also propose to learn the generation of these samples so as to challenge the discriminator in the adversarial training. We show that our approach results in a stable and well-behaved training of even the original minimax GAN formulation. Moreover, our technique can be incorporated in most modern GAN formulations and leads to a consistent improvement on several common datasets. We study the problem of estimating the covariance matrix of a high-dimensional distribution when a small constant fraction of the samples can be arbitrarily corrupted. Recent work gave the first polynomial time algorithms for this problem with near-optimal error guarantees for several natural structured distributions. Our main contribution is to develop faster algorithms for this problem whose running time nearly matches that of computing the empirical covariance. Given $N = \tilde{\Omega}(d^2/\epsilon^2)$ samples from a $d$-dimensional Gaussian distribution, an $\epsilon$-fraction of which may be arbitrarily corrupted, our algorithm runs in time $\tilde{O}(d^{3.26})/\mathrm{poly}(\epsilon)$ and approximates the unknown covariance matrix to optimal error up to a logarithmic factor. Previous robust algorithms with comparable error guarantees all have runtimes $\tilde{\Omega}(d^{2 \omega})$ when $\epsilon = \Omega(1)$, where $\omega$ is the exponent of matrix multiplication. We also provide evidence that improving the running time of our algorithm may require new algorithmic techniques. Data have often to be moved between servers and clients during the inference phase. For instance, modern virtual assistants collect data on mobile devices and the data are sent to remote servers for the analysis. A related scenario is that clients have to access and download large amounts of data stored on servers in order to apply machine learning models. Depending on the available bandwidth, this data transfer can be a serious bottleneck, which can significantly limit the application machine learning models. In this work, we propose a simple yet effective framework that allows to select certain parts of the input data needed for the subsequent application of a given neural network. Both the masks as well as the neural network are trained simultaneously such that a good model performance is achieved while, at the same time, only a minimal amount of data is selected by the masks. During the inference phase, only the parts selected by the masks have to be transferred between the server and the client. Our experimental evaluation indicates that it is, for certain learning tasks, possible to significantly reduce the amount of data needed to be transferred without affecting the model performance much. Least-mean squares (LMS) solvers such as Linear / Ridge / Lasso-Regression, SVD and Elastic-Net not only solve fundamental machine learning problems, but are also the building blocks in a variety of other methods, such as decision trees and matrix factorizations. We suggest an algorithm that gets a finite set of $n$ $d$-dimensional real vectors and returns a weighted subset of $d+1$ vectors whose sum is \emph{exactly} the same. The proof in Caratheodory’s Theorem (1907) computes such a subset in $O(n^2d^2)$ time and thus not used in practice. Our algorithm computes this subset in $O(nd)$ time, using $O(\log n)$ calls to Caratheodory’s construction on small but ‘smart’ subsets. This is based on a novel paradigm of fusion between different data summarization techniques, known as sketches and coresets. As an example application, we show how it can be used to boost the performance of existing LMS solvers, such as those in scikit-learn library, up to x100. Generalization for streaming and distributed (big) data is trivial. Extensive experimental results and complete open source code are also provided. Effective modeling of electronic health records (EHR) is rapidly becoming an important topic in both academia and industry. A recent study showed that utilizing the graphical structure underlying EHR data (e.g. relationship between diagnoses and treatments) improves the performance of prediction tasks such as heart failure diagnosis prediction. However, EHR data do not always contain complete structure information. Moreover, when it comes to claims data, structure information is completely unavailable to begin with. Under such circumstances, can we still do better than just treating EHR data as a flat-structured bag-of-features? In this paper, we study the possibility of utilizing the implicit structure of EHR by using the Transformer for prediction tasks on EHR data. Specifically, we argue that the Transformer is a suitable model to learn the hidden EHR structure, and propose the Graph Convolutional Transformer, which uses data statistics to guide the structure learning process. Our model empirically demonstrated superior prediction performance to previous approaches on both synthetic data and publicly available EHR data on encounter-based prediction tasks such as graph reconstruction and readmission prediction, indicating that it can serve as an effective general-purpose representation learning algorithm for EHR data. There are many surprising and perhaps counter-intuitive properties of optimization of deep neural networks. We propose and experimentally verify a unified phenomenological model of the loss landscape that incorporates many of them. High dimensionality plays a key role in our model. Our core idea is to model the loss landscape as a set of high dimensional \emph{wedges} that together form a large-scale, inter-connected structure and towards which optimization is drawn. We first show that hyperparameter choices such as learning rate, network width and $L_2$ regularization, affect the path optimizer takes through the landscape in a similar ways, influencing the large scale curvature of the regions the optimizer explores. Finally, we predict and demonstrate new counter-intuitive properties of the loss-landscape. We show an existence of low loss subspaces connecting a set (not only a pair) of solutions, and verify it experimentally. Finally, we analyze recently popular ensembling techniques for deep networks in the light of our model. How language-agnostic are current state-of-the-art NLP tools? Are there some types of language that are easier to model with current methods? In prior work (Cotterell et al., 2018) we attempted to address this question for language modeling, and observed that recurrent neural network language models do not perform equally well over all the high-resource European languages found in the Europarl corpus. We speculated that inflectional morphology may be the primary culprit for the discrepancy. In this paper, we extend these earlier experiments to cover 69 languages from 13 language families using a multilingual Bible corpus. Methodologically, we introduce a new paired-sample multiplicative mixed-effects model to obtain language difficulty coefficients from at-least-pairwise parallel corpora. In other words, the model is aware of inter-sentence variation and can handle missing data. Exploiting this model, we show that ‘translationese’ is not any easier to model than natively written language in a fair comparison. Trying to answer the question of what features difficult languages have in common, we try and fail to reproduce our earlier (Cotterell et al., 2018) observation about morphological complexity and instead reveal far simpler statistics of the data that seem to drive complexity in a much larger sample. Continue Reading… ### A Model and Simulation of Emotion Dynamics (This article was first published on R on Will Hipson, and kindly contributed to R-bloggers) Emotion dynamics is the study of how emotions change over time. Sometimes our feelings are quite stable, but other times capricious. Measuring and predicting these patterns for different people is somewhat of a Holy Grail for emotion researchers. In particular, some researchers are aspiring to discover mathematical laws that capture the complexity of our inner emotional experiences – much like physicists divining the laws that govern objects in the natural environment. These discoveries would revolutionize our understanding of our everyday feelings and when our emotions can go awry. This series of blog posts, which I kicked off earlier this month with a simulation of emotions during basketball games, is inspired by researchers like Peter Kuppens and Tom Hollenstein (to name a few) who have collected and analyzed reams of intensive self-reports on people’s feelings from one moment to the next. My approach is to reverse engineer these insights and generate models that simulate emotions evolving over time – like this: ## Affective State Space We start with the affective state space – the theoretical landscape on which our conscious feelings roam free. This space is represented as two-dimensional, although we acknowledge that this fails to capture all aspects of conscious feeling. The first dimension, represented along the x-axis, is valence and this refers to how unpleasant vs. pleasant we feel. The second dimension, represented along the y-axis, is arousal. Somewhat less intuitive, arousal refers to how deactivated/sluggish/sleepy vs. activated/energized/awake we feel. At any time, our emotional state can be defined in terms of valence and arousal. So if you’re feeling stressed you would be low in valence and high in arousal. Let’s say you’re serene and calm, then you would be high in valence and low in arousal. Most of the time, we feel moderately high valence and moderate arousal (i.e., content), but if you’re the type of person who is chronically stressed, this would be different. This is all well and good when we think about how we’re feeling right now, but it’s also worth considering how our emotions are changing. On a regular day, our emotions undergo minor fluctuations – sometimes in response to minor hassles or victories, and sometimes for no discernible reason. In this small paragraph, I’ve laid out a number of parameters, all of which vary between different people: • Attractor: Our typical emotional state. At any given moment, our feelings are pulled toward this state. Some people tend to be happier, whereas others are less happy. • Stability: How emotionally stable one is. Some people are more emotionally stable than others. Even in the face of adversity, an emotionally stable person keeps their cool. • Dispersion: The range of our emotional landscape. Some people experience intense highs and lows, whereas others persist somewhere in the middle. We’ll keep all of this in mind for the simulation. We’ll start with a fairly simple simulation with 100 hypothetical people. We’ll need the following packages. library(psych) library(tidyverse) library(sn) And then we’ll create a function that performs the simulation. Note that each person i has their own attractor, recovery rate, stability, and dispersion. For now we’ll just model random fluctuations in emotions, a sort of Brownian motion. You can imagine our little simulatons (fun name for the hypothetical people in the simulation) sitting around on an average day doing nothing in particular. simulate_affect <- function(n = 2, time = 250, negative_event_time = NULL) { dt <- data.frame(matrix(nrow = time, ncol = 1)) colnames(dt) <- "time" dttime <- 1:time

valence <- data.frame(matrix(nrow = time, ncol = 0))
arousal <- data.frame(matrix(nrow = time, ncol = 0))

for(i in 1:n) {
attractor_v <- rnorm(1, mean = 3.35, sd = .75)
instability_v <- sample(3:12, 1, replace = TRUE, prob = c(.18, .22, .18, .15, .8, .6, .5, .4, .2, .1))
dispersion_v <- abs(rsn(1, xi = .15, omega = .02, alpha = -6) * instability_v) #rsn simulates a skewed distribution.
if(!is.null(negative_event_time)) {
recovery_rate <- sample(1:50, 1, replace = TRUE) + negative_event_time
negative_event <- (dt$time %in% negative_event_time:recovery_rate) * seq.int(50, 1, -1) } else { negative_event <- 0 } valence[[i]] <- ksmooth(x = dt$time,
y = (negative_event * -.10) + arima.sim(list(order = c(1, 0, 0),
ar = .50),
n = time),
bandwidth = time/instability_v, kernel = "normal")$y * dispersion_v + attractor_v #instability is modelled in the bandwidth term of ksmooth, such that higher instability results in higher bandwidth (greater fluctuation). #dispersion scales the white noise (arima) parameter, such that there are higher peaks and troughs at higher dispersion. attractor_a <- rnorm(1, mean = .50, sd = .75) + sqrt(instability_v) #arousal attractor is dependent on instability. This is because high instability is associated with higher arousal states. instability_a <- instability_v + sample(-1:1, 1, replace = TRUE) dispersion_a <- abs(rsn(1, xi = .15, omega = .02, alpha = -6) * instability_a) arousal[[i]] <- ksmooth(x = dt$time,
y = (negative_event * .075) + arima.sim(list(order = c(1, 0, 0),
ar = .50),
n = time),
bandwidth = time/instability_a, kernel = "normal")$y * dispersion_a + attractor_a } valence[valence > 6] <- 6 valence[valence < 0] <- 0 arousal[arousal > 6] <- 6 arousal[arousal < 0] <- 0 colnames(valence) <- paste0("valence_", 1:n) colnames(arousal) <- paste0("arousal_", 1:n) dt <- cbind(dt, valence, arousal) return(dt) } set.seed(190625) emotions <- simulate_affect(n = 100, time = 300) emotions %>% select(valence_1, arousal_1) %>% head() ## valence_1 arousal_1 ## 1 1.328024 5.380643 ## 2 1.365657 5.385633 ## 3 1.401849 5.390470 ## 4 1.436284 5.395051 ## 5 1.468765 5.399162 ## 6 1.499062 5.402752 So we see the first six rows for participant 1’s valence and arousal. But if we want to plot these across multiple simulatons, we need to wrangle the data into long form. We’ll also compute some measures of within-person deviation. The Root Mean Square Successive Difference (RMSSD) takes into account gradual shifts in the mean. Those who are more emotionally unstable will have a higher RMSSD. For two dimensions (valence and arousal) we’ll just compute the mean RMSSD. emotions_long <- emotions %>% gather(key, value, -time) %>% separate(key, into = c("dimension", "person"), sep = "_") %>% spread(dimension, value) %>% group_by(person) %>% mutate(rmssd_v = rmssd(valence), rmssd_a = rmssd(arousal), rmssd_total = mean(rmssd_v + rmssd_a)) %>% ungroup() Let’s see what this looks like for valence and arousal individually. emotions_long %>% ggplot(aes(x = time, y = valence, group = person, color = rmssd_v)) + geom_line(size = .75, alpha = .75) + scale_color_gradient2(low = "black", mid = "grey", high = "red", midpoint = median(emotions_long$rmssd_v)) +
labs(x = "Time",
y = "Valence",
color = "Instability",
title = "Simulated Valence Scores over Time for 100 People") +
theme_minimal(base_size = 16)

emotions_long %>%
ggplot(aes(x = time, y = arousal, group = person, color = rmssd_a)) +
geom_line(size = .75, alpha = .75) +
scale_color_gradient2(low = "black", mid = "grey", high = "red", midpoint = median(emotions_long$rmssd_a)) + labs(x = "Time", y = "Arousal", color = "Instability", title = "Simulated Arousal Scores over Time for 100 People") + theme_minimal(base_size = 16) We see that some lines are fairly flat and others fluctuate more widely. More importantly, most people are somewhere in the middle. We can get a sense of one simulated person’s affective state space as well. The goal here is to mimic the kinds of models shown in Kuppens, Oravecz, and Tuerlinckx (2010): emotions_long %>% filter(person %in% sample(1:100, 6, replace = FALSE)) %>% ggplot(aes(x = valence, y = arousal, group = person)) + geom_path(size = .75) + scale_x_continuous(limits = c(0, 6)) + scale_y_continuous(limits = c(0, 6)) + labs(x = "Valence", y = "Arousal", title = "Affective State Space for Six Randomly Simulated People") + facet_wrap(~person) + theme_minimal(base_size = 18) + theme(plot.title = element_text(size = 18, hjust = .5)) ## Animating the Affective State Space To really appreciate what’s going on, we need to animate this over time. I’ll add some labels to the affective state space so that it’s easier to interpret what one might be feeling at that time. I’ll also add color to show which individuals are more unstable according to RMSSD. library(gganimate) p <- emotions_long %>% ggplot(aes(x = valence, y = arousal, color = rmssd_total)) + annotate("text", x = c(1.5, 4.5, 1.5, 4.5), y = c(1.5, 1.5, 4.5, 4.5), label = c("Gloomy", "Calm", "Anxious", "Happy"), size = 10, alpha = .50) + annotate("rect", xmin = 0, xmax = 3, ymin = 0, ymax = 3, alpha = 0.25, color = "black", fill = "white") + annotate("rect", xmin = 3, xmax = 6, ymin = 0, ymax = 3, alpha = 0.25, color = "black", fill = "white") + annotate("rect", xmin = 0, xmax = 3, ymin = 3, ymax = 6, alpha = 0.25, color = "black", fill = "white") + annotate("rect", xmin = 3, xmax = 6, ymin = 3, ymax = 6, alpha = 0.25, color = "black", fill = "white") + geom_point(size = 3.5) + scale_color_gradient2(low = "black", mid = "grey", high = "red", midpoint = median(emotions_long$rmssd_total)) +
scale_x_continuous(limits = c(0, 6)) +
scale_y_continuous(limits = c(0, 6)) +
labs(x = "Valence",
y = "Arousal",
color = "Instability",
title = 'Time: {round(frame_time)}') +
transition_time(time) +
theme_minimal(base_size = 18)

ani_p <- animate(p, nframes = 320, end_pause = 20, fps = 16, width = 550, height = 500)

ani_p

## There’s a Storm Coming…

Our simulation does a pretty good job at emulating the natural ebb and flow of emotions, but we know that emotions can be far more volatile. Let’s subject our simulation to a negative event. Perhaps all 100 simulatons co-authored a paper that just got rejected. In the function simulate_affect, there’s an optional argument negative_event_time that causes a negative event to occur at the specified time. For this, we need to consider one more emotion dynamics parameter:

• Recovery rate: How quickly one recovers from an emotional event. If something bad happens, how long does it take to return to the attractor. You can see how I’ve modelled this parameter in the function above.

So we’ll run the simulation with a negative event arising at t = 150. The negative event will cause a downward spike in valence and an upward spike in arousal.

emotions_event <- simulate_affect(n = 100, time = 300, negative_event_time = 150)

emotions_event_long <- emotions_event %>%
gather(key, value, -time) %>%
separate(key, into = c("dimension", "person"), sep = "_") %>%
group_by(person) %>%
mutate(rmssd_v = rmssd(valence),
rmssd_a = rmssd(arousal),
rmssd_total = mean(rmssd_v + rmssd_a)) %>%
ungroup()

emotions_event_long %>%
ggplot(aes(x = time, y = valence, group = person, color = rmssd_v)) +
geom_line(size = .75, alpha = .75) +
scale_color_gradient2(low = "black", mid = "grey", high = "red", midpoint = median(emotions_event_long$rmssd_v)) + labs(x = "Time", y = "Valence", color = "Instability", title = "Simulated Valence Scores over Time for 100 People") + theme_minimal(base_size = 16) emotions_event_long %>% ggplot(aes(x = time, y = arousal, group = person, color = rmssd_a)) + geom_line(size = .75, alpha = .75) + scale_color_gradient2(low = "black", mid = "grey", high = "red", midpoint = median(emotions_event_long$rmssd_a)) +
labs(x = "Time",
y = "Arousal",
color = "Instability",
title = "Simulated Arousal Scores over Time for 100 People") +
theme_minimal(base_size = 16)

It’s pretty clear that something bad happened. Of course, some of our simulatons are unflappable, but most experienced a drop in valence and spike in arousal that we might identify as anxiety. Again, let’s visualize this evolving over time. Pay close attention to when the timer hits 150.

p2 <- emotions_event_long %>%
ggplot(aes(x = valence, y = arousal, color = rmssd_total)) +
annotate("text", x = c(1.5, 4.5, 1.5, 4.5), y = c(1.5, 1.5, 4.5, 4.5), label = c("Gloomy", "Calm", "Anxious", "Happy"),
size = 10, alpha = .50) +
annotate("rect", xmin = 0, xmax = 3, ymin = 0, ymax = 3, alpha = 0.25, color = "black", fill = "white") +
annotate("rect", xmin = 3, xmax = 6, ymin = 0, ymax = 3, alpha = 0.25, color = "black", fill = "white") +
annotate("rect", xmin = 0, xmax = 3, ymin = 3, ymax = 6, alpha = 0.25, color = "black", fill = "white") +
annotate("rect", xmin = 3, xmax = 6, ymin = 3, ymax = 6, alpha = 0.25, color = "black", fill = "white") +
geom_point(size = 3.5) +
scale_color_gradient2(low = "black", mid = "grey", high = "red", midpoint = median(emotions_event_long\$rmssd_total)) +
scale_x_continuous(limits = c(0, 6)) +
scale_y_continuous(limits = c(0, 6)) +
labs(x = "Valence",
y = "Arousal",
color = "Instability",
title = 'Time: {round(frame_time)}') +
transition_time(time) +
theme_minimal(base_size = 18)

ani_p2 <- animate(p2, nframes = 320, end_pause = 20, fps = 16, width = 550, height = 500)

ani_p2

The overall picture is that some are more emotionally resilient than others. As of now, all the simulatons return to their baseline attractor, but we would realistically expect some to stay stressed or gloomy following bad news. In the coming months I’ll be looking into how to incorporate emotion regulation into the simulation. For example, maybe some of the simulatons use better coping strategies than others? I’m also interested in incorporating appraisal mechanisms that allow for different reactions depending on the type of emotional stimulus.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Word2Vec Text Mining & Parallelization in R. Join MünsteR for our next meetup!

(This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

In our next MünsteR R-user group meetup on Tuesday, July 9th, 2019, we will have two exciting talks about Word2Vec Text Mining & Parallelization in R!

You can RSVP here: https://www.meetup.com/de-DE/Munster-R-Users-Group/events/262236134/

Maren Reuter from viadee AG will give an introduction into the functionality and use of the Word2Vec algorithm in R.

Text data in its raw form cannot be used as input for machine learning algorithms. Therefore, an information extraction method is required to process plain text into an appropriate representation. By exploiting the semantic and syntactic structure of the text data, the importance of a word can be defined and represented as a vector in a vector space. I.e. the vector can be seen as a numerical „importance“ value. There exist two predominant approaches to represent words as vectors: Either by using the word frequency (ngrams), or by using a prediction model to estimate the relatedness of words. The Word2Vec algorithm by Mikolov et al. belongs to the latter one. This talk will show the functionality of the algorithm and how it can be used in practice.

Maren Reuter is an IT-Consultant at viadee AG and part of the company’s Artificial Intelligence research group. She got her Master’s degree in Information Systems at the University of Münster with a focus in Data Analytics. In her Master thesis she dealt with text mining techniques to predict maintenance tasks in agile software projects. For this purpose, she used the Word2Vec algorithm to build a word vector representation model.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Corporate America is growing more dissatisfied with Donald Trump

Earnings calls suggests that company executives are fed up with Mr Trump’s tariffs and trade wars

## June 25, 2019

### Geelong and the curse of the bye

(This article was first published on R – What You're Doing Is Rather Desperate, and kindly contributed to R-bloggers)

This week we return to Australian Rules Football, the R package fitzRoy and some statistics to ask – why can’t Geelong win after a bye?

(with apologies to long-time readers who used to come for the science)

Code and a report for this blog post are available at Github.

First, some background. In 2011 the AFL expanded from 16 to 17 teams with the addition of the Gold Coast Suns. In the same year, a bye round (a week where some teams don’t play) was reintroduced to the competition. For the purposes of this discussion, we are interested only in bye rounds since 2011, and during the regular home/away season.

You will often hear footy fans claim – sometimes with very little evidence – that “we don’t go well after the bye.” For one team, this is certainly true. That team is Geelong, who have not won a game in the round following a bye since Round 7 in 2011.

Is this unusual? If so, does the available game data suggest any reason?

We start as ever with the excellent fitzRoy package and use get_match_results() to – well, get the match results.

Next, we can use some tidyverse magic to obtain all games in the round immediately before, and after, a bye. This looks long and complicated, so here’s an version with annotations in the comments to explain what’s going on:

results_bye <- results %>%
# choose the desired columns
select(Season, Round, Date, Venue, Home.Team, Away.Team, Margin) %>%
# create one column for teams, another to indicate whether home or away
gather(Status, Team, -Season, -Round, -Margin, -Date, -Venue) %>%
# filter for 2011 onwards and only home/away games
filter(Season > 2010, grepl("^R", Round)) %>%
# create a column with the number of each round
separate(Round, into = c("prefix", "suffix"), sep = 1) %>%
mutate(suffix = as.numeric(suffix)) %>%
# for each team's games in a season find games
# the week before and after a bye
arrange(Season, Team, suffix) %>%
group_by(Season, Team) %>%
mutate(bye = case_when(
suffix - lead(suffix) == -2 ~ "before",
suffix - lag(suffix) == 2 ~ "after",
TRUE ~ as.character(suffix)
),
# margins are with respect to home team so negate them if away
Margin = ifelse(Status == "Away.Team", -Margin, Margin)) %>%
ungroup() %>%
# filter for the pre- and post-bye games
filter(bye %in% c("before", "after")) %>%
# calculate result
mutate(Result = case_when(
Margin > 0 ~ "W",
Margin < 0 ~ "L",
TRUE ~ "D"
)) %>%
# recreate the Round column
unite(Round, prefix, suffix, sep = "")


Let’s confirm that Geelong have not won after a bye in a long time:

results_bye %>%
filter(Team == "Geelong", bye == "after")

Season Round Date Venue Margin Status Team bye Result
2011 R7 2011-05-07 Kardinia Park 66 Home.Team Geelong after W
2011 R23 2011-08-27 Kardinia Park -13 Home.Team Geelong after L
2012 R13 2012-06-22 S.C.G. -6 Away.Team Geelong after L
2013 R13 2013-06-23 Gabba -5 Away.Team Geelong after L
2014 R9 2014-05-17 Subiaco -32 Away.Team Geelong after L
2016 R16 2016-07-08 Kardinia Park -38 Home.Team Geelong after L
2017 R13 2017-06-15 Subiaco -13 Away.Team Geelong after L
2018 R15 2018-06-29 Docklands -2 Away.Team Geelong after L
2019 R14 2019-06-22 Adelaide Oval -11 Away.Team Geelong after L

How does that compare with other teams?

We see all combinations: teams that seem to win more after a bye, as well as teams that win less and teams for which a bye makes no difference. However, Geelong certainly has the worst post-bye win/loss record.

We can ask: is the win/loss count in pre-bye games significantly different to those post-bye? One approach to this is to construct 2×2 contingency tables and perform Fisher’s exact test.

With some more tidyverse magic we can nest the data for each team, generate the tests and summarise the results. This approach is explained very nicely in “Running a model on separate groups” over at Simon Jackson’s blog.

Only Geelong has p < 0.05, suggesting that there is something interesting about the win/loss count after the bye. We’ll just show the first 5 teams here.

results_bye %>%
count(Team, bye, Result) %>%
nest(-Team) %>%
mutate(data = map(data, . %>% spread(Result, n) %>% select(2:3)),
fisher = map(data, fisher.test),
summary = map(fisher, tidy)) %>%
select(Team, summary) %>%
unnest() %>%
select(-method, -alternative) %>%
arrange(p.value) %>%
pander(split.table = Inf)

Team estimate p.value conf.low conf.high
Geelong 21.4 0.01522 1.533 1396
Sydney 5.43 0.1698 0.6027 79.83
North Melbourne 0.1736 0.2941 0.002835 2.438
Richmond 3.68 0.3469 0.4059 43.34
Collingwood 3.719 0.3498 0.4048 53.81

We can extend the previous visualisation by further breaking down games into home and away:

Now we see that of Geelong’s 8 post-bye losses, 6 were away games. Port Adelaide have a similar record. Then again, Brisbane have not won an away game before the bye, but you don’t hear anyone talking about Brisbane “not going well before the bye”.

When we look at those 6 away post-bye losses, one was in Melbourne – which in terms of travel distance is not very far from Geelong. The other five were “genuine” away games in Sydney, Brisbane, Adelaide and Perth (2).

Season Round Date Venue Margin Status Team bye Result
2012 R13 2012-06-22 S.C.G. -6 Away.Team Geelong after L
2013 R13 2013-06-23 Gabba -5 Away.Team Geelong after L
2014 R9 2014-05-17 Subiaco -32 Away.Team Geelong after L
2017 R13 2017-06-15 Subiaco -13 Away.Team Geelong after L
2018 R15 2018-06-29 Docklands -2 Away.Team Geelong after L
2019 R14 2019-06-22 Adelaide Oval -11 Away.Team Geelong after L

In addition, three of the losses were against a side also coming off the bye, but playing at home.

Season Round Date Venue Margin Status Team bye Result
2012 R13 2012-06-22 S.C.G. -6 Away.Team Geelong after L
2014 R9 2014-05-17 Subiaco -32 Away.Team Geelong after L
2017 R13 2017-06-15 Subiaco -13 Away.Team Geelong after L

What about away games before the bye? One loss in Melbourne, four wins in Melbourne and one win in Sydney, versus the GWS Giants who at that time were a new and struggling team.

Season Round Date Venue Margin Status Team bye Result
2011 R5 2011-04-26 M.C.G. 19 Away.Team Geelong before W
2011 R21 2011-08-14 Football Park 11 Away.Team Geelong before W
2012 R11 2012-06-08 Docklands 12 Away.Team Geelong before W
2013 R11 2013-06-08 Sydney Showground 59 Away.Team Geelong before W
2016 R14 2016-06-25 Docklands -3 Away.Team Geelong before L
2019 R12 2019-06-07 M.C.G. 67 Away.Team Geelong before W

Our last question: for games after a bye, what was the expected result? By expected we mean “according to the bookmakers”. We can join the match results with historical betting data, assign the expected result (win or loss) to Geelong according to their odds, then compare expected versus actual results. This reveals that six of the eight post-bye losses were unexpected – not surprising as Geelong has been a strong team in the period from 2011 to now.

bye Result Expected n
after L L 2
after L W 6
after W W 1
before L L 1
before L W 1
before W L 1
before W W 6

In summary
Historically, Geelong do seem more prone to losing after a bye round than other teams, and those losses have been unexpected in terms of betting odds.

However, a large proportion of their post-bye losses have been interstate away games, versus strong opponents. Away games before the bye have been either in Melbourne, or versus weaker opponents.

Scheduling may therefore have played a role in Geelong’s post-bye win/loss record.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Let’s get it right

AI systems are being deployed to support human decision making in high-stakes domains. In many cases, the human and AI form a team, in which the human makes decisions after reviewing the AI’s inferences. A successful partnership requires that the human develops insights into the performance of the AI system, including its failures. We study the influence of updates to an AI system in this setting. While updates can increase the AI’s predictive performance, they may also lead to changes that are at odds with the user’s prior experiences and confidence in the AI’s inferences, hurting therefore the overall team performance. We introduce the notion of the compatibility of an AI update with prior user experience and present methods for studying the role of compatibility in human-AI teams. Empirical results on three high-stakes domains show that current machine learning algorithms do not produce compatible updates. We propose a re-training objective to improve the compatibility of an update by penalizing new errors. The objective offers full leverage of the performance/compatibility tradeoff, enabling more compatible yet accurate updates.
How much can we infer about a person’s looks from the way they speak? In this paper, we study the task of reconstructing a facial image of a person from a short audio recording of that person speaking. We design and train a deep neural network to perform this task using millions of natural videos of people speaking from Internet/Youtube. During training, our model learns audiovisual, voice-face correlations that allow it to produce images that capture various physical attributes of the speakers such as age, gender and ethnicity. This is done in a self-supervised manner, by utilizing the natural co-occurrence of faces and speech in Internet videos, without the need to model attributes explicitly. Our reconstructions, obtained directly from audio, reveal the correlations between faces and voices. We evaluate and numerically quantify how–and in what manner–our Speech2Face reconstructions from audio resemble the true face images of the speakers.
Tech companies should ensure their ethics boards are guided by universal human rights and resist bad faith arguments about diversity and free speech. Who should be on the ethics board of a tech company that’s in the business of artificial intelligence (A.I.)? Given the attention to the devastating failure of Google’s proposed Advanced Technology External Advisory Council (ATEAC) earlier this year, which was announced and then canceled within a week, it’s crucial to get to the bottom of this question. Google, for one, admitted it’s ‘going back to the drawing board.’
Everyone talks about diversity, but progress is slow in the real world of business and IT. Sometimes humans talk a good game, but don’t follow through on their promises. Or, it might just be that many of us are well-meaning but unable to achieve a goal because of practical, real-world limitations. A couple of days per week I receive at least one public relations pitch for an article idea about how to improve diversity in an IT team. Go to any tech conference and it’s a good bet that someone will be preaching the need for inclusive hiring and team building, and those sessions are well attended. It’s been clearly established that diversity/inclusiveness is critical to ensuring that artificial intelligence and machine learning applications aren’t plagued by gender, racial, geographic, or other bias. GIGO — garbage in, garbage out — is as relevant today as it was 30 years ago.
A 3-part series exploring ethics issues in Artificial Intelligence. In Part 1, I explored the what and why of ethics of AI. Part 2 looked at the ethics of what AI is. In Part 3, I wrap up the series with an exploration of the ethics of what AI does and what AI impacts. Across the topics, be it safety or civil rights, be it effects on human behavior or risk of malicious use, a common theme emerges – The need to revisit the role of technologists in dealing with the effects of what they build; going beyond broad tenets like ‘avoid harm’ and ‘respect privacy’, to establishing cause-and-effect, and identifying the vantage points we uniquely hold OR don’t. I had a sense early on that Part 3 would be tough to do justice to as a sub-10 minute post. But 3-parts was just right and over 10 mins too long; so bear with me while I try to prove intuition wrong and fail! Let us explore.
I’m categorically ambivalent on the subject of technological advances/intrusion into our everyday lives. That includes AI and higher machine learning. On the one hand, AI is used and will be used for brilliant life-changing developments in endeavors like medicine, generative design, scientific research, and a lot of what is to come of which we really have no idea.
It’s certainly easy to tap that suggestion bubble, but is it the right thing to do – for humanity?
Have you ever had a moment when you realized something you thought was so obviously true, suddenly became so obviously false, and in that instant your whole understanding of the world changed? I think of the day I realized my parents were actually fallible human beings. Up until then, part of my subconscious was convinced they were unquestionable demigods. It wasn’t until my 30’s that I started to really question the validity of my opinions and recognize that I was biased towards their viewpoints – both on the world and about me. I gave more weight to their ideas on politics, religion, morals, and even my own capabilities and characteristics than I did to the opinions or facts presented by my experiences. That was the moment I began making it a point to think harder. To examine why I believed something and attempt to form less parentally biased opinions of my own.

### Inspiration from a waterfall of pie charts: illustrating hierarchies

Reader Antonio R. forwarded a tweet about the following "waterfall of pie charts" to me:

Maarten Lamberts loved these charts (source: here).

I am immediately attracted to the visual thinking behind this chart. The data are presented in a hierarchy with three levels. The levels are nested in the sense that the pieces in each pie chart add up to 100%. From the first level to the second, the category of freshwater is sub-divided into three parts. From the second level to the third, the "others" subgroup under freshwater is sub-divided into five further categories.

The designer faces a twofold challenge: presenting the proportions at each level, and integrating the three levels into one graphic. The second challenge is harder to master.

The solution here is quite ingenious. A waterfall/waterdrop metaphor is used to link each layer to the one below. It visually conveys the hierarchical structure.

***

There remains a little problem. There is a confusion related to the part and the whole. The link between levels should be that one part of the upper level becomes the whole of the lower level. Because of the color scheme, it appears that the part above does not account for the entirety of the pie below. For example, water in lakes is plotted on both the second and third layers while water in soil suddenly enters the diagram at the third level even though it should be part of the "drop" from the second layer.

***

I started playing around with various related forms. I like the concept of linking the layers and want to retain it. Here is one graphic inspired by the waterfall pies from above:

### Distilled News

Dialog is a domain-specific language for creating works of interactive fiction. It is heavily inspired by Inform 7 (Graham Nelson et al. 2006) and Prolog (Alain Colmerauer et al. 1972). An optimizing compiler, dialogc, translates high-level Dialog code into Z-code, a platform-independent runtime format originally created by Infocom in 1979.
A machine learning algorithm (such as classification, clustering or regression) uses a training dataset to determine weight factors that can be applied to unseen data for predictive purposes. Before implementing a machine learning algorithm, it is necessary to select only relevant features in the training dataset. The process of transforming a dataset in order to select only relevant features necessary for training is called dimensionality reduction. Dimensionality reduction is important because of three main reasons:
• Prevents Overfitting: A high-dimensional dataset having too many features can sometimes lead to overfitting (model captures both real and random effects).
• Simplicity: An over-complex model having too many features can be hard to interpret especially when features are correlated with each other.
• Computational Efficiency: A model trained on a lower-dimensional dataset is computationally efficient (execution of algorithm requires less computational time).
Dimensionality reduction therefore plays a crucial role in data preprocessing.
You have to write SQL queries to query data from a relational database. Sometimes, you even have to write complex queries to do that. Won’t it be amazing if you could use a chatbot to retrieve data from a database using simple English? That’s what this tutorial is all about.
How can organizations and individuals promote Data Literacy? Data literacy is all about critical thinking, so the time-tested method of Socratic questioning can stimulate high-level engagement with data.
Data-centric thinking is rapidly becoming vital to the way we work, communicate and understand in the 21st century. This has led to a proliferation of tools for novices that help them operate on data to clean, process, aggregate, and visualize it. Unfortunately, these tools have been designed to support users rather than learners that are trying to develop strong data literacy. This paper outlines a basic definition of data literacy and uses it to analyze the tools in this space. Based on this analysis, we propose a set of pedagogical design principles to guide the development of tools and activities that help learners build data literacy. We outline a rationale for these tools to be strongly focused, well guided, very inviting, and highly expandable. Based on these principles, we offer an example of a tool and accompanying activity that we created. Reviewing the tool as a case study, we outline design decisions that align it with our pedagogy. Discussing the activity that we led in academic classroom settings with undergraduate and graduate students, we show how the sketches students created while using the tool reflect their adeptness with key data literacy skills based on our definition. With these early results in mind, we suggest that to better support the growing number of people learning to read and speak with data, tool de- signers and educators must design from the start with these strong pedagogical principles in mind.
Imagine an organization where the marketing department speaks French, the product designers speak German, the analytics team speaks Spanish and no one speaks a second language. Even if the organization was designed with digital in mind, communicating business value and why specific technologies matter would be impossible. That’s essentially how a data-driven business functions when there is no data literacy. If no one outside the department understands what is being said, it doesn’t matter if data and analytics offers immense business value and is a required component of digital business.
While prediction errors (PE) have been established to drive learning through adaptation of internal models, the role of model-compliant events in predictive processing is less clear. Checkpoints (CP) were recently introduced as points in time where expected sensory input resolved ambiguity regarding the validity of the internal model. Conceivably, these events serve as on-line reference points for model evaluation, particularly in uncertain contexts. Evidence from fMRI has shown functional similarities of CP and PE to be independent of event-related surprise, raising the important question of how these event classes relate to one another. Consequently, the aim of the present study was to characterise the functional relationship of checkpoints and prediction errors in a serial pattern detection task using electroencephalography (EEG). Specifically, we first hypothesised a joint P3b component of both event classes to index recourse to the internal model (compared to non-informative standards, STD). Second, we assumed the mismatch signal of PE to be reflected in an N400 component when compared to CP. Event-related findings supported these hypotheses. We suggest that while model adaptation is instigated by prediction errors, checkpoints are similarly used for model evaluation. Intriguingly, behavioural subgroup analyses showed that the exploitation of potentially informative reference points may depend on initial cue learning: Strict reliance on cue-based predictions may result in less attentive processing of these reference points, thus impeding upregulation of response gain that would prompt flexible model adaptation. Overall, present results highlight the role of checkpoints as model-compliant, informative reference points and stimulate important research questions about their processing as function of learning und uncertainty.
In the statistical literature, for ordinal types of data, are known lots of indicators to measure the degree of the polarization phenomenon. Typically, many of the widely used measures of distributional variability are defined as a function of a reference point, which in some ‘sense’ could be considered representative for the entire population. This function indicates how much all the values differ from the point that is considered ‘typical’. Of all measures of variability, the variance is a well-known example that use the mean as a reference point. However, mean-based measures depend to the scale applied to the categories (Allison & Foster, 2004) and are highly sensitive to outliers. An alternative approach is to compare the distribution of an ordinal variable with that of a maximum dispersion, that is the two-point extreme distribution (i.e A distribution in which half of the population is concentrated in the lowest category and half in the top category). Using this procedure, three measures of variation for ordinal categorical data have been suggested, the Linear Order of Variation – LOV (Berry & Mielke, 1992), the Index Order of Variation – IOV (Leik, 1966) and the COV (Kvalseth, Coefficients of variations for nominal and ordinal catego…). All these indices are based on the cumulative relative frequency distribution (CDF), since this contains all the distributional information of any ordinal variable (Blair & Lacy, 1996). Consequently, none of these measures rely on ordinal assumptions about distances between categories.
Imagine…imagine that you have been challenged to play Steph Curry, the greatest 3-point shooter in the history of the National Basketball Association, in a game of 1×1. Yea, a pretty predictable outcome for 99.9999999% of us. But now image that Steph Curry has to wear a suit of knight’s armor as part of that 1×1 game. The added weight, the obstructed vision, and the lack of flexibility, agility and mobility would probably allow even the average basketball player to beat him. Welcome to today’s technology architecture challenge!
Data science has been around since mankind first did experiments and recorded data. It is only since the advent of big and heterogenous data that the term ‘Data Science’ was coined. With such a long and varied history, the field should benefit from the great diversity of perspectives that are brought by practitioners from different fields. My own path started with signal analysis. I was building high speed interferometric photon counting systems, where my ‘data science’ was dominated by signal to noise and information encoding. The key aspect of this was that data science was applied to extend or modify our knowledge (understanding) of the physical system. Later my data science efforts focused on stochastic dynamical systems. While the techniques and tools employed were different than those used in signal analysis, the objective remained the same, to extend or modify our knowledge of a system.
Typically, industry machine learning projects aren’t based on a fixed, preexisting reference dataset like MINST. A lot of effort goes into procuring and cleaning training data. As these tasks are highly project-specific and can’t be generalized, they are rarely talked about and receive no media attention. Similar is true for the post-modelling steps: How to bring your model into production? How will the model outputs create actual business value? And by the way, shouldn’t you have been thinking about these questions beforehand? While the model serving workflows are somewhat transferable, monetization strategies are usually specific and not made public. With these considerations, we can paint a more accurate picture.
One of the most frustrating aspects of data science can be the constant switching between different tools whilst working. You can be editing some code in a Jupyter Notebook, having to install a new tool on the command line and maybe editing a function in an IDE all whilst working on the same task. Sometimes it is nice to find ways of doing more things in the same piece of software. In the following post, I am going to list some of the best tools I have found for doing data science on the command line. It turns out there are many tasks that can be completed via simple terminal commands than I first thought and I wanted to share some of those here.
Artificial Neural Networks are inspired by the nature of our brain. Similarly Genetic Algorithms are inspired by the nature of evolution. In this article I propose a new type of neural network to assist in training: Genetic Neural Networks. These neural networks hold properties like fitness, and use a Genetic Algorithm to train randomly generated weights. Genetic optimization occurs prior to any form of backpropagation to give any type of gradient descent a better starting point. This project can be found on my GitHub here, with an explanation in the snippets below.
The main purpose of data is to deliver useful information that can be manipulated to take important decisions. In the early days of computing, such useful information could be queried from direct data such as those stored in databases. If the information queried for was not available in those data-stores, then the system would not be able to respond to the users’ queries. Consequently, at present, if we consider the mass amount of data that is present on the web and the exponentially growing number of web users, it is hard to predict and store what exactly these users will be querying for. Simply stated, we cannot manually hard-code the responses to each and every expected question from the users. So what is the solution? The best answer is to extract knowledge from data, refine and recompose that into a knowledge graph, which we can use to answer queries. However, knowledge extraction has proven to be a non-trivial problem. Hence, we use a statistical relational learning methodology that considers past experiences and learns new relationships or similarities between facts in order to extract such knowledge. The Probabilistic soft logic(PSL) is one such statistical relational learning framework that is used for building knowledge graphs.
Probabilistic soft logic (PSL) is a machine learning framework for developing probabilistic models. PSL models are easy to use and fast. You can define models using a straightforward logical syntax and solve them with fast convex optimization. PSL has produced state-of-the-art results in many areas spanning natural language processing, social-network analysis, knowledge graphs, recommender system, and computational biology. The PSL framework is available as an Apache-licensed, open source project on GitHub with an active user group for support.
Mindful strategies to recognize, and stop that chatbot from getting too close. In the age of chatbot proliferation, we are often wondering who we are talking to? When we chat with customer service representatives online, we expect the first line customer service representatives to be chatbots. We tell them about our problems. Chatbots help us find a solution. The ‘problem and solution’ nature of customer service presents a classic problem best solved by artificial intelligence. Taking this one step further, what if you are a twenty-something career driven person who has literally no time during your busy packed working day to check in with your significant other. You’ve worked hard to earn your significant other’s love. You want to juggle your relationship with your career. But, there’s just no time.
In this notebook I use PySpark, Keras, and Elephas python libraries to build an end-to-end deep learning pipeline that runs on Spark. Spark is an open-source distributed analytics engine that can process large amounts of data with tremendous speed. PySpark is simply the python API for Spark that allows you to use an easy programming language, like python, and leverage the power of Apache Spark.

### AnnoNLP conference on data coding for natural language processing

This workshop should be really interesting:

Silviu Paun and Dirk Hovy are co-organizing it. They’re very organized and know this area as well as anyone. I’m on the program committee, but won’t be able to attend.

I really like the problem of crowdsourcing. Especially for machine learning data curation. It’s a fantastic problem that admits of really nice Bayesian hierarchical models (no surprise to this blog’s audience!).

The rest of this note’s a bit more personal, but I’d very much like to see others adopting similar plans for the future for data curation and application.

The past

Crowdsourcing is near and dear to my heart as it’s the first serious Bayesian modeling problem I worked on. Breck Baldwin and I were working on crowdsourcing for applied natural language processing in the mid 2000s. I couldn’t quite figure out a Bayesian model for it by myself, so I asked Andrew if he could help. He invited me to the “playroom” (a salon-like meeting he used to run every week at Columbia), where he and Jennifer Hill helped me formulate a crowdsourcing model.

As Andrew likes to say, every good model was invented decades ago for psychometrics, and this one’s no different. Phil Dawid had formulated exactly the same model (without the hierarchical component) back in 1979, estimating parameters with EM (itself only published in 1977). The key idea is treating the crowdsourced data like any other noisy measurement. Once you do that, it’s just down to details.

Part of my original motivation for developing Stan was to have a robust way to fit these models. Hamiltonian Monte Carlo (HMC) only handles continuous parameters, so like in Dawid’s application of EM, I had to marginalize out the discrete parameters. This marginalization’s the key to getting these models to sample effectively. Sampling discrete parameters that can be marginalized is a mug’s game.

The present

Coming full circle, I co-authored a paper with Silviu and Dirk recently, Comparing Bayesian models of annotation, that reformulated and evaluated a bunch of these models in Stan.

Editorial Aside: Every field should move to journals like TACL. Free to publish, fully open access, and roughly one month turnarond to first decision. You have to experience journals like this in action to believe it’s possible.

The future

I want to see these general techniques applied to creating probabilistic corpora, to online adaptative training data (aka active learning), to joint corpus inference and model training (a la Raykar et al.’s models), and to evaluation.

P.S. Cultural consensus theory

I’m not the only one who recreated Dawid and Skene’s model. It’s everywhere these days.

Recently, I just discovered an entire literature dating back decades on cultural consensus theory, which uses very similar models (I’m pretty sure either Lauren Kennedy or Duco Veen pointed out the literature). The authors go more into the philosophical underpinnings of the notion of consensus driving these models (basically the underlying truth of which you are taking noisy measurements). One neat innovation in the cultural consensus theory literature is a mixture model of truth—you can assume multiple subcultures are coding the data with different standards. I’d thought of mixture models of coders (say experts, Mechanical turkers, and undergrads), but not of the truth.

In yet another small world phenomenon, right after I discovered cultural consensus theory, I saw a cello concert organized through Groupmuse by a social scientist at NYU I’d originally met through a mutual friend of Andrew’s. He introduced the cellist, Iona Batchelder, and added as an aside she was the daughter of well known social scientists. Not just any social scientists, the developers of cultural consensus theory!

### New Resource for Learning Choroplethr

(This article was first published on R – AriLamstein.com, and kindly contributed to R-bloggers)

Last week I had the honor of giving a 1 hour talk about Choroplethr at a private company.

When this company reached out to me about speaking there, I originally planned to give the same talk I gave at CDC two years ago.

But as I reviewed the CDC talk, I realized two things:

1. My ability to teach and explain Choroplethr has improved a lot since then.
2. Choroplethr has changed in the last few years.

Because of this, I decided to rewrite the talk from scratch.

I wanted to share this new and improved resource with a wider audience, so I just recorded myself giving this new talk. You can view the talk below.

I hope this helps you get the most out of Choroplethr!

Interested in having me give this talk at your company? Contact me and let me know!

The post New Resource for Learning Choroplethr appeared first on AriLamstein.com.