# My Data Science Blogs

## October 15, 2019

### Diving into Leadership to Build Push-Button Code

“Hi everyone, I’m Sarah! I’m a Research Data Scientist at the Alan Turing Institute and I’m also an operator of mybinder.org. It’s really cool seeing how many people here are interested in BinderHub!”

And it is cool. Really cool! But also a bit scary as a room full of Research Software Engineers (each of them much further on in their careers than I am) suddenly turn to me, eager for the knowledge I was surely about to impart to them.

But let’s rewind a bit.

### Joining the Binder community

Its May 2019 and I’m attending the Research Software Reactor sprint jointly hosted by Microsoft and Imperial College London. This is a 3-day hackathon where researchers from different areas (though all with a computing background) come together to collaboratively build cloud-based resources on Microsoft’s Azure platform.

As for me, I’m only 6 months into my role at the Turing, having graduated from my PhD in Astrophysics at the start of the year. Also, my role as a “Binder Operator” is barely 2 months old… which is why I’m starting to feel nervous about the amount of interest in the room! This also happens to be the second hackathon I’ve attended ever, so the adrenaline is high.

So, what is this “Binder” that we’re all so excited about?

Binder is an amazing resource that can host reproducible and interactive code in a web browser. Great… what does that mean? It means that if I have some scripts or notebooks in a repository (like this one) and I describe the packages in a configuration file (such as requirements.txt), then I can go to mybinder.org, copy the URL of the repository into the form and hit launch. This will begin a series of events culminating in my notebook appearing in a browser window with all of the packages installed, and the code will just run.

Sounds like magic, right? You can even combine Binder with Jupyter Books to create interactive documents! Below is a comic explaining how a scientist may use Binder.

BinderHub is the technology powering Binder (or where the magic happens). It is a multi-user server that can create a custom, user-specified computing environment and make it accessible via a URL. It utilizes different tools to make this possible and can be deployed onto either a cloud provider or an on-premise compute cluster. These tools are depicted in the below illustration.

Since BinderHub is a cloud-neutral technology (mybinder.org itself runs on Google Cloud and OVH.com), I previously worked on refining the documentation around deploying BinderHub on Azure and ran workshops teaching people how to use Binder and deploy BinderHub. It was this work that lead to me being invited to join the Binder team. Being a maintainer for an open-source project involves checking if the service is still running smoothly and being an active, friendly member of the community — one who can answer questions when they arise.

Now that we’ve covered a bit of background, let’s get back to my slightly uncomfortable moment at the sprint.

### Day 1: An unexpected leadership position

Everybody in the room is now trying to whittle down which projects we should work on — all of which are awesome! The magic of BinderHub is that it covers so many aspects of software engineering tools and practices, such as continuous integration/deployment and Kubernetes. The BinderHub idea is quickly amassing a lot of satellite projects!

“So Sarah, is it OK if I name you leader of ‘Team BinderHub’?” My second uncomfortable moment. Gerard Gorman, Senior Lecturer in the Department of Earth Sciences at Imperial and co-organizer of the sprint, has just elected me leader of my first ever hack project. If I’m honest, I’d intended to spend the sprint ironing out some niggles with a colleague. But I’ve always been a “learn by doing” person so I agree and desperately begin wracking my brain for a project to give my team for the next 3 days.

After a bit of shuffling (where for a while it seemed like most of the attendees would be joining ‘Team BinderHub’!), I finally assemble a team. My teammates are: Tania Allard, a Microsoft Developer Advocate; Tim Greaves and Diego Alonso Álvarez, Research Software Engineers at Imperial; and Gerard himself.

The project I decide to bring to the table is one that I’d started but had stalled. The process of deploying a BinderHub can be quite long (usually an hour) and consists mainly of typing stuff into the command line. So a couple of months prior, I’d begun designing a set of scripts that would install the relevant tools, deploy a BinderHub on Azure and connect it to a Docker Hub account, retrieve various pieces of information (IP addresses, logs, etc.) and then remove the BinderHub from the cloud. However, my bash scripting and operating system knowledge is not especially strong. Within the team, we have a bit of a discussion around the Deploy to Azure feature — a button that can be copy-pasted into the README of a GitHub project and facilitates a one-click deployment of the project to Azure. It seems like the perfect hack is to combine my scripts with the deploy button and make deploying BinderHub as easy as possible.

Tim is a bash scripting, containerizing pro and has a little experience with the Azure “blue button” from the OKPy project. I ask him to check my setup script that installs the required Command Line Interfaces (CLIs) and explain that I’d like it to be generalized for different operating systems.

Gerard is a BinderHub enthusiast and is keen to get one deployed as a teaching resource at Imperial. As a starter task, I ask him to read up on how the blue button works.

Diego has never heard of Binder or BinderHub before and is quite confused as to what the rest of the team are so excited about! I suggest that he works through my Zero to BinderHub workshop in order to bring him up to speed with the concepts. This is also the perfect opportunity to get feedback on my workshop if parts are not clear! I encourage him to open an issue describing any problems he comes across or extra information he’d like to see.

And just like that, I find I’ve delegated myself out of a job!

Now Imposter Syndrome is beginning to creep up on me. Not only is it taking four extra people to pull together my work and make it usable, but I am also unsure how I would assist them in these tasks I’d set them. Especially Tim, as the complexity of what I wanted the bash scripts to achieve was the reason this project stalled in the first place.

Except there’s one very important aspect of any software project I’ve not mentioned yet: Documentation!

Since starting my position at the Turing, I’ve come to fully appreciate how fundamental good documentation is to the success of a project. Clearly-written instructions covering the purpose of the project, how to install/run the code, and what kind of inputs/outputs to expect make it much easier for a new person to quickly get to grips with the project. And as result, it’s far more likely to be referenced and reused. While my team begin familiarizing themselves with the infrastructure and goals of my project, I begin an overhaul of the documentation whilst keeping an eye on the repository to manage incoming issues and pull requests.

Take that, Imposter Syndrome!

### Day 2: The team makes progress

It’s the second day of the sprint and ‘Team BinderHub’ gather again, our enthusiasm not diminished yet!

I very quickly set goals for the day. I want the button to work by the end of the day as I’m hoping the final day can be used to work on an idea Tania has to use Azure’s DevOps Pipelines to automatically update the deployed BinderHub as new commits come into the host repository. This would be a very handy feature for those (like me!) maintaining BinderHubs at their own institutions. Anything to automate and reduce the number of commands we have to type!

I ask Tim to begin working on the deploy script itself, to add tests where necessary and tidy up some of the parsing of the variables. The next steps are to build a Dockerfile that runs the deploy script and an ARM (Azure Resource Manager) template that controls the form that the blue button links to.

Again, I’m managing the documentation, keeping up with changes we make to the code-base and functionality. Similarly, I keep watch over the repository to manage incoming pull requests and merge conflicts and also make a start on the amendments to the BinderHub workshop Diego has compiled.

My Imposter Syndrome is definitely subsiding as I begin to feel more like I understand the skills of my team and we are all making the best use of our time.

### Day 3: Document, document, document

For the third and final day of the sprint we are in a new venue: the Microsoft Reactor. This is a workspace in the Shoreditch area of London where we are offered free pizza and cookies and a DJ to provide the soundtrack to our code. (OK, less an actual DJ, more of a software engineer with Spotify Premium. 😜) What I will say though, Microsoft’s beverage-making facilities have nothing on the Turing’s iPad coffee maker! 😉

Our goals for the last day? Consolidate and document! Ideally get the button working if possible, but the top priority is to make it as easy as possible for the team (or someone new!) to come along to the repository and finish what we have started.

We decide that there isn’t enough time left or infrastructure in place to implement the Azure DevOps Pipeline for automatic upgrades; instead, Tania begins working on a tutorial so we can implement it later.

Tim and I end up in a bit of GitHub hell as we realize that the code to create the Kubernetes cluster is in the setup script, not the deploy script. We refactor some parts so that all of the resources are deployed from the deploy script, and this also means that the Dockerfile only needs to find one script to execute. However, this refactoring causes a complicated merge conflict and some of the bug fixes Tim implemented disappear in a squash merge. (This sounded difficult enough to resolve that I’ve been put off learning about squash merges since!) As team leader, I try to orchestrate whose pull request should be merged first so that all the right code ends up in master.

### What did we learn and do?

By the end of the sprint, we don’t quite have a working button deployment, but we do have:

• a set of streamlined scripts for auto-deployment based on the contents of a JSON file,
• an ARM template and Dockerfile that will provide the backend to the blue button after some further debugging,
• a plan to move the project forwards after the sprint.

So what did I learn from those three days?

Coding with other people is fun!

Paired programming is something we try to achieve at the Turing, but it can be difficult to do depending on the project and the time constraints of those working on it.

This was the first time I’d experienced true collaboration. Where I had an idea that I wasn’t sure how to implement, and someone with the skills had helped me shape it and realize it. It gave me a sense of community and belonging. I hope I’ve forged connections with ‘Team BinderHub’ that will last the duration of our careers and we can work together again.

There’s a role for everyone in a team and these are equally important

Even if it’s reviewing code or writing documentation, these are as important (arguably, more so) than the code itself. Code that does what you think it’s doing and has been explained well will have a much longer lifespan than code that is difficult to follow, regardless of how clever it is, and poorly documented.

I found my strength as a leader

You might remember at the beginning of this blog post I mentioned that I’d never led a hack project before, so I also learned a lot about my leadership capabilities.

I think I did well. I identified the strengths of each of my team members and gave them a task suited to them whilst letting them explore new concepts, offering my own insight and opinions when required.

I also think that it was a good choice for me to not be too involved in the coding aspect of this project. Managing the flow into the repository and updating the documentation as new code came in meant that I managed to maintain an overall perspective of the project and could switch gears as questions came in from different areas. I don’t think I could have maintained such a view if I’d been buried in code, and I can always learn from the scripts that we’ve developed at a later point.

Hackathons are not where projects end

While the first 80% of a project to get the infrastructure in place can be achieved in a short amount of time like a hackathon, the last 20% is hard and often takes longer. But this extra effort is necessary for the software to be taken up by others.

‘Team BinderHub’ continued working on this project over Slack and GitHub to finally make our version 1 release on June 11th — almost 3 weeks after the end of the sprint!

If you’d like to try the button to deploy your own BinderHub (or contribute a new feature!), the repo can be found here 👉 github.com/alan-turing-institute/binderhub-deploy. Look for the button below!

Thank You! 💖

I’d like to thank a few people who made this possible:

• Gerard, Imperial College, Tania and Lee Stott from Microsoft for organising such an inspiring event;
• ‘Team BinderHub’ for coming along on this wild ride with me;
• The Binder Team for accepting me into the community and giving me the space to develop such projects;
• and The Turing Way team who introduced me to Binder and the value of community (and documentation!).

Note: this is cross-posted with the Turing Institute blog.

Diving into Leadership to Build Push-Button Code was originally published in Jupyter Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

### Automated Data Governance 101

The way we control our data isn’t working. Data is as vulnerable as ever. Download this white paper, which outlines lessons about how data science and governance programs can, if implemented properly, reinforce each other’s objective.

### What’s going on on PyPI

Scanning all new published packages on PyPI I know that the quality is often quite bad. I try to filter out the worst ones and list here the ones which might be worth a look, being followed or inspire you in some way.

bsfc
Bayesian Spectral Fitting Code (BSFC)

dbscan1d
dbscan1d is a package for DBSCAN on 1D arrays. (https://…/sklearn.cluster.DBSCAN.html ) does not have a special case for 1D, where calculating the full distance matrix is wasteful. It is much better to simply sort the input array and performing efficient bisects for finding closest points. Here are the results of running the simple profile script included with the package. In every case DBSCAN1D is much faster than scikit learn’s implementation.

kinlin
WIP: PyTorch based framework

Learning environments for Multi-Agent Connected Autonomous Driving (MACAD) with OpenAI Gym compatible interfaces

pyTokenizer
A streaming tokenizer.

syllogio-identifyPropositions
Identify natural language propositions in a written argument.

TeaML
Automated Modeling in Financial Domain. TeaML is a simple and design friendly automatic modeling learning framework.

timeseriesql
A Pythonic query language for time series data

trendln
Support and Resistance Trend lines Calculator for Financial Analysis

deepstack
DeepStack: Ensembling Keras Deep Learning Models into the next Performance Level

### Study retracted after finding a mistaken recoding of the data

A study found that a hospital program significantly reduced the number of hospitalizations and emergency department visits. Great. But then the researchers realized that the data was recoded incorrectly, and the program actually increased hospitalizations and emergency department visits. Not so great.

The identified programming error was in a file used for preparation of the analytic data sets for statistical analysis and occurred while the variable referring to the study “arm” (ie, group) assignment was recoded. The purpose of the recoding was to change the randomization assignment variable format of “1, 2” to a binary format of “0, 1.” However, the assignment was made incorrectly and resulted in a reversed coding of the study groups. Even though the data analyst created and conducted some test analysis programs, they were of the type that did not show any labeling of the arm categories, only the “arm” variable in a regression, for example.

Here’s the original, now-retracted study. And here’s the revised one.

Data can be tricky and could lead to unintended consequences if you don’t handle it correctly. Be careful out there.

Tags: ,

### Free R/datascience Extract: Evaluating a Classification Model with a Spam Filter

We are excited to share a free extract of Zumel, Mount, Practical Data Science with R, 2nd Edition, Manning 2019: Evaluating a Classification Model with a Spam Filter.

This section reflects an important design decision in the book: teach model evaluation first, and as a step separate from model construction.

It is funny, but it takes some effort to teach in this way. New data scientists want to dive into the details of model construction first, and statisticians are used to getting model diagnostics as a side-effect of model fitting. However, to compare different modeling approaches one really needs good model evaluation that is independent of the model construction techniques.

This teaching style has worked very well for us both in R and in Python (it is considered one of the merits of our LinkedIn AI Academy course design):

(Note: Nina Zumel, leads on the course design, which is the heavy lifting, John Mount just got tasked to be the one delivering it.)

Zumel, Mount, Practical Data Science with R, 2nd Edition is coming out in print very soon. Here is a discount code to help you get a good deal on the book:

Take 37% off Practical Data Science with R, Second Edition by entering fcczumel3 into the discount code box at checkout at manning.com.

### Using DC/OS to Accelerate Data Science in the Enterprise

Follow this step-by-step tutorial using Tensorflow to setup a DC/OS Data Science Engine as a PaaS for enabling distributed multi-node, multi-GPU model training.

### A heart full of hatred: 8 schools edition

No; I was all horns and thorns
Sprung out fully formed, knock-kneed and upright
Joanna Newsom

Far be it for me to be accused of liking things. Let me, instead, present a corner of my hateful heart. (That is to say that I’m supposed to be doing a really complicated thing right now and I don’t want to so I’m going to scream into a void for a little while.)

The object of my ire: The 8-Schools problem.

Now, for those of you who aren’t familiar with the 8-schools problem, I suggest reading any paper by anyone who’s worked with Andrew (or has read BDA). It’s a classic.

So why hate on a classic?

Well, let me tell you. As you can well imagine, it’s because of a walrus.

I do not hate walruses (I only hate geese and alpacas: they both know what they did), but I do love metaphors. And while sometimes a walrus is just a walrus, in this case it definitely isn’t.

The walrus in question is the Horniman Walrus (please click the link to see my smooth boy!). The Horniman walrus is a mistake that you an see, for a modest fee, at a museum in South London.

The story goes like this: Back in the late 19th century someone killed a walrus, skinned it, hopefully did some other things to it, and sent it back to England to be stuffed and mounted. Now, it was the late 19th century and it turned out that the English taxidermist maybe didn’t know what a walrus looked like. (The museum’s website claims that “only a few people had ever seen a live walrus” at this point in history which, even for a museum, is really [expletive removed] white.)

But hey. He had a sawdust. He had glue. He had other things that are needed to stuff and mount a dead animal. So he took his dead animal and his tools, introduced them to each other and proudly displayed the results.

(Are you seeing the metaphor?)

Now, of course, this didn’t go well. Walruses, if you’ve never seen one, are huge creatures with loose skin folds. The taxidermist did not know this and so he stuffed the walrus full leading to a photoshop disaster of a walrus. Smooth like a beachball. A glorious mistake. And a genuine tourist attraction.

So this is my first problem. Using a problem like 8 schools as default test for algorithms has a tendency to lead to over-stuffed algorithms that are tailored to specific models. This is not a new problem. You could easily call it the NeurIPS Problem (aka how many more ways do you want to over-fit MNIST?). (Yes, I know NeurIPS has other problems as well. I’m focussing on this one.)

A different version of this problem is a complaint I remember from back in my former life when I cared about supercomputers. This was before the whole “maybe you can use big computers on data” revolution. In these dark times, the benchmarks that mattered were the speed at which you could multiply two massive dense matrices, and the speed at which you could do a dense LU decomposition of a massive matrix. Arguably neither of these things were even then the key use of high-performance computers, but as the metrics became goals, supercomputer architectures emerged that could only be used to their full capacity on very specialized problems that had enough arithmetic intensity to make use of the entire machine. (NB: This is quite possibly still true, although HPC has diversified from just talking about Cray-style architectures)

So my problem, I guess, is with benchmark problems in general.

A few other specific things:

Why so small? 8 Schools has 8 observations, which is not very many observations. We have moved beyond the point where we need to include the data in a table in the paper.

Why so meta? The weirdest thing about the 8 Schools problem is that it has the form

$y_j\mid\mu_j\sim N(\mu_j,\sigma_j)$
$\mu_j\mid\mu,\tau\sim N(\mu,\tau)$

with appropriate priors on $\mu$ and $\tau$. The thing here is that the observation standard deviations $\sigma_j$ are known. Why? Because this is basically a meta-analysis. So 8-schools is a very specialized version of a Gaussian multilevel model. Buy fixing the observation standard deviation, the model has a much nicer posterior than the equivalent model with an unknown observation standard deviation. Hence, 8-schools doesn’t even test an algorithm on  an ordinary linear mixed model.

But it has a funnel! So does Radford Neal’s funnel distribution (in more than 17 dimensions). Sample from that instead.

But it’s real data! Firstly, no it isn’t. You grabbed it out of a book. Secondly, the idea that testing inference algorithms on real data is somehow better than systematically testing on simulated data is just wrong. We’re supposed to be statisticians so let me ask you this: How does an algorithm’s success on real data set A generalize to the set of all possible data sets? (Hint: It doesn’t.)

So, in conclusion, I am really really really sick of seeing the 8-schools data set.

Postscript: There’s something I want to clarify here: I am not saying that empirical results are not useful for evaluating inference algorithms. I’m saying that it’s only useful if the computational experiments are clear. Experiments using well-designed simulated data are unbelievably important. Real data sets are not.

Why? Because real data sets are not indicative of data that you come across in practice. This is because of selection bias! Real data sets that are used to demonstrate algorithms come in two types:

1. Data that is smooth and lovely (like 8-Schools or an over-stuffed walrus)
2. Data that is pointy and unexpected (like StackLoss, which famously has an almost singular design matrix, or this excellent photo a friend of mine once took)

Basically this means that if you have any experience with a problem at all, you can find a data set that makes y0ur method look good or that demonstrates a flaw in a competing method, or makes your method look bad. But this choice is opaque to people who are not experts in the problem at hand. Well-designed computational experiments on the other hand are clear in their aims (eg. this data has almost co-linear covariates, or this data has outliers, or this data should be totally pooled).

Simulated data is clearer, more realistic, and more honest when evaluating or even demonstrating algorithms.

### Top 7 Things I Learned on my Data Science Masters

Even though I’m still in my studies, here’s a list of the most important things I’ve learned (as of yet).

### Whats new on arXiv

We introduce a novel approach to incorporate syntax into natural language inference (NLI) models. Our method uses contextual token-level vector representations from a pretrained dependency parser. Like other contextual embedders, our method is broadly applicable to any neural model. We experiment with four strong NLI models (decomposable attention model, ESIM, BERT, and MT-DNN), and show consistent benefit to accuracy across three NLI benchmarks.
In the world of big data, many people find it difficult to access the information they need quickly and accurately. In order to overcome this, research on the system that recommends information accurately to users is continuously conducted. Collaborative Filtering is one of the famous algorithms among the most used in the industry. However, collaborative filtering is difficult to use in online systems because user recommendation is highly volatile in recommendation quality and requires computation using large matrices. To overcome this problem, this paper proposes a method similar to database queries and a clustering method (Contents Navigator) originating from a complex network.
Epistemic Logic Programs (ELPs), an extension of Answer Set Programming (ASP) with epistemic operators, have received renewed attention from the research community in recent years. Classically, evaluating an ELP yields a set of world views, with each being a set of answer sets. In this paper, we propose an alternative definition of world views that represents them as three-valued assignments, where each atom can be either always true, always false, or neither. Based on several examples, we show that this definition is natural and intuitive. We also investigate relevant computational properties of these new semantics, and explore how other notions, like strong equivalence, are affected.
In medical decision making, we have to choose among several expensive diagnostic tests such that the certainty about a patient’s health is maximized while remaining within the bounds of resources like time and money. The expected increase in certainty in the patient’s condition due to performing a test is called the value of information (VoI) for that test. In general, VoI relates to acquiring additional information to improve decision-making based on probabilistic reasoning in an uncertain system. This paper presents a framework for acquiring information based on VoI in uncertain systems modeled as Probabilistic Logic Programs (PLPs). Optimal decision-making in uncertain systems modeled as PLPs have already been studied before. But, acquiring additional information to further improve the results of making the optimal decision has remained open in this context. We model decision-making in an uncertain system with a PLP and a set of top-level queries, with a set of utility measures over the distributions of these queries. The PLP is annotated with a set of atoms labeled as ‘observable’; in the medical diagnosis example, the observable atoms will be results of diagnostic tests. Each observable atom has an associated cost. This setting of optimally selecting observations based on VoI is more general than that considered by any prior work. Given a limited budget, optimally choosing observable atoms based on VoI is intractable in general. We give a greedy algorithm for constructing a ‘conditional plan’ of observations: a schedule where the selection of what atom to observe next depends on earlier observations. We show that, preempting the algorithm anytime before completion provides a usable result, the result improves over time, and, in the absence of a well-defined budget, converges to the optimal solution.
In this work we design a narrative understanding tool Text2ALM. This tool uses an action language ALM to perform inferences on complex interactions of events described in narratives. The methodology used to implement the Text2ALM system was originally outlined by Lierler, Inclezan, and Gelfond (2017) via a manual process of converting a narrative to an ALM model. It relies on a conglomeration of resources and techniques from two distinct fields of artificial intelligence, namely, natural language processing and knowledge representation and reasoning. The effectiveness of system Text2ALM is measured by its ability to correctly answer questions from the bAbI tasks published by Facebook Research in 2015. This tool matched or exceeded the performance of state-of-the-art machine learning methods in six of the seven tested tasks. We also illustrate that the Text2ALM approach generalizes to a broader spectrum of narratives.
By incorporating the methods of Answer Set Programming (ASP) and Markov Logic Networks (MLN), LPMLN becomes a powerful tool for non-monotonic, inconsistent and uncertain knowledge representation and reasoning. To facilitate the applications and extend the understandings of LPMLN, we investigate the strong equivalences between LPMLN programs in this paper, which is regarded as an important property in the field of logic programming. In the field of ASP, two programs P and Q are strongly equivalent, iff for any ASP program R, the programs P and Q extended by R have the same stable models. In other words, an ASP program can be replaced by one of its strong equivalent without considering its context, which helps us to simplify logic programs, enhance inference engines, construct human-friendly knowledge bases etc. Since LPMLN is a combination of ASP and MLN, the notions of strong equivalences in LPMLN is quite different from that in ASP. Firstly, we present the notions of p-strong and w-strong equivalences between LPMLN programs. Secondly, we present a characterization of the notions by generalizing the SE-model approach in ASP. Finally, we show the use of strong equivalences in simplifying LPMLN programs, and present a sufficient and necessary syntactic condition that guarantees the strong equivalence between a single LPMLN rule and the empty program.
LPMLN is a probabilistic extension of answer set programs with the weight scheme adapted from Markov Logic. We study the concept of strong equivalence in LPMLN, which is a useful mathematical tool for simplifying a part of an LPMLN program without looking at the rest of it. We show that the verification of strong equivalence in LPMLN can be reduced to equivalence checking in classical logic via a reduct and choice rules as well as to equivalence checking under the ‘soft’ logic of here-and-there. The result allows us to leverage an answer set solver for LPMLN strong equivalence checking. The study also suggests us a few reformulations of the LPMLN semantics using choice rules, the logic of here-and-there, and classical logic.
In the past, the semantic issues raised by the non-monotonic nature of aggregates often prevented their use in the recursive statements of logic programs and deductive databases. However, the recently introduced notion of Pre-mappability (PreM) has shown that, in key applications of interest, aggregates can be used in recursion to optimize the perfect-model semantics of aggregate-stratified programs. Therefore we can preserve the declarative formal semantics of such programs while achieving a highly efficient operational semantics that is conducive to scalable implementations on parallel and distributed platforms. In this paper, we show that with PreM, a wide spectrum of classical algorithms of practical interest, ranging from graph analytics and dynamic programming based optimization problems to data mining and machine learning applications can be concisely expressed in declarative languages by using aggregates in recursion. Our examples are also used to show that PreM can be checked using simple techniques and templatized verification strategies. A wide range of advanced BigData applications can now be expressed declaratively in logic-based languages, including Datalog, Prolog, and even SQL, while enabling their execution with superior performance and scalability.
In artificial intelligence, multi agent systems constitute an interesting typology of society modeling, and have in this regard vast fields of application, which extend to the human sciences. Logic is often used to model such kind of systems as it is easier to verify than other approaches, and provides explainability and potential validation. In this paper we define a time module suitable to add time to many logic representations of agents.
We present a fast and scalable algorithm to induce non-monotonic logic programs from statistical learning models. We reduce the problem of search for best clauses to instances of the High-Utility Itemset Mining (HUIM) problem. In the HUIM problem, feature values and their importance are treated as transactions and utilities respectively. We make use of TreeExplainer, a fast and scalable implementation of the Explainable AI tool SHAP, to extract locally important features and their weights from ensemble tree models. Our experiments with UCI standard benchmarks suggest a significant improvement in terms of classification evaluation metrics and running time of the training algorithm compared to ALEPH, a state-of-the-art Inductive Logic Programming (ILP) system.
Our research is focused on making a human-like question answering system which can answer rationally. The distinguishing characteristic of our approach is that it will use automated common sense reasoning to truly ‘understand’ dialogues, allowing it to converse like a human. Humans often make many assumptions during conversations. We infer facts not told explicitly by using our common sense. Incorporating commonsense knowledge in a question answering system will simply make it more robust.
In recent years, mobile devices have gained increasingly development with stronger computation capability and larger storage. Some of the computation-intensive machine learning and deep learning tasks can now be run on mobile devices. To take advantage of the resources available on mobile devices and preserve users’ privacy, the idea of mobile distributed machine learning is proposed. It uses local hardware resources and local data to solve machine learning sub-problems on mobile devices, and only uploads computation results instead of original data to contribute to the optimization of the global model. This architecture can not only relieve computation and storage burden on servers, but also protect the users’ sensitive information. Another benefit is the bandwidth reduction, as various kinds of local data can now participate in the training process without being uploaded to the server. In this paper, we provide a comprehensive survey on recent studies of mobile distributed machine learning. We survey a number of widely-used mobile distributed machine learning methods. We also present an in-depth discussion on the challenges and future directions in this area. We believe that this survey can demonstrate a clear overview of mobile distributed machine learning and provide guidelines on applying mobile distributed machine learning to real applications.
In this paper, we consider the problem of accelerating the numerical simulation of time dependent problems by time domain decomposition. The available algorithms enabling such decompositions present severe efficiency limitations and are an obstacle for the solution of large scale and high dimensional problems. Our main contribution is the improvement of the parallel efficiency of the parareal in time method. The parareal method is based on combining predictions made by a numerically inexpensive solver (with coarse physics and/or coarse resolution) with corrections coming from an expensive solver (with high-fidelity physics and high resolution). At convergence, the parareal algorithm provides a solution that has the fine solver’s high-fidelity physics and high resolution In the classical version of parareal, the fine solver has a fixed high accuracy which is the major obstacle to achieve a competitive parallel efficiency. In this paper, we develop an adaptive variant of the algorithm that overcomes this obstacle. Thanks to this, the only remaining factor impacting performance becomes the cost of the coarse solver. We show both theoretically and in a numerical example that the parallel efficiency becomes very competitive when the cost of the coarse solver is small.
Data are often represented as graphs. Many common tasks in data science are based on distances between entities. While some data science methodologies natively take graphs as their input, there are many more that take their input in vectorial form. In this survey we discuss the fundamental problem of mapping graphs to vectors, and its relation with mathematical programming. We discuss applications, solution methods, dimensional reduction techniques and some of their limits. We then present an application of some of these ideas to neural networks, showing that distance geometry techniques can give competitive performance with respect to more traditional graph-to-vector mappings.
For the diagnostic inference under uncertainty Bayesian networks are investigated. The method is based on an adequate uniform representation of the necessary knowledge. This includes both generic and experience-based specific knowledge, which is stored in a knowledge base. For knowledge processing, a combination of the problem-solving methods of concept-based and case-based reasoning is used. Concept-based reasoning is used for the diagnosis, therapy and medication recommendation and evaluation of generic knowledge. Exceptions in the form of specific patient cases are processed by case-based reasoning. In addition, the use of Bayesian networks allows to deal with uncertainty, fuzziness and incompleteness. Thus, the valid general concepts can be issued according to their probability. To this end, various inference mechanisms are introduced and subsequently evaluated within the context of a developed prototype. Tests are employed to assess the classification of diagnoses by the network.
We have elsewhere reviewed proposals to reform terminology and improve interpretations of conventional statistics by emphasizing logical and information concepts over probability concepts. We here give detailed reasons and methods for reinterpreting statistics (including but not limited to) P-values and interval estimates in unconditional terms, which describe compatibility of observations with an entire set of analysis assumptions, rather than just a narrow target hypothesis. Such reinterpretations help avoid overconfident inferences whenever there is uncertainty about the assumptions used to derive and compute the statistical results. Examples of such assumptions include not only standard statistical modeling assumptions, but also assumptions about absence of systematic errors, protocol violations, and data corruption. Unconditional descriptions introduce uncertainty about such assumptions directly into statistical presentations of results, rather than leaving that only to the informal discussion that ensues. We thus view unconditional description as a vital component of good statistical training and presentation.

### Parsing Sda Pages

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

SDA is a suite of software developed at Berkeley for the web-based analysis of survey data. The Berkeley SDA archive (http://sda.berkeley.edu) lets you run various kinds of analyses on a number of public datasets, such as the General Social Survey. It also provides consistently-formatted HTML versions of the codebooks for the surveys it hosts. This is very convenient! For the gssr package, I wanted to include material from the codebooks as tibbles or data frames that would be accessible inside an R session. Processing the official codebook from its native PDF state into a data frame is, though technically possible, a rather off-putting prospect. But SDA has done most of the work already by making the pages available in HTML. I scraped the codebook pages from them instead. This post contains the code I used to do that.

Although I haven’t looked in detail, it seems that SDA has almost identical codebooks for the other surveys it hosts. so this code could be adapted for use with them. There will be some differences—e.g. the GSS has a “Text of this Question” field along with marginal summaries of the variable for each question in the survey, while the ANES seems to lack that field. But it seems clear that the HTML/CSS structure that SDA output is basically the same across datasets.

Here’s the code for the GSS documentation. There’s a GitHub repository that will allow you to reproduce what you see here. I’ve included the html files in the repository so you don’t have to scrape the SDA site. Do not try to slurp up the content of the SDA site in a way that is rude to their server.

## Libraries


library(tidyverse)
library(rvest)


## Scrape the GSS codebook from SDA

This next code chunk shows how we got the codebook data, but it is not evaluated (we set eval = FALSE), because we only need to do it once. We use sprintf() to generate a series of numbers with leading zeros, of the form 001, 002, 003, and so on. The 261 is hard-coded for this particular directory, but we should really grab the directory listing, evaluate how many files it lists (of the sort we want), and then use that number instead.


## Generate vector of doc page urls
urls <- paste0("https://sda.berkeley.edu/D3/GSS18/Doc/",
"hcbk", sprintf('%0.4d', 1:261), ".htm")

## Grab the codebook pages one at a time
doc_pages <- urls %>%
map(~ {
message(glue::glue("* parsing: {.x}"))
Sys.sleep(5) # try to be polite
})


## Save the scraped webpages locally

There’s a gotcha with objects like doc_pages: they cannot be straightforwardly saved to R’s native data format with save(). The XML files are stored with external pointers to their content and cannot be “serialized” in a way that saves their content properly. If you try, when you load() the saved object you will get complaints about missing pointers. So instead, we’ll unspool our list and save each page individually. Then if we want to rerun this analysis without crawling everything again, we will load them in from our local saved versions using read_html().

Again, this code chunk is shown but not run, as we only do it once.


## Get a list containing every codebook webpage,
## Drop the safely() error codes from the initial scrape (after we've checked them),
## and also drop any NULL entries
page_list <- pluck(doc_pages, "result") %>%
compact()

## Make a vector of clean file names of the form "raw/001.htm"
## One for every page we grabbed. Same order as the page_list.
## We use sprintf to get numbers of the form 001, 002, 003 etc.
fnames <-paste0("raw/",
sprintf('%0.4d', 1:length(doc_pages)),
".htm")

## Walk the elements of the page list and the file names to
## save each HTML file under is respective local file name
walk2(page_list, fnames, ~ write_xml(.x, file = .y))


The walk() and walk2() functions are very handy for processing batches of items when you want to produce a “side-effect” of the function you’re mapping, such as a plot or (in this case) a saved file.

## Read in the pages from the local directory

Using the local data we’ve saved, we read in a list of all the web pages. Our goal is to get them into a tractable format (a tibble or data frame). From there we can write some functions to, e.g., query the codebook directly from the console, or alterantively produce the codebook in a format suitable for integrating into the R help system via a package.


## The names of all the files we just created
local_urls <- fs::dir_ls("raw/")

## Read all the pages back in, from local storage
doc_pages <- local_urls %>%
map(~ {
})

## Are there any errors?
doc_pages %>% pluck("error") %>%
flatten_dfr()


## # A tibble: 0 x 0


## quick look at first five items in the list
summary(doc_pages)[1:5,]


##              Length Class  Mode
## raw/0001.htm 2      -none- list
## raw/0002.htm 2      -none- list
## raw/0003.htm 2      -none- list
## raw/0004.htm 2      -none- list
## raw/0005.htm 2      -none- list


## Quick look inside the first record
doc_pages[[1]]


## $result ## {html_document} ## ## [1] \n ## [2] \n\n\n General Social Survey 1972-2018 Cumula ... ## ##$error
## NULL


## Parse the pages

Next, we parse every webpage to extract a row for every variable. There are multiple variables per page.

### Functions


## Page of variables to list of variables and their info,
parse_page <- function(x){
html_nodes(x, ".dflt") %>%
map(~ html_nodes(.x, ".noborder")) %>%
map(~ html_table(.x))
}

## Length of each list element
## Standard GSS Qs will have 4 elements
## Ids recodes and other things will have 3
get_lengths <- function(x){
map(x, length)
}

get_names <- function(x){
map(x, names)
}

## Variable short names and descriptions
get_var_ids <- function(x){
x %>% map_dfr(1) %>%
select(id = X1, description = X3) %>%
as_tibble()
}

## Question Text
get_text <- function(x, y){
if(y[[1]] == 3) {
return(NA_character_)
} else {
stringr::str_trim(x[[2]])
}
}

## Question Marginals
get_marginals <- function(x, y){
if(y[[1]] == 3) {
tmp <- x[[2]]
} else {
tmp <- x[[3]]
}

if(ncol(tmp) == 2) {
as_tibble(tmp) %>%
select(cases = X1, range = X2)
} else {
tmp <- as_tibble(tmp[, colSums(is.na(tmp)) != nrow(tmp)]) %>%
janitor::clean_names()
tmp$value <- as.character(tmp$value)
tmp
}
}

}

## Question Properties
get_props <- function(x, y){
if(y[[1]] == 3) {
tmp <- x[[3]]
colnames(tmp) <- c("property", "value")
tmp <- as_tibble(tmp)
tmp$property <- stringr::str_remove(tmp$property, ":")
tmp
} else {
tmp <- x[[4]]
colnames(tmp) <- c("property", "value")
tmp <- as_tibble(tmp)
tmp$property <- stringr::str_remove(tmp$property, ":")
tmp
}
}

## Take the functions above and process a page to a tibble of cleaned records

process_page <- function(x){
page <- parse_page(x)
q_vars <- get_var_ids(page)
lens <- get_lengths(page)
keys <- q_vars$id q_text <- map2_chr(page, lens, ~ get_text(.x, .y)) q_text <- stringr::str_trim(q_text) q_text <- stringr::str_remove_all(q_text, "\n") q_text <- tibble(id = keys, q_text = q_text) q_text <- q_text %>% mutate(q_text = replace_na(q_text, "None")) q_marginals <- map2(page, lens, ~ get_marginals(.x, .y)) %>% set_names(keys) q_marginals <- map2(q_marginals, keys, ~ add_id(.x, .y)) q_props <- map2(page, lens, ~ get_props(.x, .y)) %>% set_names(keys) q_props <- map2(q_props, keys, ~ add_id(.x, .y)) q_tbl <- q_vars %>% add_column(properties = q_props) %>% add_column(marginals = q_marginals) %>% left_join(q_text) %>% rename(text = q_text) q_tbl }  ## Make the tibble Parse the GSS variables into a tibble, with list columns for the marginals and the variable properties.  gss_doc <- doc_pages %>% pluck("result") %>% # Get just the webpages compact() %>% map(process_page) %>% bind_rows()  ## Look at the outcome  gss_doc   ## # A tibble: 6,144 x 5 ## id description properties marginals text ## ## 1 CASEID YEAR + Responde… ## 2 YEAR GSS year for th… ## 3 ID Respondent ID n… ## 4 AGE Age of responde… ## 5 SEX Respondents sex ## 6 RACE Race of respond… ## 7 RACEC… What Is R's rac… ## 8 RACEC… What Is R's rac… ## 9 RACEC… What Is R's rac… ## 10 HISPA… Hispanic specif… ## # … with 6,134 more rows   gss_doc$id <- tolower(gss_doc$id)   gss_doc %>% filter(id == "race") %>% select(text)   ## # A tibble: 1 x 1 ## text ## ## 1 24. What race do you consider yourself?   gss_doc %>% filter(id == "race") %>% select(marginals) %>% unnest(cols = c(marginals))   ## # A tibble: 4 x 5 ## percent n value label id ## ## 1 80.3 52,033 1 WHITE RACE ## 2 14.2 9,187 2 BLACK RACE ## 3 5.5 3,594 3 OTHER RACE ## 4 100 64,814 Total RACE   gss_doc %>% filter(id == "sex") %>% select(text)   ## # A tibble: 1 x 1 ## text ## ## 1 23. Code respondent's sex   gss_doc %>% filter(id == "sex") %>% select(marginals) %>% unnest(cols = c(marginals))   ## # A tibble: 3 x 5 ## percent n value label id ## ## 1 44.1 28,614 1 MALE SEX ## 2 55.9 36,200 2 FEMALE SEX ## 3 100 64,814 Total SEX   gss_doc %>% filter(id == "fefam") %>% select(text)   ## # A tibble: 1 x 1 ## text ## ## 1 252. Now I'm going to read several more statements. As I read each one, …   gss_doc %>% filter(id == "fefam") %>% select(properties) %>% unnest(cols = c(properties))   ## # A tibble: 3 x 3 ## property value id ## ## 1 Data type numeric FEFAM ## 2 Missing-data codes 0,8,9 FEFAM ## 3 Record/column 1/1114 FEFAM  ## Save the data object as efficiently as we can Shown here but not run.  save(gss_doc, file = "data/gss_doc.rda", compress = "xz") # tools::checkRdaFiles("data")  To leave a comment for the author, please follow the link and comment on their blog: R on kieranhealy.org. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Continue Reading… ### How to think scientifically about scientists’ proposals for fixing science I kinda like this little article which I wrote a couple years ago while on the train from the airport. It will appear in the journal Socius. Here’s how it begins: Science is in crisis. Any doubt about this status has surely been been dispelled by the loud assurances to the contrary by various authority figures who are deeply invested in the current system and have written things such as, “Psychology is not in crisis, contrary to popular rumor . . . Crisis or no crisis, the field develops consensus about the most valuable insights . . . National panels will convene and caution scientists, reviewers, and editors to uphold standards.” (Fiske, Schacter, and Taylor, 2016). When leaders go to that much trouble to insist there is no problem, it’s only natural for outsiders to worry. The present article is being written for a sociology journal, which is appropriate for two reasons. First, sociology includes the study of institutions and communities; modern science is both an institution and a community, and as such it would be of interest to me as a citizen and a political scientist, even beyond my direct involvement as a practicing researcher. Second, sociology has a tradition of questioning; it is a field from whose luminaries I hope never to hear platitudes such as “Crisis or no crisis, the field develops consensus about the most valuable insights.” Sociology, like statistics and political science, is inherently accepting of uncertainty and variation. Following Karl Popper, Thomas Kuhn, Imre Lakatos, and Deborah Mayo, we cheerfully build our theories as tall and broad as we can, in the full awareness that reality will knock them down. We know that one of the key purposes of data analysis is to “kill our darlings,” and we also know that the more specific we make our models, the more we learn from their rejection. Structured modeling and thick description go together. Just as we learn in a local way from our modeling failures, we can learn more globally from crises in entire subfields of science. When I say that the replication crisis is also an opportunity, this is more than a fortune-cookie cliche; it is also a recognition that when a group of people make a series of bad decisions, this motivates a search for what went wrong in their decision-making process. A full discussion of the crisis in science would include three parts: 1. Evidence that science is indeed in crisis: at the very least, a series of examples of prominent products of mainstream science that were seriously flawed but still strongly promoted by the scientific community, and some evidence or at least speculation that such problems are prevalent enough to be worth our concern. 2. A discussion of what has gone wrong in the ideas and methods of scientific inquiry and in the process by which scientific claims are promoted and disseminated within the community and the larger society. This discussion could include specific concerns about statistical methods such as null hypothesis significance testing, and also institutional issues such as the increasing pressure on research to publish large numbers of articles. 3. Proposed solutions, which again range from research methods (for example, the suggestion to perform within-person, rather than between-person, comparisons wherever possible) to rules such as preregistration of hypotheses, to changes in the system of scientific publication and credit. I and others have written enough on topics 1 and 2, and since this article has been solicited for a collection on Fixing Science, I’ll restrict my attention to topic 3: what to do about the problem? I then continue: If you’ve gone to the trouble to pick up (or click on) this volume in the first place, you’ve probably already seen, somewhere or another, most of the ideas I could possibly propose on how science should be fixed. My focus here will not be on the suggestions themselves but rather on what are our reasons for thinking these proposed innovations might be good ideas. The unfortunate paradox is that the very aspects of “junk science” that we so properly criticize—the reliance on indirect, highly variable measurements from nonrepresentative samples, open-ended data analysis, followed up by grandiose conclusions and emphatic policy recommendations drawn from questionable data— all seem to occur when we suggest our own improvements to the system. All our carefully-held principles seem to evaporate when our emotions get engaged. . . . After some discussion of potential solutions, I conclude: The foregoing review is intended to be thought provoking, but not nihilistic. One of the most important statistical lessons from the recent replication crisis is that certainty or even near-certainty is harder to come by then most of us had imagined. We need to make some decisions in any case, and as the saying goes, deciding to do nothing is itself a decision. Just as an anxious job-interview candidate might well decide to chill out with some deep breaths, full-body stretches, and a power pose, those of us within the scientific community have to make use of whatever ideas are nearby, in order to make the micro-decisions that, in the aggregate, drive much of the directions of science. And, when considering larger ideas, proposals for educational requirements or recommendations for new default statistical or research methods or reorganizations of the publishing system, we need to recognize that our decisions will necessarily rely much more on logic and theory than on direct empirical evidence. This suggests in turn that our reasoning be transparent and openly connected to the goals and theories that motivate and guide our attempts toward fixing science. It’s fun, writing an article like this from first principles, with no position to defend, just trying to think things through. Continue Reading… ### Research Guide for Video Frame Interpolation with Deep Learning In this research guide, we’ll look at deep learning papers aimed at synthesizing video frames within an existing video. Continue Reading… ### Super Solutions for Shiny Apps #4 of 5: Using R6 Classes [This article was first published on r – Appsilon Data Science | End­ to­ End Data Science Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. ## TL;DR Why use object-oriented programming in Shiny applications? It’ll help organizize organize the code in your application! ## Organize Your Shiny Code with Object-Oriented Programming Classes are used widely in all R programming — usually the S3 ones. Even if you’ve never heard of them, as an R user you’re for sure familiar with object classes like data.frame. The magic of classes makes same functions (like print or plot) operate differently for specific cases. This system also allows you to define methods and attributes dedicated for object – thus basing your code around objects, not functions. How can it be useful in Shiny app? Well, for advanced apps it helps you in organizing your code. When designing a Shiny app it is often the way it should work – usually the user is interacting with some specific data structure that requires e.g. exploration, modification, saving, exporting etc. What is more, Shiny apps contain much more additional stuff required for production-ready applications that will work great when being organizized organized in a class system. Let us go through this functionality. You may wonder what objects can be stored in the app. Well, there might be ‘settings’ with private fields of configuration and public methods to get or modify them. There might be a ‘user data’ class with all of the values specific for the logged-in user and methods to read them. The data probably comes from some database, the connection to which may be another class, as well as the log builder. And some specific spatial dataset used on this new screen that required unusual functions to operate on. And so on… The point is this: once your app grows with new functionalities you will end up with tons of functions organized between multiple ‘utils-something’ scripts and a bunch of lists with stored current values. Using object oriented programming (OOP) in the Shiny App allows developers to organize the code in the packs object-attributes-functions. It will introduce a clear system of getting the current state of the piece of functionality, the operations that can be performed with it, the auxiliary functions, the initial state of the object (which can depend on the particular user access rights, which might be very useful!). In Appsilon we recommend using an R6 class system. What is R6? To make a long story short, it is a modern, fast and simple implementation of object-oriented programming (OOP) in R. To learn more about the system, check Hadley’s Advanced R book. How do we recommend organizing the code with the class system introduced? Keep each class in the separated script with a name similar to the class name, and store them all together in a folder separated from other code e.g. ui definitions of modules (folders structure might be problematic when developing a Shiny app as a package – doing it has pros and cons. In Appsilon we do not package our apps as it gives us more freedom with building the apps). Once the code is imported (either in a package or separately with e.g. modules::use function) the class needs to be initialized in the global.R script with the new() method (automatically available for all classes) and attached to some object. The class, its attributes and methods can be used everywhere in the code – but you’ve gained a clear overview of the purpose of the method call and what object it modifies and where to find the details about this particular functionality. Here’s a practice example: It might not be obvious from the beginning which classes will be used in the code – Shiny apps often start as a simple prototype to finally become some advanced production solution. Just don’t be afraid and always have in mind that classes are a good solution to organize your code once your app goes big! Thanks for reading! Questions? Comments? Add them below and/or find me on Twitter @dubelmarcin. You can also check out the other posts in the “Super Solutions for Shiny Architecture Series”: #1 Using Session Data, #2 Javascript is your friend, and #3 Softcoding Constants in the App. And don’t forget to sign up for our newsletter! To leave a comment for the author, please follow the link and comment on their blog: r – Appsilon Data Science | End­ to­ End Data Science Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Continue Reading… ### Simulating data with Bayesian networks [This article was first published on R – Daniel Oehm | Gradient Descending, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Bayesian networks are really useful for many applications and one of those is to simulate new data. Bayes nets represent data as a probabilistic graph and from this structure it is then easy to simulate new data. This post will demonstrate how to do this with bnlearn. ## Fit a Bayesian network Before simulating new data we need a model to simulate data from. Using the same Australian Institute of Sport dataset from my previous post on Bayesian networks we’ll set up a simple model. For convenience I’ll subset the data to 6 variables. library(DAAG) library(tidyverse) # ais data from the DAAG package ais_sub <- ais %>% dplyr::select(sex, sport, pcBfat, hg, rcc, hc) The variables sex and sport are pretty straight forward. The remaining four are • pcBfat – percent of body fat • hg – hemoglobin concentration • rcc – red cell count • hc – hematocrit percentage I’ve allowed the data to learn the structure of the network, bar one arc, sport to percentage of body fat. The details are not shown here, but check out the post above on how to fit the structure algorithmically (also I suggest heading to the bnlearn doco which has great examples of a number of networks that can be downloaded). The structure is defined by the string and converted to a bn class object. library(visNetwork) library(bnlearn) # set structure bn_struct <- model2network("[sex][sport|sex][hg|sex][pcBfat|sex:sport][hc|hg:pcBfat][rcc|hc]") bn_struct ## ## Random/Generated Bayesian network ## ## model: ## [sex][hg|sex][sport|sex][pcBfat|sex:sport][hc|hg:pcBfat][rcc|hc] ## nodes: 6 ## arcs: 7 ## undirected arcs: 0 ## directed arcs: 7 ## average markov blanket size: 2.67 ## average neighbourhood size: 2.33 ## average branching factor: 1.17 ## ## generation algorithm: Empty # plot network - code for function at end of post plot.network(bn_struct) Now that we have set the structure of the model it is fit to the data with bn.fit using maximum likelihood estimation. bn_mod <- bn.fit(bn_struct, data = ais_sub, method = "mle") The output is quite detailed so it’s worth running bn_mod to view the conditional probability tables and Gaussian distributions. ## Simulate data New data is simulated from a Bayes net by first sampling from each of the root nodes, in this case sex. Then followed by the children conditional on their parent(s) (e.g. sport | sex and hg | sex) until data for all nodes has been drawn. The numbers on the nodes below indicate the sequence in which the data is simulated, noting that rcc is the terminal node. From this point it’s easy to simulate new data using rbn. Here we simulate a dataset the same size as the original, but you can simulate as many rows as needed. ais_sim <- rbn(bn_mod, 202) head(ais_sim) ## hc hg pcBfat rcc sex sport ## 1 38.00754 12.43565 13.772499 3.917082 f Swim ## 2 45.54250 15.79388 13.586402 4.824458 m Field ## 3 49.87429 17.31542 5.308675 5.814398 m T_400m ## 4 49.05707 17.19443 9.230973 5.337367 m W_Polo ## 5 37.66307 12.99088 13.685909 4.020170 f Gym ## 6 42.33715 14.62894 15.147165 4.440556 f Netball Done. We now have a fully synthetic dataset which retains the properties of the original data. And it only took a few lines of code. An important property of generating synthetic data is that it doesn’t use real data to do so, meaning any predictors need to be simulated first (my post on the synthpop package explains this in more detail). This property is retained since the data is generated sequentially as per the structure of network. Also, when using synthpop the order in which the variables are simulated needs to be set. The order can alter the accuracy of simulated dataset and so it’s important to spend the time to get it right. For a Bayesian network the order is determined by the structure, so in effect this step is already done. ## Compare original and simulated datasets The original and simulated datasets are compared in a couple of ways 1) observing the distributions of the variables 2) comparing the output from various models and 3) comparing conditional probability queries. The third test is more of a sanity check. If the data is generated from the original Bayes net then a new one fit on the simulated data should be approximately the same. The more rows we generate the closer the parameters will be to the original values. The variable distributions are very close to the original with only a small amount of variation, mostly observed in sport. Red cell count may have a slight bi-modal distribution but in most part it’s a good fit. This amount of variation is reasonable since there are only 202 simulated observations. Simulating more rows will be a closer fit but there are often practical considerations for retaining the same size dataset. For the second check, two linear models are fit to the original and simulated data to predict hematocrit levels with sex, hemoglobin concentration, percentage of body fat and red cell count as predictors. Sport was left out of the model since it was not a strong predictor of hc and only increased the error. # fit models glm_og <- glm(hc ~ hg + rcc + pcBfat + sex, data = ais_sub) glm_sim <- glm(hc ~ hg + rcc + pcBfat + sex, data = ais_sim) # print summary summary(glm_og) ## ## Call: ## glm(formula = hc ~ hg + rcc + pcBfat + sex, data = ais_sub) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -3.9218 -0.4868 0.0523 0.5470 2.8983 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5.15656 1.04109 4.953 1.57e-06 *** ## hg 1.64639 0.11520 14.291 < 2e-16 *** ## rcc 3.04366 0.31812 9.568 < 2e-16 *** ## pcBfat -0.02271 0.01498 -1.517 0.131 ## sexm -0.20182 0.23103 -0.874 0.383 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for gaussian family taken to be 0.888242) ## ## Null deviance: 2696.92 on 201 degrees of freedom ## Residual deviance: 174.98 on 197 degrees of freedom ## AIC: 556.25 ## ## Number of Fisher Scoring iterations: 2 summary(glm_sim) ## ## Call: ## glm(formula = hc ~ hg + rcc + pcBfat + sex, data = ais_sim) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.2117 -0.6232 0.0089 0.5910 2.0073 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5.318057 0.953131 5.580 7.9e-08 *** ## hg 1.633765 0.107235 15.235 < 2e-16 *** ## rcc 2.905111 0.299397 9.703 < 2e-16 *** ## pcBfat -0.001506 0.014768 -0.102 0.919 ## sexm 0.319853 0.220904 1.448 0.149 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for gaussian family taken to be 0.7520993) ## ## Null deviance: 3014.20 on 201 degrees of freedom ## Residual deviance: 148.16 on 197 degrees of freedom ## AIC: 522.64 ## ## Number of Fisher Scoring iterations: 2 The coefficients and test statistics of the models are very similar, so both datasets result in the same conclusions. Percent of body fat is the least accurate but still make the same conclusion. In practice you should fit more models to assess the quality of the simulated data. As mentioned the third is more of a sanity check but it is also a good demonstration of the process. By fitting the same structure to the simulated data we expect to estimate the same parameters and calculate very similar conditional probabilities. Here we simulate 20 000 observations to better estimate the parameters. The conditional probability for the athletes red cell count given the sport they compete in i.e. what is the probability the athletes red cell count will be greater than where is the 33rd and 66th percentile? library(progress) # fit bayes net with the same structure using simulated ais_sim <- rbn(bn_mod, 2e4) bn_mod_sim <- bn.fit(bn_struct, ais_sim, method = "mle") # sport - mcpquery function at end of post - it generalises bnlearn::cpquery # rcc is continuous so we're calculating the probability rcc > some value, here the 33rd and 66th percentile # the function replaces xx and yy with actually values and evaluates the text given <- "sport == 'xx'" event <- "rcc > yy" vals <- as.matrix(expand.grid(sport = unique(ais_sub$sport), rcc = quantile(ais_sub$rcc, c(0.33, 0.66)))) orig <- mcpquery(bn_mod, values = vals, token = c("xx", "yy"), event, given, n = 1e6) %>% spread(rcc, cp) ## P( rcc > yy | sport == 'xx' ) sim <- mcpquery(bn_mod_sim, values = vals, token = c("xx", "yy"), event, given, n = 1e6) %>% spread(rcc, cp) ## P( rcc > yy | sport == 'xx' ) # join for display left_join(orig, sim, by = "sport", suffix = c("_orig", "_sim")) ## # A tibble: 10 x 5 ## sport 4.4533_orig 4.9466_orig 4.4533_sim 4.9466_sim ## ## 1 B_Ball 0.687 0.310 0.682 0.302 ## 2 Field 0.764 0.386 0.768 0.384 ## 3 Gym 0.477 0.0658 0.475 0.0717 ## 4 Netball 0.444 0.0584 0.442 0.0575 ## 5 Row 0.652 0.270 0.654 0.271 ## 6 Swim 0.752 0.370 0.758 0.376 ## 7 T_400m 0.767 0.388 0.769 0.386 ## 8 T_Sprnt 0.822 0.449 0.826 0.445 ## 9 Tennis 0.640 0.251 0.639 0.249 ## 10 W_Polo 0.945 0.571 0.944 0.566 The conditional probabilities from the simulated data are very close to the original as expected. Now we can be confident that our simulated data can be used as an alternative to the original. ## Impute data Another useful property of Bayes nets is to impute missing values. This is easily done using impute. We’ll remove 25% of the observations from variables hg and hc, and allow the Bayes net to impute them. ais_miss <- ais_sub miss_id <- sample(1:nrow(ais_sub), 50) ais_miss[miss_id, c("hg", "hc")] <- NA # table counting the NA values in hg and hc apply(ais_miss[,c("hg", "hc")], 2, function(x) table(is.na(x))) ## hg hc ## FALSE 152 152 ## TRUE 50 50 The table confirms there are 50 missing observations from hemoglobin and hematocrit variables. Now impute using Bayesian likelihood weighting. ais_imp <- impute(bn_mod, data = ais_miss, method = "bayes-lw") Plotting the imputed against the true values shows the Bayes net imputed the missing values quite well. I’ve only tested and shown two variables but the others would perform similarly for the subset of data I have chosen. This data is normally distributed so I expected it to work well, however if your data has more complex relationships you’ll need to be more rigorous with defining the structure. ## Code bits # plot network function plot.network <- function(structure, ht = "400px", cols = "darkturquoise", labels = nodes(structure)){ if(is.null(labels)) labels <- rep("", length(nodes(structure))) nodes <- data.frame(id = nodes(structure), label = labels, color = cols, shadow = TRUE ) edges <- data.frame(from = structure$arcs[,1],
to = structure$arcs[,2], arrows = "to", smooth = FALSE, shadow = TRUE, color = "black") return(visNetwork(nodes, edges, height = ht, width = "100%")) } # conditional probability query functions mcpquery <- function(mod, values, token, event, given, ns = 1e4){ cat("P(", event, "|", given, ")\n") UseMethod("mcpquery", values) } # numeric mcpquery.numeric <- function(mod, values, token, event, given, ns = 1e4){ y <- rep(NA, length(values)) pb <- progress_bar$new(
format = "time :elapsedfull // eta :eta // :k of :n // P( :event | :given )",
clear = FALSE, total = length(values))
for(k in 1:length(values)){
givenk <- gsub(token, values[k], given)
eventk <- gsub(token, values[k], event)
pb$tick(token = list(k = k, n = length(values), event = eventk, given = givenk)) y[k] <- eval(parse(text = paste0("cpquery(mod,", eventk, ",", givenk, ", n = ", ns, ", method = 'ls')"))) } return(tibble(values = values, cp = y) %>% arrange(desc(cp))) } # character mcpquery.character <- function(mod, values, token, event, given, ns = 1e4){ y <- rep(NA, length(values)) pb <- progress_bar$new(
format = "time :elapsedfull // eta :eta // :k of :n // P( :event | :given )",
clear = FALSE, total = length(values))
for(k in 1:length(values)){
givenk <- gsub(token, values[k], given)
eventk <- gsub(token, values[k], event)
pb$tick(token = list(k = k, n = length(values), event = eventk, given = givenk)) y[k] <- eval(parse(text = paste0("cpquery(mod,", eventk, ",", givenk, ", n = ", ns, ", method = 'ls')"))) } return(tibble(values = values, cp = y) %>% arrange(desc(cp))) } # matrix mcpquery.matrix <- function(mod, values, token, event, given, ns = 1e4){ n <- nrow(values) y <- rep(NA, n) pb <- progress_bar$new(
format = "time :elapsedfull // eta :eta // :k of :n // P( :event | :given )",
clear = FALSE, total = n)

for(k in 1:n){
givenk <- given
eventk <- event
for(j in 1:ncol(values)){
givenk <- gsub(token[j], values[k,j], givenk)
eventk <- gsub(token[j], values[k,j], eventk)
}
pb$tick(token = list(k = k, n = n, event = eventk, given = givenk)) y[k] <- eval(parse(text = paste0("cpquery(mod,", eventk, ",", givenk, ", n = ", ns, ", method = 'ls')"))) } out <- as.tibble(values) %>% bind_cols(tibble(cp = y)) colnames(out)[1:ncol(values)] <- colnames(values) return(out) } # compare the synthetic and original data frames df <- ais_sub %>% mutate(type = "orig") %>% bind_rows( rbn(bn_mod, 202) %>% mutate(type = "sim") ) # %>% gg_list <- list() grp_var <- "type" vars <- colnames(df)[colnames(df) != grp_var] for(k in 1:length(vars)){ var_k <- vars[k] gg_list[[k]] <- ggplot(df, aes_string(x = var_k, fill = grp_var, col = grp_var)) if(is.numeric(df[[var_k]])){ gg_list[[k]] <- gg_list[[k]] + geom_density(alpha = 0.85, size = 0) }else{ gg_list[[k]] <- gg_list[[k]] + geom_bar(position = "dodge") } gg_list[[k]] <- gg_list[[k]] + theme( axis.text.x = element_text(angle = 90), axis.title.x = element_blank() ) + labs(title = var_k) } The post Simulating data with Bayesian networks appeared first on Daniel Oehm | Gradient Descending. To leave a comment for the author, please follow the link and comment on their blog: R – Daniel Oehm | Gradient Descending. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Continue Reading… ### JAMA retraction after miscoding – new Finalfit function to check recoding [This article was first published on R – DataSurg, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Riinu and I are sitting in Frankfurt airport discussing the paper retracted in JAMA this week. During analysis, the treatment variable coded [1,2] was recoded in error to [1,0]. The results of the analysis were therefore reversed. The lung-disease self-management program actually resulted in more attendances at hospital, rather than fewer as had been originally reported. ## Recode check Checking of recoding is such an important part of data cleaning – we emphasise this a lot in HealthyR courses – but of course mistakes happen. Our standard approach is this: library(finalfit) colon_s %>% mutate( sex.factor2 = forcats::fct_recode(sex.factor, "F" = "Male", "M" = "Female") ) %>% count(sex.factor, sex.factor2) # A tibble: 2 x 3 sex.factor sex.factor2 n 1 Female M 445 2 Male F 484 The miscode should be obvious. ## check_recode() However, mistakes may still happen and be missed. So we’ve bashed out a useful function that can be applied to your whole dataset. This is not to replace careful checking, but may catch something that has been missed. The function takes a data frame or tibble and fuzzy matches variable names. It produces crosstables similar to above for all matched variables. So if you have coded something from sex to sex.factor it will be matched. The match is hungry so it is more likely to match unrelated variables than to miss similar variables. But if you recode death to mortality it won’t be matched. Here’s a walk through. # Install devtools::install_github('ewenharrison/finalfit') library(finalfit) library(dplyr) # Recode example colon_s_small = colon_s %>% select(-id, -rx, -rx.factor) %>% mutate( age.factor2 = forcats::fct_collapse(age.factor, "<60 years" = c("<40 years", "40-59 years")), sex.factor2 = forcats::fct_recode(sex.factor, # Intentional miscode "F" = "Male", "M" = "Female") ) # Check colon_s_small %>% check_recode()$index
# A tibble: 3 x 2
var1        var2

1 sex.factor  sex.factor2
2 age.factor  age.factor2
3 sex.factor2 age.factor2
$counts$counts[[1]]
# A tibble: 2 x 3
sex.factor sex.factor2     n

1 Female     M             445
2 Male       F             484
$counts[[2]] # A tibble: 3 x 3 age.factor age.factor2 n 1 <40 years <60 years 70 2 40-59 years <60 years 344 3 60+ years 60+ years 515$counts[[3]]
# A tibble: 4 x 3
sex.factor2 age.factor2     n

1 M           <60 years     204
2 M           60+ years     241
3 F           <60 years     210
4 F           60+ years     274

As can be seen, the output takes the form of a list length 2. The first is an index of matched variables. The second is crosstables as tibbles for each variable combination. sex.factor2 can be seen as being miscoded. sex.factor2 and age.factor2 have been matched, but should be ignored.

Numerics are not included by default. To do so:

out = colon_s_small %>%
select(-extent, -extent.factor,-time, -time.years) %>% # choose to exclude variables
check_recode(include_numerics = TRUE)
out
# Output not printed for space

## Miscoding in survival::colon dataset?

When doing this just today, we noticed something strange in our example dataset, survival::colon.

The variable node4 should be a binary recode of nodes greater than 4. But as can be seen, something is not right!

We’re interested in any explanations those working with this dataset might have.

# Select a tibble and expand
out$counts[[9]] %>% print(n = Inf) # Compressed output shown # A tibble: 32 x 3 nodes node4 n 1 0 0 2 2 1 0 269 3 1 1 5 4 2 0 194 5 3 0 124 6 3 1 1 7 4 0 81 8 4 1 3 9 5 0 1 10 5 1 45 # … with 22 more rows There we are then, a function that may be useful in detecting miscoding. So useful in fact, that we have immediately found probable miscoding in a standard R dataset. To leave a comment for the author, please follow the link and comment on their blog: R – DataSurg. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Continue Reading… ### Finding free science books from Springer [This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Today the biggest book fair of the world starts again in Frankfurt, Germany. I thought this might be a good opportunity to do you some good! Springer is one of the most renowned scientific publishing companies in the world. Normally, their books are quite expensive but also in the publishing business Open Access is a megatrend. If you want to use R in a little fun project to find the latest additions of open access books to their program read on! The idea is to create an R script which you can run from time to time to see whether there are new titles available. So, we need some place to store the retrieved data in a persistent manner: a database! For our purposes here most database systems would be total overkill but there is one great solution available: the amazing RSQLite package (on CRAN). This package brings its own lightweight database with it, no need to install any additional software! And it is fully SQL compatible (for Structured Query Language, the industry standard of relational database management systems) like any decent database software. So, you only have to install the RSQLite package and then load the DBI package (for database interface). To render the output table in an appealing form we will use the htmlTable package (on CRAN). Have a look at the following fully documented code which should (hopefully) be quite clear: library(DBI) library(htmlTable) # inital search for English books from 2019 springer_initial <- read.csv("https://link.springer.com/search/csv?facet-content-type=%22Book%22&previous-end-year=2019&date-facet-mode=in&facet-language=%22En%22&showAll=false&query=&facet-end-year=2019&previous-start-year=2019&facet-start-year=2019", encoding = "UTF-8") # current search for English books from 2020 - has to be updated in the following years! springer_search <- read.csv("https://link.springer.com/search/csv?previous-end-year=2020&facet-content-type=%22Book%22&date-facet-mode=in&previous-start-year=2020&facet-language=%22En%22&showAll=false&query=&facet-start-year=2020&facet-end-year=2020", encoding = "UTF-8") # open database connection springer_db <- dbConnect(RSQLite::SQLite(), "my-db.sqlite") # initialize database if (!dbExistsTable(springer_db, "search")) { dbWriteTable(springer_db, "search", springer_initial) } # read current search table, replace it with new search and compare both springer_search_old <- dbReadTable(springer_db, "search") dbRemoveTable(springer_db, "search") dbWriteTable(springer_db, "search", springer_search) new_books <- setdiff(springer_search_old, dbReadTable(springer_db, "search")) if (nrow(new_books) > 0) htmlTable(new_books[c("Item.Title", "Authors", "URL")])  [showing only a subset of the more than 200 (!) free titles in 2019] Item.Title Authors URL 47 Disrupting Finance Theo LynnProf. John G. MooneyDr. Pierangelo RosatiProf. Mark Cummins http://link.springer.com/book/10.1007/978-3-030-02330-0 84 Understanding Statistics and Experimental Design Prof. Dr. Michael H. HerzogProf. Dr. Gregory FrancisPh.D. Aaron Clarke http://link.springer.com/book/10.1007/978-3-030-03499-3 85 InformationConsciousnessReality Dr. James B. Glattfelder http://link.springer.com/book/10.1007/978-3-030-03633-1 133 Modelling our Changing World Dr. Jennifer L. CastleProf. Dr. David F. Hendry http://link.springer.com/book/10.1007/978-3-030-21432-6 147 Fundamentals of Clinical Data Science Dr. Pieter KubbenMichel DumontierProf. Dr. Andre Dekker http://link.springer.com/book/10.1007/978-3-319-99713-1 169 Reality Lost Vincent F. HendricksMads Vestergaard http://link.springer.com/book/10.1007/978-3-030-00813-0 172 The Brownian Motion Prof. Dr. Andreas LfflerProf. Dr. Lutz Kruschwitz http://link.springer.com/book/10.1007/978-3-030-20103-6 186 Automated Machine Learning Prof. Dr. Frank HutterLars KotthoffPh.D. Joaquin Vanschoren http://link.springer.com/book/10.1007/978-3-030-05318-5 209 Lithium-Ion Batteries Beta Writer http://link.springer.com/book/10.1007/978-3-030-16800-1 # close database connection dbDisconnect(springer_db)  And you thought Christmas was yet to come, right! As an aside, the last entry is an especially interesting case: it is the first machine-generated research book! The “author” Beta Writer was developed in a joint effort and in collaboration between Springer and researchers from Goethe University, Frankfurt. The book is a cross-corpora auto-summarization of current texts from SpringerLink, organized by means of a similarity-based clustering routine in coherent chapters and sections. It automatically condenses a large set of papers into a reasonably short book. More technical details of this fascinating endeavor, with the potential to revolutionize scientific publishing, can be found in the preface of the book. By clicking on the link you will directly be directed to the respective book page, where you can download the pdf and in most cases also an epub file (bonus tip: in most cases you can also download a free version of the book for your kindle on amazon.com). To get clickable links you need to render an HTML markdown document. Otherwise, if you run it in RStudio directly you will have to copy and paste the links into your browser. You just have to run the script from time to time to see what is new! If you want to customize the data retrieved from link.springer.com have a look at their search interface: You can customize your search by changing the values in the blue boxes. To get the URL which you can paste in the read.csv function above just right click on the button with the down arrow at the upper right corner (marked by the blue arrow) and choose “Copy link address” in the context menu. In case you want to completely reset the database you can use the following function (with care): # function for resetting the springer database reset_springer_db <- function() { springer_db <- dbConnect(RSQLite::SQLite(), "my-db.sqlite") dbRemoveTable(springer_db, "search") dbDisconnect(springer_db) }  One small thing: although I tried my very best there still seems to be an issue with the encoding… some special characters, like the German umlauts äöüÄÖÜ, are just not rendered. If you have a solution for me please leave it in the comments and I will add it to the post (or perhaps even write a post on the issues of encoding in R, RStudio and Windows). To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Continue Reading… ### Merge MLP And CNN in Keras [This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. In the post (https://statcompute.wordpress.com/2017/01/08/an-example-of-merge-layer-in-keras), it was shown how to build a merge-layer DNN by using the Keras Sequential model. In the example below, I tried to scratch a merge-layer DNN with the Keras functional API in both R and Python. In particular, the merge-layer DNN is the average of a multilayer perceptron network and a 1D convolutional network, just for fun and curiosity. Since the purpose of this exercise is to explore the network structure and the use case of Keras API, I didn’t bother to mess around with parameters. To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Continue Reading… ### Distilled News Who hasn’t come across the need of applying the cross validation technique, while the dataset in hand is imbalanced, in regards to the number of instances per target class value. The question here is, do we apply it properly? The purpose of this article is to show a way to use the balancing methods in cross-validation, without forcing balancing on the CV test folds; thus to get more realistic CV evaluation results. Fully self-driving passenger cars are not ‘just around the corner’. Elon Musk claims that Teslas will have a ‘full self-driving’ capability by the end of 2020. Especially, he says that Tesla’s hardware is already ready for Autonomous drive, and what is left is just an update on their current software, which many brilliant scientists are working on it. The first instinct to us, humans, as drivers, is probably to look in front of us and decide where should the car move; at which direction, between which lines, etc. As every Autonomous Vehicle comes with a camera in a front, one very important task its to decide the border in which the car should move in between. For humans, we draw lines on the roads. Now we will teach an Autonomous Vehicle to see these lines. I promise it will be fun Before implementing any tech-related initiative, specialists must answer many whys and hows: What might be the impact of this solution? How do we know which tech stack is optimal for solving this problem? Can we afford this experiment? What is the predicted payback period? Answers to such questions help companies decide whether building a certain solution is worth the effort. But that’s not always the case. Netflix spent$1 million on recommendation engine improvements it never used.
Github and Microsoft have released a dataset of search queries for code and annotated results to advance the development of search engines that can locate specific code. The dataset includes 99 search queries and ten likely results per query, which experts annotated for their relevance to the query for a total of 4,000 annotations. The search results contain a total of six million functions from open-source code across six programming languages, including Python and Java.
This is the last post of this series looking at explaining model predictions at a global level. We first started this series explaining predictions using white box models such as logistic regression and decision tree. Next, we did model specific post hoc evaluation on black box models. Specifically, for random forest and Xgboost.
Matrix Factorization is a very powerful algorithm with many valuable use cases in multiple industries. Some of the well-known application of the Matrix Factorization is – Netflix 1-million-dollar prize for movie recommendation and Amazon online application (much-refined version) for recommending books to various readers. Although it is a very popular method, it has some limitations. Our focus in the current paper is to understand latent factors and how to make it more prescriptive. The algorithm works well when we have products in the set already rated by a few users OR every user has rated few products. However, if we are including new products in the set which have not been rated by any user or new users in the set whose preferences for the products are not known – it will be difficult to create a recommendation for unrated products and the new users. There are challenges in co-relating the underlying features of the products and users’ preferences based on key features. The paper below addresses the challenge of interpreting latent factors related to the product and the user features. Once we have clarity on the products and the user features – we can move towards the prescriptive journey. It will help in designing future products with more popular features OR finding a market for a particular product.
Data is often dirty and messy. Sometimes, it doesn’t even come in the right form for quick analysis and visualization. While Tableau (and Prep) had several tools to deal with numeric, categorical, and even spatial data, one consistent missing piece was handling unstructured text data. Not anymore. In the latest edition of Tableau Prep (2019.3), released just a few weeks ago, Prep now natively supports custom R or Python scripts. By using TabPy, one can use the entire suite of R and Python libraries on any datasets without having to leave the Tableau universe. From running machine learning models to calling geospatial APIs, the possibilities are pretty much endless. In this article, we take a crack at something new and exciting: applying natural language processing techniques to unstructured text data using Tableau Prep and Python. Before we dive into that, we start with an in-depth guide on how to set everything up.
As an industrial engineer, it might sound strange using programming languages and software to perform tasks and conduct analyses. ‘Leave that to computer science engineers and data analysts’, you might be thinking. However, recent trends in the industry are demanding the workforce to be able to manage, transform, understand and interpret data for better decision making and for obtaining business insights. R, an open-source and free software environment, represents an amazing tool that can be used by industrial engineers for multiple purposes. With this article, I would like to encourage industrial engineers to start learning and using programming languages and software and to break the paradigm that they are only meant to be used by computer science engineers and data analysts. You will see that it is not difficult at all and will experience the advantages and benefits they can bring to your professional career. Let’s take a look!
In time series analysis, one is typically interested in two things; anomalies and trends. For example, a physician examines an EKG (electrocardiogram – heartbeat reading) for anomalous events that indicate at-risk patients. An individual working within retail needs to understand what items sell and when they sell (seasonality) to increase profits. One method to find anomalies and trends within a time series is to perform a similarity join. Essentially, you compare snippets of the time series against itself by computing the distance between each pair of snippets. While it takes minimal effort to implement a naive algorithm using nested loops, it may take months or years to receive an answer for a moderately sized time series using this approach. Taking advantage of the Matrix Profile algorithms drastically reduces the computation time.
Last week I saw an excellent talk by a researcher at Fast Forward Labs that caused me to think about the instance of machine learning called ‘active learning’. Active learning refers to a number of strategies for dealing with incompletely labeled data, particularly identifying which points to manually label. Most of the use cases people think about when they hear the term ‘machine learning’ involve so-called ‘supervised learning’, meaning that they require data that has a labelled target variable to train on. If a bank wants to build a model that will predict if a particular transaction is fraudulent based on certain characteristics, it needs to have training data that it can show the computer containing known cases of fraudulent and non-fraudulent transactions. If a machine-vision engineer wants to teach a car’s onboard computer to recognize stop signs, it needs to present the computer with some clearly labeled examples of images with and without stop signs. There are unsupervised techniques which don’t reference a specifically labeled target variable – algorithms that try to group points together or that search for anomalies for instance – but predictive models almost always require reference to a target variable and therefore require that your dataset already have that variable accounted for. After all, how else would you validate the model?
Taking a machine learning project from Python to Scala. In a previous post, I showed how to take a raw dataset of home sales and apply feature engineering techniques in Python with pandas. This allowed us to produce and improve predictions on home sale prices using scikit-learn machine learning models. But what happens when you want to take this sort of project to production, and instead of 10,000 data points perhaps there are tens or hundreds of gigabytes of data to train on? In this context, it is worth moving away from Python and scikit-learn toward a framework that can handle Big Data.
One of the most path-breaking developments in the field of NLP was marked by the release (considered to be the ImageNet moment for NLP) of BERT – a revolutionary NLP model that is superlative when compared with traditional NLP models. It has also inspired many recent NLP architectures, training approaches and language models, such as Google’s TransformerXL, OpenAI’s GPT-2, ERNIE2.0, XLNet, RoBERTa, etc. Let’s deep dive into understanding BERT and it’s potential to transform NLP.
Supervised Models
• Classic Neural Networks (Multilayer Perceptrons)
• Convolutional Neural Networks (CNNs)
• Recurrent Neural Networks (RNNs)
Unsupervised Models
• Self-Organizing Maps (SOMs)
• Boltzmann Machines
• AutoEncoders
During a holiday in beautiful France, me and my family played a lot of SET, a simple and elegant card game. The goal is to find specific combinations of cards before others find them. While playing the game we often stared at the cards wondering if there’s another SET that we just don’t see. This started a fun personal side project where I apply machine learning to find SET combinations.
In deep learning we have the concept of loss, which tells us how poorly the model is performing at that current instant. Now we need to use this loss to train our network such that it performs better. Essentially what we need to do is to take the loss and try to minimize it, because a lower loss means our model is going to perform better. The process of minimizing (or maximizing) any mathematical expression is called optimization and we need now see how we can use these optimization methods for neural networks. In a neural network, we have many weights in between each layer. We have to understand that each and every weight in the network will affect the output of the network in some way, because they are all directly or indirectly connected to the output.
AdaNet provides a framework that could automatically produce a high-quality model given an arbitrary set of features and a model search space. In addition, it builds ensembles from productionized TensorFlow models to reduce churn, reuse domain knowledge, and conform with business and explainability requirements. The framework is capable of handling datasets containing thousands to billions of examples, in a distributed environment.
The Mahalanobis-Taguchi System (MTS) is a diagnosis and forecasting method employing Mahalanobis Distance (MD) and Taguchi’s Robust Engineering in a multidimensional system. In MTS, MD is used to construct a continuous measurement scale to discriminate observations and measure the level of abnormality of abnormal observations which compared to a group of normal observations. Therefore, MTS can handle the class imbalance problem. In addition, MTS is unique in its robustness to assess variability among all the levels of observations (noise) and evaluate significant and insignificant features which contributed to the multidimensional system by means of simplistic yet robust technique via Orthogonal Arrays (OA) and Signal-to-Noise Ratios (SNR). However, compared with the classic multivariate methods, MTS has a weaker theoretical basis. In order to promote the development and improvement of MTS theory, this paper reviews the literature related to developing and improving MTS theory. The survey presents and analyzes the research results in terms of MD, SNR, Mahalanobis Space (MS), feature selection, threshold, multi-class MTS, and comparison with other methods. Finally, a detailed analysis of the future possible research directions will be proposed to develop and improve MTS.

### R Packages worth a look

Machine Learning and Judgement (mnj)
Perform FlexBoost in R. FlexBoost is a newly suggested algorithm based on AdaBoost by adjusting adaptive loss functions. Not only FlexBoost but also other machine learning algorithms (e.g. Support Vector Machines) will be added. For more details on FlexBoost see Jeon, Y. S., Yang, D. H., & Lim, D. J. (2019) <doi:10.1109/access.2019.2938356>.

Manage optional data for your package. The data can be hosted anywhere, and you have to give a Uniform Resource Locator (URL) for each file. File integrity checks are supported. This is useful for package authors who need to ship more than the 5 Megabyte of data currently allowed by the the Comprehensive R Archive Network (CRAN).

Utilities for Fundamental Geo-Spatial Data (fgdr)
Read and Parse for Fundamental Geo-Spatial Data (FGD) which downloads XML file from providing site (<https://…/menu.php> ). The JPGIS format file provided by FGD so that it can be handled as an R spatial object such as ‘sf’ and ‘raster’ or ‘stars’. Supports the FGD version 4.1, and accepts fundamental items and digital elevation models.

Tree Structured Hidden Markov Model (treeHMM)
Used for Inference, Prediction and Parameter learning for tree structured Hidden Markov Model. The package propose a new architecture of Hidden Markov Model(HMM) known as Tree Structured HMM which could be used in various applications which involves graphs, trees etc.

Read and Write ‘Excel’ Sheets into and from List of Data Frames (xlsx2dfs)
Reading and writing sheets of a single ‘Excel’ file into and from a list of data frames. Eases I/O of tabular data in bioinformatics while keeping them in a human readable format.

### What’s going on on PyPI

Scanning all new published packages on PyPI I know that the quality is often quite bad. I try to filter out the worst ones and list here the ones which might be worth a look, being followed or inspire you in some way.

tempeh
Machine Learning Performance Testing Framework

torch3d
Datasets and network architectures for 3D deep learning in PyTorch

trains-agent
Trains-Agent DevOps for deep learning (DevOps for TRAINS)

tree-lstm
pytorch tree lstm package. This repository contains a Pytorch Implementation of ‘Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks ‘ (https://…/1503.00075 ). This contains two type of tree-lstm (Child sum, N-ary). This was tested by Python 3.6, Pytorch 1.3.0., and this internally uses dgl 0.4.0 This repository referenced https://…/tree_lstm.py

trelawney
Generic Interpretability package. Trelawney is a general interpretability package that aims at providing a common api to use most of the modern interpretability methods to shed light on sklearn compatible models (support for Keras and XGBoost are tested).

arus
Activity Recognition with Ubiquitous Sensing

cma-es
Covariance Matrix Adaptation Evolution Strategy (CMA-ES) implemented with TensorFlow v2. The CMA-ES (Covariance Matrix Adaptation Evolution Strategy) is an evolutionary algorithm for difficult non-linear non-convex black-box optimisation problems in continuous domain. It is considered as state-of-the-art in evolutionary computation and has been adopted as one of the standard tools for continuous optimisation in many (probably hundreds of) research labs and industrial environments around the world.

mozilla-fldp
Federated Learning Experimentation

musco-pytorch
MUSCO: Multi-Stage COmpression of Neural Networks

waymo-od-tf2-0
Waymo Open Dataset libraries.

### What are Your Use Cases for rOpenSci Tools and Resources?

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We want to know how you use rOpenSci packages and resources so we can give them, their developers, and your examples more visibility.

It’s valuable to both users and developers of a package to see how it has been used “in the wild”. This goes a long way to encouraging people to keep up development knowing there are others who appreciate and build on their work. This also helps people imagine how they might use a package to address their research question, and provides some code to give them a head-start.

We have a dedicated Use Cases category in our public discussion forum and we encourage you to browse and add to it. As of September 2019, there are > 50 use cases, covering > 30 packages. More than half of these have been added by community members. These complement the > 800 citations of our tools in published works.

### Examples

• visdat1, skimr2, assertr3, Exploring and understanding a new data set. Sharla Gelfand used these packages to demonstrate how she approached a data set on public transit delays. She shared her code, slide deck, and an explanatory blog post. [Use case].

• nasapower4, rnaturalearth5, Can rainfall be a useful predictor of epidemic risk across temporal and spatial scales? Emerson Del Ponte worked with Adam Sparks to access rainfall data for a specific time period of soybean growth over multiple seasons in two states in Brazil. It’s being used to explore associations between rainfall and patterns of dispersal of a soybean fungal disease. Adam shared the code he used to access the data and create the maps for the presentation Emerson shared [Use case].

• rOpenSci package development guide used in teaching and manuscript review6. Tiffany Timbers, her teaching assistants, and students used our guide in teaching and evaluation in their Collaborative Software Development course in the University of British Columbia’s Master of Data Science program [Use case]. Hao Ye used the guide in his review of a manuscript that included an R package [Use case].

Browse the use cases for applications in academia, industry, government, or “just for fun”, with examples on biodiversity, ecology, text processing, bibliometrics, workflows and reproducibility, weather, public health, bicycle networks, agronomy, epidemiology, surveys, seafood mislabelling, tweets about fires, and others!

If your use case is published on the web somewhere, please tell us about it in the forum. Here’s the template we created to guide you.

When you post your use case in the forum we’ll tweet about it from @rOpenSci and we might feature it, with attribution, in our biweekly newsletter.

Developers and users will thank you!

1. Nicholas Tierney (2017). “visdat: Visualising Whole Data Frames.” JOSS, 2(16), 355. doi: 10.21105/joss.00355 (URL: http://doi.org/10.21105/joss.00355), .
2. Elin Waring, Michael Quinn, Amelia McNamara, Eduardo Arino de la Rubia, Hao Zhu and Shannon Ellis (2019). skimr: Compact and Flexible Summaries of Data. R package version 1.0.7. https://CRAN.R-project.org/package=skimr
3. Tony Fischetti (2019). assertr: Assertive Programming for R Analysis Pipelines. R package version 2.6. https://CRAN.R-project.org/package=assertr
4. Adam Sparks (2018). nasapower: A NASA POWER Global Meteorology, Surface Solar Energy and Climatology Data Client for R. Journal of Open Source Software, 3(30), 1035, https://doi.org/10.21105/joss.01035
5. Andy South (2017). rnaturalearth: World Map Data from Natural Earth. R package version 0.1.0. https://CRAN.R-project.org/package=rnaturalearth
6. rOpenSci, Brooke Anderson, Scott Chamberlain, Anna Krystalli, Lincoln Mullen, Karthik Ram, Noam Ross, Maëlle Salmon, Melina Vidoni. (2019, October 2). ropensci/dev_guide: Third release (Version 0.3.0). Zenodo. https://doi.org/10.5281/zenodo.2553043

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

### Selection bias, death, and dying

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I am collaborating with a number of folks who think a lot about palliative or supportive care for people who are facing end-stage disease, such as advanced dementia, cancer, COPD, or congestive heart failure. A major concern for this population (which really includes just about everyone at some point) is the quality of life at the end of life and what kind of experiences, including interactions with the health care system, they have (and don’t have) before death.

A key challenge for researchers is figuring out how to analyze events that occur just before death. For example, it is not unusual to consider hospitalization in the week or month before death as a poor outcome. For example, here is a paper in the Journal of Palliative Care Medicine that describes an association of homecare nursing and reduced hospitalizations in the week before death. While there is no denying the strength of the association, it is less clear how much of that association is causal.

In particular, there is the possibility of selection bias that may be result when considering only patients who have died. In this post, I want to describe the concept of selection bias and simulate data that mimics the process of end-stage disease in order to explore how these issues might play out when we are actually evaluating the causal effect of an exposure or randomized intervention.

### Selection bias

Selection bias is used to refer to different concepts by different researchers (see this article by Haneuse or this one by Hernán et al for really nice discussions of these issues). The terminology doesn’t matter as much as understanding the underlying data generating processes that distinguish the different ideas.

The key issue is to understand what is being selected. In one case, the exposure or intervention is the focus. And in the second case, it is how the patients or subjects are selected into the study more generally that induces the bias. The first selection process is typically referred to by epidemiologists as confounding bias (though it is also called treatment-selection bias), while the second is actually selection bias.

When I’ve written about these issues before (for example, see here), I’ve described how DAGs can be useful to illuminate the potential biases. Below, I have drawn a diagram to represent a simple case of selection bias. Say we are interested in measuring the causal relationship between income and blood pressure in some population in which the two are actually not causally related. If people with higher income are more likely to visit a doctor, and if people with higher blood pressure are also more likely to visit a doctor, the underlying causal relationship might be well represented by the DAG on the left in the figure below.

Let’s say we recruit participants for our study right outside of a medical facility. Choosing this location (as opposed to, say, a shopping mall where the causal model on the left would not be relevant), we are inducing a relationship between income and blood pressure. This can be seen in the DAG on the right, where we have effectively “controlled” for medical facility access in our selection process. The induced statistical relationship can be described in this way: if someone is at the medical center and they have relatively low income, they are more likely to have relatively high blood pressure. Conversely, if someone is there and they have relatively low blood pressure, they are more likely to have relatively high income. Based on this logic, we would expect to see a negative relationship between income and blood pressure in our study sample drawn from patients visiting a medical facility.

To explore by simulation, we can generate a large population of individuals with uncorrelated income and blood pressure. Selection will be a function of both:

n = 5000
set.seed(748347)

income <- rnorm(n);
bp <- rnorm(n)

logitSelect <- -2 + 2*income + 2*bp
pSelect <- 1/(1+exp(-logitSelect))
select <- rbinom(n, 1, pSelect)

dPop <- data.table(income, bp, select)
dSample <- dPop[select == 1]

The plot on the left below is the overall population of 5000; there is no obvious relationship between the income and blood pressure. The group that was recruited at the medical facility and enrolled in the study (a subset of the original population) is shown in purple in the plot on the right. In this subset, there does indeed appear to be a relationship between the two characteristics. An estimate of the association, which we know is zero, based on the sample would be biased; that bias is due to how we selected participants into the study.

### Hospitalization before death

In the next simulation, let’s consider a somewhat more complex process, though with the same underlying structure and similar bias as the simpler case above. The next DAG (below) shows three different time periods. In this case there is an indicator of homecare by a nurse $$N_1$$, $$N_2$$, and $$N_3$$. (In this particular example, an individual has home nursing care in all three periods or they don’t have any home nursing care in any period. This is not a requirement.) In each period, each patient has an underlying time-dependent health status, ranging from $$1$$ (healthiest) to $$4$$ (sickest). In this simulated study, the underlying health status $$U_1$$, $$U_2$$, and $$U_3$$ are considered latent (i.e. unmeasured). The progression of health status is governed by a Markov process that is independent of any kind of treatment. (See here and here for a more detailed description of how this is done using simstudy.)

The probability of hospitalization is a solely a function of the underlying health status, and nothing else. (I could make hospitalization a function of palliative care as well, but this just simplifies matters. In both cases the estimates will be biased – you can try for yourself.)

Death is a function of underlying health status and palliative care. While it does not seem to be the case in practice, I am assuming that less aggressive care results in shorter survival times. And the sicker the patient is in a particular period, the greater risk of dying in that period. (There should be lines between death in various periods and all subsequent measures, but I have eliminated them for clarity sake.)

The code to generate the data starts with the definitions: first, I define an initial health state $$S_0$$ that can range from 1 to 3 and the transition matrix $$P$$ for the Markov process. Next, I define the hospitalization and death outcomes:

bDef <- defData(varname = "S0", formula = "0.4;0.4;0.2",
dist = "categorical")

P <- t(matrix(c( 0.7, 0.2, 0.1, 0.0,
0.0, 0.4, 0.4, 0.2,
0.0, 0.0, 0.6, 0.4,
0.0, 0.0, 0.0, 1.0),
nrow = 4))

pDef <- defDataAdd(varname = "hospital", formula = "-2 + u",
dist = "binary", link = "logit")
pDef <- defDataAdd(pDef, varname = "death",
formula = "-2 + u + homenurse * 1.5",
dist = "binary", link = "logit")

The data generation process randomizes individuals to nursing home care (or care as usual) in the first period, and creates all of the health status measures and outcomes. The last step removes any data for an individual that was generated after they died. (The function trimData is new and only available in simstudy 0.1.15 which is not yet up on CRAN, but can be downloaded from github.)

set.seed(272872)

dd <- genData(10000, bDef)
dd <- trtAssign(dd, grpName = "homenurse")

dp <- addMarkov(dd, transMat = P,
chainLen = 4, id = "id",
pername = "seq", start0lab = "S0",
varname = "u")

dp <- trimData(dp, seqvar = "seq", eventvar = "death")

#### A short follow-up period

If we have a relatively short follow up period in our randomized trial of supportive care at home (nursecare), only a portion of the sample will die; as result, we can only compare the hospitalization before death for a subset of the sample. By selecting on death, we will induce a relationship between the intervention and the outcome where none truly exists. Inspecting the DAGs below, it is apparent that this is a classic case of selection bias. Since we cannot control for the unmeasured health status $$U$$, hospitalization and death are associated. And, since treatment and death are causally related, by selecting on death we are in the same situation as we were in the first example.

d1 <- dp[seq == 1]

If we consider only those who died in the first period, we will be including 61% of the sample:

d1[, mean(death)]
## [1] 0.6109

To get a sense of the bias, I am considering three models. The first model estimates the effect of the intervention on hospitalization for only those who died in the first period; we expect that this will have a negative bias. In the second model, we use the same subset of patients who died, but adjust for underlying health status; the hospitalization coefficient should be close to zero. Finally, we estimate a model for everyone in period 1, regardless of whether they died. again, we expect the effect size to be close to 0.

fit1 <- glm(hospital ~ homenurse, data=d1[death==1],
family = "binomial")
fit2 <- glm(hospital ~ homenurse + u, data=d1[death==1],
family = "binomial")
fit3 <- glm(hospital ~ homenurse, data=d1,
family = "binomial")

library(stargazer)

stargazer(fit1, fit2, fit3, type = "text",
column.labels = c("died", "died - adjusted", "all"),
omit.stat = "all", omit.table.layout = "-asn")
##
## =============================================
##                   Dependent variable:
##           -----------------------------------
##                        hospital
##             died    died - adjusted    all
##              (1)          (2)          (3)
##            (0.053)      (0.057)      (0.040)
##
## u                      1.017***
##                         (0.039)
##
## Constant   0.108**     -2.005***    -0.149***
##            (0.042)      (0.092)      (0.028)
##
## =============================================

While these are estimates from a single data set (I should really do more extensive experiment based on many different data sets), the estimates do seem to support our expectations. Indeed, if we cannot measure the underlying health status, the estimate of the intervention effect on hospitalization prior to death is biased; we would conclude that supportive care reduces the probability of hospitalization before death when we know (based on the data generation process used here) that it does not.

#### Extended follow-up

We might think that if we could follow everyone up until death (and hence not select on death), the bias would be eliminated. However, this not the case. The treatment effect is essentially an average of the effect over all time periods, and we know that for each time period, the effect estimate is biased due to selection. And averaging across biased estimates yields a biased estimate.

This issue is closely related to a general issue for causal survival analysis. It has been pointed out that it is not possible to estimate a causal treatment effect using hazard rates, as we do when we use Cox proportional hazard models. This is true even if treatment has been randomized and the two treatment arms are initially balanced with respect to underlying health status. The challenge is that after the first set of deaths, the treatment groups will no longer be balanced with respect to health status; some people survived because of the intervention, others because they were generally healthier. At each point in the survival analysis, the model for risk of death is conditioning (i.e. selecting on) those who did not die. So, there is built in selection bias in the modelling. If you are interested in reading more about these issues, I recommend taking a look at these papers by Hernán and Aalen et al..

Now, back to the simulation. In this case, we analyze everyone who has died within 4 periods, which is about 97% of the initial sample, virtually everyone.

dDied <- dp[death == 1]
nrow(dDied)/nrow(d1)
## [1] 0.9658

The effect estimate based on this data set is only unbiased when we are able to control for underlying health status. Otherwise, extending follow-up does not help remove any bias.

fit4 <- glm(hospital ~ homenurse, data=dDied, family = "binomial")
fit5 <- glm(hospital ~ homenurse + u, data=dDied, family = "binomial")

stargazer(fit4, fit5, type = "text",
omit.stat = "all", omit.table.layout = "-asn")
##
## ==============================
##           Dependent variable:
##           --------------------
##                 hospital
##              (1)        (2)
##            (0.041)    (0.045)
##
## u                    1.020***
##                       (0.028)
##
## Constant   0.296***  -2.028***
##            (0.030)    (0.070)
##
## ==============================

In the future, I hope to explore alternative ways to analyze these types of questions. In the case of survival analysis, models that do not condition on death have been proposed to get at causal estimates. This may not be possible when the outcome of interest (health care before death) is defined by conditioning on death. We may actually need to frame the question slightly differently to be able to get an unbiased estimate.

References:

Seow, H., Sutradhar, R., McGrail, K., Fassbender, K., Pataky, R., Lawson, B., Sussman, J., Burge, F. and Barbera, L., 2016. End-of-life cancer care: temporal association between homecare nursing and hospitalizations. Journal of palliative medicine, 19(3), pp.263-270.

Haneuse, S., 2016. Distinguishing selection bias and confounding bias in comparative effectiveness research. Medical care, 54(4), p.e23.

Hernán, M.A., Hernández-Díaz, S. and Robins, J.M., 2004. A structural approach to selection bias. Epidemiology, 15(5), pp.615-625.

Hernán, M.A., 2010. The hazards of hazard ratios. Epidemiology, 21(1), p.13.

Aalen, O.O., Cook, R.J. and Røysland, K., 2015. Does Cox analysis of a randomized survival study yield a causal treatment effect?. Lifetime data analysis, 21(4), pp.579-593.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

## October 14, 2019

### Rachel Tanur Memorial Prize for Visual Sociology

Judith Tanur writes:

The Rachel Tanur Memorial Prize for Visual Sociology recognizes students in the social sciences who incorporate visual analysis in their work. The contest is open worldwide to undergraduate and graduate students (majoring in any social science). It is named for Rachel Dorothy Tanur (1958–2002), an urban planner and lawyer who cared deeply about people and their lives and was an acute observer of living conditions and human relationships.

The 2020 Rachel Tanur Memorial Prize is now open for applications. Entries for the 2020 competition must be received by January 22, 2020. Winners will be notified by March 30, 2020. Up to three cash prizes will be awarded at the IV International Sociological Association (ISA) Forum of Sociology, “Challenges of the 21st Century: Democracy, Environment, Inequalities, Intersectionality,” to be held in Porto Alegre, Brazil on July 14-18, 2020. Attendance at the forum is not a requirement but is encouraged. Prizes, supported by the Mark Family Foundation, will be awarded by the Research Committee on Visual Sociology of the ISA. The first prize will be $2,500 USD, the second$1,500, and the third $500. The prize is awarded biennially. For more information and to apply please go to racheltanurmemorialprize.org Continue Reading… ### Document worth reading: “An Introduction to Fuzzy and Annotated Semantic Web Languages” We present the state of the art in representing and reasoning with fuzzy knowledge in Semantic Web Languages such as triple languages RDF/RDFS, conceptual languages of the OWL 2 family and rule languages. We further show how one may generalise them to so-called annotation domains, that cover also e.g. temporal and provenance extensions. An Introduction to Fuzzy and Annotated Semantic Web Languages Continue Reading… ### Using the Color Threshold app again Back in the summer I had another chance to use the Color Thresholder, a very nice app that's in the Image Processing Toolbox. I happened to come across a question on MATLAB Answers - someone was looking for a way to segment the yellow region in this image: I proposed using the Color Thresholder and the L*a*b* color space, and I showed this screen shot to demonstrate how to apply it to this problem: I first wrote about the Color Thresholder back in 2014, shortly after the app first appeared ("Color Thresholder App in R2014a"). In that post, I played around with picture of candy lying on my desk: Besides figuring out how to segment all the different candy colors, I also discovered how to characterize the color of my desk: The Color Thresholder is a very useful tool. Give it a look! Get the MATLAB code <noscript>(requires JavaScript)</noscript> Published with MATLAB® R2019b Continue Reading… ### Announcing the winners of the 2019 Testing and Verification research awards In May, we launched the second Testing and Verification request for proposals (TAV RFP) in response to the high quality of submissions of last year’s TAV RFP. The winners of the 2018 TAV RFP were invited to a one-day workshop at the 2018 Facebook TAV Symposium to present their winning proposals and network with symposium attendees. Following the success of last year’s model, the 2019 TAV RFP winners are also invited to this year’s TAV Symposium, which will take place in Facebook London on November 20 and 21. For those interested in attending, registration is now open. For this year’s TAV RFP, we were interested in proposals that tackled any topics on testing and verification with the potential to have profound impact on the tech sector, based on advances on the theory and practice of testing and verification. In particular, we welcomed proposals that tackled test flakiness and pay-as-you-go verification. For more details on these topics, as well as proposal requirements, eligibility, and timing, visit the 2019 TAV RFP page. The winners have been selected and are listed below. “We are excited that, once again, we received over 100 proposals, and of such high quality,” says Mark Harman, Research Scientist and Facebook TAV co-chair. “We are really excited to see this research develop, and to see the deployment of advanced research in industry. “Despite doubling the funding available this year, there were many excellent proposals that we were sadly unable to fund this round. Such was the quality of the proposals we received,” says Mark. “We are very grateful to the scientific research community for engaging so enthusiastically with this call for proposals.” ## Research award winners Principal investigators are listed first unless otherwise noted. A scalable infrastructure for fuzzy-driven root causing of flaky tests Filomena Ferrucci (University of Salerno), Pasquale Salza (University of Zurich), and Valerio Terragni (Università della Svizzera italiana) Action-based test carving Gordon Fraser (University of Passau) and Alessio Gambi (University of Passau) Cyclic proofs of program incorrectness James Brotherston (University College London) Detecting flaky test failures of system user interactive tests Mike Papadakis (University of Luxembourg), Renaud Rwemalika (University of Luxembourg), and Yves Le Traon (University of Luxembourg) Moka: Improving app testing with automated mocking Mattia Fazzini (University of Minnesota), Alessandra Gorla (IMDEA Software Institute), and Allessandro Orso (Georgia Tech) Null pointer dereferences in the wild Cindy Rubio Gonzalez (University of California, Davis) Probabilistic testing in the presence of flaky tests Weihang Wang (State University of New York — University at Buffalo) Search-based inducement and repair of latent test flakiness Philip McMinn (University of Sheffield) Static prediction of test flakiness Breno Alexandro Ferreira de Miranda (Federal University of Pernambuco, Brazil), Antonia Bertolino (Istituto di Scienza e Tecnologie dell’Informazione – CNR), Emilio Cruciani (Gran Sasso Science Institute), and Roberto Verdecchia (Gran Sasso Science Institute) Test oracle inference — supervised learning over execution traces Ajitha Rajan (University of Edinburgh) To view our currently open research awards and to subscribe to our email list, visit our Research Awards page. The post Announcing the winners of the 2019 Testing and Verification research awards appeared first on Facebook Research. Continue Reading… ### Finding out why It is often of interest in observational studies to measure the causal effect of a treatment on time-to-event outcomes. In a medical setting, observational studies commonly involve patients who initiate medication therapy and others who do not, and the goal is to infer the effect of medication therapy on time until recovery, a pre-defined level of improvement, or some other time-to-event outcome. A difficulty with such studies is that the notion of a medication initiation time does not exist in the control group. We propose an approach to infer causal effects of an intervention in longitudinal observational studies when the time of treatment assignment is only observed for treated units and where treatment is given by indication. We present a framework for conceptualizing an underlying randomized experiment in this setting based on separating the process that governs the time of study arm assignment from the mechanism that determines the assignment. Our approach involves inferring the missing times of assignment followed by estimating treatment effects. This approach allows us to incorporate uncertainty about the missing times of study arm assignment, which induces uncertainty in both the selection of the control group and the measurement of time-to-event outcomes for these controls. We demonstrate our approach to study the effects on mortality of inappropriately prescribing phosphodiesterase type 5 inhibitors (PDE5Is), a medication contraindicated for groups 2 and 3 pulmonary hypertension, using administrative data from the Veterans Affairs (VA) health care system. We develop a causal inference approach to estimate the number of adverse health events prevented by large-scale air quality regulations via changes in exposure to multiple pollutants. This approach is motivated by regulations that impact pollution levels in all areas within their purview. We introduce a causal estimand called the Total Events Avoided (TEA) by the regulation, defined as the difference in the expected number of health events under the no-regulation pollution exposures and the observed number of health events under the with-regulation pollution exposures. We propose a matching method and a machine learning method that leverage high-resolution, population-level pollution and health data to estimate the TEA. Our approach improves upon traditional methods for regulation health impact analyses by clarifying the causal identifying assumptions, utilizing population-level data, minimizing parametric assumptions, and considering the impacts of multiple pollutants simultaneously. To reduce model-dependence, the TEA estimate captures health impacts only for units in the data whose anticipated no-regulation features are within the support of the observed with-regulation data, thereby providing a conservative but data-driven assessment to complement traditional parametric approaches. We apply these methods to investigate the health impacts of the 1990 Clean Air Act Amendments in the US Medicare population. Tracking by Deblatting stands for solving an inverse problem of deblurring and image matting for tracking motion-blurred objects. We propose non-causal Tracking by Deblatting which estimates continuous, complete and accurate object trajectories. Energy minimization by dynamic programming is used to detect abrupt changes of motion, called bounces. High-order polynomials are fitted to segments, which are parts of the trajectory separated by bounces. The output is a continuous trajectory function which assigns location for every real-valued time stamp from zero to the number of frames. Additionally, we show that from the trajectory function precise physical calculations are possible, such as radius, gravity or sub-frame object velocity. Velocity estimation is compared to the high-speed camera measurements and radars. Results show high performance of the proposed method in terms of Trajectory-IoU, recall and velocity estimation. In this paper, we focus on the separability of classes with the cross-entropy loss function for classification problems by theoretically analyzing the intra-class distance and inter-class distance (i.e. the distance between any two points belonging to the same class and different classes, respectively) in the feature space, i.e. the space of representations learnt by neural networks. Specifically, we consider an arbitrary network architecture having a fully connected final layer with Softmax activation and trained using the cross-entropy loss. We derive expressions for the value and the distribution of the squared L2 norm of the product of a network dependent matrix and a random intra-class and inter-class distance vector (i.e. the vector between any two points belonging to the same class and different classes), respectively, in the learnt feature space (or the transformation of the original data) just before Softmax activation, as a function of the cross-entropy loss value. The main result of our analysis is the derivation of a lower bound for the probability with which the inter-class distance is more than the intra-class distance in this feature space, as a function of the loss value. We do so by leveraging some empirical statistical observations with mild assumptions and sound theoretical analysis. As per intuition, the probability with which the inter-class distance is more than the intra-class distance decreases as the loss value increases, i.e. the classes are better separated when the loss value is low. To the best of our knowledge, this is the first work of theoretical nature trying to explain the separability of classes in the feature space learnt by neural networks trained with the cross-entropy loss function. How to answer the centuries-old question: which one came first, the chicken or the egg? Is it the chicken that came out first then hatched the egg? Or, the other way around? This is a particularly challenging question, as it relates to the origin and the direction of causal inference in social science. Models of coupled oscillators are used to describe a wide variety of phenomena in neuroimaging. These models typically rest on the premise that oscillator dynamics do not evolve beyond their respective limit cycles, and hence that interactions can be described purely in terms of phase differences. Whilst mathematically convenient, the restrictive nature of phase-only models can limit their explanatory power. We therefore propose a generalisation of dynamic causal modelling that incorporates both phase and amplitude. This allows for the separate quantifications of phase and amplitude contributions to the connectivity between neural regions. We establish, using model-generated data and simulations of coupled pendula, that phase-only models perform well only under weak coupling conditions. We also show that, despite their higher complexity, phase-amplitude models can describe strongly coupled systems more effectively than their phase-only counterparts. We relate our findings to four metrics commonly used in neuroimaging: the Kuramoto order parameter, cross-correlation, phase-lag index, and spectral entropy. We find that, with the exception of spectral entropy, the phase-amplitude model is able to capture all metrics more effectively than the phase-only model. We then demonstrate, using local field potential recordings in rodents and functional magnetic resonance imaging in macaque monkeys, that amplitudes in oscillator models play an important role in describing neural dynamics in anaesthetised brain states. Paper: Quantifying Causation I introduce an information-theoretic measure of causation, capturing how much a quantum system influences the evolution of another system. The measure discriminates among different causal relations that generate same-looking data, with no information about the quantum channel. In particular, it determines whether correlation implies causation, and when causation manifests without correlation. In the classical scenario, the quantity evaluates the strength of causal links between random variables. Also, the measure is generalized to identify and rank concurrent sources of causal influence in many-body dynamics, enabling to reconstruct causal patterns in complex networks. In this paper we present web-liver, a rule-based system for decision support in the medical domain, focusing on its application in a liver transplantation unit for implementing policies for donor-patient matching. The rule-based system is built on top of an interpreter for logic programs with partial functions, called lppf, that extends the paradigm of Answer Set Programming (ASP) adding two main features: (1) the inclusion of partial functions and (2) the computation of causal explanations for the obtained solutions. The final goal of web-liver is assisting the medical experts in the design of new donor-patient matching policies that take into account not only the patient severity but also the transplantation utility. As an example, we illustrate the tool behaviour with a set of rules that implement the utility index called SOFT. Continue Reading… ### If you did not already know Local Orthogonal Decomposition Inverted file and asymmetric distance computation (IVFADC) have been successfully applied to approximate nearest neighbor search and subsequently maximum inner product search. In such a framework, vector quantization is used for coarse partitioning while product quantization is used for quantizing residuals. In the original IVFADC as well as all of its variants, after residuals are computed, the second production quantization step is completely independent of the first vector quantization step. In this work, we seek to exploit the connection between these two steps when we perform non-exhaustive search. More specifically, we decompose a residual vector locally into two orthogonal components and perform uniform quantization and multiscale quantization to each component respectively. The proposed method, called local orthogonal decomposition, combined with multiscale quantization consistently achieves higher recall than previous methods under the same bitrates. We conduct comprehensive experiments on large scale datasets as well as detailed ablation tests, demonstrating effectiveness of our method. … Quantitative Discourse Analysis Quantitative Discourse Analysis is basically looking at patterns in language. … Fast and Accurate Timing Error Prediction Framework (FATE) Deep neural networks (DNN) are increasingly being accelerated on application-specific hardware such as the Google TPU designed especially for deep learning. Timing speculation is a promising approach to further increase the energy efficiency of DNN accelerators. Architectural exploration for timing speculation requires detailed gate-level timing simulations that can be time-consuming for large DNNs that execute millions of multiply-and-accumulate (MAC) operations. In this paper we propose FATE, a new methodology for fast and accurate timing simulations of DNN accelerators like the Google TPU. FATE proposes two novel ideas: (i) DelayNet, a DNN based timing model for MAC units; and (ii) a statistical sampling methodology that reduces the number of MAC operations for which timing simulations are performed. We show that FATE results in between 8 times-58 times speed-up in timing simulations, while introducing less than 2% error in classification accuracy estimates. We demonstrate the use of FATE by comparing to conventional DNN accelerator that uses 2’s complement (2C) arithmetic with an alternative implementation that uses signed magnitude representations (SMR). We show that that the SMR implementation provides 18% more energy savings for the same classification accuracy than 2C, a result that might be of independent interest. … MetaForest A requirement of classic meta-analysis is that the studies being aggregated are conceptually similar, and ideally, close replications. However, in many fields, there is substantial heterogeneity between studies on the same topic. Similar research questions are studied in different laboratories, using different methods, instruments, and samples. Classic meta-analysis lacks the power to assess more than a handful of univariate moderators, or to investigate interactions between moderators, and non-linear effects. MetaForest, by contrast, has substantial power to explore heterogeneity in meta-analysis. It can identify important moderators from a larger set of potential candidates, even with as little as 20 studies (Van Lissa, in preparation). This is an appealing quality, because many meta-analyses have small sample sizes. Moreover, MetaForest yields a measure of variable importance which can be used to identify important moderators, and offers partial prediction plots to explore the shape of the marginal relationship between moderators and effect size. … Continue Reading… ### Training, validation and testing for supervised machine learning models Validating and testing our supervised machine learning models is essential to ensuring that they generalize well. SAS Viya makes it easy to train, validate, and test our machine learning models. ## Training, validation and test data sets Training data are used to fit each model. Training a model involves using an algorithm to determine model parameters (e.g., weights) or other logic to map inputs (independent variables) to a target (dependent variable). Model fitting can also include input variable (feature) selection. Models are trained by minimizing an error function. For illustration purposes, let’s say we have a very simple ordinary least squares regression model with one input (independent variable, x) and one output (dependent variable, y). Perhaps our input variable is how many hours of training a dog or cat has received, and the output variable is the combined total of how many fingers or limbs we will lose in a single encounter with the animal. In ordinary least squares regression, the parameters are estimated that minimize the sum of the squared errors between the observed data and the predicted model. This is illustrated below where the predicted y = β0 + β1x. The y variable is on the vertical axis and the x variable is on the horizontal axis. Validation data are used with each model developed in training, and the prediction errors are calculated. Depending on the model and the software, these prediction errors can be used to decide: • When to terminate the selection process • What effects (e.g., inputs, interactions, etc.) to include as the selection process proceeds, and/or • Which model to select The validation errors calculated vary from model to model and may be things such as the the average squared error (ASE), mean squared error (MSE), error sum of squares (SSE), the negative log-likelihood, etc. The validation ASE is often used in VDMML. Note: “Average Squared Error and Mean Squared Error might appear similar. But they describe two completely different measures, where each is appropriate only for specific models. In linear models, statisticians routinely use the mean squared error (MSE) as the main measure of fit. The MSE is the sum of squared errors (SSE) divided by the degrees of freedom for error. (DFE is the number of cases [observations]less the number of weights in the model.) This process yields an unbiased estimate of the population noise variance under the usual assumptions. For neural networks and decision trees, there is no known unbiased estimator. Furthermore, the DFE is often negative for neural networks. There exists approximations for the effective degrees of freedom, but these are often prohibitively expensive and are based on assumptions that might not hold. Hence, the MSE is not nearly as useful for neural networks as it is for linear models. One common solution is to divide the SSE by the number of cases N, not the DFE. This quantity, SSE/N, is referred to as the average squared error (ASE). The MSE is not a useful estimator of the generalization error, even in linear models, unless the number of cases is much larger than the number of weights.” - From Enterprise Miner documentation. The validation data may be used several times to build the final model. Test data is a hold-out sample that is used to assess the final selected model and estimate its prediction error. Test data are not used until after the model building and selection process is complete. Test data tell you how well your model will generalize, i.e., how well your model performs on new data. By new data I mean data that have not been involved in the model building nor the model selection process in any way. The test data should never be used for fitting the model, for deciding what effects to include, nor for selecting from among candidate models. In addition, be careful of any leakage of information from the test data set into the other data sets. For example, if you take a mean of all of the data to impute missing values, do that separately for each of the three data sets (training, validation, and test). Otherwise, you will leak information from one data set to another. ## Partitioning the Data Simple random sampling: The observations selected for the subset of the data are randomly selected, i.e., each observation has an equal probability of being chosen. Stratified sampling: The data are divided into segments or “strata,” and then observations are randomly selected from each stratum. For example, for large cat training, you might want to stratify your sample by lions and tigers to ensure that your training, validation, and test data sets all include the proportion of lions and tigers that exist in the total population. As shown below, in this case we take a random sample of 10% of the tigers and a random sample of 10% of the lions. Oversampling: Oversampling is commonly used when you are looking at rare events, such as cancer cases. You may have a data set with 10,000 people and only 100 cancer cases. You may wish to oversample the cancer cases to get, for example, a 90:10 ratio of non-cancer to cancer cases. You could use 100% of the cancer cases (all 100) and a 9% random sample of the non-cancer cases (900). ## How partitioning is accomplished ### In VA|VS|VDMML Visual Interface Partitioning became easier in the VA|VS|VDMML visual interface in 17w47 (December 2017) with VA|VS|VDMML 8.2 on Viya 3.3. To partition data, click the icon to the right of the data set name, and select Add partition data item, as shown below. You may select either simple random sampling or stratified sampling and type the training percentage. You may also select a random number seed if you wish. You may choose: • two partitions (training and validation) or • three partitions (training, validation, and test) They will be assigned numbers as follows: • 0 = Validation • 1 = Training • 2 = Test Note to anyone dealing with old versions of the software: In the earlier (17w12, March 2012) release of VA/VS/VDMML 8.1 on Viya 3.2 you needed to have a binary partition indicator variable in your data set with your data already partitioned if you wished to partition the data into Training and Validation data sets. For example, the partition indicator variable could be called “Partition” and could have two values, “Training” and “Validation.” Or it could be called Training and the two values could be 0 and 1. It doesn’t matter what the name of the variable was, but it had to have only two values. As a data preparation step you had to set the partition indicator variable, indicating which value indicates Training observations and which value indicates validation observations. In 8.1, if you did not have a binary partition variable already in your data set, you could not partition automatically on the fly. There is no rule of thumb for deciding how much of your data set to apportion into training, validation, and test data. This should be decided by a data scientist or statistician familiar with the data. A common split is 50% for training, 25% for validation, and 25% for testing. Cross-validation is another method, but is not done automatically with the visual interface. For more information on cross-validation methods see: Funda Gunes’s blog on cross validation in Enterprise Miner or documentation on the SAS Viya CROSSVALIDATION statement. ### Using SAS Viya procedures PROC PARTITION lets you divide data into training, validation and test. In addition, most VS and VDMML supervised learning models have a PARTITION statement. (PROC FACTMAC is an exception.) Code for a partition statement within a PROC FOREST could look something like the code below. This code would partition the full data set into 50% training data, 25% validation data, and 25% test data. ALERT: If you have the same number of computer nodes, specifying the SEED= option in the PARTITION statement lets you repeatedly create the same partition data tables. However, changing the number of compute nodes changes the initial distribution of data, resulting in different partition data tables. You can exercise more control over the partitioning of the input data table if you have created a partition variable in your data set. Then you must designate: 1. The variable in the input data table as the partition variable (e.g., “role,” in the example below) 2. A formatted value of that variable for each role (e.g., ‘TRAIN’, ‘VAL’, and ‘TEST’ as shown below) Example code: As I mentioned earlier, you can commonly use your validation data set to determine which model to select, or even when to stop the model selection process. For example, this is how you would do that within the PROC REGSELECT, SELECTION statement, METHOD= option: • Stopping the Selection Process. Specify STOP= VALIDATE. “At step k of the selection process, the best candidate effect to enter or leave the current model is determined. Here, ‘best candidate’ means the effect that gives the best value of the SELECT= criterion; this criterion does not need to be based on the validation data. The Validation ASE for the model with this candidate effect added or removed is computed. If this Validation ASE is greater than the Validation ASE for the model at step k, then the selection process terminates at step k.” • Choosing the Model. Specify CHOOSE= VALIDATE. Then “the Validation ASE will be computed for your models at each step of the selection process. The smallest model at any step that yields the smallest Validation ASE will be selected.” – Quoted from PROC REGSELECT documentation ## Summary SAS Viya makes it easy to train, validate, and test our machine learning models. This is essential to ensuring that our models not only fit our existing data, but that they will perform well on new data. Now you know everything you need to know about training cats and dogs. Go ahead and try this at home. If you are as smart as the dolphins pictured below, you might even be able to train a human! ## References and more information The post Training, validation and testing for supervised machine learning models appeared first on The SAS Data Science Blog. Continue Reading… ### Magister Dixit “If you’re not doing some things that are crazy, then you’re doing the wrong things.” Larry Page ( 2013 ) Continue Reading… ### Three Things to Know About Reinforcement Learning As an engineer, scientist, or researcher, you may want to take advantage of this new and growing technology, but where do you start? The best place to begin is to understand what the concept is, how to implement it, and whether it’s the right approach for a given problem. Continue Reading… ### Using SQL Server Performance Objects SQL Server performance objects are found inside the Performance Monitor tool, also known as perfmon. If you are using Performance Monitor for gathering resource metrics for SQL Server then you are familiar with a screen such as this one: You can see I have navigated to the SQL Server Plan Cache counter, selected Cache Hit Ratio, and an instance of Extended Stored Procedures. You will also note in the lower left I have enabled “Show description”. This results in the text at the bottom, “Ratio between cache hits and lookups”. That text is referring to the counter itself, and not to the instance. If I toggle to another instance, such as SQL plans, the text doesn’t change. I have an idea what SQL plans means, but I’m also smart enough to know I don’t know everything. So, where would one find information about the instances? Those details can be found here: Use SQL Server Objects. From there we can go to the SQL Server, Plan Cache Object page, where we will find the following details: That’s a lot more detail than I was expecting! Now I know exactly what this counter will consider to be a plan. These details provide more context to the metric, helping users understand what they are measuring. Having SQL Server performance objects documented is important, and you should review them. Otherwise you run a risk of collecting the wrong metrics. The post Using SQL Server Performance Objects appeared first on Thomas LaRock. Continue Reading… ### Choosing a Machine Learning Model Selecting the perfect machine learning model is part art and part science. Learn how to review multiple models and pick the best in both competitive and real-world applications. Continue Reading… ### Automatic data types checking in predictive models [This article was first published on R - Data Science Heroes Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. The problem: We have data, and we need to create models (xgboost, random forest, regression, etc). Each one of them has its constraints regarding data types. Many strange errors appear when we are creating models just because of data format. The new version of funModeling 1.9.3 (Oct 2019) aimed to provide quick and clean assistance on this. Cover photo by: @franjacquier_ ## tl;dr;code Based on some messy data, we want to run a random forest, so before getting some weird errors, we can check… Example 1: # install.packages("funModeling") library(funModeling) library(tidyverse) # Load data data=read_delim("https://raw.githubusercontent.com/pablo14/data-integrity/master/messy_data.txt", delim = ';') # Call the function: integ_mod_1=data_integrity_model(data = data, model_name = "randomForest") # Any errors? integ_mod_1  ## ## ✖ {NA detected} num_vessels_flour, thal, gender ## ✖ {Character detected} gender, has_heart_disease ## ✖ {One unique value} constant  Regardless the "one unique value", the other errors need to be solved in order to create a random forest. Alghoritms have their own data type restrictions, and their own error messages making the execution a hard debugging task… data_integrity_model will alert with a common error message about such errors. ## Introduction data_integrity_model is built on top of data_integrity function. We talked about it in the post: Fast data exploration for predictive modeling. It checks: • NA • Data types (allow non-numeric? allow character?) • High cardinality • One unique value ## Supported models It takes the metadata from a table that is pre-loaded with funModeling head(metadata_models)  ## # A tibble: 6 x 6 ## name allow_NA max_unique allow_factor allow_character only_numeric ## ## 1 randomForest FALSE 53 TRUE FALSE FALSE ## 2 xgboost TRUE Inf FALSE FALSE TRUE ## 3 num_no_na FALSE Inf FALSE FALSE TRUE ## 4 no_na FALSE Inf TRUE TRUE TRUE ## 5 kmeans FALSE Inf TRUE TRUE TRUE ## 6 hclust FALSE Inf TRUE TRUE TRUE  The idea is anyone can add the most popular models or some configuration that is not there. There are some redundancies, but the purpose is to focus on the model, not the needed metadata. This way we don’t think in no NA in random forest, we just write randomForest. Some custom configurations: • no_na: no NA variables. • num_no_na: numeric with no NA (for example, useful when doing deep learning). ## Embed in a data flow on production Many people ask for typical questions when interviewing candidates. I like these ones: "How do you deal with new data?" or "What are the considerations you have when you do a deploy?" Based on our first example: integ_mod_1  ## ## ✖ {NA detected} num_vessels_flour, thal, gender ## ✖ {Character detected} gender, has_heart_disease ## ✖ {One unique value} constant  We can check: integ_mod_1$data_ok

## [1] FALSE


data_ok is a flag useful to stop a process raising an error if anything goes wrong.

## More examples

Example 2:

On mtcars data frame, check if there is any variable with NA:

di2=data_integrity_model(data = mtcars, model_name = "no_na")

# Check:
di2

## ✔ Data model integrity ok!


Good to go?

di2$data_ok  ## [1] TRUE  Example 3: data_integrity_model(data = heart_disease, model_name = "pca")  ## ## ✖ {NA detected} num_vessels_flour, thal ## ✖ {Non-numeric detected} gender, chest_pain, fasting_blood_sugar, resting_electro, thal, exter_angina, has_heart_disease  Example 4: data_integrity_model(data = iris, model_name = "kmeans")  ## ## ✖ {Non-numeric detected} Species  ## Any suggestions? If you come across any cases which aren’t covered here, you are welcome to contribute: funModeling’s github. How about time series? I took them as: numeric with no na (model_name = num_no_na). You can add any new model by updating the table metadata_models. And that’s it. In case you want to understand more about data types and qualilty, you can check the Data Science Live Book Have data fun! You can found me at: Linkedin & Twitter. To leave a comment for the author, please follow the link and comment on their blog: R - Data Science Heroes Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Continue Reading… ### Using Neural Networks to Design Neural Networks: The Definitive Guide to Understand Neural Architecture Search A recent survey outlined the main neural architecture search methods used to automate the design of deep learning systems. Continue Reading… ### Why is my validation loss lower than my training loss? In this tutorial, you will learn the three primary reasons your validation loss may be lower than your training loss when training your own custom deep neural networks. I first became interested in studying machine learning and neural networks in late high school. Back then there weren’t many accessible machine learning libraries — and there certainly was no scikit-learn. Every school day at 2:35 PM I would leave high school, hop on the bus home, and within 15 minutes I would be in front of my laptop, studying machine learning, and attempting to implement various algorithms by hand. I rarely stopped for a break, more than occasionally skipping dinner just so I could keep working and studying late into the night. During these late-night sessions I would hand-implement models and optimization algorithms (and in Java of all languages; I was learning Java at the time as well). And since they were hand-implemented ML algorithms by a budding high school programmer with only a single calculus course under his belt, my implementations were undoubtedly prone to bugs. I remember one night in particular. The time was 1:30 AM. I was tired. I was hungry (since I skipped dinner). And I was anxious about my Spanish test the next day which I most certainly did not study for. I was attempting to train a simple feedforward neural network to classify image contents based on basic color channel statistics (i.e., mean and standard deviation). My network was training…but I was running into a very strange phenomenon: My validation loss was lower than training loss! How could that possibly be? • Did I accidentally switch the plot labels for training and validation loss? Potentially. I didn’t have a plotting library like matplotlib so my loss logs were being piped to a CSV file and then plotted in Excel. Definitely prone to human error. • Was there a bug in my code? Almost certainly. I was teaching myself Java and machine learning at the same time — there were definitely bugs of some sort in that code. • Was I just so tired that my brain couldn’t comprehend it? Also very likely. I wasn’t sleeping much during that time of my life and could have very easily missed something obvious. But, as it turns out it was none of the above cases — my validation loss was legitimately lower than my training loss. It took me until my junior year of college when I took my first formal machine learning course to finally understand why validation loss can be lower than training loss. And a few months ago, brilliant author, Aurélien Geron, posted a tweet thread that concisely explains why you may encounter validation loss being lower than training loss. I was inspired by Aurélien’s excellent explanation and wanted to share it here with my own commentary and code, ensuring that no students (like me many years ago) have to scratch their heads and wonder “Why is my validation loss lower than my training loss?!”. To learn the three primary reasons your validation loss may be lower than your training loss, just keep reading! Looking for the source code to this post? Jump right to the downloads section. ## Why is my validation loss lower than my training loss? In the first part of this tutorial, we’ll discuss the concept of “loss” in a neural network, including what loss represents and why we measure it. From there we’ll implement a basic CNN and training script, followed by running a few experiments using our freshly implemented CNN (which will result in our validation loss being lower than our training loss). Given our results, I’ll then explain the three primary reasons your validation loss may be lower than your training loss. ### What is “loss” when training a neural network? Figure 1: What is the “loss” in the context of machine/deep learning? And why is my validation loss lower than my training loss? (image source) At the most basic level, a loss function quantifies how “good” or “bad” a given predictor is at classifying the input data points in a dataset. The smaller the loss, the better a job the classifier is at modeling the relationship between the input data and the output targets. That said, there is a point where we can overfit our model — by modeling the training data too closely, our model loses the ability to generalize. We, therefore, seek to: 1. Drive our loss down, thereby improving our model accuracy. 2. Do so as fast as possible and with as little hyperparameter updates/experiments. 3. All without overfitting our network and modeling the training data too closely. It’s a balancing act and our choice of loss function and model optimizer can dramatically impact the quality, accuracy, and generalizability of our final model. Typical loss functions (also called “objective functions” or “scoring functions”) include: • Binary cross-entropy • Categorical cross-entropy • Sparse categorical cross-entropy • Mean Squared Error (MSE) • Mean Absolute Error (MAE) • Standard Hinge • Squared Hinge A full review of loss functions is outside the scope of this post, but for the time being, just understand that for most tasks: • Loss measures the “goodness” of your model • The smaller the loss, the better • But you need to be careful not to overfit To learn more about the role of loss functions when training your own custom neural networks, be sure to: Additionally, if you would like a complete, step-by-step guide on the role of loss functions in machine learning/neural networks, make sure you read Deep Learning for Computer Vision with Python where I explain parameterized learning and loss methods in detail (including code and experiments). ### Project structure Go ahead and use the “Downloads” section of this post to download the source code. From there, inspect the project/directory structure via the tree command: $ tree --dirsfirst
.
├── pyimagesearch
│   ├── __init__.py
│   └── minivggnet.py
├── fashion_mnist.py
├── plot_shift.py
└── training.pickle

1 directory, 5 files

Today we’ll be using a smaller version of VGGNet called MiniVGGNet. The

pyimagesearch
module includes this CNN.

Our

fashion_mnist.py
script trains MiniVGGNet on the Fashion MNIST dataset. I’ve written about training MiniVGGNet on Fashion MNIST in a previous blog post, so today we won’t go into too much detail.

Today’s training script generates a

training.pickle
file of the training accuracy/loss history. Inside the Reason #2 section below, we’ll use
plot_shift.py
to shift the training loss plot half an epoch to demonstrate that the time at which loss is measured plays a role when validation loss is lower than training loss.

Let’s dive into the three reasons now to answer the question, “Why is my validation loss lower than my training loss?”.

### Reason #1: Regularization applied during training, but not during validation/testing

Figure 2: Aurélien answers the question: “Ever wonder why validation loss > training loss?” on his twitter feed (image source). The first reason is that regularization is applied during training but not during validation/testing.

When training a deep neural network we often apply regularization to help our model:

1. Obtain higher validation/testing accuracy
2. And ideally, to generalize better to the data outside the validation and testing sets

Regularization methods often sacrifice training accuracy to improve validation/testing accuracy — in some cases that can lead to your validation loss being lower than your training loss.

Secondly, keep in mind that regularization methods such as dropout are not applied at validation/testing time.

As Aurélien shows in Figure 2, factoring in regularization to validation loss (ex., applying dropout during validation/testing time) can make your training/validation loss curves look more similar.

### Reason #2: Training loss is measured during each epoch while validation loss is measured after each epoch

Figure 3: Reason #2 for validation loss sometimes being less than training loss has to do with when the measurement is taken (image source).

The second reason you may see validation loss lower than training loss is due to how the loss value are measured and reported:

1. Training loss is measured during each epoch
2. While validation loss is measured after each epoch

Your training loss is continually reported over the course of an entire epoch; however, validation metrics are computed over the validation set only once the current training epoch is completed.

This implies, that on average, training losses are measured half an epoch earlier.

If you shift the training losses half an epoch to the left you’ll see that the gaps between the training and losses values are much smaller.

For an example of this behavior in action, read the following section.

#### Implementing our training script

We’ll be implementing a simple Python script to train a small VGG-like network (called MiniVGGNet) on the Fashion MNIST dataset. During training, we’ll save our training and validation losses to disk. We’ll then create a separate Python script to compare both the unshifted and shifted loss plots.

Let’s get started by implementing the training script:

# import the necessary packages
from pyimagesearch.minivggnet import MiniVGGNet
from sklearn.metrics import classification_report
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.datasets import fashion_mnist
from tensorflow.keras.utils import to_categorical
import argparse
import pickle

# construct the argument parser and parse the arguments
ap = argparse.ArgumentParser()
help="path to output training history file")
args = vars(ap.parse_args())

Lines 2-8 import our required packages, modules, classes, and functions. Namely, we import

MiniVGGNet
(our CNN),
fashion_mnist
(our dataset), and
pickle
(ensuring that we can serialize our training history for a separate script to handle plotting).

--history
, points to the separate
.pickle
file which will soon contain our training history (Lines 11-14).

We then initialize a few hyperparameters, namely our number of epochs to train for, initial learning rate, and batch size:

# initialize the number of epochs to train for, base learning rate,
# and batch size
NUM_EPOCHS = 25
INIT_LR = 1e-2
BS = 32

We then proceed to load and preprocess our Fashion MNIST data:

# grab the Fashion MNIST dataset (if this is your first time running
((trainX, trainY), (testX, testY)) = fashion_mnist.load_data()

# we are using "channels last" ordering, so the design matrix shape
# should be: num_samples x rows x columns x depth
trainX = trainX.reshape((trainX.shape[0], 28, 28, 1))
testX = testX.reshape((testX.shape[0], 28, 28, 1))

# scale data to the range of [0, 1]
trainX = trainX.astype("float32") / 255.0
testX = testX.astype("float32") / 255.0

# one-hot encode the training and testing labels
trainY = to_categorical(trainY, 10)
testY = to_categorical(testY, 10)

# initialize the label names
labelNames = ["top", "trouser", "pullover", "dress", "coat",
"sandal", "shirt", "sneaker", "bag", "ankle boot"]

Lines 25-34 load and preprocess the training/validation data.

Lines 37 and 38 binarize our class labels, while Lines 41 and 42 list out the human-readable class label names for classification report purposes later.

From here we have everything we need to compile and train our MiniVGGNet model on the Fashion MNIST data:

# initialize the optimizer and model
print("[INFO] compiling model...")
opt = SGD(lr=INIT_LR, momentum=0.9, decay=INIT_LR / NUM_EPOCHS)
model = MiniVGGNet.build(width=28, height=28, depth=1, classes=10)
model.compile(loss="categorical_crossentropy", optimizer=opt,
metrics=["accuracy"])

# train the network
print("[INFO] training model...")
H = model.fit(trainX, trainY,
validation_data=(testX, testY),
batch_size=BS, epochs=NUM_EPOCHS)

Lines 46-49 initialize and compile the

MiniVGGNet
model.

Lines 53-55 then fit/train the

model
.

From here we will evaluate our

model
and serialize our training history:

# make predictions on the test set and show a nicely formatted
# classification report
preds = model.predict(testX)
print("[INFO] evaluating network...")
print(classification_report(testY.argmax(axis=1), preds.argmax(axis=1),
target_names=labelNames))

# serialize the training history to disk
print("[INFO] serializing training history...")
f = open(args["history"], "wb")
f.write(pickle.dumps(H.history))
f.close()

Lines 59-62 make predictions on the test set and print a classification report to the terminal.

Lines 66-68 serialize our training accuracy/loss history to a

.pickle
file. We’ll use the training history in a separate Python script to plot the loss curves, including one plot showing a one-half epoch shift.

From there, open up a terminal and execute the following command:

$python fashion_mnist.py --history training.pickle [INFO] loading Fashion MNIST... [INFO] compiling model... [INFO] training model... Train on 60000 samples, validate on 10000 samples Epoch 1/25 60000/60000 [==============================] - 200s 3ms/sample - loss: 0.5433 - accuracy: 0.8181 - val_loss: 0.3281 - val_accuracy: 0.8815 Epoch 2/25 60000/60000 [==============================] - 194s 3ms/sample - loss: 0.3396 - accuracy: 0.8780 - val_loss: 0.2726 - val_accuracy: 0.9006 Epoch 3/25 60000/60000 [==============================] - 193s 3ms/sample - loss: 0.2941 - accuracy: 0.8943 - val_loss: 0.2722 - val_accuracy: 0.8970 Epoch 4/25 60000/60000 [==============================] - 193s 3ms/sample - loss: 0.2717 - accuracy: 0.9017 - val_loss: 0.2334 - val_accuracy: 0.9144 Epoch 5/25 60000/60000 [==============================] - 194s 3ms/sample - loss: 0.2534 - accuracy: 0.9086 - val_loss: 0.2245 - val_accuracy: 0.9194 ... Epoch 21/25 60000/60000 [==============================] - 195s 3ms/sample - loss: 0.1797 - accuracy: 0.9340 - val_loss: 0.1879 - val_accuracy: 0.9324 Epoch 22/25 60000/60000 [==============================] - 194s 3ms/sample - loss: 0.1814 - accuracy: 0.9342 - val_loss: 0.1901 - val_accuracy: 0.9313 Epoch 23/25 60000/60000 [==============================] - 193s 3ms/sample - loss: 0.1766 - accuracy: 0.9351 - val_loss: 0.1866 - val_accuracy: 0.9320 Epoch 24/25 60000/60000 [==============================] - 193s 3ms/sample - loss: 0.1770 - accuracy: 0.9347 - val_loss: 0.1845 - val_accuracy: 0.9337 Epoch 25/25 60000/60000 [==============================] - 194s 3ms/sample - loss: 0.1734 - accuracy: 0.9372 - val_loss: 0.1871 - val_accuracy: 0.9312 [INFO] evaluating network... precision recall f1-score support top 0.87 0.91 0.89 1000 trouser 1.00 0.99 0.99 1000 pullover 0.91 0.91 0.91 1000 dress 0.93 0.93 0.93 1000 coat 0.87 0.93 0.90 1000 sandal 0.98 0.98 0.98 1000 shirt 0.83 0.74 0.78 1000 sneaker 0.95 0.98 0.97 1000 bag 0.99 0.99 0.99 1000 ankle boot 0.99 0.95 0.97 1000 accuracy 0.93 10000 macro avg 0.93 0.93 0.93 10000 weighted avg 0.93 0.93 0.93 10000 [INFO] serializing training history... Checking the contents of your working directory you should have a file named training.pickle — this file contains our training history logs. $ ls *.pickle
training.pickle

In the next section we’ll learn how to plot these values and shift our training information a half epoch to the left, thereby making our training/validation loss curves look more similar.

#### Shifting our training loss values

Our

plot_shift.py
script is used to plot the training history output from
fashion_mnist.py
. Using this script we can investigate how shifting our training loss a half epoch to the left can make our training/validation plots look more similar.

Open up the

plot_shift.py
file and insert the following code:

# import the necessary packages
import matplotlib.pyplot as plt
import numpy as np
import argparse
import pickle

# construct the argument parser and parse the arguments
ap = argparse.ArgumentParser()
help="path to input training history file")
args = vars(ap.parse_args())

Lines 2-5 import

matplotlib
(for plotting), NumPy (for a simple array creation operation),
argparse
(command line arguments), and
pickle
(to load our serialized training history).

Lines 8-11 parse the

--input
command line argument which points to our
.pickle
training history file on disk.

# load the training history

# determine the total number of epochs used for training, then
# initialize the figure
epochs = np.arange(0, len(H["loss"]))
plt.style.use("ggplot")
(fig, axs) = plt.subplots(2, 1)

Line 14 loads our serialized training history

.pickle
file using the
--input
command line argument.

Line 18 makes space for our x-axis which spans from zero to the number of

epochs
in the training history.

Lines 19 and 20 set up our plot figure to be two stacked plots in the same image:

• The top plot will contain loss curves as-is.
• The bottom plot, on the other hand, will include a shift for the training loss (but not for the validation loss). The training loss will be shifted half an epoch to the left just as in Aurélien’s tweet. We’ll then be able to observe if the plots line up more closely.

Let’s generate our top plot:

# plot the *unshifted* training and validation loss
plt.style.use("ggplot")
axs[0].plot(epochs, H["loss"], label="train_loss")
axs[0].plot(epochs, H["val_loss"], label="val_loss")
axs[0].set_title("Unshifted Loss Plot")
axs[0].set_xlabel("Epoch #")
axs[0].set_ylabel("Loss")
axs[0].legend()

And then draw our bottom plot:

# plot the *shifted* training and validation loss
axs[1].plot(epochs - 0.5, H["loss"], label="train_loss")
axs[1].plot(epochs, H["val_loss"], label="val_loss")
axs[1].set_title("Shifted Loss Plot")
axs[1].set_xlabel("Epoch #")
axs[1].set_ylabel("Loss")
axs[1].legend()

# show the plots
plt.tight_layout()
plt.show()

Notice on Line 32 that the training loss is shifted

0.5
epochs to the left — the heart of this example.

Let’s now analyze our training/validation plots.

Open up a terminal and execute the following command:

python plot_shift.py --input training.pickle Figure 4: Shifting the training loss plot 1/2 epoch to the left yields more similar plots. Clearly the time of measurement answers the question, “Why is my validation loss lower than training loss?”. As you can observe, shifting the training loss values a half epoch to the left (bottom) makes the training/validation curves much more similar versus the unshifted (top) plot. ### Reason #3: The validation set may be easier than the training set (or there may be leaks) Figure 5: Consider how your validation set was acquired/generated. Common mistakes could lead to validation loss being less than training loss. (image source) The final most common reason for validation loss being lower than your training loss is due to the data distribution itself. Consider how your validation set was acquired: • Can you guarantee that the validation set was sampled from the same distribution as the training set? • Are you certain that the validation examples are just as challenging as your training images? • Can you assure there was no “data leakage” (i.e., training samples getting accidentally mixed in with validation/testing samples)? • Are you confident your code created the training, validation, and testing splits properly? Every single deep learning practitioner has made the above mistakes at least once in their career. Yes, it is embarrassing when it happens — but that’s the point — it does happen, so take the time now to investigate your code. ### BONUS: Are you training hard enough? Figure 6: If you are wondering why your validation loss is lower than your training loss, perhaps you aren’t “training hard enough”. One aspect that Aurélien didn’t touch on in his tweets is the concept of “training hard enough”. When training a deep neural network, our biggest concern is nearly always overfitting — and in order to combat overfitting, we introduce regularization techniques (discussed in Reason #1 above). We apply regularization in the form of: • Dropout • L2 weight decay • Reducing model capacity (i.e., a more shallow model) We also tend to be a bit more conservative with our learning rate to ensure our model doesn’t overshoot areas of lower loss in the loss landscape. That’s all fine and good, but sometimes we end up over-regularizing our models. If you go through all three reasons for validation loss being lower than training loss detailed above, you may have over-regularized your model. Start to relax your regularization constraints by: • Lowering your L2 weight decay strength. • Reducing the amount of dropout you’re applying. • Increasing your model capacity (i.e., make it deeper). You should also try training with a larger learning rate as you may have become too conservative with it. ## Do you have unanswered deep learning questions? You can train your first neural network in minutes…with just a few lines of Python. But if you are just getting started, you may have questions such as today’s: “Why is my validation loss lower than training loss?” Similar questions can stump you for weeks — maybe months. You might find yourself searching online for answers only to be disappointed in the explanations. Or maybe you posted your burning question to Stack Overflow or Quora and you are still hearing crickets. It doesn’t have to be like that. What you need is a comprehensive book to jumpstart your education. Discover and study deep learning the right way in my book: Deep Learning for Computer Vision with Python. Inside the book, you’ll find self-study tutorials and end-to-end projects on topics like: • Convolutional Neural Networks • Object Detection via Faster R-CNNs and SSDs • Generative Adversarial Networks (GANs) • Emotion/Facial Expression Recognition • Best practices, tips, and rules of thumb • …and much more! Using the knowledge gained by reading this book you’ll finally be able to bring deep learning to your own projects. What’s more is that you’ll learn the “art” of training neural networks, answering questions such as: 1. Which deep learning CNN architecture is right for my task at hand? 2. How can I spot underfitting and overfitting either after or during training? 3. What is the most effective way to set my initial learning rate and to use a learning rate decay scheduler to improve accuracy? 4. Which deep learning optimizer is the best one for the job and how do I evaluate new, state-of-the-art optimizers as they are published? 5. How do I apply regularization techniques effectively ensuring that I am not over-regularizing my model? You’ll find the answers to all of these questions inside my deep learning book. Customers of mine attest that this is the best deep learning education you’ll find online — inside the book you’ll find: • Super practical walkthroughs that present solutions to actual, real-world image classification, object detection, and segmentation problems. • Hands-on tutorials (with lots of code) that not only show you the algorithms behind deep learning for computer vision but their implementations as well. • A no-nonsense teaching style that is guaranteed to help you master deep learning for image understanding and visual recognition. So why wait? The cost of fumbling around the internet looking for answers to your questions only to find sub-par resources is costing you time that you can’t get back. The value you receive by reading my book is far, far greater than the price you pay and will ensure you receive a positive return on your investment of time and finances, I guarantee that. To learn more about the book, and grab the table of contents + free sample chapters, just click here! ## Summary Today’s tutorial was heavily inspired by the following tweet thread from author, Aurélien Geron. Inside the thread, Aurélien expertly and concisely explained the three reasons your validation loss may be lower than your training loss when training a deep neural network: 1. Reason #1: Regularization is applied during training, but not during validation/testing. If you add in the regularization loss during validation/testing, your loss values and curves will look more similar. 2. Reason #2: Training loss is measured during each epoch while validation loss is measured after each epoch. On average, the training loss is measured 1/2 an epoch earlier. If you shift your training loss curve a half epoch to the left, your losses will align a bit better. 3. Reason #3: Your validation set may be easier than your training set or there is a leak in your data/bug in your code. Make sure your validation set is reasonably large and is sampled from the same distribution (and difficulty) as your training set. 4. BONUS: You may be over-regularizing your model. Try reducing your regularization constraints, including increasing your model capacity (i.e., making it deeper with more parameters), reducing dropout, reducing L2 weight decay strength, etc. Hopefully, this helps clear up any confusion on why your validation loss may be lower than your training loss! It was certainly a head-scratcher for me when I first started studying machine learning and neural networks and it took me until mid-college to understand exactly why that happens — and none of the explanations back then were as clear and concise as Aurélien’s. I hope you enjoyed today’s tutorial! To download the source code (and be notified when future tutorials are published here on PyImageSearch), just enter your email address in the form below! ## Downloads: If you would like to download the code and images used in this post, please enter your email address in the form below. Not only will you get a .zip of the code, I’ll also send you a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL! Sound good? If so, enter your email address and I’ll send you the code immediately! The post Why is my validation loss lower than my training loss? appeared first on PyImageSearch. Continue Reading… ### Poetry corner Ray Could Write Statistics Be What has happened down here is the winds have changed Spin The Paper of My Enemy Has Been Retracted Imaginary gardens with real data A parable regarding changing standards on the presentation of statistical evidence Laplace Calling Thanks to W. B. Yeats, Young Tiger, Randy Newman, W. H. Auden, Clive James, Marianne Moore, Dr. Seuss, and the Clash for doing all the hard work on these. Continue Reading… ### Top Stories, Oct 7-13: 10 Free Top Notch Natural Language Processing Courses; The Last SQL Guide for Data Analysis You’ll Ever Need Also: Activation maps for deep learning models in a few lines of code; The 4 Quadrants of Data Science Skills and 7 Principles for Creating a Viral Data Visualization; OpenAI Tried to Train AI Agents to Play Hide-And-Seek but Instead They Were Shocked by What They Learned; 10 Great Python Resources for Aspiring Data Scientists Continue Reading… ### An Overview of Density Estimation Density estimation is estimating the probability density function of the population from the sample. This post examines and compares a number of approaches to density estimation. Continue Reading… ### FiveThirtyEight launches new NBA metric for predictions FiveThirtyEight has been predicting NBA games for a few years now, based on a variant of Elo ratings, which in turn have roots in ranking chess players. But for this season, they have a new metric to predict with called RAPTOR, or Robust Algorithm (using) Player Tracking (and) On/Off Ratings: NBA teams highly value floor spacing, defense and shot creation, and they place relatively little value on traditional big-man skills. RAPTOR likewise values these things — not because we made any deliberate attempt to design the system that way but because the importance of those skills emerges naturally from the data. RAPTOR thinks ball-dominant players such as James Harden and Steph Curry are phenomenally good. It highly values two-way wings such as Kawhi Leonard and Paul George. It can have a love-hate relationship with centers, who are sometimes overvalued in other statistical systems. But it appreciates modern centers such as Nikola Jokić and Joel Embiid, as well as defensive stalwarts like Rudy Gobert. I’ve mostly ignored sports-related predictions ever since the Golden State Warriors lost in the 2016 finals. There was a high probability that they would win it all, but they did not. That’s when I realized the predictions would only lead to a neutral confirmation or severe disappointment, but never happiness. I’m sure this new metric will be different. Continue Reading… ### Document worth reading: “Regularizing Deep Neural Networks by Noise: Its Interpretation and Optimization” Overfitting is one of the most critical challenges in deep neural networks, and there are various types of regularization methods to improve generalization performance. Injecting noises to hidden units during training, e.g., dropout, is known as a successful regularizer, but it is still not clear enough why such training techniques work well in practice and how we can maximize their benefit in the presence of two conflicting objectives—optimizing to true data distribution and preventing overfitting by regularization. This paper addresses the above issues by 1) interpreting that the conventional training methods with regularization by noise injection optimize the lower bound of the true objective and 2) proposing a technique to achieve a tighter lower bound using multiple noise samples per training example in a stochastic gradient descent iteration. We demonstrate the effectiveness of our idea in several computer vision applications. Regularizing Deep Neural Networks by Noise: Its Interpretation and Optimization Continue Reading… ### Assess Variable Importance In GRNN [This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Technically speaking, there is no need to evaluate the variable importance and to perform the variable selection in the training of a GRNN. It’s also been a consensus that the neural network is a black-box model and it is not an easy task to assess the variable importance in a neural network. However, from the practical prospect, it is helpful to understand the individual contribution of each predictor to the overall goodness-of-fit of a GRNN. For instance, the variable importance can help us make up a beautiful business story to decorate our model. In addition, dropping variables with trivial contributions also helps us come up with a more parsimonious model as well as improve the computational efficiency. In the YAGeR project (https://github.com/statcompute/yager), two functions have been added with the purpose to assess the variable importance in a GRNN. While the grnn.x_imp() function (https://github.com/statcompute/yager/blob/master/code/grnn.x_imp.R) will provide the importance assessment of a single variable, the grnn.imp() function (https://github.com/statcompute/yager/blob/master/code/grnn.imp.R) can give us a full picture of the variable importance for all variables in the GRNN. The returned value “imp1” is calculated as the decrease in AUC with all values for the variable of interest equal to its mean and the “imp2” is calculated as the decrease in AUC with the variable of interest dropped completely. The variable with a higher value of the decrease in AUC is deemed more important. Below is an example demonstrating how to assess the variable importance in a GRNN. As shown in the output, there are three variables making no contribution to AUC statistic. It is also noted that dropping three unimportant variables in the GRNN can actually increase AUC in the hold-out sample. What’s more, marginal effects of variables remaining in the GRNN make more sense now with all showing nice monotonic relationships, in particular “tot_open_tr”. To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Continue Reading… ### Whats new on arXiv Transfer learning aims to learn robust classifiers for the target domain by leveraging knowledge from a source domain. Since the source and the target domains are usually from different distributions, existing methods mainly focus on adapting the cross-domain marginal or conditional distributions. However, in real applications, the marginal and conditional distributions usually have different contributions to the domain discrepancy. Existing methods fail to quantitatively evaluate the different importance of these two distributions, which will result in unsatisfactory transfer performance. In this paper, we propose a novel concept called Dynamic Distribution Adaptation (DDA), which is capable of quantitatively evaluating the relative importance of each distribution. DDA can be easily incorporated into the framework of structural risk minimization to solve transfer learning problems. On the basis of DDA, we propose two novel learning algorithms: (1) Manifold Dynamic Distribution Adaptation (MDDA) for traditional transfer learning, and (2) Dynamic Distribution Adaptation Network (DDAN) for deep transfer learning. Extensive experiments demonstrate that MDDA and DDAN significantly improve the transfer learning performance and setup a strong baseline over the latest deep and adversarial methods on digits recognition, sentiment analysis, and image classification. More importantly, it is shown that marginal and conditional distributions have different contributions to the domain divergence, and our DDA is able to provide good quantitative evaluation of their relative importance which leads to better performance. We believe this observation can be helpful for future research in transfer learning. Spectral Clustering(SC) is a prominent data clustering technique of recent times which has attracted much attention from researchers. It is a highly data-driven method and makes no strict assumptions on the structure of the data to be clustered. One of the central pieces of spectral clustering is the construction of an affinity matrix based on a similarity measure between data points. The way the similarity measure is defined between data points has a direct impact on the performance of the SC technique. Several attempts have been made in the direction of strengthening the pairwise similarity measure to enhance the spectral clustering. In this work, we have defined a novel affinity measure by employing the concept of non-conformity used in Conformal Prediction(CP) framework. The non-conformity based affinity captures the relationship between neighborhoods of data points and has the power to generalize the notion of contextual similarity. We have shown that this formulation of affinity measure gives good results and compares well with the state of the art methods. Automatic question generation is an important problem in natural language processing. In this paper we propose a novel adaptive copying recurrent neural network model to tackle the problem of question generation from sentences and paragraphs. The proposed model adds a copying mechanism component onto a bidirectional LSTM architecture to generate more suitable questions adaptively from the input data. Our experimental results show the proposed model can outperform the state-of-the-art question generation methods in terms of BLEU and ROUGE evaluation scores. Federated Machine Learning (FML) creates an ecosystem for multiple parties to collaborate on building models while protecting data privacy for the participants. A measure of the contribution for each party in FML enables fair credits allocation. In this paper we develop simple but powerful techniques to fairly calculate the contributions of multiple parties in FML, in the context of both horizontal FML and vertical FML. For Horizontal FML we use deletion method to calculate the grouped instance influence. For Vertical FML we use Shapley Values to calculate the grouped feature importance. Our methods open the door for research in model contribution and credit allocation in the context of federated machine learning. Pre-trained language representation models, such as BERT, capture a general language representation from large-scale corpora, but lack domain-specific knowledge. When reading a domain text, experts make inferences with relevant knowledge. For machines to achieve this capability, we propose a knowledge-enabled language representation model (K-BERT) with knowledge graphs (KGs), in which triples are injected into the sentences as domain knowledge. However, too much knowledge incorporation may divert the sentence from its correct meaning, which is called knowledge noise (KN) issue. To overcome KN, K-BERT introduces soft-position and visible matrix to limit the impact of knowledge. K-BERT can easily inject domain knowledge into the models by equipped with a KG without pre-training by-self because it is capable of loading model parameters from the pre-trained BERT. Our investigation reveals promising results in twelve NLP tasks. Especially in domain-specific tasks (including finance, law, and medicine), K-BERT significantly outperforms BERT, which demonstrates that K-BERT is an excellent choice for solving the knowledge-driven problems that require experts. The widespread use of ML-based decision making in domains with high societal impact such as recidivism, job hiring and loan credit has raised a lot of concerns regarding potential discrimination. In particular, in certain cases it has been observed that ML algorithms can provide different decisions based on sensitive attributes such as gender or race and therefore can lead to discrimination. Although, several fairness-aware ML approaches have been proposed, their focus has been largely on preserving the overall classification accuracy while improving fairness in predictions for both protected and non-protected groups (defined based on the sensitive attribute(s)). The overall accuracy however is not a good indicator of performance in case of class imbalance, as it is biased towards the majority class. As we will see in our experiments, many of the fairness-related datasets suffer from class imbalance and therefore, tackling fairness requires also tackling the imbalance problem. To this end, we propose AdaFair, a fairness-aware classifier based on AdaBoost that further updates the weights of the instances in each boosting round taking into account a cumulative notion of fairness based upon all current ensemble members, while explicitly tackling class-imbalance by optimizing the number of ensemble members for balanced classification error. Our experiments show that our approach can achieve parity in true positive and true negative rates for both protected and non-protected groups, while it significantly outperforms existing fairness-aware methods up to 25% in terms of balanced error. We propose K-TanH, a novel, highly accurate, hardware efficient approximation of popular activation function Tanh for Deep Learning. K-TanH consists of a sequence of parameterized bit/integer operations, such as, masking, shift and add/subtract (no floating point operation needed) where parameters are stored in a very small look-up table. The design of K-TanH is flexible enough to deal with multiple numerical formats, such as, FP32 and BFloat16. High quality approximations to other activation functions, e.g., Swish and GELU, can be derived from K-TanH. We provide RTL design for K-TanH to demonstrate its area/power/performance efficacy. It is more accurate than existing piecewise approximations for Tanh. For example, K-TanH achieves $\sim 5\times$ speed up and $> 6\times$ reduction in maximum approximation error over software implementation of Hard TanH. Experimental results for low-precision BFloat16 training of language translation model GNMT on WMT16 data sets with approximate Tanh and Sigmoid obtained via K-TanH achieve similar accuracy and convergence as training with exact Tanh and Sigmoid. Distributed word embeddings have yielded state-of-the-art performance in many NLP tasks, mainly due to their success in capturing useful semantic information. These representations assign only a single vector to each word whereas a large number of words are polysemous (i.e., have multiple meanings). In this work, we approach this critical problem in lexical semantics, namely that of representing various senses of polysemous words in vector spaces. We propose a topic modeling based skip-gram approach for learning multi-prototype word embeddings. We also introduce a method to prune the embeddings determined by the probabilistic representation of the word in each topic. We use our embeddings to show that they can capture the context and word similarity strongly and outperform various state-of-the-art implementations. Reinforcement Learning (RL) algorithms usually assume their environment to be a Markov Decision Process (MDP). Additionally, they do not try to identify specific features of environments which could help them perform better. Here, we present a few key meta-features of environments: delayed rewards, specific reward sequences, sparsity of rewards, and stochasticity of environments, which may violate the MDP assumptions and adapting to which should help RL agents perform better. While it is very time consuming to run RL algorithms on standard benchmarks, we define a parameterised collection of fast-to-run toy benchmarks in OpenAI Gym by varying these meta-features. Despite their toy nature and low compute requirements, we show that these benchmarks present substantial difficulties to current RL algorithms. Furthermore, since we can generate environments with a desired value for each of the meta-features, we have fine-grained control over the environments’ difficulty and also have the ground truth available for evaluating algorithms. We believe that devising algorithms that can detect such meta-features of environments and adapt to them will be key to creating robust RL algorithms that work in a variety of different real-world problems. We introduce SpERT, an attention model for span-based joint entity and relation extraction. Our approach employs the pre-trained Transformer network BERT as its core. We use BERT embeddings as shared inputs for a light-weight reasoning, which features entity recognition and filtering, as well as relation classification with a localized, marker-free context representation. The model is trained on strong within-sentence negative samples, which are efficiently extracted in a single BERT pass. These aspects facilitate a search over all spans in the sentence. In ablation studies, we demonstrate the benefits of pre-training, strong negative sampling and localized context. Our model outperforms prior work by up to 5% F1 score on several datasets for joint entity and relation extraction. Combinatorial optimization problems for clustering are known to be NP-hard. Most optimization methods are not able to find the global optimum solution for all datasets. To solve this problem, we propose a global optimal path-based clustering (GOPC) algorithm in this paper. The GOPC algorithm is based on two facts: (1) medoids have the minimum degree in their clusters; (2) the minimax distance between two objects in one cluster is smaller than the minimax distance between objects in different clusters. Extensive experiments are conducted on synthetic and real-world datasets to evaluate the performance of the GOPC algorithm. The results on synthetic datasets show that the GOPC algorithm can recognize all kinds of clusters regardless of their shapes, sizes, or densities. Experimental results on real-world datasets demonstrate the effectiveness and efficiency of the GOPC algorithm. In addition, the GOPC algorithm needs only one parameter, i.e., the number of clusters, which can be estimated by the decision graph. The advantages mentioned above make GOPC a good candidate as a general clustering algorithm. Codes are available at https://…/Clustering. Sequential vision-to-language or visual storytelling has recently been one of the areas of focus in computer vision and language modeling domains. Though existing models generate narratives that read subjectively well, there could be cases when these models miss out on generating stories that account and address all prospective human and animal characters in the image sequences. Considering this scenario, we propose a model that implicitly learns relationships between provided characters and thereby generates stories with respective characters in scope. We use the VIST dataset for this purpose and report numerous statistics on the dataset. Eventually, we describe the model, explain the experiment and discuss our current status and future work. We present sktime — a new scikit-learn compatible Python library with a unified interface for machine learning with time series. Time series data gives rise to various distinct but closely related learning tasks, such as forecasting and time series classification, many of which can be solved by reducing them to related simpler tasks. We discuss the main rationale for creating a unified interface, including reduction, as well as the design of sktime’s core API, supported by a clear overview of common time series tasks and reduction approaches. Recently, generating adversarial examples has become an important means of measuring robustness of a deep learning model. Adversarial examples help us identify the susceptibilities of the model and further counter those vulnerabilities by applying adversarial training techniques. In natural language domain, small perturbations in the form of misspellings or paraphrases can drastically change the semantics of the text. We propose a reinforcement learning based approach towards generating adversarial examples in black-box settings. We demonstrate that our method is able to fool well-trained models for (a) IMDB sentiment classification task and (b) AG’s news corpus news categorization task with significantly high success rates. We find that the adversarial examples generated are semantics-preserving perturbations to the original text. In this work we present Ludwig, a flexible, extensible and easy to use toolbox which allows users to train deep learning models and use them for obtaining predictions without writing code. Ludwig implements a novel approach to deep learning model building based on two main abstractions: data types and declarative configuration files. The data type abstraction allows for easier code and sub-model reuse, and the standardized interfaces imposed by this abstraction allow for encapsulation and make the code easy to extend. Declarative model definition configuration files enable inexperienced users to obtain effective models and increase the productivity of expert users. Alongside these two innovations, Ludwig introduces a general modularized deep learning architecture called Encoder-Combiner-Decoder that can be instantiated to perform a vast amount of machine learning tasks. These innovations make it possible for engineers, scientists from other fields and, in general, a much broader audience to adopt deep learning models for their tasks, concretely helping in its democratization. The ability to understand and work with numbers (numeracy) is critical for many complex reasoning tasks. Currently, most NLP models treat numbers in text in the same way as other tokens—they embed them as distributed vectors. Is this enough to capture numeracy? We begin by investigating the numerical reasoning capabilities of a state-of-the-art question answering model on the DROP dataset. We find this model excels on questions that require numerical reasoning, i.e., it already captures numeracy. To understand how this capability emerges, we probe token embedding methods (e.g., BERT, GloVe) on synthetic list maximum, number decoding, and addition tasks. A surprising degree of numeracy is naturally present in standard embeddings. For example, GloVe and word2vec accurately encode magnitude for numbers up to 1,000. Furthermore, character-level embeddings are even more precise—ELMo captures numeracy the best for all pre-trained methods—but BERT, which uses sub-word units, is less exact. We introduce Block Sparse Canonical Correlation Analysis which estimates multiple pairs of canonical directions (together a ‘block’) at once, resulting in significantly improved orthogonality of the sparse directions which, we demonstrate, translates to more interpretable solutions. Our approach builds on the sparse CCA method of (Solari, Brown, and Bickel 2019) in that we also express the bi-convex objective of our block formulation as a concave minimization problem over an orthogonal k-frame in a unit Euclidean ball, which in turn, due to concavity of the objective, is shrunk to a Stiefel manifold, which is optimized via gradient descent algorithm. Our simulations show that our method outperforms existing sCCA algorithms and implementations in terms of computational cost and stability, mainly due to the drastic shrinkage of our search space, and the correlation within and orthogonality between pairs of estimated canonical covariates. Finally, we apply our method, available as an R-package called BLOCCS, to multi-omic data on Lung Squamous Cell Carcinoma(LUSC) obtained via The Cancer Genome Atlas, and demonstrate its capability in capturing meaningful biological associations relevant to the hypothesis under study rather than spurious dominant variations. Few-shot learning (FSL) for action recognition is a challenging task of recognizing novel action categories which are represented by few instances in the training data. In a more generalized FSL setting (G-FSL), both seen as well as novel action categories need to be recognized. Conventional classifiers suffer due to inadequate data in FSL setting and inherent bias towards seen action categories in G-FSL setting. In this paper, we address this problem by proposing a novel ProtoGAN framework which synthesizes additional examples for novel categories by conditioning a conditional generative adversarial network with class prototype vectors. These class prototype vectors are learnt using a Class Prototype Transfer Network (CPTN) from examples of seen categories. Our synthesized examples for a novel class are semantically similar to real examples belonging to that class and is used to train a model exhibiting better generalization towards novel classes. We support our claim by performing extensive experiments on three datasets: UCF101, HMDB51 and Olympic-Sports. To the best of our knowledge, we are the first to report the results for G-FSL and provide a strong benchmark for future research. We also outperform the state-of-the-art method in FSL for all the aforementioned datasets. A new approach to the sparse Canonical Correlation Analysis (sCCA)is proposed with the aim of discovering interpretable associations in very high-dimensional multi-view, i.e.observations of multiple sets of variables on the same subjects, problems. Inspired by the sparse PCA approach of Journee et al. (2010), we also show that the sparse CCA formulation, while non-convex, is equivalent to a maximization program of a convex objective over a compact set for which we propose a first-order gradient method. This result helps us reduce the search space drastically to the boundaries of the set. Consequently, we propose a two-step algorithm, where we first infer the sparsity pattern of the canonical directions using our fast algorithm, then we shrink each view, i.e. observations of a set of covariates, to contain observations on the sets of covariates selected in the previous step, and compute their canonical directions via any CCA algorithm. We also introduceDirected Sparse CCA, which is able to find associations which are aligned with a specified experiment design, andMulti-View sCCA which is used to discover associations between multiple sets of covariates. Our simulations establish the superior convergence properties and computational efficiency of our algorithm as well as accuracy in terms of the canonical correlation and its ability to recover the supports of the canonical directions. We study the associations between metabolomics, trasncriptomics and microbiomics in a multi-omic study usingMuLe, which is an R-package that implements our approach, in order to form hypotheses on mechanisms of adaptations of Drosophila Melanogaster to high doses of environmental toxicants, specifically Atrazine, which is a commonly used chemical fertilizer. Many studies collect functional data from multiple subjects that have both multilevel and multivariate structures. An example of such data comes from popular neuroscience experiments where participants’ brain activity is recorded using modalities such as EEG and summarized as power within multiple time-varying frequency bands within multiple electrodes, or brain regions. Summarizing the joint variation across multiple frequency bands for both whole-brain variability between subjects, as well as location-variation within subjects, can help to explain neural reactions to stimuli. This article introduces a novel approach to conducting interpretable principal components analysis on multilevel multivariate functional data that decomposes total variation into subject-level and replicate-within-subject-level (i.e. electrode-level) variation, and provides interpretable components that can be both sparse among variates (e.g. frequency bands) and have localized support over time within each frequency band. The sparsity and localization of components is achieved by solving an innovative rank-one based convex optimization problem with block Frobenius and matrix $L_1$-norm based penalties. The method is used to analyze data from a study to better understand reactions to emotional information in individuals with histories of trauma and the symptom of dissociation, revealing new neurophysiological insights into how subject- and electrode-level brain activity are associated with these phenomena. Distributed deep learning training usually adopts All-Reduce as the synchronization mechanism for data parallel algorithms due to its high performance in homogeneous environment. However, its performance is bounded by the slowest worker among all workers, and is significantly slower in heterogeneous situations. AD-PSGD, a newly proposed synchronization method which provides numerically fast convergence and heterogeneity tolerance, suffers from deadlock issues and high synchronization overhead. Is it possible to get the best of both worlds – designing a distributed training method that has both high performance as All-Reduce in homogeneous environment and good heterogeneity tolerance as AD-PSGD? In this paper, we propose Ripples, a high-performance heterogeneity-aware asynchronous decentralized training approach. We achieve the above goal with intensive synchronization optimization, emphasizing the interplay between algorithm and system implementation. To reduce synchronization cost, we propose a novel communication primitive Partial All-Reduce that allows a large group of workers to synchronize quickly. To reduce synchronization conflict, we propose static group scheduling in homogeneous environment and simple techniques (Group Buffer and Group Division) to avoid conflicts with slightly reduced randomness. Our experiments show that in homogeneous environment, Ripples is 1.1 times faster than the state-of-the-art implementation of All-Reduce, 5.1 times faster than Parameter Server and 4.3 times faster than AD-PSGD. In a heterogeneous setting, Ripples shows 2 times speedup over All-Reduce, and still obtains 3 times speedup over the Parameter Server baseline. Recent work in unsupervised language modeling demonstrates that training large neural language models advances the state of the art in Natural Language Processing applications. However, for very large models, memory constraints limit the size of models that can be practically trained. Model parallelism allows us to train larger models, because the parameters can be split across multiple processors. In this work, we implement a simple, efficient intra-layer model parallel approach that enables training state of the art transformer language models with billions of parameters. Our approach does not require a new compiler or library changes, is orthogonal and complimentary to pipeline model parallelism, and can be fully implemented with the insertion of a few communication operations in native PyTorch. We illustrate this approach by converging an 8.3 billion parameter transformer language model using 512 GPUs, making it the largest transformer model ever trained at 24x times the size of BERT and 5.6x times the size of GPT-2. We sustain up to 15.1 PetaFLOPs per second across the entire application with 76% scaling efficiency, compared to a strong single processor baseline that sustains 39 TeraFLOPs per second, which is 30% of peak FLOPs. The model is trained on 174GB of text, requiring 12 ZettaFLOPs over 9.2 days to converge. Transferring this language model achieves state of the art (SOTA) results on the WikiText103 (10.8 compared to SOTA perplexity of 16.4) and LAMBADA (66.5% compared to SOTA accuracy of 63.2%) datasets. We release training and evaluation code, as well as the weights of our smaller portable model, for reproducibility. Deep neural networks (DNN) have achieved unprecedented success in numerous machine learning tasks in various domains. However, the existence of adversarial examples raises our concerns in adopting deep learning to safety-critical applications. As a result, we have witnessed increasing interests in studying attack and defense mechanisms for DNN models on different data types, such as images, graphs and text. Thus, it is necessary to provide a systematic and comprehensive overview of the main threats of attacks and the success of corresponding countermeasures. In this survey, we review the state of the art algorithms for generating adversarial examples and the countermeasures against adversarial examples, for three most popular data types, including images, graphs and text. In recent years, the softmax model and its fast approximations have become the de-facto loss functions for deep neural networks when dealing with multi-class prediction. This loss has been extended to language modeling and recommendation, two fields that fall into the framework of learning from Positive and Unlabeled data. In this paper, we stress the different drawbacks of the current family of softmax losses and sampling schemes when applied in a Positive and Unlabeled learning setup. We propose both a Relaxed Softmax loss (RS) and a new negative sampling scheme based on Boltzmann formulation. We show that the new training objective is better suited for the tasks of density estimation, item similarity and next-event prediction by driving uplifts in performance on textual and recommendation datasets against classical softmax. Fair machine learning has become a significant research topic with broad societal impact. However, most fair learning methods require direct access to personal demographic data, which is increasingly restricted to use for protecting user privacy (e.g. by the EU General Data Protection Regulation). In this paper, we propose a distributed fair learning framework for protecting the privacy of demographic data. We assume this data is privately held by a third party, which can communicate with the data center (responsible for model development) without revealing the demographic information. We propose a principled approach to design fair learning methods under this framework, exemplify four methods and show they consistently outperform their existing counterparts in both fairness and accuracy across three real-world data sets. We theoretically analyze the framework, and prove it can learn models with high fairness or high accuracy, with their trade-offs balanced by a threshold variable. Ensemble models comprising of deep Convolutional Neural Networks (CNN) have shown significant improvements in model generalization but at the cost of large computation and memory requirements. % In this paper, we present a framework for learning compact CNN models with improved classification performance and model generalization. For this, we propose a CNN architecture of a compact student model with parallel branches which are trained using ground truth labels and information from high capacity teacher networks in an ensemble learning fashion. Our framework provides two main benefits: i) Distilling knowledge from different teachers into the student network promotes heterogeneity in feature learning at different branches of the student network and enables the network to learn diverse solutions to the target problem. ii) Coupling the branches of the student network through ensembling encourages collaboration and improves the quality of the final predictions by reducing variance in the network outputs. % Experiments on the well established CIFAR-10 and CIFAR-100 datasets show that our Ensemble Knowledge Distillation (EKD) improves classification accuracy and model generalization especially in situations with limited training data. Experiments also show that our EKD based compact networks outperform in terms of mean accuracy on the test datasets compared to state-of-the-art knowledge distillation based methods. We develop a projected least squares estimator for the change point parameter in a high dimensional time series model with a potential change point. Importantly we work under the setup where the jump size may be near the boundary of the region of detectability. The proposed methodology yields an optimal rate of convergence despite high dimensionality of the assumed model and a potentially diminishing jump size. The limiting distribution of this estimate is derived, thereby allowing construction of a confidence interval for the location of the change point. A secondary near optimal estimate is proposed which is required for the implementation of the optimal projected least squares estimate. The prestep estimation procedure is designed to also agnostically detect the case where no change point exists, thereby removing the need to pretest for the existence of a change point for the implementation of the inference methodology. Our results are presented under a general positive definite spatial dependence setup, assuming no special structure on this dependence. The proposed methodology is designed to be highly scalable, and applicable to very large data. Theoretical results regarding detection and estimation consistency and the limiting distribution are numerically supported via monte carlo simulations. Continue Reading… ### Magister Dixit “Moving data doesn’t come for free.” Adrian Colyer ( May 25, 2017 ) Continue Reading… ### R Packages worth a look Phase Plane Analysis of One- And Two-Dimensional Autonomous ODE Systems (phaseR) Performs a qualitative analysis of one- and two-dimensional autonomous ordinary differential equation systems, using phase plane methods. Programs are available to identify and classify equilibrium points, plot the direction field, and plot trajectories for multiple initial conditions. In the one-dimensional case, a program is also available to plot the phase portrait. Whilst in the two-dimensional case, programs are additionally available to plot nullclines and stable/unstable manifolds of saddle points. Many example systems are provided for the user. For further details can be found in Grayling (2014) <doi:10.32614/RJ-2014-023>. Distributions for Ecological Models in ‘nimble’ (nimbleEcology) Common ecological distributions for ‘nimble’ models in the form of nimbleFunction objects. Includes Cormack-Jolly-Seber, occupancy, dynamic occupancy, hidden Markov, and dynamic hidden Markov models. (Jolly (1965) <doi:10.2307/2333826>, Seber (1965) <10.2307/2333827>, Turek et al. (2016) <doi:10.1007/s10651-016-0353-z>). Bayesian Estimation of Bivariate Volatility Model (BayesBEKK) The Multivariate Generalized Autoregressive Conditional Heteroskedasticity (MGARCH) models are used for modelling the volatile multivariate data sets. In this package a variant of MGARCH called BEKK (Baba, Engle, Kraft, Kroner) proposed by Engle and Kroner (1995) <http://…/3532933> has been used to estimate the bivariate time series data using Bayesian technique. Data Transformation or Simulation with Empirical Covariance Matrix (simTargetCov) Transforms or simulates data with a target empirical covariance matrix supplied by the user. The method to obtain the data with the target empirical covariance matrix is described in Section 5.1 of Christidis, Van Aelst and Zamar (2019) <arXiv:1812.05678>. Graphical and Numerical Checks for Mode-Finding Routines (optimCheck) Tools for checking that the output of an optimization algorithm is indeed at a local mode of the objective function. This is accomplished graphically by calculating all one-dimensional ‘projection plots’ of the objective function, i.e., varying each input variable one at a time with all other elements of the potential solution being fixed. The numerical values in these plots can be readily extracted for the purpose of automated and systematic unit-testing of optimization routines. Continue Reading… ### Governments subsidise fossil fuels to the tune of427bn a year

Doing away with such handouts is devilishly difficult

### Who is winning the race for Westminster?

See which parties British voters plan to support

### Making a background color gradient in ggplot2

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I was recently making some arrangements for the 2020 eclipse in South America, which of course got me thinking of the day we were lucky enough to have a path of totality come to us.

We have a weather station that records local temperature every 5 minutes, so after the eclipse I was able to plot the temperature change over the eclipse as we experienced it at our house. Here is an example of a basic plot I made at the time.

Looking at this now with new eyes, I see it might be nice replace the gray rectangle with one that goes from light to dark to light as the eclipse progresses to totality and then back. I’ll show how I tackled making a gradient color background in this post.

I’ll load ggplot2 for plotting and dplyr for data manipulation.

library(ggplot2) # 3.2.1
library(dplyr) # 0.8.3

# The dataset

My weather station records the temperature in °Fahrenheit every 5 minutes. I downloaded the data from 6 AM to 12 PM local time and cleaned it up a bit. The date-times and temperature are in a dataset I named temp. You can download this below if you’d like to play around with these data.

Here are the first six lines of this temperature dataset.

head(temp)
# # A tibble: 6 x 2
#   datetime            tempf
#
# 1 2017-08-21 06:00:00  54.9
# 2 2017-08-21 06:05:00  54.9
# 3 2017-08-21 06:10:00  54.9
# 4 2017-08-21 06:15:00  54.9
# 5 2017-08-21 06:20:00  54.9
# 6 2017-08-21 06:25:00  54.8

I also stored the start and end times of the eclipse and totality in data.frames, which I pulled for my location from this website.

If following along at home, make sure your time zones match for all the date-time variables or, from personal experience , you’ll run into problems.

eclipse = data.frame(start = as.POSIXct("2017-08-21 09:05:10"),
end = as.POSIXct("2017-08-21 11:37:19") )

totality = data.frame(start = as.POSIXct("2017-08-21 10:16:55"),
end = as.POSIXct("2017-08-21 10:18:52") )

# Initial plot

I decided to make a plot of the temperature change during the eclipse only.

To keep the temperature line looking continuous, even though it’s taken every 5 minutes, I subset the data to times close but outside the start and end of the eclipse.

plottemp = filter(temp, between(datetime,
as.POSIXct("2017-08-21 09:00:00"),
as.POSIXct("2017-08-21 12:00:00") ) )

I then zoomed the plot to only include times encompassed by the eclipse with coord_cartesian(). I removed the x axis expansion in scale_x_datetime().

Since the plot covers only about 2 and half hours, I make breaks on the x axis every 15 minutes.

ggplot(plottemp) +
geom_line( aes(datetime, tempf), size = 1 ) +
scale_x_datetime( date_breaks = "15 min",
date_labels = "%H:%M",
expand = c(0, 0) ) +
coord_cartesian(xlim = c(eclipse$start, eclipse$end) ) +
labs(y = expression( Temperature~(degree*F) ),
x = NULL,
title = "Temperature during 2017-08-21 solar eclipse",
subtitle = expression(italic("Sapsucker Farm, 09:05:10 - 11:37:19 PDT") ),
caption = "Eclipse: 2 hours 32 minutes 9 seconds\nTotality: 1 minute 57 seconds"
) +
scale_y_continuous(sec.axis = sec_axis(~ (. - 32) * 5 / 9 ,
name =  expression( Temperature~(degree*C)),
breaks = seq(16, 24, by = 1)) ) +
theme_bw(base_size = 14) +
theme(panel.grid = element_blank() ) 

I wanted the background of the plot to go from light to dark back to light through time. This means a color gradient should go from left to right across the plot.

Since the gradient will be based on time, I figured I could add a vertical line with geom_segment() for every second of the eclipse and color each segment based on how far that time was from totality.

## Make a variable for the color mapping

The first step I took was to make variable with a row for every second of the eclipse, since I wanted a segment drawn for each second. I used seq.POSIXt for this.

color_dat = data.frame(time = seq(eclipse$start, eclipse$end, by = "1 sec") )

Then came some hard thinking. How would I make a continuous variable to map to color?

While I don’t have an actual measurement of light throughout the eclipse, I can show the general idea of a light change with color by using a linear change in color from the start of the eclipse to totality and then another linear change in color from totality to the end of the eclipse.

My first idea for creating a variable was to use information on the current time vs totality start/end. After subtracting the times before totality from totality start and subtracting totality end from times after totality, I realized that the amount of time before totality wasn’t actually the same as the amount of time after totality. Back to the drawing board.

Since I was making a linear change in color, I realized I could make a sequence of values before totality and after totality that covered the same range but had a different total number of values. This would account for the difference in the length of time before and after totality. I ended up making a sequence going from 100 to 0 for times before totality and a sequence from 0 to 100 after totality. Times during totality were assigned a 0.

Here’s one way to get these sequences, using base::replace(). My dataset is in order by time, which is key to this working correctly.

color_dat = mutate(color_dat,
color = 0,
color = replace(color,
time < totality$start, seq(100, 0, length.out = sum(time < totality$start) ) ),
color = replace(color,
time > totality$end, seq(0, 100, length.out = sum(time > totality$end) ) ) )

## Adding one geom_segment() per second

Once I had my color variable I was ready plot the segments along the x axis. Since the segments neeeded to go across the full height of the plot, I set y and yend to -Inf and Inf, respectively.

I put this layer first to use it as a background that the temperature line was plotted on top of.

g1 = ggplot(plottemp) +
geom_segment(data = color_dat,
aes(x = time, xend = time,
y = -Inf, yend = Inf, color = color),
show.legend = FALSE) +
geom_line( aes(datetime, tempf), size = 1 ) +
scale_x_datetime( date_breaks = "15 min",
date_labels = "%H:%M",
expand = c(0, 0) ) +
coord_cartesian(xlim = c(eclipse$start, eclipse$end) ) +
labs(y = expression( Temperature~(degree*F) ),
x = NULL,
title = "Temperature during 2017-08-21 solar eclipse",
subtitle = expression(italic("Sapsucker Farm, 09:05:10 - 11:37:19 PDT") ),
caption = "Eclipse: 2 hours 32 minutes 9 seconds\nTotality: 1 minute 57 seconds"
) +
scale_y_continuous(sec.axis = sec_axis(~ (. - 32) * 5 / 9 ,
name =  expression( Temperature~(degree*C)),
breaks = seq(16, 24, by = 1)) ) +
theme_bw(base_size = 14) +
theme(panel.grid = element_blank() )

g1

## Switching to a gray scale

The default blue color scheme for the segments actually works OK, but I was picturing going from white to dark. I picked gray colors with grDevices::gray.colors() in scale_color_gradient(). In gray.colors(), 0 is black and 1 is white. I didn’t want the colors to go all the way to black, since that would make the temperature line impossible to see during totality. And, of course, it’s not actually pitch black during totality.

g1 + scale_color_gradient(low = gray.colors(1, 0.25),
high = gray.colors(1, 1) )

# Using segments to make a gradient rectangle

I can use this same approach on only a portion of the x axis to give the appearance of a rectangle with gradient fill. Here’s an example using times outside the eclipse.

g2 = ggplot(temp) +
geom_segment(data = color_dat,
aes(x = time, xend = time,
y = -Inf, yend = Inf, color = color),
show.legend = FALSE) +
geom_line( aes(datetime, tempf), size = 1 ) +
scale_x_datetime( date_breaks = "1 hour",
date_labels = "%H:%M",
expand = c(0, 0) ) +
labs(y = expression( Temperature~(degree*F) ),
x = NULL,
title = "Temperature during 2017-08-21 solar eclipse",
subtitle = expression(italic("Sapsucker Farm, Dallas, OR, USA") ),
caption = "Eclipse: 2 hours 32 minutes 9 seconds\nTotality: 1 minute 57 seconds"
) +
scale_y_continuous(sec.axis = sec_axis(~ (. - 32) * 5 / 9 ,
name =  expression( Temperature~(degree*C)),
breaks = seq(12, 24, by = 2)) ) +
high = gray.colors(1, 1) ) +
theme_bw(base_size = 14) +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor = element_blank() )

g2

## Bonus: annotations with curved arrows

This second plot gives me a chance to try out Cédric Scherer’s very cool curved annotation arrow idea for the first time .

g2 = g2 +
annotate("text", x = as.POSIXct("2017-08-21 08:00"),
y = 74,
label = "Partial eclipse begins\n09:05:10 PDT",
color = "grey24") +
annotate("text", x = as.POSIXct("2017-08-21 09:00"),
y = 57,
label = "Totality begins\n10:16:55 PDT",
color = "grey24")
g2

I’ll make a data.frame for the arrow start/end positions. I’m skipping all the work it took to get the positions where I wanted them, which is always iterative for me.

arrows = data.frame(x1 = as.POSIXct( c("2017-08-21 08:35",
"2017-08-21 09:34") ),
x2 = c(eclipse$start, totality$start),
y1 = c(74, 57.5),
y2 = c(72.5, 60) )

I add arrows with geom_curve(). I changed the size of the arrowhead and made it closed in arrow(). I also thought the arrows looked better with a little less curvature.

g2 +
geom_curve(data = arrows,
aes(x = x1, xend = x2,
y = y1, yend = y2),
arrow = arrow(length = unit(0.075, "inches"),
type = "closed"),
curvature = 0.25)

# Other ways to make a gradient color background

Based on a bunch of internet searches, it looks like a gradient background in ggplot2 generally takes some work. There are some nice examples out there on how to use rasterGrob() and annotate_custom() to add background gradients, such as in this Stack Overflow question. I haven’t researched how to make this go from light to dark and back to light for the uneven time scale like in my example.

I’ve also seen approaches involving dataset expansion and drawing many filled rectangles or using rasters, which is like what I did with geom_segment().

# Eclipses!

Before actually experiencing totality, it seemed to me like the difference between a 99% and a 100% eclipse wasn’t a big deal. I mean, those numbers are pretty darn close.

I was very wrong.

If you ever are lucky enough to be near a path of totality, definitely try to get there even if it’s a little more trouble then the 99.9% partial eclipse. It’s an amazing experience.

Here’s the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code from here.

library(ggplot2) # 3.2.1
library(dplyr) # 0.8.3

eclipse = data.frame(start = as.POSIXct("2017-08-21 09:05:10"),
end = as.POSIXct("2017-08-21 11:37:19") )

totality = data.frame(start = as.POSIXct("2017-08-21 10:16:55"),
end = as.POSIXct("2017-08-21 10:18:52") )

plottemp = filter(temp, between(datetime,
as.POSIXct("2017-08-21 09:00:00"),
as.POSIXct("2017-08-21 12:00:00") ) )
ggplot(plottemp) +
geom_line( aes(datetime, tempf), size = 1 ) +
scale_x_datetime( date_breaks = "15 min",
date_labels = "%H:%M",
expand = c(0, 0) ) +
coord_cartesian(xlim = c(eclipse$start, eclipse$end) ) +
labs(y = expression( Temperature~(degree*F) ),
x = NULL,
title = "Temperature during 2017-08-21 solar eclipse",
subtitle = expression(italic("Sapsucker Farm, 09:05:10 - 11:37:19 PDT") ),
caption = "Eclipse: 2 hours 32 minutes 9 seconds\nTotality: 1 minute 57 seconds"
) +
scale_y_continuous(sec.axis = sec_axis(~ (. - 32) * 5 / 9 ,
name =  expression( Temperature~(degree*C)),
breaks = seq(16, 24, by = 1)) ) +
theme_bw(base_size = 14) +
theme(panel.grid = element_blank() )
color_dat = data.frame(time = seq(eclipse$start, eclipse$end, by = "1 sec") )
color_dat = mutate(color_dat,
color = 0,
color = replace(color,
time < totality$start, seq(100, 0, length.out = sum(time < totality$start) ) ),
color = replace(color,
time > totality$end, seq(0, 100, length.out = sum(time > totality$end) ) ) )
g1 = ggplot(plottemp) +
geom_segment(data = color_dat,
aes(x = time, xend = time,
y = -Inf, yend = Inf, color = color),
show.legend = FALSE) +
geom_line( aes(datetime, tempf), size = 1 ) +
scale_x_datetime( date_breaks = "15 min",
date_labels = "%H:%M",
expand = c(0, 0) ) +
coord_cartesian(xlim = c(eclipse$start, eclipse$end) ) +
labs(y = expression( Temperature~(degree*F) ),
x = NULL,
title = "Temperature during 2017-08-21 solar eclipse",
subtitle = expression(italic("Sapsucker Farm, 09:05:10 - 11:37:19 PDT") ),
caption = "Eclipse: 2 hours 32 minutes 9 seconds\nTotality: 1 minute 57 seconds"
) +
scale_y_continuous(sec.axis = sec_axis(~ (. - 32) * 5 / 9 ,
name =  expression( Temperature~(degree*C)),
breaks = seq(16, 24, by = 1)) ) +
theme_bw(base_size = 14) +
theme(panel.grid = element_blank() )

g1

g1 + scale_color_gradient(low = gray.colors(1, 0.25),
high = gray.colors(1, 1) )
g2 = ggplot(temp) +
geom_segment(data = color_dat,
aes(x = time, xend = time,
y = -Inf, yend = Inf, color = color),
show.legend = FALSE) +
geom_line( aes(datetime, tempf), size = 1 ) +
scale_x_datetime( date_breaks = "1 hour",
date_labels = "%H:%M",
expand = c(0, 0) ) +
labs(y = expression( Temperature~(degree*F) ),
x = NULL,
title = "Temperature during 2017-08-21 solar eclipse",
subtitle = expression(italic("Sapsucker Farm, Dallas, OR, USA") ),
caption = "Eclipse: 2 hours 32 minutes 9 seconds\nTotality: 1 minute 57 seconds"
) +
scale_y_continuous(sec.axis = sec_axis(~ (. - 32) * 5 / 9 ,
name =  expression( Temperature~(degree*C)),
breaks = seq(12, 24, by = 2)) ) +
high = gray.colors(1, 1) ) +
theme_bw(base_size = 14) +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor = element_blank() )

g2
g2 = g2 +
annotate("text", x = as.POSIXct("2017-08-21 08:00"),
y = 74,
label = "Partial eclipse begins\n09:05:10 PDT",
color = "grey24") +
annotate("text", x = as.POSIXct("2017-08-21 09:00"),
y = 57,
label = "Totality begins\n10:16:55 PDT",
color = "grey24")
g2

arrows = data.frame(x1 = as.POSIXct( c("2017-08-21 08:35",
"2017-08-21 09:34") ),
x2 = c(eclipse$start, totality$start),
y1 = c(74, 57.5),
y2 = c(72.5, 60) )
g2 +
geom_curve(data = arrows,
aes(x = x1, xend = x2,
y = y1, yend = y2),
arrow = arrow(length = unit(0.075, "inches"),
type = "closed"),
curvature = 0.25)

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

## October 13, 2019

### What’s going on on PyPI

Scanning all new published packages on PyPI I know that the quality is often quite bad. I try to filter out the worst ones and list here the ones which might be worth a look, being followed or inspire you in some way.

GN
A hierachical community detection algorithm by Girvan Newman. A Girvan Newman step is defined as a couple of successive edge removes such that a new community occurs.

meditorch
A PyTorch package for biomedical image processing

metriculous
Very unstable library containing utilities to measure and visualize statistical properties of machine learning models.

podworld
2D partially observable dynamic world for RL experiments. PodWorld is (https://gym.openai.com ) environment for (http://…/the-book-2nd.html ) experimentations. PodWorld is specifically designed to be partially observable and dynamic (hence abbreviation P.O.D.). We emphasize these two attributes to force agents learn spatial as well as temporal representations that must go beyond simple memorization. In addition, all entities in PodWorld must obey laws of physics allowing for long tail of emergent observations that may not appear in games designed with arbitrary hand crafted rules for human entertainment. PodWorld is designed to be highly customizable as well as hackable with fairly simple and minimal code base. PodWorld is designed to be fast (>500 FPS on usual laptops) without needing GPU and run cross platform in headless mode or with render.

sensplit
Splits a dataset (in Pandas dataframe format) to train/test sets.

simple-gpu-scheduler
A simple scheduler for running commands on multiple GPUs. A simple scheduler to run your commands on individual GPUs. Following the (https://…/KISS_principle ), this script simply accepts commands via stdin and executes them on a specific GPU by setting the CUDA_VISIBLE_DEVICES variable.

simple-s3

sparkle-hypothesis
Use the power of hypothesis property based testing in PySpark tests. Data heavy tests benefit from Hypothesis to generate your data and desinging your tests. Sparkle-hypothesis makes it easy to use Hypothesis strategies to generate dataframes.

sstk
A systematic strategy toolkit.

TeaML
Automated Modeling in Financial Domain. TeaML is a simple and design friendly automatic modeling learning framework. It can automatically model from beginning to end, and in the end, it will also help you output a model report about the model.

### Did Russia Use Manafort’s Polling Data in 2016 Election?

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction:

On August 2, 2016 then Trump campaign manager, Paul Manafort, gave polling data to Konstantin Kalimnik a Russian widely assumed to be a spy. Before then Manafort ordered his protege, Rick Gates, to share polling data with Kilmnik. Gates periodically did so starting April or May. The Mueller Report stated it did not know why Manafort was insistent on giving this information or whether the Russian’s used it to further Trump’s cause (p. 130 see here for my summary of Mueller Report V1).

One theory says that Manafort wanted to show the good work he was doing to Kilimnik’s boss, a Russian Oligarch named Deripiskia, whom Manafort owed money to. A more sinister hypothesis is that Manafort knew that the information would be valuable in the hands of Russian’s trying to interfere with the election.

This post will analyze whether the Russians used the polling data irrespective of Manafort’s intent. I looked at Russian Facebook Ads uncovered by House Intelligence Committee and tried to identify any changes in messaging after August 2nd. I conclude with a guess on what the polling data was shared.

The House Intelligence Committee released  thousands of Russian Advertisements by the Internet Research Agency. There have been several analysis on these advertisements that discuss they’re effectiveness and one good one is by Spangher et al. However, I couldn’t find any that showed topics of advertisements over time.

I focused the analysis to data in 2016 which includes periods of Manafort coming into the position of campaign manager and the election itself in november. Overall there 1858 facebook Ads captured in this dataset. Below is a time series plot of number of Advertisements per day for 2016.

There are periods of high activity in May / June and in October right before the election.

Change After August 2nd?

Each advertisement has metadata and text associated with it including: date, text, target population, etc. To see if there were any changes through time and in particular august 2nd I tried some topic modeling and text clustering to see if there were any natural changes. I couldn’t find any changes or trends using an unsupervised approach.

Instead I built a predictive model with the response being a binary variable; before  / after august 2nd and explanatory variables as text features from each ad (over 1200 words). I then performed variable importance on these words to see which were most predictive. Below I plotted the number of adverts with the important words divided by numer of advertisements for a particular day to get a normalized percentage.

The blue line is when Manafort made contact with Kilimnik initially and the red line is the august 2nd meeting. There does appear to be large increases in the words associated with African American civil rights topics after 8/2. Specifically these words were not in the advertisements texts themselves but were in the ‘people who liked’ description. That is, if you liked ‘Martin Luther King’ on your profile then a particular ad would target you.

Another way to look at this information is to see the proportion of these words used before and after 8/2.

The above plot shows the number of times a word appeared before and after 8/2 and the P(date>8/2) | word). For instance the word 1954, signifying the beginning of civil rights, occured 4 times before and 376 times after 8/2 which means that just under 99% of times it appears happen after that. This suggests there was a change in the IRA advertisements where they focused more on targeting people that were interested African American civil rights issues.

Conclusions / Discussions

I’m guessing that the contents of the polling data would be something related to African Americans and how those that have an interest in civil rights movement are more susceptible to negative ads.

Do I think the evidence presented here is that strong enough to believe the Russians used polling data? Meh, not really. For few reasons:

• All words found here were used a few times before the 8/2
• Gates gave information on a continuous basis. If Russians used this data I assume they would incorporate it accordingly and there would not be a discrete change at 8/2
• I only did this for one date. Perhaps if I did this analysis for an arbitrary dates then I would find other words that were associated with other dates
I’m not saying that they didn’t use the polling data but I don’t think the evidence here is strong enough to say that they did. At a minimum I think that the IRA and Russians adapted Ads to target different populations at different points in time. This shows they are sophisticated and probably learn from previous results.

Code

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

### If you did not already know

Policy Learning based on Completely Behavior Cloning (PLCBC)
Direct policy search is one of the most important algorithm of reinforcement learning. However, learning from scratch needs a large amount of experience data and can be easily prone to poor local optima. In addition to that, a partially trained policy tends to perform dangerous action to agent and environment. In order to overcome these challenges, this paper proposed a policy initialization algorithm called Policy Learning based on Completely Behavior Cloning (PLCBC). PLCBC first transforms the Model Predictive Control (MPC) controller into a piecewise affine (PWA) function using multi-parametric programming, and uses a neural network to express this function. By this way, PLCBC can completely clone the MPC controller without any performance loss, and is totally training-free. The experiments show that this initialization strategy can help agent learn at the high reward state region, and converge faster and better. …

PredRNN++
We present PredRNN++, an improved recurrent network for video predictive learning. In pursuit of a greater spatiotemporal modeling capability, our approach increases the transition depth between adjacent states by leveraging a novel recurrent unit, which is named Causal LSTM for re-organizing the spatial and temporal memories in a cascaded mechanism. However, there is still a dilemma in video predictive learning: increasingly deep-in-time models have been designed for capturing complex variations, while introducing more difficulties in the gradient back-propagation. To alleviate this undesirable effect, we propose a Gradient Highway architecture, which provides alternative shorter routes for gradient flows from outputs back to long-range inputs. This architecture works seamlessly with causal LSTMs, enabling PredRNN++ to capture short-term and long-term dependencies adaptively. We assess our model on both synthetic and real video datasets, showing its ability to ease the vanishing gradient problem and yield state-of-the-art prediction results even in a difficult objects occlusion scenario. …

ReSIFT
This paper presents a full-reference image quality estimator based on SIFT descriptor matching over reliability-weighted feature maps. Reliability assignment includes a smoothing operation, a transformation to perceptual color domain, a local normalization stage, and a spectral residual computation with global normalization. The proposed method ReSIFT is tested on the LIVE and the LIVE Multiply Distorted databases and compared with 11 state-of-the-art full-reference quality estimators. In terms of the Pearson and the Spearman correlation, ReSIFT is the best performing quality estimator in the overall databases. Moreover, ReSIFT is the best performing quality estimator in at least one distortion group in compression, noise, and blur category. …

Feature Generation by Convolutional Neural Network (FGCNN)
Click-Through Rate prediction is an important task in recommender systems, which aims to estimate the probability of a user to click on a given item. Recently, many deep models have been proposed to learn low-order and high-order feature interactions from original features. However, since useful interactions are always sparse, it is difficult for DNN to learn them effectively under a large number of parameters. In real scenarios, artificial features are able to improve the performance of deep models (such as Wide & Deep Learning), but feature engineering is expensive and requires domain knowledge, making it impractical in different scenarios. Therefore, it is necessary to augment feature space automatically. In this paper, We propose a novel Feature Generation by Convolutional Neural Network (FGCNN) model with two components: Feature Generation and Deep Classifier. Feature Generation leverages the strength of CNN to generate local patterns and recombine them to generate new features. Deep Classifier adopts the structure of IPNN to learn interactions from the augmented feature space. Experimental results on three large-scale datasets show that FGCNN significantly outperforms nine state-of-the-art models. Moreover, when applying some state-of-the-art models as Deep Classifier, better performance is always achieved, showing the great compatibility of our FGCNN model. This work explores a novel direction for CTR predictions: it is quite useful to reduce the learning difficulties of DNN by automatically identifying important features. …

### Using R: Animal model with simulated data

[This article was first published on R – On unicorns and genes, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Last week’s post just happened to use MCMCglmm as an example of an R package that can get confused by tibble-style data frames. To make that example, I simulated some pedigree and trait data. Just for fun, let’s look at the simulation code, and use MCMCglmm and AnimalINLA to get heritability estimates.

First, here is some AlphaSimR code that creates a small random mating population, and collects trait and pedigree:

library(AlphaSimR)

## Founder population
FOUNDERPOP <- runMacs(nInd = 100,
nChr = 20,
inbred = FALSE,
species = "GENERIC")

## Simulation parameters
SIMPARAM <- SimParam$new(FOUNDERPOP) SIMPARAM$addTraitA(nQtlPerChr = 100,
mean = 100,
var = 10)
SIMPARAM$setGender("yes_sys") SIMPARAM$setVarE(h2 = 0.3)

## Random mating for 9 more generations
generations <- vector(mode = "list", length = 10)
generations[[1]] <- newPop(FOUNDERPOP,
simParam = SIMPARAM)

for (gen in 2:10) {

generations[[gen]] <- randCross(generations[[gen - 1]],
nCrosses = 10,
nProgeny = 10,
simParam = SIMPARAM)

}

## Put them all together
combined <- Reduce(c, generations)

## Extract phentoypes
pheno <- data.frame(animal = combined@id,
pheno = combined@pheno[,1])

## Extract pedigree
ped <- data.frame(id = combined@id,
dam = combined@mother,
sire =combined@father)
ped$dam[ped$dam == 0] <- NA
ped$sire[ped$sire == 0] <- NA

## Write out the files
write.csv(pheno,
file = "sim_pheno.csv",
row.names = FALSE,
quote = FALSE)

write.csv(ped,
file = "sim_ped.csv",
row.names = FALSE,
quote = FALSE)


In turn, we:

1. Set up a founder population with a AlphaSimR’s generic livestock-like population history, and 20 chromosomes.
2. Choose simulation parameters: we have an organism with separate sexes, a quantitative trait with an additive polygenic architecture, and we want an environmental variance to give us a heritability of 0.3.
3. We store away the founders as the first generation, then run a loop to give us nine additional generations of random mating.
4. Combine the resulting generations into one population.
5. Extract phenotypes and pedigree into their own data frames.
6. Optionally, save the latter data frames to files (for the last post).

Now that we have some data, we can fit a quantitative genetic pedigree model (”animal model”) to estimate genetic parameters. We’re going to try two methods to fit it: Markov Chain Monte Carlo and (the unfortunately named) Integrated Nested Laplace Approximation. MCMC explores the posterior distribution by sampling; I’m not sure where I heard it described as ”exploring a mountain by random teleportation”. INLA makes approximations to the posterior that can be integrated numerically; I guess it’s more like building a sculpture of the mountain.

First, a Gaussian animal model in MCMCglmm:

library(MCMCglmm)

## Gamma priors for variances
prior_gamma <- list(R = list(V = 1, nu = 1),
G = list(G1 = list(V = 1, nu = 1)))

## Fit the model
model_mcmc  <- MCMCglmm(scaled ~ 1,
random = ~ animal,
family = "gaussian",
prior = prior_gamma,
pedigree = ped,
data = pheno,
nitt = 100000,
burnin = 10000,
thin = 10)

## Calculate heritability for heritability from variance components
h2_mcmc_object  <- model_mcmc$VCV[, "animal"] / (model_mcmc$VCV[, "animal"] + model_mcmc$VCV[, "units"]) ## Summarise results from that posterior h2_mcmc <- data.frame(mean = mean(h2_mcmc_object), lower = quantile(h2_mcmc_object, 0.025), upper = quantile(h2_mcmc_object, 0.975), method = "MCMC", stringsAsFactors = FALSE)  And here is a similar animal model in AnimalINLA: library(AnimalINLA) ## Format pedigree to AnimalINLA's tastes ped_inla <- ped ped_inla$id  <- as.numeric(ped_inla$id) ped_inla$dam  <- as.numeric(ped_inla$dam) ped_inla$dam[is.na(ped_inla$dam)] <- 0 ped_inla$sire  <- as.numeric(ped_inla$sire) ped_inla$sire[is.na(ped_inla$sire)] <- 0 ## Turn to relationship matrix A_inv <- compute.Ainverse(ped_inla) ## Fit the model model_inla <- animal.inla(response = scaled, genetic = "animal", Ainverse = A_inv, type.data = "gaussian", data = pheno, verbose = TRUE) ## Pull out summaries from the model object summary_inla <- summary(model_inla) ## Summarise results h2_inla <- data.frame(mean = summary_inla$summary.hyperparam["Heritability", "mean"],
lower = summary_inla$summary.hyperparam["Heritability", "0.025quant"], upper = summary_inla$summary.hyperparam["Heritability", "0.975quant"],
method = "INLA",
stringsAsFactors = FALSE)


If we wrap this all in a loop, we can see how the estimation methods do on replicate data (full script on GitHub). Here are estimates and intervals from ten replicates (black dots show the actual heritability in the first generation):

As you can see, the MCMC and INLA estimates agree pretty well and mostly hit the mark. In the one replicate dataset where they falter, they falter together.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

### Rename Columns | R

[This article was first published on Data Science Using R – FinderDing, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Often data you’re working with has abstract column names, such as (x1, x2, x3…). Typically, the first step I take when renaming columns with r is opening my web browser.

The dataset cars is data from the 1920s on “Speed and Stopping Distances of Cars”. There is only 2 columns shown below.

colnames(datasets::cars)
[1] "speed" "dist" 

If we wanted to rename the column “dist” to make it easier to know what the data is/means we can do so in a few different ways.

## Using dplyr:

cars %>%
rename("Stopping Distance (ft)" = dist) %>%
colnames()

[1] "speed"             "Stopping Distance"
cars %>%
rename("Stopping Distance (ft)" = dist, "Speed (mph)" = speed) %>%
colnames()

[1] "Speed (mph)"            "Stopping Distance (ft)"

## Using Base r:

colnames(cars)[2] <-"Stopping Distance (ft)"

[1] "speed"                  "Stopping Distance (ft)"

colnames(cars)[1:2] <-c("Speed (mph)","Stopping Distance (ft)")

[1] "Speed (mph)"            "Stopping Distance (ft)"

## Using GREP:

colnames(cars)[grep("dist", colnames(cars))] <-"Stopping Distance (ft)"

"speed"                  "Stopping Distance (ft)"

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

### Distilled News

A time series is a series of data points ordered in time. Time series adds an explicit order dependence between observations: a time dimension. In a normal machine learning dataset, the dataset is a collection of observations that are treated equally when future is being predicted. In time series the order of observations provides a source of additional information that should be analyzed and used in the prediction process. Time series are typically assumed to be generated at regularly spaced interval of time (e.g. daily temperature), and so are called regular time series. But the data in a time series doesn’t have to come in regular time intervals. In that case it is called irregular time series. In irregular time series the data follows a temporal sequence, but the measurements might not occur at a regular time intervals. For example, the data might be generated as a burst or with varying time intervals [1]. Account deposits or withdrawals from an ATM are examples of an irregular time series. Time series can have one or more variables that change over time. If there is only one variable varying over time, we call it Univariate time series. If there is more than one variable it is called Multivariate time series. For example, a tri-axial accelerometer. There are three accelerations variables, one for each axis (x,y,z) and they vary simultaneously over time.
An end to end guide on how to reduce the number of Features in a Dataset with practical examples in Python.
One of the primary challenges for the realization of near-term quantum computers has to do with their most basic constituent: the qubit. Qubits can interact with anything in close proximity that carries energy close to their own – stray photons (i.e., unwanted electromagnetic fields), phonons (mechanical oscillations of the quantum device), or quantum defects (irregularities in the substrate of the chip formed during manufacturing) – which can unpredictably change the state of the qubits themselves. Further complicating matters, there are numerous challenges posed by the tools used to control qubits. Manipulating and reading out qubits is performed via classical controls: analog signals in the form of electromagnetic fields coupled to a physical substrate in which the qubit is embedded, e.g., superconducting circuits. Imperfections in these control electronics (giving rise to white noise), interference from external sources of radiation, and fluctuations in digital-to-analog converters, introduce even more stochastic errors that degrade the performance of quantum circuits. These practical issues impact the fidelity of the computation and thus limit the applications of near-term quantum devices. To improve the computational capacity of quantum computers, and to pave the road towards large-scale quantum computation, it is necessary to first build physical models that accurately describe these experimental problems.
CS 189 is the Machine Learning course at UC Berkeley. In this guide we have created a comprehensive course guide in order to share our knowledge with students and the general public, and hopefully draw the interest of students from other universities to Berkeley’s Machine Learning curriculum. This guide was started by CS 189 TAs Soroush Nasiriany and Garrett Thomas in Fall 2017, with the assistance of William Wang and Alex Yang. We owe gratitude to Professors Anant Sahai, Stella Yu, and Jennifer Listgarten, as this book is heavily inspired from their lectures. In addition, we are indebted to Professor Jonathan Shewchuk for his machine learning notes, from which we drew inspiration. The latest version of this document can be found either at http://www.eecs189.org or http: //snasiriany.me/cs189/. Please report any mistakes to the sta , and contact the authors if you wish to redistribute this document.
The modern business leader’s new responsibility in a brave new world ruled by data. As Data Science moves along the hype cycle and matures as a business function, so do the challenges that face the discipline. The problem statement for data science went from ‘we waste 80% of our time preparing data’ via ‘production deployment is the most difficult part of data science’ to ‘lack of measurable business impact’ in the last few years.
If everyone had the time and desire to go to college and get an AI degree, you very likely wouldn’t be reading this blog. AI works in mysterious ways, but these five AI principles ought to help you avoid errors when dealing with this tech. A quick run down of this post for the AI acolyte on the go:
1. Evaluate AI systems on unseen data
2. More data leads to better models
3. An ounce of clean data is worth a pound of dirty data
5. AI isn’t magic
Learn how to transform and load (ETL) a data pipeline from scratch using R and SQLite to gather tweets in real-time and store them for future analyses.
Last year, in a post, I discussed how to merge levels of factor variables, using combinatorial techniques (it was for my STT5100 cours, and trees are not in the syllabus), with an extension on trees at the end of the post.
Benefits of data science, you might guess that data science has played a role in your daily life. After all, it not only affects what you do online, but what you do offline. Companies are using massive amounts of data to create better ads, produce tailored recommendations, and stock shelves, in the case of retail stores. It’s also shaping how, and who, we love. Here’s how data impacts us daily.
Python can be lots of fun. It’s not a difficult task to re-invent some built-in function that you don’t know exists in the first place, but why would you want to do that?. Today we’ll take a look at three of those functions which I use more or less on a daily basis, but was unaware of for a good portion of my data science career.
In this series of 6 posts we will leave the basics of prediction aside and look at a more handcrafted aspect of data science and machine learning, such as interpreting, gaining insights, and understanding what goes on within algorithms. It is not a trivial matter and is far from exhausted. In this series of posts we’ll start with the simplest statistical models and how we model even the most sophisticated ensemble and deep learning models trying to figure out the whys of a prediction.
PyTorch has become one of the most popular deep learning frameworks in the market and certainly a favorite of the research community when comes to experimentation. As a reference, PyTorch citations in papers on ArXiv grew 194 percent in the first half of 2019 alone, as noted by O’Reilly. For years, Facebook has based its deep learning work in a combination of PyTorch and Caffe2 and has put a lot of resources to support the PyTorch stack and developer community. Yesterday, Facebook released the latest version of PyTorch which showcases some state-of-the-art deep learning capabilities. There have been plenty of articles covering the launch of PyTorch 1.3. Instead of doing that, I would like to focus on some of the new projects accompanying the new release of the deep learning framework. Arguably, the most impressive capability of PyTorch is how quickly it has been able to incorporate implementations about new research technique. Not surprisingly, the artificial intelligence(AI) research community has started adopting PyTorch as one of the preferred stacks to experiment with new deep learning methods. The new release of PyTorch continues this trend by adding some impressive open source projects surrounding the core stack.
An end to end guide on how to reduce a dataset dimensionality using Feature Extraction Techniques such as: PCA, ICA, LDA, LLE, t-SNE and AE.
How to cope with AI and start becoming a part of it. If you open a news site today, you are almost sure to be met with an article about AI, Robotics, Quantum computing, genetic engineering, autonomous vehicles, natural language processing, and other technologies from the box called ‘The fourth industrial revolution’. Rating these technologies makes no sense, as they all have a staggering potential to change our world forever. Artificial intelligence, however, is already surging into all of the other technologies. To facilitate by the mastering of big data, pattern recognition, or prediction is an inherent quality of AI, and is frequently being applied to support ground-breaking discoveries in other technologies. I once heard a driving instructor comparing holding the hands on the steering wheel to having a gun in the hand, because of how dangerous it is to drive a car. AI is also dangerous, and we need to face the dark side of AI too, not only revel in the glorious benefits it brings us. Anything else would be reckless driving.
Natural language processing(NLP) is becoming the most ubiquitous application in the modern deep learning ecosystem. From support in popular deep learning frameworks to APIs in cloud runtimes such as Google Cloud, Azure, AWS or Bluemix, NLP is an omnipresent component of deep learning platforms. Despite the incredible progress, building NLP applications at scale remains incredibly challenging often surfacing strong frictions between the possibilities of research/experimentation and the realities of model serving/deployment. As one of the biggest conversational environments in the market, Facebook has been facing the challenges of building NLP applications at scale for years. Recently, the Facebook engineering team open sourced the first version of PyText, a PyTorch-based framework for building faster and more efficient NLP solutions. The ultimate goal of PyText is to provide a simpler experience for the end-to-end implementation of NLP workflows. To achieve that, PyText needs to address some of the existing friction points in NLP workflows. From those friction points, the most troublesome is the existing mismatch between the experimentation and model serving stages of the lifecycle of an NLP application.
Data science and software development are two very different fields. Trying to use the Agile methodology in the same way as you would on a software project for a data science project doesn’t really work. When it comes to data science, there tends to be a lot of investigation, exploration, testing, and tuning. In data science, you deal with unknown data which can lead to an unknown result. Software development, on the other hand, has structured data with known results; the programmers already know what they want to build (although their clients may not).
In the past 7 projects, we implemented the same project using different classification algorithms namely – ‘Logistic Regression’, ‘KNN’, ‘SVM’, ‘Kernel SVM’, ‘Naive Bayes’, ‘Decision Tree’ and ‘Random Forest’. The reason I wrote a separate article for each is to understand the intuition behind each algorithm.

### When presenting a new method, talk about its failure modes.

A coauthor writes:

I really like the paper [we are writing] as it is. My only criticism of it perhaps would be that we present this great new method and discuss all of its merits, but we do not really discuss when it fails / what its downsides are. Are there any cases where the traditional analyses or some other analysis are more appropriate? Should we say that in the main body of the paper?

Good point! I’m gonna add a section to the paper called Failure Modes or something like that, to explore where our method makes things worse.

And, no, this is not the same as the traditional “Threats to Validity” section. The Threats to Validity section, like the Robustness Checks, are typically a joke in that the purpose is usually not to explore potential problems but rather to rule out potential objections.

Now, don’t get me wrong, I love our recommended method and so does my coauthor. It’s actually hard for us to think of examples where our approach would be worse than what people were diong before. But I’ll think about it. I agree we should write something about failure modes.

I love my collaborators. They’re just great.

### What’s going on on PyPI

Scanning all new published packages on PyPI I know that the quality is often quite bad. I try to filter out the worst ones and list here the ones which might be worth a look, being followed or inspire you in some way.

spacy-transformers
spaCy pipelines for pre-trained BERT and other transformers

stackerpy
Model Stacking for scikit-learn models for Machine Learning (including blending)

tabnet
Tensorflow 2.0 implementation of TabNet of any configuration. A Tensorflow 2.0 port for the paper (https://…/1908.07442 ), whose original codebase is available at https://…/tabnet.

TeaML
Automated Modeling in Financial Domain. TeaML is a simple and design friendly automatic modeling learning framework.

traintorch
Package for live visualization of metrics during training of a machine learning model

trax
Trax

VoicePy
A Python Library to interface with Alexa, Dialogflow, and Google Actions.

aiPool
Framework for Advanced Statistics and Data Sciences

cprior
Fast Bayesian A/B and multivariate testing

elasticdl
A Kubernetes-native Deep Learning Framework

### easyMTS: My First R Package (Story, and Results)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This weekend I decided to create my first R package… it’s here!

Although I’ve been using R for 15 years, developing a package has been the one thing slightly out of reach for me. Now that I’ve been through the process once, with a package that’s not completely done (but at least has a firm foundation, and is usable to some degree), I can give you some advice:

• Make sure you know R Markdown before you begin.
• Some experience with Git and Github will be useful. Lots of experience will be very, very useful.
• Write the functions that will go into your package into a file that you can source into another R program and use. If your programs work when you run the code this way, you will have averted many problems early.

The process I used to make this happen was:

I hope you enjoy following along with my process, and that it helps you write packages too. If I can do it, so can you!

The post easyMTS: My First R Package (Story, and Results) appeared first on Quality and Innovation.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

### easyMTS R Package: Quick Solver for Mahalanobis-Taguchi System (MTS)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A new R package in development. Please cite if you use it.

The post easyMTS R Package: Quick Solver for Mahalanobis-Taguchi System (MTS) appeared first on Quality and Innovation.