# My Data Science Blogs

## January 20, 2018

### Distilled News

DataScience: Elevate is a full-day event dedicated to data science best practices. Register today to hear from experts at Uber, Facebook, Salesforce, and more. DataScience: Elevate provides a closer look at how today’s top companies use machine learning and artificial intelligence to do better business. Free to attend, this multi-city event features presentations, panels, and networking sessions designed to elevate data science work and connect you with the companies that are driving change in enterprise data science.
Imagine a world where machines understand what you want and how you are feeling when you call at a customer care – if you are unhappy about something, you speak to a person quickly. If you are looking for a specific information, you may not need to talk to a person (unless you want to!). This is going to be the new order of the world – you can already see this happening to a good degree. Check out the highlights of 2017 in the data science industry. You can see the breakthroughs that deep learning was bringing in a field which were difficult to solve before. One such field that deep learning has a potential to help solving is audio/speech processing, especially due to its unstructured nature and vast impact. So for the curious ones out there, I have compiled a list of tasks that are worth getting your hands dirty when starting out in audio processing. I’m sure there would be a few more breakthroughs in time to come using Deep Learning. The article is structured to explain each task and its importance. There is also a research paper that goes in the details of that specific task, along with a case study that would help you get started in solving the task. So let’s get cracking!
We’ve compiled a list of the hottest events and conferences from the world of Data Science, Machine Learning and Artificial Intelligence happening in 2018. Below are all the links you need to get yourself to these great events!
In this article, we have outlined some of the Scala libraries that can be very useful while performing major data scientific tasks. They have proved to be highly helpful and effective for achieving the best results.
How are you monitoring your Python applications? Take the short survey – the results will be published on KDnuggets and you will get all the details.
Propensity scores are an alternative method to estimate the effect of receiving treatment when random assignment of treatments to subjects is not feasible.
Tensorflow 1.4 was released a few weeks ago with an implementation of Gradient Boosting, called TensorFlow Boosted Trees (TFBT). Unfortunately, the paper does not have any benchmarks, so I ran some against XGBoost. For many Kaggle-style data mining problems, XGBoost has been the go-to solution since its release in 2006. It’s probably as close to an out-of-the-box machine learning algorithm as you can get today, as it gracefully handles un-normalized or missing data, while being accurate and fast to train.
In a previous post, I outlined emerging applications of reinforcement learning (RL) in industry. I began by listing a few challenges facing anyone wanting to apply RL, including the need for large amounts of data, and the difficulty of reproducing research results and deriving the error estimates needed for mission-critical applications. Nevertheless, the success of RL in certain domains has been the subject of much media coverage. This has sparked interest, and companies are beginning to explore some of the use cases and applications I described in my earlier post. Many tasks and professions, including software development, are poised to incorporate some forms of AI-powered automation. In this post, I’ll describe how RISE Lab’s Ray platform continues to mature and evolve just as companies are examining use cases for RL. Assuming one has identified suitable use cases, how does one get started with RL? Most companies that are thinking of using RL for pilot projects will want to take advantage of existing libraries.
Any programming environment should be optimized for its task, and not all tasks are alike. For example, if you are exploring uncharted mountain ranges, the portability of a tent is essential. However, when building a house to weather hurricanes, investing in a strong foundation is important. Similarly, when beginning a new data science programming project, it is prudent to assess how much effort should be put into ensuring the code is reproducible. Note that it is certainly possible to go back later and “shore up” the reproducibility of a project where it is weak. This is often the case when an “ad-hoc” project becomes an important production analysis. However, the first step in starting a project is to make a decision regarding the trade-off between the amount of time to set up the project and the probability that the project will need to be reproducible in arbitrary environments.
Another simple yet powerful technique we can pair with pipelines to improve performance is grid search, which attempts to optimize model hyperparameter combinations.

### Document worth reading: “An Overview on Data Representation Learning: From Traditional Feature Learning to Recent Deep Learning”

Since about 100 years ago, to learn the intrinsic structure of data, many representation learning approaches have been proposed, including both linear ones and nonlinear ones, supervised ones and unsupervised ones. Particularly, deep architectures are widely applied for representation learning in recent years, and have delivered top results in many tasks, such as image classification, object detection and speech recognition. In this paper, we review the development of data representation learning methods. Specifically, we investigate both traditional feature learning algorithms and state-of-the-art deep learning models. The history of data representation learning is introduced, while available resources (e.g. online course, tutorial and book information) and toolboxes are provided. Finally, we conclude this paper with remarks and some interesting research directions on data representation learning. An Overview on Data Representation Learning: From Traditional Feature Learning to Recent Deep Learning

## January 19, 2018

### Because it's Friday: Principles and Values

Most companies publish mission and vision statements, and some also publish a detailed list of principles that underlie the company ethos. But what makes a good collection of principles, and does writing them down really matter? At the recent Monktoberfest conference, Bryan Cantrill argued that yes, they do matter, mostly by way of some really egregious counterexamples.

That's all from the blog for this week. We'll be back on Monday — have a great weekend!

### Book Memo: “Advances in Hybridization of Intelligent Methods”

 Models, Systems and Applications This book presents recent research on the hybridization of intelligent methods, which refers to combining methods to solve complex problems. It discusses hybrid approaches covering different areas of intelligent methods and technologies, such as neural networks, swarm intelligence, machine learning, reinforcement learning, deep learning, agent-based approaches, knowledge-based system and image processing. The book includes extended and revised versions of invited papers presented at the 6th International Workshop on Combinations of Intelligent Methods and Applications (CIMA 2016), held in The Hague, Holland, in August 2016. The book is intended for researchers and practitioners from academia and industry interested in using hybrid methods for solving complex problems.

Google has recently released a Jupyter Notebook platform called Google Colaboratory. You can run Python code in a browser, share results, and save your code for later. It currently does not support R code.

&utm&utm&utm

### The Friday #rstats PuzzleR : 2018-01-19

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

Peter Meissner (@marvin_dpr) released crossword.r to CRAN today. It’s a spiffy package that makes it dead simple to generate crossword puzzles.

He also made a super spiffy javascript library to pair with it, which can turn crossword model output into an interactive puzzle.

I thought I’d combine those two creations with a way to highlight new/updated packages from the previous week, cool/useful packages in general, and some R functions that might come in handy. Think of it as a weekly way to get some R information while having a bit of fun!

This was a quick, rough creation and I’ll be changing the styles a bit for next Friday’s release, but Peter’s package is so easy to use that I have absolutely no excuse to not keep this a regular feature of the blog.

I’ll release a static, ggplot2 solution to each puzzle the following Monday(s). If you solve it before then, tweet a screen shot of your solution with the tag #rstats #puzzler and I’ll pick the first time-stamped one to highlight the following week.

I’ll also get a GitHub setup for suggestions/contributions to this effort + to hold the puzzle data.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### If you did not already know

Robust Multiple Signal Classification (MUSIC)
In this paper, we introduce a new framework for robust multiple signal classification (MUSIC). The proposed framework, called robust measure-transformed (MT) MUSIC, is based on applying a transform to the probability distribution of the received signals, i.e., transformation of the probability measure defined on the observation space. In robust MT-MUSIC, the sample covariance is replaced by the empirical MT-covariance. By judicious choice of the transform we show that: 1) the resulting empirical MT-covariance is B-robust, with bounded influence function that takes negligible values for large norm outliers, and 2) under the assumption of spherically contoured noise distribution, the noise subspace can be determined from the eigendecomposition of the MT-covariance. Furthermore, we derive a new robust measure-transformed minimum description length (MDL) criterion for estimating the number of signals, and extend the MT-MUSIC framework to the case of coherent signals. The proposed approach is illustrated in simulation examples that show its advantages as compared to other robust MUSIC and MDL generalizations. …

Cumulative Gains Model Quality Metric
In developing risk models, developers employ a number of graphical and numerical tools to evaluate the quality of candidate models. These traditionally involve numerous measures including the KS statistic or one of many Area Under the Curve (AUC) methodologies on ROC and cumulative Gains charts. Typical employment of these methodologies involves one of two scenarios. The first is as a tool to evaluate one or more models and ascertain the effectiveness of that model. Second however is the inclusion of such a metric in the model building process itself such as the way Ferri et al. proposed to use Area Under the ROC curve in the splitting criterion of a decision tree. However, these methods fail to address situations involving competing models where one model is not strictly above the other. Nor do they address differing values of end points as the magnitudes of these typical measures may vary depending on target definition making standardization difficult. Some of these problems are starting to be addressed. Marcade Chief Technology officer of the software vendor KXEN gives an overview of several metric techniques and proposes a new solution to the problem in data mining techniques. Their software uses two statistics called KI and KR. We will examine the shortfalls he addresses more thoroughly and propose a new metric which can be used as an improvement to the KI and KR statistics. Although useful in a machine learning sense of developing a model, these same issues and solutions apply to evaluating a single model’s performance as related by Siddiqi and Mays with respect to risk scorecards. We will not specifically give examples of each application of the new statistics but rather make the claim that it is useful in most situations where an AUC or model separation statistic (such as KS) is used. …

Probabilistic D-Clustering
We present a new iterative method for probabilistic clustering of data. Given clusters, their centers and the distances of data points from these centers, the probability of cluster membership at any point is assumed inversely proportional to the distance from (the center of) the cluster in question. This assumption is our working principle. The method is a generalization, to several centers, of theWeiszfeld method for solving the Fermat-Weber location problem. At each iteration, the distances (Euclidean, Mahalanobis, etc.) from the cluster centers are computed for all data points, and the centers are updated as convex combinations of these points, with weights determined by the above principle. Computations stop when the centers stop moving. Progress is monitored by the joint distance function, a measure of distance from all cluster centers, that evolves during the iterations, and captures the data in its low contours. The method is simple, fast (requiring a small number of cheap iterations) and insensitive to outliers. …

### Tracking America in the age of Trump

DURING his first year as America’s president Donald Trump attempted to redefine what it means to be leader of the free world. He has seen White House staffers come and go; been embroiled in scandal; waged war against “fake news”; and offended friends and foes alike.

### Curb your imposterism, start meta-learning

(This article was first published on That’s so Random, and kindly contributed to R-bloggers)

Recently, there has been a lot of attention for the imposter syndrome. Even seasoned programmers admit they suffer from feelings of anxiety and low self-esteem. Some share their personal stories, which can be comforting for those suffering in silence. I focus on a method that helped me grow confidence in recent years. It is a simple, yet very effective way to deal with being overwhelmed by the many things a data scientis can acquaint him or herself with.

## Two Faces of the Imposter Demon

I think imposterism can be broken into two, related, entities. The first is feeling you are falling short on a personal level. That is, you think you are not intelligent enough, you think you don’t have perseverance, or any other way to describe you are not up to the job. Most advice for overcoming imposterism focuses on this part. I do not. Rather, I focus on the second foe, the feeling that you don’t know enough. This can be very much related to the feeling of failing on a personal level, you might feel you don’t know enough because you are too slow a learner. However, I think it is helpful to approach it as objective as possible. The feeling of not knowing enough can be combated more actively. Not by learning as much you can, but by considering not knowing a choice, rather than an imperfection.

## You can’t have it all

The field of data science is incredibly broad. Comprising, among many others, getting data out of computer systems, preparing data in databases, principles of distributed computing, building and interpreting statistical models, data visualization, building machine learning pipelines, text analysis, translatingbusiness problems into data problems and communicating results to stakeholders. To make matters worse, for each and every topic there are several, if not dozens, databases, languages, packages and tools. This means, by definition, no one is going to have mastery of everything the field comprises. And thus there are things you do not and never will know.

## Learning new stuff

To stay effective you have to keep up with developments within the field. New packages will aid your data preparations, new tools might process data in a faster way and new machine learning models might give superior results. Just to name a few. I think a great deal of impostering comes from feeling you can’t keep up. There is a constant list in the back of your head with cool new stuff you still have to try out. This is where meta-learning comes into play, actively deciding what you will and will not learn. For my peace of mind it is crucial to decide the things I am not going to do. I keep a log (Google Sheets document) that has two simple tabs. The first a collector of stuff I come across in blogs and on twitter. These are things that do look interesting, but it needs a more thorough look. I also add things that I come across in the daily job, such as a certain part of SQL I don’t fully grasp yet. Once in a while I empty the collector, trying to pick up the small stuff right away and moving the larger things either to second tab or to will-not-do. The second tab holds the larger things I am actually going to learn. With time at hand at work or at home I work on learning the things on the second tab. More about this later.

## Define Yourself

So you cannot have it all, you have to choose. What can be of good help when choosing is to have a definition of your unique data science profile. Here is mine:

I have thorough knowledge of statistical models and know how to apply them. I am a good R programmer, both in interactive analysis and in software development. I know enough about data bases to work effectively with them, if necessary I can do the full data preparation in SQL. I know enough math to understand new models and read text books, but I can’t derive and proof new stuff on my own. I have a good understanding of the principles of machine learning and can apply most of the algorithms in practice. My oral and written communication are quite good, which helps me in translating back and forth between data and business problems.

That’s it, focused on what I do well and where I am effective. Some things that are not in there; building a full data pipeline on an Hadoop cluster, telling data stories with d3.js, creating custom algorithms for a business, optimizing a database, effective use of python, and many more. If someone comes to me with one of these task, it is just “Sorry, I am not your guy”.

I used to feel that I had to know everything. For instance, I started to learn python because I thought a good data scientist should know it as well as R. Eventually, I realized I will never be good at python, because I will always use R as my bread-and-butter. I know enough python to cooperate in a project where it is used, but that’s it and that it will remain. Rather, I spend time and effort now in improving what I already do well. This is not because I think because R is superior to python. I just happen to know R and I am content with knowing R very well at the cost of not having access to all the wonderful work done in python. I will never learn d3.js, because I don’t know JavaScript and it will take me ages to learn. Rather, I might focus on learning Stan which is much more fitting to my profile. I think it is both effective and alleviating stress to go deep on the things you are good at and deliberately choose things you will not learn.

## The meta-learning

I told you about the collector, now a few more words about the meta-learning tab. It has three simple columns. what it is I am going to learn and how I am going to do that are the first two obvious categories. The most important, however, is why I am going to learn it. For me there are only two valid reasons. Either I am very interested in the topic and I envision enjoying doing it, or it will allow me to do my current job more effectively. I stressed current there because scrolling the requirements of job openings is about the perfect way to feed your imposter monster. Focus on what you are doing now and have faith you will pick-up new skills if a future job demands it.

Meta-learning gives me focus, relaxation and efficiency. At its core it is defining yourself as a data scientist and deliberately choose what you are and, more importantly, what you are not going to learn. I experienced, that doing this with rigor actively fights the imposterism. Now, what works for me might not work for you. Maybe a different system fits you better. However, I think everybody benefits from defining the data scientist he/she is and actively choose what not to learn.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Learn Data Science Without a Degree

But how do you learn data science? Let’s take a look at some of the steps you can take to begin your journey into data science without needing a degree, including Springboard’s Data Science Career Track.

### R Packages worth a look

Allows you to retrieve information from the ‘Google Knowledge Graph’ API <https://…/knowledge.html> and process it in R in various forms. The ‘Knowledge Graph Search’ API lets you find entities in the ‘Google Knowledge Graph’. The API uses standard ‘schema.org’ types and is compliant with the ‘JSON-LD’ specification.

Fast Region-Based Association Tests on Summary Statistics (sumFREGAT)
An adaptation of classical region/gene-based association analysis techniques that uses summary statistics (P values and effect sizes) and correlations between genetic variants as input. It is a tool to perform the most common and efficient gene-based tests on the results of genome-wide association (meta-)analyses without having the original genotypes and phenotypes at hand.

Monotonic Association on Zero-Inflated Data (mazeinda)
Methods for calculating and testing the significance of pairwise monotonic association from and based on the work of Pimentel (2009) <doi:10.4135/9781412985291.n2>. Computation of association of vectors from one or multiple sets can be performed in parallel thanks to the packages ‘foreach’ and ‘doMC’.

Computing Envelope Estimators (Renvlp)
Provides a general routine, envMU(), which allows estimation of the M envelope of span(U) given root n consistent estimators of M and U. The routine envMU() does not presume a model. This package implements response envelopes (env()), partial response envelopes (penv()), envelopes in the predictor space (xenv()), heteroscedastic envelopes (henv()), simultaneous envelopes (stenv()), scaled response envelopes (senv()), scaled envelopes in the predictor space (sxenv()), groupwise envelopes (genv()), weighted envelopes (weighted.env(), weighted.penv() and weighted.xenv()), envelopes in logistic regression (logit.env()), and envelopes in Poisson regression (pois.env()). For each of these model-based routines the package provides inference tools including bootstrap, cross validation, estimation and prediction, hypothesis testing on coefficients are included except for weighted envelopes. Tools for selection of dimension include AIC, BIC and likelihood ratio testing. Background is available at Cook, R. D., Forzani, L. and Su, Z. (2016) <doi:10.1016/j.jmva.2016.05.006>. Optimization is based on a clockwise coordinate descent algorithm.

Model Based Random Forest Analysis (mobForest)
Functions to implements random forest method for model based recursive partitioning. The mob() function, developed by Zeileis et al. (2008), within ‘party’ package, is modified to construct model-based decision trees based on random forests methodology. The main input function mobforest.analysis() takes all input parameters to construct trees, compute out-of-bag errors, predictions, and overall accuracy of forest. The algorithm performs parallel computation using cluster functions within ‘parallel’ package.

### President Trump’s first year, through The Economist’s covers

SATURDAY January 20th marks one year since Donald Trump’s inauguration as the 45th President of the United States. Over the intervening months the world has been forced to come to terms with—and repeatedly adjust to—having Mr Trump in the White House. His first 365 days have hurtled by like an out-of-control fairground ride.

### Porn traffic before and after the missile alert in Hawaii

PornHub compared minute-to-minute traffic on their site before and after the missile alert to an average Saturday (okay for work). Right after the alert there was a dip as people rushed for shelter, but not long after the false alarm notice, traffic appears to spike.

Some interpret this as people rushed to porn after learning that a missile was not headed towards their home. Maybe that’s part of the reason, but my guess is that Saturday morning porn consumers woke earlier than usual.

Tags: ,

### Edelweiss: Data Scientist

Seeking a Data Scientist for building, validating and deploying machine learning models on unstructured data for various business problems.

### Plot2txt for quantitative image analysis

Plot2txt converts images into text and other representations, helping create semi-structured data from binary, using a combination of machine learning and other algorithms.

### The Trumpets of Lilliput

Gur Huberman pointed me to this paper by George Akerlof and Pascal Michaillat that gives an institutional model for the persistence of false belief. The article begins:

This paper develops a theory of promotion based on evaluations by the already promoted. The already promoted show some favoritism toward candidates for promotion with similar beliefs, just as beetles are more prone to eat the eggs of other species. With such egg-eating bias, false beliefs may not be eliminated by the promotion system. Our main application is to scientific revolutions: when tenured scientists show favoritism toward candidates for tenure with similar beliefs, science may not converge to the true paradigm. We extend the statistical concept of power to science: the power of the tenure test is the probability (absent any bias) of denying tenure to a scientist who adheres to the false paradigm, just as the power of any statistical test is the probability of rejecting a false null hypothesis. . . .

It was interesting to see a mathematical model for the persistence of errors, and I agree that there must be something to their general point that people are motivated to support work that confirms their beliefs and to discredit work that disconfirms their beliefs. We’ve seen a lot of this sort of analysis at the individual level (“motivated reasoning,” etc.) and it makes sense to think of this at an interpersonal or institutional level too.

There were, however, some specific aspects of their model that I found unconvincing, partly on statistical grounds and partly based on my understanding of how social science works within society:

1. Just as I don’t think it is helpful to describe statistical hypotheses as “true” or “false,” I don’t think it’s helpful to describe scientific paradigms as “true” or “false.” Also, I’m no biologist, but I’m skeptical of a statement such as, “With the beetles, the more biologically fit species does not always prevail.” What does it mean to say a species is “more biologically fit”? If they survive and reproduce, they’re fit, no? And if a species’ eggs get eaten before they’re hatched, that reduces the species’s fitness.

In the article, they modify “true” and “false” to “Better” and “Worse,” but I have pretty much the same problem here, which is that different paradigms serve different purposes, so I don’t see how it typically makes sense to speak of one paradigm as giving “a more correct description of the world,” except in some extreme cases. For example, a few years ago I reviewed a pop-science book that was written from a racist paradigm. Is that paradigm “more correct” or “less correct” than a non-racist paradigm? It depends on what questions are being asked, and what non-racist paradigm is being used as a comparison.

Beyond all this—or perhaps explaining my above comments—is my irritation at people who use university professors as soft targets. Silly tenured professors ha ha. Bad science is a real problem but I think it’s ludicrous to attribute that to the tenure system. Suppose there was no such thing as academic tenure, then I have a feeling that social and biomedical science research would be even more fad-driven.

I sent the above comments to the authors, and Akerlof replied:

I think that your point of view and ours are surprisingly on the same track; in fact the paper answers Thomas Kuhn’s question: what makes science so successful. The point is rather subtle and is in the back pages: especially regarding the differences between promotions of scientists and promotion of surgeons who did radical mastectomies.

The post The Trumpets of Lilliput appeared first on Statistical Modeling, Causal Inference, and Social Science.

### Registration and talk proposals open Monday for useR!2018

Registration will open on Monday (January 22) for useR! 2018, the official R user conference to be held in Brisbane, Australia July 10-13. If you haven't been to a useR! conference before, it's a fantastic opportunity to meet and mingle with other R users from around the world, see talks on R packages and applications, and attend tutorials for deep dives on R-related topics. This year's conference will also feature keynotes from Jenny Bryan, Steph De Silva, Heike Hofmann, Thomas Lin Pedersen, Roger Peng and Bill Venables. It's my favourite conference of the year, and I'm particularly looking forward to this one.

This video from last year's conference in Brussels (a sell-out with over 1,1000 attendees) will give you a sense of what a useR! conference is like:

The useR! conference brought to you by the R Foundation and is 100% community-led. That includes the content: the vast majority of talks come directly from R users. If you've written an R package, performed an interesting analysis with R, or simply have something to share of interest to the R community, consider proposing a talk by submitting an abstract. (Abstract submissions are open now.) Most talks are 20 minutes, but you can also propose a 5-minute lightning talk or a poster. If you're not sure what kind of talk you might want to give, check out the program from useR!2017 for inspiration. R-Ladies, which promotes gender diversity in the R community, can also provide guidance on abstracts. Note that all proposals must comply with the conference code of conduct.

Early-bird registrations close on March 15, and while general registration will be open until June my advice is to get in early, as this year's conference is likely to sell out once again. If you want to propose a talk, submissions are due by March 2 (but early submissions have a better chance of being accepted). Follow @user!2018_conf on Twitter for updates about the conference, and click the links below to register or submit an abstract. I look forward to seeing you in Brisbane!

Update Jan 19: Registrations will now open January 22

useR! 2018: Registration; Abstract submission

### Managing Machine Learning Workflows with Scikit-learn Pipelines Part 2: Integrating Grid Search

Another simple yet powerful technique we can pair with pipelines to improve performance is grid search, which attempts to optimize model hyperparameter combinations.

### A lesson from the Charles Armstrong plagiarism scandal: Separation of the judicial and the executive functions

Charles Armstrong is a history professor at Columbia University who, so I’ve heard, has plagiarized and faked references for an award-winning book about Korean history. The violations of the rules of scholarship were so bad that the American Historical Association “reviewed the citation issue after being notified by a member of the concerns some have about the book” and, shortly after that, Armstrong relinquished the award. More background here.

To me, the most interesting part of the story is that Armstrong was essentially forced to give in, especially surprising given how aggressive his original response was, attacking the person whose work he’d stolen.

It’s hard to imagine that Columbia University could’ve made Armstrong return the prize, given that the university gave him a “President’s Global Innovation Fund Grant” many months after the plagiarism story had surfaced.

The reason why, I think, is that the American Historical Association had this independent committee.

And that gets us to the point raised in the title of this post.

Academic and corporate environments are characterized by an executive function with weak to zero legislative or judicial functions. That is, decisions are made based on consequences, with very few rules. Yes, we have lots of little rules and red tape, but no real rules telling the executives what to do.

Evaluating every decision based on consequences seems like it could be a good idea, but it leads to situations where wrongdoers are left in place, as in any given situation it seems like too much trouble to deal with the problem.

An analogy might be with the famous probability-matching problem. Suppose someone gives shuffles a deck with 100 cards, 70 red and 30 black, and then starts pulling out cards, one at a time, asking you to guess. You’ll maximize your expected number of correct answers by simply guessing Red, Red, Red, Red, Red, etc. In each case, that’s the right guess, but put it together and your guesses are not representative. Similarly, if for each scandal the university makes the locally optimal decision to do nothing, the result is that nothing is ever done.

This analogy is not perfect: I’m not recommending that the university sanction 30% of its profs at random—for one thing, that could be me! But it demonstrates the point that a series of individually reasonable decisions can be unreasonable in aggregate.

Anyway, one advantage of a judicial branch—or, more generally, a fact-finding institution that is separate from enforcement and policymaking—is that its members can feel free to look for truth, damn the consequences, because that’s their role.

So, instead of the university weighing the negatives of having an barely-repentant plagiarist on faculty or having the embarrassment of sanctioning a tenured professor, there can be an independent committee of the American History Association just judging the evidence.

it’s a lot easier to judge the evidence if you don’t have direct responsibility for what will be done by the evidence. Or, to put it another way, it’s easier to be a judge if you don’t also have to play the roles of jury and executioner.

P.S. I see that Armstrong was recently quoted in Newsweek regarding Korea policy. Maybe they should’ve interviewed the dude he copied from instead. Why not go straight to the original, no?

THE threat of nuclear holocaust, familiar to Americans who grew up during the cold war, is alien to most today. On Saturday January 13th fears of annihilation reemerged.

### Introducing RLlib: A composable and scalable reinforcement learning library

RISE Lab’s Ray platform adds libraries for reinforcement learning and hyperparameter tuning.

In a previous post, I outlined emerging applications of reinforcement learning (RL) in industry. I began by listing a few challenges facing anyone wanting to apply RL, including the need for large amounts of data, and the difficulty of reproducing research results and deriving the error estimates needed for mission-critical applications. Nevertheless, the success of RL in certain domains has been the subject of much media coverage. This has sparked interest, and companies are beginning to explore some of the use cases and applications I described in my earlier post. Many tasks and professions, including software development, are poised to incorporate some forms of AI-powered automation. In this post, I’ll describe how RISE Lab’s Ray platform continues to mature and evolve just as companies are examining use cases for RL.

Assuming one has identified suitable use cases, how does one get started with RL? Most companies that are thinking of using RL for pilot projects will want to take advantage of existing libraries.

There are several open source projects that one can use to get started. From a technical perspective, there are a few things to keep in mind when considering a library for RL:

• Support for existing machine learning libraries. Because RL typically uses gradient-based or evolutionary algorithms to learn and fit policy functions, you will want it to support your favorite library (TensorFlow, Keras, PyTorch, etc.).
• Scalability. RL is computationally intensive, and having the option to run in a distributed fashion becomes important as you begin using it in key applications.
• Composability. RL algorithms typically involve simulations and many other components. You will want a library that lets you reuse components of RL algorithms (such as policy graphs, rollouts), that is compatible with multiple deep learning frameworks, and that provides composable distributed execution primitives (nested parallelism).

## Introducing Ray RLlib

Ray is a distributed execution platform (from UC Berkeley’s RISE Lab) aimed at emerging AI applications, including those that rely on RL. RISE Lab recently released RLlib, a scalable and composable RL library built on top of Ray:

RLlib is designed to support multiple deep learning frameworks (currently TensorFlow and PyTorch) and is accessible through a simple Python API. It currently ships with the following popular RL algorithms (more to follow):

It’s important to note that there is no dominant pattern for computing and composing RL algorithms and components. As such, we need a library that can take advantage of parallelism at multiple levels and physical devices. RLlib is an open source library for the scalable implementation of algorithms that connect the evolving set of components used in RL applications. In particular, RLlib enables rapid development because it makes it easy to build scalable RL algorithms through the reuse and assembly of existing implementations (“parallelism encapsulation”). RLlib also lets developers use neural networks created with several popular deep learning frameworks, and it integrates with popular third-party simulators.

Software for machine learning needs to run efficiently on a variety of hardware configurations, both on-premise and on public clouds. Ray and RLlib are designed to deliver fast training times on a single multi-core node or in a distributed fashion, and these software tools provide efficient performance on heterogeneous hardware (whatever the ratio of CPUs to GPUs might be).

## Examples: Text summarization and AlphaGo Zero

The best way to get started is to apply RL on some of your existing data sets. To that end, a relatively recent application of RL is in text summarization. Here’s a toy example to try—use RLlib to summarize unstructured text (note that this is not a production-grade model):

# Complete notebook available here: https://goo.gl/n6f43h
document = """Insert your sample text here
"""
summary = summarization.summarize(agent, document)
print("Original document length is {}".format(len(document)))
print("Summary length is {}".format(len(summary)))


Text summarization is just one of several possible applications. A recent RISE Lab paper provides other examples, including an implementation of the main algorithm used in AlphaGo Zero in about 70 lines of RLlib pseudocode.

## Hyperparameter tuning with RayTune

Another common example involves model building. Data scientists spend a fair amount of time conducting experiments, many of which involve tuning parameters for their favorite machine learning algorithm. As deep learning (and RL) become more popular, data scientists will need software tools for efficient hyperparameter tuning and other forms of experimentation and simulation. RayTune is a new distributed, hyperparameter search framework for deep learning and RL. It is built on top of Ray and is closely integrated with RLlib. RayTune is based on grid search and uses ideas from early stopping, including the Median Stopping Rule and HyperBand.

There is a growing list of open source software tools available to companies wanting to explore deep learning and RL. We are in empirical era, and we need tools that enable quick experiments in parallel, while letting us take advantage of popular software libraries, algorithms, and components. Ray just added two libraries that will let companies experiment with reinforcement learning and also efficiently search through the space of neural network architectures.

Reinforcement learning applications involve multiple components, each of which presents opportunities for distributed computation. Ray RLlib adopts a programming model that enables the easy composition and reuse of components, and takes advantage of parallelism at multiple levels and physical devices. Over the short term, RISE Lab plans to add more RL algorithms, APIs for integration with online serving, support for multi-agent scenarios, and an expanded set of optimization strategies.

Related resources:

### Four short links: 19 January 2018

Pricing, Windows Emulation, Toxic Tech Culture, and AI Futures

1. Pricing Summary -- quick and informative read. Three-part tariff (3PT)—Again, the software has a base platform fee, but the fee is $25,000 because it includes the first 150K events free. Each marginal event costs$0.15. In academic research and theory, the three-part tariff is proven to be best. It provides many different ways for the sales team to negotiate on price and captures the most value.
2. Wine 3.0 Released -- the Windows emulator now runs Photoshop CC 2018! Astonishing work.
3. Getting Free of Toxic Tech Culture (Val Aurora and Susan Wu) -- We didn’t realize how strongly we’d unconsciously adopted this belief that people in tech were better than those who weren’t until we started to imagine ourselves leaving tech and felt a wave of self-judgment and fear. Early on, Valerie realized that she unconsciously thought of literally every single job other than software engineer as “for people who weren’t good enough to be a software engineer” – and that she thought this because other software engineers had been telling her that for her entire career. This.
4. The Future Computed: Artificial Intelligence and its Role in Society -- Microsoft's book on the AI-enabled future. Three chapters: The Future of Artificial Intelligence; Principles, Policies, and Laws for the Responsible Use of AI; and AI and the Future of Jobs and Work.

### What can Text Mining tell us about Snapchat’s new update?

Last week, Snapchat unveiled a major redesign of their app that received quite a bit of negative feedback. As a video-sharing platform that has integrated itself into users’ daily lives, Snapchat relies on simplicity and ease of use. So when large numbers of these users begin to express pretty serious frustration about the app’s new design, it’s a big threat to their business.

You can bet that right now Snapchat are analyzing exactly how big a threat this backlash is by monitoring the conversation online. This is a perfect example of businesses leveraging the Voice of their Customer with tools like Natural Language Processing. Businesses that track their product’s reputation online can quantify how serious events like this are and make informed decisions on their next steps. In this blog, we’ll give a couple of examples of how you can dive into online chatter and extract important insights on customer opinion.

This TechCrunch article pointed out that 83% of Google Play Store reviews in the immediate aftermath of the update gave the app one or two stars. But as we mentioned in a blog last week, star rating systems aren’t enough – they don’t tell you why people feel the way they do and most of the time people base their star rating on a lot more than how they felt about a product or service.

To get accurate and in-depth insights, you need to understand exactly what a reviewer is positive or negative about, and to what degree they feel this way. This can only be done effectively with text mining.

So in this short blog, we’re going to use text mining to:

1. Analyze a sample of the Play Store reviews to see what Snapchat users mentioned in reviews posted since the update.
2. Gather and analyze a sample of 1,000 tweets mentioning “Snapchat update” to see if the reaction was similar on social media.

In each of these analyses, we’ll use the use the AYLIEN Text Analysis API, which comes with a free plan that’s ideal for testing it out on small datasets like the ones we’ll use in this post.

## What did the app reviewers talk about?

As TechCrunch pointed out, 83% of reviews since the update shipped received one or two stars, which gives us a high-level overview of the sentiment shown towards the redesign. But to dig deeper, we need to look into the reviews and see what people were actually talking about in all of these reviews.

As a sample, we gathered the 40 reviews readily available on the Google Play Store and saved them in a spreadsheet. We can analyze what people were talking about in them by using our Text Analysis API’s Entities feature. This feature analyzes a piece of text and extracts the people, places, organizations and things mentioned in it.

One of the types of entities returned to us is a list of keywords. To get a quick look into what the reviewers were talking about in a positive and negative light, we visualized the keywords extracted along with the average sentiment of the reviews they appeared in.

From the 40 reviews, our Text Analysis API extracted 498 unique keywords. Below you can see a visualization of the keywords extracted and the average sentiment of the reviews they appeared in from most positive (1) to most negative (-1).

First of all, you’ll notice that keywords like “love” and “great” are high on the chart, while “frustrating” and “terrible” are low on the scale – which is what you’d expect. But if you look at keywords that refer to Snapchat, you’ll see that “Bitmoji” appears high on the chart, while “stories,” “layout,” and “unintuitive” all  appear low down the chart, giving an insight into what Snapchat’s users were angry about.

## How did Twitter react to the Snapchat update?

Twitter is such an accurate gauge of what the general public is talking about that the US Geological Survey uses it to monitor for earthquakes – because the speed at which people react to earthquakes on Twitter outpaces even their own seismic data feeds! So if people Tweet about earthquakes during the actual earthquakes, they are absolutely going to Tweet their opinions of Snapchat updates.

To get a snapshot of the Twitter conversation, we gathered 1,000 Tweets that mentioned the update.To gather the Tweets, we ran a search on Twitter using the Twitter Search API (this is really easy –  take a look at our beginners’ guide to doing this in Python).

After we gathered our Tweets, we analyzed them with our Sentiment Analysis feature and as you can see, the Tweets were overwhelmingly negative:

<noscript><a href="http://blog.aylien.com/feed/"><img alt="Sentiment of 1,000 Tweets about Snapchat " src="https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Sn&#47;Snapchat_6&#47;Sheet1&#47;1_rss.png" style="border: none;" /></a></noscript>

Quantifying the positive, negative, and neutral sentiment shown towards the update on Twitter is useful, but using Text Mining we can go one further and extract the keywords mentioned in every one of these Tweets. To do this, we use the Text Analysis API’s Entities feature.

Disclaimer: this being Twitter, there was quite a bit of opinion expressed in a NSFW manner 😉

<noscript><a href="http://blog.aylien.com/feed/"><img alt="Most mentioned keywords on Twitter in Tweets about &quot;Snapchat update&quot; " src="https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;7P&#47;7PBRWCTHW&#47;1_rss.png" style="border: none;" /></a></noscript>

The number of expletives we identified as keywords reinforces the severity of the opinion expressed towards the update. You can see that “stories” and “story” are two of the few prominently-featured keywords that referred to feature updates while keywords like “awful” and “stupid” are good examples of the most-mentioned keywords in reaction to the update as a whole.

It’s clear that using text mining processes like sentiment analysis and entity extraction – can provide a detailed overview of public reaction to an event by extracting granular information from product reviews and social media chatter.

If you can think of insights you could extract with text mining about topics that matter to you, our Text Analysis API allows you to analyze 1,000 documents per day free of charge and getting started with our tools couldn’t be easier – click on the image below to sign up.

The post What can Text Mining tell us about Snapchat’s new update? appeared first on AYLIEN.

### On Random Weights for Texture Generation in One Layer Neural Networks

Continuing up on the use of random projections (which in the context of DNNs is really about NN with random weights), today we have:

Recent work in the literature has shown experimentally that one can use the lower layers of a trained convolutional neural network (CNN) to model natural textures. More interestingly, it has also been experimentally shown that only one layer with random filters can also model textures although with less variability. In this paper we ask the question as to why one layer CNNs with random filters are so effective in generating textures? We theoretically show that one layer convolutional architectures (without a non-linearity) paired with the an energy function used in previous literature, can in fact preserve and modulate frequency coefficients in a manner so that random weights and pretrained weights will generate the same type of images. Based on the results of this analysis we question whether similar properties hold in the case where one uses one convolution layer with a non-linearity. We show that in the case of ReLu non-linearity there are situations where only one input will give the minimum possible energy whereas in the case of no nonlinearity, there are always infinite solutions that will give the minimum possible energy. Thus we can show that in certain situations adding a ReLu non-linearity generates less variable images.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

### 501 days of Summer (school)

(This article was first published on Gianluca Baio's blog, and kindly contributed to R-bloggers)

As I anticipated earlier, we’re now ready to open registration for our Summer School in Florence (I was waiting for UCL to set up the registration system and thought it may take much longer than it actually did $-$ so well done UCL!).

We’ll probably have a few changes here and there in the timetable $-$ we’re thinking of introducing some new topics and I think I’ll certainly merge a couple of my intro lectures, to leave some time for those…

Nothing is fixed yet and we’re in the process of deliberating all the changes $-$ but I’ll post as soon as we have a clearer plan for the revised timetable.

Here’s the advert (which I’ve sent out to some relevant mailing list, also).

Summer school: Bayesian methods in health economics
Date: 4-8 June 2018
Venue: CISL Study Center, Florence (Italy)

COURSE ORGANISERS: Gianluca Baio, Chris Jackson, Nicky Welton, Mark Strong, Anna Heath

OVERVIEW:
This summer school is intended to provide an introduction to Bayesian analysis and MCMC methods using R and MCMC sampling software (such as OpenBUGS and JAGS), as applied to cost-effectiveness analysis and typical models used in health economic evaluations. We will present a range of modelling strategies for cost-effectiveness analysis as well as recent methodological developments for the analysis of the value of information.

The course is intended for health economists, statisticians, and decision modellers interested in the practice of Bayesian modelling and will be based on a mixture of lectures and computer practicals, although the emphasis will be on examples of applied analysis: software and code to carry out the analyses will be provided. Participants are encouraged to bring their own laptops for the practicals.

We shall assume a basic knowledge of standard methods in health economics and some familiarity with a range of probability distributions, regression analysis, Markov models and random-effects meta-analysis. However, statistical concepts are reviewed in the context of applied health economic evaluations in the lectures.

The summer school is hosted in the beautiful complex of the Centro Studi Cisl, overlooking and a short distance from Florence (Italy). The registration fees include full board accommodation in the Centro Studi.

More information can be found at the summer school webpage. Registration is available from the UCL Store. For more details or enquiries, email Dr Gianluca Baio.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### R Packages worth a look

Parallel Runs of Reverse Depends (prrd)
Reverse depends for a given package are queued such that multiple workers can run the tests in parallel.

Critical Line Algorithm in Pure R (CLA)
Implements ‘Markovitz’ Critical Line Algorithm (‘CLA’) for classical mean-variance portfolio optimization. Care has been taken for correctness in light of previous buggy implementations.

Extension for ‘R6’ Base Class (r6extended)
Useful methods and data fields to extend the bare bones ‘R6’ class provided by the ‘R6’ package – ls-method, hashes, warning- and message-method, general get-method and a debug-method that assigns self and private to the global environment.

Run Predictions Inside the Database (tidypredict)
It parses a fitted ‘R’ model object, and returns a formula in ‘Tidy Eval’ code that calculates the predictions. It works with several databases back-ends because it leverages ‘dplyr’ and ‘dbplyr’ for the final ‘SQL’ translation of the algorithm. It currently supports lm(), glm() and randomForest() models.

Bayesian Structure Learning in Graphical Models using Birth-Death MCMC (BDgraph)
Provides statistical tools for Bayesian structure learning in undirected graphical models for continuous, discrete, and mixed data. The package is implemented the recent improvements in the Bayesian graphical models literature, including Mohammadi and Wit (2015) <doi:10.1214/14-BA889> and Mohammadi et al. (2017) <doi:10.1111/rssc.12171>. To speed up the computations, the BDMCMC sampling algorithms are implemented in parallel using OpenMP in C++.

### Distilled News

In this article a few simple applications of Markov chain are going to be discussed as a solution to a few text processing problems. These problems appeared as assignments in a few courses, the descriptions are taken straightaway from the courses themselves.
Remember, that I told last time that Python if statements are similar to how our brain processes conditions in our everyday life? That’s true for for loops too. You go through your shopping list, until collected every item from it. The dealer gives a card for each player until everyone has five. The athlete does push-ups until reaching one-hundred… Loops everywhere! As of for loops in Python: they are perfect for processing repetitive programming tasks. In this article, I’ll show you everything you need to know about them: the syntax, the logic and best practices too!
This post shows you how to label hundreds of thousands of images in an afternoon. You can use the same approach whether you are labeling images or labeling traditional tabular data (e.g, identifying cyber security atacks or potential part failures).
I’m contemplating the idea of teaching a course on simulation next fall, so I have been exploring various topics that I might include. (If anyone has great ideas either because you have taught such a course or taken one, definitely drop me a note.) Monte Carlo (MC) simulation is an obvious one. I like the idea of talking about importance sampling, because it sheds light on the idea that not all MC simulations are created equally. I thought I’d do a brief blog to share some code I put together that demonstrates MC simulation generally, and shows how importance sampling can be an improvement.
Microsoft R Open (MRO), Microsoft’s enhanced distribution of open source R, has been upgraded to version 3.4.3 and is now available for download for Windows, Mac, and Linux. This update upgrades the R language engine to the latest R (version 3.4.3) and updates the bundled packages (specifically: checkpoint, curl, doParallel, foreach, and iterators) to new versions. MRO is 100% compatible with all R packages. MRO 3.4.3 points to a fixed CRAN snapshot taken on January 1 2018, and you can see some highlights of new packages released since the prior version of MRO on the Spotlights page. As always, you can use the built-in checkpoint package to access packages from an earlier date (for reproducibility) or a later date (to access new and updated packages).
Making deep learning simple and accessible to enterprises: Polyaxon aims to be an enterprise-grade open source platform for building, training, and monitoring large scale deep learning applications. It includes an infrastructure, set of tools, proven algorithms, and industry models to enable your organization to innovate faster. Polyaxon is a platform-agnostic with no lock-in. You keep full ownership and control of sensitive data on-premise or in the cloud.
When approaching problems with sequential data, such as natural language tasks, recurrent neural networks (RNNs) typically top the choices. While the temporal nature of RNNs are a natural fit for these problems with text data, convolutional neural networks (CNNs), which are tremendously successful when applied to vision tasks, have also demonstrated efficacy in this space. In our LSTM tutorial, we took an in-depth look at how long short-term memory (LSTM) networks work and used TensorFlow to build a multi-layered LSTM network to model stock market sentiment from social media content. In this post, we will briefly discuss how CNNs are applied to text data while providing some sample TensorFlow code to build a CNN that can perform binary classification tasks similar to our stock market sentiment model.

### Book Memo: “Stochastic Modelling in Production Planning”

 Methods for Improvement and Investigations on Production System Behaviour Alexander Hübl develops models for production planning and analyzes performance indicators to investigate production system behaviour. He extends existing literature by considering the uncertainty of customer required lead time and processing times as well as by increasing the complexity of multi-machine multi-items production models. Results are on the one hand a decision support system for determining capacity and the further development of the production planning method Conwip. On the other hand, the author develops the JIT intensity and analytically proves the effects of dispatching rules on production lead time.

### The difference between me and you is that I’m not on fire

“Eat what you are while you’re falling apart and it opened a can of worms. The gun’s in my hand and I know it looks bad, but believe me I’m innocent.” – Mclusky

While the next episode of Madam Secretary buffers on terrible hotel internet, I (the other other white meat) thought I’d pop in to say a long, convoluted hello. I’m in New York this week visiting Andrew and the Stan crew (because it’s cold in Toronto and I somehow managed to put all my teaching on Mondays. I’m Garfield without the spray tan.).

So I’m in a hotel on the Upper West Side (or, like, maybe the upper upper west side. I’m in the 100s. Am I in Harlem yet? All I know is that I’m a block from my favourite bar [which, as a side note, Aki does not particularly care for] where I am currently not sitting and writing this because last night I was there reading a book about the rise of the surprisingly multicultural anti-immigration movement in Australia and, after asking what my book was about, some bloke started asking me for my genealogy and “how Australian I am” and really I thought that it was both a bit much and a serious misunderstanding of what someone who is reading book with headphones on was looking for in a social interaction.) going through the folder of emails I haven’t managed to answer in the last couple of weeks looking for something fun to pass the time.

And I found one. Ravi Shroff from the Department of Applied Statistics, Social Science and Humanities at NYU (side note: applied statistics gets a short shrift in a lot of academic stats departments around the world, which is criminal. So I will always love a department that leads with it in the title. I’ll also say that my impression when I wandered in there for a couple of hours at some point last year was that, on top of everything else, this was an uncommonly friendly group of people. Really, it’s my second favourite statistics department in North America, obviously after Toronto who agreed to throw a man into a volcano every year as part of my startup package after I got really into both that Tori Amos album from 1996 and cultural appropriation. Obviously I’m still processing the trauma of being 11 in 1996 and singularly unable to sacrifice any young men to the volcano goddess.) sent me an email a couple of weeks ago about constructing interpretable decision rules.

(Meta-structural diversion: I starting writing this with the new year, new me idea that every blog post wasn’t going to devolve into, say, 500 words on how Medúlla is Björk’s Joanne, but that resolution clearly lasted for less time than my tenure as an Olympic torch relay runner. But if you’ve not learnt to skip the first section of my posts by now, clearly reinforcement learning isn’t for you.)

#### To hell with good intentions

Ravi sent me his paper Simple rules for complex decisions by Jongbin Jung, Connor Concannon, Ravi Shroff, Sharad Goel and Daniel Goldstein and it’s one of those deals where the title really does cover the content.

This is my absolute favourite type of statistics paper: it eschews the bright shiny lights of ultra-modern methodology in favour of the much harder road of taking a collection of standard tools and shaping them into something completely new.

Why do I prefer the latter? Well it’s related to the age old tension between “state-of-the-art” methods and “stuff-people-understand” methods. The latter are obviously preferred as they’re much easier to push into practice. This is in spite of the former being potentially hugely more effective. Practically, you have to balance “black box performance” with “interpretability”. Where you personally land on that Pareto frontier is between you and your volcano goddess.

This paper proposes a simple decision rule for binary classification problems and shows fairly convincingly that it can be almost as effective as much more complicated classifiers.

#### There ain’t no fool in Ferguson

The paper proposes a Select-Regress-and-Round method for constructing decision rules that works as follows:

1. Select a small number $k$ of features $\mathbf{x}$ that will be used to build the classifier
2. Regress: Use a logistic-lasso to estimate the classifier $h(\mathbf{x}) = (\mathbf{x}^T\mathbf{\beta} \geq 0 \text{ ? } 1 \text{ : } 0)$.
3. Round: Chose $M$ possible levels of effect and build weights

$w_j = \text{Round} \left( \frac{M \beta_j}{\max_i|\beta_i|}\right)$.

The new classifier (which chooses between options 1 and 0) selects 1 if

$\sum_{j=1}^k w_j x_j > 0$.

In the paper they use $k=10$ features and $M = 3$ levels.  To interpret this classifier, we can consider each level as a discrete measure of importance.  For example, when we have $M=3$ we have seven levels of importance from “very high negative effect”, through “no effect”, to “very high positive effect”. In particular

• $w_j=0$: The $j$th feature has no effect
• $w_j= \pm 1$: The $j$th feature has a low effect (positive or negative)
• $w_j = \pm 2$: The $j$th feature has a medium effect (positive or negative)
• $w_j = \pm 3$: The $j$th feature has a high effect (positive or negative).

A couple of key things here that makes this idea work.  Firstly, the initial selection phase allows people to “sense check” the initial group of features while also forcing the decision rule to only depend on a small number of features, which greatly improves the ability for people to interpret the rule.  The second two phases then works out which of those features are used (the number of active features can be less than $k$. Finally the last phase gives a qualitative weight to each feature.

This is a transparent way of building a decision rule, as the effect of each feature used to make the decision is clearly specified.  But does it work?

#### She will only bring you happiness

The most surprising thing in this paper is that this very simple strategy for building a decision rule works fairly well. Probably unsurprisingly, complicated, uninterpretable decision rules constructed through random forests typically do work better than this simple decision rule.  But the select-regress-round strategy doesn’t do too badly.  It might be possible to improve the performance by tweaking the first two steps to allow for some low-order interactions. For binary features, this would allow for classifiers where neither X nor Y are strong indicators of success, but the co-occurance of them (XY) is.

Even without this tweak, the select-regress-round classifier performs about as well as logistic regression and logistic lasso models that use all possible features (see the above figure from the paper), although it performs worse than the random forrest.  It also doesn’t appear that the rounding process has too much of an effect on the quality of the classifier.

#### This man will not hang

The substantive example in this paper has to do with whether or not a judge decides to grant bail, where the event you’re trying to predict is a failure to appear at trial. The results in this paper suggest that the select-regress-round rule leads to a consistently lower rate of failure compared to the “expert judgment” of the judges.  It also works, on this example, almost as well as a random forest classifier.

There’s some cool methodology stuff in here about how to actually build, train, and evaluate classification rules when, for any particular experimental unit (person getting or not getting bail in this case), you can only observed one of the potential outcomes.  This paper uses some ideas from the causal analysis literature to work around that problem.

I guess the real question I have about this type of decision rule for this sort of example is around how these sorts of decision rules would be applied in practice.  In particular, would judges be willing to use this type of system?  The obvious advantage of implementing it in practice is that it is data driven and, therefore, the decisions are potentially less likely to fall prey to implicit and unconscious biases. The obvious downside is that I am personally more than the sum of my demographic features (or other measurable quantities) and this type of system would treat me like the average person who has shares the $k$ features with me.

### Sketchnotes from TWiML&AI #92: Learning State Representations with Yael Niv

(This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

These are my sketchnotes for Sam Charrington’s podcast This Week in Machine Learning and AI about Learning State Representations with Yael Niv: https://twimlai.com/twiml-talk-92-learning-state-representations-yael-niv/

You can listen to the podcast here.

In this interview Yael and I explore the relationship between neuroscience and machine learning. In particular, we discusses the importance of state representations in human learning, some of her experimental results in this area, and how a better understanding of representation learning can lead to insights into machine learning problems such as reinforcement and transfer learning. Did I mention this was a nerd alert show? I really enjoyed this interview and I know you will too. Be sure to send over any thoughts or feedback via the show notes page. https://twimlai.com/twiml-talk-92-learning-state-representations-yael-niv/

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### General Linear Models The Basics

(This article was first published on Bluecology blog, and kindly contributed to R-bloggers)

# General Linear Models: The Basics

General linear models are one of the most widely used statistical tool
in the biological sciences. This may be because they are so flexible and
they can address many different problems, that they provide useful
outputs about statistical significance AND effect sizes, or just that
they are easy to run in many common statistical packages.

The maths underlying General Linear Models (and Generalized linear
models, which are a related but different class of model) may seem
mysterious to many, but are actually pretty accessible. You would have
learned the basics in high school maths.

We will cover some of those basics here.

## Linear equations

As the name suggests General Linear Models rely on a linear equation,
which in its basic form is simply:

yi = α + βx*i* + ϵ*i

The equation for a straight line, with some error added on.

If you aren’t that familiar with mathematical notation, notice a few
I used normal characters for variables (i.e. things you measure) and
Greek letters for parameters, which are estimated when you fit the model
to the data.

yi are your response data, I indexed the y with i to
indicate that there are multiple observations. xi is
variously known as a covariate, predictor variable or explanatory
variable. α is an intercept that will be estimated. α has the same
units as y. (e.g. if y is number of animals, then α is expected the
number of animals when x = 0).

β is a slope parameter that will also be estimated. β is also termed
the effect size because it measures the effect of x on y. β has units
of ‘y per x’. For instance, if x is temperature, then β has units of
number of animals per degree C. β thus measures how much we expect y
to change if x were to increase by 1.

Finally, don’t forget ϵi, which is the error.
ϵi will measure the distance between each prediction of
yi made by the model and the observed value of
yi.

These predictions will simply be calculated as:

yi = α + βx*i

(notice I just removed the ϵi from the end). You can
think of the linear predictions as: the mean or ‘expected’ value a new
observation yi would take if we only knew
xi and also as the ‘line of best fit’.

## Simulating ideal data for a general linear model

Now we know the model, we can generate some idealized data. Hopefully
this will then give you a feel for how we can fit a model to data. Open
up R and we will create these parameters:

n <- 100
beta <- 2.2
alpha <- 30


Where n is the sample size and alpha and beta are as above.

We also need some covariate data, we will just generate a sequence of
n numbers from 0 to 1:

x <- seq(0, 1, length.out = n)


The model’s expectation is thus this straight line:

y_true <- beta * x + alpha
plot(x, y_true)


Because we made the model up, we can say this is the true underlying
relationship. Now we will add error to it and see if we can recover that
relationship with a general linear model.

Let’s generate some error:

sigma <- 2.4
set.seed(42)
error <- rnorm(n, sd = sigma)
y_obs <- y_true + error
plot(x, y_obs)
lines(x, y_true)


Here sigma is our standard deviation, which measures how much the
observations y vary around the true relationship. We then used rnorm
to generate n random normal numbers, that we just add to our predicted
line y_true to simulate observing this relationship.

Congratulations, you just created a (modelled) reality a simulated an
ecologist going out and measuring that reality.

Note the set.seed() command. This just ensures the random number
generator produces the same set of numbers every time it is run in R and
it is good practice to use it (so your code is repeatable). Here is a
great explanation of seed setting and why 42 is so
popular
.

Also, check out the errors:

hist(error)


Looks like a normal distribution hey? That’s because we generated them
from a normal distribution. That was a handy trick, because the basic
linear model assumes the errors are normally distributed (but not
necessarily the raw data).

Also note that sigma is constant (e.g. it doesn’t get larger as x gets
larger). That is another assumption of basic linear models called
‘homogeneity of variance’.

## Fitting a model

To fit a basic linear model in R we can use the lm() function:

m1 <- lm(y_obs ~ x)


It takes a formula argument, which simply says here that y_obs depends
on (the tilde ~) x. R will do all the number crunching to estimate
the parameters now.

To see what it came up with try:

coef(m1)

## (Intercept)           x
##   30.163713    2.028646


This command tells us the estimate of the intercept ((Intercept)) and
the slope on x under x. Notice they are close to, but not exactly the
same as alpha and beta. So the model has done a pretty decent job of
recovering our original process. The reason the values are not identical
is that we simulated someone going and measuring the real process with
error (that was when we added the normal random numbers).

We can get slightly more details about the model fit like this:

summary(m1)

##
## Call:
## lm(formula = y_obs ~ x)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -7.2467 -1.5884  0.1942  1.5665  5.3433
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  30.1637     0.4985  60.503   <2e-16 ***
## x             2.0286     0.8613   2.355   0.0205 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.511 on 98 degrees of freedom
## Multiple R-squared:  0.05357,    Adjusted R-squared:  0.04391
## F-statistic: 5.547 on 1 and 98 DF,  p-value: 0.0205


I’m not going to go overboard with explaining this output now, but
notice a few key things. With the summary, we get standard errors for
the parameter estimates (which is a measure of how much they might
vary). Also notice the R-squared, which can be handy. Finally, notice
that the Residual standard error is close to the value we used for
sigma, which is because it is an estimate of sigma from our
simulated data.

Your homework is play around with the model and sampling process. Try
change alpha, beta, n and sigma, then refit the model and see
what happens.

## Final few points

So did you do the homework? If you did, well done, you just performed a
simple power analysis (in the broad sense).

In a more formal power analysis (which is what you might have come
across previously) could systematically vary n or beta and for 1000
randomised data sets and then calculate the proportion out of 1000
data-sets that your p-value was ‘significant’ (e.g. less than a critical
threshold like the ever-popular 0.05). This number tells you how good
you are at detecting ‘real’ effects.

Here’s a great intro to power analysis in the broad sense: Bolker,
Ecological Models and Data in
R

One more point. Remember we said above about some ‘assumptions’. Well we
can check those in R quite easily:

plot(m1, 1)


This shows a plot of the residuals (A.K.A. errors) versus the predicted
values. We are looking for ‘heteroskedasticity’ which is a fancy way of
saying the errors aren’t equal across the range of predictions (remember
I said sigma is a constant?).

Another good plot:

plot(m1, 2)


Here we are looking for deviations of the points from the line. Points
on the line mean the errors are approximately normally distributed,
which was a key assumption. Points far from the line could indicate the
errors are skewed left or right, too fat in the middle, or too in the
middle skinny. More on that issue
here

## The end

So the basics might belie the true complexity of situations we can
address with General Linear Models and their relatives Generalized
Linear Models. But, just to get you excited, here are a few things you
can do by adding on more terms to the right hand side of the linear
equation:

1. Model multiple, interacting covariates.
2. Include factors as covariates (instead of continuous variables). Got
a factor and a continuous variable? Don’t bother with the old-school
ANCOVA method, just use a linear model.
3. Include a spline to model non-linear effects (that’s a GAM).
4. Account for hierarchies in your sampling, like transects sampled
within sites (that’s a mixed effects model)
5. Account for spatial or temporal dependencies.
6. Model varying error variance (e.g. when the variance increases with the mean).

You can also change the left-hand side, so that it no longer assumes
normality (then that’s a Generalized Linear Model). Or even add
chains of models together to model pathways of cause and effect (that’s
a ‘path analysis’ or ‘structural equation model’)

If this taster has left you keen to learn more, then check out any one
of the zillion online courses or books on GLMs with R, or if you can get
to Brisbane, come to our next course (which as of writing was in Feb
2018, but we do them regularly)
.

Now you know the basics, practice, practice, practice and pretty soon
you will be running General Linear Models behind your back while you
watch your 2 year old, which is what I do for kicks.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Mapping a list of functions to a list of datasets with a list of columns as arguments

(This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers)

This week I had the opportunity to teach R at my workplace, again. This course was the “advanced R” course, and unlike the one I taught at the end of last year, I had one more day (so 3 days in total) where I could show my colleagues the joys of the tidyverse and R.

To finish the section on programming with R, which was the very last section of the whole 3 day course I wanted to blow their minds; I had already shown them packages from the tidyverse in the previous days, such as dplyr, purrr and stringr, among others. I taught them how to use ggplot2, broom and modelr. They also liked janitor and rio very much. I noticed that it took them a bit more time and effort for them to digest purrr::map() and purrr::reduce(), but they all seemed to see how powerful these functions were. To finish on a very high note, I showed them the ultimate purrr::map() use case.

Consider the following; imagine you have a situation where you are working on a list of datasets. These datasets might be the same, but for different years, or for different countries, or they might be completely different datasets entirely. If you used rio::import_list() to read them into R, you will have them in a nice list. Let’s consider the following list as an example:

library(tidyverse)
data(mtcars)
data(iris)

data_list = list(mtcars, iris)

I made the choice to have completely different datasets. Now, I would like to map some functions to the columns of these datasets. If I only worked on one, for example on mtcars, I would do something like:

my_summarise_f = function(dataset, cols, funcs){
dataset %>%
summarise_at(vars(!!!cols), funs(!!!funcs))
}

And then I would use my function like so:

mtcars %>%
my_summarise_f(quos(mpg, drat, hp), quos(mean, sd, max))
##   mpg_mean drat_mean  hp_mean   mpg_sd   drat_sd    hp_sd mpg_max drat_max
## 1 20.09062  3.596563 146.6875 6.026948 0.5346787 68.56287    33.9     4.93
##   hp_max
## 1    335

my_summarise_f() takes a dataset, a list of columns and a list of functions as arguments and uses tidy evaluation to apply mean(), sd(), and max() to the columns mpg, drat and hp of mtcars. That’s pretty useful, but not useful enough! Now I want to apply this to the list of datasets I defined above. For this, let’s define the list of columns I want to work on:

cols_mtcars = quos(mpg, drat, hp)
cols_iris = quos(Sepal.Length, Sepal.Width)

cols_list = list(cols_mtcars, cols_iris)

Now, let’s use some purrr magic to apply the functions I want to the columns I have defined in list_cols:

map2(data_list,
cols_list,
my_summarise_f, funcs = quos(mean, sd, max))
## [[1]]
##   mpg_mean drat_mean  hp_mean   mpg_sd   drat_sd    hp_sd mpg_max drat_max
## 1 20.09062  3.596563 146.6875 6.026948 0.5346787 68.56287    33.9     4.93
##   hp_max
## 1    335
##
## [[2]]
##   Sepal.Length_mean Sepal.Width_mean Sepal.Length_sd Sepal.Width_sd
## 1          5.843333         3.057333       0.8280661      0.4358663
##   Sepal.Length_max Sepal.Width_max
## 1              7.9             4.4

That’s pretty useful, but not useful enough! I want to also use different functions to different datasets!

Well, let’s define a list of functions then:

funcs_mtcars = quos(mean, sd, max)
funcs_iris = quos(median, min)

funcs_list = list(funcs_mtcars, funcs_iris)

Because there is no map3(), we need to use pmap():

pmap(
list(
dataset = data_list,
cols = cols_list,
funcs = funcs_list
),
my_summarise_f)
## [[1]]
##   mpg_mean drat_mean  hp_mean   mpg_sd   drat_sd    hp_sd mpg_max drat_max
## 1 20.09062  3.596563 146.6875 6.026948 0.5346787 68.56287    33.9     4.93
##   hp_max
## 1    335
##
## [[2]]
##   Sepal.Length_median Sepal.Width_median Sepal.Length_min Sepal.Width_min
## 1                 5.8                  3              4.3               2

Now I’m satisfied! Let me tell you, this blew their minds !

To be able to use things like that, I told them to always solve a problem for a single example, and from there, try to generalize their solution using functional programming tools found in purrr.

If you found this blog post useful, you might want to follow me on twitter for blog post updates.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### SatRday in South Africa

(This article was first published on R on The Jumping Rivers Blog, and kindly contributed to R-bloggers)

Jumping Rivers is proud to be sponsoring the upcoming SatRday conference in Cape Town, South Africa on 17th March 2018.

## What is SatRday?

SatRdays are a collection of free/cheap accessible R conferences organised by members of the R community at various locations across the globe. Each SatRday looks to provide talks and/or workshops by R programmers covering the language and it’s applications and is run as a not-for-profit event. They provide a great place to meet like minded people, be it researchers, data scientists, developers or enthusiasts, to discuss your passion for R programming.

## SatRday in Cape Town

This years SatRday in Cape Town has a collection of workshops on the days running up to the conference on the Saturday. For more detailed information concerning speakers, workshop topics and registration head on over to http://capetown2018.satrdays.org/ .

## Be in it to win it

In addition to sponsoring the conference this year, Jumping Rivers is also giving you the chance to win a free ticket. To be in with a chance just respond to the tweet below:

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

## January 18, 2018

### The Generalization Mystery: Sharp vs Flat Minima

I set out to write about the following paper I saw people talk about on twitter and reddit:

It's related to this pretty insightful paper:

Inevitably, I started thinking more generally about flat and sharp minima and generalization, so rather than describing these papers in details, I ended up dumping some thoughts of my own. Feedback and pointers to literature are welcome, as always

#### Summary of this post

• Flatness of minima is hypothesized to have something to do with generalization in deep nets.
• as Dinh et al (2017) show, flatness is sensitive to reparametrization and thus cannot predict generalization ability alone.
• Li et al (2017) use a form of parameter normalization to make their method more robust to reparametrization and produce some fancy plots comparing deep net architectures.
• While this analysis is now invariant to the particular type of reparametrizations considered by Dinh et al, it may still be sensitive to other types of invariances, so I'm not sure how much to trust these plots and conclusions.
• Then I go back to square one and ask how one could construct indicators of generalization that are invariant by construction, for example by considering ratios of flatness measures.
• Finally, I have a go at developing a local measure of generalization from first principles. The resulting metric depends on the data and statistical properties of gradients calculated from different minibatches.

## Flatness, Generalization and SGD

The loss surface of deep nets tends to have many local minima. Many of these might be equally good in terms of training error, but they may have widely different generalization performance, i.e. an network with minimal training loss might perform very well, or very poorly on a held-out training set. Interestingly, stochastic gradient descent (SGD) with small batchsizes appears to locate minima with better generalization properties than large-batch SGD. So the big question is: what measurable property of a local minimum can we use to predict generalization properties? And how does this relate to SGD?

There is speculation dating back to at least Hochreiter and Schmidhuber (1997) that the flatness of the minimum is a good measure to look at. However, as Dinh et al (2017) pointed out, flatness is sensitive to reparametrizations of the neural network: we can reparametrize a neural network without changing its outputs while making sharp minima look arbitrarily flat and vice versa. As a consequence the flatness alone cannot explain or predict good generalization.

Li et al (2017) proposed a normalization scheme which scales the space around a minimum in such a way that the apparent flatness in 1D and 2D plots is kind of invariant to the type of reparametrization Dinh et al used. This, they say, allows us to produce more faithful visualizations of the loss surfaces around a minimum. They even use 1D and 2D plots to illustrate differences between different architectures, such as a VGG and a ResNet. I personally do not buy the conclusions of this paper, and it seems the reviewers of the ICLR submission largely agreed on this. The proposed method is weakly motivated and only addresses one possible type of reparametrization.

## Contrastive Flatness measures

Following the thinking by Dinh et al, if generalization is a property which is invariant under reparametrization, the quantity we use to predict generalization should also be invariant. My intuition is that a good way to achieve invariance is to consider the ratio between two quantities - maybe two flatness measures - which are effected by reparametrization in the same way.

One thing I think would make sense to look at is the average flatness of the loss in a single minibatch vs the flatness of the average loss. Why would this makes sense? The average loss can be flat around a minimum in different ways: it can be flat because it is the average of flat functions which all look very similar and whose minimum is very close to the same location; or it can be flat because it is the average of many sharp functions with minima at locations scattered around the minimum of the average.

Intuitively, the former solution is more stable with respect to subsampling of data, therefore it should be more favourable from a generalization viewpont. The latter solution is very sensitive to which particular minibatch we are looking at, so presumably it may give rise to worse generalization.

As a conclusion of this section, I don't think it makes sense to look at only the flatness of the average loss, looking at how that flatness is effected by subsampling the data somehow feels more key to generalization.

## A local measure of generalization

After Jorge Nocedal's ICLR talk on large-batch SGD Leon Buttou had a comment which I think hit the nail on its head. The process of sampling minibatches from training data kind of simulates the effect of sampling the training set and the test set from some underlying data distribution. Therefore, you might think of generalization from one minibatch to another as a proxy to how well a method would generalize from a training set to a test set.

How can we use this insight to come up with some sort of measure of generalization based on minibatches, especially along the lines of sharpness or local derivatives?

First of all, let's consider the stochastic process $f(\theta)$ which we obtain by evaluating the loss function on a random minibatch. The randomness comes from subsampling the data. This is a probability distribution over loss functions over $\theta$. I think it's useful to seek an indicator of generalization ability as a local property of this stochastic process at any given $\theta$ value.

Let's pretend for a minute that each draw $f(\theta)$ from this process is a convex or at least has a unique global minimum. How would one describe a model's generalization from one minibatch to another in terms of this stochastic process?

Let's draw two functions $f_1(\theta)$ and $f_2(\theta)$ independently (i.e. evaluate the loss on two separate minibatches). I propose that the following would be a meaningful measure:

$$R = f_2 (\operatorname{argmin}_\theta f_1(\theta)) - \min_\theta f_2(\theta)$$

Basically: you care about finding low error according to $f_2$ but all you have access to is $f_1$. You therefore look at what the value of $f_2$ is at the location of the minimum of $f_1$ and compare that to the global minimal value of $f_2$. This is a sort of regret expression, hence my use of $R$ to denote it.

Now, in deep learning the loss functions $f_1$ and $f_2$ are not convex, have many local minima, so this definition is not particularly useful in general. However, it makes sense to calculate this value locally, in a small neighbourhood of a particular parameter value $\theta$. Let's consider fitting a restricted neural network model, where only parameters within a certain $\epsilon$ distance from $\theta$ are allowed. If $\epsilon$ is small enough, we can assume the loss functions have a unique global minimum within this $\epsilon$-ball. Furthermore, if $\epsilon$ is small enough, one can use a first-order Taylor-approximation to $f_1$ and $f_2$ to analytically find approximate minima within the $\epsilon$-ball. To do this, we just need to evaluate gradient at $\theta$. this is illustrated in the figure below:

The left-hand panel shows an imaginary loss function evaluated on some minibatch $f_1$, restricted to the $\epsilon$-ball around $\theta$. We can assume $\epsilon$ is small enough so $f_1$ is linear within this local region. Unless the gradient is exactly $0$, the minimum will fall on the surface of the $\epsilon$-ball, exactly at $\theta - \epsilon \frac{g_1}{\|g_1\|}$ where $g_1$ is the gradient of $f_1$ at $\theta$. This is shown by the yellow star. On the right-hand panel I show $f_2$. This is also locally linear, but its gradient $g_2$ might be different. The minimum of $f_2$ within the $\epsilon$-ball is at $\theta - \epsilon \frac{g_2}{\|g_2\|}$, shown by the red star. We can consider the regret-type expression as above, by evaluating $f_2$ at the yellow star, and substracting its value at the red star. This can be expressed as follows (I divided by $\epsilon$):

$$\frac{R(\theta, f_1, f_2)}{\epsilon} \rightarrow - \frac{g_2^\top g_1}{\|g_1\|} + \frac{g_2^\top g_2}{\|g_2\|} = \|g_2\| - \frac{g_2^\top g_1}{\|g_1\|} = \|g_2\|(1 - cos(g_1, g_2))$$

In practice one would consider taking an expectation with respect to the two minibatches to obtain an expression that depends on $\theta$. So, we have just come up with a local measure of generalization ability, which is expressed in terms of expectations involving gradients over different minibatches. The measure is local as it is specific for each value of $\theta$. It is data-dependent in that it depends on the distribution $p_\mathcal{D}$ from which we sample minibatches.

This measure depends on two things:

• the expected similarity of gradients which come from different minibatches $1 - cos(g_1, g_2)$ looks at whether various minibatches of data push $\theta$ in similar directions. In regions where the gradients are sampled from a mostly spherically symmetric distribution, this term would be close to $1$ most of the time.
• the magnitude of gradients $\|g_2\|$. Interestingly, one can express this as $\sqrt{\operatorname{trace}\left(g_2 g_2^\top\right)}$.

When we take the expectation over this, assuming that the cosine similarity term is mostly $1$ we end up with the expression $\mathbb{E}_g \sqrt{\operatorname{trace}\left(g g_2^\top\right)}$ where the expectation is taken over minibatches. Note that the trace-norm of the empirical Fisher information matrix $\sqrt{ \operatorname{trace} \mathbb{E}_g \left(g g_2^\top\right)}$ can be used as a measure of flatness of the average loss around minima, so there may be some interesting connections there. However, due to Jensen's inequality the two things are not actually the same.

Update - thanks for reddit user bbsome for pointing this out:

Note that R is not invariant under reparametrization either. The source of this sensitivity is the fact that I considered an $\epsilon$-ball in Euclidean norm around $\theta$. The right way to get rid of this is to consider an $\epsilon$-ball using the symmetrized KL divergence as instead of the Euclidean norm, similarly to how natural gradient methods can be derived. If we do this, the formula becomes dependent only on the functions the neural network implements, not on the particular choice of parametrization. I leave it as homework for people to work out how this would change the formulae.

# Summary

This post started out as a paper review, but in the end I didn't find the paper too interesting and instead resorted to sharing ideas about tackling the generalization puzzle a bit differently. It's entirely possible that people have done this analysis before, or that it's completely useless. In any case, I welcome feedback.

The first observation here was that a good indicator may involve not just the flatness of the average loss around the minimum, but a ratio between two flatness indicators. Such metrics may end up invariant under reparametrization by construction.

Taking this idea further I attempted to develop a local indicator of generalization performance which goes beyond flatness. It also includes terms that measure the sensitivity of gradients to data subsampling.

Because data subsampling is something that occurs both in generalization (training vs test set) and in minibatch-SGD, it may be possible that these kind of measures might shed some light on how SGD enables better generalization.

### JupyterCon 2018: Call For Proposal

Dear fellow Jovyans,

It is with great pleasure that we are opening the Call For Proposals (CFP) for JupyterCon 2018!

Last August, Project Jupyter, the NumFOCUS Foundation, and O’Reilly Media came together to host JupyterCon 2017. For its first offering we attracted over 700 attendees, 23 scholarship recipients across 4 days of talks and tutorials with access to 5 parallel session tracks totaling 11 keynotes, 55 talks, 8 tutorials, and 2 training courses. In addition, Community Day, open to everyone, was held at at the end of the conference. The conference also featured 33 poster sessions as a starting point for further discussion and was a huge success. Videos of the event have been made available on Safari Online and YouTube.

### JupyterCon 2018, CFP Open

JupyterCon 2017 was a huge success and we’ve been working hard since then to make JupyterCon 2018 even better. It will be held in New York City in August from Tuesday the 21st to Friday the 24th. We’ll also host an open Community Day on August 25th, which will be open to everyone.

Today we are happy to open the conference website and open the Call For Proposal with submissions due by early March. A couple of changes have been made to the CFP since last year. In particular if your talk is not accepted, you can ask us to automatically consider the proposal for the poster session.

We encourage you to submit a proposal, and reach out to us if you have any questions. We’ll do our best to help you and and give you feedback on your proposal.

Like last year, we will have diversity and student scholarships available; further information will be provided on the website. We also encourage you to follow the JupyterCon Twitter account for announcements or corrections.

### Community Day

The final day of JupyterCon 2017 was a blast with a large number of people making their first contribution to the Jupyter codebase, to the documentation, editing the wiki, or deploying it in the cloud. During the conference days, a separate room was also reserved for user testing of different Jupyter software, which proved to be fantastic source of feedback for User Experience (UX) and driving various Jupyter Tools forward.

We are happy to offer this “Community Day” experience again. At JupyterCon 2017, the Saturday was branded “Sprints” with the connotation of a code-centric experience. While we’re happy to see users coming to “Sprint” on code, we want to let you know that the Community Day will be open to anyone. Whether you are a teacher, coder, researcher, or user of Jupyter, the Community Day will have something for you. The Community Day is not limited to attendees of the main JupyterCon event, and it’s intended to be a “grass-roots” celebration of Jupyter and its community. We hope to see you at JupyterCon 2018.

### Thanks

JupyterCon 2018 would not be possible without O’Reilly Media, NumFocus, as well as our sponsors.

JupyterCon 2018: Call For Proposal was originally published in Jupyter Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

### Registration and talk proposals now open for useR!2018

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Registration is now open for useR! 2018, the official R user conference to be held in Brisbane, Australia July 10-13. If you haven't been to a useR! conference before, it's a fantastic opportunity to meet and mingle with other R users from around the world, see talks on R packages and applications, and attend tutorials for deep dives on R-related topics. This year's conference will also feature keynotes from Jenny Bryan, Steph De Silva, Heike Hofmann, Thomas Lin Pedersen, Roger Peng and Bill Venables. It's my favourite conference of the year, and I'm particularly looking forward to this one.

This video from last year's conference in Brussels (a sell-out with over 1,1000 attendees) will give you a sense of what a useR! conference is like:

The useR! conference brought to you by the R Foundation and is 100% community-led. That includes the content: the vast majority of talks come directly from R users. If you've written an R package, performed an interesting analysis with R, or simply have something to share of interest to the R community, consider proposing a talk by submitting an abstract. Most talks are 20 minutes, but you can also propose a 5-minute lightning talk or a poster. If you're not sure what kind of talk you might want to give, check out the program from useR!2017 for inspiration. R-Ladies, which promotes gender diversity in the R community, can also provide guidance on abstracts. Note that all proposals must comply with the conference code of conduct.

Early-bird registrations close on March 15, and while general registration will be open until June my advice is to get in early, as this year's conference is likely to sell out once again. If you want to propose a talk, submissions are due by March 2 (but early submissions have a better chance of being accepted). Follow the links below to register or submit an abstract, and I look forward to seeing you in Brisbane!

useR! 2018: Registration; Abstract submission

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### How to Write a Bootcamp Review that Actually Helps People

Editor's note: This post was written as part of a collaboration with SwitchUp, an online platform for researching and reviewing technology learning programs. Erica Freedman is a Content and Client Services Specialist at SwitchUp.

Data Science is a rapidly growing industry. From university programs to week-long cohorts, it can be difficult to decide where to start. Much like the “Best coffee in all of America,” sign in your local diner’s window, every boot camp or school website tells you they are the best in the game. Based on SwitchUp’s research, “there are currently over 120 in-person bootcamps and hundreds of part-time and online programs available worldwide.” While choice can be good, it can also be daunting.

How can you be sure you’re picking the right program?

For quality control, students have taken to using reviews and ratings from graduates to eliminate the less-than-satisfactory programs saturating the market. Detailed reviews take students beyond marketing materials or publicity, and provide valuable first-hand experience. On-the-ground perspectives are often a deciding factor when students are looking to change careers. It can help students understand the big picture, from the beginning of their research through to a career in tech.

If you are a bootcamp grad (or soon-to-be grad), your perspective can help “pay it forward” to the next cohort of students, and give your school helpful feedback as well. Think back to when you were trying to find the best program possible and write a review from that perspective. What do you wish you had seen or heard before entering a bootcamp?

We suggest the following tips to write a review that is valuable to future students.

## Weigh the Pros and Cons

Even if your bootcamp was the most perfect experience of your life, there is always room for improvement. Do researching students a favor and cover the positive aspects of your program experience while balancing this out with constructive criticism. This feedback not only helps those looking to join the school, but also the school itself.

At SwitchUp, we’ve found that prospective students are most interested in the quality of the curriculum, teaching staff, and job support, so be sure to mention your thoughts on these areas. If your school has multiple campuses then you’ll want to list the campus you attended, as these variables change from campus to campus.

## Talk About Your Complete Experience: Before, During, and After The Bootcamp

Have you ever seen a review that says, “It was great!” or “I hated it.”? Although these are technically reviews, neither are helpful to prospective students. What made the bootcamp great? Was it the teachers? The length of the courses? The location of the campus? There are so many variables to consider when thinking about the application process straight through to a job offer.

As you write a review, include how the program helped you to become immersed in the world of Data Science as well as how it helped you succeed after graduation. For example: Did the pre-work give you a useful introduction to the Data Science industry? Did career services help you ace an interview with your dream company? The complete picture will show future bootcampers how the program can help them both learn to code and meet their career goals.

SwitchUp has interviewed a wide range of bootcamp students. What is your story? Maybe you embarked on a career change into Data Science from a completely different background. Or maybe you took a semester off from college to simply gain skills at a bootcamp. Whatever the case may be, your path will show other students what’s possible.

This perspective is especially helpful if you do not have a Data Science, Coding or Computer Science background, since many bootcamp students come from different fields. Your story will show future students that as long as they are committed, they too can switch to tech career.

## Where to write your review

Many bootcamp alumni are choosing to leave reviews on sites like Quora and Medium, or on a review site like SwitchUp.

If you are interested in writing a review of Dataquest, check out their SwitchUp reviews page here. Plus, you will automatically be entered to win one of five $100 Amazon gift cards or one$500 Amazon gift card grand-prize from SwitchUp once you submit a verified review. This sweepstakes ends in March, so get going!

### Document worth reading: “Fairness in Supervised Learning: An Information Theoretic Approach”

Automated decision making systems are increasingly being used in real-world applications. In these systems for the most part, the decision rules are derived by minimizing the training error on the available historical data. Therefore, if there is a bias related to a sensitive attribute such as gender, race, religion, etc. in the data, say, due to cultural/historical discriminatory practices against a certain demographic, the system could continue discrimination in decisions by including the said bias in its decision rule. We present an information theoretic framework for designing fair predictors from data, which aim to prevent discrimination against a specified sensitive attribute in a supervised learning setting. We use equalized odds as the criterion for discrimination, which demands that the prediction should be independent of the protected attribute conditioned on the actual label. To ensure fairness and generalization simultaneously, we compress the data to an auxiliary variable, which is used for the prediction task. This auxiliary variable is chosen such that it is decontaminated from the discriminatory attribute in the sense of equalized odds. The final predictor is obtained by applying a Bayesian decision rule to the auxiliary variable. Fairness in Supervised Learning: An Information Theoretic Approach

### Are you monitoring your machine learning systems?

How are you monitoring your Python applications? Take the short survey - the results will be published on KDnuggets and you will get all the details.

### Check your prior posterior overlap (PPO) – MCMC wrangling in R made easy with MCMCvis

(This article was first published on R – Lynch Lab, and kindly contributed to R-bloggers)

When fitting a Bayesian model using MCMC (often via JAGS/BUGS/Stan), a number of checks are typically performed to make sure your model is worth interpreting without further manipulation (remember: all models are wrong, some are useful!):

• R-hat (AKA Gelman-Rubin statistic) – used to assess convergence of chains in the model
• Visual assessment of chains – used to assess whether posterior chains mixed well (convergence)
• Visual assessment of posterior distribution shape – used to determine if the posterior distribution is constrained
• Posterior predictive check (predicting data using estimated parameters) – used to make sure that the model can generate the data used in the model

## PPO

One check, however, is often missing: a robust assessment of the degree to which the prior is informing the posterior distribution. Substantial influence of the prior on the posterior may not be apparent through the use of R-hat and visual checks alone. Version 0.9.2 of MCMCvis (now available on CRAN), makes quantifying and plotting the prior posterior overlap (PPO) simple.

MCMCvis is an R package designed to streamline analysis of Bayesian model results derived from MCMC samplers (e.g., JAGS, BUGS, Stan). It can be used to easily visualize, manipulate, and summarize MCMC output. The newest version is full of new features – a full tutorial can be found here.

## An example

To check PPO for a model, we will use the function MCMCtrace. As the function is used to generate trace and density plots, checking for PPO is barely more work than just doing the routine checks that one would ordinarily perform. The function plots trace plots on the left and density plots for both the posterior (black) and prior (red) distributions on the right. The function calculates the percent overlap between the prior and posterior and prints this value on the plot. See ?MCMCtrace in R for details regarding the syntax.

#install package
install.packages('MCMCvis', repos = "http://cran.case.edu")

require(MCMCvis)

data(MCMC_data)

#simulate data from the prior used in your model
#number of iterations should equal the number of draws times the number of chains (although the function will adjust if the correct number of iterations is not specified)
#in JAGS: parameter ~ dnorm(0, 0.001)
PR <- rnorm(15000, 0, 32)

#run the function for just beta parameters
MCMCtrace(MCMC_data, params = 'beta', priors = PR, pdf = FALSE)

## Why check?

Checking the PPO has particular utility when trying to determine if the parameters in your model are identifiable. If substantial PPO exists, the prior may simply be dictating the posterior distribution – the data may have little influence on the results. If a small degree of PPO exists, the data was informative enough to overcome the influence of the prior. In the field of ecology, nonidentifiability is a particular concern in some types of mark-recapture models. Gimenez (2009) developed quantitative guidelines to determine when parameters are robustly identifiable using PPO.

While a large degree of PPO is not always a bad thing (e.g., substantial prior knowledge about the system may result in very informative priors used in the model), it is important to know where data was and was not informative for parameter estimation. The degree of PPO that is acceptable for a particular model will depend on a great number of factors, and may be somewhat subjective (but see Gimenez [2009] for a less subjective case). Like other checks, PPO is just one of many tools to be used for model assessment. Finding substantial PPO when unexpected may suggest that further model manipulation is needed. Happy model building!

## Other MCMCvis improvements

Check out the rest of the new package freatures, including the option to calculate the number of effective samples for each parameter, ability to take arguments in the form of a ‘regular expression’ for the params argument, ability to retain the structure of all parameters in model output (e.g., parameters specified as matrices in the model are summarized as matrices).

## Follow Casey Youngflesh on Twitter @caseyyoungflesh. The MCMCvis source code can be found on GitHub.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### If you did not already know

We present a probabilistic model with discrete latent variables that control the computation time in deep learning models such as ResNets and LSTMs. A prior on the latent variables expresses the preference for faster computation. The amount of computation for an input is determined via amortized maximum a posteriori (MAP) inference. MAP inference is performed using a novel stochastic variational optimization method. The recently proposed Adaptive Computation Time mechanism can be seen as an ad-hoc relaxation of this model. We demonstrate training using the general-purpose Concrete relaxation of discrete variables. Evaluation on ResNet shows that our method matches the speed-accuracy trade-off of Adaptive Computation Time, while allowing for evaluation with a simple deterministic procedure that has a lower memory footprint. …

Neo4j

Neuro-Index
The article describes a new data structure called neuro-index. It is an alternative to well-known file indexes. The neuro-index is fundamentally different because it stores weight coefficients in neural network. It is not a reference type like ‘keyword-position in a file’. …

### Personality Tests Are Failing American Workers

My newest Bloomberg View article just came out:

#### Personality Tests Are Failing American Workers

##### All too often, they filter people out for the wrong reasons.

Read all of my Bloomberg View pieces here.

### We were measuring the speed of Stan incorrectly—it’s faster than we thought in some cases due to antithetical sampling

Aki points out that in cases of antithetical sampling, our effective sample size calculations were unduly truncated above at the number of iterations. It turns out the effective sample size can be greater than the number of iterations if the draws are anticorrelated. And all we really care about for speed is effective sample size per unit time.

NUTS can be antithetical

The desideratum for a sampler Andrew laid out to Matt was to maximze expected squared transition distance. Why? Because that’s going to maximize effective sample size. (I still hadn’t wrapped my head around this when Andrew was laying it out.) Matt figured out how to achieve this goal by building an algorithm that simulated the Hamiltonian forward and backward in time at random, doubling the time at each iteration, and then sampling from the path with a preference for the points visited in the final doubling. This tends to push iterations away from their previous values. In some cases, it can lead to anticorrelated chains.

Removing this preference for the second half of the chains drastically reduces NUTS’s effectiveness. Figuring out how to include it and satisfy detailed balance was one of the really nice contributions in the original NUTS paper (and implementation).

Have you ever seen 4000 as the estimated n_eff in a default Stan run? That’s probably because the true value is greater than 4000 and we truncated it.

The fix is in

What’s even cooler is that the fix is already in the pipeline and it just happens to be Aki’s first C++ contribution. Here it is on GitHub:

Aki’s also done simulations, so the new version is actually better calibrated as far as MCMC standard error goes (posterior standard deviation divided by the square root of the effective sample size).

A simple example

Consider three Markov processes for drawing a binary sequence y[1], y[2], y[3], …, where each y[n] is in { 0, 1 }. Our target is a uniform stationary distribution, for which each sequence element is marginally uniformly distributed,

Pr[y[n] = 0] = 0.5     Pr[y[n] = 1] = 0.5


Process 1: Independent. This Markov process draws each y[n] independently. Whether the previous state is 0 or 1, the next state has a 50-50 chance of being either 0 or 1.

Here are the transition probabilities:

Pr[0 | 1] = 0.5   Pr[1 | 1] = 0.5
Pr[0 | 0] = 0.5   Pr[1 | 0] = 0.5


More formally, these should be written in the form

Pr[y[n + 1] = 0 | y[n] = 1] = 0.5


For this Markov chain, the stationary distribution is uniform. That is, some number of steps after initialization, there’s a probability of 0.5 of being in state 0 and a probability of 0.5 of being in state 1. More formally, there exists an m such that for all n > m,

Pr[y[n] = 1] = 0.5


The process will have an effective sample size exactly equal to the number of iterations because each state in a chain is independent.

Process 2: Correlated. This one makes correlated draws and is more likely to emit sequences of the same symbol.

Pr[0 | 1] = 0.01   Pr[1 | 1] = 0.99
Pr[0 | 0] = 0.99   Pr[1 | 0] = 0.01


Nevertheless, the stationary distribution remains uniform. Chains drawn according to this process will be slow to mix in the sense that they will have long sequences of zeroes and long sequences of ones.

The effective sample size will be much smaller than the number of iterations when drawing chains from this process.

Process 3: Anticorrelated. The final process makes anticorrelated draws. It’s more likely to switch back and forth after every output, so that there will be very few repeating sequences of digits.

Pr[0 | 1] = 0.99   Pr[1 | 1] = 0.01
Pr[0 | 0] = 0.01   Pr[1 | 0] = 0.99


The stationary distribution is still uniform. Chains drawn according to this process will mix very quickly.

With an anticorrelated process, the effective sample size will be greater than the number of iterations.

Visualization

If I had more time, I’d simulate, draw some traceplots, and also show correlation plots at various lags and the rate at which the estimated mean converges. This example’s totally going in the Coursera course I’m doing on MCMC, so I’ll have to work out the visualizations soon.

### Online MSc in Applied Data Science, Big Data – part-time, small, private

DSTI mission is simple: training executive students to become ready-to-go Data Scientists and Big Data Analysts. Check our small private online course programme.

### Magister Dixit

“There are 18 million developers in the world, but only one in a thousand have expertise in artificial intelligence. To a lot of developers, AI is inscrutable and inaccessible. We’re trying to ease the burden.” Mark Hammond ( 2017 )

### An Intuitive Introduction to Generative Adversarial Networks

This article was jointly written by Keshav Dhandhania and Arash Delijani, bios below.

1. A brief review of Deep Learning
2. The image generation problem
3. Key issue in generative tasks
5. Challenges
7. Conclusion

A brief review of Deep Learning

Sketch of a (feed-forward) neural network, with input layer in brown, hidden layers in yellow, and output layer in red.

Let’s begin with a brief overview of deep learning. Above, we have a sketch of a neural network. The neural network is made of up neurons, which are connected to each other using edges. The neurons are organized into layers - we have the hidden layers in the middle, and the input and output layers on the left and right respectively. Each of the edges is weighted, and each neuron performs a weighted sum of values from neurons connected to it by incoming edges, and thereafter applies a nonlinear activation such as sigmoid or ReLU. For example, neurons in the first hidden layer, calculate a weighted sum of neurons in the input layer, and then apply the ReLU function. The activation function introduces a nonlinearity which allows the neural network to model complex phenomena (multiple linear layers would be equivalent to a single linear layer).

Given a particular input, we sequentially compute the values outputted by each of the neurons (also called the neurons’ activity). We compute the values layer by layer, going from left to right, using already computed values from the previous layers. This gives us the values for the output layer. Then we define a cost, based on the values in the output layer and the desired output (target value). For example, a possible cost function is the mean-squared error cost function.

Where, x is the input, h(x) is the output and y is the target. The sum is over the various data points in our dataset.

At each step, our goal is to nudge each of the edge weights by the right amount so as to reduce the cost function as much as possible. We calculate a gradient, which tells us how much to nudge each weight. Once we compute the cost, we compute the gradients using the backpropagation algorithm. The main result of the backpropagation algorithm is that we can exploit the chain rule of differentiation to calculate the gradients of a layer given the gradients of the weights in layer above it. Hence, we calculate these gradients backwards, i.e. from the output layer to the input layer. Then, we update each of the weights by an amount proportional to the respective gradients (i.e. gradient descent).

If you would like to read about neural networks and the back-propagation algorithm in more detail, I recommend reading this article by Nikhil Buduma on Deep Learning in a Nutshell.

### The image generation problem

In the image generation problem, we want the machine learning model to generate images. For training, we are given a dataset of images (say 1,000,000 images downloaded from the web). During testing, the model should generate images that look like they belong to the training dataset, but are not actually in the training dataset. That is, we want to generate novel images (in contrast to simply memorizing), but we still want it to capture patterns in the training dataset so that new images feel like they look similar to those in the training dataset.

Image generation problem: There is no input, and the desired output is an image.

One thing to note: there is no input in this problem during the testing or prediction phase. Everytime we ‘run the model’, we want it to generate (output) a new image. This can be achieved by saying that the input is going to be sampled randomly from a distribution that is easy to sample from (say the uniform distribution or Gaussian distribution).

The crucial issue in a generative task is - what is a good cost function? Let’s say you have two images that are outputted by a machine learning model. How do we decide which one is better, and by how much?

The most common solution to this question in previous approaches has been, distance between the output and its closest neighbor in the training dataset, where the distance is calculated using some predefined distance metric. For example, in the language translation task, we usually have one source sentence, and a small set of (about 5) target sentences, i.e. translations provided by different human translators. When a model generates a translation, we compare the translation to each of the provided targets, and assign it the score based on the target it is closest to (in particular, we use the BLEU score, which is a distance metric based on how many n-grams match between the two sentences). That kind of works for single sentence translations, but the same approach leads to a significant deterioration in the quality of the cost function when the target is a larger piece of text. For example, our task could be to generate a paragraph length summary of a given article. This deterioration stems from the inability of the small number of samples to represent the wide range of variation observed in all possible correct answers.

GANs answer to the above question is, use another neural network! This scorer neural network (called the discriminator) will score how realistic the image outputted by the generator neural network is. These two neural networks have opposing objectives (hence, the word adversarial). The generator network’s objective is to generate fake images that look real, the discriminator network’s objective is to tell apart fake images from real ones.

This puts generative tasks in a setting similar to the 2-player games in reinforcement learning (such as those of chess, Atari games or Go) where we have a machine learning model improving continuously by playing against itself, starting from scratch. The difference here is that often in games like chess or Go, the roles of the two players are symmetric (although not always). For GAN setting, the objectives and roles of the two networks are different, one generates fake samples, the other distinguishes real ones from fake ones.

Sketch of Generative Adversarial Network, with the generator network labelled as G and the discriminator network labelled as D.

Above, we have a diagram of a Generative Adversarial Network. The generator network G and discriminator network D are playing a 2-player minimax game. First, to better understand the setup, notice that D’s inputs can be sampled from the training data or the output generated by G: Half the time from one and half the time from the other. To generate samples from G, we sample the latent vector from the Gaussian distribution and then pass it through G. If we are generating a 200 x 200 grayscale image, then G’s output is a 200 x 200 matrix. The objective function is given by the following function, which is essentially the standard log-likelihood for the predictions made by D:

The generator network G is minimizing the objective, i.e. reducing the log-likelihood, or trying to confuse D. It wants D to identify the the inputs it receives from G as correct whenever samples are drawn from its output. The discriminator network D is maximizing the objective, i.e. increasing the log-likelihood, or trying to distinguish generated samples from real samples. In other words, if G does a good job of confusing D, then it will minimize the objective by increasing D(G(z))in the second term. If D does its job well, then in cases when samples are chosen from the training data, they add to the objective function via the first term (because D(x) would be larger) and decrease it via the second term (because D(x)would be small).

Training proceeds as usual, using random initialization and backpropagation, with the addition that we alternately update the discriminator and the generator and keep the other one fixed. The following is a description of the end-to-end workflow for applying GANs to a particular problem

1. Decide on the GAN architecture: What is architecture of G? What is the architecture of D?
2. Train: Alternately update D and G for a fixed number of updates
1. Update D (freeze G): Half the samples are real, and half are fake.
2. Update G (freeze D): All samples are generated (note that even though D is frozen, the gradients flow through D)
3. Manually inspect some fake samples. If quality is high enough (or if quality is not improving), then stop. Else repeat step 2.

When both G and D are feed-forward neural networks, the results we get are as follows (trained on MNIST dataset).

Results from Goodfellow et. al. Rightmost column (in yellow boxes) are the closest images from the training dataset to the image on its direct left. All other images are generated samples.

Using a more sophisticated architecture for G and D with strided convolutional, adam optimizer instead of stochastic gradient descent, and a number of other improvements in architecture, hyperparameters and optimizers (see paper for details), we get the following results:

Results from Alec Radford et. al. Images are of ‘bedrooms’.

### Challenges

If you would like to learn about GANs in much more depth, I suggest checking out the ICCV 2017 tutorials on GANs. There are multiple tutorials, each focusing on different aspect of GANs, and they are quite recent.

I’d also like to mention the concept of Conditional GANs. Conditional GANs are GANs where the output is conditioned on the input. For example, the task might be to output an image matching the input description. So if the input is “dog”, then the output should be an image of a dog.

Below are results from some recent research (along with links to those papers).

Results for ‘Text to Image synthesis’ by Reed et. al

Results for Image Super-resolution by Ledig et. al

Results for Image to Image translation by Isola et. al

Generating high resolution ‘celebrity like’ images by Karras et. al

Last but not the least, if you would like to do a lot more reading on GANs, check out this list of GAN papers categorized by application and this list of 100+ different GAN variations.

### Conclusion

I hope that in this article, you have understood a new technique in deep learning called Generative Adversarial Networks. They are one of the few successful techniques in unsupervised machine learning, and are quickly revolutionizing our ability to perform generative tasks. Over the last few years, we’ve come across some very impressive results. There is a lot of active research in the field to apply GANs for language tasks, to improve their stability and ease of training, and so on. They are already being applied in industry for a variety of applications ranging from interactive image editing, 3D shape estimation, drug discovery, semi-supervised learning to robotics. I hope this is just the beginning of your journey into adversarial machine learning.

### Author Bios:

Keshav Dhandhania:
Keshav is a cofounder of Compose Labs (commonlounge.com) and has spoken on GANs at international conferences including DataSciCon.Tech, Atlanta and DataHack Summit, Bangaluru, India. He did his masters in Artificial Intelligence from MIT, and his research focused on natural language processing, and before that, computer vision and recommendation systems.
Arash Delijani:
Arash previously worked on data science at MIT and is the cofounder of Orderly, an SF-based startup using machine learning to help businesses with customer segmentation and feedback analysis.

### Announcing the 2018 Facebook Fellows and Emerging Scholars

Facebook is proud to announce the 2018 Facebook Fellowship and Emerging Scholar award winners. “This year we received over 800 applications from promising and talented PhD students from around the world,” said Sharon Ayalde, Fellowship Program Manager. “We are pleased and excited to award 17 Fellows and 6 Emerging Scholars – a significant increase from last year.” The program, now in it’s 7th year, has supported over 70 top PhD candidates.

The Facebook Fellowship program is designed to encourage and support promising doctoral students engaged in innovative and relevant research across computer science and engineering. The research topics from this year’s cohort range from natural language processing, computer vision, machine learning, commAI, networking and connectivity hardware, economics and computation, distributed systems, and security/privacy.

Launched in 2017, Emerging Scholar Awards are open to first or second year PhD students. The program is specifically designed to support talented students from under-represented minority groups in the technology sector to encourage them to continue their PhD studies, pursue innovative research, and engage with the broader research community.

Congratulations to this year’s talented group of Fellows and Emerging Scholars! We are excited to engage deeper with them, learn more about their research, and support their continued studies.

## 2018 Emerging Scholar Award recipients

### A simple way to set up a SparklyR cluster on Azure

The SparklyR package from RStudio provides a high-level interface to Spark from R. This means you can create R objects that point to data frames stored in the Spark cluster and apply some familiar R paradigms (like dplyr) to the data, all the while leveraging Spark's distributed architecture without having to worry about memory limitations in R. You can also access the distributed machine-learning algorithms included in Spark directly from R functions.

If you don't happen to have a cluster of Spark-enabled machines set up in a nearby well-ventilated closet, you can easily set one up in your favorite cloud service. For Azure, one option is to launch a Spark cluster in HDInsight, which also includes the extensions of Microsoft ML Server. While this service recently had a significant price reduction, it's still more expensive than running a "vanilla" Spark-and-R cluster. If you'd like to take the vanilla route, a new guide details how to set up Spark cluster on Azure for use with SparklyR.

All of the details are provided in the link below, but the guide basically provides the Azure Distributed Data Engineering Toolkit shell commands to provision a Spark cluster, connect SparklyR to the cluster, and then interact with it via RStudio Server. This includes the ability to launch the cluster with pre-emptable low-priority VMs, a cost-effective option (up to 80% cheaper!) for non-critical workloads. Check out the details at the link below.

Github (Azure): How to use SparklyR on Azure with AZTK

### Deep Learning & Computer Vision in the Microsoft Azure Cloud

This is the first in a multi-part series by guest blogger Adrian Rosebrock. Adrian writes at PyImageSearch.com about computer vision and deep learning using Python, and he recently finished authoring a new book on deep learning for computer vision and image recognition.

Introduction

I had two goals when I set out to write my new book, Deep Learning for Computer Vision with Python. The first was to create a book/self-study program that was accessible to both novices and experienced researchers and practitioners — we start off with the fundamentals of neural networks and machine learning and by the end of the program you’re training state-of-the-art networks on the ImageNet dataset from scratch. My second goal was to provide a book that included:

• Practical walkthroughs that present solutions to actual, real-world deep learning classification problems.
• Hands-on tutorials (with accompanying code) that not only show you the algorithms behind deep learning for computer vision but their implementations as well.
• A no-nonsense teaching style that cuts through all the cruft and helps you on your path to deep learning + computer vision mastery for visual recognition.

Along the way I quickly realized that a stumbling block for many readers is configuring their development environment — especially true for those wanted to utilize their GPU(s) and train deep neural networks on massive image datasets (such as ImageNet). Of course, some readers may not want to invest in physical hardware and instead utilize the cloud where it’s easy to spin-up and tear-down instances. I spent some time researching different cloud-based solutions. Some of them worked well, others either outright didn’t work as claimed or involved too much setup.

When Microsoft reached out to me to take a look at their Data Science Virtual Machine (DSVM), I was incredibly impressed.

The DSVM included TensorFlow, Keras, mxnet, Caffe, CUDA/cuDNN all out of the box, pre-configured and ready to go. Best of all, I could run the DSVM on a CPU instance (great if you’re just getting started with deep learning) or I could switch to a GPU instance and seamlessly watch my networks train order of magnitudes faster (excellent if you’re a deep learning practitioner looking to train deep neural networks on larger datasets). In the remainder of this post, I’ll be discussing why I choose to use the DSVM — and even adapted my entire deep learning book to run on it.

Why I like Microsoft’s Data Science Virtual Machine

Microsoft’s Data Science Virtual Machine (DSVM) runs in the Azure cloud and supports either Windows or Linux (Ubuntu). For nearly all deep learning projects I recommend Linux; however, there are some applications where Windows is appropriate — you can choose either. The list of packages installed in the DSVM is complete and comprehensive (you can find the full list here), but from a deep learning + computer vision perspective you’ll find:

• TensorFlow
• Keras
• mxnet
• Caffe/Caffe2
• Microsoft Cognitive Toolkit
• Torch
• OpenCV
• Jupyter
• CUDA/cuDNN
• Python 3

Again, the complete list is very extensive and is a huge testament to not only the DSVM team for keeping this instance running seamlessly, but also to Microsoft’s desire to have their users utilize and even enjoy working in their environment. As I mentioned above you can run the DSVM in either CPU only or one or more GPUs.

Once you have your DSVM up and running you’ll find many sample Jupyter notebooks for various machine learning, deep learning, and data science projects. These sample Jupyter notebooks will help you get up and running and familiarize yourself with the DSVM. If you prefer not to use Juptyter Notebooks you can also access your DSVM instance via SSH and VNC.

To spin up your first DSVM instance (including a free $200 credit) you’ll want to follow this link: https://azure.microsoft.com/en-us/free/. I recommend reading through the DSVM docs as well. Finally, be sure to read through the Tips, Tricks, and Suggestions section of this post, where I discuss additional advice and hacks you can use to better your experience with the DSVM. Your First Convolutional Neural Network I have put together a Jupyter Notebook demonstrating how to train your first Convolutional Neural Network using the following toolset: • Python (2.7 and 3). • TensorFlow. • Keras. You can find the Jupyter Notebook here: Note: Make sure (1) you install Jupyter Notebooks on your local system or (2) use a DSVM instance to open the notebook. Inside the notebook you’ll learn how to train the classic “LeNet” architecture to recognize handwritten digits: And obtain over 98% classification accuracy after only 20 epochs: Be sure to take a look at the Jupyter Notebook for a full explanation of the code and training process. I also want to draw attention to the code associated with this tutorial is the exact same code that I used when writing Deep Learning for Computer Vision with Python — with only two modifications: 1. Using %matplotlib inline to display plots inside the Jupyter Notebook. 2. Swapping out argument parsing for using a built-in Python args dictionary. There are no other required changes to the code, which again is a huge testament to the DSVM team. Tips, Tricks, and Suggestions In this section I detail some additional tips and tricks I found useful when working with the DSVM. Some of these suggestions are specific to my book, Deep Learning for Computer Vision with Python, while others are more general. Additional Python Packages I installed both imutils and progressbar2 in the DSVM once it was up and running: $ sudo /anaconda/envs/py35/bin/pip install imutils
$sudo /anaconda/envs/py35/bin/pip install progressbar2 The imutils library is a series of convenience functions used to make basic image processing and computer vision operations easier using Python and the OpenCV library. The progressbar2 package is used to make nice progress bars when running tasks that take a long time to complete (such as building and packing an image dataset). Updating MKL I ran into a small issue when trying to work with the Intel Math Kernel Library (MKL): Intel MKL FATAL ERROR: Cannot load libmkl_avx2.so or libmkl_def.so. Process ForkPoolWorker-14: Process ForkPoolWorker-12: Process ForkPoolWorker-13: This was resolved by running the following command to update the mkl package: $ sudo /anaconda/envs/py35/bin/conda update mkl -n py35

Avoiding Accidental ResourceExhaustedError

When leaving your Jupyter notebook open/running for long periods of time you may run into a ResourceExhaustedError when training your networks.

This can be solved by inserting the following lines in a cell at the end of the notebook:

%%javascript
Jupyter.notebook.session.delete();

Command Line Arguments and Jupyter Notebooks

Many deep learning and machine learning Python scripts require command line arguments…

…but Jupyter notebooks do not have a concept of command line arguments.

So, what do you do?

Inside Deep Learning for Computer Vision with Python I made sure all command line arguments parsed into a built-in Python dictionary.

This means you can change command line argument parsing code from this:

# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-i", "--image", required=True,
    help="path to input image")

To this:

args = {
    "image": "/path/to/your/input/image.png"
}

Here we have swapped out the command line arguments for a hard-coded dictionary that points to the relevant parameters/file paths. Not all command line arguments can be swapped out so easily, but for all examples in my book I opted to use a Python dictionary to make this a near seamless experience for Jupyter notebook users.

SSH and Remote Desktop

You can access your Azure DSVM via SSH or remote desktop.

If you opt for remote desktop make sure you install the X2Go Client as discussed in the DSVM docs.

If you’re using macOS, make sure you install XQuartz as well.

Where to Next?

In this post you learned about the Microsoft Data Science Virtual Machine (DSVM). We discussed how the DSVM can be used for deep learning, and in particular, how the code from my book and self-study program, Deep Learning for Computer Vision with Python can executed on the DSVM. My goal when writing this book was to make it accessible to both novices and experienced researchers and practitioners — I have no doubt that the DSVM facilitates this accessibility by removing frustration with deep learning development environment configuration and getting you up and running quickly. If you’re interested in learning more about the Microsoft Data Science Virtual Machine, be sure to click here.

Once you’re up and running with the DSVM take a look at my deep learning book — this self-study program is engineered from the ground up to help you master deep learning for computer vision (you’ll also find more detailed walkthroughs of LeNet and other popular network architectures, including ResNet, SqueezeNet, GoogLeNet, VGGNet, to name a few).

Stay tuned for my next post in this series, where I will share my experience and tips for running more advanced deep learning techniques for computer vision on the Data Science Virtual Machine.

### Visual Aesthetics: Judging photo quality using AI techniques

We built a deep learning system that can automatically analyze and score an image for aesthetic quality with high accuracy. Check the demo and see your photo measures up!

### BigML Release and Webinar: Operating Thresholds and Organizations!

BigML’s first release of the year is here! Join us on Wednesday, January 31, 2018, at 10:00 AM PT (Portland, Oregon. GMT -08:00) / 07:00 PM CET (Valencia, Spain. GMT +01:00) for a FREE live webinar to discover the latest version of the BigML platform. We will be presenting two new features: operating thresholds for classification models to fine tune the performance of […]

### Data Science in 30 Minutes: Alan Schwarz, Former NYTimes Journalist, on Numbers-Based Journalism

This FREE webinar will be on February 27th at 5:30 PM ET. Register below now, space is limited!

Join The Data Incubator and former NY Times journalist Alan Schwarz for the next installment of our free online webinar series, Data Science in 30 minutes: Numbers-Based Journalism.

Alan Schwarz, former N.Y. Times investigative reporter and Pulitzer finalist, discusses numbers-based journalism that shook industries from the National Football League to Big Pharma. Alan used data analysis to expose the NFL’s cover-up of concussions as well as issues in child psychiatry.

Alan Schwarz is a Pulitzer Prize-nominated journalist best known for his reportage of public health issues for The New York Times. His 130-article series on concussions in sports is roundly credited with revolutionizing the handling of head injuries in professional and youth sports, and was a finalist for the 2011 Pulitzer Prize for
Public Service. He followed that work with a series on A.D.H.D. and other psychiatric disorders in children, which also was considered for a Pulitzer and led to his book “A.D.H.D. NATION: Children, Doctors, Big Pharma and the Making of an American Epidemic.”

A recognized expert on the use of mathematics and probability in journalism — statistical analysis formed the backbone of his major series — Mr. Schwarz has lectured at dozens of universities and professional conferences about these subjects, including at the 2015 SAS national convention and a keynote at the Andrew Wiles Mathematical Institute at the University of Oxford. Mr. Schwarz, who holds a bachelor of arts degree in Mathematics from the University of Pennsylvania, was honored by the American Statistical Association in 2013 with its Lifetime Excellence in Statistical Reporting Award and serves on editorial boards of the ASA and the Royal Statistical Society.

Michael Li founded The Data Incubator, a New York-based training program that turns talented PhDs from academia into workplace-ready data scientists and quants. The program is free to Fellows, employers engage with the Incubator as hiring partners.

Previously, he worked as a data scientist (Foursquare), Wall Street quant (D.E. Shaw, J.P. Morgan), and a rocket scientist (NASA). He completed his PhD at Princeton as a Hertz fellow and read Part III Maths at Cambridge as a Marshall Scholar. At Foursquare, Michael discovered that his favorite part of the job was teaching and mentoring smart people about data science. He decided to build a startup to focus on what he really loves.

Michael lives in New York, where he enjoys the Opera, rock climbing, and attending geeky data science events.

### Propensity Score Matching in R

Propensity scores are an alternative method to estimate the effect of receiving treatment when random assignment of treatments to subjects is not feasible.

### Bitcoin (World Map) Bubbles

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

We’re doing some interesting studies (cybersecurity-wise, not finance-wise) on digital currency networks at work-work and — while I’m loathe to create a geo-map from IPv4 geolocation data — we:

• do get (often, woefully inaccurate) latitude & longitude data from our geolocation service (I won’t name-and-shame here); and,
• there are definite geo-aspects to the prevalence of mining nodes — especially Bitcoin; and,
• I have been itching to play with the nascent nord palette in a cartographical context…

so I went on a small diversion to create a bubble plot of geographical Bitcoin node-prevalence.

I tweeted out said image and someone asked if there was code, hence this post.

You’ll be able to read about the methodology we used to capture the Bitcoin node data that underpins the map below later this year. For now, all I can say is that wasn’t garnered from joining the network-proper.

I’m including the geo-data in the gist, but not the other data elements (you can easily find Bitcoin node data out on the internets from various free APIs and our data is on par with them).

I’m using swatches for the nord palette since I was hand-picking colors, but you should use @jakekaupp’s most excellent nord package if you want to use the various palettes more regularly.

I’ve blathered a bit about nord, so let’s start with that (and include the various other packages we’ll use later on):

library(swatches)
library(ggalt) # devtools::install_github("hrbrmstr/ggalt")
library(hrbrthemes) # devtools::install_github("hrbrmstr/hrbrthemes")
library(tidyverse)

show_palette(nord)

It may not be a perfect palette (accounting for all forms of vision issues and other technical details) but it was designed very well (IMO).

The rest is pretty straightforward:

• read in the bitcoin geo-data
• count up by lat/lng
• figure out which colors to use (that took a bit of trial-and-error)
• tweak the rest of the ggplot2 canvas styling (that took a wee bit longer)

I’m using development versions of two packages due to their added functionality not being on CRAN (yet). If you’d rather not use a dev-version of hrbrthemes just use a different ipsum theme vs the new theme_ipsum_tw().

read_csv("bitc.csv") %>%
count(lng, lat, sort = TRUE) -> bubbles_df

world <- map_data("world")
world <- world[world$region != "Antarctica", ] ggplot() + geom_cartogram( data = world, map = world, aes(x = long, y = lat, map_id = region), color = nord["nord3"], fill = nord["nord0"], size = 0.125 ) + geom_point( data = bubbles_df, aes(lng, lat, size = n), fill = nord["nord13"], shape = 21, alpha = 2/3, stroke = 0.25, color = "#2b2b2b" ) + coord_proj("+proj=wintri") + scale_size_area(name = "Node count", max_size = 20, labels = scales::comma) + labs( x = NULL, y = NULL, title = "Bitcoin Network Geographic Distribution (all node types)", subtitle = "(Using bubbles seemed appropriate for some, odd reason)", caption = "Source: Rapid7 Project Sonar" ) + theme_ipsum_tw(plot_title_size = 24, subtitle_size = 12) + theme(plot.title = element_text(color = nord["nord14"], hjust = 0.5)) + theme(plot.subtitle = element_text(color = nord["nord14"], hjust = 0.5)) + theme(panel.grid = element_blank()) + theme(plot.background = element_rect(fill = nord["nord3"], color = nord["nord3"])) + theme(panel.background = element_rect(fill = nord["nord3"], color = nord["nord3"])) + theme(legend.position = c(0.5, 0.05)) + theme(axis.text = element_blank()) + theme(legend.title = element_text(color = "white")) + theme(legend.text = element_text(color = "white")) + theme(legend.key = element_rect(fill = nord["nord3"], color = nord["nord3"])) + theme(legend.background = element_rect(fill = nord["nord3"], color = nord["nord3"])) + theme(legend.direction = "horizontal") As noted, the RStudio project associated with this post in in this gist. Also, upon further data-inspection by @jhartftw, we’ve discovered yet-more inconsistencies in the geo-mapping service data (there are way too many nodes in Paris, for example), but the main point of the post was to mostly show and play with the nord palette. To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### Return of the Mac The Economist’s Big Mac index gives a flavour of how far currency values are out of whack. It is based on the idea of purchasing-power parity, which says exchange rates should move towards the level that would make the price of a basket of goods the same everywhere. Continue Reading… ### Gradient Boosting in TensorFlow vs XGBoost For many Kaggle-style data mining problems, XGBoost has been the go-to solution since its release in 2016. It's probably as close to an out-of-the-box machine learning algorithm as you can get today. Continue Reading… ### (What’s So Funny ‘Bout) Evidence, Policy, and Understanding [link] Kevin Lewis asked me what I thought of this article by Oren Cass, “Policy-Based Evidence Making.” That title sounds wrong at first—shouldn’t it be “evidence-based policy making”?—but when you read the article you get the point, which is that Cass argues that so-called evidence-based policy isn’t so evidence-based at all, that what is considered “evidence” in social science and economic policy is often so flexible that it can be taken to support whatever position you want. Hence, policy-based evidence making. I agree with Cass that the whole “evidence-based policy” thing has been oversold. For an extreme example, see this story of some “evidence-based design” in architecture that could well be little more than a billion-dollar pseudoscientific scam. More generally, I agree that there are problems with a lot of these studies, both in their design and in their interpretation. Here’s an story from a few years ago that I’ve discussed a bit; for a slightly more formal treatment of that example, see section 2.1 of this article. So I’m sympathetic with the points that Cass is making, and I’m glad this article came out; I think it will generally push the discussion in the right direction. But there are two places where I disagree with Cass. 1. First, despite all the problems with controlled experiments, they can still tell us something, as long as they’re not overinterpreted. If we forget about statistical significance and all that crap, a controlled experiment is, well, it’s a controlled experiment, you’re learning about something under controlled conditions, which can be useful. This is a point that has been made many times by my colleague Don Green: yes, controlled experiments have problems with realism, but the same difficulties can arise when trying to generalize observational comparisons to new settings. To put it another way, recall Bill James’s adage that alternative to good statistics is not “no statistics,” it’s “bad statistics.” Consider Cass’s article. He goes through lots of legitimate criticism of overinterpretations of results from that Oregon experiment, but then, what does he do? He gives lots of weight to an observational study from Yale that compares across states. One point that Cass makes very well is that you can’t rely too much on any single study. Any single study is limited in scope, it occurs at a particular time and place and with a particular set of treatments, outcomes, and time horizon. To make decisions, we have to do our best with the studies we have, which sometimes means discarding them completely if they are too noisy. And I think Cass is right that we should take studies more seriously when they are large and occur under realistic conditions. 2. The political slant is just overwhelming. Cass throws in the kitchen sink. For example, “When Denmark began offering generous maternity leave, so many nurses made use of it that mortality rates in nursing homes skyrocketed.” Whaaaa? Maybe they could hire some more help in those nursing homes? Any policy will have issues when rolling out on a larger scale, but it seems silly to say that therefore it’s not a good idea to evaluate based on what evidence is available. Then he refers to “This evidence ratchet, in which findings can promote but not undermine a policy, is common.” This makes no sense to me, given that anything can be a policy. The$15 minimum wage is a policy. So is the $5 minimum wage, or for that matter the$0 minimum wage. High taxes on the rich is a policy, low taxes on the rich is a policy, etc.

Also this: “Grappling with such questions is frustrating and unsettling, as the policymaking process should be. It encourages humility and demands that the case for government action clear a high bar.” This may be Cass’s personal view, but it has nothing to do with evidence. He’s basically saying that if the evidence isn’t clear, we should make decisions based on his personal preference for less government spending, which I think means lower taxes on the rich. One could just as well say the opposite: “It encourages humility and demands that the riches of our country be shared more equally.” Or, to give it a different spin: “It encourages humility and demands that we live by the Islamic principles that have stood the test of time.” Or whatever. When evidence is weak, you have to respect uncertainty; it should not be treated as a rationale for sneaking in your own policy preferences as a default.

But I hate to end it there. Overall I liked Cass’s article, and we should be able to get value from it, subtracting the political slant which muddles his legitimate points. The key point, which Cass makes well, is that there is no magic to evidence-based decision making: You can do a controlled experiment and still learn nothing useful. The challenge is where to go next. I do think evidence is important, and I think that, looking forward, our empirical studies of policies should be as realistic as possible, close to the ground, as it were. Easier said than done, perhaps, but we need to do our best, and I think that critiques such as Cass’s are helpful.

### Convolutional neural networks for language tasks

Though they are typically applied to vision problems, convolution neural networks can be very effective for some language tasks.

When approaching problems with sequential data, such as natural language tasks, recurrent neural networks (RNNs) typically top the choices. While the temporal nature of RNNs are a natural fit for these problems with text data, convolutional neural networks (CNNs), which are tremendously successful when applied to vision tasks, have also demonstrated efficacy in this space.

In our LSTM tutorial, we took an in-depth look at how long short-term memory (LSTM) networks work and used TensorFlow to build a multi-layered LSTM network to model stock market sentiment from social media content. In this post, we will briefly discuss how CNNs are applied to text data while providing some sample TensorFlow code to build a CNN that can perform binary classification tasks similar to our stock market sentiment model.

We see a sample CNN architecture for text classification in Figure 1. First, we start with our input sentence (of length seq_len), represented as a matrix in which the rows are our words vectors and the columns are the dimensions of the distributed word embedding. In computer vision problems, we typically see three input channels for RGB; however, for text we have only a single input channel. When we implement our model in TensorFlow, we first define placeholders for our inputs and then build the embedding matrix and embedding lookup.

# Define Inputs
inputs_ = tf.placeholder(tf.int32, [None, seq_len], name='inputs')
labels_ = tf.placeholder(tf.float32, [None, 1], name='labels—)
training_ = tf.placeholder(tf.bool, name='training')

# Define Embeddings
embedding = tf.Variable(tf.random_uniform((vocab_size, embed_size), -1, 1))
embed = tf.nn.embedding_lookup(embedding, inputs_)


Notice how the CNN processes the input as a complete sentence, rather than word by word as we did with the LSTM. For our CNN, we pass a tensor with all word indices in our sentence to our embedding lookup and get back the matrix for our sentence that will be used as the input to our network.

Now that we have our embedded representation of our input sentence, we build our convolutional layers. In our CNN, we will use one-dimensional convolutions, as opposed to the two-dimensional convolutions typically used on vision tasks. Instead of defining a height and a width for our filters, we will only define a height, and the width will always be the embedding dimension. This makes sense intuitively, when compared to how images are represented in CNNs. When we deal with images, each pixel is a unit for analysis, and these pixels exist in both dimensions of our input image. For our sentence, each word is a unit for analysis and is represented by the dimension of our embeddings (the width of our input matrix), so words exist only in the single dimension of our rows.

We can include as many one-dimensional kernels as we like with different sizes. Figure 1 shows a kernel size of two (red box over input) and a kernel size of three (yellow box over input). We also define a uniform number of filters (in the same fashion as we would for a two-dimensional convolutional layer) for each of our layers, which will be the output dimension of our convolution. We apply a relu activation and add a max-over-time pooling to our output that takes the maximum output for each filter of each convolution—resulting in the extraction of a single model feature from each filter.

# Define Convolutional Layers with Max Pooling
convs = []
for filter_size in filter_sizes:
conv = tf.layers.conv1d(inputs=embed, filters=128, kernel_size=filter_size, activation=tf.nn.relu)
pool = tf.layers.max_pooling1d(inputs=conv, pool_size=seq_len-filter_size+1, strides=1)
convs.append(pool)


We can think of these layers as “parallel”—i.e., one convolution layer doesn’t feed into the next, but rather they are all functions on the input that result in a unique output. We concatenate and flatten these outputs to combine the results.

# Concat Pooling Outputs and Flatten
pool_concat = tf.concat(convs, axis=-1)
pool_flat = tf.layers.Flatten(pool_concat)


Finally, we now build a single fully connected layer with a sigmoid activation to make predictions from our concatenated convolutional outputs. Note that we can use a tf.nn.softmax activation function here as well if the problem has more than two classes. We also include a dropout layer here to regularize our model for better out-of-sample performance.

drop = tf.layers.Dropout(inputs=pool_flat, rate=keep_prob, training=training_)
dense = tf.layers.Dense(inputs=drop, num_outputs=1, activation_fn=tf.nn.sigmoid)


Finally, we can wrap this code into a custom tf.Estimator using the model_fn for a simple API for training, evaluating and making future predictions.

And there we have it: a convolutional neural network architecture for text classification.

As with any model comparison, there are some trade offs between CNNs and RNNs for text classification. Even though RNNs seem like a more natural choice for language, CNNs have been shown to train up to 5x faster than RNNs and perform well on text where feature detection is important. However, when long-term dependency over the input sequence is an important factor, RNN variants typically outperform CNNs.

Ultimately, language problems in various domains behave differently, so it is important to have multiple techniques in your arsenal. This is just one example of a trend we are seeing in applying techniques successfully across different areas of research. While convolutional neural networks have traditionally been the star of the computer vision world, we are starting to see more breakthroughs in applying them to sequential data.

This post is a collaboration between O'Reilly and TensorFlow. See our statement of editorial independence.

### Build the right thing; build the thing right

How design thinking supports delivering product.

Continue reading Build the right thing; build the thing right.

### Getting started with spatial data in R – EdinbR talk

(This article was first published on R – scottishsnow, and kindly contributed to R-bloggers)

Last night (2018-01-17) I spoke at the EdinbR user group alongside Susan Johnston. Susan talked about writing R packages and you see her slides here. I gave an introduction to working with spatial data in R. You can see my slides below: