My Data Science Blogs

August 15, 2018

TrackML Challenge



I just won a gold medal for my 9th rank in the TrackML challenge hosted on Kaggle.  That challenge was proposed by CERN (Centre Européen de Recherche Nucléaire).  The problem was to reconstruct the trajectories of high energy physics particle from the tracks they leave in the detectors used at CERN.  The data we were given was actually a simulation of forthcoming detectors at CERN.  Here is how the challenge is described by CERN:

To explore what our universe is made of, scientists at CERN are colliding protons, essentially recreating mini big bangs, and meticulously observing these collisions with intricate silicon detectors.

While orchestrating the collisions and observations is already a massive scientific accomplishment, analyzing the enormous amounts of data produced from the experiments is becoming an overwhelming challenge.

Event rates have already reached hundreds of millions of collisions per second, meaning physicists must sift through tens of petabytes of data per year. And, as the resolution of detectors improve, ever better software is needed for real-time pre-processing and filtering of the most promising events, producing even more data.

To help address this problem, a team of Machine Learning experts and physics scientists working at CERN (the world largest high energy physics laboratory), has partnered with Kaggle and prestigious sponsors to answer the question: can machine learning assist high energy physics in discovering and characterizing new particles?

Specifically, in this competition, you’re challenged to build an algorithm that quickly reconstructs particle tracks from 3D points left in the silicon detectors.

I used an unsupervised machine learning technique known as clustering, see my detailed writeup.  The key was to preprocess data so that the clustering algorithm could find the particle tracks more easily.  This is very similar to feature engineering for supervised machine learning.  The main issue was the computation ressources required by my solution: about 20 cpu hours per event, and we need to predict tracks for 125 events.  This would not have been possible without the use of parallelism on multicore machines.  I used an IBM Power 9 machine with 40 cores in order to be able to compute a submission in less than 3 days.

Many participants also had significant running times, for instance the one who finished second says that some events take him up to 3 cpu days.  This makes the achievement of the first place winner extremely impressive.  Not only do they achieve an amazing detection rate, but their code runs in 8 minutes per event!  I'm not in the same league as them.

Anyway, I'm still happy with my result.  And the cherry on the cake: this gold medal earned me the  Kaggle Competitions Grandmaster title :) I'm the second Kaggler to become a Grandmaster in two categories.







Continue Reading…


Read More

Document worth reading: “Radial Basis Function Approximations: Comparison and Applications”

Approximation of scattered data is often a task in many engineering problems. The Radial Basis Function (RBF) approximation is appropriate for large scattered (unordered) datasets in d-dimensional space. This approach is useful for a higher dimension d>2, because the other methods require the conversion of a scattered dataset to an ordered dataset (i.e. a semi-regular mesh is obtained by using some tessellation techniques), which is computationally expensive. The RBF approximation is non-separable, as it is based on the distance between two points. This method leads to a solution of Linear System of Equations (LSE) Ac=h. In this paper several RBF approximation methods are briefly introduced and a comparison of those is made with respect to the stability and accuracy of computation. The proposed RBF approximation offers lower memory requirements and better quality of approximation. Radial Basis Function Approximations: Comparison and Applications

Continue Reading…


Read More

Data Science Portfolio Project: Is Fandango Still Inflating Ratings?

At Dataquest, we strongly advocate portfolio projects as a means of getting your first data science job. In this blog post, we'll walk you through an example portfolio project.

The project is part of our Statistics Fundamentals course, and it assumes some familiarity with:

  • Sampling (simple random sampling, populations, samples, parameters, statistics)
  • Variables
  • Frequency distributions
  • pandas and matplotlib

If you think you need to fill in any gaps before moving forward, we cover the topics above in depth in our Statistics Fundamentals course. This course will also give you deeper instructions on how to build this project, and code it in your browser.

To learn more about pandas and matplotlib, check out our Data Scientist Path.

Is Fandango still inflating ratings?

In October 2015, Walt Hickey from FiveThirtyEight published a popular article where he presented strong evidence that suggested Fandango's movie rating system was biased and dishonest. In this project, we'll analyze more recent movie ratings data to determine whether there has been any change in Fandango's rating system after Hickey's analysis.

We'll work with two samples of movie ratings: the data in one sample was collected previous to Hickey's analysis, while the other sample was collected after. Let's start by reading in the two samples (which are stored as CSV files) and getting familiar with their structure.

import pandas as pd
pd.options.display.max_columns = 100  # Avoid having displayed truncated output

previous = pd.read_csv('fandango_score_comparison.csv')
after = pd.read_csv('movie_ratings_16_17.csv')

FILM RottenTomatoes RottenTomatoes_User Metacritic Metacritic_User IMDB Fandango_Stars Fandango_Ratingvalue RT_norm RT_user_norm Metacritic_norm Metacritic_user_nom IMDB_norm RT_norm_round RT_user_norm_round Metacritic_norm_round Metacritic_user_norm_round IMDB_norm_round Metacritic_user_vote_count IMDB_user_vote_count Fandango_votes Fandango_Difference
0 Avengers: Age of Ultron (2015) 74 86 66 7.1 7.8 5.0 4.5 3.70 4.3 3.30 3.55 3.90 3.5 4.5 3.5 3.5 4.0 1330 271107 14846 0.5
1 Cinderella (2015) 85 80 67 7.5 7.1 5.0 4.5 4.25 4.0 3.35 3.75 3.55 4.5 4.0 3.5 4.0 3.5 249 65709 12640 0.5
2 Ant-Man (2015) 80 90 64 8.1 7.8 5.0 4.5 4.00 4.5 3.20 4.05 3.90 4.0 4.5 3.0 4.0 4.0 627 103660 12055 0.5
movie year metascore imdb tmeter audience fandango n_metascore n_imdb n_tmeter n_audience nr_metascore nr_imdb nr_tmeter nr_audience
0 10 Cloverfield Lane 2016 76 7.2 90 79 3.5 3.80 3.60 4.5 3.95 4.0 3.5 4.5 4.0
1 13 Hours 2016 48 7.3 50 83 4.5 2.40 3.65 2.5 4.15 2.5 3.5 2.5 4.0
2 A Cure for Wellness 2016 47 6.6 40 47 3.0 2.35 3.30 2.0 2.35 2.5 3.5 2.0 2.5

Below, we isolate only the columns that provide information about Fandango, to make the relevant data more readily available for later use. We'll make copies to avoid any SettingWithCopyWarning later on.

fandango_previous = previous[['FILM', 'Fandango_Stars', 'Fandango_Ratingvalue', 'Fandango_votes',
fandango_after = after[['movie', 'year', 'fandango']].copy()

FILM Fandango_Stars Fandango_Ratingvalue Fandango_votes Fandango_Difference
0 Avengers: Age of Ultron (2015) 5.0 4.5 14846 0.5
1 Cinderella (2015) 5.0 4.5 12640 0.5
2 Ant-Man (2015) 5.0 4.5 12055 0.5
movie year fandango
0 10 Cloverfield Lane 2016 3.5
1 13 Hours 2016 4.5
2 A Cure for Wellness 2016 3.0

Our goal is to determine whether there has been any change in Fandango's rating system after Hickey's analysis. The population of interest for our analysis is made of all the movie ratings stored on Fandango's website, regardless of the releasing year.

Because we want to find out whether the parameters of this population changed after Hickey's analysis, we're interested in sampling the population at two different periods in time — before and after Hickey's analysis — so we can compare the two states.

The data we're working with was sampled at the moments we want: one sample was taken previous to the analysis, and the other after the analysis. We want to describe the population, so we need to make sure that the samples are representative. Otherwise we should expect a large sampling error and, ultimately, wrong conclusions.

From Hickey's article and from the of the data set's repository, we can see that he used the following sampling criteria:

  • The movie must have had at least 30 fan ratings on Fandango's website at the time of sampling (Aug. 24, 2015).
  • The movie must have had tickets on sale in 2015.

The sampling was clearly not random because not every movie had the same chance to be included in the sample — some movies didn't have a chance at all (like those having under 30 fan ratings or those without tickets on sale in 2015). It's questionable whether this sample is representative of the entire population we're interested to describe. It seems more likely that it isn't, mostly because this sample is subject to temporal trends — e.g. movies in 2015 might have been outstandingly good or bad compared to other years.

The sampling conditions for our other sample were (as it can be read in the of the data set's repository):

  • The movie must have been released in 2016 or later.
  • The movie must have had a considerable number of votes and reviews (unclear how many from the or from the data).

This second sample is also subject to temporal trends and it's unlikely to be representative of our population of interest.

Both Hickey and the repo author had certain research questions in mind when they sampled the data, and they used a set of criteria to get a sample that would fit their questions. Their sampling method is called purposive sampling (or judgmental/selective/subjective sampling). While these samples were good enough for their research, they don't seem too useful for us.

Changing the goal of our analysis

At this point, we can either collect new data or change our the goal of our analysis. Let's choose the latter and place some limitations on our initial goal.

Instead of trying to determine whether there has been any change in Fandango's rating system after Hickey's analysis, our new goal is to determine whether there's any difference between Fandango's ratings for popular movies in 2015 and Fandango's ratings for popular movies in 2016. This new goal should also be a fairly good proxy for our initial goal.

Isolating the samples we need

With this new research goal, we have two populations of interest:

  1. All Fandango's ratings for popular movies released in 2015.
  2. All Fandango's ratings for popular movies released in 2016.

We need to be clear about what counts as popular movies. We'll use Hickey's benchmark of 30 fan ratings and count a movie as popular only if it has 30 fan ratings or more on Fandango's website.

Although one of the sampling criteria in our second sample is movie popularity, the sample doesn't provide information about the number of fan ratings. We should be skeptical once more and ask whether this sample is truly representative and contains popular movies (movies with over 30 fan ratings).

One quick way to check the representativity of this sample is to randomly sample 10 movies from it and then check the number of fan ratings ourselves on Fandango's website. Ideally, at least 8 out of the 10 movies have 30 fan ratings or more.

fandango_after.sample(10, random_state = 1)
movie year fandango
108 Mechanic: Resurrection 2016 4.0
206 Warcraft 2016 4.0
106 Max Steel 2016 3.5
107 Me Before You 2016 4.5
51 Fantastic Beasts and Where to Find Them 2016 4.5
33 Cell 2016 3.0
59 Genius 2016 3.5
152 Sully 2016 4.5
4 A Hologram for the King 2016 3.0
31 Captain America: Civil War 2016 4.5

Above we used a value of 1 as the random seed. This is good practice because it suggests that we weren't trying out various random seeds just to get a favorable sample.

As of April 2018, these are the fan ratings we found:

Movie Fan ratings
Mechanic: Resurrection 2247
Warcraft 7271
Max Steel 493
Me Before You 5263
Fantastic Beasts and Where to Find Them 13400
Cell 17
Genius 127
Sully 11877
A Hologram for the King 500
Captain America: Civil War 35057

90% of the movies in our sample are popular. This is enough and we move forward with a bit more confidence.

Let's also double-check the other data set for popular movies. The documentation states clearly that there're only movies with at least 30 fan ratings, but it should take only a couple of seconds to double-check here.

sum(fandango_previous['Fandango_votes'] < 30)

If you explore the two data sets, you'll notice that there are movies with a releasing year different than 2015 or 2016. For our purposes, we'll need to isolate only the movies released in 2015 and 2016.

Let's start with Hickey's data set and isolate only the movies released in 2015. There's no special column for the releasing year, but we should be able to extract it from the strings in the FILM column.

FILM Fandango_Stars Fandango_Ratingvalue Fandango_votes Fandango_Difference
0 Avengers: Age of Ultron (2015) 5.0 4.5 14846 0.5
1 Cinderella (2015) 5.0 4.5 12640 0.5
fandango_previous['Year'] = fandango_previous['FILM'].str[-5:-1]
FILM Fandango_Stars Fandango_Ratingvalue Fandango_votes Fandango_Difference Year
0 Avengers: Age of Ultron (2015) 5.0 4.5 14846 0.5 2015
1 Cinderella (2015) 5.0 4.5 12640 0.5 2015
2015    129
2014     17
Name: Year, dtype: int64
fandango_2015 = fandango_previous[fandango_previous['Year'] == '2015'].copy()
2015    129
Name: Year, dtype: int64

Great, now let's isolate the movies in the other data set.

movie year fandango
0 10 Cloverfield Lane 2016 3.5
1 13 Hours 2016 4.5
2016    191
2017     23
Name: year, dtype: int64
fandango_2016 = fandango_after[fandango_after['year'] == 2016].copy()
2016    191
Name: year, dtype: int64

Comparing distribution shapes for 2015 and 2016

Our aim is to figure out whether there's any difference between Fandango's ratings for popular movies in 2015 and Fandango's ratings for popular movies in 2016. One way to go about is to analyze and compare the distributions of movie ratings for the two samples.

We'll start with comparing the shape of the two distributions using kernel density plots. We'll use the FiveThirtyEight style for the plots.

import matplotlib.pyplot as plt
from numpy import arange
%matplotlib inline'fivethirtyeight')

fandango_2015['Fandango_Stars'].plot.kde(label = '2015', legend = True, figsize = (8,5.5))
fandango_2016['fandango'].plot.kde(label = '2016', legend = True)

plt.title("Comparing distribution shapes for Fandango's ratings\n(2015 vs 2016)",
          y = 1.07) # the `y` parameter pads the title upward
plt.xlim(0,5) # because ratings start at 0 and end at 5


Two aspects of the figure above are striking:

  • Both distributions are strongly left skewed.
  • The 2016 distribution is slightly shifted to the left relative to the 2015 distribution.

The left skew suggests that movies on Fandango are given mostly high and very high fan ratings. Coupled with the fact that Fandango sells tickets, the high ratings are a bit dubious. It'd be really interesting to investigate this further — ideally in a separate project, since this is quite irrelevant for the current goal of our analysis.

The slight left shift of the 2016 distribution is very interesting for our analysis. It shows that ratings were slightly lower in 2016 compared to 2015. This suggests that there was a difference indeed between Fandango's ratings for popular movies in 2015 and Fandango's ratings for popular movies in 2016. We can also see the direction of the difference: the ratings in 2016 were slightly lower compared to 2015.

Comparing relative frequencies

It seems we're following a good thread so far, but we need to analyze more granular information. Let's examine the frequency tables of the two distributions to analyze some numbers. Because the data sets have different numbers of movies, we normalize the tables and show percentages instead.

print('2015' + '\n' + '-' * 16) # To help us distinguish between the two tables immediately and avoid silly mistakes as we read to and fro

fandango_2015['Fandango_Stars'].value_counts(normalize = True).sort_index() * 100
3.0     8.527132
3.5    17.829457
4.0    28.682171
4.5    37.984496
5.0     6.976744
Name: Fandango_Stars, dtype: float64
print('2016' + '\n' + '-' * 16)
fandango_2016['fandango'].value_counts(normalize = True).sort_index() * 100
2.5     3.141361
3.0     7.329843
3.5    24.083770
4.0    40.314136
4.5    24.607330
5.0     0.523560
Name: fandango, dtype: float64

In 2016, very high ratings (4.5 and 5 stars) had significantly lower percentages compared to 2015. In 2016, under 1% of the movies had a perfect rating of 5 stars, compared to 2015 when the percentage was close to 7%. Ratings of 4.5 were also more popular in 2015 — there were approximately 13% more movies rated with a 4.5 in 2015 compared to 2016.

The minimum rating is also lower in 2016 — 2.5 instead of 3 stars, the minimum of 2015. There clearly is a difference between the two frequency distributions.

For some other ratings, the percentage went up in 2016. There was a greater percentage of movies in 2016 that received 3.5 and 4 stars, compared to 2015. 3.5 and 4.0 are high ratings and this challenges the direction of the change we saw on the kernel density plots.

Determining the direction of the change

Let's take a couple of summary metrics to get a more precise picture about the direction of the change. In what follows, we'll compute the mean, the median, and the mode for both distributions and then use a bar graph to plot the values.

mean_2015 = fandango_2015['Fandango_Stars'].mean()
mean_2016 = fandango_2016['fandango'].mean()

median_2015 = fandango_2015['Fandango_Stars'].median()
median_2016 = fandango_2016['fandango'].median()

mode_2015 = fandango_2015['Fandango_Stars'].mode()[0] # the output of Series.mode() is a bit uncommon
mode_2016 = fandango_2016['fandango'].mode()[0]

summary = pd.DataFrame()
summary['2015'] = [mean_2015, median_2015, mode_2015]
summary['2016'] = [mean_2016, median_2016, mode_2016]
summary.index = ['mean', 'median', 'mode']
2015 2016
mean 4.085271 3.887435
median 4.000000 4.000000
mode 4.500000 4.000000'fivethirtyeight')
summary['2015'] = '#0066FF', align = 'center', label = '2015', width = .25)
summary['2016'] = '#CC0000', align = 'edge', label = '2016', width = .25,
                         rot = 0, figsize = (8,5))

plt.title('Comparing summary statistics: 2015 vs 2016', y = 1.07)
plt.legend(framealpha = 0, loc = 'upper center')


The mean rating was lower in 2016 with approximately 0.2. This means a drop of almost 5% relative to the mean rating in 2015.

(summary.loc['mean'][0] - summary.loc['mean'][1]) / summary.loc['mean'][0]

While the median is the same for both distributions, the mode is lower in 2016 by 0.5. Coupled with what we saw for the mean, the direction of the change we saw on the kernel density plot is confirmed: on average, popular movies released in 2016 were rated slightly lower than popular movies released in 2015.


Our analysis showed that there's indeed a slight difference between Fandango's ratings for popular movies in 2015 and Fandango's ratings for popular movies in 2016. We also determined that, on average, popular movies released in 2016 were rated lower on Fandango than popular movies released in 2015.

We cannot be completely sure what caused the change, but the chances are very high that it was caused by Fandango fixing the biased rating system after Hickey's analysis.

Continue Reading…


Read More

R Packages worth a look

Estimation, Comparison and Selection of Transformations (trafo)
Estimation, selection and comparison of several families of transformations. The families of transformations included in the package are the following: …

Easily Install and Load the ‘Tidymodels’ Packages (tidymodels)
The tidy modeling ‘verse’ is a collection of package for modeling and statistical analysis that share the underlying design philosophy, grammar, and da …

Sample Size Calculations for Complex Surveys (samplesize4surveys)
Computes the required sample size for estimation of totals, means and proportions under complex sampling designs.

A Colour List and Colour Metric Based on the ISCC-NBS System of Color Designation (rolocISCCNBS)
A colour list and colour metric based on the ISCC-NBS System of Color Designation for use with the ‘roloc’ package for converting colour specifications …

Compute Directly Standardized Rates, Ratios and Differences (dsr)
A set of functions to compute and compare directly standardized rates, rate differences and ratios. A variety of user defined options for analysis (e.g …

Continue Reading…


Read More

Whats new on arXiv

Reconfigurable Inverted Index

Existing approximate nearest neighbor search systems suffer from two fundamental problems that are of practical importance but have not received sufficient attention from the research community. First, although existing systems perform well for the whole database, it is difficult to run a search over a subset of the database. Second, there has been no discussion concerning the performance decrement after many items have been newly added to a system. We develop a reconfigurable inverted index (Rii) to resolve these two issues. Based on the standard IVFADC system, we design a data layout such that items are stored linearly. This enables us to efficiently run a subset search by switching the search method to a linear PQ scan if the size of a subset is small. Owing to the linear layout, the data structure can be dynamically adjusted after new items are added, maintaining the fast speed of the system. Extensive comparisons show that Rii achieves a comparable performance with state-of-the art systems such as Faiss.

Text classification using capsules

This paper presents an empirical exploration of the use of capsule networks for text classification. While it has been shown that capsule networks are effective for image classification, their validity in the domain of text has not been explored. In this paper, we show that capsule networks indeed have the potential for text classification and that they have several advantages over convolutional neural networks. We further suggest a simple routing method that effectively reduces the computational complexity of dynamic routing. We utilized seven benchmark datasets to demonstrate that capsule networks, along with the proposed routing method provide comparable results.

Interpretable Time Series Classification using All-Subsequence Learning and Symbolic Representations in Time and Frequency Domains

The time series classification literature has expanded rapidly over the last decade, with many new classification approaches published each year. The research focus has mostly been on improving the accuracy and efficiency of classifiers, while their interpretability has been somewhat neglected. Classifier interpretability has become a critical constraint for many application domains and the introduction of the ‘right to explanation’ GDPR EU legislation in May 2018 is likely to further emphasize the importance of explainable learning algorithms. In this work we analyse the state-of-the-art for time series classification, and propose new algorithms that aim to maintain the classifier accuracy and efficiency, but keep interpretability as a key design constraint. We present new time series classification algorithms that advance the state-of-the-art by implementing the following three key ideas: (1) Multiple resolutions of symbolic approximations: we combine symbolic representations obtained using different parameters; (2) Multiple domain representations: we combine symbolic approximations in time (e.g., SAX) and frequency (e.g., SFA) domains; (3) Efficient navigation of a huge symbolic-words space: we adapt a symbolic sequence classifier named SEQL, to make it work with multiple domain representations (e.g., SAX-SEQL, SFA-SEQL), and use its greedy feature selection strategy to effectively filter the best features for each representation. We show that a multi-resolution multi-domain linear classifier, SAX-SFA-SEQL, achieves a similar accuracy to the state-of-the-art COTE ensemble, and to a recent deep learning method (FCN), but uses a fraction of the time required by either COTE or FCN. We discuss the accuracy, efficiency and interpretability of our proposed algorithms. To further analyse the interpretability aspect of our classifiers, we present a case study on an ecology benchmark.

Review of Different Privacy Preserving Techniques in PPDP

Big data is a term used for a very large data sets that have many difficulties in storing and processing the data. Analysis this much amount of data will lead to information loss. The main goal of this paper is to share data in a way that privacy is preserved while information loss is kept at least. Data that include Government agencies, University details and Medical history etc., are very necessary for an organization to do analysis and predict trends and patterns, but it may prevent the data owner from sharing the data because of privacy regulations [1]. By doing an analysis of several algorithms of Anonymization such as k-anonymity, l-diversity and tcloseness, one can achieve privacy at minimum loss. Admitting these techniques has some limitations. We need to maintain trade-off between privacy and information loss. We introduce a novel approach called Differential Privacy.

Detecting deviations from second-order stationarity in locally stationary functional time series

A time-domain test for the assumption of second order stationarity of a functional time series is proposed. The test is based on combining individual cumulative sum tests which are designed to be sensitive to changes in the mean, variance and autocovariance operators, respectively. The combination of their dependent p-values relies on a joint dependent block multiplier bootstrap of the individual test statistics. Conditions under which the proposed combined testing procedure is asymptotically valid under stationarity are provided. A procedure is proposed to automatically choose the block length parameter needed for the construction of the bootstrap. The finite-sample behavior of the proposed test is investigated in Monte Carlo experiments and an illustration on a real data set is provided.

Parikh Matrices for Powers of Words

Certain upper triangular matrices, termed as Parikh matrices, are often used in the combinatorial study of words. Given a word, the Parikh matrix of that word elegantly computes the number of occurrences of certain predefined subwords in that word. In this paper, we compute the Parikh matrix of any word raised to an arbitrary power. Furthermore, we propose canonical decompositions of both Parikh matrices and words into normal forms. Finally, given a Parikh matrix, the relation between its normal form and the normal forms of words in the corresponding M-equivalence class is established.

A Capsule Network-based Embedding Model for Knowledge Graph Completion and Search Personalization

In this paper, we introduce an embedding model, named CapsE, exploring a capsule network to model relationship triples \textit{(subject, relation, object)}. Our CapsE represents each triple as a 3-column matrix where each column vector represents the embedding of an element in the triple. This 3-column matrix is then fed to a convolution layer where multiple filters are operated to generate different feature maps. These feature maps are used to construct capsules in the first capsule layer. Capsule layers are connected via dynamic routing mechanism. The last capsule layer consists of only one capsule to produce a vector output. The length of this vector output is used to measure the plausibility of the triple. Our proposed CapsE obtains state-of-the-art link prediction results for knowledge graph completion on two benchmark datasets: WN18RR and FB15k-237, and outperforms strong search personalization baselines on SEARCH17 dataset.

Modeling Semantics with Gated Graph Neural Networks for Knowledge Base Question Answering

The most approaches to Knowledge Base Question Answering are based on semantic parsing. In this paper, we address the problem of learning vector representations for complex semantic parses that consist of multiple entities and relations. Previous work largely focused on selecting the correct semantic relations for a question and disregarded the structure of the semantic parse: the connections between entities and the directions of the relations. We propose to use Gated Graph Neural Networks to encode the graph structure of the semantic parse. We show on two data sets that the graph networks outperform all baseline models that do not explicitly model the structure. The error analysis confirms that our approach can successfully process complex semantic parses.

Learning Explanations from Language Data

PatternAttribution is a recent method, introduced in the vision domain, that explains classifications of deep neural networks. We demonstrate that it also generates meaningful interpretations in the language domain.

A Matching Based Theoretical Framework for Estimating Probability of Causation

The concept of Probability of Causation (PC) is critically important in legal contexts and can help in many other domains. While it has been around since 1986, current operationalizations can obtain only the minimum and maximum values of PC, and do not apply for purely observational data. We present a theoretical framework to estimate the distribution of PC from experimental and from purely observational data. We illustrate additional problems of the existing operationalizations and show how our method can be used to address them. We also provide two illustrative examples of how our method is used and how factors like sample size or rarity of events can influence the distribution of PC. We hope this will make the concept of PC more widely usable in practice.

Multi-Task Learning for Sequence Tagging: An Empirical Study

We study three general multi-task learning (MTL) approaches on 11 sequence tagging tasks. Our extensive empirical results show that in about 50% of the cases, jointly learning all 11 tasks improves upon either independent or pairwise learning of the tasks. We also show that pairwise MTL can inform us what tasks can benefit others or what tasks can be benefited if they are learned jointly. In particular, we identify tasks that can always benefit others as well as tasks that can always be harmed by others. Interestingly, one of our MTL approaches yields embeddings of the tasks that reveal the natural clustering of semantic and syntactic tasks. Our inquiries have opened the doors to further utilization of MTL in NLP.

Semiparametric Bayesian causal inference using Gaussian process priors

We develop a semiparametric Bayesian approach for estimating the mean response in a missing data model with binary outcomes and a nonparametrically modelled propensity score. Equivalently we estimate the causal effect of a treatment, correcting nonparametrically for confounding. We show that standard Gaussian process priors satisfy a semiparametric Bernstein-von Mises theorem under smoothness conditions. We further propose a novel propensity score-dependent prior that provides efficient inference under strictly weaker conditions. We also show that it is theoretically preferable to model the covariate distribution with a Dirichlet process or Bayesian bootstrap, rather than modelling the covariate density using a Gaussian process prior.

iNNvestigate neural networks!

In recent years, deep neural networks have revolutionized many application domains of machine learning and are key components of many critical decision or predictive processes. Therefore, it is crucial that domain specialists can understand and analyze actions and predictions, even of the most complex neural network architectures. Despite these arguments neural networks are often treated as black boxes. In the attempt to alleviate this short- coming many analysis methods were proposed, yet the lack of reference implementations often makes a systematic comparison between the methods a major effort. The presented library iNNvestigate addresses this by providing a common interface and out-of-the- box implementation for many analysis methods, including the reference implementation for PatternNet and PatternAttribution as well as for LRP-methods. To demonstrate the versatility of iNNvestigate, we provide an analysis of image classifications for variety of state-of-the-art neural network architectures.

Estimating Heterogeneous Causal Effects in the Presence of Irregular Assignment Mechanisms

This paper provides a link between causal inference and machine learning techniques – specifically, Classification and Regression Trees (CART) – in observational studies where the receipt of the treatment is not randomized, but the assignment to the treatment can be assumed to be randomized (irregular assignment mechanism). The paper contributes to the growing applied machine learning literature on causal inference, by proposing a modified version of the Causal Tree (CT) algorithm to draw causal inference from an irregular assignment mechanism. The proposed method is developed by merging the CT approach with the instrumental variable framework to causal inference, hence the name Causal Tree with Instrumental Variable (CT-IV). As compared to CT, the main strength of CT-IV is that it can deal more efficiently with the heterogeneity of causal effects, as demonstrated by a series of numerical results obtained on synthetic data. Then, the proposed algorithm is used to evaluate a public policy implemented by the Tuscan Regional Administration (Italy), which aimed at easing the access to credit for small firms. In this context, CT-IV breaks fresh ground for target-based policies, identifying interesting heterogeneous causal effects.

Understanding training and generalization in deep learning by Fourier analysis

Background: It is still an open research area to theoretically understand why Deep Neural Networks (DNNs)—equipped with many more parameters than training data and trained by (stochastic) gradient-based methods—often achieve remarkably low generalization error. Contribution: We study DNN training by Fourier analysis. Our theoretical framework explains: i) DNN with (stochastic) gradient-based methods endows low-frequency components of the target function with a higher priority during the training; ii) Small initialization leads to good generalization ability of DNN while preserving the DNN’s ability of fitting any function. These results are further confirmed by experiments of DNNs fitting the following datasets, i.e., natural images, one-dimensional functions and MNIST dataset.

Simple Root Cause Analysis by Separable Likelihoods

Root Cause Analysis for Anomalies is challenging because of the trade-off between the accuracy and its explanatory friendliness, required for industrial applications. In this paper we propose a framework for simple and friendly RCA within the Bayesian regime under certain restrictions (that Hessian at the mode is diagonal, here referred to as \emph{separability}) imposed on the predictive posterior. We show that this assumption is satisfied for important base models, including Multinomal, Dirichlet-Multinomial and Naive Bayes. To demonstrate the usefulness of the framework, we embed it into the Bayesian Net and validate on web server error logs (real world data set).

Rank-1 Convolutional Neural Network

In this paper, we propose a convolutional neural network(CNN) with 3-D rank-1 filters which are composed by the outer product of 1-D filters. After being trained, the 3-D rank-1 filters can be decomposed into 1-D filters in the test time for fast inference. The reason that we train 3-D rank-1 filters in the training stage instead of consecutive 1-D filters is that a better gradient flow can be obtained with this setting, which makes the training possible even in the case where the network with consecutive 1-D filters cannot be trained. The 3-D rank-1 filters are updated by both the gradient flow and the outer product of the 1-D filters in every epoch, where the gradient flow tries to obtain a solution which minimizes the loss function, while the outer product operation tries to make the parameters of the filter to live on a rank-1 sub-space. Furthermore, we show that the convolution with the rank-1 filters results in low rank outputs, constraining the final output of the CNN also to live on a low dimensional subspace.

Augmenting word2vec with latent Dirichlet allocation within a clinical application
Existential monadic second order convergence law fails on sparse random graphs
Globally Convergent Type-I Anderson Acceleration for Non-Smooth Fixed-Point Iterations
Local Decodability of the Burrows-Wheeler Transform
Robot Safe Interaction System for Intelligent Industrial Co-Robots
Multimodal Differential Network for Visual Question Generation
A simulation study to distinguish prompt photon from $π^0$ and beam halo in a granular calorimeter using deep networks
Convex Union Representability and Convex Codes
Various Optimality Criteria for the Prediction of Individual Response Curves
Time-Varying Semidefinite Programs
Language Guided Fashion Image Manipulation with Feature-wise Transformations
Multi-Cell Massive MIMO in LoS
On rigidity of unit-bar frameworks
On configuration spaces and Whitehouse’s lifts of the Eulerian representations
PAC-Battling Bandits with Plackett-Luce: Tradeoff between Sample Complexity and Subset Size
Scene-LSTM: A Model for Human Trajectory Prediction
Saturation numbers for Ramsey-minimal graphs
Off-diagonal ordered Ramsey numbers of matchings
3D Geometry-Aware Semantic Labeling of Outdoor Street Scenes
Confidence penalty, annealing Gaussian noise and zoneout for biLSTM-CRF networks for named entity recognition
Modeling and Simulation of Regenerative Braking Energy in DC Electric Rail Systems
Fooling Polytopes
Dynamic Pricing for Revenue Maximization in Mobile Social Data Market with Network Effects
Optimization of a perturbed sweeping process by constrained discontinuous controls
Faster and More Robust Mesh-based Algorithms for Obstacle k-Nearest Neighbour
A Nonparametric Bayesian Method for Clustering of High-Dimensional Mixed Dataset
A short note on checkerboard colorable twisted duals
Relax, and Accelerate: A Continuous Perspective on ADMM
Optimal control of Markov-modulated multiclass many-server queues
Constructing Non-isomorphic Signless Laplacian Cospectral Graphs
Privacy Preserving and Cost Optimal Mobile Crowdsensing using Smart Contracts on Blockchain
Estimating the Distribution of Random Parameters in a Diffusion Equation Forward Model for a Transdermal Alcohol Biosensor
Rigid colourings of hypergraphs and contiguity
Speeding Up Constrained $k$-Means Through 2-Means
Time Perception Machine: Temporal PointProcesses for the When, Where and What ofActivity Prediction
Regularizing Neural Machine Translation by Target-bidirectional Agreement
Game Theoretic Analysis for Joint Sponsored and Edge Caching Content Service Market
A Transfer Learning based Feature-Weak-Relevant Method for Image Clustering
Language Style Transfer from Sentences with Arbitrary Unknown Styles
Stretched and compressed exponentials in the relaxation dynamics of a metallic glass-forming melt
Long properly coloured cycles in edge-coloured graphs
Live Video Comment Generation Based on Surrounding Frames and Live Comments
Directed Policy Gradient for Safe Reinforcement Learning with Human Advice
Eigenvectors of Deformed Wigner Random Matrices
Regularity and Sensitivity for McKean-Vlasov Type SPDEs Generated by Stable-like Processes
Improved Recovery of Analysis Sparse Vectors in Presence of Prior Information
Towards Audio to Scene Image Synthesis using Generative Adversarial Network
Enumerating five families of pattern-avoiding inversion sequences; and introducing the powered Catalan numbers
AsySPA: An Exact Asynchronous Algorithm for Convex Optimization Over Digraphs
Methodology for identifying study sites in scientific corpus
DenseRAN for Offline Handwritten Chinese Character Recognition
Parsimonious HMMs for Offline Handwritten Chinese Text Recognition
Network Flows that Solve Least Squares for Linear Equations
Symmetric decompositions and real-rootedness
A Preliminary Study On Emerging Cloud Computing Security Challenges
On grounded L-graphs and their relatives
Relaxed Schedulers Can Efficiently Parallelize Iterative Algorithms
Quantization effects and convergence properties of rigid formation control systems with quantized distance measurements
A Forward-Backward Splitting Method for Monotone Inclusions Without Cocoercivity
Automatic Reference-Based Evaluation of Pronoun Translation Misses the Point
On the Shannon entropy of the number of vertices with zero in-degree in randomly oriented hypergraphs
Exponential loss of memory for the 2-dimensional Allen-Cahn equation with small noise
Incremental Non-Rigid Structure-from-Motion with Unknown Focal Length
Stealth Attacks on the Smart Grid
Faster deterministic parameterized algorithm for k-Path
Automatic Plaque Detection in IVOCT Pullbacks Using Convolutional Neural Networks
Rapid Adaptation of Neural Machine Translation to New Languages
Equidistributed statistics on Fishburn matrices and permutations
Passing through a stack $k$ times with reversals
New optimal control problems in density functional theory motivated by photovoltaics
Neural Semi-Markov Conditional Random Fields for Robust Character-Based Part-of-Speech Tagging
Separation-type combinatorial invariants for triangulations of manifolds
On criteria for rook equivalence of Ferrers boards
A Reference Architecture for Datacenter Scheduling: Extended Technical Report
Fast Video Shot Transition Localization with Deep Structured Models
Rook and Wilf equivalence of integer partitions
Applications of molecular communications to medicine: a survey
On the Distribution of Range for Tree-Indexed Random Walks
Fully commutative elements of the complex reflection groups
A molecular communications model for drug delivery
Symmetric Dellac configurations
Graph-Based Controller Synthesis for Safety-Constrained, Resilient Systems
BACH: Grand Challenge on Breast Cancer Histology Images
Clustering genomic words in human DNA using peaks and trends of distributions
Generalized Multivariate Extreme Value Models for Explicit Route Choice Sets
The monotonicity properties for the rank of overpartitions
Unsupervised Hard Example Mining from Videos for Improved Object Detection
Visual Sensor Network Reconfiguration with Deep Reinforcement Learning
Automatic Playlist Continuation through a Composition of Collaborative Filters
Global Complexity Analysis of Inexact Successive Quadratic Approximation methods for Regularized Optimization under Mild Assumptions
Fast, Better Training Trick — Random Gradient
Randomized Hamiltonian Monte Carlo as Scaling Limit of the Bouncy Particle Sampler and Dimension-Free Convergence Rates
Evaluation of estimation approaches on the quality and robustness of collision warning system
What is Unique in Individual Gait Patterns? Understanding and Interpreting Deep Learning in Gait Analysis
Precise Performance Analysis of the LASSO under Matrix Uncertainties
Analysing Multiple Epidemic Data Sources
Comparing morphological complexity of Spanish, Otomi and Nahuatl
Noncoherent Multiantenna Receivers for Cognitive Backscatter System with Multiple RF Sources
Generating Paths with WFC
A simple counterexample to the Monge ansatz in multi-marginal optimal transport, convex geometry of the set of Kantorovich plans, and the Frenkel-Kontorova model
Improving Shape Deformation in Unsupervised Image-to-Image Translation
Hidden Fluid Mechanics: A Navier-Stokes Informed Deep Learning Framework for Assimilating Flow Visualization Data
Stable limits for Markov chains via the Principle of Conditioning
Angular-Based Word Meta-Embedding Learning
Vision-Based Preharvest Yield Mapping for Apple Orchards
Disentangled Representation Learning for Text Style Transfer
Estimating the Density of States of Frustrated Spin Systems
REGMAPR – A Recipe for Textual Matching
Interactive Launch of 16,000 Microsoft Windows Instances on a Supercomputer
Functional Large Deviations for Cox Processes and $Cox/G/\infty$ Queues, with a Biological Application
Multivariate Geometric Anisotropic Cox Processes
Patrolling on Dynamic Ring Networks
Small-time fluctuations for the bridge in a model class of hypoelliptic diffusions of weak Hörmander type
Moments of the SHE under delta initial measure
Large-Scale Study of Curiosity-Driven Learning

Continue Reading…


Read More

August 14, 2018

Magister Dixit

“Analytic and decision making have to be probabilistic; and the system and application has to be conscious of what is “good enough” and not fail in the absence of perfect behavior.” Prasant Misra, Yogesh Simmhan, Jay Warrior ( 03.02.2015 )

Continue Reading…


Read More

Better Analytics for the Product Experience – Aug 21 webinar

Learn a process for discovering the data and analytics needs of your users using user stories, use cases and mapping to data sources; Strategies for balancing priorities and managing expectations, and more.

Continue Reading…


Read More

Book Memo: “Uncertainty Modelling in Data Science”

This book features 29 peer-reviewed papers presented at the 9th International Conference on Soft Methods in Probability and Statistics (SMPS 2018), which was held in conjunction with the 5th International Conference on Belief Functions (BELIEF 2018) in Compiègne, France on September 17-21, 2018. It includes foundational, methodological and applied contributions on topics as varied as imprecise data handling, linguistic summaries, model coherence, imprecise Markov chains, and robust optimisation. These proceedings were produced using EasyChair. Over recent decades, interest in extensions and alternatives to probability and statistics has increased significantly in diverse areas, including decision-making, data mining and machine learning, and optimisation. This interest stems from the need to enrich existing models, in order to include different facets of uncertainty, like ignorance, vagueness, randomness, conflict or imprecision. Frameworks such as rough sets, fuzzy sets, fuzzy random variables, random sets, belief functions, possibility theory, imprecise probabilities, lower previsions, and desirable gambles all share this goal, but have emerged from different needs. The advances, results and tools presented in this book are important in the ubiquitous and fast-growing fields of data science, machine learning and artificial intelligence. Indeed, an important aspect of some of the learned predictive models is the trust placed in them. Modelling the uncertainty associated with the data and the models carefully and with principled methods is one of the means of increasing this trust, as the model will then be able to distinguish between reliable and less reliable predictions. In addition, extensions such as fuzzy sets can be explicitly designed to provide interpretable predictive models, facilitating user interaction and increasing trust.

Continue Reading…


Read More

BooST (Boosting Smooth Trees) a new Machine Learning Model for Partial Effect Estimation in Nonlinear Regressions

(This article was first published on R – insightR, and kindly contributed to R-bloggers)

By Gabriel Vasconcelos and Yuri Fonseca


We are happy to introduce our new machine learning method called Boosting Smooth Trees (BooST) (full article here). This model was a joint work with professors Marcelo Medeiros and Álvaro Veiga. The BooST uses a different type of regression tree that allows us to estimate the derivatives of very general nonlinear models. In other words, the model is differentiable and it has an analytical solution. The consequence is that now we can estimate partial effects of a characteristic on the response variable, which provide us much more interpretation than traditional importance measures.

The idea behind the BooST is to replace traditional Classification and Regression Trees (CART), which are not differentiable, by Smooth logistic trees. We show that with this adaptation the BooST is a consistent estimator of the model’s derivatives under some assumptions.


The example below shows that the BooST is very good to recover the derivatives of nonlinear functions. The data was generated from the following model:

\displaystyle y_i = \cos( \pi [x_{i,1}+x_{i,2}])+ \varepsilon_i

where x_{i,1} \sim N(0,1), x_{i,2} is a Bernoulli with p = 0.5 and \varepsilon_i \sim N(0,1). Note that this is not an easy problem. The function that generates the data is not monotonic, very smooth and nonlinear. The Figure below shows how the BooST estimates the model and its derivatives with respect to x_{i,1}. We simulated 1000 data points and forced it to have an R2 of 0.5 (half of the variation in the data is noise).

Real data example

To give a more practical example, suppose we want to estimate the effects of price changes on sales without any prior knowledge of the demand function. It is natural to think that these effects will depend on the current price level and possible some characteristics of the product. The BooST will estimate the partial effects of price on sales conditional to the current (or any) price and any other controls we choose. We will write a post specifically on this example in the future.

Consider yet an other example where we want to estimate how the price of a house changes as we move on the latitude or longitude. This is precisely what we did in the figure below using data from house sales in Melbourne (Dataset scrapped from the web by Tony Pino). The figure shows that if we are south of the CBD and move north the prices increase a lot. However, if we keep moving north prices start to decrease as we move away from the CBD.

Other considerations

The BooST is also very good for forecasting. We will come back to this topic with examples on future posts. Just keep in mind that if the number of characteristics is to big it may improve forecasting accuracy but you will probably loose some interpretation of the partial effects not because of the model itself but because of some theoretical problems you may have.

For now we have an implementation of the BooST in R, which is fully documented and straightforward to use. It can be downloaded and installed from:


Note that this implementation is not very fast and it is not adequate for very big problems with to many variables. We already have a faster Julia implementation that  will be published soon. The even faster C++ version will take some time but it is also coming. In the next posts we will explore some empirical applications with codes to replicate our examples.


Fonseca, Y.; Medeiros, M.; Vasconcelos; G.; Veiga, A. “BooST: Boosting Smooth Trees for Partial Effect Estimation in Nonlinear Regressions” arXiv preprint available at (2018).

To leave a comment for the author, please follow the link and comment on their blog: R – insightR. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

The Microsoft AI Idea Challenge – Breakthrough Ideas Wanted!

This post is authored by Tara Shankar Jana, Senior Technical Product Marketing Manager at Microsoft.

All of us have creative ideas – ideas that can improve our lives and the lives of thousands, perhaps even millions of others. But how often do we act on turning those ideas into a reality? Most of the time, we do not believe in our ideas strongly enough to pursue them. Other times we feel like we lack a platform to build out our idea or showcase it. Most good ideas don’t go beyond those initial creative thoughts in our head.

If you’re a professional working in the field of artificial intelligence (AI), or an aspiring AI developer or just someone who is passionate about AI and machine learning, Microsoft is excited to offer you an opportunity to transform your most creative ideas into reality. Join the Microsoft AI Idea Challenge Contest today for a chance to win exciting prizes and get your project featured in Microsoft’s AI.lab showcase. Check out the rules, terms and conditions of the contest and then dive right in!

The Challenge

The Microsoft AI Idea Challenge is seeking breakthrough AI solutions from developers, data scientists, professionals and students, and preferably developed on the Microsoft AI platform and services. The challenge gives you a platform to freely share AI models and applications, so they are reusable and easily accessible. The ideas you submit are judged on the parameters shown in the figure below – essentially half the weight is for the originality of your idea, 20% for the feasibility of your solution, and 30% for the complexity (i.e. level of sophistication) of your implementation.

The Microsoft AI Challenge is accepting submissions between now and October 12th, 2018.

To qualify for the competition, individuals or teams are required to submit a working AI model, test dataset, a demo app and a demo video that can be a maximum of three minutes long. We encourage you to register early and upload your projects soon, so that you can begin to plan and build out your solution and turn in the rest of your materials on time. We are looking for solutions across the whole spectrum of use cases – to be inspired, take a look at some of the examples at AI.lab.


The winners of the first three places in the contest will respectively receive a Surface Book 2, a DJI Drone, and an Xbox One X.

We hope that’s motivation to get you started today – good luck!


Continue Reading…


Read More

ebook: Using Deep Learning to Solve Real-World Problems

Read this eBook to learn: How deep learning enables image classification, sentiment analysis, and other advanced analysis techniques and get a a starter workflow for building and training deep learning models.

Continue Reading…


Read More

A transforming river seen from above

The Padma River in Bangladesh is constantly shifting its 75-mile path. Joshua Stevens for the NASA Earth Observatory shows what the shifting looked like through satellite imagery, over a 30-year span.

Kasha Patel:

The upper section of the Padma—the Harirampur region— has experienced the most erosion and shows the most notable changes. The river has become wider at this section by eroding along both banks, although most activity occurred on the left bank. Using topographic, aerial, and satellite imagery, scientists found that the left bank shifted 12 kilometers towards the north from 1860 to 2009 and developed a meandering bend. The river left a scar where the water once flowed, as you can see in the 2018 image.

See also the dramatic shifts of the Ucayali River in Peru.

Tags: , , ,

Continue Reading…


Read More

Distilled News

Neural Network Exchange Format (NNEF)

NNEF reduces machine learning deployment fragmentation by enabling a rich mix of neural network training tools and inference engines to be used by applications across a diverse range of devices and platforms. The goal of NNEF is to enable data scientists and engineers to easily transfer trained networks from their chosen training framework into a wide variety of inference engines. A stable, flexible and extensible standard that equipment manufacturers can rely on is critical for the widespread deployment of neural networks onto edge devices, and so NNEF encapsulates a complete description of the structure, operations and parameters of a trained neural network, independent of the training tools used to produce it and the inference engine used to execute it.

Pandas Tutorial 3: Important Data Formatting Methods (merge, sort, reset_index, fillna)

This is the third episode of my pandas tutorial series. In this one I´ll show you four data formatting methods that you might use a lot in data science projects. These are: merge, sort, reset_index and fillna! Of course, there are many others, and at the end of the article, I´ll link to a pandas cheat sheet where you can find every function and method you could ever need. Okay! Let´s get started!

Setting up your AI Dev Environment in 5 Minutes

Whether you’re a novice data science enthusiast setting up TensorFlow for the first time, or a seasoned AI engineer working with terabytes of data, getting your libraries, packages, and frameworks installed is always a struggle. Learn how datmo, an open source python package, helps you get started in minutes.
0. Prerequisites
1. Install datmo
2. Initialize a datmo project
3. Start environment setup
4. Select System Drivers (CPU or GPU)
5. Select an environment
6. Select a language version (if applicable)
7. Launch your workspace

Datmo – An open source model tracking and reproducibility tool for developers.

Workflow tools to help you experiment, deploy, and scale. By data scientists, for data scientists. Datmo is an open source model tracking and reproducibility tool for developers. Features
• One command environment setup (languages, frameworks, packages, etc)
• Tracking and logging for model config and results
• Project versioning (model state tracking)
• Experiment reproducibility (re-run tasks)
• Visualize + export experiment history

data.table is Really Good at Sorting

The data.table R package is really good at sorting. Below is a comparison of it versus dplyr for a range of problem sizes.

In-brief: splashr update + High Performance Scraping with splashr, furrr & TeamHG-Memex’s Aquarium

The development version of splashr now support authenticated connections to Splash API instances. Just specify user and pass on the initial splashr::splash() call to use your scraping setup a bit more safely. For those not familiar with splashr and/or Splash: the latter is a lightweight alternative to tools like Selenium and the former is an R interface to it. Unlike xml2::read_html(), splashr renders a URL exactly as a browser does (because it uses a virtual browser) and can return far more than just the HTML from a web page. Splash does need to be running and it´s best to use it in a Docker container.

IML and H2O: Machine Learning Model Interpretability And Feature Explanation

Model interpretability is critical to businesses. If you want to use high performance models (GLM, RF, GBM, Deep Learning, H2O, Keras, xgboost, etc), you need to learn how to explain them. With machine learning interpretability growing in importance, several R packages designed to provide this capability are gaining in popularity. We analyze the IML package in this article. In recent blog posts we assessed LIME for model agnostic local interpretability functionality and DALEX for both local and global machine learning explanation plots. This post examines the iml package (short for Interpretable Machine Learning) to assess its functionality in providing machine learning interpretability to help you determine if it should become part of your preferred machine learning toolbox. We again utilize the high performance machine learning library, h2o, implementing three popular black-box modeling algorithms: GLM (generalized linear models), RF (random forest), and GBM (gradient boosted machines). For those that want a deep dive into model interpretability, the creator of the iml package, Christoph Molnar, has put together a free book: Interpretable Machine Learning. Check it out.

PCA revisited

Principal component analysis (PCA) is a dimensionality reduction technique which might come handy when building a predictive model or in the exploratory phase of your data analysis. It is often the case that when it is most handy you might have forgot it exists but let´s neglect this aspect for now 😉

Delayed Impact of Fair Machine Learning

Delayed impact of fair machine learning’ (https://…/1803.04383 ) won a best paper award at ICML this year. It´s not an easy read (at least it wasn´t for me), but fortunately it´s possible to appreciate the main results without following all of the proof details. The central question is how to ensure fair treatment across demographic groups in a population when using a score-based machine learning model to decide who gets an opportunity (e.g. is offered a loan) and who doesn´t. Most recently we looked at the equal opportunity and equalized odds models. The underlying assumption of course for studied fairness models is that the fairness criteria promote the long-term well-being of those groups they aim to protect. The big result in this paper is that you can easily up end ‘killing them with kindness’ instead. The potential for this to happen exists when there is a feedback loop in place in the overall system. By overall system here, I mean the human system of which the machine learning model is just a small part. Using the loan/no-loan decision that is a popular study vehicle in fairness papers, we need to consider not just (for example) the opportunity that someone in a disadvantaged group has to qualify for a loan, but also what happens in the future as a result of that loan being made. If the borrower eventually defaults, then they will also see a decline in their credit score, which will make it harder for the borrower to obtain additional loans in the future. A successful lending event on the other hand may increase the credit score for the borrower.

What is a CapsNet or Capsule Network?

What is a Capsule Network? What is a Capsule? Is CapsNet better than a Convolutional Neural Network (CNN)? In this article I will talk about all the above questions about CapsNet or Capsule Network released by Hinton.

Predicting Employee Churn in Python

Analyze employee churn. Find out why employees are leaving the company and learn to predict, who will leave the company.
In the past, most of the focus on the ‘rates’ such as attrition rate and retention rates. HR Managers compute the previous rates try to predict the future rates using data warehousing tools. These rates present the aggregate impact of churn, but this is the half picture. Another approach can be the focus on individual records in addition to aggregate.
There are lots of case studies on customer churn are available. In customer churn, you can predict who and when a customer will stop buying. Employee churn is similar to customer churn. It mainly focuses on the employee rather than the customer. Here, you can predict who, and when an employee will terminate the service. Employee churn is expensive, and incremental improvements will give significant results. It will help us in designing better retention plans and improving employee satisfaction. In this tutorial, you are going to cover the following topics:
• Employee Churn Analysis
• Data loading and understanding feature
• Exploratory data analysis and Data visualization
• Cluster analysis
• Building prediction model using Gradient Boosting Tree.
• Evaluating model performance
• Conclusion

Unveiling Mathematics behind XGBoost

This article is targeted at people who found it difficult to grasp the original paper. Follow me till the end, and I assure you will atleast get a sense of what is happening underneath the revolutionary machine learning model. Lets get started.

Time series intervention analysis with fuel prices

Look into whether the regional fuel tax in Auckland has led to changes in fuel prices in other regions of New Zealand.

OBDA – Ontology-Based Data Access – A Data Management Paradigm

Ontology-based Data Access (OBDA) is a new paradigm, based on the use of knowledge representation and reasoning techniques, for governing the resources (data, meta-data, services, processes, etc.) of modern information systems. The challenges of data governance:
• Accessing Data
In large organizations data sources are commonly re-shaped by corrective maintenance and to adapt to application requirements, and applications are changed to meet new requirements. The result is that the data stored in different sources and the processes operating over them tend to be redundant, mutually inconsistent, and obscure for large classes of users. So, accessing data means interacting with IT experts who know where the data are and what they mean in the various contexts, and can therefore translate the information need expressed by the user into appropriate queries. This process can be both expensive and time-consuming.
• Data Quality
Data quality is cited often as a critical factor in delivering high value information services. But how can we check data quality, and how can we decide if it is good if we do not have a clear understanding of the semantics that data should bring? Moreover, how can we judge the quality of external data of business partners, clients, or even public sources, that we connect to? Data quality is also crucial for opening data to external organisations, to favor new business opportunities, or even to the public, which we are seeing more of nowadays in the age of Open Data.
• Process Specification
Information systems are key assets for business organisations, which rely not only on data, but also, for instance, on processes and services. Designing and managing processes is an important aspect of information systems, but deciding what a process should do is tough to do properly without a clear idea of which data the process will access, and how it will possibly change it. The difficulties of doing this properly come from various factors, including the lack of modelling languages and tools for describing process and data holistically, and the problems related to the semantics of data make this task even harder.
• Three-Level Architecture
The key idea of OBDA is provide users with access to the information in their data sources through a three-level architecture, constituted by the ontology, the sources, and the mapping between the two, where the ontology is a formal description of the domain of interest, and is the heart of the system. Through this architecture, OBDA provides a semantic end-to-end connection between users and data sources, allowing users to directly query data spread across multiple distributed sources, through the familiar vocabulary of the ontology: the user formulates SPARQL queries over the ontology which are transformed, through the mapping layer, into SQL queries over the underlying relational databases.
• The Ontology layer in the architecture is the mean for pursuing a declarative approach to information integration, and, more generally, to data governance. The domain knowledge base of the organization is specified through a formal and high level description of both its static and dynamic aspects, represented by the ontology. By making the representation of the domain explicit, we gain re-usability of the acquired knowledge, which is not achieved when the global schema is simply a unified description of the underlying data sources.
• The Mapping layer connects the Ontology layer with the Data Source layer by defining the relationships between the domain concepts on the one hand and the data sources on the other hand. These mappings are not only used for the operation of the information system, but can also be a significant asset for documentation purposes in cases where the information about data is widespread into separate pieces of documentation that are often difficult to access and rarely conforming to common standards.
• The Data Source layer is constituted by the existing data sources of the organization.

Data Scientist guide for getting started with Docker

Docker is an increasingly popular way to create and deploy applications through virtualization, but can it be useful for data scientists? This guide should help you quickly get started.

Continue Reading…


Read More

Blockchain Technology Will Revolutionize These Five Industries

Analysts forecast that bitcoin will achieve a total value of $1.2 trillion towards the end of 2018. However, forward-thinking individuals see the value of blockchain beyond its original application for digital currency transactions You’ve probably heard about blockchain technology through cryptocurrency, but did you know that blockchain technology is expected

The post Blockchain Technology Will Revolutionize These Five Industries appeared first on Dataconomy.

Continue Reading…


Read More

Microsoft R Open 3.5.1 now available

Microsoft R Open 3.5.1 has been released, combining the latest R language engine with multi-processor performance and tools for managing R packages reproducibly. You can download Microsoft R Open 3.5.1 for Windows, Mac and Linux from MRAN now. Microsoft R Open is 100% compatible with all R scripts and packages, and works with all your favorite R interfaces and development environments.

This update brings a number of minor fixes to the R language engine from the R core team. It also makes available a host of new R packages contributed by the community, including packages for downloading financial data, connecting with analytics systems, applying machine learning algorithms and statistical models, and many more. New R packages are released every day, and you can access packages released after the 1 August 2018 CRAN snapshot used by MRO 3.5.1 using the checkpoint package.

We hope you find Microsoft R Open useful, and if you have any comments or questions please visit the Microsoft R Open forum. You can follow the development of Microsoft R Open at the MRO Github repository. To download Microsoft R Open, simply follow the link below.

MRAN: Download Microsoft R Open

Continue Reading…


Read More

Microsoft R Open 3.5.1 now available

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Microsoft R Open 3.5.1 has been released, combining the latest R language engine with multi-processor performance and tools for managing R packages reproducibly. You can download Microsoft R Open 3.5.1 for Windows, Mac and Linux from MRAN now. Microsoft R Open is 100% compatible with all R scripts and packages, and works with all your favorite R interfaces and development environments.

This update brings a number of minor fixes to the R language engine from the R core team. It also makes available a host of new R packages contributed by the community, including packages for downloading financial data, connecting with analytics systems, applying machine learning algorithms and statistical models, and many more. New R packages are released every day, and you can access packages released after the 1 August 2018 CRAN snapshot used by MRO 3.5.1 using the checkpoint package.

We hope you find Microsoft R Open useful, and if you have any comments or questions please visit the Microsoft R Open forum. You can follow the development of Microsoft R Open at the MRO Github repository. To download Microsoft R Open, simply follow the link below.

MRAN: Download Microsoft R Open

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Whats new on arXiv

GARCH(1,1) model of the financial market with the Minkowski metric

We solved a stylized fact on a long memory process of volatility cluster phenomena by using Minkowski metric for GARCH(1,1) under assumption that price and time can not be separated. We provide a Yang-Mills equation in financial market and anomaly on superspace of time series data as a consequence of the proof from the general relativity theory. We used an original idea in Minkowski spacetime embedded in Kolmogorov space in time series data with behavior of traders.The result of this work is equivalent to the dark volatility or the hidden risk fear field induced by the interaction of the behavior of the trader in the financial market panic when the market crashed.

Partial Adversarial Domain Adaptation

Domain adversarial learning aligns the feature distributions across the source and target domains in a two-player minimax game. Existing domain adversarial networks generally assume identical label space across different domains. In the presence of big data, there is strong motivation of transferring deep models from existing big domains to unknown small domains. This paper introduces partial domain adaptation as a new domain adaptation scenario, which relaxes the fully shared label space assumption to that the source label space subsumes the target label space. Previous methods typically match the whole source domain to the target domain, which are vulnerable to negative transfer for the partial domain adaptation problem due to the large mismatch between label spaces. We present Partial Adversarial Domain Adaptation (PADA), which simultaneously alleviates negative transfer by down-weighing the data of outlier source classes for training both source classifier and domain adversary, and promotes positive transfer by matching the feature distributions in the shared label space. Experiments show that PADA exceeds state-of-the-art results for partial domain adaptation tasks on several datasets.

BooST: Boosting Smooth Trees for Partial Effect Estimation in Nonlinear Regressions

In this paper we introduce a new machine learning (ML) model for nonlinear regression called Boosting Smooth Transition Regression Tree (BooST). The main advantage of the BooST is that it estimates the derivatives (partial effects) of very general nonlinear models, providing more interpretation than other tree based models concerning the mapping between the covariates and the dependent variable. We provide some asymptotic theory that shows consistency of the partial derivatives and we present some examples on simulated and empirical data.

LemmaTag: Jointly Tagging and Lemmatizing for Morphologically-Rich Languages with BRNNs

We present LemmaTag, a featureless recurrent neural network architecture that jointly generates part-of-speech tags and lemmatizes sentences of languages with complex morphology, using bidirectional RNNs with character-level and word-level embeddings. We demonstrate that both tasks benefit from sharing the encoding part of the network and from using the tagger output as an input to the lemmatizer. We evaluate our model across several morphologically-rich languages, surpassing state-of-the-art accuracy in both part-of-speech tagging and lemmatization in Czech, German, and Arabic.

Grey-box Process Control Mining for Anomaly Monitoring and Deconstruction

We present a new ‘grey-box’ approach to anomaly detection in smart manufacturing. The approach is designed for tools run by control systems which execute recipe steps to produce semiconductor wafers. Multiple streaming sensors capture trace data to guide the control systems and for quality control. These control systems are typically PI controllers which can be modeled as an ordinary differential equation (ODE) coupled with a control equation, capturing the physics of the process. The ODE ‘white-box’ models capture physical causal relationships that can be used in simulations to determine how the process will react to changes in control parameters, but they have limited utility for anomaly detection. Many ‘black-box’ approaches exist for anomaly detection in manufacturing, but they typically do not exploit the underlying process control. The proposed ‘grey-box’ approach uses the process-control ODE model to derive a parametric function of sensor data. Bayesian regression is used to fit the parameters of these functions to form characteristic ‘shape signatures’. The probabilistic model provides a natural anomaly score for each wafer, which captures poor control and strange shape signatures. The anomaly score can be deconstructed into its constituent parts in order to identify which parameters are contributing to anomalies. We demonstrate how the anomaly score can be used to monitor complex multi-step manufacturing processes to detect anomalies and changes and show how the shape signatures can provide insight into the underlying sources of process variation that are not readily apparent in the sensor data.

Unsupervised Keyphrase Extraction Based on Outlier Detection

We propose a novel unsupervised keyphrase extraction approach based on outlier detection. Our approach starts by training word embeddings on the target document to capture semantic regularities among the words. It then uses the minimum covariance determinant estimator to model the distribution of non-keyphrase word vectors, under the assumption that these vectors come from the same distribution, indicative of their irrelevance to the semantics expresses by the dimensions of the learned vector representation. Candidate keyphrases are based on words that are outliers of this dominant distribution. Empirical results show that our approach outperforms state-of-the-art unsupervised keyphrase extraction methods.

Hierarchical Attention: What Really Counts in Various NLP Tasks

Attention mechanisms in sequence to sequence models have shown great ability and wonderful performance in various natural language processing (NLP) tasks, such as sentence embedding, text generation, machine translation, machine reading comprehension, etc. Unfortunately, existing attention mechanisms only learn either high-level or low-level features. In this paper, we think that the lack of hierarchical mechanisms is a bottleneck in improving the performance of the attention mechanisms, and propose a novel Hierarchical Attention Mechanism (Ham) based on the weighted sum of different layers of a multi-level attention. Ham achieves a state-of-the-art BLEU score of 0.26 on Chinese poem generation task and a nearly 6.5% averaged improvement compared with the existing machine reading comprehension models such as BIDAF and Match-LSTM. Furthermore, our experiments and theorems reveal that Ham has greater generalization and representation ability than existing attention mechanisms.

Building Safer AGI by introducing Artificial Stupidity

Artificial Intelligence (AI) achieved super-human performance in a broad variety of domains. We say that an AI is made Artificially Stupid on a task when some limitations are deliberately introduced to match a human’s ability to do the task. An Artificial General Intelligence (AGI) can be made safer by limiting its computing power and memory, or by introducing Artificial Stupidity on certain tasks. We survey human intellectual limits and give recommendations for which limits to implement in order to build a safe AGI.

Familia: A Configurable Topic Modeling Framework for Industrial Text Engineering

In the last decade, a variety of topic models have been proposed for text engineering. However, except Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA), most of existing topic models are seldom applied or considered in industrial scenarios. This phenomenon is caused by the fact that there are very few convenient tools to support these topic models so far. Intimidated by the demanding expertise and labor of designing and implementing parameter inference algorithms, software engineers are prone to simply resort to PLSA/LDA, without considering whether it is proper for their problem at hand or not. In this paper, we propose a configurable topic modeling framework named Familia, in order to bridge the huge gap between academic research fruits and current industrial practice. Familia supports an important line of topic models that are widely applicable in text engineering scenarios. In order to relieve burdens of software engineers without knowledge of Bayesian networks, Familia is able to conduct automatic parameter inference for a variety of topic models. Simply through changing the data organization of Familia, software engineers are able to easily explore a broad spectrum of existing topic models or even design their own topic models, and find the one that best suits the problem at hand. With its superior extendability, Familia has a novel sampling mechanism that strikes balance between effectiveness and efficiency of parameter inference. Furthermore, Familia is essentially a big topic modeling framework that supports parallel parameter inference and distributed parameter storage. The utilities and necessity of Familia are demonstrated in real-life industrial applications. Familia would significantly enlarge software engineers’ arsenal of topic models and pave the way for utilizing highly customized topic models in real-life problems.

Neural Network Encapsulation

A capsule is a collection of neurons which represents different variants of a pattern in the network. The routing scheme ensures only certain capsules which resemble lower counterparts in the higher layer should be activated. However, the computational complexity becomes a bottleneck for scaling up to larger networks, as lower capsules need to correspond to each and every higher capsule. To resolve this limitation, we approximate the routing process with two branches: a master branch which collects primary information from its direct contact in the lower layer and an aide branch that replenishes master based on pattern variants encoded in other lower capsules. Compared with previous iterative and unsupervised routing scheme, these two branches are communicated in a fast, supervised and one-time pass fashion. The complexity and runtime of the model are therefore decreased by a large margin. Motivated by the routing to make higher capsule have agreement with lower capsule, we extend the mechanism as a compensation for the rapid loss of information in nearby layers. We devise a feedback agreement unit to send back higher capsules as feedback. It could be regarded as an additional regularization to the network. The feedback agreement is achieved by comparing the optimal transport divergence between two distributions (lower and higher capsules). Such an add-on witnesses a unanimous gain in both capsule and vanilla networks. Our proposed EncapNet performs favorably better against previous state-of-the-arts on CIFAR10/100, SVHN and a subset of ImageNet.

Knowledge Graph Embedding with Entity Neighbors and Deep Memory Network

Knowledge Graph Embedding (KGE) aims to represent entities and relations of knowledge graph in a low-dimensional continuous vector space. Recent works focus on incorporating structural knowledge with additional information, such as entity descriptions, relation paths and so on. However, common used additional information usually contains plenty of noise, which makes it hard to learn valuable representation. In this paper, we propose a new kind of additional information, called entity neighbors, which contain both semantic and topological features about given entity. We then develop a deep memory network model to encode information from neighbors. Employing a gating mechanism, representations of structure and neighbors are integrated into a joint representation. The experimental results show that our model outperforms existing KGE methods utilizing entity descriptions and achieves state-of-the-art metrics on 4 datasets.

MARVIN: An Open Machine Learning Corpus and Environment for Automated Machine Learning Primitive Annotation and Execution

In this demo paper, we introduce the DARPA D3M program for automatic machine learning (ML) and JPL’s MARVIN tool that provides an environment to locate, annotate, and execute machine learning primitives for use in ML pipelines. MARVIN is a web-based application and associated back-end interface written in Python that enables composition of ML pipelines from hundreds of primitives from the world of Scikit-Learn, Keras, DL4J and other widely used libraries. MARVIN allows for the creation of Docker containers that run on Kubernetes clusters within DARPA to provide an execution environment for automated machine learning. MARVIN currently contains over 400 datasets and challenge problems from a wide array of ML domains including routine classification and regression to advanced video/image classification and remote sensing.

Document Informed Neural Autoregressive Topic Models

Context information around words helps in determining their actual meaning, for example ‘networks’ used in contexts of artificial neural networks or biological neuron networks. Generative topic models infer topic-word distributions, taking no or only little context into account. Here, we extend a neural autoregressive topic model to exploit the full context information around words in a document in a language modeling fashion. This results in an improved performance in terms of generalization, interpretability and applicability. We apply our modeling approach to seven data sets from various domains and demonstrate that our approach consistently outperforms stateof-the-art generative topic models. With the learned representations, we show on an average a gain of 9.6% (0.57 Vs 0.52) in precision at retrieval fraction 0.02 and 7.2% (0.582 Vs 0.543) in F1 for text categorization.

jLDADMM: A Java package for the LDA and DMM topic models

In this technical report, we present jLDADMM—an easy-to-use Java toolkit for conventional topic models. jLDADMM is released to provide alternatives for topic modeling on normal or short texts. It provides implementations of the Latent Dirichlet Allocation topic model and the one-topic-per-document Dirichlet Multinomial Mixture model (i.e. mixture of unigrams), using collapsed Gibbs sampling. In addition, jLDADMM supplies a document clustering evaluation to compare topic models. jLDADMM is open-source and available to download at: https://…/jLDADMM

Matrix Factorization on GPUs with Memory Optimization and Approximate Computing

Matrix factorization (MF) discovers latent features from observations, which has shown great promises in the fields of collaborative filtering, data compression, feature extraction, word embedding, etc. While many problem-specific optimization techniques have been proposed, alternating least square (ALS) remains popular due to its general applicability e.g. easy to handle positive-unlabeled inputs, fast convergence and parallelization capability. Current MF implementations are either optimized for a single machine or with a need of a large computer cluster but still are insufficient. This is because a single machine provides limited compute power for large-scale data while multiple machines suffer from the network communication bottleneck. To address the aforementioned challenge, accelerating ALS on graphics processing units (GPUs) is a promising direction. We propose the novel approach in enhancing the MF efficiency via both memory optimization and approximate computing. The former exploits GPU memory hierarchy to increase data reuse, while the later reduces unnecessary computing without hurting the convergence of learning algorithms. Extensive experiments on large-scale datasets show that our solution not only outperforms the competing CPU solutions by a large margin but also has a 2x-4x performance gain compared to the state-of-the-art GPU solutions. Our implementations are open-sourced and publicly available.

Ranking with Features: Algorithm and A Graph Theoretic Analysis

We consider the problem of ranking a set of items from pairwise comparisons in the presence of features associated with the items. Recent works have established that O(n\log(n)) samples are needed to rank well when there is no feature information present. However, this might be sub-optimal in the presence of associated features. We introduce a new probabilistic preference model called feature-Bradley-Terry-Luce (f-BTL) model that generalizes the standard BTL model to incorporate feature information. We present a new least squares based algorithm called fBTL-LS which we show requires much lesser than O(n\log(n)) pairs to obtain a good ranking — precisely our new sample complexity bound is of O(\alpha\log \alpha), where \alpha denotes the number of `independent items’ of the set, in general \alpha << n. Our analysis is novel and makes use of tools from classical graph matching theory to provide tighter bounds that sheds light on the true complexity of the ranking problem, capturing the item dependencies in terms of their feature representations. This was not possible with earlier matrix completion based tools used for this problem. We also prove an information theoretic lower bound on the required sample complexity for recovering the underlying ranking, which essentially shows the tightness of our proposed algorithms. The efficacy of our proposed algorithms are validated through extensive experimental evaluations on a variety of synthetic and real world datasets.

Pervasive Attention: 2D Convolutional Neural Networks for Sequence-to-Sequence Prediction

Current state-of-the-art machine translation systems are based on encoder-decoder architectures, that first encode the input sequence, and then generate an output sequence based on the input encoding. Both are interfaced with an attention mechanism that recombines a fixed encoding of the source tokens based on the decoder state. We propose an alternative approach which instead relies on a single 2D convolutional neural network across both sequences. Each layer of our network re-codes source tokens on the basis of the output sequence produced so far. Attention-like properties are therefore pervasive throughout the network. Our model yields excellent results, outperforming state-of-the-art encoder-decoder systems, while being conceptually simpler and having fewer parameters.

A Consistent Method for Learning OOMs from Asymptotically Stationary Time Series Data Containing Missing Values

In the traditional framework of spectral learning of stochastic time series models, model parameters are estimated based on trajectories of fully recorded observations. However, real-world time series data often contain missing values, and worse, the distributions of missingness events over time are often not independent of the visible process. Recently, a spectral OOM learning algorithm for time series with missing data was introduced and proved to be consistent, albeit under quite strong conditions. Here we refine the algorithm and prove that the original strong conditions can be very much relaxed. We validate our theoretical findings by numerical experiments, showing that the algorithm can consistently handle missingness patterns whose dynamic interacts with the visible process.

A Basic Compositional Model for Spiking Neural Networks

This paper is part of a project on developing an algorithmic theory of brain networks, based on stochastic Spiking Neural Network (SNN) models. Inspired by tasks that seem to be solved in actual brains, we are defining abstract problems to be solved by these networks. In our work so far, we have developed models and algorithms for the Winner-Take-All problem from computational neuroscience [LMP17a,Mus18], and problems of similarity detection and neural coding [LMP17b]. We plan to consider many other problems and networks, including both static networks and networks that learn. This paper is about basic theory for the stochastic SNN model. In particular, we define a simple version of the model. This version assumes that the neurons’ only state is a Boolean, indicating whether the neuron is firing or not. In later work, we plan to develop variants of the model with more elaborate state. We also define an external behavior notion for SNNs, which can be used for stating requirements to be satisfied by the networks. We then define a composition operator for SNNs. We prove that our external behavior notion is ‘compositional’, in the sense that the external behavior of a composed network depends only on the external behaviors of the component networks. We also define a hiding operator that reclassifies some output behavior of an SNN as internal. We give basic results for hiding. Finally, we give a formal definition of a problem to be solved by an SNN, and give basic results showing how composition and hiding of networks affect the problems that they solve. We illustrate our definitions with three examples: building a circuit out of gates, building an ‘Attention’ network out of a ‘Winner-Take-All’ network and a ‘Filter’ network, and a toy example involving combining two networks in a cyclic fashion.

Interpreting Recurrent and Attention-Based Neural Models: a Case Study on Natural Language Inference

Deep learning models have achieved remarkable success in natural language inference (NLI) tasks. While these models are widely explored, they are hard to interpret and it is often unclear how and why they actually work. In this paper, we take a step toward explaining such deep learning based models through a case study on a popular neural model for NLI. In particular, we propose to interpret the intermediate layers of NLI models by visualizing the saliency of attention and LSTM gating signals. We present several examples for which our methods are able to reveal interesting insights and identify the critical information contributing to the model decisions.

Adversarial Personalized Ranking for Recommendation

Item recommendation is a personalized ranking task. To this end, many recommender systems optimize models with pairwise ranking objectives, such as the Bayesian Personalized Ranking (BPR). Using matrix Factorization (MF) — the most widely used model in recommendation — as a demonstration, we show that optimizing it with BPR leads to a recommender model that is not robust. In particular, we find that the resultant model is highly vulnerable to adversarial perturbations on its model parameters, which implies the possibly large error in generalization. To enhance the robustness of a recommender model and thus improve its generalization performance, we propose a new optimization framework, namely Adversarial Personalized Ranking (APR). In short, our APR enhances the pairwise ranking method BPR by performing adversarial training. It can be interpreted as playing a minimax game, where the minimization of the BPR objective function meanwhile defends an adversary, which adds adversarial perturbations on model parameters to maximize the BPR objective function. To illustrate how it works, we implement APR on MF by adding adversarial perturbations on the embedding vectors of users and items. Extensive experiments on three public real-world datasets demonstrate the effectiveness of APR — by optimizing MF with APR, it outperforms BPR with a relative improvement of 11.2% on average and achieves state-of-the-art performance for item recommendation. Our implementation is available at: https://…/adversarial_personalized_ranking.

Outer Product-based Neural Collaborative Filtering

In this work, we contribute a new multi-layer neural network architecture named ONCF to perform collaborative filtering. The idea is to use an outer product to explicitly model the pairwise correlations between the dimensions of the embedding space. In contrast to existing neural recommender models that combine user embedding and item embedding via a simple concatenation or element-wise product, our proposal of using outer product above the embedding layer results in a two-dimensional interaction map that is more expressive and semantically plausible. Above the interaction map obtained by outer product, we propose to employ a convolutional neural network to learn high-order correlations among embedding dimensions. Extensive experiments on two public implicit feedback data demonstrate the effectiveness of our proposed ONCF framework, in particular, the positive effect of using outer product to model the correlations between embedding dimensions in the low level of multi-layer neural recommender model. The experiment codes are available at: https://…/ConvNCF

Characterizing Neuronal Circuits with Spike-triggered Non-negative Matrix Factorization

Neuronal circuits formed in the brain are complex with intricate connection patterns. Such a complexity is also observed in the retina as a relatively simple neuronal circuit. A retinal ganglion cell receives excitatory inputs from neurons in previous layers as driving forces to fire spikes. Analytical methods are required that can decipher these components in a systematic manner. Recently a method termed spike-triggered non-negative matrix factorization (STNMF) has been proposed for this purpose. In this study, we extend the scope of the STNMF method. By using the retinal ganglion cell as a model system, we show that STNMF can detect various biophysical properties of upstream bipolar cells, including spatial receptive fields, temporal filters, and transfer nonlinearity. In addition, we recover synaptic connection strengths from the weight matrix of STNMF. Furthermore, we show that STNMF can separate spikes of a ganglion cell into a few subsets of spikes where each subset is contributed by one presynaptic bipolar cell. Taken together, these results corroborate that STNMF is a useful method for deciphering the structure of neuronal circuits.

Large-Scale Learnable Graph Convolutional Networks

Convolutional neural networks (CNNs) have achieved great success on grid-like data such as images, but face tremendous challenges in learning from more generic data such as graphs. In CNNs, the trainable local filters enable the automatic extraction of high-level features. The computation with filters requires a fixed number of ordered units in the receptive fields. However, the number of neighboring units is neither fixed nor are they ordered in generic graphs, thereby hindering the applications of convolutional operations. Here, we address these challenges by proposing the learnable graph convolutional layer (LGCL). LGCL automatically selects a fixed number of neighboring nodes for each feature based on value ranking in order to transform graph data into grid-like structures in 1-D format, thereby enabling the use of regular convolutional operations on generic graphs. To enable model training on large-scale graphs, we propose a sub-graph training method to reduce the excessive memory and computational resource requirements suffered by prior methods on graph convolutions. Our experimental results on node classification tasks in both transductive and inductive learning settings demonstrate that our methods can achieve consistently better performance on the Cora, Citeseer, Pubmed citation network, and protein-protein interaction network datasets. Our results also indicate that the proposed methods using sub-graph training strategy are more efficient as compared to prior approaches.

Connecting Sharpe ratio and Student t-statistic, and beyond
Modeling Meaning Associated with Documental Entities: Introducing the Brussels Quantum Approach
Machine Learning Promoting Extreme Simplification of Spectroscopy Equipment
Eikos: a Bayesian unfolding method for differential cross-section measurements
On the Feasibility of FPGA Acceleration of Molecular Dynamics Simulations
Active Learning for Regression Using Greedy Sampling
Affect Estimation in 3D Space Using Multi-Task Active Learning for Regression
Algorithmic No-Cloning Theorem
Relational dynamic memory networks
CT Super-resolution GAN Constrained by the Identical, Residual, and Cycle Learning Ensemble(GAN-CIRCLE)
Self-Organization Scheme for Balanced Routing in Large-Scale Multi-Hop Networks
Model Reduction with Memory and the Machine Learning of Dynamical Systems
Unsupervised Learning of Sentence Representations Using Sequence Consistency
Connectivity-Driven Brain Parcellation via Consensus Clustering
Effective Unsupervised Author Disambiguation with Relative Frequencies
Homophonic Quotients of Linguistic Free Groups: German, Korean, and Turkish
The effective entropy of next/previous larger/smaller value queries
Snapshot compressed sensing: performance bounds and algorithms
Multi-Channel Stochastic Variational Inference for the Joint Analysis of Heterogeneous Biomedical Data in Alzheimer’s Disease
Artin Groups and Iwahori-Hecke algebras over finite fields
Estimation of natural indirect effects robust to unmeasured confounding and mediator measurement error
Saturation Games for Odd Cycles
Unified power system analyses and models using equivalent circuit formulation
Genome-Wide Association Studies: Information Theoretic Limits of Reliable Learning
Development of an 8 channel sEMG wireless device based on ADS1299 with Virtual Instrumentation
Simple versus Optimal Contracts
Jamming and Tiling in Aggregation of Rectangles
This Time with Feeling: Learning Expressive Musical Performance
Pointwise control of the linearized Gear-Grimshaw system
Learning to Represent Bilingual Dictionaries
From POS tagging to dependency parsing for biomedical event extraction
An Implementation, Empirical Evaluation and Proposed Improvement for Bidirectional Splitting Method for Argumentation Frameworks under Stable Semantics
Learning Multi-touch Conversion Attribution with Dual-attention Mechanisms for Online Advertising
Ancient-Modern Chinese Translation with a Large Training Dataset
Sketch for a Theory of Constructs
Trajectory planning optimization for real-time 6DOF robotic patient motion compensation
Dropout during inference as a model for neurological degeneration in an image captioning network
Identification and Bayesian inference for heterogeneous treatment effects under non-ignorable assignment condition
Zero-sum path-dependent stochastic differential games in weak formulation
Resilience-based performance modeling and decision optimization for transportation network
Restricted permutations refined by number of crossings and nestings
The ActivityNet Large-Scale Activity Recognition Challenge 2018 Summary
Reciprocity and success in academic careers
Improved Methods for Moment Restriction Models with Marginally Incompatible Data Combination and an Application to Two-sample Instrumental Variable Estimation
Magnetic microstructure machine learning analysis
Statistics of the Eigenvalues of a Noisy Multi-Soliton Pulse
Criticality of Lagrange Multipliers in Variational Systems
The Impact of Automatic Pre-annotation in Clinical Note Data Element Extraction – the CLEAN Tool
Bayesian Bivariate Subgroup Analysis for Risk-Benefit Evaluation
A Full End-to-End Semantic Role Labeler, Syntax-agnostic Over Syntax-aware?
Fast RodFIter for Attitude Reconstruction from Inertial Measurements
Automatically Designing CNN Architectures Using Genetic Algorithm for Image Classification
Constant overhead quantum fault-tolerance with quantum expander codes
Learning Discriminative 3D Shape Representations by View Discerning Networks
Protecting the Grid against IoT Botnets of High-Wattage Devices
Racial Disparities and Mistrust in End-of-Life Care
On modelling positive continuous data with spatio-temporal dependence
A Simple Network of Nodes Moving on the Circle
Sample size determination in superiority or non-inferiority clinical trials with time-to-event data under exponential, Weibull and Gompertz distributions
Self-Supervised Model Adaptation for Multimodal Semantic Segmentation
Fake Sentence Detection as a Training Task for Sentence Encoding
Fully-Automated Analysis of Body Composition from CT in Cancer Patients Using Convolutional Neural Networks
Algorithm to Prove Formulas for the Expected Number of Questions in Mastermind Games
A Stage-wise Decision Framework for Transportation Network Resilience Planning
Upper and Lower Bounds on Zero-Sum Generalized Schur Numbers
A Parameterized Complexity View on Description Logic Reasoning
Neural Importance Sampling
Spectral norm of a symmetric tensor and its computation
Mixed-integer bilevel representability
Ring statistics in 2D-silica: effective temperatures in equilibrium
Orders-of-magnitude speedup in atmospheric chemistry modeling through neural network-based emulation
Compound Poisson Noise Sources in Diffusion-based Molecular Communication
Several classes of minimal linear codes with few weights from weakly regular plateaued functions
Parallelization does not Accelerate Convex Optimization: Adaptivity Lower Bounds for Non-smooth Convex Minimization
Backpressure-based Resource allocation for buffer-aided underlay D2D networks
Approximately uniformly locally finite graphs
Semi-supervised Skin Lesion Segmentation via Transformation Consistent Self-ensembling Model
A note on hypergraph colorings
Robust high dimensional factor models with applications to statistical machine learning
Partitioning a graph into cycles with a specified number of chords
The Stochastic Fejér-Monotone Hybrid Steepest Descent Method
Engineering and Economic Analysis for Electric Vehicle Charging Infrastructure — Placement, Pricing, and Market Design
Iterative Global Similarity Points : A robust coarse-to-fine integration solution for pairwise 3D point cloud registration
Mobility edge and Black Hole Horizon
Addressee and Response Selection for Multilingual Conversation
Linguistic Relativity and Programming Languages
Multimodal Language Analysis with Recurrent Multistage Fusion
Sequence Labeling: A Practical Approach
Discrete-time Risk-sensitive Mean-field Games
$PC$-polynomial of graph
Protocol for an observational study on the effects of playing football in adolescence on mental health in early adulthood
Fine-grained visual recognition with salient feature detection
Denoising of 3-D Magnetic Resonance Images Using a Residual Encoder-Decoder Wasserstein Generative Adversarial Network
Self-Triggered Network Coordination over Noisy Communication Channels
Unsupervised learning for cross-domain medical image synthesis using deformation invariant cycle consistency networks
A New Look at $F$-Tests
An Asymptotically Efficient Metropolis-Hastings Sampler for Bayesian Inference in Large-Scale Educational Measuremen
Plithogeny, Plithogenic Set, Logic, Probability, and Statistics
On-Device Federated Learning via Blockchain and its Latency Analysis
A Fourier View of REINFORCE
Open-World Stereo Video Matching with Deep RNN

Continue Reading…


Read More

The side effect of automation is fake data and theft

There is hope in our business journalism! That's the message I'm getting as I delight in the current stream of articles that take a critical look at the inherent unaccountability of technology giants. These stories have been around for years, even decades, but better late than never.

Fake views/plays on Youtube videos (New York Times) This is a companion piece to a previous story about fake social-media followers (New York TImes magazine)

Uber drivers gaming the pricing algorithm (Wall Street Journal)

Google's Android tracks you everywhere whether you allow it or not (Bloomberg)

Finally some coverage of the shame of IBM's Marketing (Wall Street Journal)

As blog readers know, most companies in the Web/mobile ecosystem use an advertising-supported model. Even the most popular products like Youtube, Google Maps, and Facebook have not found it possible to charge users anything (when in theory, users derive value from such products). Such digital advertising, since the beginning, has been heralded as inherently more accountable because it was supposedly more measurable. Just look at the mountains of data that are generated by the very act of putting up ads on a website. You can even trace a user's final action, say the purchase, back to various interactions prior to the action. (This data is called the "click stream".)

The marketing on digital marketing centers on its "efficiency," achieved through automating the process, taking human beings out of the equation, and trusting the black box machines. This system is flawed even in its most theoretical form, assuming "complete data". The clickstream and other digital data have multiple problems. Assigning causality is highly suspect. I wrote about this years ago - for example, here.

Torbakhopper_fakebagsThe opacity of the black box machines offers not just efficiency but also shelter for fraud. Any kind of verification, quality checking, common-sense making slow the system down. The more efficient is the system, the easier and faster it is to commit fraud!

What kind of fraud are we talking about? Fake everything! Fake clicks, fake impressions, fake ads, fake viewers, fake plays, fake followers, fake accounts, fake likes, fake webpages, you name it. Since advertisers are paying money for clicks, impressions, ads, viewers, plays, followers, accounts, etc., and most advertisers have neither the incentives nor the nerves to look under the hood, they are paying real dollars for all the fakes.

The much-lauded efficiency of the black-box advertising model sits on a pile of fake data. The system is seriously flawed assuming complete data; it is worse when we admit that not only do we not possess complete data, but we are drowning in fake data.

Now, data scientists are hard at work building even more black-box machines that promise to suss out the fake stuff. To understand how this might work, think about spam emails, an analogous problem. In the machine-learning world, spam detection is thought of as a "solved" problem. Spam filtering algorithms have existed for decades. Think about how many fake emails continue to sit in your inbox, and in addition, how many not-fake emails land in your spam folder - and you can begin to see why technical solutions are not potent enough to drive out fraudsters.


Below are some comments on selected parts of the latest New York Times article on fake Youtube views (here).

  • Youtube stars can make millions from advertisers by publishing viral videos. Google which owns Youtube makes tons of money by having people upload their content to the platform and charging rent for it. Advertisers ultimately pay for those views. Youtube stars, and by extension, Google, can make lots of money from fake views, i.e. bots (software) pretending to play the video. Third parties approach wannabes to sell them fake views and take a cut of the action.
  • It's a cat and mouse game that the fraudsters are winning. They have the upper hand because the products are often designed without assuming nefarious players gaming the system. Adding software to deal preemptively with nefarious players slows down tech startups trying to build an initial market. Preventing bad players from acting badly imposes (the externality of) restrictions for good players, which is another disincentive to tackle the problem.
  • Youtube officially claims: "fake views represent just a tiny fraction of the total." (later defined as 1 percent). This is a guess masquerading as fact. To know what fraction of the views on Youtube is "fake" requires knowing whether each view is fake or not. In theory, one can review a sample of the views, and then infer the fraction in the entire catalog from that sample. In practice, it is not easy to prove if a view is fake just from the data collected about the viewer's browser, OS, etc. A machine-learning algorithm will "predict" if a view is fake or not but such predictions should not be used as "ground truth".
  • Legitimate organizations buy views, including marketing agencies, news organizations and musicians. As a former employee said, "View count manipulation will be a problem as long as views and the popularity they signal are the currency of YouTube." It's not just YouTube - we know that when teachers are compensated based on student test scores, they become complicit in abetting their students to cheat.
  • This underground economy is quite well developed now, with wholesalers and resellers of fake views. If you are a reseller, you don't even need to know how to generate fake views. You just buy fakes from the wholesalers and charge a markup!
  • The sellers of fake views are so sure they can evade Youtube's detection technologies that they are openly talking about their clients, business models, etc. to the press. Either they know those technologies don't work, or they know that Youtube isn't serious about cracking down.
  • Several buyers interviewed by the Times said they truly believed the marketing spin of the fake-view sellers, that they were driving real users to watch their videos. This is a tad disingenuous. As one writer who purchased views pointed out, those purchased views led to no sales, and she couldn't get any data about who viewed the videos.

It's actually very easy to demonstrate fraud as the fake views are barely hidden. The article came with a great graphic, which does not appear to be addressed directly in the text. So let me explain.


The reporter purchased fake views from the vendor called "BuyYoutubViews". The top chart shows that close to 100% of those views ("in campaign") played the entire video (over 5 minutes long). The bottom chart shows that when they turned off the "campaign" i.e. stopped buying fake views, almost no one plays the video for more than a few seconds. (What is not shown is that the number of views probably plunged after the fake-views campaign was shut down.)

Raise your hand if you think "BuyYoutubViews" is finding real people to watch the video!





Continue Reading…


Read More

Data Scientist guide for getting started with Docker

Docker is an increasingly popular way to create and deploy applications through virtualization, but can it be useful for data scientists? This guide should help you quickly get started.

Continue Reading…


Read More

Solve epileptic seizure prediction! Participate at

Around twenty million people worldwide suffer from drug-resistant epilepsy and the unpredictability of seizures is one of the major factors affecting the quality of life of people with epilepsy.

Continue Reading…


Read More

Vienna overtakes Melbourne as the world’s most liveable city

“WHAT is the city but the people?” The latest liveability index from the Economist Intelligence Unit suggests that Shakespeare might be overplaying the importance of people in his definition.

Continue Reading…


Read More

Unveiling Mathematics Behind XGBoost

Follow me till the end, and I assure you will atleast get a sense of what is happening underneath the revolutionary machine learning model.

Continue Reading…


Read More

The missing piece of the Smart Contract Puzzle

While there has been a lot of hype around the efficiency of a smart contract, can it replace the traditional way of working? What are the challenges and what is the role blockchain has to play in this shift? Smart contracts have been touted as an evolutionary leap in legal

The post The missing piece of the Smart Contract Puzzle appeared first on Dataconomy.

Continue Reading…


Read More

It was the weeds that bothered him.

Bill Jefferys points to this news article by Denise Grady. Bill noticed the following bit, “In male rats, the studies linked tumors in the heart to high exposure to radiation from the phones. But that problem did not occur in female rats, or any mice,” and asked:

​Forking paths, much?

My reply: The summary of the news article seems reasonable: “But two government studies released on Friday, one in rats and one in mice, suggest that if there is any risk, it is small, health officials said.”

But, yes, later on they get into the weeds: “Scientists do not know why only male rats develop the heart tumors, but Dr. Bucher said one possibility is simply that the males are bigger and absorb more of the radiation.” They didn’t mention the possibility that variations just happen at random, and the fact that a comparison happened to be statistically significant in the data is not necessarily strong evidence that it represents a corresponding pattern in the population.

Bill responded:

Yes, it was the weeds that bothered me.

Overall, then, yes, a good news article.

The post It was the weeds that bothered him. appeared first on Statistical Modeling, Causal Inference, and Social Science.

Continue Reading…


Read More

3 promising areas for AI skills development

O'Reilly survey results and usage data reveal growing trends and topics in artificial intelligence.

One of the key findings of a survey we released earlier this year (How Companies are Putting AI to Work through Deep Learning) was that the leading reason holding companies back from incorporating deep learning was their lack of access to skilled people. One-fifth of respondents pointed to a skills gap as one of the reasons they haven’t integrated deep learning, and at the time of the survey, 75% of respondents indicated their company had some combination of internal and external training programs to address this issue.

We’ve continued to monitor interest in topics relevant to building AI products and systems, specifically areas that also warrant investment in skills development. In this post, I’ll share results of related studies we’ve conducted. I’ll draw from two data sources:

  • We examine usage[1] across all content formats on the O’Reilly online learning platform, as well as demand via volume of search terms.
  • We recently conducted a survey (full report forthcoming) on machine learning adoption, which included more than 6,000 respondents from North America.

I’ll use key portions of our upcoming AI Conference in San Francisco to describe how companies can address the topics and findings surfaced in these two recent studies.

Growing interest in key topics

Through the end of June 2018, we found double-digit growth in key topics associated with AI. Our online learning platform usage metrics encompass many content formats including books, videos, online training, interactive content, and other material:

growth in usage

Growth was strong across many topics associated with AI and machine learning. The chart below provides a sense of how much content usage (“relative popularity”) we’re seeing in some of these key topics: our users remain very interested in machine learning, particularly in deep learning.

usage by topic

It’s one thing to learn about an individual technology or a specific class of modeling techniques, but ultimately, organizations need to be able to design robust AI applications and products. This involves hardware, software infrastructure to manage data pipelines, and elegant user interfaces. For the upcoming AI Conference in San Francisco, we assembled training sessions, tutorials, and case studies on many of these important topics:

We’ve also found that interest in machine learning compares favorably with other areas of technology. We track interest in topics by monitoring search volume on our online learning platform. Alongside Kubernetes and blockchain, machine learning has been one of the fast-growing, high-volume search topics year over year:

usage year over year

Emerging topics

As I noted in the first chart above, we are seeing growing interest in reinforcement learning and PyTorch. It’s important to point out that TensorFlow is still by far the most popular deep learning framework, but as with other surveys we are seeing that PyTorch is beginning to build a devoted following. Looking closely at interest in topics within data science and AI, we found that interest in reinforcement learning, PyTorch, and Keras have risen substantially this year:

top ai data search terms

The chart below provides a ranked list of industries that are beginning to explore using reinforcement learning and PyTorch:

usage by topic

We’ve had reinforcement learning tutorial sessions and presentations from the inception of our AI Conference. As tools and libraries get simpler and more tightly integrated with other popular components, I’m expecting to see more mainstream applications of reinforcement learning. We have assembled tutorial sessions and talks at the AI Conference on reinforcement learning and on popular tools for building deep learning applications (including PyTorch and TensorFlow):

Toward a holistic view of AI applications

There is growing awareness among major stakeholders about the importance of data privacy, ethics, and security. Users are beginning to seek more transparency and control over their data, regulators are beginning to introduce data privacy rules, and there is growing interest in ethics and privacy among data professionals.

model-building checklist

There are an emerging set of tools and best practices for incorporating fairness, transparency, privacy, and security into AI systems. For our upcoming AI Conference in San Francisco, we have a wide selection of tutorials and sessions aimed at both technologists wanting to understand how to incorporate ethics and privacy into applications, and for managers needing to understand what these new tools and technologies are able to provide:

Continue reading 3 promising areas for AI skills development.

Continue Reading…


Read More

Four short links: 14 August 2018

Hyrum's Law, Academic Torrents, Logic Textbook, and Suboptimal Fairness

  1. Hyrum's Law -- With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody. (via Simon Willison)
  2. Academic Torrents -- a community-maintained distributed repository for data sets and scientific knowledge. 27GB and growing.
  3. Open Logic Project -- a collection of teaching materials on mathematical logic aimed at a non-mathematical audience, intended for use in advanced logic courses as taught in many philosophy departments. It is open source: you can download the LaTeX code. It is open: you’re free to change it whichever way you like, and share your changes. It is collaborative: a team of people is working on it, using the GitHub platform, and we welcome contributions and feedback. And it is written with configurability in mind.
  4. Delayed Impact of Fair Machine Learning (Paper a Day) -- it’s therefore possible to have a fairness intervention with the unintended consequence of leaving the disadvantaged group worse off than they were before.

Continue reading Four short links: 14 August 2018.

Continue Reading…


Read More

A quick reminder on HTTPS everywhere

HTTPS "everywhere" means everywhere—not just the login page, or the page where you accept donations. Everything.

HTTPS Everywhere! So the plugin says, and now browsers are warning users that sites not implementing https:// are security risks. Using HTTPS everywhere is good advice. And this really means "everywhere": the home page, everything. Not just the login page, or the page where you accept donations. Everything.

Implementing HTTPS everywhere has some downsides, as Eric Meyer points out. It breaks caching, which makes the web much slower for people limited to satellite connections (and that's much of the third world); it's a problem for people who, for various reasons, have to use older browsers (there are more ancient browsers and operating systems in the world than you would like to think, trust me); domain names and IP address are handled by lower-level protocols that HTTPS doesn't get to touch, so it's not as private as one would like; and more. It's not a great solution, but it's a necessary one. (Meyer's article, and the comments following it, are excellent.)

The real problem isn't HTTPS's downsides; it's that I see and hear more and more complaints from people who run simple non-commercial sites asking why this affects them. Do you need cryptographic security if your site is a simple read-only, text-only site with nothing controversial? Unfortunately, you do. Here's why. Since the ISPs' theft of the web (it's not limited to the loss of Network Neutrality, and not just an issue in the U.S.), the ISPs themselves can legally execute man-in-the-middle attacks to:

The first two are happening already; the third may be. (It's possible that GDPR, which protects European citizens regardless of where they're located, might prevent ISPs from collecting and selling browsing history. I wouldn't count on it, though.)

Yesterday, I poked around a bit and found many sites that don't use HTTPS everywhere. Those sites include an ivy league university (Cornell, get your act together!), many non-profit organizations (including several that I belong to), several well-known newspapers and magazines, local libraries, and a lot of small businesses. The irony is most of these sites accept donations, let you read restricted-access materials, and even sell stuff online, and these pages are already using HTTPS (though not always correctly). Protecting the entire site doesn't require that big a change. In many cases, using HTTPS for the entire site is simpler than protecting a limited collection of pages.

I agree that HTTPS is a significant administrative burden for simple, static sites, or sites that are run by groups that don’t have the technical ability to implement HTTPS. Services like Let's Encrypt reduce some of the burden (Let's Encrypt provides free certificates, reducing the process of setting up HTTPS to a few well-placed clicks)—but you still have to do it.

Nothing stays simple and elegant, particularly when it's under attack. And the web is under attack: from the pirate ISPs, from hostile governments (a somewhat different issue, but related), and from many other actors. HTTPS is a solution—a problematic one, I’ll grant, and one that imposes a burden on the sites least capable of dealing with technical overhead, but we don't have a better solution. Well, yes, we do have a better solution: IPSec and IPv6 solve the problem nicely. But we've been waiting for widespread adoption of those for more than 20 years, and we're still waiting. These are problems we need to solve now.

"There's nothing on my site that requires cryptography" doesn't help anyone. That's no more helpful than people who say "I have nothing to hide, so I don't need privacy." You don't need either privacy or HTTPS until you do, and then it's way, way too late. Do it for your users' sake.

Continue reading A quick reminder on HTTPS everywhere.

Continue Reading…


Read More

Robot arm seeks out Waldo, using machine learning

The camera on the slightly creepy arm takes a picture of the pages in the book, the software uses OpenCV to extract faces, and the faces are passed to Google Auto ML Vision comparing the faces to a Waldo model. The result: There’s Waldo.

Tags: , , ,

Continue Reading…


Read More

I started a corporation!

After I released Bite Size Linux in May, something surprising happened – it sold WAY more copies than I expected! (as of today, 2600 copies!!!). If you bought one, thank you! It was super encouraging and makes me want to produce a lot more fun zine ideas :). And 1300 of you have already bought Bite Size Command Line, which is similarly astonishing.

The only downside of selling things is – making money is complicated! I need to track the income so I can pay income taxes, I need to pay sales tax on Canadian sales! Tracking expenses is really annoying!

So I decided to start a corporation (just on the side, I still work at the same place ❤) to manage the logistics. I’m pretty excited about this so I wanted to explain why.

What’s involved in starting a company?

I thought briefly about using Stripe Atlas (which is a cool way to easily start a US corporation) but I live in Canada so I decided to start a Canadian corporation. I hired a lawyer and an accountant to do all the paperwork. Now I have:

  • A numbered Canadian corporation!
  • GST and QST numbers for me use to to pay sales tax!
  • A corporate bank account and credit card!

Even though I hired a lawyer / accountant to help me out, it ended up taking 8 weeks from start to finish, mostly because I was travelling/busy.

It’s easier to pay people!

One of the most exciting things about having a company is that it’s easier to hire people to help me with my zines! And I can use the company’s money to expense ALL KINDS OF THINGS. For example:

  • I’m interested in doing collaborations with folks and paying them for their work. This makes that WAY easier – the fantastic Katie Sylor-Miller and I just announced a collaboration where we’re illustrating oh shit, git!
  • It simplifies hiring illustrators for my covers (and maybe I can get EVEN MORE things illustrated?!)
  • I can hire an accountant to help with my taxes
  • It makes it easier to maybe print & ship zines to people (because it simplifies things like potential kickstarter taxes), which folks have been asking me about FOREVER and I’d really like to do one day.
  • I can hire an IP lawyer to get advice about intellectual property / copyright if I need it!
  • I can more easily invest in equipment (like I’m thinking of buying a fancy colour laser printer or maybe even a machine to help me staple zines)

It’s easier to keep track of money!

Before, all my zine income was mixed in with my personal income. This was okay when there wasn’t that much of it, but now that there’s more it’s really exciting to have it in a separate place. And I’m SO EXCITED about being able to use a corporate credit card to buy company-things.

Adding a company rate for zines

One other sort-of related thing I’m excited about is – for my last 2 zines, I’ve added a “corporate rate” which gives you permission to use a zine at work for $100. So far 47 companies have bought it (thanks ❤), which is really amazing. And one company bought the $500 version which is for companies over 1000 employees!

It’s possible that I should increase those prices in the future, but it’s a start!

some questions about running a zine business

There are more aspects to selling zines that I’m interested in talking about in the future – reframing writing zines as a “small side business” instead of just a “fun thing” raises all kinds of interesting questions like:

  • will more or less people read my zines if I sell them than if they’re free?
  • does selling zines make it harder to make weird niche zines because there’s an incentive to make things that are more general-interest?
  • does having a side business make it easier to do high quality work?
  • do people value work more if it’s not free?
  • right now I’m selling zines for $10, but $10 is is a lot of money for many people (both in the US and outside the US). how do I keep them accessible?
  • are there better ways to sell zines to companies?
  • does giving away really good work for free normalize the idea that work “should” be given away for free? Does it undermine people who sell their work?

All of the existing zines that you can find at will remain free, of course :). I’m not sure about newer zines. I’m doing some experiments! (Bite Size Linux and Bite Size Command Line are both not-free) More on those questions in the future as I learn about them :)

Continue Reading…


Read More

If you did not already know

Geometry-Aware Generative Adversarial Network (GAGAN) google
Deep generative models learned through adversarial training have become increasingly popular for their ability to generate naturalistic image textures. However, apart from the visual texture, the visual appearance of objects is significantly affected by their shape geometry, information which is not taken into account by existing generative models. This paper introduces the Geometry-Aware Generative Adversarial Network (GAGAN) for incorporating geometric information into the image generation process. Specifically, in GAGAN the generator samples latent variables from the probability space of a statistical shape model. By mapping the output of the generator to a canonical coordinate frame through a differentiable geometric transformation, we enforce the geometry of the objects and add an implicit connection from the prior to the generated object. Experimental results on face generation indicate that the GAGAN can generate realistic images of faces with arbitrary facial attributes such as facial expression, pose, and morphology, that are of better quality compared to current GAN-based methods. Finally, our method can be easily incorporated into and improve the quality of the images generated by any existing GAN architecture. …

Information Potential Auto-Encoders google
In this paper, we suggest a framework to make use of mutual information as a regularization criterion to train Auto-Encoders (AEs). In the proposed framework, AEs are regularized by minimization of the mutual information between input and encoding variables of AEs during the training phase. In order to estimate the entropy of the encoding variables and the mutual information, we propose a non-parametric method. We also give an information theoretic view of Variational AEs (VAEs), which suggests that VAEs can be considered as parametric methods that estimate entropy. Experimental results show that the proposed non-parametric models have more degree of freedom in terms of representation learning of features drawn from complex distributions such as Mixture of Gaussians, compared to methods which estimate entropy using parametric approaches, such as Variational AEs. …

Spotting anomalies with Privileged Information (SPI) google
We introduce a new unsupervised anomaly detection ensemble called SPI which can harness privileged information – data available only for training examples but not for (future) test examples. Our ideas build on the Learning Using Privileged Information (LUPI) paradigm pioneered by Vapnik et al. [19,17], which we extend to unsupervised learning and in particular to anomaly detection. SPI (for Spotting anomalies with Privileged Information) constructs a number of frames/fragments of knowledge (i.e., density estimates) in the privileged space and transfers them to the anomaly scoring space through ‘imitation’ functions that use only the partial information available for test examples. Our generalization of the LUPI paradigm to unsupervised anomaly detection shepherds the field in several key directions, including (i) domain knowledge-augmented detection using expert annotations as PI, (ii) fast detection using computationally-demanding data as PI, and (iii) early detection using ‘historical future’ data as PI. Through extensive experiments on simulated and real datasets, we show that augmenting privileged information to anomaly detection significantly improves detection performance. We also demonstrate the promise of SPI under all three settings (i-iii); with PI capturing expert knowledge, computationally expensive features, and future data on three real world detection tasks. …

Continue Reading…


Read More

data.table is Really Good at Sorting

The data.table R package is really good at sorting. Below is a comparison of it versus dplyr for a range of problem sizes.

Present 2

The graph is using a log-log scale (so things are very compressed). But data.table is routinely 7 times faster than dplyr. The ratio of run times is shown below.

Present 3

Notice on the above semi-log plot the run time ratio is growing roughly linearly. This makes sense: data.table uses a radix sort which has the potential to perform in near linear time (faster than the n log(n) lower bound known comparison sorting) for a range of problems (also we are only showing example sorting times, not worst-case sorting times).

In fact, if we divide the y in the above graph by log(rows) we get something approaching a constant.

Present 4

The above is consistent with data.table not only being faster than dplyr, but also having a fundamentally different asymptotic running time.

Performance like the above is one of the reasons you should strongly consider data.table for your R projects.

All details of the timings can be found here.

Continue Reading…


Read More

Magister Dixit

“You can have data without information, but you cannot have information without data.” Daniel Keys Moran

Continue Reading…


Read More

2018 Update of ffanalytics R Package

(This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers)


As part of our effort to provide users ways to replicate our analyses and improve their performance in fantasy football, we are continuously looking at ways we can improve our tools. Since the ffanalytics package was introduced back in 2016, we have had a tremendous response and feedback on the package and have worked through some of the kinks and are now introducing a new version of the ffanalytics package that should be more robust and provide more consistent results.


One of the major concerns with data scrapes is that websites change and so does the way data are rendered on individual pages. Where the old version of the ffanalytics package was highly dependent on how columns were ordered on the different sites, the new version of the package does not depend on column order. Instead the package looks for specific components of the page and, based on predefined mapping, can determine which columns are which.

The other major concern is that different sites lists the same player with a slightly different name (i.e. Rob Kelly or Robert Kelly) or at different positions, so matching up the data can be tricky. The new version of the package maps player IDs from the different sites back to the player ID used by, so it now doesn’t matter what name or position a specific player is listed as on a specific site because the data for a specific player will match up correctly.

Finally, the new version of the package relies heavily on the vocabulary and terminology used in the tidyverse package.  We highly recommend familiarizing yourself with that package.

It is our intent that the new version of the package will provide more consistent scrapes for users.

Player information

The player information used in the package is based on player information from (MFL). When the package is loaded, it pulls the most recent player data from MFL into the player_table object. When the scrape runs, it determines the player ID for the player on the site that is being scraped and maps that ID back to the MFL player ID. The scraped data contains an id column which holds the player ID for MFL, and a src_id column that has the ID for the site. However, when aggregating the stats and generating the projections the player ID used is the MFL player ID.

Custom calculations

The customization of calculations is done as input to the projections_table() function. You can specify your own scoring rules, analyst weights, tier thresholds and VOR baselines to be used in the calculations. See ?projections_table for specifics, and check out the scoring_settings vignette on how to customize scoring rules

Installation and getting started

We have updated the instructions on installing and getting the package here so you can check that page out, and feel free to leave comments with feedback and suggestions.

The post 2018 Update of ffanalytics R Package appeared first on Fantasy Football Analytics.

To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Document worth reading: “Does putting your emotions into words make you feel better? Measuring the minute-scale dynamics of emotions from online data”

Studies of affect labeling, i.e. putting your feelings into words, indicate that it can attenuate positive and negative emotions. Here we track the evolution of individual emotions for tens of thousands of Twitter users by analyzing the emotional content of their tweets before and after they explicitly report having a strong emotion. Our results reveal how emotions and their expression evolve at the temporal resolution of one minute. While the expression of positive emotions is preceded by a short but steep increase in positive valence and followed by short decay to normal levels, negative emotions build up more slowly, followed by a sharp reversal to previous levels, matching earlier findings of the attenuating effects of affect labeling. We estimate that positive and negative emotions last approximately 1.25 and 1.5 hours from onset to evanescence. A separate analysis for male and female subjects is suggestive of possible gender-specific differences in emotional dynamics. Does putting your emotions into words make you feel better? Measuring the minute-scale dynamics of emotions from online data

Continue Reading…


Read More

R Packages worth a look

Robust Generalized Linear Models (GLM) using Mixtures (robmixglm)
Robust generalized linear models (GLM) using a mixture method, as described in Beath (2018) <doi:10.1080/02664763.2017.1414164>. This assumes tha …

Penalized Orthogonal-Components Regression (POCRE)
Penalized orthogonal-components regression (POCRE) is a supervised dimension reduction method for high-dimensional data. It sequentially constructs ort …

Library of Research Designs (DesignLibrary)
A simple interface to build designs using the using the package ‘DeclareDesign’. In one line of code, users can specify the parameters of individual de …

Generate Random Samples from the Polya-Gamma Distribution (pgdraw)
Generates random samples from the Polya-Gamma distribution using an implementation of the algorithm described in J. Windle’s PhD thesis (2013) <<a h …

Web-Based Interactive Omics Visualization (wilson)
Tool-set of modules for creating web-based applications that use plot based strategies to visualize and analyze multi-omics data. This package utilizes …

Continue Reading…


Read More

Mongolite 2.0: GridFS, connection pooling, and more

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

This week version 2.0 of the mongolite package has been released to CRAN. Major new features in this release include support for MongoDB 4.0, GridFS, running database commands, and connection pooling.

Mongolite is primarily an easy-to-use client to get data in and out of MongoDB. However it supports increasingly many advanced features like aggregation, indexing, map-reduce, streaming, encryption, and enterprise authentication. The mongolite user manual provides a great introduction with details and worked examples.

GridFS Support!

New in version 2.0 is support for the MongoDB GridFS system. A GridFS is a special type of Mongo collection for storing binary data, such as files. To the user, a GridFS looks like a key-value server with potentially very large values.

We support two interfaces for sending/receiving data from/to GridFS. The fs$read() and fs$write() methods are the most flexible and can stream data from/to an R connection, such as a file, socket or url. These methods support a progress counter and can be interrupted if needed. These methods are recommended for reading or writing single files.

# Assume 'mongod' is running on localhost
fs <- gridfs()
buf <- serialize(nycflights13::flights, NULL)
fs$write(buf, 'flights')
#> [flights]: written 45.11 MB (done)
#>  List of 6
#> $ id      : chr "5b70a37a47a302506117c352"
#> $ name    : chr "flights"
#> $ size    : num 45112028
#> $ date    : POSIXct[1:1], format: "2018-08-12 23:15:38"
#> $ type    : chr NA
#> $ metadata: chr NA

# Read serialized data:
out <- fs$read('flights')
flights <- unserialize(out$data)
# [flights]: read 45.11 MB (done)
#>  A tibble: 6 x 19
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight
#> 1  2013     1     1      517            515         2      830            819        11 UA        1545
#> 2  2013     1     1      533            529         4      850            830        20 UA        1714
#> 3  2013     1     1      542            540         2      923            850        33 AA        1141
#> 4  2013     1     1      544            545        -1     1004           1022       -18 B6         725
#> 5  2013     1     1      554            600        -6      812            837       -25 DL         461
#> 6  2013     1     1      554            558        -4      740            728        12 UA        1696
#> # ... with 8 more variables: tailnum , origin , dest , air_time , distance ,
#> #   hour , minute , time_hour 

The fs$upload() and fs$download() methods on the other hand copy directly between GridFS and your local disk. This API is vectorized so it can transfer many files at once. Hover over to the gridfs chapter in the manual for more examples.

Running Commands

MongoDB exposes an enormous number of database commands, and mongolite cannot provide wrappers for each command. As a compromise, the new version of mongolite exposes an api to run raw commands, so you can manually run the commands for which we do not expose wrappers.

The result data from the commannd automatically gets simplified into nice R data structures using the jsonlite simplification system (but you can set simplify = FALSE if you prefer json structures).

m <- mongo()
col$run('{"ping": 1}')
#> $ok
#> [1] 1

For example we can run the listDatabases command to list all db’s on the server:

admin <- mongo(db = "admin")
#> $databases
#>     name sizeOnDisk empty
#> 1  admin      32768 FALSE
#> 2 config      73728 FALSE
#> 3  local      73728 FALSE
#> 4   test   72740864 FALSE
#> $totalSize
#> [1] 72921088
#> $ok
#> [1] 1

The server tools chapter in the manual has some more examples.

Connection Pooling

Finally another much requested feature was connection pooling. Previously, mongolite would open a new database connection for each collection object in R. However as of version 2.0, mongolite will use existing connections to the same database when possible.

# This will use a single database connection <- mongolite::mongo("flights")
db.iris <- mongolite::mongo("iris")
db.files <- mongolite::gridfs()

A database connection is automatically closed when all objects that were using it have either been removed, or explicitly disconnected with the disconnect() method. For example using the example above:

# Connection still alive

# Now it will disconnect

Mongolite collection and GridFS objects automatically reconnect if when needed if they are disconnected (either explicitly or automatically e.g. when restarting your R session). For example:

> db.files$find()
Connection lost. Trying to reconnect with mongo...
#>                         id    name     size                date type metadata
#> 1 5b70a37a47a302506117c352 flights 45112028 2018-08-12 23:15:38      

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Where to go observe birds in Radolfzell? An answer with R and open data

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

This post is the 1st post of a series showcasing various rOpenSci
packages as if Maëlle were a birder trying to make the most of R in
general and rOpenSci in particular. Although the series use cases will
mostly feature birds, it’ll be the occasion to highlight rOpenSci’s
packages that are more widely applicable, so read on no matter what your
field is! Moreoever, each post should stand on its own.

A logical first step in this birder’s guide to rOpenSci is to use R to
find where to observe birds! In this blog post, we shall use rOpenSci
packages accessing open geographical data
to locate and visualize
where to observe birds near a given location.

First of all, where are we now?

The Max Planck Institute for Ornithology (henceforth shortened to MPI),
where Maëlle will give a talk is located at Am Obstberg 1 78315
. Let’s geolocate it using rOpenSci’s opencage
that interfaces the
OpenCage Geocoder, a commercial service
based on open data
. When choosing to
get only one result via limit = 1, one gets what the API considers to
be the best one.

mpi <- opencage::opencage_forward("Am Obstberg 1 78315 Radolfzell", limit = 1)$results
## [1] "tbl_df"     "tbl"        "data.frame"
## [1] ""    "annotations.DMS.lng"   
## [3] "annotations.MGRS"       "annotations.Maidenhead"
## [5] "annotations.Mercator.x" "annotations.Mercator.y"

This gets us Am Obstberg 1, 78315 Radolfzell, Germany (mpi$formatted)
which is in 🇩🇪 (mpi$annotations.flag gets us a flag!).

Birding in a bird hide?

Where to find a bird hide?

You can most certainly go birding anywhere, but if you can find a bird
hide (or bird blind depending on the English you speak), it might be a
very appropriate observation point. Now that we know where the MPI is,
we can look for bird hide(s) in the vicinity. For that, we shall use
rOpenSci’s osmdata package by
Mark Padgham and collaborators! Note that incidentally, Mark did his
PhD in
This package is an interface to OpenStreetMap’s Overpass
. OpenStreetMap is
a collective map of the world. It contains information about towns’
limits, roads, placenames… but also tags of everything, from bars as
seen in this blog post to
trees. You can
browse existing features via OpenStreetMap’s
. Some parts of the
world are better mapped than others depending on the local OpenStreetMap
community. Actually, OpenCage’s blog features an interesting series of
country profiles.

To look for a bird hide, we first create a bounding box of 10km around
the MPI, using rOpenSci’s bbox

bbox <- bbox::lonlat2bbox(mpi$geometry.lng,
                          dist = 10, method = "lawn")

We then use the key and value associated with bird hides in
respectively leisure and bird_hide.

## Data (c) OpenStreetMap contributors, ODbL 1.0.
(results <- opq(bbox = bbox) %>%
  add_osm_feature(key = 'leisure', value = 'bird_hide') %>%
  osmdata_sf ())
## Object of class 'osmdata' with:
##                  $bbox : 47.6865,8.8753,47.8494,9.1177
##         $overpass_call : The call submitted to the overpass API
##             $timestamp : [ Thu 5 Jul 2018 08:06:35 ]
##            $osm_points : 'sf' Simple Features Collection with 1 points
##             $osm_lines : 'sf' Simple Features Collection with 0 linestrings
##          $osm_polygons : 'sf' Simple Features Collection with 0 polygons
##        $osm_multilines : 'sf' Simple Features Collection with 0 multilinestrings
##     $osm_multipolygons : 'sf' Simple Features Collection with 0 multipolygons
##              leisure            geometry
## 5004940425 bird_hide 8.920901, 47.741569

Yay, we now know where to find a bird hide not too far from the MPI!

Visualizing our location and the bird hide

So one could enter the coordinates of that bird hide in one’s favourite
mapping software or app but to show you where the bird hide is we can
actually step back and use
osmplotr, another package
contributed to rOpenSci by Mark Padgham!

The way osmplotr works is letting you create a basemap, on which
you’ll add different layers extracted from OpenStreetMap using
osmplotr::extract_osm_objects or osmdata functions directly. Its
strengths are therefore the use of open data, and the customization of
what you’re using as background!

Let’s create a basemap for our bounding box, and then add roads and
buildings to it.

## Data (c) OpenStreetMap contributors, ODbL 1.0.
bbox <- get_bbox(bbox)
dat_B <- extract_osm_objects (key = 'building', bbox = bbox)
dat_H <- extract_osm_objects (key = 'highway', bbox = bbox)

map0 <- osm_basemap(bbox = bbox, bg = 'gray20') %>%
  add_osm_objects (dat_B, col = 'gray40') %>%
  add_osm_objects (dat_H, col = 'gray80')
map0 %>%
  add_axes() %>%
  print_osm_map (filename = 'map_a1.png', width = 600,
               units = 'px', dpi = 72)

magick::image_read('map_a1.png') %>%
  magick::image_annotate("Map data © OpenStreetMap contributors",
                         color = "white",
                         boxcolor =  "black",
                         size = 15,
                         gravity = "south")

Quite pretty! The lakes can be seen because of the absence of roads and
buildings on them.

Now, let’s plot the bird hide and the MPI on them. Since we used
osmdata::osmdata_sf, we had gotten the data in a receivable class,

points_map <- add_osm_objects(map0, results$osm_points, 
                              col = 'salmon',
                              size = 5)

For plotting the MPI, we’ll convert opencage output to an sf point
with the same coordinate reference system as the OpenStreetMap data
extracted with osmdata.

coords <- data.frame(lon = mpi$geometry.lng,
                     lat = mpi$
crs <- sf::st_crs(results$osm_points)

mpi_sf <- sf::st_as_sf(coords,
                       coords = c("lon", "lat"), 
                      crs = crs)

points_map <- add_osm_objects(points_map, mpi_sf, 
                             col = 'white',
                             size = 5)

We can now visualize both points on the map, the MPI in white and the
bird hide in salmon, South-West from the MPI.

points_map %>%
  add_axes() %>%
  print_osm_map (filename = 'map_a2.png', 
                 width = 600,
                 units = 'px', dpi = 72) 

magick::image_read('map_a2.png') %>%
  magick::image_annotate("Map data © OpenStreetMap contributors",
                         color = "white",
                         boxcolor =  "black",
                         size = 15,
                         gravity = "south")

Aha, now we see where the bird hide is, fantastic! But as Mark noted,
birds can actually be observed from other places.

Birding where birds should be?

Birds are most likely to be found where water lies close to natural
, and we can translate this to R code! We shall get all water and
(separately) all non-watery natural areas and find the shortest
distances between them before plotting the results using

First, let’s get all water and (separately) all non-watery natural

dat <- opq(bbox = bbox) %>%
     add_osm_feature(key = 'natural') %>%
     osmdata_sf (quiet = FALSE)
## Issuing query to Overpass API ...

## Rate limit: 2

## Query complete!

## converting OSM data to sf format
indx_W <- which (dat$osm_polygons$natural %in% c ("water", "wetland"))
indx_N <- which (!dat$osm_polygons$natural %in% c ("water", "wetland"))

xy_W <- sf::st_coordinates (dat$osm_polygons [indx_W, ])
xy_N <- sf::st_coordinates (dat$osm_polygons [indx_N, ])

Then we use Mark’s geodist
to get pairwise distances
between all points, find minima, and make a data.frame to submit to
add_osm_surface(). We have 7068 watery points and 10065 non-watery
points so the speed of geodist is crucial here!

t1 <- Sys.time()
d <- geodist::geodist (xy_W, xy_N)
# so fast!!!
Sys.time() - t1
## Time difference of 13.57983 secs
d1 <- apply (d, 1, min)
d2 <- apply (d, 2, min)
xy <- cbind (rbind (xy_W, xy_N), c (d1, d2))
xy <- xy [, c (1, 2, 5)]
colnames (xy) <- c ("x", "y", "z")
xy <- xy [!duplicated (xy), ]

Finally we plot the results on the roads we had gotten earlier, although
we do not recommend staying on the middle of a road to observe birds! We
re-add the points corresponding to the MPI and bird hide after the
surface coloring. With osmplotr, order matters because layers are
added on top of each other. Note that plotting the distances takes a

# colorblind-friendly palette!
cols <- viridis::viridis_pal (direction = -1) (30)

add_osm_surface (map0, dat_H,
                 dat = xy, col = cols) %>%
    add_axes()  %>%
  add_colourbar(cols = cols,
                zlim = range(xy[,"z"])) %>%
                  col = 'white', size = 5) %>%
                  col = 'white', size = 5)%>%
  print_osm_map (filename = 'map_a3.png', width = 600,
                 units = 'px', dpi = 72)

magick::image_read('map_a3.png') %>%
  magick::image_annotate("Map data © OpenStreetMap contributors",
                         color = "white",
                         boxcolor =  "black",
                         size = 15,
                         gravity = "south")

On the map, the yellower/lighter a road is, the better it is to observe
birds according to Mark’s assumption that birds are most likely to be
found where water lies close to natural areas. With this metric, the MPI
itself is not too badly located, which after all is not too surprising
for an MPI of ornithology. Maybe one should just walk to the one of
the nearest lakes to meet some birds.


Open geographical data in R

In this post we showcased three rOpenSci packages helping you use open
geographical data in R:
osmplotr, therefore mostly
making use of the awesome OpenStreetMap data (The OpenCage Geocoder uses
a bit more, but it only warrants attributing
). We therefore added
the annotation “Map data © OpenStreetMap contributors” to all maps of
this post using rOpenSci’s magick package. You might also have noticed
in the code earlier that both osmdata and osmplotr have start-up
messages indicating the data origin and licence.

Another package of rOpenSci’s interacting with open geographical data,
that might be of interest for making maps, is
rnaturalearth that
facilitates interaction with Natural Earth map

Other R packages for spatial analyses

We also used two other rOpenSci packages:
bbox to get a bounding box and
magick for image manipulation.
Explore more of our packages suite, including and beyond the geospatial
category, here.

We also used the geodist
for ultra lightweight,
ultra fast calculation of geo distances. This package is developed in
the hypertidy GitHub organization which
is a good place to find useful R geospatial packages. Other good
resources for geospatial analyses with R include the
and this great book by Robin
Lovelace, Jakub Nowosad and Jannes
, and more links
presented in this blog post of Steph

More birding soon!

Stay tuned for the next post in this series, about getting bird
observation data in R! In the meantime, happy birding!

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

August 13, 2018

Analysis of fake YouTube views

Wherever more attention or the appearance of it equates to more money, there are those who try to game the system. Michael H. Keller for The New York Times examines the business of fake YouTube views:

YouTube’s engineers, statisticians and data scientists are constantly improving in their ability to fight what Ms. O’Connor calls a “very hard problem,” but the attacks have “continually gotten stronger and more sophisticated,” she said.

After the Times reporter presented YouTube with the videos for which he had bought views, the company said sellers had exploited two vulnerabilities that had already been fixed. Later that day, the reporter bought more views from six of the same vendors. The view count rose again, though more slowly. A week later, all but two of the vendors had delivered the full amount.

Tags: , ,

Continue Reading…


Read More

Whats new on arXiv

Overarching Computation Model (OCM)

Existing models of computation, such as a Turing machine (hereafter, TM), do not consider the agent involved in interpreting the outcome of the computation. We argue that a TM, or any other computation model, has no significance if its output is not interpreted by some agent. Furthermore, we argue that including the interpreter in the model definition sheds light on some of the difficult problems faced in computation and mathematics. We provide an analytic process framework to address this limitation. The framework can be overlaid on existing concepts of computation to address many practical and philosophical concerns such as the P vs NP problem. In addition, we provide constructive proof for the P vs NP problem under the assumption that the class NP comprises of problems solvable by non-deterministic algorithms. We utilize the observation that deterministic computational procedures lack fundamental capacity to fully simulate their non-deterministic variant.

Maximum Weight Online Matching with Deadlines

We study the problem of matching agents who arrive at a marketplace over time and leave after d time periods. Agents can only be matched while they are present in the marketplace. Each pair of agents can yield a different match value, and the planner’s goal is to maximize the total value over a finite time horizon. First we study the case in which vertices arrive in an adversarial order. We provide a randomized 0.25-competitive algorithm building on a result by Feldman et al. (2009) and Lehman et al. (2006). We extend the model to the case in which departure times are drawn independently from a distribution with non-decreasing hazard rate, for which we establish a 1/8-competitive algorithm. When the arrival order is chosen uniformly at random, we show that a batching algorithm, which computes a maximum-weighted matching every (d+1) periods, is 0.279-competitive.

NL4Py: Agent-Based Modeling in Python with Parallelizable NetLogo Workspaces

NL4Py is a NetLogo controller software for Python, for the rapid, parallel execution of NetLogo models. NL4Py provides both headless (no graphical user interface) and GUI NetLogo workspace control through Python. Spurred on by the increasing availability of open-source computation and machine learning libraries on the Python package index, there is an increasing demand for such rapid, parallel execution of agent-based models through Python. NetLogo, being the language of choice for a majority of agent-based modeling driven research projects, requires an integration to Python for researchers looking to perform statistical analyses of agent-based model output using these libraries. Unfortunately, until the recent introduction of PyNetLogo, and now NL4Py, such a controller was unavailable. This article provides a detailed introduction into the usage of NL4Py and explains its client-server software architecture, highlighting architectural differences to PyNetLogo. A step-by-step demonstration of global sensitivity analysis and parameter calibration of the Wolf Sheep Predation model is then performed through NL4Py. Finally, NL4Py’s performance is benchmarked against PyNetLogo and its combination with IPyParallel, and shown to provide significant savings in execution time over both configurations.

Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network

Because of their effectiveness in broad practical applications, LSTM networks have received a wealth of coverage in scientific journals, technical blogs, and implementation guides. However, in most articles, the inference formulas for the LSTM network and its parent, RNN, are stated axiomatically, while the training formulas are omitted altogether. In addition, the technique of ‘unrolling’ an RNN is routinely presented without justification throughout the literature. The goal of this paper is to explain the essential RNN and LSTM fundamentals in a single document. Drawing from concepts in signal processing, we formally derive the canonical RNN formulation from differential equations. We then propose and prove a precise statement, which yields the RNN unrolling technique. We also review the difficulties with training the standard RNN and address them by transforming the RNN into the ‘Vanilla LSTM’ network through a series of logical arguments. We provide all equations pertaining to the LSTM system together with detailed descriptions of its constituent entities. Albeit unconventional, our choice of notation and the method for presenting the LSTM system emphasizes ease of understanding. As part of the analysis, we identify new opportunities to enrich the LSTM system and incorporate these extensions into the Vanilla LSTM network, producing the most general LSTM variant to date. The target reader has already been exposed to RNNs and LSTM networks through numerous available resources and is open to an alternative pedagogical approach. A Machine Learning practitioner seeking guidance for implementing our new augmented LSTM model in software for experimentation and research will find the insights and derivations in this tutorial valuable as well.

Fuzzy Clustering to Identify Clusters at Different Levels of Fuzziness: An Evolutionary Multi-Objective Optimization Approach

Fuzzy clustering methods identify naturally occurring clusters in a dataset, where the extent to which different clusters are overlapped can differ. Most methods have a parameter to fix the level of fuzziness. However, the appropriate level of fuzziness depends on the application at hand. This paper presents Entropy c-Means (ECM), a method of fuzzy clustering that simultaneously optimizes two contradictory objective functions, resulting in the creation of fuzzy clusters with different levels of fuzziness. This allows ECM to identify clusters with different degrees of overlap. ECM optimizes the two objective functions using two multi-objective optimization methods, Non-dominated Sorting Genetic Algorithm II (NSGA-II), and Multiobjective Evolutionary Algorithm based on Decomposition (MOEA/D). We also propose a method to select a suitable trade-off clustering from the Pareto front. Experiments on challenging synthetic datasets as well as real-world datasets show that ECM leads to better cluster detection compared to the conventional fuzzy clustering methods as well as previously used multi-objective methods for fuzzy clustering.

Linked Causal Variational Autoencoder for Inferring Paired Spillover Effects

Modeling spillover effects from observational data is an important problem in economics, business, and other fields of research. % It helps us infer the causality between two seemingly unrelated set of events. For example, if consumer spending in the United States declines, it has spillover effects on economies that depend on the U.S. as their largest export market. In this paper, we aim to infer the causation that results in spillover effects between pairs of entities (or units), we call this effect as \textit{paired spillover}. To achieve this, we leverage the recent developments in variational inference and deep learning techniques to propose a generative model called Linked Causal Variational Autoencoder (LCVA). Similar to variational autoencoders (VAE), LCVA incorporates an encoder neural network to learn the latent attributes and a decoder network to reconstruct the inputs. However, unlike VAE, LCVA treats the \textit{latent attributes as confounders that are assumed to affect both the treatment and the outcome of units}. Specifically, given a pair of units u and \bar{u}, their individual treatment and outcomes, the encoder network of LCVA samples the confounders by conditioning on the observed covariates of u, the treatments of both u and \bar{u} and the outcome of u. Once inferred, the latent attributes (or confounders) of u captures the spillover effect of \bar{u} on u. Using a network of users from job training dataset (LaLonde (1986)) and co-purchase dataset from Amazon e-commerce domain, we show that LCVA is significantly more robust than existing methods in capturing spillover effects.

Exploiting Structure for Fast Kernel Learning

We propose two methods for exact Gaussian process (GP) inference and learning on massive image, video, spatial-temporal, or multi-output datasets with missing values (or ‘gaps’) in the observed responses. The first method ignores the gaps using sparse selection matrices and a highly effective low-rank preconditioner is introduced to accelerate computations. The second method introduces a novel approach to GP training whereby response values are inferred on the gaps before explicitly training the model. We find this second approach to be greatly advantageous for the class of problems considered. Both of these novel approaches make extensive use of Kronecker matrix algebra to design massively scalable algorithms which have low memory requirements. We demonstrate exact GP inference for a spatial-temporal climate modelling problem with 3.7 million training points as well as a video reconstruction problem with 1 billion points.

Fast Flexible Function Dispatch in Julia

Technical computing is a challenging application area for programming languages to address. This is evinced by the unusually large number of specialized languages in the area (e.g. MATLAB, R), and the complexity of common software stacks, often involving multiple languages and custom code generators. We believe this is ultimately due to key characteristics of the domain: highly complex operators, a need for extensive code specialization for performance, and a desire for permissive high-level programming styles allowing productive experimentation. The Julia language attempts to provide a more effective structure for this kind of programming by allowing programmers to express complex polymorphic behaviors using dynamic multiple dispatch over parametric types. The forms of extension and reuse permitted by this paradigm have proven valuable for technical computing. We report on how this approach has allowed domain experts to express useful abstractions while simultaneously providing a natural path to better performance for high-level technical code.

Fast computation of the principal components of genotype matrices in Julia

Finding the largest few principal components of a matrix of genetic data is a common task in genome-wide association studies (GWASs), both for dimensionality reduction and for identifying unwanted factors of variation. We describe a simple random matrix model for matrices that arise in GWASs, showing that the singular values have a bulk behavior that obeys a Marchenko-Pastur distributed with a handful of large outliers. We also implement Golub-Kahan-Lanczos (GKL) bidiagonalization in the Julia programming language, providing thick restarting and a choice between full and partial reorthogonalization strategies to control numerical roundoff. Our implementation of GKL bidiagonalization is up to 36 times faster than software tools used commonly in genomics data analysis for computing principal components, such as EIGENSOFT and FlashPCA, which use dense LAPACK routines and randomized subspace iteration respectively.

Hierarchical Block Sparse Neural Networks

Sparse deep neural networks(DNNs) are efficient in both memory and compute when compared to dense DNNs. But due to irregularity in computation of sparse DNNs, their efficiencies are much lower than that of dense DNNs on general purpose hardwares. This leads to poor/no performance benefits for sparse DNNs. Performance issue for sparse DNNs can be alleviated by bringing structure to the sparsity and leveraging it for improving runtime efficiency. But such structural constraints often lead to sparse models with suboptimal accuracies. In this work, we jointly address both accuracy and performance of sparse DNNs using our proposed class of neural networks called HBsNN ( Hierarchical Block Sparse Neural Networks).

Lingke: A Fine-grained Multi-turn Chatbot for Customer Service

Traditional chatbots usually need a mass of human dialogue data, especially when using supervised machine learning method. Though they can easily deal with single-turn question answering, for multi-turn the performance is usually unsatisfactory. In this paper, we present Lingke, an information retrieval augmented chatbot which is able to answer questions based on given product introduction document and deal with multi-turn conversations. We will introduce a fine-grained pipeline processing to distill responses based on unstructured documents, and attentive sequential context-response matching for multi-turn conversations.

Bagging of Density Estimators

In this work we give new density estimators by averaging classical density estimators such as the histogram, the frequency polygon and the kernel density estimators obtained over different bootstrap samples of the original data. We prove the L 2-consistency of these new estimators and compare them to several similar approaches by extensive simulations. Based on them, we give also a way to construct non parametric pointwise confidence intervals for the target density.

AIQ: Measuring Intelligence of Business AI Software

Focusing on Business AI, this article introduces the AIQ quadrant that enables us to measure AI for business applications in a relative comparative manner, i.e. to judge that software A has more or less intelligence than software B. Recognizing that the goal of Business software is to maximize value in terms of business results, the dimensions of the quadrant are the key factors that determine the business value of AI software: Level of Output Quality (Smartness) and Level of Automation. The use of the quadrant is illustrated by several software solutions to support the real life business challenge of field service scheduling. The role of machine learning and conversational digital assistants in increasing the business value are also discussed and illustrated with a recent integration of existing intelligent digital assistants for factory floor decision making with the new version of Google Glass. Such hands free AI solutions elevate the AIQ level to its ultimate position.

TwoWingOS: A Two-Wing Optimization Strategy for Evidential Claim Verification

Determining whether a given claim is supported by evidence is a fundamental NLP problem that is best modeled as Textual Entailment. However, given a large collection of text, finding evidence that could support or refute a given claim is a challenge in itself, amplified by the fact that different evidence might be needed to support or refute a claim. Nevertheless, most prior work decouples evidence identification from determining the truth value of the claim given the evidence. We propose to consider these two aspects jointly. We develop TwoWingOS (two-wing optimization strategy), a system that, while identifying appropriate evidence for a claim, also determines whether or not the claim is supported by the evidence. Given the claim, TwoWingOS attempts to identify a subset of the evidence candidates; given the predicted evidence, it then attempts to determine the truth value of the corresponding claim. We treat this challenge as coupled optimization problems, training a joint model for it. TwoWingOS offers two advantages: (i) Unlike pipeline systems, it facilitates flexible-size evidence set, and (ii) Joint training improves both the claim entailment and the evidence identification. Experiments on a benchmark dataset show state-of-the-art performance. Code: https://…/FEVER

Greedy Algorithms for Approximating the Diameter of Machine Learning Datasets in Multidimensional Euclidean Space

Finding the diameter of a dataset in multidimensional Euclidean space is a well-established problem, with well-known algorithms. However, most of the algorithms found in the literature do not scale well with large values of data dimension, so the time complexity grows exponentially in most cases, which makes these algorithms impractical. Therefore, we implemented 4 simple greedy algorithms to be used for approximating the diameter of a multidimensional dataset; these are based on minimum/maximum l2 norms, hill climbing search, Tabu search and Beam search approaches, respectively. The time complexity of the implemented algorithms is near-linear, as they scale near-linearly with data size and its dimensions. The results of the experiments (conducted on different machine learning data sets) prove the efficiency of the implemented algorithms and can therefore be recommended for finding the diameter to be used by different machine learning applications when needed.

Dropout is a special case of the stochastic delta rule: faster and more accurate deep learning

Multi-layer neural networks have lead to remarkable performance on many kinds of benchmark tasks in text, speech and image processing. Nonlinear parameter estimation in hierarchical models is known to be subject to overfitting. One approach to this overfitting and related problems (local minima, colinearity, feature discovery etc.) is called dropout (Srivastava, et al 2014, Baldi et al 2016). This method removes hidden units with a Bernoulli random variable with probability p over updates. In this paper we will show that Dropout is a special case of a more general model published originally in 1990 called the stochastic delta rule ( SDR, Hanson, 1990). SDR parameterizes each weight in the network as a random variable with mean \mu_{w_{ij}} and standard deviation \sigma_{w_{ij}}. These random variables are sampled on each forward activation, consequently creating an exponential number of potential networks with shared weights. Both parameters are updated according to prediction error, thus implementing weight noise injections that reflect a local history of prediction error and efficient model averaging. SDR therefore implements a local gradient-dependent simulated annealing per weight converging to a bayes optimal network. Tests on standard benchmarks (CIFAR) using a modified version of DenseNet shows the SDR outperforms standard dropout in error by over 50% and in loss by over 50%. Furthermore, the SDR implementation converges on a solution much faster, reaching a training error of 5 in just 15 epochs with DenseNet-40 compared to standard DenseNet-40’s 94 epochs.

How Complex is your classification problem? A survey on measuring classification complexity

Extracting characteristics from the training datasets of classification problems has proven effective in a number of meta-analyses. Among them, measures of classification complexity can estimate the difficulty in separating the data points into their expected classes. Descriptors of the spatial distribution of the data and estimates of the shape and size of the decision boundary are among the existent measures for this characterization. This information can support the formulation of new data-driven pre-processing and pattern recognition techniques, which can in turn be focused on challenging characteristics of the problems. This paper surveys and analyzes measures which can be extracted from the training datasets in order to characterize the complexity of the respective classification problems. Their use in recent literature is also reviewed and discussed, allowing to prospect opportunities for future work in the area. Finally, descriptions are given on an R package named Extended Complexity Library (ECoL) that implements a set of complexity measures and is made publicly available.

Ensemble Kalman Inversion: A Derivative-Free Technique For Machine Learning Tasks

The standard probabilistic perspective on machine learning gives rise to empirical risk-minimization tasks that are frequently solved by stochastic gradient descent (SGD) and variants thereof. We present a formulation of these tasks as classical inverse or filtering problems and, furthermore, we propose an efficient, gradient-free algorithm for finding a solution to these problems using ensemble Kalman inversion (EKI). Applications of our approach include offline and online supervised learning with deep neural networks, as well as graph-based semi-supervised learning. The essence of the EKI procedure is an ensemble based approximate gradient descent in which derivatives are replaced by differences from within the ensemble. We suggest several modifications to the basic method, derived from empirically successful heuristics developed in the context of SGD. Numerical results demonstrate wide applicability and robustness of the proposed algorithm.

‘Quantum Equilibrium-Disequilibrium’: Asset Price Dynamics, Symmetry Breaking, and Defaults as Dissipative Instantons
Effective Caching for the Secure Content Distribution in Information-Centric Networking
Self-Adaptive Systems in Organic Computing: Strategies for Self-Improvement
A Hybrid Recommender System for Patient-Doctor Matchmaking in Primary Care
Evolutionary optimisation of neural network models for fish collective behaviours in mixed groups of robots and zebrafish
Random forest prediction of Alzheimer’s disease using pairwise selection from time series data
VerIDeep: Verifying Integrity of Deep Neural Networks through Sensitive-Sample Fingerprinting
Temporal starvation in multi-channel CSMA networks: an analytical framework
Who Falls for Online Political Manipulation?
The frog model on trees with drift
Code-Mixed Sentiment Analysis Using Machine Learning and Neural Network Approaches
$α$-Approximation Density-based Clustering of Multi-valued Objects
On-Chip Optical Convolutional Neural Networks
The Elephant in the Room
Longest Increasing Subsequence under Persistent Comparison Errors
The Effectiveness of Multitask Learning for Phenotyping with Electronic Health Records Data
DeepMag: Source Specific Motion Magnification Using Gradient Ascent
Transport coefficients in multi-component fluids from equilibrium molecular dynamics
On Physical Layer Security over Fox’s $H$-Function Wiretap Fading Channels
Deep Learning for Single Image Super-Resolution: A Brief Review
Uncovering the Spread of Chagas Disease in Argentina and Mexico
Efficient human-like semantic representations via the Information Bottleneck principle
Sequence-Based OOK for Orthogonal Multiplexing of Wake-up Radio Signals and OFDM Waveforms
Error Forward-Propagation: Reusing Feedforward Connections to Propagate Errors in Deep Learning
Collective irregular dynamics in balanced networks of leaky integrate-and-fire neurons
A Panel Quantile Approach to Attrition Bias in Big Data: Evidence from a Randomized Experiment
A note on partial rejection sampling for the hard disks model in the plane
Blue Phase: Optimal Network Traffic Control for Legacy and Autonomous Vehicles
A survey of data transfer and storage techniques in prevalent cryptocurrencies and suggested improvements
Code-division multiplexed resistive pulse sensor networks for spatio-temporal detection of particles in microfluidic devices
Classes of graphs with e-positive chromatic symmetric function
Distinctiveness, complexity, and repeatability of online signature templates
End-to-end Active Object Tracking and Its Real-world Deployment via Reinforcement Learning
A note on the critical barrier for the survival of $α-$stable branching random walk with absorption
On the Convergence of AdaGrad with Momentum for Training Deep Neural Networks
Linear time algorithm to check the singularity of block graphs
Efficient Measurement on Programmable SwitchesUsing Probabilistic Recirculation
DeepWrinkles: Accurate and Realistic Clothing Modeling
(De)localization of Fermions in Correlated-Spin Background
Learning and Inference on Generative Adversarial Quantum Circuits
WonDerM: Skin Lesion Classification with Fine-tuned Neural Networks
Power Minimization Based Joint Task Scheduling and Resource Allocation in Downlink C-RAN
Stochastic $R_0$ Tensors to Stochastic Tensor Complementarity Problems
Hybrid approach for transliteration of Algerian arabizi: a primary study
Spin systems on Bethe lattices
On the optimal designs for the prediction of complex Ornstein-Uhlenbeck processes
The Moment-SOS hierarchy
Stability for Intersecting Families of Perfect Matchings
Weakly-Supervised Attention and Relation Learning for Facial Action Unit Detection
The Evolution of Sex Chromosomes through the Baldwin Effect
Cross-location wind speed forecasting for wind energy applications using machine learning based models
Deep Learning Based Speed Estimation for Constraining Strapdown Inertial Navigation on Smartphones
Pulse-laser Based Long-range Non-line-of-sight Ultraviolet Communication with Pulse Response Position Estimation
Construction of cospectral graphs
On the Complexity of Solving Subtraction Games
The Power of Cut-Based Parameters for Computing Edge Disjoint Paths
Extremal process of the zero-average Gaussian Free Field for $d\ge 3$
Model Approximation Using Cascade of Tree Decompositions
ChipNet: Real-Time LiDAR Processing for Drivable Region Segmentation on an FPGA
Making effective use of healthcare data using data-to-text technology
Band selection with Higher Order Multivariate Cumulants for small target detection in hyperspectral images
Optimizing error of high-dimensional statistical queries under differential privacy
On testing for high-dimensional white noise
Evaluation of the Spatial Consistency Feature in the 3GPP GSCM Channel Model
Atmospheric turbulence mitigation for sequences with moving objects using recursive image fusion
Dynamic all scores matrices for LCS score
Ektelo: A Framework for Defining Differentially-Private Computations
Existence of symmetric maximal noncrossing collections of $k$-element sets
Finding a Small Number of Colourful Components
Choosing the optimal multi-point iterative method for the Colebrook flow friction equation — Numerical validation
Densely Connected Convolutional Networks for Speech Recognition
Physics-based Learned Design: Optimized Coded-Illumination for Quantitative Phase Imaging
Enumerating Anchored Permutations with Bounded Gaps
Weakly- and Semi-Supervised Panoptic Segmentation
Johnson type bounds for mixed dimension subspace codes
Shape differentiability of Lagrangians and application to Stokes problem
One-dimensional quasicrystals with power-law hopping
On the moments of the (2+1)-dimensional directed polymer and stochastic heat equation in the critical window
A simplified convolutional sparse filter for impulsive signature enhancement and its application to the prognostic of rotating machinery
New global optimality conditions for nonsmooth DC optimization problems
On the structure of the set of active sets in constrained linear quadratic regulation
Overlapping community detection using superior seed set selection in social networks
Rigidity of proper colorings of $\mathbb{Z}^d$
Using Randomness to Improve Robustness of Machine-Learning Models Against Evasion Attacks
Disease Progression Timeline Estimation for Alzheimer’s Disease using Discriminative Event Based Modeling
Weakly supervised learning of indoor geometry by dual warping
An Iterative Path-Breaking Approach with Mutation and Restart Strategies for the MAX-SAT Problem
Reliability of Relay Networks under Random Linear Network Coding
Automorphism groups of Steiner triple systems
Improved Bit-Flipping Algorithm for Successive Cancellation Decoding of Polar Codes
A Novel Method for Epileptic Seizure Detection Using Coupled Hidden Markov Models
A New Algorithm for the Robust Semi-random Independent Set Problem

Continue Reading…


Read More

PCA revisited: using principal components for classification of faces

(This article was first published on The Beginner Programmer, and kindly contributed to R-bloggers)

This is a short post following the previous one (PCA revisited).

In this post I’m going to apply PCA to a toy problem: the classification of faces. Again I’ll be working on the Olivetti faces dataset. Please visit the previous post PCA revisited to read how to download it.

The goal of this post is to fit a simple classification model to predict, given an image, the label to which it belongs. I’m goint to fit two support vector machine models and then compare their accuracy.


  1. The first model uses as input data the raw (scaled) pixels (all 4096 of them). Let’s call this model “the data model”.
  2. The second model uses as input data only some principal components. Let’s call this model “the PCA model”.

This will not be an in-depth and detailed fit and subsequent evaluation of the model, but rather a simple proof of concept to prove that a limited number of principal components can be used to perform a classification task instead of using the actual data.

The setup of the two models

In order to get an idea of what accuracy our models can achieve, I decided to do a simple 20 fold crossvalidation  of each model.

The data model is a simple support vector machine that is fitted using all the faces in the training set: it’s input data is a vector of 4096 components.

The PCA model is again the same support vector machine (with the same hyperparameters, which however may need some tweaking) fitted using 30 PCs.

By performing PCA on the dataset I transformed the data and, according to the analysis, 30 PCs account for about 82% of the total variance in the dataset. Indeed you can see from the plots below that the first 30 eigenvalues are those with the highest magnitude.




After running  the two crossvalidations on the two different models, the main results I found are the following:

  1. The average accuracy of the two models is about the same (PCA model has a slightly better average accuracy): 93% vs 94%. Standard deviation seems to be again in favour of the PCA model: 3.2% vs 2.3%. The accuracy isn’t great, I obtained 98% on this dataset using other models, but of course this is just a test “out of the box” with no hyperparameter tuning and all the other fine tuning operations.
  2. The second model fits the data much faster than the first one.

The results are summarised in the boxplot below


Note that the PCA model has less information available compared to the model using the original data. However, the information is condensed in a fewer variables (the 30 PCs) in a more compact way. This technique is great when you are using models sensitive to correlation in the data: since the PCs are not correlated, you can transform your original data (which might be correlated and causing problems in the model) and use the principal components instead.

Another nice advantage is that since the PCA model uses fewer variables it is faster. Note that I decided to apply PCA to the whole dataset and not only to the training set of each round of crossvalidation. It could have been argued that in practice you have access to only the training set (of course). Then I used the eigenvectors to obtain the PCs for both the training and testing set of each crossvalidation iteration.

The code I used is here below

To leave a comment for the author, please follow the link and comment on their blog: The Beginner Programmer. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

In-brief: splashr update + High Performance Scraping with splashr, furrr & TeamHG-Memex’s Aquarium

(This article was first published on R –, and kindly contributed to R-bloggers)

The development version of splashr now support authenticated connections to Splash API instances. Just specify user and pass on the initial splashr::splash() call to use your scraping setup a bit more safely. For those not familiar with splashr and/or Splash: the latter is a lightweight alternative to tools like Selenium and the former is an R interface to it. Unlike xml2::read_html(), splashr renders a URL exactly as a browser does (because it uses a virtual browser) and can return far more than just the HTML from a web page. Splash does need to be running and it’s best to use it in a Docker container.

If you have a large number of sites to scrape, working with splashr and Splash “as-is” can be a bit frustrating since there’s a limit to what a single instance can handle. Sure, it’s possible to setup your own highly available, multi-instance Splash cluster and use it, but that’s work. Thankfully, the folks behind TeamHG-Memex created Aquarium which uses docker and docker-compose to stand up a multi-Splash instance behind a pre-configured HAProxy instance so you can take advantage of parallel requests the Splash API. As long as you have docker and docker-compose handy (and Python) following the steps on the aforelinked GitHub page should have you up and running with Aquarium in minutes. You use the same default port (8050) to access the Splash API and you get a bonus port of 8036 to watch in your browser (the HAProxy stats page).

This works well when combined with furrr🔗 which is an R package that makes parallel operations very tidy.

One way to use purrr, splashr and Aquarium might look like this:


list_of_urls_with_unique_urls <- c("http://...", "http://...", ...)

make_a_splash <- function(org_url) {
    host = "ip/name of system you started aquarium on", 
    user = "your splash api username", 
    pass = "your splash api password"
  ) %>% 
    splash_response_body(TRUE) %>% # we want to get all the content 
    splash_user_agent(ua_win10_ie11) %>% # splashr has many pre-configured user agents to choose from 
    splash_go(org_url) %>% 
    splash_wait(5) %>% # pick a reasonable timeout; modern web sites with javascript are bloated

safe_splash <- safely(make_a_splash) # splashr/Splash work well but can throw errors. Let's be safe

plan(multiprocess, workers=5) # don't overwhelm the default setup or your internet connection

future_map(sites, ~{
  org <- safe_splash(.x) # go get it!
  if (is.null(org$result)) {
    sprintf("Error retrieving %s (%s)", .x, org$error$message) # this gives us good error messages
  } else {
    HARtools::writeHAR( # HAR format saves *everything*. the files are YUGE
      har = org$result, 
      file = file.path("/place/to/store/stuff", sprintf("%s.har", domain(.x))) # saved with the base domain; you may want to use a UUID via uuid::UUIDgenerate()
    sprintf("Successfully retrieved %s", .x)
}) -> results

(Those with a keen eye will grok why splashr supports Splash API basic authentication, now)

The parallel iterator will return a list we can flatten to a character vector (I don’t do that by default since it’s safer to get a list back as it can hold anything and map_chr() likes to check for proper objects) to check for errors with something like:

flatten_chr(results) %>% 
  keep(str_detect, "Error")
## [1] "Error retrieving (Service Unavailable (HTTP 503).)"
## [2] "Error retrieving (Gateway Timeout (HTTP 504).)"
## [3] "Error retrieving (Bad Gateway (HTTP 502).)"
## [4] "Error retrieving (Bad Gateway (HTTP 502).)"
## [5] "Error retrieving (Gateway Timeout (HTTP 504).)"

Timeouts would suggest you may need to up the timeout parameter in your Splash call. Service unavailable or bad gateway errors may suggest you need to tweak the Aquarium configuration to add more workers or reduce your plan(…). It’s not unusual to have to create a scraping process that accounts for errors and retries a certain number of times.

If you were stuck in the splashr/Splash slow-lane before, give this a try to help save you some time and frustration.

To leave a comment for the author, please follow the link and comment on their blog: R – offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

The dangers of AVX-512 throttling: myth or reality?

Modern processors use many tricks to go faster. They are superscalar which means that they can execute many instructions at once. They are multicore, which means that each CPU is made of several baby processors that are partially independent. And they are vectorized, which means that they have instructions that can operate over wide registers (spanning 128, 256 or even 512 bits).

Regarding vectorization, Intel is currently ahead of the curve with its AVX-512 instruction sets. They have the only commodity-level processors able to work over 512-bit registers. AMD is barely doing 256 bits and your phone is limited to 128 bits.

The more you use your CPU, the more heat it produces. Intel does not want your CPU to burn out. So it throttles the CPU (makes it run slower). Your CPU stays warm but not too hot. When the processor does not have to use AVX-512 instructions, some of the silicon remains dark, and thus less heat is generated.

The cores that are executing SIMD instructions may run at a lower frequency. Thankfully, Intel has per-core voltage and frequencies. And Intel processors can switch frequency very fast, certainly in a millisecond.

Vlad Krasnov from Cloudfare wrote a blog post last year warning us against AVX-512 throttling:

If you do not require AVX-512 for some specific high performance tasks, I suggest you disable AVX-512 execution on your server or desktop, to avoid accidental AVX-512 throttling.

I am sure that it is the case that AVX-512 can cause problems for some use cases. It is also the case that some people die if you give them aspirin; yet we don’t retire aspirin.

Should we really disable AVX-512 as a precautionary stance?

In an earlier blog post, I tried to measure this throttling on a server I own but found no effect whatsoever. It could be that this server does not have AVX-512 throttling or maybe I did not make use of sufficiently expensive instructions.

Vlad offered me test case in C. His test case involves AVX-512 multiplications, while much of the running time is spent on some bubble-sort routine. It can run in both AVX-512 mode and in the regular (non-AVX-512) mode. To be clear, it is not meant to be a demonstration in favour of AVX-512: it is meant to show that AVX-512 can be detrimental.

I did not want to run my tests using my own server this time. So I went to Packet and launched a powerful two-CPU Xeon Gold server (Intel Xeon Gold 5120 CPU @ 2.20GHz). Each of these processors have 14 cores, so we have 28 cores in total. Because of hyperthreading, it supports up to 56 physical threads (running two threads per core).

threads AVX-512 disabled with AVX-512
20 8.4 s 7.2 s
40 6 s 5.0 s
80 5.7 s 4.7 s

In an earlier test, I was using an older compiler (GCC 5) and when exceeding the number of physically supported threads (56), I was getting poorer results, especially with AVX-512. I am not sure why that would be but I suspect it is not related to throttling; it might have to do with context switching and register initialization (though that is speculation on my part).

In my latest test, I use a more recent compiler (GCC 7). As you can see, the AVX-512 version is always faster. Otherwise, I see no negative effect from the application of AVX-512. If there is throttling, it appears that the benefits of AVX-512 offset it.

My code is available along with all the scripts and the outputs for your inspection. You should be able to reproduce my results. It is not like Xeon Gold processors are magical faeries: anyone can grab an instance. For the record, the bill I got from Packet was $2.

Note: I have no conflict of interest to disclose. I do not own Intel stock.

Continue Reading…


Read More

Book Memo: “Applied Artificial Intelligence”

Where AI Can Be Used In Business
This book deals with artificial intelligence (AI) and its several applications. It is not an organic text that should be read from the first page onwards, but rather a collection of articles that can be read at will (or at need). The idea of this work is indeed to provide some food for thoughts on how AI is impacting few verticals (insurance and financial services), affecting horizontal and technical applications (speech recognition and blockchain), and changing organizational structures (introducing new figures or dealing with ethical issues).

Continue Reading…


Read More

Time series intervention analysis with fuel prices by @ellis2013nz

(This article was first published on free range statistics - R, and kindly contributed to R-bloggers)

A couple of months ago I blogged about consumer spending on vehicle fuel by income. The impetus for that post was the introduction on 1 July 2018 of an 11.5 cent per litre “regional fuel tax” in Auckland.

One vehement critic of the levy, largely on the grounds of its impact on the poor, has been Sam Warburton (@Economissive on Twitter) of the New Zealand Initiative. Since early May 2018 (ie seven weeks before the fuel levy began), Sam has been collecting fuel prices from, and he recently made the data available.

One of the issues of controversy about a levy like this is whether it will lead to “price spreading” – fuel companies absorbing some of the extra tax in Auckland and increasing prices in other regions. A relatively small number of firms make retail pricing decisions about fuel in New Zealand so it’s plausible that imperfect competition is making this possible. I had a look at the data to see if the intervention of the fuel levy in New Zealand’s biggest city can be seen to impact on fuel pricing in the rest of the country.

To cut to the chase, this graphic exemplifies my approach and results:

We see that after the spike at the time the tax was introduced, fuel prices in other regions have converged somewhat on Auckland’s prices (particularly when considering the relative change happening before the tax). The impact of the tax is still clearly felt much more strongly in Auckland than anywhere else (as of course would be expected – the question at issue is whether anywhere else would be impacted at all). More on that later.

The data

First, let’s look at the data Sam’s collected. The Twitter thread linked to above provides an Excel workbook on Dropbox. There’s a worksheet for each region (which are defined similarly, but not identically, to New Zealand’s official Regions) as well as one describing the source. For each region we have data on fuel prices for combinations of Company, Date, fuel type (eg diesel, 91 octane, etc) and Region. If we plot all the data other than liquid petroleum gas (which has particularly sparse observations), it looks like this:

“Companies” have been sorted in order of increasing average price for that graphic, but fairly crudely (ie not taking into account different mixes of fuel type by Company).

We can see we have more data for 91 octane petrol than the other types. For the rest of this post I’ll be focusing on just 91 octane.

Here’s the R code to tidy up the data to this point and draw the graphic. It assumes you’ve manually downloaded the Excel workbook to your working folder (I’m currently working with severely restricted internet, so couldn’t experiment in automating that process.)


# Import data:
sn <- getSheetNames("fuel price data.xlsx")
sn <- sn[sn != "Source"]

fuel_orig <- list()

for(i in 1:length(sn)){
  tmp <- read.xlsx("fuel price data.xlsx", sheet = sn[i], cols = 1:7, 
                   detectDates = TRUE, na.strings = c("NA", "n/a"))
  tmp[ , "region"] <- sn[i]
  fuel_orig[[i]] <- tmp

# Combine into a single data frame
fuel_df <-"rbind", fuel_orig)

# some useful extra information:
south_island <- c("Canterbury", "Nelson", "Otago", "Southland", "West Coast")
big_four <- c("CALTEX", "Z ENERGY", "BP", "MOBIL")

# Make long, thin, tidy version:
fuel_tidy <- fuel_df %>%
  select(-LPG) %>%
  gather(fueltype, value, -Company, -Date, -region) %>%
  filter(! %>%
  mutate(island = ifelse(region %in% south_island, "South", "North"),
         company_type = ifelse(Company %in% big_four, "Big Four", "Smaller")) %>%
  mutate(region = fct_reorder(region, as.numeric(as.factor(island))),
         Company = fct_reorder(Company, value))
# Overview graphic:
fuel_tidy %>%
  ggplot(aes(x = Date, y = value, colour = Company)) +
  facet_grid(fueltype~region, scales = "free_y") +
  geom_point(size = 0.8) +
  scale_y_continuous("Price per litre at the pump", label = dollar) +
  labs(x = "Date in 2018", 
       caption = "Source:, collated by @Economissive") +
  ggtitle("Petrol prices in New Zealand over several months in mid 2018") + 
  guides(colour = guide_legend(override.aes = list(size=5))) +
  scale_colour_brewer(palette = "Set1")

Regional comparisons with Auckland

I tried a couple of different ways of comparing prices in individual regions with those in Auckland. I think this graphic is probably the most informative and straightforward:

The grey line in the background of each facet represents Auckland’s price; the shaded blue rectangle is the post-tax period (ie 1 July 2018 and onwards). The grey shaded area shows the difference between the given region’s price and that of Auckland.

We can see a lot of regional variation here, and an interesting pattern with three (or maybe even all) of the South Island regions experiencing price declines in June then picking up in July. Of course, the 11.5 cent increase in price in Auckland is very obvious in the grey lines and shading. Later on I’ll be using time series intervention analysis on this data; this is an approach commonly used in evaluating the impact of evaluations. If we were only after the direct impact, there would be no need to do any statistical tests beyond this graphic above; the big spike in prices hits you between the eyes, and there is no doubt about the discontinuity in Auckland’s prices on 1 July! The question, of course, is how sustained that impact is, and whether it bled into secondary impacts in other regions.

Here’s a second graphic that tries to visually simplify what’s going on, by calculating a single line of the ratio of prices in each region to those in Auckland. I think what it gains in visual simplicity (less lines and shading) it loses in clear interpretability. In particular, it’s not possible to tell from this graphic what changes in the graphic come from changes in Auckland, and which come from changes in the comparison region. That’s not a deal-breaker for using a graphic like this, but it does strongly suggest we should also include the first one, with the rawer average prices per region plainly shown without transformation, for context.

Note that even after the introduction of the Auckland-specific fuel tax, 91 octane petrol in several regions still costs more than in Auckland. The regions with prices higher than Auckland in the most recent data in the collection are West Coast, Otago, Nelson, Canterbury and Wellington.

Here’s the R code for those two graphics:

# construct two convenient summaries of the data, different ways of comparing regions to Auckland:
p91 <- fuel_tidy %>%
  filter(fueltype == "91") %>%
  group_by(region, island, Date) %>%
  summarise(value = mean(value, tr = 0.2)) %>%

p91_rel <- p91 %>%
  group_by(Date) %>%
  mutate(Auckland = value[region == "Auckland"]) %>%
  filter(! region %in% c("Auckland", "Wairarapa")) %>%
  mutate(perc_of_auck = value / Auckland)

# Plot showing original price data
ggplot() +
  # annoying trick necessary here to draw a semi-transparent background rectangle: 
 geom_rect(data = data.frame("hello world"), 
            xmin = as.Date("2018-07-01"), xmax = Inf, ymin = -Inf, ymax = Inf, fill = "blue", alpha = 0.1) +
  # now draw the actual data:
  geom_ribbon(data = p91_rel, aes(x = Date, ymin = Auckland, ymax = value), fill = "grey", alpha = 0.5) +
  geom_line(data = p91_rel, aes(x = Date, y = Auckland), colour = "grey50") +
  geom_line(data = p91_rel, aes(x= Date, y = value, colour = island), size = 1.2) +
  facet_wrap(~region, ncol = 3) +
  scale_y_continuous("Price of 91 octane petrol compared to in Auckland\n", label = dollar) +
  labs(x = "2018; grey line shows Auckland",
       caption = "Source:, collated by @Economissive")

# ratio of Auckland prices to others
ggplot() +
  geom_hline(yintercept = 1, colour = "grey50") +
  geom_rect(data = data.frame("hello world"), 
            xmin = as.Date("2018-07-01"), xmax = Inf, ymin = -Inf, ymax = Inf, fill = "blue", alpha = 0.1) +
  geom_line(data = p91_rel, aes(x= Date, y = perc_of_auck, colour = island)) +
  facet_wrap(~region, ncol = 3) +
  scale_y_continuous("Price of 91 octane petrol as a percentage of in Auckland\n", label = percent) +
  labs(x = "2018",
       caption = "Source:, collated by @Economissive")

Modelling the impact of an intervention over time

When it came to directly addressing our question of interest regarding price spreading, I opted to group all non-Auckland regions together and compare average prices there with those in Auckland. There are better ways of modelling this that make full use of the granular data available (mostly involving mixed effects models, and more complex ways of representing the trend over time than linearly; and they would certainly take into account weighting from the spread in population over regions) but they come with big costs in complexity that I don’t have time for right now. Plus, the difference-of-averages method struck me as the easiest way to interpret and communicate, not to mention think about, the question of whether prices were converging back towards eachother after the initial shock of the addition of the tax. This leads me to the graphic I showed earlier in this post:

The linear regression lines shown in that graphic are a simplified version of the formal statistical model we want to fit and use to test our hypothesis. We’re looking for evidence that the slope of the post-tax line is materially less than the slope of the pre-tax line; in other words, is the gap in pricing between Auckland and other regions declining after the initial 11.5 cent shock of the tax.

I defined this as a simple linear model, but fit it using generalized least squares with time series residuals (auto-regressive moving average of order (1, 1)). This is straightforward to specify and fit using the gls function in Pinheiro, Bates et al’s nlme package, but there are other ways of doing it too.

This results in the following:

Dependent variable:
Date 0.001**
post_tax 19.159**
Date:post_tax -0.001**
Constant -10.470**
Observations 93
Log Likelihood 330.591
Akaike Inf. Crit. -647.182
Bayesian Inf. Crit. -629.761
Note: *p<0.1; **p<0.05; ***p<0.01

…which simply confirms what is visually obvious, that there is indeed statistically significant evidence of the slope changing direction downwards after the tax is introduced. In other words, we do have evidence consistent with some degree of “spreading” taking place. After the initial clean shock of the introduction of the tax, prices in Auckland and in the rest of the country are indeed converging somewhat; although nowhere near as much as the full cost of the tax.

This effect holds whether I use all of New Zealand as the comparison point or just the South Island (which has less competition in fuel retailers) or just the North Island, although in the latter case the effect is not as strong (as can be seen in the graphic). It also doesn’t seem to matter whether we use all available prices, or just those of the “big four” companies that are present in all regions.

We can’t say for sure the effect comes from introducing the tax from just looking at the numbers. Drawing that conclusion would require carefully considering any other possible causality options. For example, one driver of the pattern we’re seeing is clearly that prices in Canterbury, Nelson and Otago stopped declining and started rising slightly in July. What are other plausible causes of that pattern? Understanding and considering such alternative theories would need more knowledge of the fuel market in New Zealand than I have, so I’ll leave it to others to debate that. All I can safely conclude is what I wrote above:

after the spike caused by the tax, fuel prices in Auckland and in the rest of the country are converging somewhat (although much less than the full cost of the tax), and plausibly this is because of companies’ price adjustments down in Auckland and up elsewhere to spread the cost of the tax over a broader base.

Here’s the code for that final graphic and statistical modelling;

# Data on the difference between Auckland's average price and those in other areas:
diff_data <- fuel_tidy %>%
  filter(fueltype == "91" & company_type == "Big Four") %>%
  group_by(Date) %>%
  summarise(auck_v_rest = 
              mean(value[region == "Auckland"]) - 
              mean(value[region != "Auckland"]),
            auck_v_si = 
              mean(value[region == "Auckland"]) - 
              mean(value[island == "South"]),
            auck_v_ni = 
              mean(value[region == "Auckland"]) - 
              mean(value[island == "North" & region != "Auckland"]),
            ) %>%
  mutate(post_tax = as.integer(Date >= as.Date("2018-07-01"))) %>%
  gather(comparison, value, -Date, -post_tax) %>%
  mutate(comparison = case_when(
         comparison == "auck_v_si"   ~ "Compared to South Island",
         comparison == "auck_v_ni"   ~ "Compared to rest of North island",
         comparison == "auck_v_rest" ~ "Compared to all NZ except Auckland"))

# Graphic:
ggplot(diff_data, aes(x = Date, y = value)) +
  facet_wrap(~comparison, ncol = 3) +
  geom_line() +
  geom_smooth(aes(group = post_tax), method = "lm") +
  scale_y_continuous("Average price of 91 octane petrol in Auckland\nminus average price in comparison area",
                     label = dollar) +
  labs(x = "Date in 2018\nAverage prices have not been weighted by population or sales",
       caption = "Source:, collated by @Economissive") +
  ggtitle("Fuel prices in Auckland compared to three other comparison areas",
          "Restricted to prices from BP, Caltex, Mobil and Z Energy")

# Modelling:
# first, make a convenient subset of the data (useful later in various tests and diagnostics):
D <- subset(diff_data, comparison == "Compared to all NZ except Auckland")

# Fit model, taking care to specify time series residuals, which aren't as useful for inference
# as i.i.d. residuals and hence lead to more conservative inference:
model <- gls(value ~ Date * post_tax, 
            data = subset(diff_data, comparison == "Compared to all NZ except Auckland"),
            cor = corARMA(p = 1, q = 1))

# print-friendly summary of coefficieints
stargazer::stargazer(model, type = "html")

# more comprehensive summary (not shown in blog):

Observations per region

A topic of interest in fuel pricing debate in New Zealand is the number of companies present in each region, with a particular focus on the presence of Gull. In case of interest, here are the observations in the pricewatch data collected by Sam Warburton:

Auckland 206 143 349 251 279 372 279 279 372
Bay of Plenty 262 200 295 202 278 264 279 161 370
Coromandel 3 6 249 4 209 219 233 69 277
Waikato 0 0 228 227 276 302 278 220 367
Hawke’s Bay 0 45 217 16 261 228 255 132 345
Northland 0 0 205 52 274 229 256 268 275
Manawatū-Whanganui 0 108 193 120 275 265 277 176 371
Taranaki 0 184 178 188 243 88 274 188 301
Wellington 0 91 78 252 278 277 279 213 372
East Coast 0 41 71 158 221 122 145 104 211
Central Plateau 0 0 44 51 159 244 275 0 295
Wairarapa 0 1 0 0 2 42 3 2 103
Canterbury 0 262 0 279 279 278 279 245 372
Nelson 0 50 0 34 216 237 263 163 237
Otago 0 225 0 226 273 265 277 116 364
Southland 0 114 0 197 249 195 226 98 245
West Coast 0 115 0 227 119 121 115 65 242

That table was generated by:

fuel_tidy %>%
  group_by(region, Company) %>%
  summarise(freq = n()) %>%
  spread(Company, freq, fill = 0) %>%
  arrange(desc(GULL)) %>%

To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Distilled News

How to Safeguard Your “Data Lake” for Better Decision Making

It is absolutely critical for organizations that deal with large amounts of data to carefully protect their information repositories, just as natural lakes often come with rules and regulations to make sure those who swim are safe. The reasons to do so are overwhelming, as 40 percent of people surveyed recently by PwC report having made a decision based on data. But, if decisions are being made with flawed, inconsistent data, it makes the whole effort of collecting and storing it in the first place irrelevant. It is critical that IT departments set boundaries and give clear directives on how all data is used when the rest of their organization begins ‘swimming’ – or using this data to make decisions – in the data lake. To do so, they should consider setting the following rules:
• Create swimmable areas
• Clear the water deliberately
• Make sure there is a lifeguard on duty
• Always keep a current map

DataRobot Announces Automated Time Series Solution

DataRobot automation platform constructs and evaluates hundreds to thousands of different time series models and scores their performance – taking into account all the different temporal conditions to determine real world accuracy.

Complete tutorial on Text Classification using Conditional Random Fields Model (in Python)

Complete tutorial on Text Classification using Conditional Random Fields Model (in Python)

How to Overcome the Curse of Dimensionality

Dimensionality reduction is an important technique to overcome the curse of dimensionality in data science and machine learning. As the number of predictors in the dataset, or dimensions or features, increase, it becomes computationally more expensive (ie. increased storage space, longer computation time) and exponentially more difficult to produce accurate predictions in classification or regression models. Moreover, it is hard to wrap our head around visualizing the data points in more than 3 dimensions.

Using genetic data to strengthen causal inference in observational research

Causal inference is essential across the biomedical, behavioural and social sciences.By progressing from confounded statistical associations to evidence of causal relationships, causal inference can reveal complex pathways underlying traits and diseases and help to prioritize targets for intervention. Recent progress in genetic epidemiology – including statistical innovation, massive genotyped data sets and novel computational tools for deep data mining – has fostered the intense development of methods exploiting genetic data and relatedness to strengthen causal inference in observational research. In this Review, we describe how such genetically informed methods differ in their rationale, applicability and inherent limitations and outline how they should be integrated in the future to offer a rich causal inference toolbox.

What is the difference between RDF and OWL?

• Triples & URIs
Subject – Predicate – Object
These describe a single fact. Generally URI’s are used for the subject and predicate. The object is either another URI or a literal such as a number or string. Literals can have a type (which is also a URI), and they can also have a language. Yes, this means triples can have up to 5 bits of data!
For example a triple might describe the fact that Charles is Harrys father.
<http://…/harry> <http://…/1.0#hasFather> <http://…/charles> .
Triples are database normalization taken to a logical extreme. They have the advantage that you can load triples from many sources into one database with no reconfiguration.

• RDF and RDFS
The next layer is RDF – The Resource Description Framework. RDF defines some extra structure to triples. The most important thing RDF defines is a predicate called “rdf:type”. This is used to say that things are of certain types. Everyone uses rdf:type which makes it very useful.
RDFS (RDF Schema) defines some classes which represent the concept of subjects, objects, predicates etc. This means you can start making statements about classes of thing, and types of relationship. At the most simple level you can state things like http://…/1.0#hasFather is a relationship between a person and a person. It also allows you to describe in human readable text the meaning of a relationship or a class. This is a schema. It tells you legal uses of various classes and relationships. It is also used to indicate that a class or property is a sub-type of a more general type. For example “HumanParent” is a subclass of “Person”. “Loves” is a sub-class of “Knows”.

• RDF Serialisations
RDF can be exported in a number of file formats. The most common is RDF+XML but this has some weaknesses.
N3 is a non-XML format which is easier to read, and there’s some subsets (Turtle and N-Triples) which are stricter.
It’s important to know that RDF is a way of working with triples, NOT the file formats.

XSD is a namespace mostly used to describe property types, like dates, integers and so forth. It’s generally seen in RDF data identifying the specific type of a literal. It’s also used in XML schemas, which is a slightly different kettle of fish.

OWL adds semantics to the schema. It allows you to specify far more about the properties and classes. It is also expressed in triples. For example, it can indicate that “If A isMarriedTo B” then this implies “B isMarriedTo A”. Or that if “C isAncestorOf D” and “D isAncestorOf E” then “C isAncestorOf B”. Another useful thing owl adds is the ability to say two things are the same, this is very helpful for joining up data expressed in different schemas. You can say that relationship “sired” in one schema is owl:sameAs “fathered” in some other schema. You can also use it to say two things are the same, such as the “Elvis Presley” on wikipedia is the same one on the BBC. This is very exciting as it means you can start joining up data from multiple sites (this is “Linked Data”).

Four Quadrants of the Enterprise AI business case

In Progressive Orders of Complexity (and Opportunity) the Four Quadrants for the Enterprise AI Business Case Are:
• Experiment Driven: Machine Learning and Deep Learning
• Data Driven: Enterprise Platforms and Data
• Scale Driven: AI Pipeline and Scalability
• Talent Driven: AI Disruption and Stagnation

Using Uncertainty to Interpret your Model

As deep neural networks (DNN) become more powerful, their complexity increases. This complexity introduces new challenges, including model interpretability. Interpretability is crucial in order to build models that are more robust and resistant to adversarial attacks. Moreover, designing a model for a new, not well researched domain is challenging and being able to interpret what the model is doing can help us in the process.

Comparison of the Most Useful Text Processing APIs

Nowadays, text processing is developing rapidly, and several big companies provide their products which help to deal successfully with diverse text processing tasks. In case you need to do some text processing there are 2 options available. The first one is to develop the entire system on your own from scratch. This way proves to be very time and resource consuming. On the other hand, you can use the already accessible solutions developed by well-known companies. This option is usually faster and simpler. No specific knowledge or experience in the natural language processing is required. It would suffice to understand the fundamentals of the text processing. At the same time, if you need something exclusive, it is better to implement own solution rather than to apply one of the above mentioned.
Working with text processing, the data analyst faces the following tasks:
• Keyphrase extraction;
• Sentiment analysis;
• Text analysis;
• Entity recognition;
• Translation;
• Language detection;
• Topic modeling.
There are several high-level APIs which may be used to perform these tasks. Among them:
• Amazon Comprehend;
• IBM Watson Natural Language Understanding;
• Microsoft Azure (Text analytics API);
• Google Cloud Natural Language;
• Microsoft Azure (Linguistic Analysis API) – beta;
• Google Translate API;
• IBM Watson Translator;
• Amazon Translate;
• Microsoft Azure Translator Text API.

Collections in R: Review and Proposal

R is a powerful tool for data processing, visualization, and modeling. However, R is slower than other languages used for similar purposes, such as Python. One reason for this is that R lacks base support for collections, abstract data types that store, manipulate, and return data (e.g., sets, maps, stacks). An exciting recent trend in the R extension ecosystem is the development of collection packages, packages that provide classes that implement common collections. At least 12 collection packages are available across the two major R extension repositories, the Comprehensive R Archive Network (CRAN) and Bioconductor. In this article, we compare collection packages in terms of their features, design philosophy, ease of use, and performance on benchmark tests. We demonstrate that, when used well, the data structures provided by collection packages are in many cases significantly faster than the data structures provided by base R. We also highlight current deficiencies among R collection packages and propose avenues of possible improvement. This article provides useful recommendations to R programmers seeking to speed up their programs and aims to inform the development of future collection-oriented software for R.

News from the Bioconductor Project

The Bioconductor project provides tools for the analysis and comprehension of high-throughput genomic data. Bioconductor 3.7 was released on 1 May, 2018. It is compatible with R 3.5.1 and consists of 1560 software packages, 342 experiment data packages, and 919 up-to-date annotation packages. The release announcement includes descriptions of 98 new software packages and updated NEWS files for many additional packages.

Changes in R

From version 3.5.0 to version 3.5.0 patched

The Blacker the Box

There has been a lot of discussion in the data science community about the use of black-box models, and there is lots of really fascinating ongoing research into methods, algorithms, and tools to help data scientists better introspect their models. While those discussions and that research are important, in this post I discuss the macro-framework I use for evaluating how black the box can be for a prediction product. In this post I do not get into the weeds of complexity penalization algorithms or even how to weigh the tech debt associated with additional complexity. Instead, I want to take a step back and discuss how I think about ‘prediction’ problems at a more macro level, and how I value accuracy and complexity for different types of problems. The thesis of this post is: The faster the feedback on prediction accuracy, the blacker the box can be. The slower the feedback, the more your models should be explicit and formal.

Datasets From Images

This tutorial that will demonstrate how you can make datasets in CSV format from images and use them for Data Science, on your Laptop.

Unsupervised learning demystified

Unsupervised learning may sound like a fancy way to say ‘let the kids learn on their own not to touch the hot oven’ but it´s actually a pattern-finding technique for mining inspiration from your data. It has nothing to do with machines running around without adult supervision, forming their own opinions about things. Let´s demystify!

Continue Reading…


Read More

The Future of Data Affects the Whole Team – TDWI Orlando, Nov 11-16

Eliminate Weak Links When You Bring Your Team to Orlando! Super Early Bird Deadline: September 14 - Save up to $915 with code KD20

Continue Reading…


Read More

New York halts new licences for ride-hailing cars

RIDE-HAILING companies like Uber and Lyft are loved by city dwellers but may be jamming roads. In midtown and lower Manhattan, cars have slowed from an average speed of 9.1mph (14.6kph) in 2010 to 7.1mph in 2017.

Continue Reading…


Read More

Setting up your AI Dev Environment in 5 Minutes

Whether you're a novice data science enthusiast setting up TensorFlow for the first time, or a seasoned AI engineer working with terabytes of data, getting your libraries, packages, and frameworks installed is always a struggle. Learn how datmo, an open source python package, helps you get started in minutes.

Continue Reading…


Read More

Evaluating Olive McBride with the Arkham Horror LCG Chaos Bag Simulator in R

(This article was first published on R – Curtis Miller's Personal Website, and kindly contributed to R-bloggers)

(If you care, there may be spoilers in this post.)


I love Arkham Horror; The Card Game. I love it more than I really should; it’s ridiculously fun. It’s a cooperative card game where you build a deck representing a character in the Cthulhu mythos universe, and with that deck you play scenarios in a narrative campaign1 where you grapple with the horrors of the mythos. Your actions in (and between) scenarios have repercussions for how the campaign plays out, changing the story, and you use experience points accumulated in the scenarios to upgrade your deck with better cards.

The game is hard. Really hard. And I don’t mean it’s hard because it has a learning curve (although if you’re not used to deep games like this, it probably does have a learning curve). This is a Cthulhu game, which means that the mythos is going to do everything to beat you down. Your best-laid plans are going to get derailed. The worst thing that could possibly happen to you will happen. This is why this game is unreasonably fun: it’s a game for masochists.

And it makes a damn ton of good stories. I often want to tell the tale of how amazingly awful the events played out, or how I just snatched victory out of the jaws of defeat. Bloomberg recently published an article about the rise of board games in companies. If you’re in one of these companies where employees often play board games together after hours, you should consider adding Arkham Horror (the card game; I’ve never played the board game and I’ve heard mixed reviews) to the mix. It’s a great team builder.

Two elements make Arkham Horror so damn hard and unpredictable: the encounter deck and the chaos bag. The encounter deck is a deck that all players draw from every turn that spawns monsters and awful events. The chaos bag is used for skill test you make for advancing your board state. You draw tokens from the chaos bag, add the modifier of the revealed token to your skill value for that skill test (most of them are negative numbers), check the new value against the difficulty of the test, and if your modified skill value is at least as great as the difficulty of the test, you pass; otherwise, you fail. (You can pitch cards from your hand to increase your skill value before you reveal a token.)

Chaos bag tokens
*(Image source: Ars Technica)

This is the chaos bag in a nutshell, but the chaos bag does more. The elder sign token represents good luck and often helps your board state; it’s considered a great success. But most icon tokens in the bag are trying to hurt you. There are four icon tokens: the skulls, the cultist, the tablets, and the elder things. These often not only apply a modifier but do something bad to you (which changes between scenarios).

And of course there’s this little bastard.


This is the auto-fail token. You just fail. Thanks for playing.

Simulating the Chaos Bag

Bad things happen in Arkham Horror. Sometimes, though, they feel like they happen a lot. For example, you can draw two auto-fails in a row. Or three. I cannot understate how devastating that can be in a game. Sometimes it feels like this happens a lot. Sometimes it feels like this particular game went unusually poorly, and unusually poor games seem to happen frequently.

That’s a contradiction, of course. I think that this sense of bad games happening often emerges from humans fundamentally poor understanding of probability and how it actually works. I’m not just referring to the mathematics; people’s intuition of what should happen according to probability does not match what probability says happens. This phenomenon is well-documented (see, for example, the book Irrationality, by Stuart Sutherland) and is one of the explanations for why people seemed to underestimate Donald Trump’s chances of winning the 2016 election (in short, unlikely events occur more frequently than people perceive; see also Nate Silver’s series of articles). In fact, you are more likely to find any pattern than no pattern at all (and in Arkham Horror, patterns are usually bad for you.)

This perception of “unusually many auto-fails” (which, as a statistician, I know cannot be right no matter what my feelings say) prompted me to write an R function that generates a string of pulls from the chaos bag, each pull independent of the previous pull. Here’s the function:

chaos_bag_string_sim <- function(dicebag = list('E' = 1,  # Elder Sign
                                                'P' = 1,  # +1
                                                '0' = 2,
                                                '1' = 3,  # -1
                                                '2' = 2,  # -2
                                                '3' = 1,
                                                '4' = 1,
                                                'S' = 2,  # Skull
                                                'C' = 1,  # Cultist
                                                'T' = 1,  # Tablet
                                                'L' = 1,  # Elder Thing
                                                'A' = 1), # Autofail
                                 n = 24, replace = TRUE) {
  paste0(sample(names(dicebag), replace = replace,
                prob = as.vector(unlist(dicebag)),
                size = n), collapse = '')

Notice that the function takes a list, dicebag. This list uses one-character codes representing chaos bag tokens, and the actual contents of the list are numbers representing the frequency of that token in the bag. The bag that serves as a default is a fictitious chaos bag that I believe is representative of Arkham Horror on standard difficulty. Since most numbered tokens are negative, I denote the +1 token with 'P' and the negative tokens with their numbers.

How many pulls from the bag occur in a game? That's really hard to figure out; in fact, this will vary from scenario to scenario dramatically. So let's just make an educated guess to represent the "typical" game. A round consists of a mythos phase, then three actions per player, followed by a monsters phase and an upkeep phase. The mythos phase often prompts skill tests, and I would guess that the average number of skill tests made during each player's turn is two. We'll guess that about half of the draws from the encounter deck prompt skill tests. A game may last about 16 rounds. This adds up to about 40 skill tests per player per game. As a consequence, a four-player game (I mostly play multiplayer, and the more the merrier) would yield 160 skill tests.

Let's simulate a string of 160 skill tests, each independent of the other2.

(game <- chaos_bag_string_sim(n = 160))
[1] "311113E12121E2LS20A32A4ETT1SE3T3P0AT2S12PLSC033CP22A11CLL31L024P12ES24A1S0E

I represented pulls from the chaos bag as a string because we can use regular expressions to find patterns in the string. For instance, we can use the expression AA to find all occurances of double-auto-fails in the string.

grepl("AA", game)

The grepl() function returns a boolean (logical) value identifying whether the pattern appeared in the string; in this game, no two auto-fails appeared in a row.

What about two auto-fails appearing, separated by two other tests? This is matched by the expression A..A (remember, . matches any character).

grepl("A..A", game)
[1] TRUE

We can take this further and use our simulator to estimate the probability events occur. After all, in probability, there are two ways to find the probability of events:

  1. Create a probability model and use mathematics to compute the probability an event of interest occurs; for example, use a model of a die roll to compute the probability of rolling a 6
  2. Perform the experiment many times and count how many times the event occured in these experiments many times and count how often the event of interest occured; for example, roll a die 1000 times and count how many times it rolled a 6 (this is usually done on a computer)

While the latter approach produces estimates, these estimates are guaranteed to get better the more simulations are performed. In fact, if you want at least a 95% chance of your estimate being accurate up to two decimal places, you could perform the simulation 10,000 times.3

Using this, we can estimate the probability of seeing two auto-fails in a row.

mean(grepl("AA", replicate(10000, {chaos_bag_string_sim(n = 160)})))
[1] 0.4019

There's about a 40% chance of seeing two auto-fail tokens in a single four-player game. That's pretty damn likely.

Let's ask a slightly different question: what is the average number of times we will see two autofails in a row in a game? For this we will want to dip into the stringr package, using the str_count() function.


str_count(game, "AA")
[1] 0
str_count(game, "A..A")  # How many times did we see auto-fails separated by two
                         # tests?
[1] 1

Or we can ask how often we saw three "bad" tokens in a row. In Arkham Horror on standard difficulty, players often aim to pass a test when drawing a -2 or better. This means that "bad" tokens include -3, -4, auto-fail, and often the tablet and elder thing tokens, too. The regular expression pattern that can match "two bad things in a row" is [34TLA]{2} (translation: see either 3, 4, T, L, or A exactly two times).

str_count(game, "[34TLA]{2}")
[1] 14

How could we estimate the average number of times this would occur in a four-player game? The simulation trick still works. Pick a large number of simulations to get an accurate estimate; the larger, the more accurate (but also more work for your computer).

mean(str_count(replicate(10000, {chaos_bag_string_sim(n = 160)}), "[34TLA]{2}"))
[1] 10.6717

You can imagine that this can keep going and perhaps there are queries that you would like to see answered. Below I leave you with an R script that you can use to do these types of experiments. This script is designed to be run from a Unix-flavored command line, though you could source() the script into an R session to use the chaos_bag_string_sim() function interactively. Use regular expressions to define a pattern you are interested in (this is a good opportunity to learn regular expressions for those who want the exercise).

# AHChaosBagSimulator.R
# Curtis Miller
# 2018-08-03
# A script for simulating Arkham Horror's chaos bag

chaos_bag_string_sim <- function(dicebag = list('E' = 1,  # Elder Sign
                                                'P' = 1,  # +1
                                                '0' = 2,
                                                '1' = 3,  # -1
                                                '2' = 2,  # -2
                                                '3' = 1,
                                                '4' = 1,
                                                'S' = 2,  # Skull
                                                'C' = 1,  # Cultist
                                                'T' = 1,  # Tablet
                                                'L' = 1,  # Elder Thing
                                                'A' = 1), # Autofail
                                 n = 24, replace = TRUE) {
  paste0(sample(names(dicebag), replace = replace,
                prob = as.vector(unlist(dicebag)),
                size = n), collapse = '')

# optparse: A package for handling command line arguments
if (!suppressPackageStartupMessages(require("optparse"))) {

main <- function(pattern, average = FALSE, pulls = 24, replications = 10000,
                 no_replacement = FALSE, dicebag = "", help = FALSE) {

  if (dicebag != "") {
    dicebag_df <- read.csv(dicebag, stringsAsFactors = FALSE)
    dicebag_df$token <- as.character(dicebag_df$token)
    if (!is.numeric(dicebag_df$freq)) {stop("Dicebag freq must be integers.")}
    dicebag <- as.list(dicebag_df$freq)
    names(dicebag) <- dicebag$token
  } else {
    dicebag = list('E' = 1,  # Elder Sign
                   'P' = 1,  # +1
                   '0' = 2,
                   '1' = 3,  # -1
                   '2' = 2,  # -2
                   '3' = 1,
                   '4' = 1,
                   'S' = 2,  # Skull
                   'C' = 1,  # Cultist
                   'T' = 1,  # Tablet
                   'L' = 1,  # Elder Thing
                   'A' = 1)  # Autofail

  games <- replicate(replications, {chaos_bag_string_sim(dicebag = dicebag,
                                                    n = pulls,
                                                    replace = !no_replacement)})

  if (average) {
    cat("Average occurance of pattern:", 
        mean(stringr::str_count(games, pattern)), "\n")
  } else {
    cat("Probability of occurance of pattern:", 
        mean(grepl(pattern, games)), "\n")

if (sys.nframe() == 0) {
  cl_args <- parse_args(OptionParser(
        description = "Simulates the Arkham Horror LCG chaos bag.",
        option_list = list(
          make_option(c("--pattern", "-r"), type = "character",
                      help = "Pattern (regular expression) to match"),
          make_option(c("--average", "-a"), type = "logical",
                      action = "store_true", default = FALSE,
                      help = "If set, computes average number of occurances"),
          make_option(c("--pulls", "-p"), type = "integer", default = 24,
                      help = "The number of pulls from the chaos bag"),
          make_option(c("--replications", "-N"), type = "integer",
                      default = 10000,
                      help = "Number of replications"),
          make_option(c("--no-replacement", "-n"), type = "logical",
                      action = "store_true", default = FALSE,
                      help = "Draw tokens without replacement"),
          make_option(c("--dicebag", "-d"), type = "character", default = "",
                      help = "(Optional) dice bag distribution CSV file")

  names(cl_args)[which(names(cl_args) == "no-replacement")] <- "no_replacement", cl_args)

Below is an example of a CSV file specifying a chaos bag.


Below is some example usage.

$ chmod +x AHChaosBagSimulator.R  # Only need to do this once
$ ./AHChaosBagSimulator.R -r AA -p 160
Probability of occurance of pattern: 0.4074
$ ./AHChaosBagSimulator.R -r [34TLA]{2} -a -p 160
Average occurance of pattern: 10.6225

To see documentation, type ./AHChaosBagSimulator.R --help. Note that something must always be passed to -r; this is the pattern needed.

Olive McBride

This tool was initially written as a way to explore chaos bag probabilities. I didn't consider the tool to be very useful. It simply helped make the point that unlikely events seem to happen frequently in Arkham Horror. However, I found a way to put the tool to practical use.

Recently, Arkham Horror's designers have been releasing cards that seem to demand more math to fully understand. My favorite Arkham Horror card reviewer, The Man from Leng, has fretted about this in some of his recent reviews about cards using the seal mechanic, which change the composition of the chaos bag by removing tokens from it.

I can only imagine how he would feel about a card like Olive McBride (one of the cards appearing in the upcoming Mythos Pack, "Heart of the Elders", to be released later this week.)

Olive McBride

Olive McBride allows you to reveal three chaos tokens instead of one, and choose two of those tokens to resolve. This effect can be triggered any time you would reveal a chaos token.

I'll repeat that again: Olive can trigger any time a chaos token is revealed. Most of the time tokens are revealed during skill tests, but other events lead to dipping into the chaos bag too. Notably, The Dunwich Legacy leads to drawing tokens without taking skill tests, such as when gambling in "The House Always Wins", due to some encounter cards. This makes Olive McBride a "master gambler", since she can draw three tokens and pick the winning ones when gambling. (She almost breaks the scenario.) Additionally, Olive can be combined with cards like Ritual Candles and Grotesque Statue to further shift skill tests in your favor.

Ritual Candles

Grotesque Statue

These are interesting situations but let's ignore combos like this for now. What does Olive do to your usual skill test? Specifically, what does she do to the modifiers?

Before going on, I need to address verbiage. When are about to perform a skill test, typically they will make a statement like "I'm at +2 for this test". This means that the skill value of the investigator (after applying all modifiers) is two higher than the difficulty of the test. Thus, if the investigator draws a -2 or better, the investigator will pass the test; if the investigator daws a -3 or worse, the investigator fails. My play group does not say this; we say "I'm at -2 for this test," meaning that if the investigator sees a -2 or better from the bag, the investigator will pass. This is more intuitive to me, and also I think it's more directly translated to math.

Presumably when doing a skill test with Olive, if all we care about is passing the test, we will pick the two tokens drawn from the bag that have the best modifiers. We add the modifiers of those tokens together to get the final modifier. Whether this improves your odds of passing the test or not isn't immediately clear.

I've written an R function that simulates skill tests with Olive. With this we can estimate Olive's effects.

olive_sim <- function(translate = c("E" = 2, "P" = 1, "0" = 0, "1" = -1,
                                    "2" = -2, "3" = -3, "4" = -4, "S" = -2,
				    "C" = -3, "T" = -4, "L" = -5, "A" = -Inf),
                       N = 1000, dicebag = NULL) {
  dargs <- list(n = 3, replace = FALSE)
  if (!is.null(dicebag)) {
    dargs$dicebag <- dicebag

  pulls <- replicate(N, {, dargs)})
  vals <- sapply(pulls, function(p) {
    vecp <- strsplit(p, "")[[1]]
    vecnum <- translate[vecp]
    max(apply(combn(3, 2), 2, function(i) {sum(vecnum[i])}))

  vals[which(is.nan(vals))] <- Inf


The parameter translate gives a vector that translates code for chaos tokens to numerical modifiers. Notice that the auto-fail token is assigned -Inf since it will cause any test to fail. If we wanted, say, the elder sign token to auto-succeed (which is Mateo's elder sign effect), we could replace its translation with Inf. By default the function uses the dicebag provided with chaos_bag_string_sim(), but this can be changed.

Here's a single final modifier from a pull using Olive.

olive_sim(N = 1)

Interestingly, the function returned a named vector, and the name corresponds to what was pulled. In this case, a cultist, 0, and -1 token were pulled; the resulting best modifier is -1. The function is already built to do this many times.

olive_sim(N = 5)
31S 2S1 LS1 120 C2E
 -3  -3  -3  -1   0

Below is a function that can simulate a lot of normal skill tests.

test_sim <- function(translate = c("E" = 2, "P" = 1, "0" = 0, "1" = -1,
                                   "2" = -2, "3" = -3, "4" = -4, "S" = -2,
                                   "C" = -3, "T" = -4, "L" = -5, "A" = -Inf),
                       N = 1000, dicebag = NULL) {
  dargs <- list(n = N, replace = TRUE)
  if (!is.null(dicebag)) {
    dargs$dicebag <- dicebag

  pulls <-, dargs)
  vecp <- strsplit(pulls, "")[[1]]
  vecnum <- translate[vecp]


Here's a demonstration.

test_sim(N = 5)
 2  S  3  1  0
-2 -2 -3 -1  0

Finally, below is a function that, when given a vector of results like those above, produce a table of estimated probabilities of success at skill tests when given a vector of values the tests must beat in order to pass the test (that is, using the "I'm at -2" type of language).

est_success_table_gen <- function(res, values = (-10):3) {
  r = v)})
  names(r) <- values

Let's give it a test run.4

u <- test_sim()
    3     2     1     0    -1    -2    -3    -4    -5    -6    -7    -8    -9
0.000 0.057 0.127 0.420 0.657 0.759 0.873 0.931 0.931 0.931 0.931 0.931 0.931
3   0.000
2   0.057
1   0.127
0   0.250
-1  0.420
-2  0.657
-3  0.759
-4  0.873
-5  0.931
-6  0.931
-7  0.931
-8  0.931
-9  0.931
-10 0.931

This represents the base probability of success. Let's see what Olive does to this table.

y <- olive_sim()
3   0.024
2   0.063
1   0.140
0   0.238
-1  0.423
-2  0.531
-3  0.709
-4  0.829
-5  0.911
-6  0.960
-7  0.983
-8  0.996
-9  1.000
-10 1.000

Perhaps I can make the relationship more clear with a plot.

plot(stepfun((-10):3, rev(c(0, est_success_table_gen(u)))), verticals = FALSE,
     pch = 20, main = "Probability of Success", ylim = c(0, 1))
lines(stepfun((-10):3, rev(c(0, est_success_table_gen(y)))), verticals = FALSE,
      pch = 20, col = "blue")

The black line is the estimated probability of success without Olive, and the blue line the same with Olive. (I tried reversing the x-axis, but for some reason I could not get good results.) What we see is:

  • Olive improves the chances a "hail Mary" will succeed. If you need +1, +2, or more to succeed, Olive can help make that happen (though the odds still aren't great)
  • Olive can help you guarantee a skill test will succeed. If you boost your skill value to very high numbers, Olive can effectively neuter the auto-fail token. That's a good feeling.
  • Otherwise, Olive hurts your chances of success. Being at -2 is particularly worse with Olive than without. However, most of the time she changes the probability of success too little to notice.

For most investigators, then, Olive doesn't do much to make her worth your while. But I think Olive makes a huge difference for two investigators: Father Mateo and Jim Culver.

Father Mateo Jim Culver

Both of these investigators like some chaos bag tokens a lot. Father Mateo really likes the elder sign since it is an auto-success (in addition to other very good effects), while Jim Culver likes skulls since they are always 0.

What does Olive do for Father Mateo?

translate <- c("E" =  2, "P" =  1, "0" =  0, "1" = -1, "2" = -2, "3" = -3,
               "4" = -4, "S" = -2, "C" = -3, "T" = -4, "L" = -5, "A" = -Inf)
# Mateo
mateo_translate <- translate; mateo_translate["E"] <- Inf
mateo_no_olive <- test_sim(translate = mateo_translate)
mateo_olive <- olive_sim(translate = mateo_translate)
             rev(c(0, est_success_table_gen(mateo_no_olive)))),
     verticals = FALSE,
     pch = 20, main = "Mateo's Probabilities", ylim = c(0, 1))
             rev(c(0, est_success_table_gen(mateo_olive)))),
     verticals = FALSE, col = "blue", pch = 20)

3   0.049
2   0.049
1   0.101
0   0.226
-1  0.393
-2  0.627
-3  0.745
-4  0.862
-5  0.928
-6  0.928
-7  0.928
-8  0.928
-9  0.928
-10 0.928
3   0.177
2   0.177
1   0.218
0   0.278
-1  0.444
-2  0.554
-3  0.746
-4  0.842
-5  0.924
-6  0.972
-7  0.989
-8  0.997
-9  1.000
-10 1.000

Next up is Jim Culver.

# Culver
culver_translate <- translate; culver_translate["S"] <- 0
culver_no_olive <- test_sim(translate = culver_translate)
culver_olive <- olive_sim(translate = culver_translate)
             rev(c(0, est_success_table_gen(culver_no_olive)))),
     verticals = FALSE,
     pch = 20, main = "Culver's Probabilities", ylim = c(0, 1))
             rev(c(0, est_success_table_gen(culver_olive)))),
     verticals = FALSE, col = "blue", pch = 20)

3   0.000
2   0.065
1   0.112
0   0.333
-1  0.530
-2  0.638
-3  0.761
-4  0.874
-5  0.936
-6  0.936
-7  0.936
-8  0.936
-9  0.936
-10 0.936
3   0.020
2   0.095
1   0.217
0   0.362
-1  0.564
-2  0.666
-3  0.803
-4  0.888
-5  0.951
-6  0.972
-7  0.992
-8  0.999
-9  1.000
-10 1.000

Olive helps these investigators succeed at skill tests more easily, especially Mateo. We haven't even taken account of the fact that good things happen when certain tokens appear for these investigators! Sealing tokens could also have a big impact on the distribution of the bag when combined with Olive McBride.

Again, there's a lot that could be touched on that I won't here, so I will share with you a script allowing you to do some of these analyses yourself.

# AHOliveDistributionEstimator.R
# Curtis Miller
# 2018-08-03
# Simulates the chaos bag distribution when applying Olive McBride

# optparse: A package for handling command line arguments
if (!suppressPackageStartupMessages(require("optparse"))) {

olive_sim <- function(translate = c("E" =  2, "P" =  1, "0" =  0, "1" = -1,
                                    "2" = -2, "3" = -3, "4" = -4, "S" = -2,
				                            "C" = -3, "T" = -4, "L" = -5, "A" = -Inf),
                       N = 1000, dicebag = NULL) {
  dargs <- list(n = 3, replace = FALSE)
  if (!is.null(dicebag)) {
    dargs$dicebag <- dicebag

  pulls <- replicate(N, {, dargs)})
  vals <- sapply(pulls, function(p) {
    vecp <- strsplit(p, "")[[1]]
    vecnum <- translate[vecp]
    max(apply(combn(3, 2), 2, function(i) {sum(vecnum[i])}))

  vals[which(is.nan(vals))] <- Inf


test_sim <- function(translate = c("E" =  2, "P" =  1, "0" =  0, "1" = -1,
                                   "2" = -2, "3" = -3, "4" = -4, "S" = -2,
                                   "C" = -3, "T" = -4, "L" = -5, "A" = -Inf),
                       N = 1000, dicebag = NULL) {
  dargs <- list(n = N, replace = TRUE)
  if (!is.null(dicebag)) {
    dargs$dicebag <- dicebag

  pulls <-, dargs)
  vecp <- strsplit(pulls, "")[[1]]
  vecnum <- translate[vecp]


est_success_table_gen <- function(res, values = (-10):3) {
  r <-  sapply(values, function(v) {mean(res >= v)})
  names(r) <- values

# Main function
# See definition of cl_args for functionality (help does nothing)
main <- function(dicebag = "", translate = "", replications = 10000,
                 plotfile = "", width = 6, height = 4, basic = FALSE,
                 oliveless = FALSE, lowest = -10, highest = 3,
                 chaos_bag_script = "AHChaosBagSimulator.R",
                 title = "Probability of Success",
                 pos = "topright", help = FALSE) {


  if (dicebag != "") {
    dicebag_df <- read.csv(dicebag, stringsAsFactors = FALSE)
    dicebag <- as.numeric(dicebag_df$freq)
    names(dicebag) <- dicebag_df$token
  } else {
    dicebag <- NULL

  if (translate != "") {
    translate_df <- read.csv(translate, stringsAsFactors = FALSE)
    translate <- as.numeric(translate_df$mod)
    names(translate) <- translate_df$token
  } else {
    translate <- c("E" =  2, "P" =  1, "0" =  0, "1" = -1,
                  "2" = -2, "3" = -3, "4" = -4, "S" = -2,
                  "C" = -3, "T" = -4, "L" = -5, "A" = -Inf)

  if (any(names(translate) != names(dicebag))) {
    stop("Name mismatch between translate and dicebag; check the token columns")

  if (!oliveless) {
    olive_res <- olive_sim(translate = translate, dicebag = dicebag,
                           N = replications)
    cat("Table of success rate with Olive when at lest X is needed:\n")
    print(as.matrix(est_success_table_gen(olive_res, values = lowest:highest)))

  if (basic) {
    basic_res <- test_sim(translate = translate, dicebag = dicebag,
                          N = replications)
    cat("Table of success rate for basic test when at lest X is needed:\n")
    print(as.matrix(est_success_table_gen(basic_res, values = lowest:highest)))

  if (plotfile != "") {
    png(plotfile, width = width, height = height, units = "in", res = 300)
    if (basic) {
      plot(stepfun(lowest:highest, rev(c(0, est_success_table_gen(basic_res)))),
        verticals = FALSE, pch = 20, main = title,
        ylim = c(0, 1))
      if (!oliveless) {
            rev(c(0, est_success_table_gen(olive_res)))),
          verticals = FALSE, pch = 20, col = "blue")
        legend(pos, legend = c("No Olive", "Olive"), col = c("black", "blue"),
               lty = 1, pch = 20)

if (sys.nframe() == 0) {
  cl_args <- parse_args(OptionParser(
        description = "Estimate the chaos bag distribution with Olive McBride.",
        option_list = list(
          make_option(c("--dicebag", "-d"), type = "character", default = "",
                      help = "(Optional) dice bag distribution CSV file"),
          make_option(c("--translate", "-t"), type = "character", default = "",
                      help = "(Optional) symbol numeric translation CSV file"),
          make_option(c("--replications", "-r"), type = "integer",
                      default = 10000,
                      help = "Number of replications to perform"),
          make_option(c("--plotfile", "-p"), type = "character", default = "",
                      help = paste("Where to save plot (if not set, no plot",
                                   "made; -b/--basic must be set)")),
          make_option(c("--width", "-w"), type = "integer", default = 6,
                      help = "Width of plot (inches)"),
          make_option(c("--height", "-H"), type = "integer", default = 4,
                      help = "Height of plot (inches)"),
          make_option(c("--basic", "-b"), type = "logical",
                      action = "store_true", default = FALSE,
                      help = "Include results for test without Olive McBride"),
          make_option(c("--oliveless", "-o"), type = "logical",
                      action = "store_true", default = FALSE,
                      help = "Don't include results using Olive McBride"),
          make_option(c("--lowest", "-l"), type = "integer", default = -10,
                      help = "Lowest value to check"),
          make_option(c("--highest", "-i"), type = "integer", default = 3,
                      help = "Highest value to check"),
          make_option(c("--pos", "-s"), type = "character",
                      default = "topright", help = "Position of legend"),
          make_option(c("--title", "-m"), type = "character",
                      default = "Probability of Success",
                      help = "Title of plot"),
          make_option(c("--chaos-bag-script", "-c"), type = "character",
                      default = "AHChaosBagSimulator.R",
                      help = paste("Location of the R file containing",
                                   "chaos_bag_string_sim() definition; by",
                                   "default, assumed to be in the working",
                                   "directory in AHChaosBagSimulator.R"))

    names(cl_args) == "chaos-bag-script")] = "chaos_bag_script", cl_args)

You've already seen an example CSV file for defining the dice bag; here's a file for defining what each token is worth.


Make the script executable and get help like so:

$ chmod +x AHOliveDistributionEstimator.R
$ ./AHOliveDistributionEstimator.R -h


If you've never heard of this game I love, hopefully you've heard of it now. Give the game a look! And if you have heard of this game before, I hope you learned something from this post. If you're a reviewer, perhaps I've given you some tools you could use to help evaluate some of Arkham Horror's more intractable cards.

My final thoughts on Olive: she's not going to displace Arcane Initiate's spot as the best Mystic ally, but she could do well in certain decks. Specifically, I think that Mateo decks and Jim Culver decks planning on running Song of the Dead will want to run her since there are specific chaos tokens those decks want to see; the benefits of Olive extend beyond changing the probability of success. Most of the time you will not want to make a hail Mary skill test and you won't have the cards to push your skill value to a point where anything but the auto-fail is a success, so most of the time Olive will hurt your chances rather than help you, if she has any effect at all. Thus a typical Mystic likely won't find Olive interesting… but some decks will love her.

By the way, if you are in the Salt Lake City area of Utah, I play Arkham Horror LCG at Mind Games, LLC. While I haven't been to many other stores, Mind Games seems to have the best stocking of Arkham Horror (as well as two other Fantasy Flight LCGs, A Game of Thrones and Legend of the Five Rings). Every Tuesday night (the store's late night) a group of card game players come in to play; consider joining us! In addition, Mind Games likely has the best deal when it comes to buying the game; whenever you spend $50 on product, you get an additional $10 off (or, alternatively, you take $10 off for every $60 you spend). Thus Mind Games could be the cheapest way to get the game (without going second-hand or Internet deal hunting).

(Big image credit: Aurore Folney and FFG.)

EDIT: does not like R code and garbled some of the code in the script for Olive simulation. It should be correct now. If anyone spots other errors, please notify me; I will fix them.

I have created a video course published by Packt Publishing entitled Applications of Statistical Learning with Python, the fourth volume in a four-volume set of video courses entitled, Taming Data with Python; Excelling as a Data Analyst. This course discusses how to use Python for data science, emphasizing application. It starts with introducing natural language processing and computer vision. It ends with two case studies; in one, I train a classifier to detect spam e-mail, while in the other, I train a computer vision system to detect emotions in faces. Viewers get a hands-on experience using Python for machine learning. If you are starting out using Python for data analysis or know someone who is, please consider buying my course or at least spreading the word about it. You can buy the course directly or purchase a subscription to Mapt and watch it there.

If you like my blog and would like to support it, spread the word (if not get a copy yourself)! Also, stay tuned for future courses I publish with Packt at the Video Courses section of my site.

  1. You can also play scenarios stand-alone, which isn’t nearly as fun as playing in a campaign, especially one of the eight-scenario campaigns. ↩
  2. This is not right because sometims you redraw tokens from the bag without replacement; we'll ignore that case for now. ↩
  3. Using the asymptotically appropriate confidence interval based on the Normal distribution (as described here), the margin of error error will not exceed \frac{1}{\sqrt{n}} since \sqrt{\hat{p}(1 - \hat{p})} \leq \frac{1}{2}. Thus, we have m \leq \frac{1}{\sqrt{n}}; solving this for n yields n \geq m^{-2}. So if we want a 95% chance that the margin of error will not exceed m = 0.01 = \frac{1}{100}, we need a dataset of size n \geq (0.01)^{-2} = 10,000↩
  4. Note that the earlier calculations that argued the odds of being accurate up to two decimal places were high no longer hold, since many more numbers are being estimated, and one estimate is not independent of the other. That said, the general principle of more simulations leading to better accuracy still holds. ↩

To leave a comment for the author, please follow the link and comment on their blog: R – Curtis Miller's Personal Website. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

OpenCV People Counter

In this tutorial you will learn how to build a “people counter” with OpenCV and Python. Using OpenCV, we’ll count the number of people who are heading “in” or “out” of a department store in real-time.

Building a person counter with OpenCV has been one of the most-requested topics here on the PyImageSearch and I’ve been meaning to do a blog post on people counting for a year now — I’m incredibly thrilled to be publishing it and sharing it with you today.

Enjoy the tutorial and let me know what you think in the comments section at the bottom of the post!

To get started building a people counter with OpenCV, just keep reading!

Looking for the source code to this post?
Jump right to the downloads section.

OpenCV People Counter with Python

In the first part of today’s blog post, we’ll be discussing the required Python packages you’ll need to build our people counter.

From there I’ll provide a brief discussion on the difference between object detection and object tracking, along with how we can leverage both to create a more accurate people counter.

Afterwards, we’ll review the directory structure for the project and then implement the entire person counting project.

Finally, we’ll examine the results of applying people counting with OpenCV to actual videos.

Required Python libraries for people counting

In order to build our people counting applications, we’ll need a number of different Python libraries, including:

Additionally, you’ll also want to use the “Downloads” section of this blog post to download my source code which includes:

  1. My special
      module which we’ll implement and use later in this post
  2. The Python driver script used to start the people counter
  3. All example videos used here in the post

I’m going to assume you already have NumPy, OpenCV, and dlib installed on your system.

If you don’t have OpenCV installed, you’ll want to head to my OpenCV install page and follow the relevant tutorial for your particular operating system.

If you need to install dlib, you can use this guide.

Finally, you can install/upgrade your imutils via the following command:

$ pip install --upgrade imutils

Understanding object detection vs. object tracking

There is a fundamental difference between object detection and object tracking that you must understand before we proceed with the rest of this tutorial.

When we apply object detection we are determining where in an image/frame an object is. An object detector is also typically more computationally expensive, and therefore slower, than an object tracking algorithm. Examples of object detection algorithms include Haar cascades, HOG + Linear SVM, and deep learning-based object detectors such as Faster R-CNNs, YOLO, and Single Shot Detectors (SSDs).

An object tracker, on the other hand, will accept the input (x, y)-coordinates of where an object is in an image and will:

  1. Assign a unique ID to that particular object
  2. Track the object as it moves around a video stream, predicting the new object location in the next frame based on various attributes of the frame (gradient, optical flow, etc.)

Examples of object tracking algorithms include MedianFlow, MOSSE, GOTURN, kernalized correlation filters, and discriminative correlation filters, to name a few.

If you’re interested in learning more about the object tracking algorithms built into OpenCV, be sure to refer to this blog post.

Combining both object detection and object tracking

Highly accurate object trackers will combine the concept of object detection and object tracking into a single algorithm, typically divided into two phases:

  • Phase 1 — Detecting: During the detection phase we are running our computationally more expensive object tracker to (1) detect if new objects have entered our view, and (2) see if we can find objects that were “lost” during the tracking phase. For each detected object we create or update an object tracker with the new bounding box coordinates. Since our object detector is more computationally expensive we only run this phase once every N frames.
  • Phase 2 — Tracking: When we are not in the “detecting” phase we are in the “tracking” phase. For each of our detected objects, we create an object tracker to track the object as it moves around the frame. Our object tracker should be faster and more efficient than the object detector. We’ll continue tracking until we’ve reached the N-th frame and then re-run our object detector. The entire process then repeats.

The benefit of this hybrid approach is that we can apply highly accurate object detection methods without as much of the computational burden. We will be implementing such a tracking system to build our people counter.

Project structure

Let’s review the project structure for today’s blog post. Once you’ve grabbed the code from the “Downloads” section, you can inspect the directory structure with the


$ tree --dirsfirst
├── pyimagesearch
│   ├──
│   ├──
│   └──
├── mobilenet_ssd
│   ├── MobileNetSSD_deploy.caffemodel
│   └── MobileNetSSD_deploy.prototxt
├── videos
│   ├── example_01.mp4
│   └── example_02.mp4
├── output
│   ├── output_01.avi
│   └── output_02.avi

4 directories, 10 files

Zeroing in on the most-important two directories, we have:

  1. pyimagesearch/
     : This module contains the centroid tracking algorithm. The centroid tracking algorithm is covered in the “Combining object tracking algorithms” section, but the code is not. For a review of the centroid tracking code (
     ) you should refer to the first post in the series.
  2. mobilenet_ssd/
     : Contains the Caffe deep learning model files. We’ll be using a MobileNet Single Shot Detector (SSD) which is covered at the top of this blog post in the section, “Single Shot Detectors for object detection”.

The heart of today’s project is contained within the
  script — that’s where we’ll spend most of our time. We’ll also review the
  script today.

Combining object tracking algorithms

Figure 1: An animation demonstrating the steps in the centroid tracking algorithm.

To implement our people counter we’ll be using both OpenCV and dlib. We’ll use OpenCV for standard computer vision/image processing functions, along with the deep learning object detector for people counting.

We’ll then use dlib for its implementation of correlation filters. We could use OpenCV here as well; however, the dlib object tracking implementation was a bit easier to work with for this project.

I’ll be including a deep dive into dlib’s object tracking algorithm in next week’s post.

Along with dlib’s object tracking implementation, we’ll also be using my implementation of centroid tracking from a few weeks ago. Reviewing the entire centroid tracking algorithm is outside the scope of this blog post, but I’ve included a brief overview below.

At Step #1 we accept a set of bounding boxes and compute their corresponding centroids (i.e., the center of the bounding boxes):

Figure 2: To build a simple object tracking via centroids script with Python, the first step is to accept bounding box coordinates and use them to compute centroids.

The bounding boxes themselves can be provided by either:

  1. An object detector (such as HOG + Linear SVM, Faster R- CNN, SSDs, etc.)
  2. Or an object tracker (such as correlation filters)

In the above image you can see that we have two objects to track in this initial iteration of the algorithm.

During Step #2 we compute the Euclidean distance between any new centroids (yellow) and existing centroids (purple):

Figure 3: Three objects are present in this image. We need to compute the Euclidean distance between each pair of original centroids (red) and new centroids (green).

The centroid tracking algorithm makes the assumption that pairs of centroids with minimum Euclidean distance between them must be the same object ID.

In the example image above we have two existing centroids (purple) and three new centroids (yellow), implying that a new object has been detected (since there is one more new centroid vs. old centroid).

The arrows then represent computing the Euclidean distances between all purple centroids and all yellow centroids.

Once we have the Euclidean distances we attempt to associate object IDs in Step #3:

Figure 4: Our simple centroid object tracking method has associated objects with minimized object distances. What do we do about the object in the bottom-left though?

In Figure 4 you can see that our centroid tracker has chosen to associate centroids that minimize their respective Euclidean distances.

But what about the point in the bottom-left?

It didn’t get associated with anything — what do we do?

To answer that question we need to perform Step #4, registering new objects:

Figure 5: In our object tracking example, we have a new object that wasn’t matched with an existing object, so it is registered as object ID #3.

Registering simply means that we are adding the new object to our list of tracked objects by:

  1. Assigning it a new object ID
  2. Storing the centroid of the bounding box coordinates for the new object

In the event that an object has been lost or has left the field of view, we can simply deregister the object (Step #5).

Exactly how you handle when an object is “lost” or is “no longer visible” really depends on your exact application, but for our people counter, we will deregister people IDs when they cannot be matched to any existing person objects for 40 consecutive frames.

Again, this is only a brief overview of the centroid tracking algorithm.

Note: For a more detailed review, including an explanation of the source code used to implement centroid tracking, be sure to refer to this post.

Creating a “trackable object”

In order to track and count an object in a video stream, we need an easy way to store information regarding the object itself, including:

  • It’s object ID
  • It’s previous centroids (so we can easily to compute the direction the object is moving)
  • Whether or not the object has already been counted

To accomplish all of these goals we can define an instance of

  — open up the
  file and insert the following code:

class TrackableObject:
	def __init__(self, objectID, centroid):
		# store the object ID, then initialize a list of centroids
		# using the current centroid
		self.objectID = objectID
		self.centroids = [centroid]

		# initialize a boolean used to indicate if the object has
		# already been counted or not
		self.counted = False


  constructor accepts an
  and stores them. The centroids variable is a list because it will contain an object’s centroid location history.

The constructor also initializes

 , indicating that the object has not been counted yet.

Implementing our people counter with OpenCV + Python

With all of our supporting Python helper tools and classes in place, we are now ready to built our OpenCV people counter.

Open up your
  file and insert the following code:

# import the necessary packages
from pyimagesearch.centroidtracker import CentroidTracker
from pyimagesearch.trackableobject import TrackableObject
from import VideoStream
from import FPS
import numpy as np
import argparse
import imutils
import time
import dlib
import cv2

We begin by importing our necessary packages:

  • From the
      module, we import our custom
  • The 
      modules from
      will help us to work with a webcam and to calculate the estimated Frames Per Second (FPS) throughput rate.
  • We need
      for its OpenCV convenience functions.
  • The
      library will be used for its correlation tracker implementation.
  • OpenCV will be used for deep neural network inference, opening video files, writing video files, and displaying output frames to our screen.

Now that all of the tools are at our fingertips, let’s parse command line arguments:

# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-p", "--prototxt", required=True,
	help="path to Caffe 'deploy' prototxt file")
ap.add_argument("-m", "--model", required=True,
	help="path to Caffe pre-trained model")
ap.add_argument("-i", "--input", type=str,
	help="path to optional input video file")
ap.add_argument("-o", "--output", type=str,
	help="path to optional output video file")
ap.add_argument("-c", "--confidence", type=float, default=0.4,
	help="minimum probability to filter weak detections")
ap.add_argument("-s", "--skip-frames", type=int, default=30,
	help="# of skip frames between detections")
args = vars(ap.parse_args())

We have six command line arguments which allow us to pass information to our people counter script from the terminal at runtime:

  • --prototxt
     : Path to the Caffe “deploy” prototxt file.
  • --model
     : The path to the Caffe pre-trained CNN model.
  • --input
     : Optional input video file path. If no path is specified, your webcam will be utilized.
  • --output
     : Optional output video path. If no path is specified, a video will not be recorded.
  • --confidence
     : With a default value of
     , this is the minimum probability threshold which helps to filter out weak detections.
  • --skip-frames
     : The number of frames to skip before running our DNN detector again on the tracked object. Remember, object detection is computationally expensive, but it does help our tracker to reassess objects in the frame. By default we skip
      frames between detecting objects with the OpenCV DNN module and our CNN single shot detector model.

Now that our script can dynamically handle command line arguments at runtime, let’s prepare our SSD:

# initialize the list of class labels MobileNet SSD was trained to
# detect
CLASSES = ["background", "aeroplane", "bicycle", "bird", "boat",
	"bottle", "bus", "car", "cat", "chair", "cow", "diningtable",
	"dog", "horse", "motorbike", "person", "pottedplant", "sheep",
	"sofa", "train", "tvmonitor"]

# load our serialized model from disk
print("[INFO] loading model...")
net = cv2.dnn.readNetFromCaffe(args["prototxt"], args["model"])

First, we’ll initialize

  — the list of classes that our SSD supports. This list should not be changed if you’re using the model provided in the “Downloads”We’re only interested in the “person” class, but you could count other moving objects as well (however, if your “pottedplant”, “sofa”, or “tvmonitor” grows legs and starts moving, you should probably run out of your house screaming rather than worrying about counting them! 😋 ).

On Line 38 we load our pre-trained MobileNet SSD used to detect objects (but again, we’re just interested in detecting and tracking people, not any other class). To learn more about MobileNet and SSDs, please refer to my previous blog post.

From there we can initialize our video stream:

# if a video path was not supplied, grab a reference to the webcam
if not args.get("input", False):
	print("[INFO] starting video stream...")
	vs = VideoStream(src=0).start()

# otherwise, grab a reference to the video file
	print("[INFO] opening video file...")
	vs = cv2.VideoCapture(args["input"])

First we handle the case where we’re using a webcam video stream (Lines 41-44). Otherwise, we’ll be capturing frames from a video file (Lines 47-49).

We still have a handful of initializations to perform before we begin looping over frames:

# initialize the video writer (we'll instantiate later if need be)
writer = None

# initialize the frame dimensions (we'll set them as soon as we read
# the first frame from the video)
W = None
H = None

# instantiate our centroid tracker, then initialize a list to store
# each of our dlib correlation trackers, followed by a dictionary to
# map each unique object ID to a TrackableObject
ct = CentroidTracker(maxDisappeared=40, maxDistance=50)
trackers = []
trackableObjects = {}

# initialize the total number of frames processed thus far, along
# with the total number of objects that have moved either up or down
totalFrames = 0
totalDown = 0
totalUp = 0

# start the frames per second throughput estimator
fps = FPS().start()

The remaining initializations include:

  • writer
     : Our video writer. We’ll instantiate this object later if we are writing to video.
  • W
     : Our frame dimensions. We’ll need to plug these into
  • ct
     : Our
     . For details on the implementation of
     , be sure to refer to my blog post from a few weeks ago.
  • trackers
     : A list to store the dlib correlation trackers. To learn about dlib correlation tracking stay tuned for next week’s post.
  • trackableObjects
     : A dictionary which maps an
      to a
  • totalFrames
     : The total number of frames processed.
  • totalDown
     : The total number of objects/people that have moved either down or up. These variables measure the actual “people counting” results of the script.
  • fps
     : Our frames per second estimator for benchmarking.

Note: If you get lost in the

  loop below, you should refer back to this bulleted listing of important variables.

Now that all of our initializations are taken care of, let’s loop over incoming frames:

# loop over frames from the video stream
while True:
	# grab the next frame and handle if we are reading from either
	# VideoCapture or VideoStream
	frame =
	frame = frame[1] if args.get("input", False) else frame

	# if we are viewing a video and we did not grab a frame then we
	# have reached the end of the video
	if args["input"] is not None and frame is None:

	# resize the frame to have a maximum width of 500 pixels (the
	# less data we have, the faster we can process it), then convert
	# the frame from BGR to RGB for dlib
	frame = imutils.resize(frame, width=500)
	rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)

	# if the frame dimensions are empty, set them
	if W is None or H is None:
		(H, W) = frame.shape[:2]

	# if we are supposed to be writing a video to disk, initialize
	# the writer
	if args["output"] is not None and writer is None:
		fourcc = cv2.VideoWriter_fourcc(*"MJPG")
		writer = cv2.VideoWriter(args["output"], fourcc, 30,
			(W, H), True)

We begin looping on Line 76. At the top of the loop we grab the next

  (Lines 79 and 80). In the event that we’ve reached the end of the video, we’ll
  out of the loop (Lines 84 and 85).

Preprocessing the

  takes place on Lines 90 and 91. This includes resizing and swapping color channels as dlib requires an

We grab the dimensions of the

  for the video
  (Lines 94 and 95).

From there we’ll instantiate the video

  if an output path was provided via command line argument (Lines 99-102). To learn more about writing video to disk, be sure to refer to this post.

Now let’s detect people using the SSD:

# initialize the current status along with our list of bounding
	# box rectangles returned by either (1) our object detector or
	# (2) the correlation trackers
	status = "Waiting"
	rects = []

	# check to see if we should run a more computationally expensive
	# object detection method to aid our tracker
	if totalFrames % args["skip_frames"] == 0:
		# set the status and initialize our new set of object trackers
		status = "Detecting"
		trackers = []

		# convert the frame to a blob and pass the blob through the
		# network and obtain the detections
		blob = cv2.dnn.blobFromImage(frame, 0.007843, (W, H), 127.5)
		detections = net.forward()

We initialize a

  as “Waiting” on Line 107. Possible
  states include:

  • Waiting: In this state, we’re waiting on people to be detected and tracked.
  • Detecting: We’re actively in the process of detecting people using the MobileNet SSD.
  • Tracking: People are being tracked in the frame and we’re counting the


  list will be populated either via detection or tracking. We go ahead and initialize
  on Line 108.

It’s important to understand that deep learning object detectors are very computationally expensive, especially if you are running them on your CPU.

To avoid running our object detector on every frame, and to speed up our tracking pipeline, we’ll be skipping every N frames (set by command line argument

  is the default). Only every N frames will we exercise our SSD for object detection. Otherwise, we’ll simply be tracking moving objects in-between.

Using the modulo operator on Line 112 we ensure that we’ll only execute the code in the if-statement every N frames.

Assuming we’ve landed on a multiple of

 , we’ll update the
  to “Detecting” (Line 114).

Then we initialize our new list of

  (Line 115).

Next, we’ll perform inference via object detection. We begin by creating a

  from the image, followed by passing the
  through the net to obtain
  (Lines 119-121).

Now we’ll loop over each of the

  in hopes of finding objects belonging to the “person” class:

# loop over the detections
		for i in np.arange(0, detections.shape[2]):
			# extract the confidence (i.e., probability) associated
			# with the prediction
			confidence = detections[0, 0, i, 2]

			# filter out weak detections by requiring a minimum
			# confidence
			if confidence > args["confidence"]:
				# extract the index of the class label from the
				# detections list
				idx = int(detections[0, 0, i, 1])

				# if the class label is not a person, ignore it
				if CLASSES[idx] != "person":

Looping over

  on Line 124, we proceed to grab the
  (Line 127) and filter out weak results + those that don’t belong to the “person” class (Lines 131-138).

Now we can compute a bounding box for each person and begin correlation tracking:

# compute the (x, y)-coordinates of the bounding box
				# for the object
				box = detections[0, 0, i, 3:7] * np.array([W, H, W, H])
				(startX, startY, endX, endY) = box.astype("int")

				# construct a dlib rectangle object from the bounding
				# box coordinates and then start the dlib correlation
				# tracker
				tracker = dlib.correlation_tracker()
				rect = dlib.rectangle(startX, startY, endX, endY)
				tracker.start_track(rgb, rect)

				# add the tracker to our list of trackers so we can
				# utilize it during skip frames

Computing our bounding

  takes place on Lines 142 and 143.

Then we instantiate our dlib correlation

  on Line 148, followed by passing in the object’s bounding box coordinates to
 , storing the result as
  (Line 149).

Subsequently, we start tracking on Line 150 and append the

  to the
  list on Line 154.

That’s a wrap for all operations we do every N skip-frames!

Let’s take care of the typical operations where tracking is taking place in the


# otherwise, we should utilize our object *trackers* rather than
	# object *detectors* to obtain a higher frame processing throughput
		# loop over the trackers
		for tracker in trackers:
			# set the status of our system to be 'tracking' rather
			# than 'waiting' or 'detecting'
			status = "Tracking"

			# update the tracker and grab the updated position
			pos = tracker.get_position()

			# unpack the position object
			startX = int(pos.left())
			startY = int(
			endX = int(pos.right())
			endY = int(pos.bottom())

			# add the bounding box coordinates to the rectangles list
			rects.append((startX, startY, endX, endY))

Most of the time, we aren’t landing on a skip-frame multiple. During this time, we’ll utilize our

  to track our object rather than applying detection.

We begin looping over the available

  on Line 160.

We proceed to update the

  to “Tracking” (Line 163) and grab the object position (Lines 166 and 167).

From there we extract the position coordinates (Lines 170-173) followed by populating the information in our


Now let’s draw a horizontal visualization line (that people must cross in order to be tracked) and use the centroid tracker to update our object centroids:

# draw a horizontal line in the center of the frame -- once an
	# object crosses this line we will determine whether they were
	# moving 'up' or 'down'
	cv2.line(frame, (0, H // 2), (W, H // 2), (0, 255, 255), 2)

	# use the centroid tracker to associate the (1) old object
	# centroids with (2) the newly computed object centroids
	objects = ct.update(rects)

On Line 181 we draw the horizontal line which we’ll be using to visualize people “crossing” — once people cross this line we’ll increment our respective counters

Then on Line 185, we utilize our

  instantiation to accept the list of
 , regardless of whether they were generated via object detection or object tracking. Our centroid tracker will associate object IDs with object locations.

In this next block, we’ll review the logic which counts if a person has moved up or down through the frame:

# loop over the tracked objects
	for (objectID, centroid) in objects.items():
		# check to see if a trackable object exists for the current
		# object ID
		to = trackableObjects.get(objectID, None)

		# if there is no existing trackable object, create one
		if to is None:
			to = TrackableObject(objectID, centroid)

		# otherwise, there is a trackable object so we can utilize it
		# to determine direction
			# the difference between the y-coordinate of the *current*
			# centroid and the mean of *previous* centroids will tell
			# us in which direction the object is moving (negative for
			# 'up' and positive for 'down')
			y = [c[1] for c in to.centroids]
			direction = centroid[1] - np.mean(y)

			# check to see if the object has been counted or not
			if not to.counted:
				# if the direction is negative (indicating the object
				# is moving up) AND the centroid is above the center
				# line, count the object
				if direction < 0 and centroid[1] < H // 2:
					totalUp += 1
					to.counted = True

				# if the direction is positive (indicating the object
				# is moving down) AND the centroid is below the
				# center line, count the object
				elif direction > 0 and centroid[1] > H // 2:
					totalDown += 1
					to.counted = True

		# store the trackable object in our dictionary
		trackableObjects[objectID] = to

We begin by looping over the updated bounding box coordinates of the object IDs (Line 188).

On Line 191 we attempt to fetch a

  for the current

If the

  doesn’t exist for the
 , we create one (Lines 194 and 195).

Otherwise, there is already an existing

 , so we need to figure out if the object (person) is moving up or down.

To do so, we grab the y-coordinate value for all previous centroid locations for the given object (Line 204). Then we compute the

  by taking the difference between the current centroid location and the mean of all previous centroid locations (Line 205).

The reason we take the mean is to ensure our direction tracking is more stable. If we stored just the previous centroid location for the person we leave ourselves open to the possibility of false direction counting. Keep in mind that object detection and object tracking algorithms are not “magic” — sometimes they will predict bounding boxes that may be slightly off what you may expect; therefore, by taking the mean, we can make our people counter more accurate.

If the

  has not been
  (Line 209), we need to determine if it’s ready to be counted yet (Lines 213-222), by:

  1. Checking if the
      is negative (indicating the object is moving Up) AND the centroid is Above the centerline. In this case we increment
  2. Or checking if the
      is positive (indicating the object is moving Down) AND the centroid is Below the centerline. If this is true, we increment

Finally, we store the

  in our
  dictionary (Line 225) so we can grab and update it when the next frame is captured.

We’re on the home-stretch!

The next three code blocks handle:

  1. Display (drawing and writing text to the frame)
  2. Writing frames to a video file on disk (if the
      command line argument is present)
  3. Capturing keypresses
  4. Cleanup

First we’ll draw some information on the frame for visualization:

# draw both the ID of the object and the centroid of the
		# object on the output frame
		text = "ID {}".format(objectID)
		cv2.putText(frame, text, (centroid[0] - 10, centroid[1] - 10),
			cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2), (centroid[0], centroid[1]), 4, (0, 255, 0), -1)

	# construct a tuple of information we will be displaying on the
	# frame
	info = [
		("Up", totalUp),
		("Down", totalDown),
		("Status", status),

	# loop over the info tuples and draw them on our frame
	for (i, (k, v)) in enumerate(info):
		text = "{}: {}".format(k, v)
		cv2.putText(frame, text, (10, H - ((i * 20) + 20)),
			cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0, 0, 255), 2)

Here we overlay the following data on the frame:

  • ObjectID
     : Each object’s numerical identifier.
  • centroid
      : The center of the object will be represented by a “dot” which is created by filling in a circle.
  • info
      : Includes
     , and

For a review of drawing operations, be sure to refer to this blog post.

Then we’ll write the

  to a video file (if necessary) and handle keypresses:

# check to see if we should write the frame to disk
	if writer is not None:

	# show the output frame
	cv2.imshow("Frame", frame)
	key = cv2.waitKey(1) & 0xFF

	# if the `q` key was pressed, break from the loop
	if key == ord("q"):

	# increment the total number of frames processed thus far and
	# then update the FPS counter
	totalFrames += 1

In this block we:

  • Write the 
     , if necessary, to the output video file (Lines 249 and 250)
  • Display the
      and handle keypresses (Lines 253-258). If “q” is pressed, we
      out of the frame processing loop.
  • Update our
      counter (Line 263)

We didn’t make too much of a mess, but now it’s time to clean up:

# stop the timer and display FPS information
print("[INFO] elapsed time: {:.2f}".format(fps.elapsed()))
print("[INFO] approx. FPS: {:.2f}".format(fps.fps()))

# check to see if we need to release the video writer pointer
if writer is not None:

# if we are not using a video file, stop the camera video stream
if not args.get("input", False):

# otherwise, release the video file pointer

# close any open windows

To finish out the script, we display the FPS info to the terminal, release all pointers, and close any open windows.

Just 283 lines of code later, we are now done 😎.

People counting results

To see our OpenCV people counter in action, make sure you use the “Downloads” section of this blog post to download the source code and example videos.

From there, open up a terminal and execute the following command:

$ python --prototxt mobilenet_ssd/MobileNetSSD_deploy.prototxt \
	--model mobilenet_ssd/MobileNetSSD_deploy.caffemodel \
	--input videos/example_01.mp4 --output output/output_01.avi
[INFO] loading model...
[INFO] opening video file...
[INFO] elapsed time: 37.27
[INFO] approx. FPS: 34.42

Here you can see that our person counter is counting the number of people who:

  1. Are entering the department store (down)
  2. And the number of people who are leaving (up)

At the end of the first video you’ll see there have been 7 people who entered and 3 people who have left.

Furthermore, examining the terminal output you’ll see that our person counter is capable of running in real-time, obtaining 34 FPS throughout.  This is despite the fact that we are using a deep learning object detector for more accurate person detections.

Our 34 FPS throughout rate is made possible through our two-phase process of:

  1. Detecting people once every 30 frames
  2. And then applying a faster, more efficient object tracking algorithm in all frames in between.

Another example of people counting with OpenCV can be seen below:

$ python --prototxt mobilenet_ssd/MobileNetSSD_deploy.prototxt \
	--model mobilenet_ssd/MobileNetSSD_deploy.caffemodel \
	--input videos/example_01.mp4 --output output/output_02.avi 
[INFO] loading model...
[INFO] opening video file...
[INFO] elapsed time: 36.88
[INFO] approx. FPS: 34.79

I’ve included a short GIF below to give you an idea of how the algorithm works:

Figure 7: An example of an OpenCV people counter in action.

A full video of the demo can be seen below:

This time there have been 2 people who have entered the department store and 14 people who have left.

You can see how useful this system would be to a store owner interested in foot traffic analytics.

The same type of system for counting foot traffic with OpenCV can be used to count automobile traffic with OpenCV and I hope to cover that topic in a future blog post.

Additionally, a big thank you to David McDuffee for recording the example videos used here today! David works here with me at PyImageSearch and if you’ve ever emailed PyImageSearch before, you have very likely interacted with him. Thank you for making this post possible, David! Also a thank you to BenSound for providing the music for the video demos included in this post.

What are the next steps?

Congratulations on building your person counter with OpenCV!

If you’re interested in learning more about OpenCV, including building other real-world applications, including face detection, object recognition, and more, I would suggest reading through my book, Practical Python and OpenCV + Case Studies.

Practical Python and OpenCV is meant to be a gentle introduction to the world of computer vision and image processing. This book is perfect if you:

  • Are new to the world of computer vision and image processing
  • Have some past image processing experience but are new to Python
  • Are looking for some great example projects to get your feet wet


Learn OpenCV fundamentals in a single weekend!

If you’re looking for a more detailed dive into computer vision, I would recommend working through the PyImageSearch Gurus course. The PyImageSearch Gurus course is similar to a college survey course and many students report that they learn more than a typical university class.

Inside you’ll find over 168 lessons, starting with the fundamentals of computer vision, all the way up to more advanced topics, including:

  • Face recognition
  • Automatic license plate recognition
  • Training your own custom object detectors
  • …and much more!

You’ll also find a thriving community of like-minded individuals who are itching to learn about computer vision. Each day in the community forums we discuss:

  • Your burning questions about computer vision
  • New project ideas and resources
  • Kaggle and other competitions
  • Development environment and code issues
  • …among many other topics!

Master computer vision inside PyImageSearch Gurus!


In today’s blog post we learned how to build a people counter using OpenCV and Python.

Our implementation is:

  • Capable of running in real-time on a standard CPU
  • Utilizes deep learning object detectors for improved person detection accuracy
  • Leverages two separate object tracking algorithms, including both centroid tracking and correlation filters for improved tracking accuracy
  • Applies both a “detection” and “tracking” phase, making it capable of (1) detecting new people and (2) picking up people that may have been “lost” during the tracking phase

I hope you enjoyed today’s post on people counting with OpenCV!

To download the code to this blog post (and apply people counting to your own projects), just enter your email address in the form below!


If you would like to download the code and images used in this post, please enter your email address in the form below. Not only will you get a .zip of the code, I’ll also send you a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL! Sound good? If so, enter your email address and I’ll send you the code immediately!

The post OpenCV People Counter appeared first on PyImageSearch.

Continue Reading…


Read More

Unsupervised Learning Demystified

Unsupervised learning is a pattern-finding technique for mining inspiration from your data. Let's demystify!

Continue Reading…


Read More

Four short links: 13 August 2018

Algorithms, Feedback, Transliteration, and Diagnosis

  1. Dijkstras in Disguise -- It turns out that many algorithms I've encountered in my computer graphics, finance, and reinforcement learning studies are all variations of this relaxation principle in disguise. [...] This blog post is a gentle tutorial on how all these varied CS topics are connected.
  2. The Blacker the Box -- The thesis of this post is: The faster the feedback on prediction accuracy, the blacker the box can be. The slower the feedback, the more your models should be explicit and formal.
  3. Design Challenges in Named Entity Transliteration -- In order to improve availability of bilingual named entity transliteration data sets, we release personal name bilingual dictionaries mined from Wikidata for English to Russian, Hebrew, Arabic, and Japanese Katakana. Our code and dictionaries are publicly available. GitHub.
  4. AI Spots Fibromyalgia -- A machine-learning algorithm that was programmed to recognize this neurological signature was able to use it to predict which brain scans were indicative of fibromyalgia and which were not. [...] López-Solà’s research is compelling evidence to convince those who are reluctant to accept the existence of fibromyalgia. Interesting because medical science argues whether the condition is real, but the software can reliably identify something.

Continue reading Four short links: 13 August 2018.

Continue Reading…


Read More

How feminism has made me a better scientist

Feminism is not a branch of science. It is not a set of testable propositions about the observable world, nor is it any single research method. From my own perspective, feminism is a political movement associated with successes such as votes for women, setbacks such as the failed Equal Rights Amendment, and continuing struggles in areas ranging from reproductive freedom to equality in the workplace. Feminism is also a way of looking at the world through awareness of the social and political struggles of women, historically and in the present. And others will define feminism in other ways.

How is this relevant to the practice of science? As a researcher in statistical methods and applications, and I have found feminism to help me do better science. I make this claim based on various experiences in my work.

To start with, statistics is all about learning from data. Statisticians from Florence Nightingale and Francis Galton through Hans Rosling and Nate Silver have discovered unexpected patterns using mathematical modeling and data visualization. What does that have to do with feminism? Feminism is about a willingness to question the dominant, tacitly accepted ideology. This is essential for science. What is labeled “non-ideological” basically means the dominant ideology is accepted without thought. As Angela Davis said in her lecture, Feminism and Abolition, “feminist methodologies impel us to explore connections that are not always apparent. And they drive us to inhabit contradictions and discover what is productive in these contradictions. Feminism insists on methods of thought and action that urge us to think about things together that appear to be separate, and to disaggregate things that appear to naturally belong together.”

Along with questioning the dominant, tacitly accepted ideology is the need to recognize this ideology. This is related to the idea that a valuable characteristic of a scientist is a willingness to be disturbed. We learn from anomalies (see discussion here), which requires recognizing how a given observation or story is anomalous (that is, anomalous with respect to what expectations, exactly?), which in turn is more effective if one can identify the substrates of theories that determine our expectations. An example from our statistical work is our research using the Xbox survey to reveal that apparent swings in public opinion can largely be explained by variations in nonresponse.

On a more specific level, feminism can make us aware of work where the male perspective is unthinkingly taken as a baseline. For example, a paper was released a few years ago presenting survey data from the United States showing that parents of girls were more likely to support the Republican party, compared to parents of boys. I’m not completely sure what to make of this finding—for one thing, an analysis a few years ago of data from Britain found the opposite pattern—but here I want to focus on the reception of this research claim. There’s something oddly asymmetrical about how these results were presented, both by the authors and in the media. Consider the following headlines: “The Effect of Daughters on Partisanship and Social Attitudes Toward Women,” “Does Having Daughters Make You More Republican?”, “Parents With Daughters Are More Likely To Be Republicans, Says New Study,” “Parents Of Daughters Lean Republican, Study Shows,” “The Daughter Theory: Does raising girls make parents conservative?” Here’s our question: Why is it all about “the effect of daughters”? Why not, Does having sons make you support the Democrats? This is not just semantics: the male-as-baseline perspective affects the explanations that are given for this pattern: Lots of discussion about how you, as a parent, might change your views of the world if you have a girl. But not so much about how you might change your views if you have a boy.

The fallacy here was that people were reasoning unidirectionally. In this case, the benefit of a feminist perspective is not political but rather just a recognition of multiple perspectives and social biases, recognizing that in this case the boy baby is considered a default. As the saying goes, the greatest trick the default ever pulled was convincing the world it didn’t exist.

To put it another way: it is not that feminism is some sort of superpower that allows one to consider alternatives to the existing defaults, it’s more that these alternatives are obvious and can only not be seen if you don’t allow yourself to look. Feminism is, for this purpose, not a new way of looking at the world; rather, it is a simple removal of blinders. But uncovering blind spots isn’t that simple, and can be quite powerful.

A broader point is that it’s hard to do good social science if you don’t understand the community you’re studying. The lesson from feminism is not just to avoid taking the male perspective for granted but more generally to recognize the perspective of outgroups. An example came up recently with the so-called gaydar study, a much-publicized paper demonstrating the ability of a machine learning algorithm to distinguish gay and straight faces based on photographs from dating sites. This study was hyped beyond any reasonable level. From a statistical perspective, it’s no surprise at all that two groups of people selected from different populations will differ from each other, and if you have samples from two different populations and a large number of cases, then you should be able to train an algorithm to distinguish them at some level of accuracy. The authors of the paper in question went way beyond this, though, claiming that their results “provide strong support for the PHT [prenatal hormone theory], which argues that same-gender sexual orientation stems from the underexposure of male fetuses and overexposure of female fetuses to prenatal androgens responsible for the sexual differentiation of faces, preferences, and behavior.” Also some goofy stuff about the fact that gay men in this particular sample are less likely to have beards. Beyond the purely statistical problems here is a conceptual error, which is to think of “gay faces” as some sort of innate property of gay people, rather than as cues that gays are deliberately sending to each other. The distinctive and noticeable characteristics of the subpopulation are the result of active choices by members of that group, not (as assumed in the paper under discussion) essential attributes derived from “nature” or “nurture.”

Looking from a different direction, feminism can make us suspicious of simplistic gender-essentialist ideas such as expressed in various papers that make use of schoolyard evolutionary biology—the idea that, because of evolution, all people are equivalent to all other people, except that all boys are different from all girls. It’s the attitude I remember from the grade school playground, in which any attribute of a person, whether it be how you walked or how you laughed or even how you held your arms when you were asked to look at your fingernails (really) were gender-typed. It’s gender and race essentialism. And when you combine it with what Tversky and Kahneman called “the law of small numbers” (the naive but common attitude that any underlying pattern should reproduce in any small sample) has led to endless chasing of noise in data analyses. In short, if you believe this sort of essentialism, you can find it just about anywhere you look, hence the value of a feminist perspective which reminds us of the history and fallacies of gender essentialism. Of course there are lots of systematic differences between boys and girls, and between men and women, that are not directly sex-linked. To be a feminist is not to deny these differences; rather, placing these differences within a larger context, and relating them to past and existing power imbalances, is part of what feminism is about.

Many examples of schoolyard evolutionary biology in published science will be familiar to regular readers of this blog: there’s the beauty-and-sex-ratio study, the ovulation-and-clothing study, the fat-arms-and-political attitudes study (a rare example that objectified men instead of women), the ovulation-and-voting study, and various others. Just to be clear: I’m not saying that the claims in those studies have to be wrong. All these claims, to my eyes, look crudely gender-essentialist, but that doesn’t mean they’re false, or that it was a bad idea to study them. No, all those studies had problems in their statistics (in the sense that poor understanding of statistical methods led researchers and observers to wrongly think that those data presented strong evidence in favor of those claims) and in their measurement (in that the collected data were too sparse and noisy to really have a chance of supplying the desired evidence).

A feminist perspective was helpful to me in unraveling these studies for several reasons. To start with, feminism gave me the starting point of skepticism regarding naive gender essentialism—or, I could say, it help keep me from being intimidated by weak theorizing coming from that direction. Second, feminism made me aware of multiple perspectives: if someone can hypothesize that prettier parents are more likely to produce girls, I can imagine an opposite just-so story that makes just as much sense. And, indeed, both stories could be true at different times and different places, which brings me to the third thing I bring from feminism: an awareness of variation. There’s no reason to think that a hypothesized effect will be consistent in magnitude or even sign for different people and under different conditions. Understanding this point is a start toward moving away from naive, one might say “scientistic,” views of what can be learned or proved from simple surveys or social experiments.

I consider myself a feminist but I understand that others have different political views, and I’m not trying to say that being a feminist is necessary for doing science. Of course not. Rather, my point is that I think the political and historical insights of feminism have made me a better statistician and scientist.

As I wrote a few years ago:

At some level, in this post I’m making the unremarkable point that each of us has a political perspective which informs our research in positive and negative ways. The reason that this particular example of the feminist statistician is interesting is that it’s my impression that feminism, like religion, is generally viewed as a generally anti-scientific stance. I think some of this attitude comes from some feminists themselves who are skeptical of science in that is a generally male-dominated institution that is in part used to continue male dominance of society, and it also comes from people who might say that reality has an anti-feminist bias.

We can try to step back and account for our own political and ethical positions when doing science, but at the same time we should be honest about the ways that our positions and experiences shape the questions we ask and produce insights.

Feminism, like religion or other social identifications, can be competitive with science or it can be collaborative. See, for example, the blog of Echidne for a collaborative approach. To the extent that feminism represents a set of tenets are opposed to reality, it could get in the way of scientific thinking, in the same way that religion would get in the way of scientific thinking if, for example, you tried to apply faith healing principles to do medical research. If you’re serious about science, though, I think of feminism (or, I imagine, Christianity, for example) as a framework rather than a theory—that is, as a way of interpreting the world, not as a set of positive statements. This is in the same way that I earlier wrote that racism is a framework, not a theory. Not all frameworks are equal; my point here is just that, if we’re used to thinking of feminism, or religion, as anti-scientific, it can be useful to consider ways in which these perspectives can help one’s scientific work.

All of this is just one perspective. I did get several useful comments and references from Shira Mitchell and Dan Simpson when preparing this post, but the particular stories and emphases are mine. One could imagine a whole series of such posts—“How feminism made me a worse scientist,” “How science has made me a better feminist,” “How Christianity has made me a better scientist,” and so forth—all written by different people. And each one could be valid.

I was impelled to write the above post after reflecting upon all those many pseudo-scientific stories of cavemen (as Rebecca Solnit put it, “the five-million-year-old suburb”) and reflecting on the difficulties we so often have in communicating with one another; see for example here (where psychologist Steven Pinker, who describes himself as a feminist, gives a list of topics that he feels are “taboo” and cannot be openly discussed among educated Americans; an example is the statement, “Do men have an innate tendency to rape?”) and here (where social critic Charles Murray, who I assume would not call himself a feminist, argues that educated Americans are too “nonjudgmental” and not willing enough to condemn others, for example by saying “that it is morally wrong for a woman to bring a baby into the world knowing that it will not have a father”).

When doing social science, we have to accept that different people have sharply different views. Awareness of multiple perspectives is to me a key step, both in understanding social behavior and also in making sense of the social science we read. I do not think that calling oneself a feminist makes someone a better person, nor do I claim that feminism represents some higher state of virtue. All I’m saying here is that feminism, beyond its political context, happens to be a perspective that can help some of us be better scientists.

The post How feminism has made me a better scientist appeared first on Statistical Modeling, Causal Inference, and Social Science.

Continue Reading…


Read More

Top Stories, Aug 6-12: Eight iconic examples of data visualisation; GitHub Python Data Science Spotlight

Also: Only Numpy: Implementing GANs and Adam Optimizer using Numpy; Understanding Language Syntax and Structure; Eight iconic examples of data visualisation; 5 Data Science Projects That Will Get You Hired in 2018; Seven Practical Ideas For Beginner Data Scientists

Continue Reading…


Read More

IML and H2O: Machine Learning Model Interpretability And Feature Explanation

(This article was first published on - Articles, and kindly contributed to R-bloggers)

Model interpretability is critical to businesses. If you want to use high performance models (GLM, RF, GBM, Deep Learning, H2O, Keras, xgboost, etc), you need to learn how to explain them. With machine learning interpretability growing in importance, several R packages designed to provide this capability are gaining in popularity. We analyze the IML package in this article.

In recent blog posts we assessed LIME for model agnostic local interpretability functionality and DALEX for both local and global machine learning explanation plots. This post examines the iml package (short for Interpretable Machine Learning) to assess its functionality in providing machine learning interpretability to help you determine if it should become part of your preferred machine learning toolbox.

We again utilize the high performance machine learning library, h2o, implementing three popular black-box modeling algorithms: GLM (generalized linear models), RF (random forest), and GBM (gradient boosted machines). For those that want a deep dive into model interpretability, the creator of the iml package, Christoph Molnar, has put together a free book: Interpretable Machine Learning. Check it out.

Articles In The Model Interpretability Series

Articles related to machine learning and black-box model interpretability:

Awesome Data Science Tutorials with LIME for black-box model explanation in business:

Transform Your Data Science Abilities In 10 Weeks

If you’re interested in learning how to apply critical thinking and data science while solving a real-world business problem following an end-to-end data science project, check out Data Science For Business (DS4B 201-R). Over the course of 10 weeks you will solve an end-to-end Employee Churn data science project following our systematic Business Science Problem Framework.

Data Science For Business With R

Start Learning Today!

We have two new course offerings coming soon! Shiny Web Apps and Python And Spark For Customer Churn! Get started with Business Science University.

FREE BOOK: Interpretable Machine Learning

The creator of the IML (Interpretable Machine Learning) package has a great FREE resource for those interested in applying model interpretability techniques to complex, black-box models (high performance models). Check out the book: Interpretable Machine Learning by Christoph Molnar.

Interpretable Machine Learning Book

Interpretable Machine Learning Book by Christoph Molnar

Learning Trajectory

We’ll cover the following topics on iml, combining with the h2o high performance machine learning package:

IML and H2O: Machine Learning Model Interpretability And Feature Explanation

By Brad Boehmke, Director of Data Science at 84.51°

Advantages & disadvantages

The iml package is probably the most robust ML interpretability package available. It provides both global and local model-agnostic interpretation methods. Although the interaction functions are notably slow, the other functions are faster or comparable to existing packages we use or have tested. I definitely recommend adding iml to your preferred ML toolkit. The following provides a quick list of some of its pros and cons:


  • ML model and package agnostic: can be used for any supervised ML model (many features are only relevant to regression and binary classification problems).
  • Variable importance: uses a permutation-based approach for variable importance, which is model agnostic, and accepts any loss function to assess importance.
  • Partial dependence plots: Fast PDP implementation and allows for ICE curves.
  • H-statistic: one of only a few implementations to allow for assessing interactions.
  • Local interpretation: provides both LIME and Shapley implementations.
  • Plots: built with ggplot2 which allows for easy customization


  • Does not allow for easy comparisons across models like DALEX.
  • The H-statistic interaction functions do not scale well to wide data (may predictor variables).
  • Only provides permutation-based variable importance scores (which become slow as number of features increase).
  • LIME implementation has less flexibilty and features than lime.

Replication Requirements


I leverage the following packages:

# load required packages
library(rsample)   # data splitting
library(ggplot2)   # allows extension of visualizations
library(dplyr)     # basic data transformation
library(h2o)       # machine learning modeling
library(iml)       # ML interprtation

# initialize h2o session
## H2O is not running yet, starting it now...
## Note:  In case of errors look at the following log files:
##     C:\Users\mdanc\AppData\Local\Temp\RtmpotQxdq/h2o_mdanc_started_from_r.out
##     C:\Users\mdanc\AppData\Local\Temp\RtmpotQxdq/h2o_mdanc_started_from_r.err
## Starting H2O JVM and connecting: . Connection successful!
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         3 seconds 250 milliseconds 
##     H2O cluster version: 
##     H2O cluster version age:    8 months and 13 days !!! 
##     H2O cluster name:           H2O_started_from_R_mdanc_rkf359 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.53 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  8 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Algos, AutoML, Core V3, Core V4 
##     R Version:                  R version 3.4.4 (2018-03-15)


To demonstrate iml’s capabilities we’ll use the employee attrition data that has been included in the rsample package. This demonstrates a binary classification problem (“Yes” vs. “No”) but the same process that you’ll observe can be used for a regression problem.
I perform a few house cleaning tasks on the data prior to converting to an h2o object and splitting.

NOTE: The surrogate tree function uses partykit::cpart, which requires all predictors to be numeric or factors. Consequently, we need to coerce any character predictors to factors (or ordinal encode).

# classification data
df <- rsample::attrition %>% 
  mutate_if(is.ordered, factor, ordered = FALSE) %>%
  mutate(Attrition = recode(Attrition, "Yes" = "1", "No" = "0") %>% factor(levels = c("1", "0")))

# convert to h2o object
df.h2o <- as.h2o(df)

# create train, validation, and test splits
splits <- h2o.splitFrame(df.h2o, ratios = c(.7, .15), destination_frames = c("train","valid","test"))
names(splits) <- c("train","valid","test")

# variable names for resonse & features
y <- "Attrition"
x <- setdiff(names(df), y) 

H2O Models

We will explore how to visualize a few of the more common machine learning algorithms implemented with h2o. For brevity I train default models and do not emphasize hyperparameter tuning. The following produces a regularized logistic regression, random forest, and gradient boosting machine models.

# elastic net model 
glm <- h2o.glm(
  x = x, 
  y = y, 
  training_frame = splits$train,
  validation_frame = splits$valid,
  family = "binomial",
  seed = 123

# random forest model
rf <- h2o.randomForest(
  x = x, 
  y = y,
  training_frame = splits$train,
  validation_frame = splits$valid,
  ntrees = 1000,
  stopping_metric = "AUC",    
  stopping_rounds = 10,         
  stopping_tolerance = 0.005,
  seed = 123

# gradient boosting machine model
gbm <-  h2o.gbm(
  x = x, 
  y = y,
  training_frame = splits$train,
  validation_frame = splits$valid,
  ntrees = 1000,
  stopping_metric = "AUC",    
  stopping_rounds = 10,         
  stopping_tolerance = 0.005,
  seed = 123

All of the models provide AUCs ranging between 0.75 to 0.79. Although these models have distinct AUC scores, our objective is to understand how these models come to this conclusion in similar or different ways based on underlying logic and data structure.

# model performance
h2o.auc(glm, valid = TRUE)
## [1] 0.7870935
h2o.auc(rf, valid = TRUE)
## [1] 0.7681021
h2o.auc(gbm, valid = TRUE)
## [1] 0.7468242

Although these models have distinct AUC scores, our objective is to understand how these models come to this conclusion in similar or different ways based on underlying logic and data structure.

IML procedures

In order to work with iml, we need to adapt our data a bit so that we have the following three components:

  1. Create a data frame with just the features (must be of class data.frame, cannot be an H2OFrame or other class).

  2. Create a vector with the actual responses (must be numeric – 0/1 for binary classification problems).

  3. iml has internal support for some machine learning packages (i.e. mlr, caret, randomForest). However, to use iml with several of the more popular packages being used today (i.e. h2o, ranger, xgboost) we need to create a custom function that will take a data set (again must be of class data.frame) and provide the predicted values as a vector.

# 1. create a data frame with just the features
features <-$valid) %>% select(-Attrition)

# 2. Create a vector with the actual responses
response <- as.numeric(as.vector(splits$valid$Attrition))

# 3. Create custom predict function that returns the predicted values as a
#    vector (probability of purchasing in our example)
pred <- function(model, newdata)  {
  results <-, as.h2o(newdata)))

# example of prediction output
pred(rf, features) %>% head()
## [1] 0.18181818 0.27272727 0.06060606 0.54545455 0.03030303 0.42424242

Once we have these three components we can create a predictor object. Similar to DALEX and lime, the predictor object holds the model, the data, and the class labels to be applied to downstream functions. A unique characteristic of the iml package is that it uses R6 classes, which is rather rare. To main differences between R6 classes and the normal S3 and S4 classes we typically work with are:

  • Methods belong to objects, not generics (we’ll see this in the next code chunk).
  • Objects are mutable: the usual copy-on-modify semantics do not apply (we’ll see this later in this tutorial).

These properties make R6 objects behave more like objects in programming languages such as Python. So to construct a new Predictor object, you call the new() method which belongs to the R6 Predictor object and you use $ to access new():

# create predictor object to pass to explainer functions
predictor.glm <- Predictor$new(
  model = glm, 
  data = features, 
  y = response, = pred,
  class = "classification"

predictor.rf <- Predictor$new(
  model = rf, 
  data = features, 
  y = response, = pred,
  class = "classification"

predictor.gbm <- Predictor$new(
  model = gbm, 
  data = features, 
  y = response, = pred,
  class = "classification"
# structure of predictor object
## Classes 'Predictor', 'R6' 
##   Public:
##     class: classification
##     clone: function (deep = FALSE) 
##     data: Data, R6
##     initialize: function (model = NULL, data, = NULL, y = NULL, class = NULL) 
##     model: H2OBinomialModel
##     predict: function (newdata) 
##     prediction.colnames: NULL
##     prediction.function: function (newdata) 
##     print: function () 
##     task: unknown
##   Private:
##     predictionChecked: FALSE

Global interpretation

iml provides a variety of ways to understand our models from a global perspective:

  1. Feature Importance
  2. Partial Dependence
  3. Measuring Interactions
  4. Surrogate Model

We’ll go through each.

1. Feature importance

We can measure how important each feature is for the predictions with FeatureImp. The feature importance measure works by calculating the increase of the model’s prediction error after permuting the feature. A feature is “important” if permuting its values increases the model error, because the model relied on the feature for the prediction. A feature is “unimportant” if permuting its values keeps the model error unchanged, because the model ignored the feature for the prediction. This model agnostic approach is based on (Breiman, 2001; Fisher et al, 2018) and follows the given steps:

For any given loss function do
1: compute loss function for original model
2: for variable i in {1,...,p} do
     | randomize values
     | apply given ML model
     | estimate loss function
     | compute feature importance (permuted loss / original loss)
3. Sort variables by descending feature importance   

We see that all three models find OverTime as the most influential variable; however, after that each model finds unique structure and signals within the data. Note: you can extract the results with imp.rf$results.

# compute feature importance with specified loss metric
imp.glm <- FeatureImp$new(predictor.glm, loss = "mse")
imp.rf <- FeatureImp$new(predictor.rf, loss = "mse")
imp.gbm <- FeatureImp$new(predictor.gbm, loss = "mse")

# plot output
p1 <- plot(imp.glm) + ggtitle("GLM")
p2 <- plot(imp.rf) + ggtitle("RF")
p3 <- plot(imp.gbm) + ggtitle("GBM")

gridExtra::grid.arrange(p1, p2, p3, nrow = 1)

plot of chunk vip

Permutation-based approaches can become slow as your number of predictors grows. To assess variable importance for all 3 models in this example takes only 8 seconds. However, performing the same procedure on a data set with 80 predictors (AmesHousing::make_ames()) takes roughly 3 minutes. Although this is slower, it is comparable to other permutation-based implementations (i.e. DALEX, ranger).

  imp.glm <- FeatureImp$new(predictor.glm, loss = "mse")
  imp.rf <- FeatureImp$new(predictor.rf, loss = "mse")
  imp.gbm <- FeatureImp$new(predictor.gbm, loss = "mse")
##  user  system elapsed 
## 2.982   0.112   8.164

The following lists some advantages and disadvantages to iml’s feature importance procedure.


  • Model agnostic
  • Simple interpretation that’s comparable across models
  • Can apply any loss function (accepts custom loss functions as well)
  • Plot output uses ggplot2; we can add to or use our internal branding packages with it


  • Permutation-based methods are slow
  • Default plot contains all predictors; however, we can subset $results data frame if desired

2. Partial dependence

The Partial class implements partial dependence plots (PDPs) and individual conditional expectation (ICE) curves. The procedure follows the traditional methodology documented in Friedman (2001) and Goldstein et al. (2014) where the ICE curve for a certain feature illustrates the predicted value for each observation when we force each observation to take on the unique values of that feature. The PDP curve represents the average prediction across all observations.

For a selected predictor (x)
1. Determine grid space of j evenly spaced values across distribution of x
2: for value i in {1,...,j} of grid space do
     | set x to i for all observations
     | apply given ML model
     | estimate predicted value
     | if PDP: average predicted values across all observations

The following produces “ICE boxplots” and a PDP line (connects the averages of all observations for each response class) for the most important variable in all three models (OverTime). All three model show a sizable increase in the probability of employees attriting when working overtime. However, you will notice the random forest model experiences less of an increase in probability compared to the other two models.

glm.ot <- Partial$new(predictor.glm, "OverTime") %>% plot() + ggtitle("GLM")
rf.ot <- Partial$new(predictor.rf, "OverTime") %>% plot() + ggtitle("RF") 
gbm.ot <- Partial$new(predictor.gbm, "OverTime") %>% plot() + ggtitle("GBM")

gridExtra::grid.arrange(glm.ot, rf.ot, gbm.ot, nrow = 1)

plot of chunk pdp

For continuous predictors you can reduce the grid space to make computation time more efficient and center the ICE curves. Note: to produce the centered ICE curves (right plot) you use ice$center and provide it the value to center on. This will modify the existing object in place (recall this is a unique characteristic of R6 –> objects are mutable). The following compares the marginal impact of age on the probability of attriting. The regularized regression model shows a monotonic decrease in the probability (the log-odds probability is linear) while the two tree-based approaches capture the non-linear, non-monotonic relationship.

# GLM model
glm.age <- Partial$new(predictor.glm, "Age", ice = TRUE, grid.size = 50)
p1 <- plot(glm.age) + ggtitle("GLM")

# RF model
rf.age <- Partial$new(predictor.rf, "Age", ice = TRUE, grid.size = 50)
p2 <- plot(rf.age) + ggtitle("RF")

# GBM model
gbm.age <- Partial$new(predictor.gbm, "Age", ice = TRUE, grid.size = 50)
p3 <- plot(gbm.age) + ggtitle("GBM")

gridExtra::grid.arrange(p1, p2, p3, nrow = 1)

plot of chunk ice

Similar to pdp you can also compute and plot 2-way interactions. Here we assess how the interaction of MonthlyIncome and OverTime influences the predicted probability of attrition for all three models.

p1 <- Partial$new(predictor.glm, c("MonthlyIncome", "OverTime")) %>% 
    plot() + ggtitle("GLM") + ylim(c(0, .4))
p2 <- Partial$new(predictor.rf, c("MonthlyIncome", "OverTime")) %>% 
    plot() + ggtitle("RF") + ylim(c(0, .4))
p3 <- Partial$new(predictor.gbm, c("MonthlyIncome", "OverTime")) %>% plot() + 
    ggtitle("GBM") + ylim(c(0, .4))

gridExtra::grid.arrange(p1, p2, p3, nrow = 1)

plot of chunk pdp-interaction

The following lists some advantages and disadvantages to iml’s PDP and ICE procedures.


  • Provides PDPs & ICE curves (unlike DALEX)
  • Allows you to center ICE curves
  • Computationally efficient
  • grid.size allows you to increase/reduce grid space of xi values
  • Rug option illustrates distribution of all xi values
  • Provides convenient plot outputs for categorical predictors


  • Only provides heatmap plot of 2-way interaction plots
  • Does not allow for easy comparison across models like DALEX

3. Measuring Interactions

A wonderful feature provided by iml is to measure how strongly features interact with each other. To measure interaction, iml uses the H-statistic proposed by Friedman and Popescu (2008). The H-statistic measures how much of the variation of the predicted outcome depends on the interaction of the features. There are two approaches to measure this. The first measures if a feature (xi) interacts with any other feature. The algorithm performs the following steps:

1: for variable i in {1,...,p} do
     | f(x) = estimate predicted values with original model
     | pd(x) = partial dependence of variable i
     | pd(!x) = partial dependence of all features excluding i
     | upper = sum(f(x) - pd(x) - pd(!x))
     | lower = variance(f(x))
     | rho = upper / lower
2. Sort variables by descending rho (interaction strength)  

The intereaction strength (rho) will be between 0 when there is no interaction at all and 1 if all of variation of the predicted outcome depends on a given interaction. All three models capture different interaction structures although some commonalities exist for different models (i.e. OverTime, Age, JobRole). The interaction effects are stronger in the tree based models versus the GLM model, with the GBM model having the strongest interaction effect of 0.4.

# identify variables with largest interactions in each model
interact.glm <- Interaction$new(predictor.glm) %>% 
    plot() + ggtitle("GLM")
interact.rf  <- Interaction$new(predictor.rf) %>% 
    plot() + ggtitle("RF")
interact.gbm <- Interaction$new(predictor.gbm) %>% 
    plot() + ggtitle("GBM")

# plot
gridExtra::grid.arrange(interact.glm, interact.rf, interact.gbm, nrow = 1)


Considering OverTime exhibits the strongest interaction signal, the next question is which other variable is this interaction the strongest. The second interaction approach measures the 2-way interaction strength of feature xi and xj and performs the following steps:

1: i = a selected variable of interest
2: for remaining variables j in {1,...,p} do
     | pd(ij) = interaction partial dependence of variables i and j
     | pd(i) = partial dependence of variable i
     | pd(j) = partial dependence of variable j
     | upper = sum(pd(ij) - pd(i) - pd(j))
     | lower = variance(pd(ij))
     | rho = upper / lower
3. Sort interaction relationship by descending rho (interaction strength)  

The following measures the two-way interactions of all variables with the OverTime variable. The two tree-based models show MonthlyIncome having the strongest interaction (although it is a week interaction since rho < 0.13). Identifying these interactions can be useful to understand which variables create co-denpendencies in our models behavior. It also helps us identify interactions to visualize with PDPs (which is why I showed the example of the OverTime and MonthlyIncome interaction PDP earlier).

# identify variables with largest interactions w/OverTime
interact.glm <- Interaction$new(predictor.glm, feature = "OverTime") %>% plot()
interact.rf  <- Interaction$new(predictor.rf, feature = "OverTime") %>% plot()
interact.gbm <- Interaction$new(predictor.gbm, feature = "OverTime") %>% plot()

# plot
gridExtra::grid.arrange(interact.glm, interact.rf, interact.gbm, nrow = 1)

Two-Way Interactions

The H-statistic is not widely implemented so having this feature in iml is beneficial. However, its important to note that as your feature set grows, the H-statistic becomes computationally slow. For this data set, measuring the interactions across all three models only took 45 seconds and 68 seconds for the two-way interactions. However, for a wider data set such as AmesHousing::make_ames() where there are 80 predictors, this will up towards an hour to compute.

4. Surrogate model

Another way to make the models more interpretable is to replace the “black box” model with a simpler model (aka “white box” model) such as a decision tree. This is known as a surrogate model in which we

1. apply original model and get predictions
2. choose an interpretable "white box" model (linear model, decision tree)
3. Train the interpretable model on the original dataset and its predictions
4. Measure how well the surrogate model replicates the prediction of the black box model
5. Interpret / visualize the surrogate model

iml provides a simple decision tree surrogate approach, which leverages partykit::cpart. In this example we train a CART decision tree with max depth of 3 on our GBM model. You can see that the white box model does not do a good job of explaining the black box predictions (R^2 = 0.438).

# fit surrogate decision tree model
tree <- TreeSurrogate$new(predictor.gbm, maxdepth = 3)

# how well does this model fit the original results
## [1] 0.4384319

The plot illustrates the distribution of the probability of attrition for each terminal node. We see an employee with JobLevel > 1 and DistanceFromHome <= 12 has a very low probability of attriting.

# Visualize tree 

plot of chunk unnamed-chunk-4

When trying to explain a complicated machine learning model to decision makers, surrogate models can help simplify the process. However, its important to only use surrogate models for simplified explanations when they are actually good representatives of the black box model (in this example it is not).

Local interpretation

In addition to providing global explanations of ML models, iml also provides two newer, but well accepted methods for local interpretation.

Local interpretation techniques provide methods to explain why an individual prediction was made for a given observation.

To illustrate, lets get two observations. The first represents the observation that our random forest model produces the highest probability of a attrition (observation 154 has a 0.666 probability of attrition) and the second represents the observation with the lowest probability (observation 28 has a 0 probability of attrition).

# identify obs with highest and lowest probabilities
(high <- predict(rf, splits$valid) %>% .[, 3] %>% as.vector() %>% which.max()) 
## [1] 154
(low  <- predict(rf, splits$valid) %>% .[, 3] %>% as.vector() %>% which.min())  
## [1] 28
# get these observations
high_prob_ob <- features[high, ]
low_prob_ob  <- features[low, ]

1. Lime

iml implements its own version of local interpretable model-agnostic explanations (Ribeiro et al., 2016). Although it does not use the lime package directly, it does implement the same procedures (see lime tutorial).

A few notable items about iml implementation (see referenced tutorial above for these details within lime):

  • like lime, can change distance metric (default is gower but accepts all distance functions provided by ?dist),
  • like lime, can change kernel (neighborhood size),
  • like lime, computationally efficient –> takes about 5 seconds to compute,
  • like lime, can be applied to multinomial responses,
  • like lime, uses the glmnet package to fit the local model; however…
  • unlike lime, only implements a ridge model (lime allows ridge, lasso, and more),
  • unlike lime, can only do one observation at a time (lime can do multiple),
  • unlike lime, does not provide fit metric such as (R^2) for the local model.

The following fits a local model for the observation with the highest probability of attrition. In this example I look for the 10 variables in each model that are most influential in this observations predicted value (k = 10). The results show that the Age of the employee reduces the probability of attrition within all three models. Morever, all three models show that since this employee works OverTime, this is having a sizable increase in the probability of attrition. However, the tree-based models also identify the MaritalStatus and JobRole of this employee contributing to his/her increased probability of attrition.

# fit local model
lime.glm <- LocalModel$new(predictor.glm, k = 10, x.interest = high_prob_ob) %>% 
    plot() + ggtitle("GLM")
lime.rf  <- LocalModel$new(predictor.rf, k = 10, x.interest = high_prob_ob) %>% 
    plot() + ggtitle("RF")
lime.gbm <- LocalModel$new(predictor.gbm, k = 10, x.interest = high_prob_ob) %>% 
    plot() + ggtitle("GBM")

gridExtra::grid.arrange(lime.glm, lime.rf, lime.gbm, nrow = 1)

plot of chunk lime

Here, I reapply the same model to low_prob_ob. Here, we see Age, JobLevel, and OverTime all having sizable influence on this employees very low predicted probability of attrition (zero).

# fit local model
lime.glm <- LocalModel$new(predictor.glm, k = 10, x.interest = low_prob_ob) %>% 
    plot() + ggtitle("GLM")
lime.rf  <- LocalModel$new(predictor.rf, k = 10, x.interest = low_prob_ob) %>% 
    plot() + ggtitle("RF")
lime.gbm <- LocalModel$new(predictor.gbm, k = 10, x.interest = low_prob_ob) %>% 
    plot() + ggtitle("GBM")

gridExtra::grid.arrange(lime.glm, lime.rf, lime.gbm, nrow = 1)

plot of chunk lime2

Although, LocalModel does not provide the fit metrics (i.e. R^2) for our local model, we can compare the local models predicted probability to the global (full) model’s predicted probability.

For the high probability employee, the local model only predicts a 0.34 probability of attrition whereas the local model predicts a more accurate 0.12 probability of attrition for the low probability employee. This can help you guage the trustworthiness of the local model.

High Probability:

# high probability observation
predict(rf, splits$valid) %>% 
  .[high, 3] # actual probability
## [1] 0.6666667
lime_high <- LocalModel$new(predictor.rf, k = 10, x.interest = high_prob_ob)
lime_high$predict(high_prob_ob) # predicted probability
##   prediction
## 1  0.3371567

Low Probability:

# low probability observation
predict(rf, splits$valid) %>% 
  .[low, 3] # actual probability
## [1] 0
lime_low <- LocalModel$new(predictor.rf, k = 10, x.interest = low_prob_ob)
lime_low$predict(low_prob_ob) # predicted probability
##   prediction
## 1  0.1224224

2. Shapley values

An alternative for explaining individual predictions is a method from coalitional game theory that produces whats called Shapley values (Lundberg & Lee, 2016). The idea behind Shapley values is to assess every combination of predictors to determine each predictors impact. Focusing on feature xj, the approach will test the accuracy of every combination of features not including xj and then test how adding xj to each combination improves the accuracy. Unfortunately, computing Shapley values is very computationally expensive. Consequently, iml implements an approximate Shapley estimation algorithm that follows the following steps:

ob = single observation of interest
1: for variables j in {1,...,p} do
     | m = random sample from data set
     | t = rbind(m, ob)
     | f(all) = compute predictions for t
     | f(!j) = compute predictions for t with feature j values randomized
     | diff = sum(f(all) - f(!j))
     | phi = mean(diff)
2. sort phi in decreasing order

The Shapley value ($\phi$) represents the contribution of each respective variable towards the predicted valued compared to the average prediction for the data set.

We use Shapley$new to create a new Shapley object. For this data set it takes about 9 seconds to compute. The time to compute is largely driven by the number of predictors but you can also control the sample size drawn (see sample.size argument) to help reduce compute time. If you look at the results, you will see that the predicted value of 0.667 is 0.496 larger than the average prediction of 0.17. The plot displays the contribution each predictor played in this difference of 0.496.

# compute Shapley values
shapley.rf <- Shapley$new(predictor.rf, x.interest = high_prob_ob)

# look at summary of results
shapley.rf <- readRDS("2018-08-13-iml/shapley_rf.rds")
## Interpretation method:  Shapley 
## Predicted value: 0.666667, Average prediction: 0.170373 (diff = 0.496293)
## Analysed predictor: 
## Prediction task: unknown 
## Analysed data:
## Sampling from data.frame with 233 rows and 30 columns.
## Head of results:
##            feature          phi      phi.var
## 1              Age  0.031515152 0.0027626123
## 2   BusinessTravel  0.047575758 0.0031212956
## 3        DailyRate  0.011515152 0.0015170994
## 4       Department -0.006363636 0.0003579412
## 5 DistanceFromHome -0.011818182 0.0009256013
## 6        Education  0.001212121 0.0007220042
##                      feature.value
## 1                           Age=28
## 2 BusinessTravel=Travel_Frequently
## 3                    DailyRate=791
## 4  Department=Research_Development
## 5               DistanceFromHome=1
## 6                 Education=Master
#plot results

plot of chunk unnamed-chunk-11

We can compare the Shapley values across each model to see if common themes appear. Again, OverTime is a common theme across all three models. We also see MonthlyIncome influential for the tree-based methods and there are other commonalities for the mildly influential variables across all three models (i.e. StockOptionLevel, JobLevel, Age, MaritalStatus).

shapley.glm <- Shapley$new(predictor.glm, x.interest = high_prob_ob) %>% 
    plot() + ggtitle("GLM")
shapley.rf  <- plot(shapley.rf) + ggtitle("RF")
shapley.gbm <- Shapley$new(predictor.gbm, x.interest = high_prob_ob) %>% 
    plot() + ggtitle("GBM")

gridExtra::grid.arrange(shapley.glm, shapley.rf, shapley.gbm, nrow = 1)

shapley 2

Similarly, we can apply for the low probability employee. Some common themes pop out for this employee as well. It appears that the age, total number of working years, and the senior position (JobLevel, JobRole) play a large part in the low predicted probability of attrition for this employee.

shapley.glm <- Shapley$new(predictor.glm, x.interest = low_prob_ob) %>% 
    plot() + ggtitle("GLM")
shapley.rf  <- Shapley$new(predictor.rf, x.interest = low_prob_ob) %>% 
    plot() + ggtitle("RF")
shapley.gbm <- Shapley$new(predictor.gbm, x.interest = low_prob_ob) %>% 
    plot() + ggtitle("GBM")

gridExtra::grid.arrange(shapley.glm, shapley.rf, shapley.gbm, nrow = 1)

shapley 3

Shapley values are considered more robust than the results you will get from LIME. However, similar to the different ways you can compute variable importance, although you will see differences between the two methods often you will see common variables being identified as highly influential in both approaches. Consequently, we should use these approaches to help indicate influential variables but not to definitively label a variables as the most influential.

Next Steps (Transform Your Abilities)

It’s critical to understand that modeling is just one part of the overall data science project. Don’t mistake – it’s an important part, but other parts are equally if not more important, which is why our Data Science For Business With R (DS4B 201-R) course is successfully teaching data science students how to apply data science in the real world.

Data Science For Business With R

We teach end-to-end data science. This means you learn how to:

  • Chapter 1: Sizing The Problem, Custom Functions With Tidy Eval: Learn how to understand if a business problem exists by sizing the problem. In addition, create custom functions using Tidy Eval, a programming interface that is needed if you want to build custom functions using dplyr, ggplot2.

  • Chapter 2: Exploration with GGally, skimr – We show you how to explore data and identify relationships efficiently and effectively

  • Chapter 3: recipes, Premodeling Correlation Analysis: We teach you how to use recipes to develop data transformation workflow and we show you how to perform a pre-modeling correlation analysis so you do not move into modeling prematurely, which again saves you time

  • Chapters 4 and 5: H2O AutoML: You will first learn how to use all of the major h2o functions to perform automated machine learning for binary classification including working with the H2O leaderboard, visualizing results, and performing grid search and cross validation. In the next chapter, you learn how to evaluate multiple models against each other with ROC, Precision vs Recall, Gain and Lift, which is necessary to explain your choices for best model to the business stakeholders (your manager and key decision makers).

  • Chapter 6: LIME: You learn how to create local explanations to interpret black-box machine learning models. Explanations for the model predictions are critical because the business cannot do anything to improve unless they understand why. We show you how with lime.

  • Chapters 7 and 8: Expected Value Framework, Threshold Optimization, Sensitivity Analysis: These are my two favorite chapters because they show you how to link the churn classification model to financial results, and they teach purrr for iterative grid-search style optimization! These are POWERFUL CONCEPTS.

  • Chapter 9: Recommendation Algorithm: We again use a correlation analysis but in a different way. We discretize, creating a special visualization that enables us to develop a recommendation strategy.

Data Science For Business With R (DS4B 201-R)

Learn everything you need to know to complete a real-world, end-to-end data science project with the R programming language. Transform your abilities in 10 weeks.

Get Started Today!

Business Science University Course Roadmap

We have one course out and two courses coming soon!

Data Science For Business With R (DS4B 201-R)

Over the course of 10 weeks, we teach you how to solve an end-to-end data science project. Available now!

DS4B 301-R Shiny Application: Employee Prediction

Transform you abilities by solving employee churn over a 10-week end-to-end data science project in DS4B 201-R

Data Science For Business With R

Get Started Today!

Building A Shiny Application (DS4B 301-R)

Our next course teaches you how to take the H2O Model, LIME explanation results, and the recommendation algorithm you develop in DS4B 201-R and turn it into a Shiny Web Application that predicts employee attrition! Coming in Q3 2018.

DS4B 301-R Shiny Application: Employee Prediction

Shiny App That Predicts Attrition and Recommends Management Strategies, Taught in DS4B 301-R (Building A Shiny Web App)

Kelly O’Briant is lead developer for the Shiny App Course coming soon. She’s a brilliant software engineer / data scientist that knows how to make a great looking and performing Shiny app.

Sign Up! Coming Q3!

Data Science For Business With Python (DS4B 201-P)

Did we mention with have a DS4B Python Course coming? Well we do! Coming in Q4 2018.

The problem changes: Customer Churn! The tools will be H2O, LIME, and a host of other tools implemented in Python + Spark.

Python Track

Python Track: Data Science For Business With Python And Spark

Favio Vazquez, Principle Data Scientist at OXXO, is building the Python + Spark equivalent of DS4B 201-R. He’s so talented knowing Python, Spark, and R, along with a host of other data science tools.

Sign Up! Coming Q4!

Don’t Miss A Beat

Connect With Business Science

If you like our software (anomalize, tidyquant, tibbletime, timetk, and sweep), our courses, and our company, you can connect with us:

To leave a comment for the author, please follow the link and comment on their blog: - Articles. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Wikimedia Foundation: Manager of Product Analytics (San Francisco, CA or Remote)

Seeking an experienced and hands-on Manager of Product Analytics to lead a team that uses data to help our staff and our communities achieve our vision: a world in which every single human being can freely share in the sum of all knowledge.

Continue Reading…


Read More

Aerial view of sheepdogs herding sheep

Sometimes the visualization takes care of itself. Photographer Tim Whittaker filmed sheepdogs herding thousands of sheep, and the flows one place to another are like organized randomness.

Tags: , , ,

Continue Reading…


Read More

Book Memo: “Artificial Adaptive Systems using Auto Contractive Maps”

Theory, Applications and Extensions
This book offers an introduction to artificial adaptive systems and a general model of the relationships between the data and algorithms used to analyze them. It subsequently describes artificial neural networks as a subclass of artificial adaptive systems, and reports on the backpropagation algorithm, while also identifying an important connection between supervised and unsupervised artificial neural networks. The book’s primary focus is on the auto contractive map, an unsupervised artificial neural network employing a fixed point method versus traditional energy minimization. This is a powerful tool for understanding, associating and transforming data, as demonstrated in the numerous examples presented here. A supervised version of the auto contracting map is also introduced as an outstanding method for recognizing digits and defects. In closing, the book walks the readers through the theory and examples of how the auto contracting map can be used in conjunction with another artificial neural network, the ‘spin-net,’ as a dynamic form of auto-associative memory.

Continue Reading…


Read More

A More Informative Approach to Movie Recommendation

(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

A Deluge of Content

Over the last two decades, the accessibility of media has increased dramatically. One of the main reasons for this surge in available content is the evolution of online streaming platforms. Services like Netflix, Hulu, Amazon Prime, and others enable access for millions of consumers to a seemingly countless number of movies and television shows. With so many choices now available, it is more important than ever to have effective ways to filter all the possible titles and find options that best match your interests. Many streaming services have implemented functionality to suggest possible titles of interests for their users. Without these guiding tools, deciding what to watch next can seem like a monstrous undertaking.


The Goal

In my experience, the “you may also like”-style suggesting algorithms work well. However, I find that when I’m searching for something to watch I want to know more than just whether or not I will like it. What kind of movie is it? What parts of the movie were done better than others? In short, I want to know what aspects of the movie suggest that I will enjoy it. This information is not provided by the suggestion functionality of the streaming platforms I have encountered. So I thought of a different way to approach the problem.

Movie critics work hard to produce reviews that help the general audience decide whether a movie is worth watching. A critic review can provide a much more detailed look into a movie’s worth than an algorithm simply telling you “watch this.” Still, the issue of individual critics’ personal taste and preference exists. Although general trends usually do arise, movie reviews can be highly subjective. How do I know I can trust a particular critic? Do they value the same qualities in a movie that I do? These uncertainties can compromise the validity of a review on a person to person basis. A critic’s reviews would carry more weight if I knew that our interests were similar. With this in mind, I decided to create an application that would match a user with the most similarly-minded critics. The user could then follow their matched critics to explore a range of movies they could potentially enjoy.


The Data

I chose to extract movie review data from the movie and TV show review site RottenTomatoes (RT). RT is a great resource for anyone looking for a new TV show to watch or to choose between movies that are currently playing. The site provides ratings by both professional critics and average viewers to give a general idea of the quality of a film or show. RT also gives a certification of “Fresh” or “Rotten” so users can quickly get an idea if the title is worth checking out. I used the scrapy package in python to code a script that would extract certain values from the RT site. From RT, I scraped certified critics’ reviews of the top 100 movies per year from 2000 to 2018. From the top 100 movies list, I only inspected those movies that were reviewed by 100 or more unique critics. The 100 critic count mark seemed to be a threshold for widely recognizable movies. On the critic reviews page for each movie, if a critic gave a rating for the movie (some reviews only included a description) I scraped the critic’s name, the organization they belonged to, and the rating they gave. I also scraped the average rating for each of the movies.


Data Cleaning

I used python to clean the data that I scraped from RT. The main issue that I encountered was that there were many different rating formats used among the different critics. For example, here are some of the formats used:

  • 3/5
  • 3/4
  • 6/10
  • 60/100
  • 3 out of 5
  • A +
  • A plus
  • *** 1/2
  • 3 stars
  • Recommended vs. Not Recommended

In order to normalize the critics’ ratings so I could compare them, I decided to convert them all to fractional values between 0 and 1. The different rating formats made this conversion difficult. I wrote if-statements containing regular expressions for each format to match and extract the rating information. Then I was able to reassign each value as a fractional representation of the original rating (e.g. *** 1/2 = 3.5/5 = 0.7). Because there were many different format types, I only retained the ratings that had the most common formats. I also dropped the ratings that I did not think would add value to the matching algorithm, such as recommended vs. not recommended (too coarse of a rating). After the conversion, many ratings had more than five decimal places. For simplicity, I truncated them all to three decimal places.


The App

The main focus of this project was to use web scraping to collect unorganized data and then analyze that data. Instead of exploring the RT data to find trends among movies, such as comparing reviews among different genres or movie release dates, I decided to use the data to help people find movies they might like. This was my thought process for the application’s functionality:

  1. Have a user pick five movies they have seen
  2. Have the user rate each movie according to how much they enjoyed it
  3. Compare the user’s ratings to critic ratings and find the critics with the most similar ratings
  4. Show the top critics ordered by closest match and show other movies those critics rated

In order to transform this theoretical process into a reality, I used the Shiny package in R to code an application that could ask the user for input and return a table of top matched critics for them to follow.


The user selects five movies they have seen.


The user rates the movies they selected based on how much they enjoyed them.


The matching algorithm is pretty straightforward, although I utilized two different methods. The first method I used was to take the absolute value of the difference between the user scores and the critic scores and then sum the differences up. I called this match score the “Absolute Distance Score” (the lower the score, the better the match). This approach was simple, but I found that it created some issues. For example, a critic that rated all five movies exactly one point off from the user ratings would be given the same match score as a critic that rated four movies exactly the same as the user but one movie five points off. In this example, the first critic rated each movie almost exactly the same as the user, and the second critic rated four movies exactly the same and rated one very differently. I decided that in this type of situation the critics should not be given the same match score so I implemented an additional method. I added a “Squared Distance Score” which squared the difference between the user and critic’s ratings and summed them up. This weights distance exponentially so larger differences will increase the match score more. I included both score types in the table that shows the top matches, however I sorted the table by lowest squared distance score. I hoped that allowing the user to see both scores would provide more insight into the meaning behind the match score.

An important complication arose in the case where a critic did not review all five of the user’s chosen movies. This could lead to misleading match scores in the match table. If one critic reviewed all five of the user’s movies, there is a greater chance that their match score will be higher than a that of a critic that reviewed only three movies (five differences will be summed for the first critic and only three differences will be summed for the second). In the match table, two critics could have the same score but perhaps only because data was missing for one or more of the movies. Because of this, I added a review count to the match table. I ordered the table by highest review count and selected the top five match scores for critics who rated five, four, or three of the user’s movies. I did not include critics that rated two or less of the user’s movies. I made it clear in the text above the tables that differences in match scores between review count groups should be taken with a grain of salt. Overall, the five review count group will give the most robust matches.


This table shows the user’s top matched critics based on the movies and ratings chosen.


These tables show other movies that the currently selected critic has reviewed, specifically those movies that the critic rated high or low.


Below the match table, I included two tables that show high and low rated movies for the currently selected critic. My goal here was to recommend movies to the user based on the opinion of their matched critics. Here the user can browse the list of highly rated movies to see which titles they might enjoy. Conversely, the low rated movie table will allow the user to know which movies to avoid.

The last tables in the application is something I was very excited to implement. When searching for a movie to watch, I generally only stumble upon movies that are universally loved or universally disliked because those are the movies that receive the most publicity. Obviously there are many other movies out there that I might really enjoy, but it’s hard to find the diamonds in the rough. The final tables in my application show the user which movies their critics rated much differently from the general public. The left table shows movies that the critic rated highly but had a low average rating; these are the movies no one would ever think to recommend to you, but are ones you might really enjoy based on your matched critic. On the other hand, the right table shows movies that the critic rated low but had a high average rating; these are the movies that everyone keeps telling you to see, but you might be better off avoiding them. I think this is a very useful tool in finding movies that cater to your interests and can open up a whole new world of movies for you to see.


These tables show the movies that the currently selected critic rated much differently from the general public.


Going Forward

I am really happy with how this application turned out, but there are still a few kinks I would like to work out.

  1. In the match table, I include a column with a list of all the organisations that the critics have been associated with. I populate this column by aggregating all the unique organisations for a critic from the original data set and concatenating them with a “|” as a separator. While the matching calculation does not take very long, this organisation column creation more than doubles the required time for the table to be created. I would like to add a few lines in the python cleaning code to replace each organisation value in the original data set with all the critic’s organisations so that this calculation does not need to be done each time a user submits a new request.
  2. Due to my filtering criteria during scraping, I only included movies that were reviewed by 100 or more critics. Additionally, the list of movies only includes the top 100 movies from each year between 2000 and 2018.  I have not explored how loosening these criteria would affect the data set size or how it would affect the match calculation time, but I would like to add more options in the future for users to choose from (and also to potentially make the algorithm more accurate by adding more uncommon movies).
  3. Also due to the criteria above, the movies in the data set are necessarily popular. This gave rise to a complication in the “Critic Rating High / Average Rating Low” table. Because all of the movies are popular, there are few to no movies with average ratings below 6/10 or so. This somewhat defeats the purpose of the table, which is supposed to introduce the user to unknown or generally disliked movies. On RT, each critic has their own page with every review they have submitted along with RT’s T-Meter rating (a percentage of critics who gave the movie a “Fresh” rating). In the future, I would like to scrape each critic’s page and use this second data set to populate the two critic vs. average tables. This should give the user a more exhaustive list of movies they should consider seeing or avoiding.

Hey, Thanks!

Thank you for taking the time to read about my project! I was really excited to create this application because I think it serves a real purpose and can be helpful for a large group of people. Feel free to check out my application, and please let me know if it helped you find a movie you enjoyed! Here is a link to my GitHub repository if you would like to explore my code.

To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

R Packages worth a look

Binary Indexed Tree (rbit)
A simple implementation of Binary Indexed Tree by R. The BinaryIndexedTree class supports construction of Binary Indexed Tree from a vector, update of …

Standard Young Tableaux (syt)
Deals with standard Young tableaux (field of combinatorics). Performs enumeration, counting, random generation, the Robinson-Schensted correspondence, …

Estimation and Prediction for Remote Effects Spatial Process Models (telefit)
Implementation of the remote effects spatial process (RESP) model for teleconnection. The RESP model is a geostatistical model that allows a spatially- …

Interactive Visualization of Bayesian Networks (bnviewer)
Interactive visualization of Bayesian Networks. The ‘bnviewer’ package reads various structure learning algorithms provided by the ‘bnlearn’ package an …

Routines for Fit, Inference and Diagnostics in LAD Models (LadR)
LAD (Least Absolute Deviations) estimation for linear regression, confidence intervals, tests of hypotheses, methods for outliers detection, measures o …

Continue Reading…


Read More

Thanks for reading!