# My Data Science Blogs

## August 15, 2018

### TrackML Challenge

I just won a gold medal for my 9th rank in the TrackML challenge hosted on Kaggle.  That challenge was proposed by CERN (Centre Européen de Recherche Nucléaire).  The problem was to reconstruct the trajectories of high energy physics particle from the tracks they leave in the detectors used at CERN.  The data we were given was actually a simulation of forthcoming detectors at CERN.  Here is how the challenge is described by CERN:

To explore what our universe is made of, scientists at CERN are colliding protons, essentially recreating mini big bangs, and meticulously observing these collisions with intricate silicon detectors.

While orchestrating the collisions and observations is already a massive scientific accomplishment, analyzing the enormous amounts of data produced from the experiments is becoming an overwhelming challenge.

Event rates have already reached hundreds of millions of collisions per second, meaning physicists must sift through tens of petabytes of data per year. And, as the resolution of detectors improve, ever better software is needed for real-time pre-processing and filtering of the most promising events, producing even more data.

To help address this problem, a team of Machine Learning experts and physics scientists working at CERN (the world largest high energy physics laboratory), has partnered with Kaggle and prestigious sponsors to answer the question: can machine learning assist high energy physics in discovering and characterizing new particles?

Specifically, in this competition, you’re challenged to build an algorithm that quickly reconstructs particle tracks from 3D points left in the silicon detectors.

I used an unsupervised machine learning technique known as clustering, see my detailed writeup.  The key was to preprocess data so that the clustering algorithm could find the particle tracks more easily.  This is very similar to feature engineering for supervised machine learning.  The main issue was the computation ressources required by my solution: about 20 cpu hours per event, and we need to predict tracks for 125 events.  This would not have been possible without the use of parallelism on multicore machines.  I used an IBM Power 9 machine with 40 cores in order to be able to compute a submission in less than 3 days.

Many participants also had significant running times, for instance the one who finished second says that some events take him up to 3 cpu days.  This makes the achievement of the first place winner extremely impressive.  Not only do they achieve an amazing detection rate, but their code runs in 8 minutes per event!  I'm not in the same league as them.

Anyway, I'm still happy with my result.  And the cherry on the cake: this gold medal earned me the  Kaggle Competitions Grandmaster title I'm the second Kaggler to become a Grandmaster in two categories.

Approximation of scattered data is often a task in many engineering problems. The Radial Basis Function (RBF) approximation is appropriate for large scattered (unordered) datasets in d-dimensional space. This approach is useful for a higher dimension d>2, because the other methods require the conversion of a scattered dataset to an ordered dataset (i.e. a semi-regular mesh is obtained by using some tessellation techniques), which is computationally expensive. The RBF approximation is non-separable, as it is based on the distance between two points. This method leads to a solution of Linear System of Equations (LSE) Ac=h. In this paper several RBF approximation methods are briefly introduced and a comparison of those is made with respect to the stability and accuracy of computation. The proposed RBF approximation offers lower memory requirements and better quality of approximation. Radial Basis Function Approximations: Comparison and Applications

### Data Science Portfolio Project: Is Fandango Still Inflating Ratings?

At Dataquest, we strongly advocate portfolio projects as a means of getting your first data science job. In this blog post, we'll walk you through an example portfolio project.

The project is part of our Statistics Fundamentals course, and it assumes some familiarity with:

• Sampling (simple random sampling, populations, samples, parameters, statistics)
• Variables
• Frequency distributions
• pandas and matplotlib

If you think you need to fill in any gaps before moving forward, we cover the topics above in depth in our Statistics Fundamentals course. This course will also give you deeper instructions on how to build this project, and code it in your browser.

To learn more about pandas and matplotlib, check out our Data Scientist Path.

# Is Fandango still inflating ratings?

In October 2015, Walt Hickey from FiveThirtyEight published a popular article where he presented strong evidence that suggested Fandango's movie rating system was biased and dishonest. In this project, we'll analyze more recent movie ratings data to determine whether there has been any change in Fandango's rating system after Hickey's analysis.

We'll work with two samples of movie ratings: the data in one sample was collected previous to Hickey's analysis, while the other sample was collected after. Let's start by reading in the two samples (which are stored as CSV files) and getting familiar with their structure.

import pandas as pd
pd.options.display.max_columns = 100  # Avoid having displayed truncated output


FILM RottenTomatoes RottenTomatoes_User Metacritic Metacritic_User IMDB Fandango_Stars Fandango_Ratingvalue RT_norm RT_user_norm Metacritic_norm Metacritic_user_nom IMDB_norm RT_norm_round RT_user_norm_round Metacritic_norm_round Metacritic_user_norm_round IMDB_norm_round Metacritic_user_vote_count IMDB_user_vote_count Fandango_votes Fandango_Difference
0 Avengers: Age of Ultron (2015) 74 86 66 7.1 7.8 5.0 4.5 3.70 4.3 3.30 3.55 3.90 3.5 4.5 3.5 3.5 4.0 1330 271107 14846 0.5
1 Cinderella (2015) 85 80 67 7.5 7.1 5.0 4.5 4.25 4.0 3.35 3.75 3.55 4.5 4.0 3.5 4.0 3.5 249 65709 12640 0.5
2 Ant-Man (2015) 80 90 64 8.1 7.8 5.0 4.5 4.00 4.5 3.20 4.05 3.90 4.0 4.5 3.0 4.0 4.0 627 103660 12055 0.5
after.head(3)

movie year metascore imdb tmeter audience fandango n_metascore n_imdb n_tmeter n_audience nr_metascore nr_imdb nr_tmeter nr_audience
0 10 Cloverfield Lane 2016 76 7.2 90 79 3.5 3.80 3.60 4.5 3.95 4.0 3.5 4.5 4.0
1 13 Hours 2016 48 7.3 50 83 4.5 2.40 3.65 2.5 4.15 2.5 3.5 2.5 4.0
2 A Cure for Wellness 2016 47 6.6 40 47 3.0 2.35 3.30 2.0 2.35 2.5 3.5 2.0 2.5

Below, we isolate only the columns that provide information about Fandango, to make the relevant data more readily available for later use. We'll make copies to avoid any SettingWithCopyWarning later on.

fandango_previous = previous[['FILM', 'Fandango_Stars', 'Fandango_Ratingvalue', 'Fandango_votes',
'Fandango_Difference']].copy()
fandango_after = after[['movie', 'year', 'fandango']].copy()


0 Avengers: Age of Ultron (2015) 5.0 4.5 14846 0.5
1 Cinderella (2015) 5.0 4.5 12640 0.5
2 Ant-Man (2015) 5.0 4.5 12055 0.5
fandango_after.head(3)

movie year fandango
0 10 Cloverfield Lane 2016 3.5
1 13 Hours 2016 4.5
2 A Cure for Wellness 2016 3.0

Our goal is to determine whether there has been any change in Fandango's rating system after Hickey's analysis. The population of interest for our analysis is made of all the movie ratings stored on Fandango's website, regardless of the releasing year.

Because we want to find out whether the parameters of this population changed after Hickey's analysis, we're interested in sampling the population at two different periods in time — before and after Hickey's analysis — so we can compare the two states.

The data we're working with was sampled at the moments we want: one sample was taken previous to the analysis, and the other after the analysis. We want to describe the population, so we need to make sure that the samples are representative. Otherwise we should expect a large sampling error and, ultimately, wrong conclusions.

From Hickey's article and from the README.md of the data set's repository, we can see that he used the following sampling criteria:

• The movie must have had at least 30 fan ratings on Fandango's website at the time of sampling (Aug. 24, 2015).
• The movie must have had tickets on sale in 2015.

The sampling was clearly not random because not every movie had the same chance to be included in the sample — some movies didn't have a chance at all (like those having under 30 fan ratings or those without tickets on sale in 2015). It's questionable whether this sample is representative of the entire population we're interested to describe. It seems more likely that it isn't, mostly because this sample is subject to temporal trends — e.g. movies in 2015 might have been outstandingly good or bad compared to other years.

The sampling conditions for our other sample were (as it can be read in the README.md of the data set's repository):

• The movie must have been released in 2016 or later.
• The movie must have had a considerable number of votes and reviews (unclear how many from the README.md or from the data).

This second sample is also subject to temporal trends and it's unlikely to be representative of our population of interest.

Both Hickey and the repo author had certain research questions in mind when they sampled the data, and they used a set of criteria to get a sample that would fit their questions. Their sampling method is called purposive sampling (or judgmental/selective/subjective sampling). While these samples were good enough for their research, they don't seem too useful for us.

# Changing the goal of our analysis

At this point, we can either collect new data or change our the goal of our analysis. Let's choose the latter and place some limitations on our initial goal.

Instead of trying to determine whether there has been any change in Fandango's rating system after Hickey's analysis, our new goal is to determine whether there's any difference between Fandango's ratings for popular movies in 2015 and Fandango's ratings for popular movies in 2016. This new goal should also be a fairly good proxy for our initial goal.

# Isolating the samples we need

With this new research goal, we have two populations of interest:

1. All Fandango's ratings for popular movies released in 2015.
2. All Fandango's ratings for popular movies released in 2016.

We need to be clear about what counts as popular movies. We'll use Hickey's benchmark of 30 fan ratings and count a movie as popular only if it has 30 fan ratings or more on Fandango's website.

Although one of the sampling criteria in our second sample is movie popularity, the sample doesn't provide information about the number of fan ratings. We should be skeptical once more and ask whether this sample is truly representative and contains popular movies (movies with over 30 fan ratings).

One quick way to check the representativity of this sample is to randomly sample 10 movies from it and then check the number of fan ratings ourselves on Fandango's website. Ideally, at least 8 out of the 10 movies have 30 fan ratings or more.

fandango_after.sample(10, random_state = 1)

movie year fandango
108 Mechanic: Resurrection 2016 4.0
206 Warcraft 2016 4.0
106 Max Steel 2016 3.5
107 Me Before You 2016 4.5
51 Fantastic Beasts and Where to Find Them 2016 4.5
33 Cell 2016 3.0
59 Genius 2016 3.5
152 Sully 2016 4.5
4 A Hologram for the King 2016 3.0
31 Captain America: Civil War 2016 4.5

Above we used a value of 1 as the random seed. This is good practice because it suggests that we weren't trying out various random seeds just to get a favorable sample.

As of April 2018, these are the fan ratings we found:

Movie Fan ratings
Mechanic: Resurrection 2247
Warcraft 7271
Max Steel 493
Me Before You 5263
Fantastic Beasts and Where to Find Them 13400
Cell 17
Genius 127
Sully 11877
A Hologram for the King 500
Captain America: Civil War 35057

90% of the movies in our sample are popular. This is enough and we move forward with a bit more confidence.

Let's also double-check the other data set for popular movies. The documentation states clearly that there're only movies with at least 30 fan ratings, but it should take only a couple of seconds to double-check here.

sum(fandango_previous['Fandango_votes'] < 30)

0


If you explore the two data sets, you'll notice that there are movies with a releasing year different than 2015 or 2016. For our purposes, we'll need to isolate only the movies released in 2015 and 2016.

Let's start with Hickey's data set and isolate only the movies released in 2015. There's no special column for the releasing year, but we should be able to extract it from the strings in the FILM column.

fandango_previous.head(2)

0 Avengers: Age of Ultron (2015) 5.0 4.5 14846 0.5
1 Cinderella (2015) 5.0 4.5 12640 0.5
fandango_previous['Year'] = fandango_previous['FILM'].str[-5:-1]

FILM Fandango_Stars Fandango_Ratingvalue Fandango_votes Fandango_Difference Year
0 Avengers: Age of Ultron (2015) 5.0 4.5 14846 0.5 2015
1 Cinderella (2015) 5.0 4.5 12640 0.5 2015
fandango_previous['Year'].value_counts()

2015    129
2014     17
Name: Year, dtype: int64

fandango_2015 = fandango_previous[fandango_previous['Year'] == '2015'].copy()
fandango_2015['Year'].value_counts()

2015    129
Name: Year, dtype: int64


Great, now let's isolate the movies in the other data set.

fandango_after.head(2)

movie year fandango
0 10 Cloverfield Lane 2016 3.5
1 13 Hours 2016 4.5
fandango_after['year'].value_counts()

2016    191
2017     23
Name: year, dtype: int64

fandango_2016 = fandango_after[fandango_after['year'] == 2016].copy()
fandango_2016['year'].value_counts()

2016    191
Name: year, dtype: int64


# Comparing distribution shapes for 2015 and 2016

Our aim is to figure out whether there's any difference between Fandango's ratings for popular movies in 2015 and Fandango's ratings for popular movies in 2016. One way to go about is to analyze and compare the distributions of movie ratings for the two samples.

We'll start with comparing the shape of the two distributions using kernel density plots. We'll use the FiveThirtyEight style for the plots.

import matplotlib.pyplot as plt
from numpy import arange
%matplotlib inline
plt.style.use('fivethirtyeight')

fandango_2015['Fandango_Stars'].plot.kde(label = '2015', legend = True, figsize = (8,5.5))
fandango_2016['fandango'].plot.kde(label = '2016', legend = True)

plt.title("Comparing distribution shapes for Fandango's ratings\n(2015 vs 2016)",
y = 1.07) # the y parameter pads the title upward
plt.xlabel('Stars')
plt.xlim(0,5) # because ratings start at 0 and end at 5
plt.xticks(arange(0,5.1,.5))
plt.show()


Two aspects of the figure above are striking:

• Both distributions are strongly left skewed.
• The 2016 distribution is slightly shifted to the left relative to the 2015 distribution.

The left skew suggests that movies on Fandango are given mostly high and very high fan ratings. Coupled with the fact that Fandango sells tickets, the high ratings are a bit dubious. It'd be really interesting to investigate this further — ideally in a separate project, since this is quite irrelevant for the current goal of our analysis.

The slight left shift of the 2016 distribution is very interesting for our analysis. It shows that ratings were slightly lower in 2016 compared to 2015. This suggests that there was a difference indeed between Fandango's ratings for popular movies in 2015 and Fandango's ratings for popular movies in 2016. We can also see the direction of the difference: the ratings in 2016 were slightly lower compared to 2015.

# Comparing relative frequencies

It seems we're following a good thread so far, but we need to analyze more granular information. Let's examine the frequency tables of the two distributions to analyze some numbers. Because the data sets have different numbers of movies, we normalize the tables and show percentages instead.

print('2015' + '\n' + '-' * 16) # To help us distinguish between the two tables immediately and avoid silly mistakes as we read to and fro

fandango_2015['Fandango_Stars'].value_counts(normalize = True).sort_index() * 100

2015
----------------
3.0     8.527132
3.5    17.829457
4.0    28.682171
4.5    37.984496
5.0     6.976744
Name: Fandango_Stars, dtype: float64

print('2016' + '\n' + '-' * 16)
fandango_2016['fandango'].value_counts(normalize = True).sort_index() * 100

2016
----------------
2.5     3.141361
3.0     7.329843
3.5    24.083770
4.0    40.314136
4.5    24.607330
5.0     0.523560
Name: fandango, dtype: float64


In 2016, very high ratings (4.5 and 5 stars) had significantly lower percentages compared to 2015. In 2016, under 1% of the movies had a perfect rating of 5 stars, compared to 2015 when the percentage was close to 7%. Ratings of 4.5 were also more popular in 2015 — there were approximately 13% more movies rated with a 4.5 in 2015 compared to 2016.

The minimum rating is also lower in 2016 — 2.5 instead of 3 stars, the minimum of 2015. There clearly is a difference between the two frequency distributions.

For some other ratings, the percentage went up in 2016. There was a greater percentage of movies in 2016 that received 3.5 and 4 stars, compared to 2015. 3.5 and 4.0 are high ratings and this challenges the direction of the change we saw on the kernel density plots.

# Determining the direction of the change

Let's take a couple of summary metrics to get a more precise picture about the direction of the change. In what follows, we'll compute the mean, the median, and the mode for both distributions and then use a bar graph to plot the values.

mean_2015 = fandango_2015['Fandango_Stars'].mean()
mean_2016 = fandango_2016['fandango'].mean()

median_2015 = fandango_2015['Fandango_Stars'].median()
median_2016 = fandango_2016['fandango'].median()

mode_2015 = fandango_2015['Fandango_Stars'].mode()[0] # the output of Series.mode() is a bit uncommon
mode_2016 = fandango_2016['fandango'].mode()[0]

summary = pd.DataFrame()
summary['2015'] = [mean_2015, median_2015, mode_2015]
summary['2016'] = [mean_2016, median_2016, mode_2016]
summary.index = ['mean', 'median', 'mode']
summary

2015 2016
mean 4.085271 3.887435
median 4.000000 4.000000
mode 4.500000 4.000000
plt.style.use('fivethirtyeight')
summary['2015'].plot.bar(color = '#0066FF', align = 'center', label = '2015', width = .25)
summary['2016'].plot.bar(color = '#CC0000', align = 'edge', label = '2016', width = .25,
rot = 0, figsize = (8,5))

plt.title('Comparing summary statistics: 2015 vs 2016', y = 1.07)
plt.ylim(0,5.5)
plt.yticks(arange(0,5.1,.5))
plt.ylabel('Stars')
plt.legend(framealpha = 0, loc = 'upper center')
plt.show()


The mean rating was lower in 2016 with approximately 0.2. This means a drop of almost 5% relative to the mean rating in 2015.

(summary.loc['mean'][0] - summary.loc['mean'][1]) / summary.loc['mean'][0]

0.048426835689519929


While the median is the same for both distributions, the mode is lower in 2016 by 0.5. Coupled with what we saw for the mean, the direction of the change we saw on the kernel density plot is confirmed: on average, popular movies released in 2016 were rated slightly lower than popular movies released in 2015.

# Conclusion

Our analysis showed that there's indeed a slight difference between Fandango's ratings for popular movies in 2015 and Fandango's ratings for popular movies in 2016. We also determined that, on average, popular movies released in 2016 were rated lower on Fandango than popular movies released in 2015.

We cannot be completely sure what caused the change, but the chances are very high that it was caused by Fandango fixing the biased rating system after Hickey's analysis.

### R Packages worth a look

Estimation, Comparison and Selection of Transformations (trafo)
Estimation, selection and comparison of several families of transformations. The families of transformations included in the package are the following: …

Easily Install and Load the ‘Tidymodels’ Packages (tidymodels)
The tidy modeling ‘verse’ is a collection of package for modeling and statistical analysis that share the underlying design philosophy, grammar, and da …

Sample Size Calculations for Complex Surveys (samplesize4surveys)
Computes the required sample size for estimation of totals, means and proportions under complex sampling designs.

A Colour List and Colour Metric Based on the ISCC-NBS System of Color Designation (rolocISCCNBS)
A colour list and colour metric based on the ISCC-NBS System of Color Designation for use with the ‘roloc’ package for converting colour specifications …

Compute Directly Standardized Rates, Ratios and Differences (dsr)
A set of functions to compute and compare directly standardized rates, rate differences and ratios. A variety of user defined options for analysis (e.g …

### Whats new on arXiv

Existing approximate nearest neighbor search systems suffer from two fundamental problems that are of practical importance but have not received sufficient attention from the research community. First, although existing systems perform well for the whole database, it is difficult to run a search over a subset of the database. Second, there has been no discussion concerning the performance decrement after many items have been newly added to a system. We develop a reconfigurable inverted index (Rii) to resolve these two issues. Based on the standard IVFADC system, we design a data layout such that items are stored linearly. This enables us to efficiently run a subset search by switching the search method to a linear PQ scan if the size of a subset is small. Owing to the linear layout, the data structure can be dynamically adjusted after new items are added, maintaining the fast speed of the system. Extensive comparisons show that Rii achieves a comparable performance with state-of-the art systems such as Faiss.
This paper presents an empirical exploration of the use of capsule networks for text classification. While it has been shown that capsule networks are effective for image classification, their validity in the domain of text has not been explored. In this paper, we show that capsule networks indeed have the potential for text classification and that they have several advantages over convolutional neural networks. We further suggest a simple routing method that effectively reduces the computational complexity of dynamic routing. We utilized seven benchmark datasets to demonstrate that capsule networks, along with the proposed routing method provide comparable results.
The time series classification literature has expanded rapidly over the last decade, with many new classification approaches published each year. The research focus has mostly been on improving the accuracy and efficiency of classifiers, while their interpretability has been somewhat neglected. Classifier interpretability has become a critical constraint for many application domains and the introduction of the ‘right to explanation’ GDPR EU legislation in May 2018 is likely to further emphasize the importance of explainable learning algorithms. In this work we analyse the state-of-the-art for time series classification, and propose new algorithms that aim to maintain the classifier accuracy and efficiency, but keep interpretability as a key design constraint. We present new time series classification algorithms that advance the state-of-the-art by implementing the following three key ideas: (1) Multiple resolutions of symbolic approximations: we combine symbolic representations obtained using different parameters; (2) Multiple domain representations: we combine symbolic approximations in time (e.g., SAX) and frequency (e.g., SFA) domains; (3) Efficient navigation of a huge symbolic-words space: we adapt a symbolic sequence classifier named SEQL, to make it work with multiple domain representations (e.g., SAX-SEQL, SFA-SEQL), and use its greedy feature selection strategy to effectively filter the best features for each representation. We show that a multi-resolution multi-domain linear classifier, SAX-SFA-SEQL, achieves a similar accuracy to the state-of-the-art COTE ensemble, and to a recent deep learning method (FCN), but uses a fraction of the time required by either COTE or FCN. We discuss the accuracy, efficiency and interpretability of our proposed algorithms. To further analyse the interpretability aspect of our classifiers, we present a case study on an ecology benchmark.
Big data is a term used for a very large data sets that have many difficulties in storing and processing the data. Analysis this much amount of data will lead to information loss. The main goal of this paper is to share data in a way that privacy is preserved while information loss is kept at least. Data that include Government agencies, University details and Medical history etc., are very necessary for an organization to do analysis and predict trends and patterns, but it may prevent the data owner from sharing the data because of privacy regulations [1]. By doing an analysis of several algorithms of Anonymization such as k-anonymity, l-diversity and tcloseness, one can achieve privacy at minimum loss. Admitting these techniques has some limitations. We need to maintain trade-off between privacy and information loss. We introduce a novel approach called Differential Privacy.
A time-domain test for the assumption of second order stationarity of a functional time series is proposed. The test is based on combining individual cumulative sum tests which are designed to be sensitive to changes in the mean, variance and autocovariance operators, respectively. The combination of their dependent $p$-values relies on a joint dependent block multiplier bootstrap of the individual test statistics. Conditions under which the proposed combined testing procedure is asymptotically valid under stationarity are provided. A procedure is proposed to automatically choose the block length parameter needed for the construction of the bootstrap. The finite-sample behavior of the proposed test is investigated in Monte Carlo experiments and an illustration on a real data set is provided.
Certain upper triangular matrices, termed as Parikh matrices, are often used in the combinatorial study of words. Given a word, the Parikh matrix of that word elegantly computes the number of occurrences of certain predefined subwords in that word. In this paper, we compute the Parikh matrix of any word raised to an arbitrary power. Furthermore, we propose canonical decompositions of both Parikh matrices and words into normal forms. Finally, given a Parikh matrix, the relation between its normal form and the normal forms of words in the corresponding M-equivalence class is established.
In this paper, we introduce an embedding model, named CapsE, exploring a capsule network to model relationship triples \textit{(subject, relation, object)}. Our CapsE represents each triple as a 3-column matrix where each column vector represents the embedding of an element in the triple. This 3-column matrix is then fed to a convolution layer where multiple filters are operated to generate different feature maps. These feature maps are used to construct capsules in the first capsule layer. Capsule layers are connected via dynamic routing mechanism. The last capsule layer consists of only one capsule to produce a vector output. The length of this vector output is used to measure the plausibility of the triple. Our proposed CapsE obtains state-of-the-art link prediction results for knowledge graph completion on two benchmark datasets: WN18RR and FB15k-237, and outperforms strong search personalization baselines on SEARCH17 dataset.
The most approaches to Knowledge Base Question Answering are based on semantic parsing. In this paper, we address the problem of learning vector representations for complex semantic parses that consist of multiple entities and relations. Previous work largely focused on selecting the correct semantic relations for a question and disregarded the structure of the semantic parse: the connections between entities and the directions of the relations. We propose to use Gated Graph Neural Networks to encode the graph structure of the semantic parse. We show on two data sets that the graph networks outperform all baseline models that do not explicitly model the structure. The error analysis confirms that our approach can successfully process complex semantic parses.
PatternAttribution is a recent method, introduced in the vision domain, that explains classifications of deep neural networks. We demonstrate that it also generates meaningful interpretations in the language domain.
The concept of Probability of Causation (PC) is critically important in legal contexts and can help in many other domains. While it has been around since 1986, current operationalizations can obtain only the minimum and maximum values of PC, and do not apply for purely observational data. We present a theoretical framework to estimate the distribution of PC from experimental and from purely observational data. We illustrate additional problems of the existing operationalizations and show how our method can be used to address them. We also provide two illustrative examples of how our method is used and how factors like sample size or rarity of events can influence the distribution of PC. We hope this will make the concept of PC more widely usable in practice.
We develop a semiparametric Bayesian approach for estimating the mean response in a missing data model with binary outcomes and a nonparametrically modelled propensity score. Equivalently we estimate the causal effect of a treatment, correcting nonparametrically for confounding. We show that standard Gaussian process priors satisfy a semiparametric Bernstein-von Mises theorem under smoothness conditions. We further propose a novel propensity score-dependent prior that provides efficient inference under strictly weaker conditions. We also show that it is theoretically preferable to model the covariate distribution with a Dirichlet process or Bayesian bootstrap, rather than modelling the covariate density using a Gaussian process prior.
In recent years, deep neural networks have revolutionized many application domains of machine learning and are key components of many critical decision or predictive processes. Therefore, it is crucial that domain specialists can understand and analyze actions and predictions, even of the most complex neural network architectures. Despite these arguments neural networks are often treated as black boxes. In the attempt to alleviate this short- coming many analysis methods were proposed, yet the lack of reference implementations often makes a systematic comparison between the methods a major effort. The presented library iNNvestigate addresses this by providing a common interface and out-of-the- box implementation for many analysis methods, including the reference implementation for PatternNet and PatternAttribution as well as for LRP-methods. To demonstrate the versatility of iNNvestigate, we provide an analysis of image classifications for variety of state-of-the-art neural network architectures.
This paper provides a link between causal inference and machine learning techniques – specifically, Classification and Regression Trees (CART) – in observational studies where the receipt of the treatment is not randomized, but the assignment to the treatment can be assumed to be randomized (irregular assignment mechanism). The paper contributes to the growing applied machine learning literature on causal inference, by proposing a modified version of the Causal Tree (CT) algorithm to draw causal inference from an irregular assignment mechanism. The proposed method is developed by merging the CT approach with the instrumental variable framework to causal inference, hence the name Causal Tree with Instrumental Variable (CT-IV). As compared to CT, the main strength of CT-IV is that it can deal more efficiently with the heterogeneity of causal effects, as demonstrated by a series of numerical results obtained on synthetic data. Then, the proposed algorithm is used to evaluate a public policy implemented by the Tuscan Regional Administration (Italy), which aimed at easing the access to credit for small firms. In this context, CT-IV breaks fresh ground for target-based policies, identifying interesting heterogeneous causal effects.
Background: It is still an open research area to theoretically understand why Deep Neural Networks (DNNs)—equipped with many more parameters than training data and trained by (stochastic) gradient-based methods—often achieve remarkably low generalization error. Contribution: We study DNN training by Fourier analysis. Our theoretical framework explains: i) DNN with (stochastic) gradient-based methods endows low-frequency components of the target function with a higher priority during the training; ii) Small initialization leads to good generalization ability of DNN while preserving the DNN’s ability of fitting any function. These results are further confirmed by experiments of DNNs fitting the following datasets, i.e., natural images, one-dimensional functions and MNIST dataset.
Root Cause Analysis for Anomalies is challenging because of the trade-off between the accuracy and its explanatory friendliness, required for industrial applications. In this paper we propose a framework for simple and friendly RCA within the Bayesian regime under certain restrictions (that Hessian at the mode is diagonal, here referred to as \emph{separability}) imposed on the predictive posterior. We show that this assumption is satisfied for important base models, including Multinomal, Dirichlet-Multinomial and Naive Bayes. To demonstrate the usefulness of the framework, we embed it into the Bayesian Net and validate on web server error logs (real world data set).
In this paper, we propose a convolutional neural network(CNN) with 3-D rank-1 filters which are composed by the outer product of 1-D filters. After being trained, the 3-D rank-1 filters can be decomposed into 1-D filters in the test time for fast inference. The reason that we train 3-D rank-1 filters in the training stage instead of consecutive 1-D filters is that a better gradient flow can be obtained with this setting, which makes the training possible even in the case where the network with consecutive 1-D filters cannot be trained. The 3-D rank-1 filters are updated by both the gradient flow and the outer product of the 1-D filters in every epoch, where the gradient flow tries to obtain a solution which minimizes the loss function, while the outer product operation tries to make the parameters of the filter to live on a rank-1 sub-space. Furthermore, we show that the convolution with the rank-1 filters results in low rank outputs, constraining the final output of the CNN also to live on a low dimensional subspace.

## August 14, 2018

### Magister Dixit

“Analytic and decision making have to be probabilistic; and the system and application has to be conscious of what is “good enough” and not fail in the absence of perfect behavior.” Prasant Misra, Yogesh Simmhan, Jay Warrior ( 03.02.2015 )

### Better Analytics for the Product Experience – Aug 21 webinar

Learn a process for discovering the data and analytics needs of your users using user stories, use cases and mapping to data sources; Strategies for balancing priorities and managing expectations, and more.

### Book Memo: “Uncertainty Modelling in Data Science”

 This book features 29 peer-reviewed papers presented at the 9th International Conference on Soft Methods in Probability and Statistics (SMPS 2018), which was held in conjunction with the 5th International Conference on Belief Functions (BELIEF 2018) in Compiègne, France on September 17-21, 2018. It includes foundational, methodological and applied contributions on topics as varied as imprecise data handling, linguistic summaries, model coherence, imprecise Markov chains, and robust optimisation. These proceedings were produced using EasyChair. Over recent decades, interest in extensions and alternatives to probability and statistics has increased significantly in diverse areas, including decision-making, data mining and machine learning, and optimisation. This interest stems from the need to enrich existing models, in order to include different facets of uncertainty, like ignorance, vagueness, randomness, conflict or imprecision. Frameworks such as rough sets, fuzzy sets, fuzzy random variables, random sets, belief functions, possibility theory, imprecise probabilities, lower previsions, and desirable gambles all share this goal, but have emerged from different needs. The advances, results and tools presented in this book are important in the ubiquitous and fast-growing fields of data science, machine learning and artificial intelligence. Indeed, an important aspect of some of the learned predictive models is the trust placed in them. Modelling the uncertainty associated with the data and the models carefully and with principled methods is one of the means of increasing this trust, as the model will then be able to distinguish between reliable and less reliable predictions. In addition, extensions such as fuzzy sets can be explicitly designed to provide interpretable predictive models, facilitating user interaction and increasing trust.

### BooST (Boosting Smooth Trees) a new Machine Learning Model for Partial Effect Estimation in Nonlinear Regressions

(This article was first published on R – insightR, and kindly contributed to R-bloggers)

By Gabriel Vasconcelos and Yuri Fonseca

We are happy to introduce our new machine learning method called Boosting Smooth Trees (BooST) (full article here). This model was a joint work with professors Marcelo Medeiros and Álvaro Veiga. The BooST uses a different type of regression tree that allows us to estimate the derivatives of very general nonlinear models. In other words, the model is differentiable and it has an analytical solution. The consequence is that now we can estimate partial effects of a characteristic on the response variable, which provide us much more interpretation than traditional importance measures.

The idea behind the BooST is to replace traditional Classification and Regression Trees (CART), which are not differentiable, by Smooth logistic trees. We show that with this adaptation the BooST is a consistent estimator of the model’s derivatives under some assumptions.

## Example

The example below shows that the BooST is very good to recover the derivatives of nonlinear functions. The data was generated from the following model:

$\displaystyle y_i = \cos( \pi [x_{i,1}+x_{i,2}])+ \varepsilon_i$

where $x_{i,1} \sim N(0,1)$, $x_{i,2}$ is a Bernoulli with $p = 0.5$ and $\varepsilon_i \sim N(0,1)$. Note that this is not an easy problem. The function that generates the data is not monotonic, very smooth and nonlinear. The Figure below shows how the BooST estimates the model and its derivatives with respect to $x_{i,1}$. We simulated 1000 data points and forced it to have an R2 of 0.5 (half of the variation in the data is noise).

## Real data example

To give a more practical example, suppose we want to estimate the effects of price changes on sales without any prior knowledge of the demand function. It is natural to think that these effects will depend on the current price level and possible some characteristics of the product. The BooST will estimate the partial effects of price on sales conditional to the current (or any) price and any other controls we choose. We will write a post specifically on this example in the future.

Consider yet an other example where we want to estimate how the price of a house changes as we move on the latitude or longitude. This is precisely what we did in the figure below using data from house sales in Melbourne (Dataset scrapped from the web by Tony Pino). The figure shows that if we are south of the CBD and move north the prices increase a lot. However, if we keep moving north prices start to decrease as we move away from the CBD.

## Other considerations

The BooST is also very good for forecasting. We will come back to this topic with examples on future posts. Just keep in mind that if the number of characteristics is to big it may improve forecasting accuracy but you will probably loose some interpretation of the partial effects not because of the model itself but because of some theoretical problems you may have.

For now we have an implementation of the BooST in R, which is fully documented and straightforward to use. It can be downloaded and installed from:

library(devtools)
install_github("gabrielrvsc/BooST")


Note that this implementation is not very fast and it is not adequate for very big problems with to many variables. We already have a faster Julia implementation that  will be published soon. The even faster C++ version will take some time but it is also coming. In the next posts we will explore some empirical applications with codes to replicate our examples.

## References

Fonseca, Y.; Medeiros, M.; Vasconcelos; G.; Veiga, A. “BooST: Boosting Smooth Trees for Partial Effect Estimation in Nonlinear Regressions” arXiv preprint available at https://arxiv.org/abs/1808.03698 (2018).

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### The Microsoft AI Idea Challenge – Breakthrough Ideas Wanted!

This post is authored by Tara Shankar Jana, Senior Technical Product Marketing Manager at Microsoft.

All of us have creative ideas – ideas that can improve our lives and the lives of thousands, perhaps even millions of others. But how often do we act on turning those ideas into a reality? Most of the time, we do not believe in our ideas strongly enough to pursue them. Other times we feel like we lack a platform to build out our idea or showcase it. Most good ideas don’t go beyond those initial creative thoughts in our head.

If you’re a professional working in the field of artificial intelligence (AI), or an aspiring AI developer or just someone who is passionate about AI and machine learning, Microsoft is excited to offer you an opportunity to transform your most creative ideas into reality. Join the Microsoft AI Idea Challenge Contest today for a chance to win exciting prizes and get your project featured in Microsoft’s AI.lab showcase. Check out the rules, terms and conditions of the contest and then dive right in!

### The Challenge

The Microsoft AI Idea Challenge is seeking breakthrough AI solutions from developers, data scientists, professionals and students, and preferably developed on the Microsoft AI platform and services. The challenge gives you a platform to freely share AI models and applications, so they are reusable and easily accessible. The ideas you submit are judged on the parameters shown in the figure below – essentially half the weight is for the originality of your idea, 20% for the feasibility of your solution, and 30% for the complexity (i.e. level of sophistication) of your implementation.

The Microsoft AI Challenge is accepting submissions between now and October 12th, 2018.

To qualify for the competition, individuals or teams are required to submit a working AI model, test dataset, a demo app and a demo video that can be a maximum of three minutes long. We encourage you to register early and upload your projects soon, so that you can begin to plan and build out your solution and turn in the rest of your materials on time. We are looking for solutions across the whole spectrum of use cases – to be inspired, take a look at some of the examples at AI.lab.

### Prizes

The winners of the first three places in the contest will respectively receive a Surface Book 2, a DJI Drone, and an Xbox One X.

We hope that’s motivation to get you started today – good luck!

Tara

### ebook: Using Deep Learning to Solve Real-World Problems

Read this eBook to learn: How deep learning enables image classification, sentiment analysis, and other advanced analysis techniques and get a a starter workflow for building and training deep learning models.

### A transforming river seen from above

The Padma River in Bangladesh is constantly shifting its 75-mile path. Joshua Stevens for the NASA Earth Observatory shows what the shifting looked like through satellite imagery, over a 30-year span.

Kasha Patel:

The upper section of the Padma—the Harirampur region— has experienced the most erosion and shows the most notable changes. The river has become wider at this section by eroding along both banks, although most activity occurred on the left bank. Using topographic, aerial, and satellite imagery, scientists found that the left bank shifted 12 kilometers towards the north from 1860 to 2009 and developed a meandering bend. The river left a scar where the water once flowed, as you can see in the 2018 image.

### Distilled News

NNEF reduces machine learning deployment fragmentation by enabling a rich mix of neural network training tools and inference engines to be used by applications across a diverse range of devices and platforms. The goal of NNEF is to enable data scientists and engineers to easily transfer trained networks from their chosen training framework into a wide variety of inference engines. A stable, flexible and extensible standard that equipment manufacturers can rely on is critical for the widespread deployment of neural networks onto edge devices, and so NNEF encapsulates a complete description of the structure, operations and parameters of a trained neural network, independent of the training tools used to produce it and the inference engine used to execute it.
This is the third episode of my pandas tutorial series. In this one I´ll show you four data formatting methods that you might use a lot in data science projects. These are: merge, sort, reset_index and fillna! Of course, there are many others, and at the end of the article, I´ll link to a pandas cheat sheet where you can find every function and method you could ever need. Okay! Let´s get started!
Whether you’re a novice data science enthusiast setting up TensorFlow for the first time, or a seasoned AI engineer working with terabytes of data, getting your libraries, packages, and frameworks installed is always a struggle. Learn how datmo, an open source python package, helps you get started in minutes.
0. Prerequisites
1. Install datmo
2. Initialize a datmo project
3. Start environment setup
4. Select System Drivers (CPU or GPU)
5. Select an environment
6. Select a language version (if applicable)
Workflow tools to help you experiment, deploy, and scale. By data scientists, for data scientists. Datmo is an open source model tracking and reproducibility tool for developers. Features
• One command environment setup (languages, frameworks, packages, etc)
• Tracking and logging for model config and results
• Project versioning (model state tracking)
• Visualize + export experiment history
The data.table R package is really good at sorting. Below is a comparison of it versus dplyr for a range of problem sizes.
The development version of splashr now support authenticated connections to Splash API instances. Just specify user and pass on the initial splashr::splash() call to use your scraping setup a bit more safely. For those not familiar with splashr and/or Splash: the latter is a lightweight alternative to tools like Selenium and the former is an R interface to it. Unlike xml2::read_html(), splashr renders a URL exactly as a browser does (because it uses a virtual browser) and can return far more than just the HTML from a web page. Splash does need to be running and it´s best to use it in a Docker container.
Model interpretability is critical to businesses. If you want to use high performance models (GLM, RF, GBM, Deep Learning, H2O, Keras, xgboost, etc), you need to learn how to explain them. With machine learning interpretability growing in importance, several R packages designed to provide this capability are gaining in popularity. We analyze the IML package in this article. In recent blog posts we assessed LIME for model agnostic local interpretability functionality and DALEX for both local and global machine learning explanation plots. This post examines the iml package (short for Interpretable Machine Learning) to assess its functionality in providing machine learning interpretability to help you determine if it should become part of your preferred machine learning toolbox. We again utilize the high performance machine learning library, h2o, implementing three popular black-box modeling algorithms: GLM (generalized linear models), RF (random forest), and GBM (gradient boosted machines). For those that want a deep dive into model interpretability, the creator of the iml package, Christoph Molnar, has put together a free book: Interpretable Machine Learning. Check it out.
Principal component analysis (PCA) is a dimensionality reduction technique which might come handy when building a predictive model or in the exploratory phase of your data analysis. It is often the case that when it is most handy you might have forgot it exists but let´s neglect this aspect for now
Delayed impact of fair machine learning’ (https://…/1803.04383 ) won a best paper award at ICML this year. It´s not an easy read (at least it wasn´t for me), but fortunately it´s possible to appreciate the main results without following all of the proof details. The central question is how to ensure fair treatment across demographic groups in a population when using a score-based machine learning model to decide who gets an opportunity (e.g. is offered a loan) and who doesn´t. Most recently we looked at the equal opportunity and equalized odds models. The underlying assumption of course for studied fairness models is that the fairness criteria promote the long-term well-being of those groups they aim to protect. The big result in this paper is that you can easily up end ‘killing them with kindness’ instead. The potential for this to happen exists when there is a feedback loop in place in the overall system. By overall system here, I mean the human system of which the machine learning model is just a small part. Using the loan/no-loan decision that is a popular study vehicle in fairness papers, we need to consider not just (for example) the opportunity that someone in a disadvantaged group has to qualify for a loan, but also what happens in the future as a result of that loan being made. If the borrower eventually defaults, then they will also see a decline in their credit score, which will make it harder for the borrower to obtain additional loans in the future. A successful lending event on the other hand may increase the credit score for the borrower.
What is a Capsule Network? What is a Capsule? Is CapsNet better than a Convolutional Neural Network (CNN)? In this article I will talk about all the above questions about CapsNet or Capsule Network released by Hinton.
Analyze employee churn. Find out why employees are leaving the company and learn to predict, who will leave the company.
In the past, most of the focus on the ‘rates’ such as attrition rate and retention rates. HR Managers compute the previous rates try to predict the future rates using data warehousing tools. These rates present the aggregate impact of churn, but this is the half picture. Another approach can be the focus on individual records in addition to aggregate.
There are lots of case studies on customer churn are available. In customer churn, you can predict who and when a customer will stop buying. Employee churn is similar to customer churn. It mainly focuses on the employee rather than the customer. Here, you can predict who, and when an employee will terminate the service. Employee churn is expensive, and incremental improvements will give significant results. It will help us in designing better retention plans and improving employee satisfaction. In this tutorial, you are going to cover the following topics:
• Employee Churn Analysis
• Exploratory data analysis and Data visualization
• Cluster analysis
• Building prediction model using Gradient Boosting Tree.
• Evaluating model performance
• Conclusion
This article is targeted at people who found it difficult to grasp the original paper. Follow me till the end, and I assure you will atleast get a sense of what is happening underneath the revolutionary machine learning model. Lets get started.
Look into whether the regional fuel tax in Auckland has led to changes in fuel prices in other regions of New Zealand.
Ontology-based Data Access (OBDA) is a new paradigm, based on the use of knowledge representation and reasoning techniques, for governing the resources (data, meta-data, services, processes, etc.) of modern information systems. The challenges of data governance:
• Accessing Data
In large organizations data sources are commonly re-shaped by corrective maintenance and to adapt to application requirements, and applications are changed to meet new requirements. The result is that the data stored in different sources and the processes operating over them tend to be redundant, mutually inconsistent, and obscure for large classes of users. So, accessing data means interacting with IT experts who know where the data are and what they mean in the various contexts, and can therefore translate the information need expressed by the user into appropriate queries. This process can be both expensive and time-consuming.
• Data Quality
Data quality is cited often as a critical factor in delivering high value information services. But how can we check data quality, and how can we decide if it is good if we do not have a clear understanding of the semantics that data should bring? Moreover, how can we judge the quality of external data of business partners, clients, or even public sources, that we connect to? Data quality is also crucial for opening data to external organisations, to favor new business opportunities, or even to the public, which we are seeing more of nowadays in the age of Open Data.
• Process Specification
Information systems are key assets for business organisations, which rely not only on data, but also, for instance, on processes and services. Designing and managing processes is an important aspect of information systems, but deciding what a process should do is tough to do properly without a clear idea of which data the process will access, and how it will possibly change it. The difficulties of doing this properly come from various factors, including the lack of modelling languages and tools for describing process and data holistically, and the problems related to the semantics of data make this task even harder.
• Three-Level Architecture
The key idea of OBDA is provide users with access to the information in their data sources through a three-level architecture, constituted by the ontology, the sources, and the mapping between the two, where the ontology is a formal description of the domain of interest, and is the heart of the system. Through this architecture, OBDA provides a semantic end-to-end connection between users and data sources, allowing users to directly query data spread across multiple distributed sources, through the familiar vocabulary of the ontology: the user formulates SPARQL queries over the ontology which are transformed, through the mapping layer, into SQL queries over the underlying relational databases.
• The Ontology layer in the architecture is the mean for pursuing a declarative approach to information integration, and, more generally, to data governance. The domain knowledge base of the organization is specified through a formal and high level description of both its static and dynamic aspects, represented by the ontology. By making the representation of the domain explicit, we gain re-usability of the acquired knowledge, which is not achieved when the global schema is simply a unified description of the underlying data sources.
• The Mapping layer connects the Ontology layer with the Data Source layer by defining the relationships between the domain concepts on the one hand and the data sources on the other hand. These mappings are not only used for the operation of the information system, but can also be a significant asset for documentation purposes in cases where the information about data is widespread into separate pieces of documentation that are often difficult to access and rarely conforming to common standards.
• The Data Source layer is constituted by the existing data sources of the organization.
Docker is an increasingly popular way to create and deploy applications through virtualization, but can it be useful for data scientists? This guide should help you quickly get started.

Analysts forecast that bitcoin will achieve a total value of 1.2 trillion towards the end of 2018. However, forward-thinking individuals see the value of blockchain beyond its original application for digital currency transactions You’ve probably heard about blockchain technology through cryptocurrency, but did you know that blockchain technology is expected The post Blockchain Technology Will Revolutionize These Five Industries appeared first on Dataconomy. Continue Reading… ### Microsoft R Open 3.5.1 now available Microsoft R Open 3.5.1 has been released, combining the latest R language engine with multi-processor performance and tools for managing R packages reproducibly. You can download Microsoft R Open 3.5.1 for Windows, Mac and Linux from MRAN now. Microsoft R Open is 100% compatible with all R scripts and packages, and works with all your favorite R interfaces and development environments. This update brings a number of minor fixes to the R language engine from the R core team. It also makes available a host of new R packages contributed by the community, including packages for downloading financial data, connecting with analytics systems, applying machine learning algorithms and statistical models, and many more. New R packages are released every day, and you can access packages released after the 1 August 2018 CRAN snapshot used by MRO 3.5.1 using the checkpoint package. We hope you find Microsoft R Open useful, and if you have any comments or questions please visit the Microsoft R Open forum. You can follow the development of Microsoft R Open at the MRO Github repository. To download Microsoft R Open, simply follow the link below. Continue Reading… ### Microsoft R Open 3.5.1 now available (This article was first published on Revolutions, and kindly contributed to R-bloggers) Microsoft R Open 3.5.1 has been released, combining the latest R language engine with multi-processor performance and tools for managing R packages reproducibly. You can download Microsoft R Open 3.5.1 for Windows, Mac and Linux from MRAN now. Microsoft R Open is 100% compatible with all R scripts and packages, and works with all your favorite R interfaces and development environments. This update brings a number of minor fixes to the R language engine from the R core team. It also makes available a host of new R packages contributed by the community, including packages for downloading financial data, connecting with analytics systems, applying machine learning algorithms and statistical models, and many more. New R packages are released every day, and you can access packages released after the 1 August 2018 CRAN snapshot used by MRO 3.5.1 using the checkpoint package. We hope you find Microsoft R Open useful, and if you have any comments or questions please visit the Microsoft R Open forum. You can follow the development of Microsoft R Open at the MRO Github repository. To download Microsoft R Open, simply follow the link below. To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### Whats new on arXiv We solved a stylized fact on a long memory process of volatility cluster phenomena by using Minkowski metric for GARCH(1,1) under assumption that price and time can not be separated. We provide a Yang-Mills equation in financial market and anomaly on superspace of time series data as a consequence of the proof from the general relativity theory. We used an original idea in Minkowski spacetime embedded in Kolmogorov space in time series data with behavior of traders.The result of this work is equivalent to the dark volatility or the hidden risk fear field induced by the interaction of the behavior of the trader in the financial market panic when the market crashed. Domain adversarial learning aligns the feature distributions across the source and target domains in a two-player minimax game. Existing domain adversarial networks generally assume identical label space across different domains. In the presence of big data, there is strong motivation of transferring deep models from existing big domains to unknown small domains. This paper introduces partial domain adaptation as a new domain adaptation scenario, which relaxes the fully shared label space assumption to that the source label space subsumes the target label space. Previous methods typically match the whole source domain to the target domain, which are vulnerable to negative transfer for the partial domain adaptation problem due to the large mismatch between label spaces. We present Partial Adversarial Domain Adaptation (PADA), which simultaneously alleviates negative transfer by down-weighing the data of outlier source classes for training both source classifier and domain adversary, and promotes positive transfer by matching the feature distributions in the shared label space. Experiments show that PADA exceeds state-of-the-art results for partial domain adaptation tasks on several datasets. In this paper we introduce a new machine learning (ML) model for nonlinear regression called Boosting Smooth Transition Regression Tree (BooST). The main advantage of the BooST is that it estimates the derivatives (partial effects) of very general nonlinear models, providing more interpretation than other tree based models concerning the mapping between the covariates and the dependent variable. We provide some asymptotic theory that shows consistency of the partial derivatives and we present some examples on simulated and empirical data. We present LemmaTag, a featureless recurrent neural network architecture that jointly generates part-of-speech tags and lemmatizes sentences of languages with complex morphology, using bidirectional RNNs with character-level and word-level embeddings. We demonstrate that both tasks benefit from sharing the encoding part of the network and from using the tagger output as an input to the lemmatizer. We evaluate our model across several morphologically-rich languages, surpassing state-of-the-art accuracy in both part-of-speech tagging and lemmatization in Czech, German, and Arabic. We present a new ‘grey-box’ approach to anomaly detection in smart manufacturing. The approach is designed for tools run by control systems which execute recipe steps to produce semiconductor wafers. Multiple streaming sensors capture trace data to guide the control systems and for quality control. These control systems are typically PI controllers which can be modeled as an ordinary differential equation (ODE) coupled with a control equation, capturing the physics of the process. The ODE ‘white-box’ models capture physical causal relationships that can be used in simulations to determine how the process will react to changes in control parameters, but they have limited utility for anomaly detection. Many ‘black-box’ approaches exist for anomaly detection in manufacturing, but they typically do not exploit the underlying process control. The proposed ‘grey-box’ approach uses the process-control ODE model to derive a parametric function of sensor data. Bayesian regression is used to fit the parameters of these functions to form characteristic ‘shape signatures’. The probabilistic model provides a natural anomaly score for each wafer, which captures poor control and strange shape signatures. The anomaly score can be deconstructed into its constituent parts in order to identify which parameters are contributing to anomalies. We demonstrate how the anomaly score can be used to monitor complex multi-step manufacturing processes to detect anomalies and changes and show how the shape signatures can provide insight into the underlying sources of process variation that are not readily apparent in the sensor data. We propose a novel unsupervised keyphrase extraction approach based on outlier detection. Our approach starts by training word embeddings on the target document to capture semantic regularities among the words. It then uses the minimum covariance determinant estimator to model the distribution of non-keyphrase word vectors, under the assumption that these vectors come from the same distribution, indicative of their irrelevance to the semantics expresses by the dimensions of the learned vector representation. Candidate keyphrases are based on words that are outliers of this dominant distribution. Empirical results show that our approach outperforms state-of-the-art unsupervised keyphrase extraction methods. Attention mechanisms in sequence to sequence models have shown great ability and wonderful performance in various natural language processing (NLP) tasks, such as sentence embedding, text generation, machine translation, machine reading comprehension, etc. Unfortunately, existing attention mechanisms only learn either high-level or low-level features. In this paper, we think that the lack of hierarchical mechanisms is a bottleneck in improving the performance of the attention mechanisms, and propose a novel Hierarchical Attention Mechanism (Ham) based on the weighted sum of different layers of a multi-level attention. Ham achieves a state-of-the-art BLEU score of 0.26 on Chinese poem generation task and a nearly 6.5% averaged improvement compared with the existing machine reading comprehension models such as BIDAF and Match-LSTM. Furthermore, our experiments and theorems reveal that Ham has greater generalization and representation ability than existing attention mechanisms. Artificial Intelligence (AI) achieved super-human performance in a broad variety of domains. We say that an AI is made Artificially Stupid on a task when some limitations are deliberately introduced to match a human’s ability to do the task. An Artificial General Intelligence (AGI) can be made safer by limiting its computing power and memory, or by introducing Artificial Stupidity on certain tasks. We survey human intellectual limits and give recommendations for which limits to implement in order to build a safe AGI. In the last decade, a variety of topic models have been proposed for text engineering. However, except Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA), most of existing topic models are seldom applied or considered in industrial scenarios. This phenomenon is caused by the fact that there are very few convenient tools to support these topic models so far. Intimidated by the demanding expertise and labor of designing and implementing parameter inference algorithms, software engineers are prone to simply resort to PLSA/LDA, without considering whether it is proper for their problem at hand or not. In this paper, we propose a configurable topic modeling framework named Familia, in order to bridge the huge gap between academic research fruits and current industrial practice. Familia supports an important line of topic models that are widely applicable in text engineering scenarios. In order to relieve burdens of software engineers without knowledge of Bayesian networks, Familia is able to conduct automatic parameter inference for a variety of topic models. Simply through changing the data organization of Familia, software engineers are able to easily explore a broad spectrum of existing topic models or even design their own topic models, and find the one that best suits the problem at hand. With its superior extendability, Familia has a novel sampling mechanism that strikes balance between effectiveness and efficiency of parameter inference. Furthermore, Familia is essentially a big topic modeling framework that supports parallel parameter inference and distributed parameter storage. The utilities and necessity of Familia are demonstrated in real-life industrial applications. Familia would significantly enlarge software engineers’ arsenal of topic models and pave the way for utilizing highly customized topic models in real-life problems. A capsule is a collection of neurons which represents different variants of a pattern in the network. The routing scheme ensures only certain capsules which resemble lower counterparts in the higher layer should be activated. However, the computational complexity becomes a bottleneck for scaling up to larger networks, as lower capsules need to correspond to each and every higher capsule. To resolve this limitation, we approximate the routing process with two branches: a master branch which collects primary information from its direct contact in the lower layer and an aide branch that replenishes master based on pattern variants encoded in other lower capsules. Compared with previous iterative and unsupervised routing scheme, these two branches are communicated in a fast, supervised and one-time pass fashion. The complexity and runtime of the model are therefore decreased by a large margin. Motivated by the routing to make higher capsule have agreement with lower capsule, we extend the mechanism as a compensation for the rapid loss of information in nearby layers. We devise a feedback agreement unit to send back higher capsules as feedback. It could be regarded as an additional regularization to the network. The feedback agreement is achieved by comparing the optimal transport divergence between two distributions (lower and higher capsules). Such an add-on witnesses a unanimous gain in both capsule and vanilla networks. Our proposed EncapNet performs favorably better against previous state-of-the-arts on CIFAR10/100, SVHN and a subset of ImageNet. Knowledge Graph Embedding (KGE) aims to represent entities and relations of knowledge graph in a low-dimensional continuous vector space. Recent works focus on incorporating structural knowledge with additional information, such as entity descriptions, relation paths and so on. However, common used additional information usually contains plenty of noise, which makes it hard to learn valuable representation. In this paper, we propose a new kind of additional information, called entity neighbors, which contain both semantic and topological features about given entity. We then develop a deep memory network model to encode information from neighbors. Employing a gating mechanism, representations of structure and neighbors are integrated into a joint representation. The experimental results show that our model outperforms existing KGE methods utilizing entity descriptions and achieves state-of-the-art metrics on 4 datasets. In this demo paper, we introduce the DARPA D3M program for automatic machine learning (ML) and JPL’s MARVIN tool that provides an environment to locate, annotate, and execute machine learning primitives for use in ML pipelines. MARVIN is a web-based application and associated back-end interface written in Python that enables composition of ML pipelines from hundreds of primitives from the world of Scikit-Learn, Keras, DL4J and other widely used libraries. MARVIN allows for the creation of Docker containers that run on Kubernetes clusters within DARPA to provide an execution environment for automated machine learning. MARVIN currently contains over 400 datasets and challenge problems from a wide array of ML domains including routine classification and regression to advanced video/image classification and remote sensing. Context information around words helps in determining their actual meaning, for example ‘networks’ used in contexts of artificial neural networks or biological neuron networks. Generative topic models infer topic-word distributions, taking no or only little context into account. Here, we extend a neural autoregressive topic model to exploit the full context information around words in a document in a language modeling fashion. This results in an improved performance in terms of generalization, interpretability and applicability. We apply our modeling approach to seven data sets from various domains and demonstrate that our approach consistently outperforms stateof-the-art generative topic models. With the learned representations, we show on an average a gain of 9.6% (0.57 Vs 0.52) in precision at retrieval fraction 0.02 and 7.2% (0.582 Vs 0.543) in F1 for text categorization. In this technical report, we present jLDADMM—an easy-to-use Java toolkit for conventional topic models. jLDADMM is released to provide alternatives for topic modeling on normal or short texts. It provides implementations of the Latent Dirichlet Allocation topic model and the one-topic-per-document Dirichlet Multinomial Mixture model (i.e. mixture of unigrams), using collapsed Gibbs sampling. In addition, jLDADMM supplies a document clustering evaluation to compare topic models. jLDADMM is open-source and available to download at: https://…/jLDADMM Matrix factorization (MF) discovers latent features from observations, which has shown great promises in the fields of collaborative filtering, data compression, feature extraction, word embedding, etc. While many problem-specific optimization techniques have been proposed, alternating least square (ALS) remains popular due to its general applicability e.g. easy to handle positive-unlabeled inputs, fast convergence and parallelization capability. Current MF implementations are either optimized for a single machine or with a need of a large computer cluster but still are insufficient. This is because a single machine provides limited compute power for large-scale data while multiple machines suffer from the network communication bottleneck. To address the aforementioned challenge, accelerating ALS on graphics processing units (GPUs) is a promising direction. We propose the novel approach in enhancing the MF efficiency via both memory optimization and approximate computing. The former exploits GPU memory hierarchy to increase data reuse, while the later reduces unnecessary computing without hurting the convergence of learning algorithms. Extensive experiments on large-scale datasets show that our solution not only outperforms the competing CPU solutions by a large margin but also has a 2x-4x performance gain compared to the state-of-the-art GPU solutions. Our implementations are open-sourced and publicly available. We consider the problem of ranking a set of items from pairwise comparisons in the presence of features associated with the items. Recent works have established that $O(n\log(n))$ samples are needed to rank well when there is no feature information present. However, this might be sub-optimal in the presence of associated features. We introduce a new probabilistic preference model called feature-Bradley-Terry-Luce (f-BTL) model that generalizes the standard BTL model to incorporate feature information. We present a new least squares based algorithm called fBTL-LS which we show requires much lesser than $O(n\log(n))$ pairs to obtain a good ranking — precisely our new sample complexity bound is of $O(\alpha\log \alpha)$, where $\alpha$ denotes the number of independent items’ of the set, in general $\alpha << n$. Our analysis is novel and makes use of tools from classical graph matching theory to provide tighter bounds that sheds light on the true complexity of the ranking problem, capturing the item dependencies in terms of their feature representations. This was not possible with earlier matrix completion based tools used for this problem. We also prove an information theoretic lower bound on the required sample complexity for recovering the underlying ranking, which essentially shows the tightness of our proposed algorithms. The efficacy of our proposed algorithms are validated through extensive experimental evaluations on a variety of synthetic and real world datasets. Current state-of-the-art machine translation systems are based on encoder-decoder architectures, that first encode the input sequence, and then generate an output sequence based on the input encoding. Both are interfaced with an attention mechanism that recombines a fixed encoding of the source tokens based on the decoder state. We propose an alternative approach which instead relies on a single 2D convolutional neural network across both sequences. Each layer of our network re-codes source tokens on the basis of the output sequence produced so far. Attention-like properties are therefore pervasive throughout the network. Our model yields excellent results, outperforming state-of-the-art encoder-decoder systems, while being conceptually simpler and having fewer parameters. In the traditional framework of spectral learning of stochastic time series models, model parameters are estimated based on trajectories of fully recorded observations. However, real-world time series data often contain missing values, and worse, the distributions of missingness events over time are often not independent of the visible process. Recently, a spectral OOM learning algorithm for time series with missing data was introduced and proved to be consistent, albeit under quite strong conditions. Here we refine the algorithm and prove that the original strong conditions can be very much relaxed. We validate our theoretical findings by numerical experiments, showing that the algorithm can consistently handle missingness patterns whose dynamic interacts with the visible process. This paper is part of a project on developing an algorithmic theory of brain networks, based on stochastic Spiking Neural Network (SNN) models. Inspired by tasks that seem to be solved in actual brains, we are defining abstract problems to be solved by these networks. In our work so far, we have developed models and algorithms for the Winner-Take-All problem from computational neuroscience [LMP17a,Mus18], and problems of similarity detection and neural coding [LMP17b]. We plan to consider many other problems and networks, including both static networks and networks that learn. This paper is about basic theory for the stochastic SNN model. In particular, we define a simple version of the model. This version assumes that the neurons’ only state is a Boolean, indicating whether the neuron is firing or not. In later work, we plan to develop variants of the model with more elaborate state. We also define an external behavior notion for SNNs, which can be used for stating requirements to be satisfied by the networks. We then define a composition operator for SNNs. We prove that our external behavior notion is ‘compositional’, in the sense that the external behavior of a composed network depends only on the external behaviors of the component networks. We also define a hiding operator that reclassifies some output behavior of an SNN as internal. We give basic results for hiding. Finally, we give a formal definition of a problem to be solved by an SNN, and give basic results showing how composition and hiding of networks affect the problems that they solve. We illustrate our definitions with three examples: building a circuit out of gates, building an ‘Attention’ network out of a ‘Winner-Take-All’ network and a ‘Filter’ network, and a toy example involving combining two networks in a cyclic fashion. Deep learning models have achieved remarkable success in natural language inference (NLI) tasks. While these models are widely explored, they are hard to interpret and it is often unclear how and why they actually work. In this paper, we take a step toward explaining such deep learning based models through a case study on a popular neural model for NLI. In particular, we propose to interpret the intermediate layers of NLI models by visualizing the saliency of attention and LSTM gating signals. We present several examples for which our methods are able to reveal interesting insights and identify the critical information contributing to the model decisions. Item recommendation is a personalized ranking task. To this end, many recommender systems optimize models with pairwise ranking objectives, such as the Bayesian Personalized Ranking (BPR). Using matrix Factorization (MF) — the most widely used model in recommendation — as a demonstration, we show that optimizing it with BPR leads to a recommender model that is not robust. In particular, we find that the resultant model is highly vulnerable to adversarial perturbations on its model parameters, which implies the possibly large error in generalization. To enhance the robustness of a recommender model and thus improve its generalization performance, we propose a new optimization framework, namely Adversarial Personalized Ranking (APR). In short, our APR enhances the pairwise ranking method BPR by performing adversarial training. It can be interpreted as playing a minimax game, where the minimization of the BPR objective function meanwhile defends an adversary, which adds adversarial perturbations on model parameters to maximize the BPR objective function. To illustrate how it works, we implement APR on MF by adding adversarial perturbations on the embedding vectors of users and items. Extensive experiments on three public real-world datasets demonstrate the effectiveness of APR — by optimizing MF with APR, it outperforms BPR with a relative improvement of 11.2% on average and achieves state-of-the-art performance for item recommendation. Our implementation is available at: https://…/adversarial_personalized_ranking. In this work, we contribute a new multi-layer neural network architecture named ONCF to perform collaborative filtering. The idea is to use an outer product to explicitly model the pairwise correlations between the dimensions of the embedding space. In contrast to existing neural recommender models that combine user embedding and item embedding via a simple concatenation or element-wise product, our proposal of using outer product above the embedding layer results in a two-dimensional interaction map that is more expressive and semantically plausible. Above the interaction map obtained by outer product, we propose to employ a convolutional neural network to learn high-order correlations among embedding dimensions. Extensive experiments on two public implicit feedback data demonstrate the effectiveness of our proposed ONCF framework, in particular, the positive effect of using outer product to model the correlations between embedding dimensions in the low level of multi-layer neural recommender model. The experiment codes are available at: https://…/ConvNCF Neuronal circuits formed in the brain are complex with intricate connection patterns. Such a complexity is also observed in the retina as a relatively simple neuronal circuit. A retinal ganglion cell receives excitatory inputs from neurons in previous layers as driving forces to fire spikes. Analytical methods are required that can decipher these components in a systematic manner. Recently a method termed spike-triggered non-negative matrix factorization (STNMF) has been proposed for this purpose. In this study, we extend the scope of the STNMF method. By using the retinal ganglion cell as a model system, we show that STNMF can detect various biophysical properties of upstream bipolar cells, including spatial receptive fields, temporal filters, and transfer nonlinearity. In addition, we recover synaptic connection strengths from the weight matrix of STNMF. Furthermore, we show that STNMF can separate spikes of a ganglion cell into a few subsets of spikes where each subset is contributed by one presynaptic bipolar cell. Taken together, these results corroborate that STNMF is a useful method for deciphering the structure of neuronal circuits. Convolutional neural networks (CNNs) have achieved great success on grid-like data such as images, but face tremendous challenges in learning from more generic data such as graphs. In CNNs, the trainable local filters enable the automatic extraction of high-level features. The computation with filters requires a fixed number of ordered units in the receptive fields. However, the number of neighboring units is neither fixed nor are they ordered in generic graphs, thereby hindering the applications of convolutional operations. Here, we address these challenges by proposing the learnable graph convolutional layer (LGCL). LGCL automatically selects a fixed number of neighboring nodes for each feature based on value ranking in order to transform graph data into grid-like structures in 1-D format, thereby enabling the use of regular convolutional operations on generic graphs. To enable model training on large-scale graphs, we propose a sub-graph training method to reduce the excessive memory and computational resource requirements suffered by prior methods on graph convolutions. Our experimental results on node classification tasks in both transductive and inductive learning settings demonstrate that our methods can achieve consistently better performance on the Cora, Citeseer, Pubmed citation network, and protein-protein interaction network datasets. Our results also indicate that the proposed methods using sub-graph training strategy are more efficient as compared to prior approaches. Continue Reading… ### The side effect of automation is fake data and theft There is hope in our business journalism! That's the message I'm getting as I delight in the current stream of articles that take a critical look at the inherent unaccountability of technology giants. These stories have been around for years, even decades, but better late than never. Fake views/plays on Youtube videos (New York Times) This is a companion piece to a previous story about fake social-media followers (New York TImes magazine) Uber drivers gaming the pricing algorithm (Wall Street Journal) Google's Android tracks you everywhere whether you allow it or not (Bloomberg) Finally some coverage of the shame of IBM's Marketing (Wall Street Journal) As blog readers know, most companies in the Web/mobile ecosystem use an advertising-supported model. Even the most popular products like Youtube, Google Maps, and Facebook have not found it possible to charge users anything (when in theory, users derive value from such products). Such digital advertising, since the beginning, has been heralded as inherently more accountable because it was supposedly more measurable. Just look at the mountains of data that are generated by the very act of putting up ads on a website. You can even trace a user's final action, say the purchase, back to various interactions prior to the action. (This data is called the "click stream".) The marketing on digital marketing centers on its "efficiency," achieved through automating the process, taking human beings out of the equation, and trusting the black box machines. This system is flawed even in its most theoretical form, assuming "complete data". The clickstream and other digital data have multiple problems. Assigning causality is highly suspect. I wrote about this years ago - for example, here. The opacity of the black box machines offers not just efficiency but also shelter for fraud. Any kind of verification, quality checking, common-sense making slow the system down. The more efficient is the system, the easier and faster it is to commit fraud! What kind of fraud are we talking about? Fake everything! Fake clicks, fake impressions, fake ads, fake viewers, fake plays, fake followers, fake accounts, fake likes, fake webpages, you name it. Since advertisers are paying money for clicks, impressions, ads, viewers, plays, followers, accounts, etc., and most advertisers have neither the incentives nor the nerves to look under the hood, they are paying real dollars for all the fakes. The much-lauded efficiency of the black-box advertising model sits on a pile of fake data. The system is seriously flawed assuming complete data; it is worse when we admit that not only do we not possess complete data, but we are drowning in fake data. Now, data scientists are hard at work building even more black-box machines that promise to suss out the fake stuff. To understand how this might work, think about spam emails, an analogous problem. In the machine-learning world, spam detection is thought of as a "solved" problem. Spam filtering algorithms have existed for decades. Think about how many fake emails continue to sit in your inbox, and in addition, how many not-fake emails land in your spam folder - and you can begin to see why technical solutions are not potent enough to drive out fraudsters. *** Below are some comments on selected parts of the latest New York Times article on fake Youtube views (here). • Youtube stars can make millions from advertisers by publishing viral videos. Google which owns Youtube makes tons of money by having people upload their content to the platform and charging rent for it. Advertisers ultimately pay for those views. Youtube stars, and by extension, Google, can make lots of money from fake views, i.e. bots (software) pretending to play the video. Third parties approach wannabes to sell them fake views and take a cut of the action. • It's a cat and mouse game that the fraudsters are winning. They have the upper hand because the products are often designed without assuming nefarious players gaming the system. Adding software to deal preemptively with nefarious players slows down tech startups trying to build an initial market. Preventing bad players from acting badly imposes (the externality of) restrictions for good players, which is another disincentive to tackle the problem. • Youtube officially claims: "fake views represent just a tiny fraction of the total." (later defined as 1 percent). This is a guess masquerading as fact. To know what fraction of the views on Youtube is "fake" requires knowing whether each view is fake or not. In theory, one can review a sample of the views, and then infer the fraction in the entire catalog from that sample. In practice, it is not easy to prove if a view is fake just from the data collected about the viewer's browser, OS, etc. A machine-learning algorithm will "predict" if a view is fake or not but such predictions should not be used as "ground truth". • Legitimate organizations buy views, including marketing agencies, news organizations and musicians. As a former employee said, "View count manipulation will be a problem as long as views and the popularity they signal are the currency of YouTube." It's not just YouTube - we know that when teachers are compensated based on student test scores, they become complicit in abetting their students to cheat. • This underground economy is quite well developed now, with wholesalers and resellers of fake views. If you are a reseller, you don't even need to know how to generate fake views. You just buy fakes from the wholesalers and charge a markup! • The sellers of fake views are so sure they can evade Youtube's detection technologies that they are openly talking about their clients, business models, etc. to the press. Either they know those technologies don't work, or they know that Youtube isn't serious about cracking down. • Several buyers interviewed by the Times said they truly believed the marketing spin of the fake-view sellers, that they were driving real users to watch their videos. This is a tad disingenuous. As one writer who purchased views pointed out, those purchased views led to no sales, and she couldn't get any data about who viewed the videos. It's actually very easy to demonstrate fraud as the fake views are barely hidden. The article came with a great graphic, which does not appear to be addressed directly in the text. So let me explain. The reporter purchased fake views from the vendor called "BuyYoutubViews". The top chart shows that close to 100% of those views ("in campaign") played the entire video (over 5 minutes long). The bottom chart shows that when they turned off the "campaign" i.e. stopped buying fake views, almost no one plays the video for more than a few seconds. (What is not shown is that the number of views probably plunged after the fake-views campaign was shut down.) Raise your hand if you think "BuyYoutubViews" is finding real people to watch the video! Continue Reading… ### Data Scientist guide for getting started with Docker Docker is an increasingly popular way to create and deploy applications through virtualization, but can it be useful for data scientists? This guide should help you quickly get started. Continue Reading… ### Solve epileptic seizure prediction! Participate at epilepsyecosystem.org Around twenty million people worldwide suffer from drug-resistant epilepsy and the unpredictability of seizures is one of the major factors affecting the quality of life of people with epilepsy. Continue Reading… ### Vienna overtakes Melbourne as the world’s most liveable city “WHAT is the city but the people?” The latest liveability index from the Economist Intelligence Unit suggests that Shakespeare might be overplaying the importance of people in his definition. Continue Reading… ### Unveiling Mathematics Behind XGBoost Follow me till the end, and I assure you will atleast get a sense of what is happening underneath the revolutionary machine learning model. Continue Reading… ### The missing piece of the Smart Contract Puzzle While there has been a lot of hype around the efficiency of a smart contract, can it replace the traditional way of working? What are the challenges and what is the role blockchain has to play in this shift? Smart contracts have been touted as an evolutionary leap in legal The post The missing piece of the Smart Contract Puzzle appeared first on Dataconomy. Continue Reading… ### It was the weeds that bothered him. Bill Jefferys points to this news article by Denise Grady. Bill noticed the following bit, “In male rats, the studies linked tumors in the heart to high exposure to radiation from the phones. But that problem did not occur in female rats, or any mice,” and asked: ​Forking paths, much? My reply: The summary of the news article seems reasonable: “But two government studies released on Friday, one in rats and one in mice, suggest that if there is any risk, it is small, health officials said.” But, yes, later on they get into the weeds: “Scientists do not know why only male rats develop the heart tumors, but Dr. Bucher said one possibility is simply that the males are bigger and absorb more of the radiation.” They didn’t mention the possibility that variations just happen at random, and the fact that a comparison happened to be statistically significant in the data is not necessarily strong evidence that it represents a corresponding pattern in the population. Bill responded: Yes, it was the weeds that bothered me. Overall, then, yes, a good news article. The post It was the weeds that bothered him. appeared first on Statistical Modeling, Causal Inference, and Social Science. Continue Reading… ### 3 promising areas for AI skills development O'Reilly survey results and usage data reveal growing trends and topics in artificial intelligence. One of the key findings of a survey we released earlier this year (How Companies are Putting AI to Work through Deep Learning) was that the leading reason holding companies back from incorporating deep learning was their lack of access to skilled people. One-fifth of respondents pointed to a skills gap as one of the reasons they haven’t integrated deep learning, and at the time of the survey, 75% of respondents indicated their company had some combination of internal and external training programs to address this issue. We’ve continued to monitor interest in topics relevant to building AI products and systems, specifically areas that also warrant investment in skills development. In this post, I’ll share results of related studies we’ve conducted. I’ll draw from two data sources: • We examine usage[1] across all content formats on the O’Reilly online learning platform, as well as demand via volume of search terms. • We recently conducted a survey (full report forthcoming) on machine learning adoption, which included more than 6,000 respondents from North America. I’ll use key portions of our upcoming AI Conference in San Francisco to describe how companies can address the topics and findings surfaced in these two recent studies. ## Growing interest in key topics Through the end of June 2018, we found double-digit growth in key topics associated with AI. Our online learning platform usage metrics encompass many content formats including books, videos, online training, interactive content, and other material: Growth was strong across many topics associated with AI and machine learning. The chart below provides a sense of how much content usage (“relative popularity”) we’re seeing in some of these key topics: our users remain very interested in machine learning, particularly in deep learning. It’s one thing to learn about an individual technology or a specific class of modeling techniques, but ultimately, organizations need to be able to design robust AI applications and products. This involves hardware, software infrastructure to manage data pipelines, and elegant user interfaces. For the upcoming AI Conference in San Francisco, we assembled training sessions, tutorials, and case studies on many of these important topics: We’ve also found that interest in machine learning compares favorably with other areas of technology. We track interest in topics by monitoring search volume on our online learning platform. Alongside Kubernetes and blockchain, machine learning has been one of the fast-growing, high-volume search topics year over year: ## Emerging topics As I noted in the first chart above, we are seeing growing interest in reinforcement learning and PyTorch. It’s important to point out that TensorFlow is still by far the most popular deep learning framework, but as with other surveys we are seeing that PyTorch is beginning to build a devoted following. Looking closely at interest in topics within data science and AI, we found that interest in reinforcement learning, PyTorch, and Keras have risen substantially this year: The chart below provides a ranked list of industries that are beginning to explore using reinforcement learning and PyTorch: We’ve had reinforcement learning tutorial sessions and presentations from the inception of our AI Conference. As tools and libraries get simpler and more tightly integrated with other popular components, I’m expecting to see more mainstream applications of reinforcement learning. We have assembled tutorial sessions and talks at the AI Conference on reinforcement learning and on popular tools for building deep learning applications (including PyTorch and TensorFlow): ## Toward a holistic view of AI applications There is growing awareness among major stakeholders about the importance of data privacy, ethics, and security. Users are beginning to seek more transparency and control over their data, regulators are beginning to introduce data privacy rules, and there is growing interest in ethics and privacy among data professionals. There are an emerging set of tools and best practices for incorporating fairness, transparency, privacy, and security into AI systems. For our upcoming AI Conference in San Francisco, we have a wide selection of tutorials and sessions aimed at both technologists wanting to understand how to incorporate ethics and privacy into applications, and for managers needing to understand what these new tools and technologies are able to provide: Continue reading 3 promising areas for AI skills development. Continue Reading… ### Four short links: 14 August 2018 Hyrum's Law, Academic Torrents, Logic Textbook, and Suboptimal Fairness 1. Hyrum's Law -- With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody. (via Simon Willison) 2. Academic Torrents -- a community-maintained distributed repository for data sets and scientific knowledge. 27GB and growing. 3. Open Logic Project -- a collection of teaching materials on mathematical logic aimed at a non-mathematical audience, intended for use in advanced logic courses as taught in many philosophy departments. It is open source: you can download the LaTeX code. It is open: you’re free to change it whichever way you like, and share your changes. It is collaborative: a team of people is working on it, using the GitHub platform, and we welcome contributions and feedback. And it is written with configurability in mind. 4. Delayed Impact of Fair Machine Learning (Paper a Day) -- it’s therefore possible to have a fairness intervention with the unintended consequence of leaving the disadvantaged group worse off than they were before. Continue reading Four short links: 14 August 2018. Continue Reading… ### A quick reminder on HTTPS everywhere HTTPS "everywhere" means everywhere—not just the login page, or the page where you accept donations. Everything. HTTPS Everywhere! So the plugin says, and now browsers are warning users that sites not implementing https:// are security risks. Using HTTPS everywhere is good advice. And this really means "everywhere": the home page, everything. Not just the login page, or the page where you accept donations. Everything. Implementing HTTPS everywhere has some downsides, as Eric Meyer points out. It breaks caching, which makes the web much slower for people limited to satellite connections (and that's much of the third world); it's a problem for people who, for various reasons, have to use older browsers (there are more ancient browsers and operating systems in the world than you would like to think, trust me); domain names and IP address are handled by lower-level protocols that HTTPS doesn't get to touch, so it's not as private as one would like; and more. It's not a great solution, but it's a necessary one. (Meyer's article, and the comments following it, are excellent.) The real problem isn't HTTPS's downsides; it's that I see and hear more and more complaints from people who run simple non-commercial sites asking why this affects them. Do you need cryptographic security if your site is a simple read-only, text-only site with nothing controversial? Unfortunately, you do. Here's why. Since the ISPs' theft of the web (it's not limited to the loss of Network Neutrality, and not just an issue in the U.S.), the ISPs themselves can legally execute man-in-the-middle attacks to: The first two are happening already; the third may be. (It's possible that GDPR, which protects European citizens regardless of where they're located, might prevent ISPs from collecting and selling browsing history. I wouldn't count on it, though.) Yesterday, I poked around a bit and found many sites that don't use HTTPS everywhere. Those sites include an ivy league university (Cornell, get your act together!), many non-profit organizations (including several that I belong to), several well-known newspapers and magazines, local libraries, and a lot of small businesses. The irony is most of these sites accept donations, let you read restricted-access materials, and even sell stuff online, and these pages are already using HTTPS (though not always correctly). Protecting the entire site doesn't require that big a change. In many cases, using HTTPS for the entire site is simpler than protecting a limited collection of pages. I agree that HTTPS is a significant administrative burden for simple, static sites, or sites that are run by groups that don’t have the technical ability to implement HTTPS. Services like Let's Encrypt reduce some of the burden (Let's Encrypt provides free certificates, reducing the process of setting up HTTPS to a few well-placed clicks)—but you still have to do it. Nothing stays simple and elegant, particularly when it's under attack. And the web is under attack: from the pirate ISPs, from hostile governments (a somewhat different issue, but related), and from many other actors. HTTPS is a solution—a problematic one, I’ll grant, and one that imposes a burden on the sites least capable of dealing with technical overhead, but we don't have a better solution. Well, yes, we do have a better solution: IPSec and IPv6 solve the problem nicely. But we've been waiting for widespread adoption of those for more than 20 years, and we're still waiting. These are problems we need to solve now. "There's nothing on my site that requires cryptography" doesn't help anyone. That's no more helpful than people who say "I have nothing to hide, so I don't need privacy." You don't need either privacy or HTTPS until you do, and then it's way, way too late. Do it for your users' sake. Continue reading A quick reminder on HTTPS everywhere. Continue Reading… ### Robot arm seeks out Waldo, using machine learning The camera on the slightly creepy arm takes a picture of the pages in the book, the software uses OpenCV to extract faces, and the faces are passed to Google Auto ML Vision comparing the faces to a Waldo model. The result: There’s Waldo. Tags: , , , Continue Reading… ### I started a corporation! After I released Bite Size Linux in May, something surprising happened – it sold WAY more copies than I expected! (as of today, 2600 copies!!!). If you bought one, thank you! It was super encouraging and makes me want to produce a lot more fun zine ideas :). And 1300 of you have already bought Bite Size Command Line, which is similarly astonishing. The only downside of selling things is – making money is complicated! I need to track the income so I can pay income taxes, I need to pay sales tax on Canadian sales! Tracking expenses is really annoying! So I decided to start a corporation (just on the side, I still work at the same place ❤) to manage the logistics. I’m pretty excited about this so I wanted to explain why. ### What’s involved in starting a company? I thought briefly about using Stripe Atlas (which is a cool way to easily start a US corporation) but I live in Canada so I decided to start a Canadian corporation. I hired a lawyer and an accountant to do all the paperwork. Now I have: • A numbered Canadian corporation! • GST and QST numbers for me use to to pay sales tax! • A corporate bank account and credit card! Even though I hired a lawyer / accountant to help me out, it ended up taking 8 weeks from start to finish, mostly because I was travelling/busy. ### It’s easier to pay people! One of the most exciting things about having a company is that it’s easier to hire people to help me with my zines! And I can use the company’s money to expense ALL KINDS OF THINGS. For example: • I’m interested in doing collaborations with folks and paying them for their work. This makes that WAY easier – the fantastic Katie Sylor-Miller and I just announced a collaboration where we’re illustrating oh shit, git! • It simplifies hiring illustrators for my covers (and maybe I can get EVEN MORE things illustrated?!) • I can hire an accountant to help with my taxes • It makes it easier to maybe print & ship zines to people (because it simplifies things like potential kickstarter taxes), which folks have been asking me about FOREVER and I’d really like to do one day. • I can hire an IP lawyer to get advice about intellectual property / copyright if I need it! • I can more easily invest in equipment (like I’m thinking of buying a fancy colour laser printer or maybe even a machine to help me staple zines) ### It’s easier to keep track of money! Before, all my zine income was mixed in with my personal income. This was okay when there wasn’t that much of it, but now that there’s more it’s really exciting to have it in a separate place. And I’m SO EXCITED about being able to use a corporate credit card to buy company-things. ### Adding a company rate for zines One other sort-of related thing I’m excited about is – for my last 2 zines, I’ve added a “corporate rate” which gives you permission to use a zine at work for100. So far 47 companies have bought it (thanks ❤), which is really amazing. And one company bought the $500 version which is for companies over 1000 employees! It’s possible that I should increase those prices in the future, but it’s a start! ### some questions about running a zine business There are more aspects to selling zines that I’m interested in talking about in the future – reframing writing zines as a “small side business” instead of just a “fun thing” raises all kinds of interesting questions like: • will more or less people read my zines if I sell them than if they’re free? • does selling zines make it harder to make weird niche zines because there’s an incentive to make things that are more general-interest? • does having a side business make it easier to do high quality work? • do people value work more if it’s not free? • right now I’m selling zines for$10, but $10 is is a lot of money for many people (both in the US and outside the US). how do I keep them accessible? • are there better ways to sell zines to companies? • does giving away really good work for free normalize the idea that work “should” be given away for free? Does it undermine people who sell their work? All of the existing zines that you can find at https://jvns.ca/zines will remain free, of course :). I’m not sure about newer zines. I’m doing some experiments! (Bite Size Linux and Bite Size Command Line are both not-free) More on those questions in the future as I learn about them :) Continue Reading… ### If you did not already know Geometry-Aware Generative Adversarial Network (GAGAN) Deep generative models learned through adversarial training have become increasingly popular for their ability to generate naturalistic image textures. However, apart from the visual texture, the visual appearance of objects is significantly affected by their shape geometry, information which is not taken into account by existing generative models. This paper introduces the Geometry-Aware Generative Adversarial Network (GAGAN) for incorporating geometric information into the image generation process. Specifically, in GAGAN the generator samples latent variables from the probability space of a statistical shape model. By mapping the output of the generator to a canonical coordinate frame through a differentiable geometric transformation, we enforce the geometry of the objects and add an implicit connection from the prior to the generated object. Experimental results on face generation indicate that the GAGAN can generate realistic images of faces with arbitrary facial attributes such as facial expression, pose, and morphology, that are of better quality compared to current GAN-based methods. Finally, our method can be easily incorporated into and improve the quality of the images generated by any existing GAN architecture. … Information Potential Auto-Encoders In this paper, we suggest a framework to make use of mutual information as a regularization criterion to train Auto-Encoders (AEs). In the proposed framework, AEs are regularized by minimization of the mutual information between input and encoding variables of AEs during the training phase. In order to estimate the entropy of the encoding variables and the mutual information, we propose a non-parametric method. We also give an information theoretic view of Variational AEs (VAEs), which suggests that VAEs can be considered as parametric methods that estimate entropy. Experimental results show that the proposed non-parametric models have more degree of freedom in terms of representation learning of features drawn from complex distributions such as Mixture of Gaussians, compared to methods which estimate entropy using parametric approaches, such as Variational AEs. … Spotting anomalies with Privileged Information (SPI) We introduce a new unsupervised anomaly detection ensemble called SPI which can harness privileged information – data available only for training examples but not for (future) test examples. Our ideas build on the Learning Using Privileged Information (LUPI) paradigm pioneered by Vapnik et al. [19,17], which we extend to unsupervised learning and in particular to anomaly detection. SPI (for Spotting anomalies with Privileged Information) constructs a number of frames/fragments of knowledge (i.e., density estimates) in the privileged space and transfers them to the anomaly scoring space through ‘imitation’ functions that use only the partial information available for test examples. Our generalization of the LUPI paradigm to unsupervised anomaly detection shepherds the field in several key directions, including (i) domain knowledge-augmented detection using expert annotations as PI, (ii) fast detection using computationally-demanding data as PI, and (iii) early detection using ‘historical future’ data as PI. Through extensive experiments on simulated and real datasets, we show that augmenting privileged information to anomaly detection significantly improves detection performance. We also demonstrate the promise of SPI under all three settings (i-iii); with PI capturing expert knowledge, computationally expensive features, and future data on three real world detection tasks. … Continue Reading… ### data.table is Really Good at Sorting The data.table R package is really good at sorting. Below is a comparison of it versus dplyr for a range of problem sizes. The graph is using a log-log scale (so things are very compressed). But data.table is routinely 7 times faster than dplyr. The ratio of run times is shown below. Notice on the above semi-log plot the run time ratio is growing roughly linearly. This makes sense: data.table uses a radix sort which has the potential to perform in near linear time (faster than the n log(n) lower bound known comparison sorting) for a range of problems (also we are only showing example sorting times, not worst-case sorting times). In fact, if we divide the y in the above graph by log(rows) we get something approaching a constant. The above is consistent with data.table not only being faster than dplyr, but also having a fundamentally different asymptotic running time. Performance like the above is one of the reasons you should strongly consider data.table for your R projects. All details of the timings can be found here. Continue Reading… ### Magister Dixit “You can have data without information, but you cannot have information without data.” Daniel Keys Moran Continue Reading… ### 2018 Update of ffanalytics R Package (This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers) ## Introduction As part of our effort to provide users ways to replicate our analyses and improve their performance in fantasy football, we are continuously looking at ways we can improve our tools. Since the ffanalytics package was introduced back in 2016, we have had a tremendous response and feedback on the package and have worked through some of the kinks and are now introducing a new version of the ffanalytics package that should be more robust and provide more consistent results. ## Improvements One of the major concerns with data scrapes is that websites change and so does the way data are rendered on individual pages. Where the old version of the ffanalytics package was highly dependent on how columns were ordered on the different sites, the new version of the package does not depend on column order. Instead the package looks for specific components of the page and, based on predefined mapping, can determine which columns are which. The other major concern is that different sites lists the same player with a slightly different name (i.e. Rob Kelly or Robert Kelly) or at different positions, so matching up the data can be tricky. The new version of the package maps player IDs from the different sites back to the player ID used by MyFantasyLeague.com, so it now doesn’t matter what name or position a specific player is listed as on a specific site because the data for a specific player will match up correctly. Finally, the new version of the package relies heavily on the vocabulary and terminology used in the tidyverse package. We highly recommend familiarizing yourself with that package. It is our intent that the new version of the package will provide more consistent scrapes for users. ## Player information The player information used in the package is based on player information from MyFantasyLeague.com (MFL). When the package is loaded, it pulls the most recent player data from MFL into the player_table object. When the scrape runs, it determines the player ID for the player on the site that is being scraped and maps that ID back to the MFL player ID. The scraped data contains an id column which holds the player ID for MFL, and a src_id column that has the ID for the site. However, when aggregating the stats and generating the projections the player ID used is the MFL player ID. ## Custom calculations The customization of calculations is done as input to the projections_table() function. You can specify your own scoring rules, analyst weights, tier thresholds and VOR baselines to be used in the calculations. See ?projections_table for specifics, and check out the scoring_settings vignette on how to customize scoring rules ## Installation and getting started We have updated the instructions on installing and getting the package here so you can check that page out, and feel free to leave comments with feedback and suggestions. The post 2018 Update of ffanalytics R Package appeared first on Fantasy Football Analytics. To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### Document worth reading: “Does putting your emotions into words make you feel better? Measuring the minute-scale dynamics of emotions from online data” Studies of affect labeling, i.e. putting your feelings into words, indicate that it can attenuate positive and negative emotions. Here we track the evolution of individual emotions for tens of thousands of Twitter users by analyzing the emotional content of their tweets before and after they explicitly report having a strong emotion. Our results reveal how emotions and their expression evolve at the temporal resolution of one minute. While the expression of positive emotions is preceded by a short but steep increase in positive valence and followed by short decay to normal levels, negative emotions build up more slowly, followed by a sharp reversal to previous levels, matching earlier findings of the attenuating effects of affect labeling. We estimate that positive and negative emotions last approximately 1.25 and 1.5 hours from onset to evanescence. A separate analysis for male and female subjects is suggestive of possible gender-specific differences in emotional dynamics. Does putting your emotions into words make you feel better? Measuring the minute-scale dynamics of emotions from online data Continue Reading… ### R Packages worth a look Robust Generalized Linear Models (GLM) using Mixtures (robmixglm) Robust generalized linear models (GLM) using a mixture method, as described in Beath (2018) <doi:10.1080/02664763.2017.1414164>. This assumes tha … Penalized Orthogonal-Components Regression (POCRE) Penalized orthogonal-components regression (POCRE) is a supervised dimension reduction method for high-dimensional data. It sequentially constructs ort … Library of Research Designs (DesignLibrary) A simple interface to build designs using the using the package ‘DeclareDesign’. In one line of code, users can specify the parameters of individual de … Generate Random Samples from the Polya-Gamma Distribution (pgdraw) Generates random samples from the Polya-Gamma distribution using an implementation of the algorithm described in J. Windle’s PhD thesis (2013) <<a h … Web-Based Interactive Omics Visualization (wilson) Tool-set of modules for creating web-based applications that use plot based strategies to visualize and analyze multi-omics data. This package utilizes … Continue Reading… ### Mongolite 2.0: GridFS, connection pooling, and more (This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers) This week version 2.0 of the mongolite package has been released to CRAN. Major new features in this release include support for MongoDB 4.0, GridFS, running database commands, and connection pooling. Mongolite is primarily an easy-to-use client to get data in and out of MongoDB. However it supports increasingly many advanced features like aggregation, indexing, map-reduce, streaming, encryption, and enterprise authentication. The mongolite user manual provides a great introduction with details and worked examples. ## GridFS Support! New in version 2.0 is support for the MongoDB GridFS system. A GridFS is a special type of Mongo collection for storing binary data, such as files. To the user, a GridFS looks like a key-value server with potentially very large values. We support two interfaces for sending/receiving data from/to GridFS. The fs$read() and fs$write() methods are the most flexible and can stream data from/to an R connection, such as a file, socket or url. These methods support a progress counter and can be interrupted if needed. These methods are recommended for reading or writing single files. # Assume 'mongod' is running on localhost fs <- gridfs() buf <- serialize(nycflights13::flights, NULL) fs$write(buf, 'flights')
#> [flights]: written 45.11 MB (done)
#>  List of 6
#> $id : chr "5b70a37a47a302506117c352" #>$ name    : chr "flights"
#> $size : num 45112028 #>$ date    : POSIXct[1:1], format: "2018-08-12 23:15:38"
#> $type : chr NA #>$ metadata: chr NA

out <- fs$read('flights') flights <- unserialize(out$data)
# [flights]: read 45.11 MB (done)
#>  A tibble: 6 x 19
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight
#>
#> 1  2013     1     1      517            515         2      830            819        11 UA        1545
#> 2  2013     1     1      533            529         4      850            830        20 UA        1714
#> 3  2013     1     1      542            540         2      923            850        33 AA        1141
#> 4  2013     1     1      544            545        -1     1004           1022       -18 B6         725
#> 5  2013     1     1      554            600        -6      812            837       -25 DL         461
#> 6  2013     1     1      554            558        -4      740            728        12 UA        1696
#> # ... with 8 more variables: tailnum , origin , dest , air_time , distance ,
#> #   hour , minute , time_hour


The fs$upload() and fs$download() methods on the other hand copy directly between GridFS and your local disk. This API is vectorized so it can transfer many files at once. Hover over to the gridfs chapter in the manual for more examples.

## Running Commands

MongoDB exposes an enormous number of database commands, and mongolite cannot provide wrappers for each command. As a compromise, the new version of mongolite exposes an api to run raw commands, so you can manually run the commands for which we do not expose wrappers.

The result data from the commannd automatically gets simplified into nice R data structures using the jsonlite simplification system (but you can set simplify = FALSE if you prefer json structures).

m <- mongo()
col$run('{"ping": 1}') #>$ok
#> [1] 1


For example we can run the listDatabases command to list all db’s on the server:

admin <- mongo(db = "admin")
admin$run('{"listDatabases":1}') #>$databases
#>     name sizeOnDisk empty
#> 2 config      73728 FALSE
#> 3  local      73728 FALSE
#> 4   test   72740864 FALSE
#>
#> $totalSize #> [1] 72921088 #> #>$ok
#> [1] 1


The server tools chapter in the manual has some more examples.

## Connection Pooling

Finally another much requested feature was connection pooling. Previously, mongolite would open a new database connection for each collection object in R. However as of version 2.0, mongolite will use existing connections to the same database when possible.

# This will use a single database connection
db.flights <- mongolite::mongo("flights")
db.iris <- mongolite::mongo("iris")
db.files <- mongolite::gridfs()


A database connection is automatically closed when all objects that were using it have either been removed, or explicitly disconnected with the disconnect() method. For example using the example above:

# Connection still alive
rm(db.flights)
db.files$disconnect() # Now it will disconnect db.iris$disconnect()


Mongolite collection and GridFS objects automatically reconnect if when needed if they are disconnected (either explicitly or automatically e.g. when restarting your R session). For example:

> db.files$find() Connection lost. Trying to reconnect with mongo... #> id name size date type metadata #> 1 5b70a37a47a302506117c352 flights 45112028 2018-08-12 23:15:38  To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### Where to go observe birds in Radolfzell? An answer with R and open data (This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers) This post is the 1st post of a series showcasing various rOpenSci packages as if Maëlle were a birder trying to make the most of R in general and rOpenSci in particular. Although the series use cases will mostly feature birds, it’ll be the occasion to highlight rOpenSci’s packages that are more widely applicable, so read on no matter what your field is! Moreoever, each post should stand on its own. A logical first step in this birder’s guide to rOpenSci is to use R to find where to observe birds! In this blog post, we shall use rOpenSci packages accessing open geographical data to locate and visualize where to observe birds near a given location. ### First of all, where are we now? The Max Planck Institute for Ornithology (henceforth shortened to MPI), where Maëlle will give a talk is located at Am Obstberg 1 78315 Radolfzell . Let’s geolocate it using rOpenSci’s opencage package that interfaces the OpenCage Geocoder, a commercial service based on open data . When choosing to get only one result via limit = 1, one gets what the API considers to be the best one. mpi <- opencage::opencage_forward("Am Obstberg 1 78315 Radolfzell", limit = 1)$results
class(mpi)

## [1] "tbl_df"     "tbl"        "data.frame"

head(names(mpi))

## [1] "annotations.DMS.lat"    "annotations.DMS.lng"
## [5] "annotations.Mercator.x" "annotations.Mercator.y"


This gets us Am Obstberg 1, 78315 Radolfzell, Germany (mpi$formatted) which is in (mpi$annotations.flag gets us a flag!).

### Birding in a bird hide?

#### Where to find a bird hide?

You can most certainly go birding anywhere, but if you can find a bird
hide (or bird blind depending on the English you speak), it might be a
very appropriate observation point. Now that we know where the MPI is,
we can look for bird hide(s) in the vicinity. For that, we shall use
rOpenSci’s osmdata package by
Mark Padgham and collaborators! Note that incidentally, Mark did his
PhD in
ecology
.
This package is an interface to OpenStreetMap’s Overpass
API
. OpenStreetMap is
a collective map of the world. It contains information about towns’
limits, roads, placenames… but also tags of everything, from bars as
seen in this blog post to
trees. You can
browse existing features via OpenStreetMap’s
wiki
. Some parts of the
world are better mapped than others depending on the local OpenStreetMap
community. Actually, OpenCage’s blog features an interesting series of
country profiles.

To look for a bird hide, we first create a bounding box of 10km around
the MPI, using rOpenSci’s bbox
package
.

bbox <- bbox::lonlat2bbox(mpi$geometry.lng, mpi$geometry.lat,
dist = 10, method = "lawn")


We then use the key and value associated with bird hides in
OpenStreetMap
:
respectively leisure and bird_hide.

library("osmdata")

## Data (c) OpenStreetMap contributors, ODbL 1.0. http://www.openstreetmap.org/copyright

library("magrittr")
(results <- opq(bbox = bbox) %>%
add_osm_feature(key = 'leisure', value = 'bird_hide') %>%
osmdata_sf ())

## Object of class 'osmdata' with:
##                  $bbox : 47.6865,8.8753,47.8494,9.1177 ##$overpass_call : The call submitted to the overpass API
##             $timestamp : [ Thu 5 Jul 2018 08:06:35 ] ##$osm_points : 'sf' Simple Features Collection with 1 points
##             $osm_lines : 'sf' Simple Features Collection with 0 linestrings ##$osm_polygons : 'sf' Simple Features Collection with 0 polygons
##        $osm_multilines : 'sf' Simple Features Collection with 0 multilinestrings ##$osm_multipolygons : 'sf' Simple Features Collection with 0 multipolygons

results$osm_points  ## leisure geometry ## 5004940425 bird_hide 8.920901, 47.741569  Yay, we now know where to find a bird hide not too far from the MPI! #### Visualizing our location and the bird hide So one could enter the coordinates of that bird hide in one’s favourite mapping software or app but to show you where the bird hide is we can actually step back and use osmplotr, another package contributed to rOpenSci by Mark Padgham! The way osmplotr works is letting you create a basemap, on which you’ll add different layers extracted from OpenStreetMap using osmplotr::extract_osm_objects or osmdata functions directly. Its strengths are therefore the use of open data, and the customization of what you’re using as background! Let’s create a basemap for our bounding box, and then add roads and buildings to it. library("osmplotr")  ## Data (c) OpenStreetMap contributors, ODbL 1.0. http://www.openstreetmap.org/copyright  bbox <- get_bbox(bbox) dat_B <- extract_osm_objects (key = 'building', bbox = bbox) dat_H <- extract_osm_objects (key = 'highway', bbox = bbox) map0 <- osm_basemap(bbox = bbox, bg = 'gray20') %>% add_osm_objects (dat_B, col = 'gray40') %>% add_osm_objects (dat_H, col = 'gray80')  map0 %>% add_axes() %>% print_osm_map (filename = 'map_a1.png', width = 600, units = 'px', dpi = 72) library("magrittr") magick::image_read('map_a1.png') %>% magick::image_annotate("Map data © OpenStreetMap contributors", color = "white", boxcolor = "black", size = 15, gravity = "south")  Quite pretty! The lakes can be seen because of the absence of roads and buildings on them. Now, let’s plot the bird hide and the MPI on them. Since we used osmdata::osmdata_sf, we had gotten the data in a receivable class, sf. points_map <- add_osm_objects(map0, results$osm_points,
col = 'salmon',
size = 5)


For plotting the MPI, we’ll convert opencage output to an sf point
with the same coordinate reference system as the OpenStreetMap data
extracted with osmdata.

coords <- data.frame(lon = mpi$geometry.lng, lat = mpi$geometry.lat)
crs <- sf::st_crs(results$osm_points) mpi_sf <- sf::st_as_sf(coords, coords = c("lon", "lat"), crs = crs) points_map <- add_osm_objects(points_map, mpi_sf, col = 'white', size = 5)  We can now visualize both points on the map, the MPI in white and the bird hide in salmon, South-West from the MPI. points_map %>% add_axes() %>% print_osm_map (filename = 'map_a2.png', width = 600, units = 'px', dpi = 72) magick::image_read('map_a2.png') %>% magick::image_annotate("Map data © OpenStreetMap contributors", color = "white", boxcolor = "black", size = 15, gravity = "south")  Aha, now we see where the bird hide is, fantastic! But as Mark noted, birds can actually be observed from other places. ### Birding where birds should be? Birds are most likely to be found where water lies close to natural areas , and we can translate this to R code! We shall get all water and (separately) all non-watery natural areas and find the shortest distances between them before plotting the results using add_osm_surface. First, let’s get all water and (separately) all non-watery natural areas. dat <- opq(bbox = bbox) %>% add_osm_feature(key = 'natural') %>% osmdata_sf (quiet = FALSE)  ## Issuing query to Overpass API ... ## Rate limit: 2 ## Query complete! ## converting OSM data to sf format  indx_W <- which (dat$osm_polygons$natural %in% c ("water", "wetland")) indx_N <- which (!dat$osm_polygons$natural %in% c ("water", "wetland")) xy_W <- sf::st_coordinates (dat$osm_polygons [indx_W, ])
xy_N <- sf::st_coordinates (dat$osm_polygons [indx_N, ])  Then we use Mark’s geodist package to get pairwise distances between all points, find minima, and make a data.frame to submit to add_osm_surface(). We have 7068 watery points and 10065 non-watery points so the speed of geodist is crucial here! t1 <- Sys.time() d <- geodist::geodist (xy_W, xy_N) # so fast!!! Sys.time() - t1  ## Time difference of 13.57983 secs  d1 <- apply (d, 1, min) d2 <- apply (d, 2, min) xy <- cbind (rbind (xy_W, xy_N), c (d1, d2)) xy <- xy [, c (1, 2, 5)] colnames (xy) <- c ("x", "y", "z") xy <- xy [!duplicated (xy), ]  Finally we plot the results on the roads we had gotten earlier, although we do not recommend staying on the middle of a road to observe birds! We re-add the points corresponding to the MPI and bird hide after the surface coloring. With osmplotr, order matters because layers are added on top of each other. Note that plotting the distances takes a while. # colorblind-friendly palette! cols <- viridis::viridis_pal (direction = -1) (30) add_osm_surface (map0, dat_H, dat = xy, col = cols) %>% add_axes() %>% add_colourbar(cols = cols, zlim = range(xy[,"z"])) %>% add_osm_objects(mpi_sf, col = 'white', size = 5) %>% add_osm_objects(results$osm_points,
col = 'white', size = 5)%>%
print_osm_map (filename = 'map_a3.png', width = 600,
units = 'px', dpi = 72)

color = "white",
boxcolor =  "black",
size = 15,
gravity = "south")


On the map, the yellower/lighter a road is, the better it is to observe
birds according to Mark’s assumption that birds are most likely to be
found where water lies close to natural areas. With this metric, the MPI
itself is not too badly located, which after all is not too surprising
for an MPI of ornithology. Maybe one should just walk to the one of
the nearest lakes to meet some birds.

### Conclusion

#### Open geographical data in R

In this post we showcased three rOpenSci packages helping you use open
geographical data in R:
opencage,
osmdata,
osmplotr, therefore mostly
making use of the awesome OpenStreetMap data (The OpenCage Geocoder uses
a bit more, but it only warrants attributing
OpenStreetMap
the annotation “Map data © OpenStreetMap contributors” to all maps of
this post using rOpenSci’s magick package. You might also have noticed
in the code earlier that both osmdata and osmplotr have start-up
messages indicating the data origin and licence.

Another package of rOpenSci’s interacting with open geographical data,
that might be of interest for making maps, is
rnaturalearth that
facilitates interaction with Natural Earth map
data
.

#### Other R packages for spatial analyses

We also used two other rOpenSci packages:
bbox to get a bounding box and
magick for image manipulation.
Explore more of our packages suite, including and beyond the geospatial
category, here.

We also used the geodist
package
for ultra lightweight,
ultra fast calculation of geo distances. This package is developed in
the hypertidy GitHub organization which
is a good place to find useful R geospatial packages. Other good
resources for geospatial analyses with R include the r-spatial.org
website
and this great book by Robin
Muenchow
presented in this blog post of Steph
Locke’s
.

#### More birding soon!

Stay tuned for the next post in this series, about getting bird
observation data in R! In the meantime, happy birding!

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

## August 13, 2018

### Analysis of fake YouTube views

Wherever more attention or the appearance of it equates to more money, there are those who try to game the system. Michael H. Keller for The New York Times examines the business of fake YouTube views:

YouTube’s engineers, statisticians and data scientists are constantly improving in their ability to fight what Ms. O’Connor calls a “very hard problem,” but the attacks have “continually gotten stronger and more sophisticated,” she said.

After the Times reporter presented YouTube with the videos for which he had bought views, the company said sellers had exploited two vulnerabilities that had already been fixed. Later that day, the reporter bought more views from six of the same vendors. The view count rose again, though more slowly. A week later, all but two of the vendors had delivered the full amount.

Tags: , ,

### Whats new on arXiv

Existing models of computation, such as a Turing machine (hereafter, TM), do not consider the agent involved in interpreting the outcome of the computation. We argue that a TM, or any other computation model, has no significance if its output is not interpreted by some agent. Furthermore, we argue that including the interpreter in the model definition sheds light on some of the difficult problems faced in computation and mathematics. We provide an analytic process framework to address this limitation. The framework can be overlaid on existing concepts of computation to address many practical and philosophical concerns such as the P vs NP problem. In addition, we provide constructive proof for the P vs NP problem under the assumption that the class NP comprises of problems solvable by non-deterministic algorithms. We utilize the observation that deterministic computational procedures lack fundamental capacity to fully simulate their non-deterministic variant.
We study the problem of matching agents who arrive at a marketplace over time and leave after d time periods. Agents can only be matched while they are present in the marketplace. Each pair of agents can yield a different match value, and the planner’s goal is to maximize the total value over a finite time horizon. First we study the case in which vertices arrive in an adversarial order. We provide a randomized 0.25-competitive algorithm building on a result by Feldman et al. (2009) and Lehman et al. (2006). We extend the model to the case in which departure times are drawn independently from a distribution with non-decreasing hazard rate, for which we establish a 1/8-competitive algorithm. When the arrival order is chosen uniformly at random, we show that a batching algorithm, which computes a maximum-weighted matching every (d+1) periods, is 0.279-competitive.
NL4Py is a NetLogo controller software for Python, for the rapid, parallel execution of NetLogo models. NL4Py provides both headless (no graphical user interface) and GUI NetLogo workspace control through Python. Spurred on by the increasing availability of open-source computation and machine learning libraries on the Python package index, there is an increasing demand for such rapid, parallel execution of agent-based models through Python. NetLogo, being the language of choice for a majority of agent-based modeling driven research projects, requires an integration to Python for researchers looking to perform statistical analyses of agent-based model output using these libraries. Unfortunately, until the recent introduction of PyNetLogo, and now NL4Py, such a controller was unavailable. This article provides a detailed introduction into the usage of NL4Py and explains its client-server software architecture, highlighting architectural differences to PyNetLogo. A step-by-step demonstration of global sensitivity analysis and parameter calibration of the Wolf Sheep Predation model is then performed through NL4Py. Finally, NL4Py’s performance is benchmarked against PyNetLogo and its combination with IPyParallel, and shown to provide significant savings in execution time over both configurations.
Because of their effectiveness in broad practical applications, LSTM networks have received a wealth of coverage in scientific journals, technical blogs, and implementation guides. However, in most articles, the inference formulas for the LSTM network and its parent, RNN, are stated axiomatically, while the training formulas are omitted altogether. In addition, the technique of ‘unrolling’ an RNN is routinely presented without justification throughout the literature. The goal of this paper is to explain the essential RNN and LSTM fundamentals in a single document. Drawing from concepts in signal processing, we formally derive the canonical RNN formulation from differential equations. We then propose and prove a precise statement, which yields the RNN unrolling technique. We also review the difficulties with training the standard RNN and address them by transforming the RNN into the ‘Vanilla LSTM’ network through a series of logical arguments. We provide all equations pertaining to the LSTM system together with detailed descriptions of its constituent entities. Albeit unconventional, our choice of notation and the method for presenting the LSTM system emphasizes ease of understanding. As part of the analysis, we identify new opportunities to enrich the LSTM system and incorporate these extensions into the Vanilla LSTM network, producing the most general LSTM variant to date. The target reader has already been exposed to RNNs and LSTM networks through numerous available resources and is open to an alternative pedagogical approach. A Machine Learning practitioner seeking guidance for implementing our new augmented LSTM model in software for experimentation and research will find the insights and derivations in this tutorial valuable as well.
Fuzzy clustering methods identify naturally occurring clusters in a dataset, where the extent to which different clusters are overlapped can differ. Most methods have a parameter to fix the level of fuzziness. However, the appropriate level of fuzziness depends on the application at hand. This paper presents Entropy $c$-Means (ECM), a method of fuzzy clustering that simultaneously optimizes two contradictory objective functions, resulting in the creation of fuzzy clusters with different levels of fuzziness. This allows ECM to identify clusters with different degrees of overlap. ECM optimizes the two objective functions using two multi-objective optimization methods, Non-dominated Sorting Genetic Algorithm II (NSGA-II), and Multiobjective Evolutionary Algorithm based on Decomposition (MOEA/D). We also propose a method to select a suitable trade-off clustering from the Pareto front. Experiments on challenging synthetic datasets as well as real-world datasets show that ECM leads to better cluster detection compared to the conventional fuzzy clustering methods as well as previously used multi-objective methods for fuzzy clustering.
Modeling spillover effects from observational data is an important problem in economics, business, and other fields of research. % It helps us infer the causality between two seemingly unrelated set of events. For example, if consumer spending in the United States declines, it has spillover effects on economies that depend on the U.S. as their largest export market. In this paper, we aim to infer the causation that results in spillover effects between pairs of entities (or units), we call this effect as \textit{paired spillover}. To achieve this, we leverage the recent developments in variational inference and deep learning techniques to propose a generative model called Linked Causal Variational Autoencoder (LCVA). Similar to variational autoencoders (VAE), LCVA incorporates an encoder neural network to learn the latent attributes and a decoder network to reconstruct the inputs. However, unlike VAE, LCVA treats the \textit{latent attributes as confounders that are assumed to affect both the treatment and the outcome of units}. Specifically, given a pair of units $u$ and $\bar{u}$, their individual treatment and outcomes, the encoder network of LCVA samples the confounders by conditioning on the observed covariates of $u$, the treatments of both $u$ and $\bar{u}$ and the outcome of $u$. Once inferred, the latent attributes (or confounders) of $u$ captures the spillover effect of $\bar{u}$ on $u$. Using a network of users from job training dataset (LaLonde (1986)) and co-purchase dataset from Amazon e-commerce domain, we show that LCVA is significantly more robust than existing methods in capturing spillover effects.
We propose two methods for exact Gaussian process (GP) inference and learning on massive image, video, spatial-temporal, or multi-output datasets with missing values (or ‘gaps’) in the observed responses. The first method ignores the gaps using sparse selection matrices and a highly effective low-rank preconditioner is introduced to accelerate computations. The second method introduces a novel approach to GP training whereby response values are inferred on the gaps before explicitly training the model. We find this second approach to be greatly advantageous for the class of problems considered. Both of these novel approaches make extensive use of Kronecker matrix algebra to design massively scalable algorithms which have low memory requirements. We demonstrate exact GP inference for a spatial-temporal climate modelling problem with 3.7 million training points as well as a video reconstruction problem with 1 billion points.
Technical computing is a challenging application area for programming languages to address. This is evinced by the unusually large number of specialized languages in the area (e.g. MATLAB, R), and the complexity of common software stacks, often involving multiple languages and custom code generators. We believe this is ultimately due to key characteristics of the domain: highly complex operators, a need for extensive code specialization for performance, and a desire for permissive high-level programming styles allowing productive experimentation. The Julia language attempts to provide a more effective structure for this kind of programming by allowing programmers to express complex polymorphic behaviors using dynamic multiple dispatch over parametric types. The forms of extension and reuse permitted by this paradigm have proven valuable for technical computing. We report on how this approach has allowed domain experts to express useful abstractions while simultaneously providing a natural path to better performance for high-level technical code.
Finding the largest few principal components of a matrix of genetic data is a common task in genome-wide association studies (GWASs), both for dimensionality reduction and for identifying unwanted factors of variation. We describe a simple random matrix model for matrices that arise in GWASs, showing that the singular values have a bulk behavior that obeys a Marchenko-Pastur distributed with a handful of large outliers. We also implement Golub-Kahan-Lanczos (GKL) bidiagonalization in the Julia programming language, providing thick restarting and a choice between full and partial reorthogonalization strategies to control numerical roundoff. Our implementation of GKL bidiagonalization is up to 36 times faster than software tools used commonly in genomics data analysis for computing principal components, such as EIGENSOFT and FlashPCA, which use dense LAPACK routines and randomized subspace iteration respectively.
Sparse deep neural networks(DNNs) are efficient in both memory and compute when compared to dense DNNs. But due to irregularity in computation of sparse DNNs, their efficiencies are much lower than that of dense DNNs on general purpose hardwares. This leads to poor/no performance benefits for sparse DNNs. Performance issue for sparse DNNs can be alleviated by bringing structure to the sparsity and leveraging it for improving runtime efficiency. But such structural constraints often lead to sparse models with suboptimal accuracies. In this work, we jointly address both accuracy and performance of sparse DNNs using our proposed class of neural networks called HBsNN ( Hierarchical Block Sparse Neural Networks).
Traditional chatbots usually need a mass of human dialogue data, especially when using supervised machine learning method. Though they can easily deal with single-turn question answering, for multi-turn the performance is usually unsatisfactory. In this paper, we present Lingke, an information retrieval augmented chatbot which is able to answer questions based on given product introduction document and deal with multi-turn conversations. We will introduce a fine-grained pipeline processing to distill responses based on unstructured documents, and attentive sequential context-response matching for multi-turn conversations.
In this work we give new density estimators by averaging classical density estimators such as the histogram, the frequency polygon and the kernel density estimators obtained over different bootstrap samples of the original data. We prove the L 2-consistency of these new estimators and compare them to several similar approaches by extensive simulations. Based on them, we give also a way to construct non parametric pointwise confidence intervals for the target density.
Determining whether a given claim is supported by evidence is a fundamental NLP problem that is best modeled as Textual Entailment. However, given a large collection of text, finding evidence that could support or refute a given claim is a challenge in itself, amplified by the fact that different evidence might be needed to support or refute a claim. Nevertheless, most prior work decouples evidence identification from determining the truth value of the claim given the evidence. We propose to consider these two aspects jointly. We develop TwoWingOS (two-wing optimization strategy), a system that, while identifying appropriate evidence for a claim, also determines whether or not the claim is supported by the evidence. Given the claim, TwoWingOS attempts to identify a subset of the evidence candidates; given the predicted evidence, it then attempts to determine the truth value of the corresponding claim. We treat this challenge as coupled optimization problems, training a joint model for it. TwoWingOS offers two advantages: (i) Unlike pipeline systems, it facilitates flexible-size evidence set, and (ii) Joint training improves both the claim entailment and the evidence identification. Experiments on a benchmark dataset show state-of-the-art performance. Code: https://…/FEVER
Finding the diameter of a dataset in multidimensional Euclidean space is a well-established problem, with well-known algorithms. However, most of the algorithms found in the literature do not scale well with large values of data dimension, so the time complexity grows exponentially in most cases, which makes these algorithms impractical. Therefore, we implemented 4 simple greedy algorithms to be used for approximating the diameter of a multidimensional dataset; these are based on minimum/maximum l2 norms, hill climbing search, Tabu search and Beam search approaches, respectively. The time complexity of the implemented algorithms is near-linear, as they scale near-linearly with data size and its dimensions. The results of the experiments (conducted on different machine learning data sets) prove the efficiency of the implemented algorithms and can therefore be recommended for finding the diameter to be used by different machine learning applications when needed.
Multi-layer neural networks have lead to remarkable performance on many kinds of benchmark tasks in text, speech and image processing. Nonlinear parameter estimation in hierarchical models is known to be subject to overfitting. One approach to this overfitting and related problems (local minima, colinearity, feature discovery etc.) is called dropout (Srivastava, et al 2014, Baldi et al 2016). This method removes hidden units with a Bernoulli random variable with probability $p$ over updates. In this paper we will show that Dropout is a special case of a more general model published originally in 1990 called the stochastic delta rule ( SDR, Hanson, 1990). SDR parameterizes each weight in the network as a random variable with mean $\mu_{w_{ij}}$ and standard deviation $\sigma_{w_{ij}}$. These random variables are sampled on each forward activation, consequently creating an exponential number of potential networks with shared weights. Both parameters are updated according to prediction error, thus implementing weight noise injections that reflect a local history of prediction error and efficient model averaging. SDR therefore implements a local gradient-dependent simulated annealing per weight converging to a bayes optimal network. Tests on standard benchmarks (CIFAR) using a modified version of DenseNet shows the SDR outperforms standard dropout in error by over 50% and in loss by over 50%. Furthermore, the SDR implementation converges on a solution much faster, reaching a training error of 5 in just 15 epochs with DenseNet-40 compared to standard DenseNet-40’s 94 epochs.
Extracting characteristics from the training datasets of classification problems has proven effective in a number of meta-analyses. Among them, measures of classification complexity can estimate the difficulty in separating the data points into their expected classes. Descriptors of the spatial distribution of the data and estimates of the shape and size of the decision boundary are among the existent measures for this characterization. This information can support the formulation of new data-driven pre-processing and pattern recognition techniques, which can in turn be focused on challenging characteristics of the problems. This paper surveys and analyzes measures which can be extracted from the training datasets in order to characterize the complexity of the respective classification problems. Their use in recent literature is also reviewed and discussed, allowing to prospect opportunities for future work in the area. Finally, descriptions are given on an R package named Extended Complexity Library (ECoL) that implements a set of complexity measures and is made publicly available.
The standard probabilistic perspective on machine learning gives rise to empirical risk-minimization tasks that are frequently solved by stochastic gradient descent (SGD) and variants thereof. We present a formulation of these tasks as classical inverse or filtering problems and, furthermore, we propose an efficient, gradient-free algorithm for finding a solution to these problems using ensemble Kalman inversion (EKI). Applications of our approach include offline and online supervised learning with deep neural networks, as well as graph-based semi-supervised learning. The essence of the EKI procedure is an ensemble based approximate gradient descent in which derivatives are replaced by differences from within the ensemble. We suggest several modifications to the basic method, derived from empirically successful heuristics developed in the context of SGD. Numerical results demonstrate wide applicability and robustness of the proposed algorithm.

### PCA revisited: using principal components for classification of faces

(This article was first published on The Beginner Programmer, and kindly contributed to R-bloggers)

This is a short post following the previous one (PCA revisited).

In this post I’m going to apply PCA to a toy problem: the classification of faces. Again I’ll be working on the Olivetti faces dataset. Please visit the previous post PCA revisited to read how to download it.

The goal of this post is to fit a simple classification model to predict, given an image, the label to which it belongs. I’m goint to fit two support vector machine models and then compare their accuracy.

The

1. The first model uses as input data the raw (scaled) pixels (all 4096 of them). Let’s call this model “the data model”.
2. The second model uses as input data only some principal components. Let’s call this model “the PCA model”.

This will not be an in-depth and detailed fit and subsequent evaluation of the model, but rather a simple proof of concept to prove that a limited number of principal components can be used to perform a classification task instead of using the actual data.

## The setup of the two models

In order to get an idea of what accuracy our models can achieve, I decided to do a simple 20 fold crossvalidation  of each model.

The data model is a simple support vector machine that is fitted using all the faces in the training set: it’s input data is a vector of 4096 components.

The PCA model is again the same support vector machine (with the same hyperparameters, which however may need some tweaking) fitted using 30 PCs.

By performing PCA on the dataset I transformed the data and, according to the analysis, 30 PCs account for about 82% of the total variance in the dataset. Indeed you can see from the plots below that the first 30 eigenvalues are those with the highest magnitude.

## Results

After running  the two crossvalidations on the two different models, the main results I found are the following:

1. The average accuracy of the two models is about the same (PCA model has a slightly better average accuracy): 93% vs 94%. Standard deviation seems to be again in favour of the PCA model: 3.2% vs 2.3%. The accuracy isn’t great, I obtained 98% on this dataset using other models, but of course this is just a test “out of the box” with no hyperparameter tuning and all the other fine tuning operations.
2. The second model fits the data much faster than the first one.

The results are summarised in the boxplot below

Note that the PCA model has less information available compared to the model using the original data. However, the information is condensed in a fewer variables (the 30 PCs) in a more compact way. This technique is great when you are using models sensitive to correlation in the data: since the PCs are not correlated, you can transform your original data (which might be correlated and causing problems in the model) and use the principal components instead.

Another nice advantage is that since the PCA model uses fewer variables it is faster. Note that I decided to apply PCA to the whole dataset and not only to the training set of each round of crossvalidation. It could have been argued that in practice you have access to only the training set (of course). Then I used the eigenvectors to obtain the PCs for both the training and testing set of each crossvalidation iteration.

The code I used is here below

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### In-brief: splashr update + High Performance Scraping with splashr, furrr & TeamHG-Memex’s Aquarium

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

The development version of splashr now support authenticated connections to Splash API instances. Just specify user and pass on the initial splashr::splash() call to use your scraping setup a bit more safely. For those not familiar with splashr and/or Splash: the latter is a lightweight alternative to tools like Selenium and the former is an R interface to it. Unlike xml2::read_html(), splashr renders a URL exactly as a browser does (because it uses a virtual browser) and can return far more than just the HTML from a web page. Splash does need to be running and it’s best to use it in a Docker container.

If you have a large number of sites to scrape, working with splashr and Splash “as-is” can be a bit frustrating since there’s a limit to what a single instance can handle. Sure, it’s possible to setup your own highly available, multi-instance Splash cluster and use it, but that’s work. Thankfully, the folks behind TeamHG-Memex created Aquarium which uses docker and docker-compose to stand up a multi-Splash instance behind a pre-configured HAProxy instance so you can take advantage of parallel requests the Splash API. As long as you have docker and docker-compose handy (and Python) following the steps on the aforelinked GitHub page should have you up and running with Aquarium in minutes. You use the same default port (8050) to access the Splash API and you get a bonus port of 8036 to watch in your browser (the HAProxy stats page).

This works well when combined with furrr which is an R package that makes parallel operations very tidy.

One way to use purrr, splashr and Aquarium might look like this:

library(splashr)
library(HARtools)
library(urltools)
library(furrr)
library(tidyverse)

list_of_urls_with_unique_urls <- c("http://...", "http://...", ...)

make_a_splash <- function(org_url) {
splash(
host = "ip/name of system you started aquarium on",
) %>%
splash_response_body(TRUE) %>% # we want to get all the content
splash_user_agent(ua_win10_ie11) %>% # splashr has many pre-configured user agents to choose from
splash_go(org_url) %>%
splash_wait(5) %>% # pick a reasonable timeout; modern web sites with javascript are bloated
splash_har()
}

safe_splash <- safely(make_a_splash) # splashr/Splash work well but can throw errors. Let's be safe

plan(multiprocess, workers=5) # don't overwhelm the default setup or your internet connection

future_map(sites, ~{

org <- safe_splash(.x) # go get it!

if (is.null(org$result)) { sprintf("Error retrieving %s (%s)", .x, org$error$message) # this gives us good error messages } else { HARtools::writeHAR( # HAR format saves *everything*. the files are YUGE har = org$result,
file = file.path("/place/to/store/stuff", sprintf("%s.har", domain(.x))) # saved with the base domain; you may want to use a UUID via uuid::UUIDgenerate()
)

sprintf("Successfully retrieved %s", .x)

}

}) -> results

(Those with a keen eye will grok why splashr supports Splash API basic authentication, now)

The parallel iterator will return a list we can flatten to a character vector (I don’t do that by default since it’s safer to get a list back as it can hold anything and map_chr() likes to check for proper objects) to check for errors with something like:

flatten_chr(results) %>%
keep(str_detect, "Error")
## [1] "Error retrieving www.1.example.com (Service Unavailable (HTTP 503).)"
## [2] "Error retrieving www.100.example.com (Gateway Timeout (HTTP 504).)"
## [3] "Error retrieving www.3000.example.com (Bad Gateway (HTTP 502).)"
## [4] "Error retrieving www.a.example.com (Bad Gateway (HTTP 502).)"
## [5] "Error retrieving www.z.examples.com (Gateway Timeout (HTTP 504).)"

Timeouts would suggest you may need to up the timeout parameter in your Splash call. Service unavailable or bad gateway errors may suggest you need to tweak the Aquarium configuration to add more workers or reduce your plan(…). It’s not unusual to have to create a scraping process that accounts for errors and retries a certain number of times.

If you were stuck in the splashr/Splash slow-lane before, give this a try to help save you some time and frustration.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### The dangers of AVX-512 throttling: myth or reality?

Modern processors use many tricks to go faster. They are superscalar which means that they can execute many instructions at once. They are multicore, which means that each CPU is made of several baby processors that are partially independent. And they are vectorized, which means that they have instructions that can operate over wide registers (spanning 128, 256 or even 512 bits).

Regarding vectorization, Intel is currently ahead of the curve with its AVX-512 instruction sets. They have the only commodity-level processors able to work over 512-bit registers. AMD is barely doing 256 bits and your phone is limited to 128 bits.

The more you use your CPU, the more heat it produces. Intel does not want your CPU to burn out. So it throttles the CPU (makes it run slower). Your CPU stays warm but not too hot. When the processor does not have to use AVX-512 instructions, some of the silicon remains dark, and thus less heat is generated.

The cores that are executing SIMD instructions may run at a lower frequency. Thankfully, Intel has per-core voltage and frequencies. And Intel processors can switch frequency very fast, certainly in a millisecond.

Vlad Krasnov from Cloudfare wrote a blog post last year warning us against AVX-512 throttling:

If you do not require AVX-512 for some specific high performance tasks, I suggest you disable AVX-512 execution on your server or desktop, to avoid accidental AVX-512 throttling.

I am sure that it is the case that AVX-512 can cause problems for some use cases. It is also the case that some people die if you give them aspirin; yet we don’t retire aspirin.

Should we really disable AVX-512 as a precautionary stance?

In an earlier blog post, I tried to measure this throttling on a server I own but found no effect whatsoever. It could be that this server does not have AVX-512 throttling or maybe I did not make use of sufficiently expensive instructions.

Vlad offered me test case in C. His test case involves AVX-512 multiplications, while much of the running time is spent on some bubble-sort routine. It can run in both AVX-512 mode and in the regular (non-AVX-512) mode. To be clear, it is not meant to be a demonstration in favour of AVX-512: it is meant to show that AVX-512 can be detrimental.

I did not want to run my tests using my own server this time. So I went to Packet and launched a powerful two-CPU Xeon Gold server (Intel Xeon Gold 5120 CPU @ 2.20GHz). Each of these processors have 14 cores, so we have 28 cores in total. Because of hyperthreading, it supports up to 56 physical threads (running two threads per core).

20 8.4 s 7.2 s
40 6 s 5.0 s
80 5.7 s 4.7 s

In an earlier test, I was using an older compiler (GCC 5) and when exceeding the number of physically supported threads (56), I was getting poorer results, especially with AVX-512. I am not sure why that would be but I suspect it is not related to throttling; it might have to do with context switching and register initialization (though that is speculation on my part).

In my latest test, I use a more recent compiler (GCC 7). As you can see, the AVX-512 version is always faster. Otherwise, I see no negative effect from the application of AVX-512. If there is throttling, it appears that the benefits of AVX-512 offset it.

### New York halts new licences for ride-hailing cars

RIDE-HAILING companies like Uber and Lyft are loved by city dwellers but may be jamming roads. In midtown and lower Manhattan, cars have slowed from an average speed of 9.1mph (14.6kph) in 2010 to 7.1mph in 2017.

### Setting up your AI Dev Environment in 5 Minutes

Whether you're a novice data science enthusiast setting up TensorFlow for the first time, or a seasoned AI engineer working with terabytes of data, getting your libraries, packages, and frameworks installed is always a struggle. Learn how datmo, an open source python package, helps you get started in minutes.

### Evaluating Olive McBride with the Arkham Horror LCG Chaos Bag Simulator in R

(This article was first published on R – Curtis Miller's Personal Website, and kindly contributed to R-bloggers)

(If you care, there may be spoilers in this post.)

## Introduction

I love Arkham Horror; The Card Game. I love it more than I really should; it’s ridiculously fun. It’s a cooperative card game where you build a deck representing a character in the Cthulhu mythos universe, and with that deck you play scenarios in a narrative campaign1 where you grapple with the horrors of the mythos. Your actions in (and between) scenarios have repercussions for how the campaign plays out, changing the story, and you use experience points accumulated in the scenarios to upgrade your deck with better cards.

The game is hard. Really hard. And I don’t mean it’s hard because it has a learning curve (although if you’re not used to deep games like this, it probably does have a learning curve). This is a Cthulhu game, which means that the mythos is going to do everything to beat you down. Your best-laid plans are going to get derailed. The worst thing that could possibly happen to you will happen. This is why this game is unreasonably fun: it’s a game for masochists.

And it makes a damn ton of good stories. I often want to tell the tale of how amazingly awful the events played out, or how I just snatched victory out of the jaws of defeat. Bloomberg recently published an article about the rise of board games in companies. If you’re in one of these companies where employees often play board games together after hours, you should consider adding Arkham Horror (the card game; I’ve never played the board game and I’ve heard mixed reviews) to the mix. It’s a great team builder.

Two elements make Arkham Horror so damn hard and unpredictable: the encounter deck and the chaos bag. The encounter deck is a deck that all players draw from every turn that spawns monsters and awful events. The chaos bag is used for skill test you make for advancing your board state. You draw tokens from the chaos bag, add the modifier of the revealed token to your skill value for that skill test (most of them are negative numbers), check the new value against the difficulty of the test, and if your modified skill value is at least as great as the difficulty of the test, you pass; otherwise, you fail. (You can pitch cards from your hand to increase your skill value before you reveal a token.)

*(Image source: Ars Technica)

This is the chaos bag in a nutshell, but the chaos bag does more. The elder sign token represents good luck and often helps your board state; it’s considered a great success. But most icon tokens in the bag are trying to hurt you. There are four icon tokens: the skulls, the cultist, the tablets, and the elder things. These often not only apply a modifier but do something bad to you (which changes between scenarios).

And of course there’s this little bastard.

This is the auto-fail token. You just fail. Thanks for playing.

## Simulating the Chaos Bag

Bad things happen in Arkham Horror. Sometimes, though, they feel like they happen a lot. For example, you can draw two auto-fails in a row. Or three. I cannot understate how devastating that can be in a game. Sometimes it feels like this happens a lot. Sometimes it feels like this particular game went unusually poorly, and unusually poor games seem to happen frequently.

That’s a contradiction, of course. I think that this sense of bad games happening often emerges from humans fundamentally poor understanding of probability and how it actually works. I’m not just referring to the mathematics; people’s intuition of what should happen according to probability does not match what probability says happens. This phenomenon is well-documented (see, for example, the book Irrationality, by Stuart Sutherland) and is one of the explanations for why people seemed to underestimate Donald Trump’s chances of winning the 2016 election (in short, unlikely events occur more frequently than people perceive; see also Nate Silver’s series of articles). In fact, you are more likely to find any pattern than no pattern at all (and in Arkham Horror, patterns are usually bad for you.)

This perception of “unusually many auto-fails” (which, as a statistician, I know cannot be right no matter what my feelings say) prompted me to write an R function that generates a string of pulls from the chaos bag, each pull independent of the previous pull. Here’s the function:

chaos_bag_string_sim <- function(dicebag = list('E' = 1,  # Elder Sign
'P' = 1,  # +1
'0' = 2,
'1' = 3,  # -1
'2' = 2,  # -2
'3' = 1,
'4' = 1,
'S' = 2,  # Skull
'C' = 1,  # Cultist
'T' = 1,  # Tablet
'L' = 1,  # Elder Thing
'A' = 1), # Autofail
n = 24, replace = TRUE) {
paste0(sample(names(dicebag), replace = replace,
prob = as.vector(unlist(dicebag)),
size = n), collapse = '')
}


Notice that the function takes a list, dicebag. This list uses one-character codes representing chaos bag tokens, and the actual contents of the list are numbers representing the frequency of that token in the bag. The bag that serves as a default is a fictitious chaos bag that I believe is representative of Arkham Horror on standard difficulty. Since most numbered tokens are negative, I denote the +1 token with 'P' and the negative tokens with their numbers.

How many pulls from the bag occur in a game? That's really hard to figure out; in fact, this will vary from scenario to scenario dramatically. So let's just make an educated guess to represent the "typical" game. A round consists of a mythos phase, then three actions per player, followed by a monsters phase and an upkeep phase. The mythos phase often prompts skill tests, and I would guess that the average number of skill tests made during each player's turn is two. We'll guess that about half of the draws from the encounter deck prompt skill tests. A game may last about 16 rounds. This adds up to about 40 skill tests per player per game. As a consequence, a four-player game (I mostly play multiplayer, and the more the merrier) would yield 160 skill tests.

Let's simulate a string of 160 skill tests, each independent of the other2.

(game <- chaos_bag_string_sim(n = 160))

[1] "311113E12121E2LS20A32A4ETT1SE3T3P0AT2S12PLSC033CP22A11CLL31L024P12ES24A1S0E
E3PSSS12C13T21224104L43PCT13TTCSSASE20SA1TT2SSC2CPC20012CC41S234PPEP101E0LS
1P1110TSS1"


I represented pulls from the chaos bag as a string because we can use regular expressions to find patterns in the string. For instance, we can use the expression AA to find all occurances of double-auto-fails in the string.

grepl("AA", game)

[1] FALSE


The grepl() function returns a boolean (logical) value identifying whether the pattern appeared in the string; in this game, no two auto-fails appeared in a row.

What about two auto-fails appearing, separated by two other tests? This is matched by the expression A..A (remember, . matches any character).

grepl("A..A", game)

[1] TRUE


We can take this further and use our simulator to estimate the probability events occur. After all, in probability, there are two ways to find the probability of events:

1. Create a probability model and use mathematics to compute the probability an event of interest occurs; for example, use a model of a die roll to compute the probability of rolling a 6
2. Perform the experiment many times and count how many times the event occured in these experiments many times and count how often the event of interest occured; for example, roll a die 1000 times and count how many times it rolled a 6 (this is usually done on a computer)

While the latter approach produces estimates, these estimates are guaranteed to get better the more simulations are performed. In fact, if you want at least a 95% chance of your estimate being accurate up to two decimal places, you could perform the simulation 10,000 times.3

Using this, we can estimate the probability of seeing two auto-fails in a row.

mean(grepl("AA", replicate(10000, {chaos_bag_string_sim(n = 160)})))

[1] 0.4019


There's about a 40% chance of seeing two auto-fail tokens in a single four-player game. That's pretty damn likely.

Let's ask a slightly different question: what is the average number of times we will see two autofails in a row in a game? For this we will want to dip into the stringr package, using the str_count() function.

library(stringr)

str_count(game, "AA")

[1] 0

str_count(game, "A..A")  # How many times did we see auto-fails separated by two
# tests?

[1] 1


Or we can ask how often we saw three "bad" tokens in a row. In Arkham Horror on standard difficulty, players often aim to pass a test when drawing a -2 or better. This means that "bad" tokens include -3, -4, auto-fail, and often the tablet and elder thing tokens, too. The regular expression pattern that can match "two bad things in a row" is [34TLA]{2} (translation: see either 3, 4, T, L, or A exactly two times).

str_count(game, "[34TLA]{2}")

[1] 14


How could we estimate the average number of times this would occur in a four-player game? The simulation trick still works. Pick a large number of simulations to get an accurate estimate; the larger, the more accurate (but also more work for your computer).

mean(str_count(replicate(10000, {chaos_bag_string_sim(n = 160)}), "[34TLA]{2}"))

[1] 10.6717


You can imagine that this can keep going and perhaps there are queries that you would like to see answered. Below I leave you with an R script that you can use to do these types of experiments. This script is designed to be run from a Unix-flavored command line, though you could source() the script into an R session to use the chaos_bag_string_sim() function interactively. Use regular expressions to define a pattern you are interested in (this is a good opportunity to learn regular expressions for those who want the exercise).

#!/usr/bin/Rscript
#######################################
# AHChaosBagSimulator.R
#######################################
# Curtis Miller
# 2018-08-03
# A script for simulating Arkham Horror's chaos bag
#######################################

chaos_bag_string_sim <- function(dicebag = list('E' = 1,  # Elder Sign
'P' = 1,  # +1
'0' = 2,
'1' = 3,  # -1
'2' = 2,  # -2
'3' = 1,
'4' = 1,
'S' = 2,  # Skull
'C' = 1,  # Cultist
'T' = 1,  # Tablet
'L' = 1,  # Elder Thing
'A' = 1), # Autofail
n = 24, replace = TRUE) {
paste0(sample(names(dicebag), replace = replace,
prob = as.vector(unlist(dicebag)),
size = n), collapse = '')
}

# optparse: A package for handling command line arguments
if (!suppressPackageStartupMessages(require("optparse"))) {
install.packages("optparse")
require("optparse")
}

main <- function(pattern, average = FALSE, pulls = 24, replications = 10000,
no_replacement = FALSE, dicebag = "", help = FALSE) {

if (dicebag != "") {
dicebag_df <- read.csv(dicebag, stringsAsFactors = FALSE)
dicebag_df$token <- as.character(dicebag_df$token)
if (!is.numeric(dicebag_df$freq)) {stop("Dicebag freq must be integers.")} dicebag <- as.list(dicebag_df$freq)
names(dicebag) <- dicebag$token } else { dicebag = list('E' = 1, # Elder Sign 'P' = 1, # +1 '0' = 2, '1' = 3, # -1 '2' = 2, # -2 '3' = 1, '4' = 1, 'S' = 2, # Skull 'C' = 1, # Cultist 'T' = 1, # Tablet 'L' = 1, # Elder Thing 'A' = 1) # Autofail } games <- replicate(replications, {chaos_bag_string_sim(dicebag = dicebag, n = pulls, replace = !no_replacement)}) if (average) { cat("Average occurance of pattern:", mean(stringr::str_count(games, pattern)), "\n") } else { cat("Probability of occurance of pattern:", mean(grepl(pattern, games)), "\n") } quit() } if (sys.nframe() == 0) { cl_args <- parse_args(OptionParser( description = "Simulates the Arkham Horror LCG chaos bag.", option_list = list( make_option(c("--pattern", "-r"), type = "character", help = "Pattern (regular expression) to match"), make_option(c("--average", "-a"), type = "logical", action = "store_true", default = FALSE, help = "If set, computes average number of occurances"), make_option(c("--pulls", "-p"), type = "integer", default = 24, help = "The number of pulls from the chaos bag"), make_option(c("--replications", "-N"), type = "integer", default = 10000, help = "Number of replications"), make_option(c("--no-replacement", "-n"), type = "logical", action = "store_true", default = FALSE, help = "Draw tokens without replacement"), make_option(c("--dicebag", "-d"), type = "character", default = "", help = "(Optional) dice bag distribution CSV file") ) )) names(cl_args)[which(names(cl_args) == "no-replacement")] <- "no_replacement" do.call(main, cl_args) }  Below is an example of a CSV file specifying a chaos bag. token,freq E,1 P,1 0,2 1,3 2,2 3,1 4,1 S,2 C,1 T,1 L,1 A,1  Below is some example usage. $ chmod +x AHChaosBagSimulator.R  # Only need to do this once
$./AHChaosBagSimulator.R -r AA -p 160 Probability of occurance of pattern: 0.4074$ ./AHChaosBagSimulator.R -r [34TLA]{2} -a -p 160
Average occurance of pattern: 10.6225


To see documentation, type ./AHChaosBagSimulator.R --help. Note that something must always be passed to -r; this is the pattern needed.

## Olive McBride

This tool was initially written as a way to explore chaos bag probabilities. I didn't consider the tool to be very useful. It simply helped make the point that unlikely events seem to happen frequently in Arkham Horror. However, I found a way to put the tool to practical use.

Recently, Arkham Horror's designers have been releasing cards that seem to demand more math to fully understand. My favorite Arkham Horror card reviewer, The Man from Leng, has fretted about this in some of his recent reviews about cards using the seal mechanic, which change the composition of the chaos bag by removing tokens from it.

I can only imagine how he would feel about a card like Olive McBride (one of the cards appearing in the upcoming Mythos Pack, "Heart of the Elders", to be released later this week.)

Olive McBride allows you to reveal three chaos tokens instead of one, and choose two of those tokens to resolve. This effect can be triggered any time you would reveal a chaos token.

I'll repeat that again: Olive can trigger any time a chaos token is revealed. Most of the time tokens are revealed during skill tests, but other events lead to dipping into the chaos bag too. Notably, The Dunwich Legacy leads to drawing tokens without taking skill tests, such as when gambling in "The House Always Wins", due to some encounter cards. This makes Olive McBride a "master gambler", since she can draw three tokens and pick the winning ones when gambling. (She almost breaks the scenario.) Additionally, Olive can be combined with cards like Ritual Candles and Grotesque Statue to further shift skill tests in your favor.

These are interesting situations but let's ignore combos like this for now. What does Olive do to your usual skill test? Specifically, what does she do to the modifiers?

Before going on, I need to address verbiage. When are about to perform a skill test, typically they will make a statement like "I'm at +2 for this test". This means that the skill value of the investigator (after applying all modifiers) is two higher than the difficulty of the test. Thus, if the investigator draws a -2 or better, the investigator will pass the test; if the investigator daws a -3 or worse, the investigator fails. My play group does not say this; we say "I'm at -2 for this test," meaning that if the investigator sees a -2 or better from the bag, the investigator will pass. This is more intuitive to me, and also I think it's more directly translated to math.

Presumably when doing a skill test with Olive, if all we care about is passing the test, we will pick the two tokens drawn from the bag that have the best modifiers. We add the modifiers of those tokens together to get the final modifier. Whether this improves your odds of passing the test or not isn't immediately clear.

I've written an R function that simulates skill tests with Olive. With this we can estimate Olive's effects.

olive_sim <- function(translate = c("E" = 2, "P" = 1, "0" = 0, "1" = -1,
"2" = -2, "3" = -3, "4" = -4, "S" = -2,
"C" = -3, "T" = -4, "L" = -5, "A" = -Inf),
N = 1000, dicebag = NULL) {
dargs <- list(n = 3, replace = FALSE)
if (!is.null(dicebag)) {
dargs$dicebag <- dicebag } pulls <- replicate(N, {do.call(chaos_bag_string_sim, dargs)}) vals <- sapply(pulls, function(p) { vecp <- strsplit(p, "")[[1]] vecnum <- translate[vecp] max(apply(combn(3, 2), 2, function(i) {sum(vecnum[i])})) }) vals[which(is.nan(vals))] <- Inf return(vals) }  The parameter translate gives a vector that translates code for chaos tokens to numerical modifiers. Notice that the auto-fail token is assigned -Inf since it will cause any test to fail. If we wanted, say, the elder sign token to auto-succeed (which is Mateo's elder sign effect), we could replace its translation with Inf. By default the function uses the dicebag provided with chaos_bag_string_sim(), but this can be changed. Here's a single final modifier from a pull using Olive. olive_sim(N = 1)  C01 -1  Interestingly, the function returned a named vector, and the name corresponds to what was pulled. In this case, a cultist, 0, and -1 token were pulled; the resulting best modifier is -1. The function is already built to do this many times. olive_sim(N = 5)  31S 2S1 LS1 120 C2E -3 -3 -3 -1 0  Below is a function that can simulate a lot of normal skill tests. test_sim <- function(translate = c("E" = 2, "P" = 1, "0" = 0, "1" = -1, "2" = -2, "3" = -3, "4" = -4, "S" = -2, "C" = -3, "T" = -4, "L" = -5, "A" = -Inf), N = 1000, dicebag = NULL) { dargs <- list(n = N, replace = TRUE) if (!is.null(dicebag)) { dargs$dicebag <- dicebag
}

pulls <- do.call(chaos_bag_string_sim, dargs)
vecp <- strsplit(pulls, "")[[1]]
vecnum <- translate[vecp]

return(vecnum)
}


Here's a demonstration.

test_sim(N = 5)

 2  S  3  1  0
-2 -2 -3 -1  0


Finally, below is a function that, when given a vector of results like those above, produce a table of estimated probabilities of success at skill tests when given a vector of values the tests must beat in order to pass the test (that is, using the "I'm at -2" type of language).

est_success_table_gen <- function(res, values = (-10):3) {
r = v)})
names(r) <- values
return(rev(r))
}


Let's give it a test run.4

u <- test_sim()
est_success_table_gen(u)

    3     2     1     0    -1    -2    -3    -4    -5    -6    -7    -8    -9
0.000 0.057 0.127 0.420 0.657 0.759 0.873 0.931 0.931 0.931 0.931 0.931 0.931
-10
0.931

as.matrix(est_success_table_gen(u))
[,1]
3   0.000
2   0.057
1   0.127
0   0.250
-1  0.420
-2  0.657
-3  0.759
-4  0.873
-5  0.931
-6  0.931
-7  0.931
-8  0.931
-9  0.931
-10 0.931


This represents the base probability of success. Let's see what Olive does to this table.

y <- olive_sim()
as.matrix(est_success_table_gen(y))

     [,1]
3   0.024
2   0.063
1   0.140
0   0.238
-1  0.423
-2  0.531
-3  0.709
-4  0.829
-5  0.911
-6  0.960
-7  0.983
-8  0.996
-9  1.000
-10 1.000


Perhaps I can make the relationship more clear with a plot.

plot(stepfun((-10):3, rev(c(0, est_success_table_gen(u)))), verticals = FALSE,
pch = 20, main = "Probability of Success", ylim = c(0, 1))
lines(stepfun((-10):3, rev(c(0, est_success_table_gen(y)))), verticals = FALSE,
pch = 20, col = "blue")


The black line is the estimated probability of success without Olive, and the blue line the same with Olive. (I tried reversing the $x$-axis, but for some reason I could not get good results.) What we see is:

• Olive improves the chances a "hail Mary" will succeed. If you need +1, +2, or more to succeed, Olive can help make that happen (though the odds still aren't great)
• Olive can help you guarantee a skill test will succeed. If you boost your skill value to very high numbers, Olive can effectively neuter the auto-fail token. That's a good feeling.
• Otherwise, Olive hurts your chances of success. Being at -2 is particularly worse with Olive than without. However, most of the time she changes the probability of success too little to notice.

For most investigators, then, Olive doesn't do much to make her worth your while. But I think Olive makes a huge difference for two investigators: Father Mateo and Jim Culver.

Both of these investigators like some chaos bag tokens a lot. Father Mateo really likes the elder sign since it is an auto-success (in addition to other very good effects), while Jim Culver likes skulls since they are always 0.

What does Olive do for Father Mateo?

translate <- c("E" =  2, "P" =  1, "0" =  0, "1" = -1, "2" = -2, "3" = -3,
"4" = -4, "S" = -2, "C" = -3, "T" = -4, "L" = -5, "A" = -Inf)
# Mateo
mateo_translate <- translate; mateo_translate["E"] <- Inf
mateo_no_olive <- test_sim(translate = mateo_translate)
mateo_olive <- olive_sim(translate = mateo_translate)
plot(stepfun((-10):3,
rev(c(0, est_success_table_gen(mateo_no_olive)))),
verticals = FALSE,
pch = 20, main = "Mateo's Probabilities", ylim = c(0, 1))
lines(stepfun((-10):3,
rev(c(0, est_success_table_gen(mateo_olive)))),
verticals = FALSE, col = "blue", pch = 20)


as.matrix(est_success_table_gen(mateo_no_olive))

     [,1]
3   0.049
2   0.049
1   0.101
0   0.226
-1  0.393
-2  0.627
-3  0.745
-4  0.862
-5  0.928
-6  0.928
-7  0.928
-8  0.928
-9  0.928
-10 0.928

as.matrix(est_success_table_gen(mateo_olive))

     [,1]
3   0.177
2   0.177
1   0.218
0   0.278
-1  0.444
-2  0.554
-3  0.746
-4  0.842
-5  0.924
-6  0.972
-7  0.989
-8  0.997
-9  1.000
-10 1.000


Next up is Jim Culver.

# Culver
culver_translate <- translate; culver_translate["S"] <- 0
culver_no_olive <- test_sim(translate = culver_translate)
culver_olive <- olive_sim(translate = culver_translate)
plot(stepfun((-10):3,
rev(c(0, est_success_table_gen(culver_no_olive)))),
verticals = FALSE,
pch = 20, main = "Culver's Probabilities", ylim = c(0, 1))
lines(stepfun((-10):3,
rev(c(0, est_success_table_gen(culver_olive)))),
verticals = FALSE, col = "blue", pch = 20)


as.matrix(est_success_table_gen(culver_no_olive))

     [,1]
3   0.000
2   0.065
1   0.112
0   0.333
-1  0.530
-2  0.638
-3  0.761
-4  0.874
-5  0.936
-6  0.936
-7  0.936
-8  0.936
-9  0.936
-10 0.936

as.matrix(est_success_table_gen(culver_olive))

     [,1]
3   0.020
2   0.095
1   0.217
0   0.362
-1  0.564
-2  0.666
-3  0.803
-4  0.888
-5  0.951
-6  0.972
-7  0.992
-8  0.999
-9  1.000
-10 1.000


Olive helps these investigators succeed at skill tests more easily, especially Mateo. We haven't even taken account of the fact that good things happen when certain tokens appear for these investigators! Sealing tokens could also have a big impact on the distribution of the bag when combined with Olive McBride.

Again, there's a lot that could be touched on that I won't here, so I will share with you a script allowing you to do some of these analyses yourself.

#!/usr/bin/Rscript
#######################################
# AHOliveDistributionEstimator.R
#######################################
# Curtis Miller
# 2018-08-03
# Simulates the chaos bag distribution when applying Olive McBride
#######################################

# optparse: A package for handling command line arguments
if (!suppressPackageStartupMessages(require("optparse"))) {
install.packages("optparse")
require("optparse")
}

olive_sim <- function(translate = c("E" =  2, "P" =  1, "0" =  0, "1" = -1,
"2" = -2, "3" = -3, "4" = -4, "S" = -2,
"C" = -3, "T" = -4, "L" = -5, "A" = -Inf),
N = 1000, dicebag = NULL) {
dargs <- list(n = 3, replace = FALSE)
if (!is.null(dicebag)) {
dargs$dicebag <- dicebag } pulls <- replicate(N, {do.call(chaos_bag_string_sim, dargs)}) vals <- sapply(pulls, function(p) { vecp <- strsplit(p, "")[[1]] vecnum <- translate[vecp] max(apply(combn(3, 2), 2, function(i) {sum(vecnum[i])})) }) vals[which(is.nan(vals))] <- Inf return(vals) } test_sim <- function(translate = c("E" = 2, "P" = 1, "0" = 0, "1" = -1, "2" = -2, "3" = -3, "4" = -4, "S" = -2, "C" = -3, "T" = -4, "L" = -5, "A" = -Inf), N = 1000, dicebag = NULL) { dargs <- list(n = N, replace = TRUE) if (!is.null(dicebag)) { dargs$dicebag <- dicebag
}

pulls <- do.call(chaos_bag_string_sim, dargs)
vecp <- strsplit(pulls, "")[[1]]
vecnum <- translate[vecp]

return(vecnum)
}

est_success_table_gen <- function(res, values = (-10):3) {
r <-  sapply(values, function(v) {mean(res >= v)})
names(r) <- values
return(rev(r))
}

# Main function
# See definition of cl_args for functionality (help does nothing)
main <- function(dicebag = "", translate = "", replications = 10000,
plotfile = "", width = 6, height = 4, basic = FALSE,
oliveless = FALSE, lowest = -10, highest = 3,
chaos_bag_script = "AHChaosBagSimulator.R",
title = "Probability of Success",
pos = "topright", help = FALSE) {

source(chaos_bag_script)

if (dicebag != "") {
dicebag_df <- read.csv(dicebag, stringsAsFactors = FALSE)
dicebag <- as.numeric(dicebag_df$freq) names(dicebag) <- dicebag_df$token
} else {
dicebag <- NULL
}

if (translate != "") {
translate_df <- read.csv(translate, stringsAsFactors = FALSE)
translate <- as.numeric(translate_df$mod) names(translate) <- translate_df$token
} else {
translate <- c("E" =  2, "P" =  1, "0" =  0, "1" = -1,
"2" = -2, "3" = -3, "4" = -4, "S" = -2,
"C" = -3, "T" = -4, "L" = -5, "A" = -Inf)
}

if (any(names(translate) != names(dicebag))) {
stop("Name mismatch between translate and dicebag; check the token columns")
}

if (!oliveless) {
olive_res <- olive_sim(translate = translate, dicebag = dicebag,
N = replications)
cat("Table of success rate with Olive when at lest X is needed:\n")
print(as.matrix(est_success_table_gen(olive_res, values = lowest:highest)))
cat("\n\n")
}

if (basic) {
basic_res <- test_sim(translate = translate, dicebag = dicebag,
N = replications)
cat("Table of success rate for basic test when at lest X is needed:\n")
print(as.matrix(est_success_table_gen(basic_res, values = lowest:highest)))
cat("\n\n")
}

if (plotfile != "") {
png(plotfile, width = width, height = height, units = "in", res = 300)
if (basic) {
plot(stepfun(lowest:highest, rev(c(0, est_success_table_gen(basic_res)))),
verticals = FALSE, pch = 20, main = title,
ylim = c(0, 1))
if (!oliveless) {
lines(stepfun(lowest:highest,
rev(c(0, est_success_table_gen(olive_res)))),
verticals = FALSE, pch = 20, col = "blue")
legend(pos, legend = c("No Olive", "Olive"), col = c("black", "blue"),
lty = 1, pch = 20)
}
}
dev.off()
}
quit()
}

if (sys.nframe() == 0) {
cl_args <- parse_args(OptionParser(
description = "Estimate the chaos bag distribution with Olive McBride.",
option_list = list(
make_option(c("--dicebag", "-d"), type = "character", default = "",
help = "(Optional) dice bag distribution CSV file"),
make_option(c("--translate", "-t"), type = "character", default = "",
help = "(Optional) symbol numeric translation CSV file"),
make_option(c("--replications", "-r"), type = "integer",
default = 10000,
help = "Number of replications to perform"),
make_option(c("--plotfile", "-p"), type = "character", default = "",
help = paste("Where to save plot (if not set, no plot",
make_option(c("--width", "-w"), type = "integer", default = 6,
help = "Width of plot (inches)"),
make_option(c("--height", "-H"), type = "integer", default = 4,
help = "Height of plot (inches)"),
make_option(c("--basic", "-b"), type = "logical",
action = "store_true", default = FALSE,
help = "Include results for test without Olive McBride"),
make_option(c("--oliveless", "-o"), type = "logical",
action = "store_true", default = FALSE,
help = "Don't include results using Olive McBride"),
make_option(c("--lowest", "-l"), type = "integer", default = -10,
help = "Lowest value to check"),
make_option(c("--highest", "-i"), type = "integer", default = 3,
help = "Highest value to check"),
make_option(c("--pos", "-s"), type = "character",
default = "topright", help = "Position of legend"),
make_option(c("--title", "-m"), type = "character",
default = "Probability of Success",
help = "Title of plot"),
make_option(c("--chaos-bag-script", "-c"), type = "character",
default = "AHChaosBagSimulator.R",
help = paste("Location of the R file containing",
"chaos_bag_string_sim() definition; by",
"default, assumed to be in the working",
"directory in AHChaosBagSimulator.R"))
)
))

names(cl_args)[which(
names(cl_args) == "chaos-bag-script")] = "chaos_bag_script"

do.call(main, cl_args)
}


You've already seen an example CSV file for defining the dice bag; here's a file for defining what each token is worth.

token,mod
E,2
P,1
0,0
1,-1
2,-2
3,-3
4,-4
S,-2
C,-3
T,-4
L,-5
A,-Inf


Make the script executable and get help like so:

$chmod +x AHOliveDistributionEstimator.R$ ./AHOliveDistributionEstimator.R -h


## Conclusion

If you've never heard of this game I love, hopefully you've heard of it now. Give the game a look! And if you have heard of this game before, I hope you learned something from this post. If you're a reviewer, perhaps I've given you some tools you could use to help evaluate some of Arkham Horror's more intractable cards.

My final thoughts on Olive: she's not going to displace Arcane Initiate's spot as the best Mystic ally, but she could do well in certain decks. Specifically, I think that Mateo decks and Jim Culver decks planning on running Song of the Dead will want to run her since there are specific chaos tokens those decks want to see; the benefits of Olive extend beyond changing the probability of success. Most of the time you will not want to make a hail Mary skill test and you won't have the cards to push your skill value to a point where anything but the auto-fail is a success, so most of the time Olive will hurt your chances rather than help you, if she has any effect at all. Thus a typical Mystic likely won't find Olive interesting… but some decks will love her.

By the way, if you are in the Salt Lake City area of Utah, I play Arkham Horror LCG at Mind Games, LLC. While I haven't been to many other stores, Mind Games seems to have the best stocking of Arkham Horror (as well as two other Fantasy Flight LCGs, A Game of Thrones and Legend of the Five Rings). Every Tuesday night (the store's late night) a group of card game players come in to play; consider joining us! In addition, Mind Games likely has the best deal when it comes to buying the game; whenever you spend $50 on product, you get an additional$10 off (or, alternatively, you take $10 off for every$60 you spend). Thus Mind Games could be the cheapest way to get the game (without going second-hand or Internet deal hunting).

(Big image credit: Aurore Folney and FFG.)

EDIT: WordPress.com does not like R code and garbled some of the code in the script for Olive simulation. It should be correct now. If anyone spots other errors, please notify me; I will fix them.

I have created a video course published by Packt Publishing entitled Applications of Statistical Learning with Python, the fourth volume in a four-volume set of video courses entitled, Taming Data with Python; Excelling as a Data Analyst. This course discusses how to use Python for data science, emphasizing application. It starts with introducing natural language processing and computer vision. It ends with two case studies; in one, I train a classifier to detect spam e-mail, while in the other, I train a computer vision system to detect emotions in faces. Viewers get a hands-on experience using Python for machine learning. If you are starting out using Python for data analysis or know someone who is, please consider buying my course or at least spreading the word about it. You can buy the course directly or purchase a subscription to Mapt and watch it there.

If you like my blog and would like to support it, spread the word (if not get a copy yourself)! Also, stay tuned for future courses I publish with Packt at the Video Courses section of my site.

1. You can also play scenarios stand-alone, which isn’t nearly as fun as playing in a campaign, especially one of the eight-scenario campaigns.
2. This is not right because sometims you redraw tokens from the bag without replacement; we'll ignore that case for now.
3. Using the asymptotically appropriate confidence interval based on the Normal distribution (as described here), the margin of error error will not exceed $\frac{1}{\sqrt{n}}$ since $\sqrt{\hat{p}(1 - \hat{p})} \leq \frac{1}{2}$. Thus, we have $m \leq \frac{1}{\sqrt{n}}$; solving this for $n$ yields $n \geq m^{-2}$. So if we want a 95% chance that the margin of error will not exceed $m = 0.01 = \frac{1}{100}$, we need a dataset of size $n \geq (0.01)^{-2} = 10,000$
4. Note that the earlier calculations that argued the odds of being accurate up to two decimal places were high no longer hold, since many more numbers are being estimated, and one estimate is not independent of the other. That said, the general principle of more simulations leading to better accuracy still holds.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### OpenCV People Counter

In this tutorial you will learn how to build a “people counter” with OpenCV and Python. Using OpenCV, we’ll count the number of people who are heading “in” or “out” of a department store in real-time.

Building a person counter with OpenCV has been one of the most-requested topics here on the PyImageSearch and I’ve been meaning to do a blog post on people counting for a year now — I’m incredibly thrilled to be publishing it and sharing it with you today.

Enjoy the tutorial and let me know what you think in the comments section at the bottom of the post!

To get started building a people counter with OpenCV, just keep reading!

Looking for the source code to this post?

## OpenCV People Counter with Python

In the first part of today’s blog post, we’ll be discussing the required Python packages you’ll need to build our people counter.

From there I’ll provide a brief discussion on the difference between object detection and object tracking, along with how we can leverage both to create a more accurate people counter.

Afterwards, we’ll review the directory structure for the project and then implement the entire person counting project.

Finally, we’ll examine the results of applying people counting with OpenCV to actual videos.

### Required Python libraries for people counting

In order to build our people counting applications, we’ll need a number of different Python libraries, including:

1. My special
pyimagesearch
module which we’ll implement and use later in this post
2. The Python driver script used to start the people counter
3. All example videos used here in the post

I’m going to assume you already have NumPy, OpenCV, and dlib installed on your system.

If you don’t have OpenCV installed, you’ll want to head to my OpenCV install page and follow the relevant tutorial for your particular operating system.

If you need to install dlib, you can use this guide.

$pip install --upgrade imutils ### Understanding object detection vs. object tracking There is a fundamental difference between object detection and object tracking that you must understand before we proceed with the rest of this tutorial. When we apply object detection we are determining where in an image/frame an object is. An object detector is also typically more computationally expensive, and therefore slower, than an object tracking algorithm. Examples of object detection algorithms include Haar cascades, HOG + Linear SVM, and deep learning-based object detectors such as Faster R-CNNs, YOLO, and Single Shot Detectors (SSDs). An object tracker, on the other hand, will accept the input (x, y)-coordinates of where an object is in an image and will: 1. Assign a unique ID to that particular object 2. Track the object as it moves around a video stream, predicting the new object location in the next frame based on various attributes of the frame (gradient, optical flow, etc.) Examples of object tracking algorithms include MedianFlow, MOSSE, GOTURN, kernalized correlation filters, and discriminative correlation filters, to name a few. If you’re interested in learning more about the object tracking algorithms built into OpenCV, be sure to refer to this blog post. ### Combining both object detection and object tracking Highly accurate object trackers will combine the concept of object detection and object tracking into a single algorithm, typically divided into two phases: • Phase 1 — Detecting: During the detection phase we are running our computationally more expensive object tracker to (1) detect if new objects have entered our view, and (2) see if we can find objects that were “lost” during the tracking phase. For each detected object we create or update an object tracker with the new bounding box coordinates. Since our object detector is more computationally expensive we only run this phase once every N frames. • Phase 2 — Tracking: When we are not in the “detecting” phase we are in the “tracking” phase. For each of our detected objects, we create an object tracker to track the object as it moves around the frame. Our object tracker should be faster and more efficient than the object detector. We’ll continue tracking until we’ve reached the N-th frame and then re-run our object detector. The entire process then repeats. The benefit of this hybrid approach is that we can apply highly accurate object detection methods without as much of the computational burden. We will be implementing such a tracking system to build our people counter. ### Project structure Let’s review the project structure for today’s blog post. Once you’ve grabbed the code from the “Downloads” section, you can inspect the directory structure with the tree command: $ tree --dirsfirst
.
├── pyimagesearch
│   ├── __init__.py
│   ├── centroidtracker.py
│   └── trackableobject.py
├── mobilenet_ssd
│   ├── MobileNetSSD_deploy.caffemodel
│   └── MobileNetSSD_deploy.prototxt
├── videos
│   ├── example_01.mp4
│   └── example_02.mp4
├── output
│   ├── output_01.avi
│   └── output_02.avi
└── people_counter.py

4 directories, 10 files

Zeroing in on the most-important two directories, we have:

1. pyimagesearch/
: This module contains the centroid tracking algorithm. The centroid tracking algorithm is covered in the “Combining object tracking algorithms” section, but the code is not. For a review of the centroid tracking code (
centroidtracker.py
) you should refer to the first post in the series.
2. mobilenet_ssd/
: Contains the Caffe deep learning model files. We’ll be using a MobileNet Single Shot Detector (SSD) which is covered at the top of this blog post in the section, “Single Shot Detectors for object detection”.

The heart of today’s project is contained within the

people_counter.py
script — that’s where we’ll spend most of our time. We’ll also review the
trackableobject.py
script today.

### Combining object tracking algorithms

Figure 1: An animation demonstrating the steps in the centroid tracking algorithm.

To implement our people counter we’ll be using both OpenCV and dlib. We’ll use OpenCV for standard computer vision/image processing functions, along with the deep learning object detector for people counting.

We’ll then use dlib for its implementation of correlation filters. We could use OpenCV here as well; however, the dlib object tracking implementation was a bit easier to work with for this project.

I’ll be including a deep dive into dlib’s object tracking algorithm in next week’s post.

Along with dlib’s object tracking implementation, we’ll also be using my implementation of centroid tracking from a few weeks ago. Reviewing the entire centroid tracking algorithm is outside the scope of this blog post, but I’ve included a brief overview below.

At Step #1 we accept a set of bounding boxes and compute their corresponding centroids (i.e., the center of the bounding boxes):

Figure 2: To build a simple object tracking via centroids script with Python, the first step is to accept bounding box coordinates and use them to compute centroids.

The bounding boxes themselves can be provided by either:

1. An object detector (such as HOG + Linear SVM, Faster R- CNN, SSDs, etc.)
2. Or an object tracker (such as correlation filters)

In the above image you can see that we have two objects to track in this initial iteration of the algorithm.

During Step #2 we compute the Euclidean distance between any new centroids (yellow) and existing centroids (purple):

Figure 3: Three objects are present in this image. We need to compute the Euclidean distance between each pair of original centroids (red) and new centroids (green).

The centroid tracking algorithm makes the assumption that pairs of centroids with minimum Euclidean distance between them must be the same object ID.

In the example image above we have two existing centroids (purple) and three new centroids (yellow), implying that a new object has been detected (since there is one more new centroid vs. old centroid).

The arrows then represent computing the Euclidean distances between all purple centroids and all yellow centroids.

Once we have the Euclidean distances we attempt to associate object IDs in Step #3:

Figure 4: Our simple centroid object tracking method has associated objects with minimized object distances. What do we do about the object in the bottom-left though?

In Figure 4 you can see that our centroid tracker has chosen to associate centroids that minimize their respective Euclidean distances.

But what about the point in the bottom-left?

It didn’t get associated with anything — what do we do?

To answer that question we need to perform Step #4, registering new objects:

Figure 5: In our object tracking example, we have a new object that wasn’t matched with an existing object, so it is registered as object ID #3.

Registering simply means that we are adding the new object to our list of tracked objects by:

1. Assigning it a new object ID
2. Storing the centroid of the bounding box coordinates for the new object

In the event that an object has been lost or has left the field of view, we can simply deregister the object (Step #5).

Exactly how you handle when an object is “lost” or is “no longer visible” really depends on your exact application, but for our people counter, we will deregister people IDs when they cannot be matched to any existing person objects for 40 consecutive frames.

Again, this is only a brief overview of the centroid tracking algorithm.

Note: For a more detailed review, including an explanation of the source code used to implement centroid tracking, be sure to refer to this post.

### Creating a “trackable object”

In order to track and count an object in a video stream, we need an easy way to store information regarding the object itself, including:

• It’s object ID
• It’s previous centroids (so we can easily to compute the direction the object is moving)
• Whether or not the object has already been counted

To accomplish all of these goals we can define an instance of

TrackableObject
— open up the
trackableobject.py
file and insert the following code:

class TrackableObject:
def __init__(self, objectID, centroid):
# store the object ID, then initialize a list of centroids
# using the current centroid
self.objectID = objectID
self.centroids = [centroid]

# initialize a boolean used to indicate if the object has
# already been counted or not
self.counted = False

The

TrackableObject
constructor accepts an
objectID
+
centroid
and stores them. The centroids variable is a list because it will contain an object’s centroid location history.

The constructor also initializes

counted
as
False
, indicating that the object has not been counted yet.

### Implementing our people counter with OpenCV + Python

With all of our supporting Python helper tools and classes in place, we are now ready to built our OpenCV people counter.

Open up your

people_counter.py
file and insert the following code:

# import the necessary packages
from pyimagesearch.centroidtracker import CentroidTracker
from pyimagesearch.trackableobject import TrackableObject
from imutils.video import VideoStream
from imutils.video import FPS
import numpy as np
import argparse
import imutils
import time
import dlib
import cv2

We begin by importing our necessary packages:

• From the
pyimagesearch
module, we import our custom
CentroidTracker
and
TrackableObject
classes.
• The
VideoStream
and
FPS
modules from
imutils.video
will help us to work with a webcam and to calculate the estimated Frames Per Second (FPS) throughput rate.
• We need
imutils
for its OpenCV convenience functions.
• The
dlib
library will be used for its correlation tracker implementation.
• OpenCV will be used for deep neural network inference, opening video files, writing video files, and displaying output frames to our screen.

Now that all of the tools are at our fingertips, let’s parse command line arguments:

# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
help="path to Caffe 'deploy' prototxt file")
help="path to Caffe pre-trained model")
help="path to optional input video file")
help="path to optional output video file")
help="minimum probability to filter weak detections")
help="# of skip frames between detections")
args = vars(ap.parse_args())

We have six command line arguments which allow us to pass information to our people counter script from the terminal at runtime:

• --prototxt
: Path to the Caffe “deploy” prototxt file.
• --model
: The path to the Caffe pre-trained CNN model.
• --input
: Optional input video file path. If no path is specified, your webcam will be utilized.
• --output
: Optional output video path. If no path is specified, a video will not be recorded.
• --confidence
: With a default value of
0.4
, this is the minimum probability threshold which helps to filter out weak detections.
• --skip-frames
: The number of frames to skip before running our DNN detector again on the tracked object. Remember, object detection is computationally expensive, but it does help our tracker to reassess objects in the frame. By default we skip
30
frames between detecting objects with the OpenCV DNN module and our CNN single shot detector model.

Now that our script can dynamically handle command line arguments at runtime, let’s prepare our SSD:

# initialize the list of class labels MobileNet SSD was trained to
# detect
CLASSES = ["background", "aeroplane", "bicycle", "bird", "boat",
"bottle", "bus", "car", "cat", "chair", "cow", "diningtable",
"dog", "horse", "motorbike", "person", "pottedplant", "sheep",
"sofa", "train", "tvmonitor"]

# load our serialized model from disk
net = cv2.dnn.readNetFromCaffe(args["prototxt"], args["model"])

First, we’ll initialize

CLASSES
— the list of classes that our SSD supports. This list should not be changed if you’re using the model provided in the “Downloads”We’re only interested in the “person” class, but you could count other moving objects as well (however, if your “pottedplant”, “sofa”, or “tvmonitor” grows legs and starts moving, you should probably run out of your house screaming rather than worrying about counting them!  ).

On Line 38 we load our pre-trained MobileNet SSD used to detect objects (but again, we’re just interested in detecting and tracking people, not any other class). To learn more about MobileNet and SSDs, please refer to my previous blog post.

From there we can initialize our video stream:

# if a video path was not supplied, grab a reference to the webcam
if not args.get("input", False):
print("[INFO] starting video stream...")
vs = VideoStream(src=0).start()
time.sleep(2.0)

# otherwise, grab a reference to the video file
else:
print("[INFO] opening video file...")
vs = cv2.VideoCapture(args["input"])

First we handle the case where we’re using a webcam video stream (Lines 41-44). Otherwise, we’ll be capturing frames from a video file (Lines 47-49).

We still have a handful of initializations to perform before we begin looping over frames:

# initialize the video writer (we'll instantiate later if need be)
writer = None

# initialize the frame dimensions (we'll set them as soon as we read
# the first frame from the video)
W = None
H = None

# instantiate our centroid tracker, then initialize a list to store
# each of our dlib correlation trackers, followed by a dictionary to
# map each unique object ID to a TrackableObject
ct = CentroidTracker(maxDisappeared=40, maxDistance=50)
trackers = []
trackableObjects = {}

# initialize the total number of frames processed thus far, along
# with the total number of objects that have moved either up or down
totalFrames = 0
totalDown = 0
totalUp = 0

# start the frames per second throughput estimator
fps = FPS().start()

The remaining initializations include:

• writer
: Our video writer. We’ll instantiate this object later if we are writing to video.
• W
and
H
: Our frame dimensions. We’ll need to plug these into
cv2.VideoWriter
.
• ct
: Our
CentroidTracker
. For details on the implementation of
CentroidTracker
, be sure to refer to my blog post from a few weeks ago.
• trackers
: A list to store the dlib correlation trackers. To learn about dlib correlation tracking stay tuned for next week’s post.
• trackableObjects
: A dictionary which maps an
objectID
to a
TrackableObject
.
• totalFrames
: The total number of frames processed.
• totalDown
and
totalUp
: The total number of objects/people that have moved either down or up. These variables measure the actual “people counting” results of the script.
• fps
: Our frames per second estimator for benchmarking.

Note: If you get lost in the

while
loop below, you should refer back to this bulleted listing of important variables.

Now that all of our initializations are taken care of, let’s loop over incoming frames:

# loop over frames from the video stream
while True:
# grab the next frame and handle if we are reading from either
# VideoCapture or VideoStream
frame = frame[1] if args.get("input", False) else frame

# if we are viewing a video and we did not grab a frame then we
# have reached the end of the video
if args["input"] is not None and frame is None:
break

# resize the frame to have a maximum width of 500 pixels (the
# less data we have, the faster we can process it), then convert
# the frame from BGR to RGB for dlib
frame = imutils.resize(frame, width=500)
rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)

# if the frame dimensions are empty, set them
if W is None or H is None:
(H, W) = frame.shape[:2]

# if we are supposed to be writing a video to disk, initialize
# the writer
if args["output"] is not None and writer is None:
fourcc = cv2.VideoWriter_fourcc(*"MJPG")
writer = cv2.VideoWriter(args["output"], fourcc, 30,
(W, H), True)

We begin looping on Line 76. At the top of the loop we grab the next

frame
(Lines 79 and 80). In the event that we’ve reached the end of the video, we’ll
break
out of the loop (Lines 84 and 85).

Preprocessing the

frame
takes place on Lines 90 and 91. This includes resizing and swapping color channels as dlib requires an
rgb
image.

We grab the dimensions of the

frame
for the video
writer
(Lines 94 and 95).

From there we’ll instantiate the video

writer
if an output path was provided via command line argument (Lines 99-102). To learn more about writing video to disk, be sure to refer to this post.

Now let’s detect people using the SSD:

# initialize the current status along with our list of bounding
# box rectangles returned by either (1) our object detector or
# (2) the correlation trackers
status = "Waiting"
rects = []

# check to see if we should run a more computationally expensive
# object detection method to aid our tracker
if totalFrames % args["skip_frames"] == 0:
# set the status and initialize our new set of object trackers
status = "Detecting"
trackers = []

# convert the frame to a blob and pass the blob through the
# network and obtain the detections
blob = cv2.dnn.blobFromImage(frame, 0.007843, (W, H), 127.5)
net.setInput(blob)
detections = net.forward()

We initialize a

status
as “Waiting” on Line 107. Possible
status
states include:

• Waiting: In this state, we’re waiting on people to be detected and tracked.
• Detecting: We’re actively in the process of detecting people using the MobileNet SSD.
• Tracking: People are being tracked in the frame and we’re counting the
totalUp
and
totalDown
.

Our

rects
list will be populated either via detection or tracking. We go ahead and initialize
rects
on Line 108.

It’s important to understand that deep learning object detectors are very computationally expensive, especially if you are running them on your CPU.

To avoid running our object detector on every frame, and to speed up our tracking pipeline, we’ll be skipping every N frames (set by command line argument

--skip-frames
where
30
is the default). Only every N frames will we exercise our SSD for object detection. Otherwise, we’ll simply be tracking moving objects in-between.

Using the modulo operator on Line 112 we ensure that we’ll only execute the code in the if-statement every N frames.

Assuming we’ve landed on a multiple of

skip_frames
, we’ll update the
status
to “Detecting” (Line 114).

Then we initialize our new list of

trackers
(Line 115).

Next, we’ll perform inference via object detection. We begin by creating a

blob
from the image, followed by passing the
blob
through the net to obtain
detections
(Lines 119-121).

Now we’ll loop over each of the

detections
in hopes of finding objects belonging to the “person” class:

# loop over the detections
for i in np.arange(0, detections.shape[2]):
# extract the confidence (i.e., probability) associated
# with the prediction
confidence = detections[0, 0, i, 2]

# filter out weak detections by requiring a minimum
# confidence
if confidence > args["confidence"]:
# extract the index of the class label from the
# detections list
idx = int(detections[0, 0, i, 1])

# if the class label is not a person, ignore it
if CLASSES[idx] != "person":
continue

Looping over

detections
on Line 124, we proceed to grab the
confidence
(Line 127) and filter out weak results + those that don’t belong to the “person” class (Lines 131-138).

Now we can compute a bounding box for each person and begin correlation tracking:

# compute the (x, y)-coordinates of the bounding box
# for the object
box = detections[0, 0, i, 3:7] * np.array([W, H, W, H])
(startX, startY, endX, endY) = box.astype("int")

# construct a dlib rectangle object from the bounding
# box coordinates and then start the dlib correlation
# tracker
tracker = dlib.correlation_tracker()
rect = dlib.rectangle(startX, startY, endX, endY)
tracker.start_track(rgb, rect)

# add the tracker to our list of trackers so we can
# utilize it during skip frames
trackers.append(tracker)

Computing our bounding

box
takes place on Lines 142 and 143.

Then we instantiate our dlib correlation

tracker
on Line 148, followed by passing in the object’s bounding box coordinates to
dlib.rectangle
, storing the result as
rect
(Line 149).

Subsequently, we start tracking on Line 150 and append the

tracker
to the
trackers
list on Line 154.

That’s a wrap for all operations we do every N skip-frames!

Let’s take care of the typical operations where tracking is taking place in the

else
block:

# otherwise, we should utilize our object *trackers* rather than
# object *detectors* to obtain a higher frame processing throughput
else:
# loop over the trackers
for tracker in trackers:
# set the status of our system to be 'tracking' rather
# than 'waiting' or 'detecting'
status = "Tracking"

# update the tracker and grab the updated position
tracker.update(rgb)
pos = tracker.get_position()

# unpack the position object
startX = int(pos.left())
startY = int(pos.top())
endX = int(pos.right())
endY = int(pos.bottom())

# add the bounding box coordinates to the rectangles list
rects.append((startX, startY, endX, endY))

Most of the time, we aren’t landing on a skip-frame multiple. During this time, we’ll utilize our

trackers
to track our object rather than applying detection.

We begin looping over the available

trackers
on Line 160.

We proceed to update the

status
to “Tracking” (Line 163) and grab the object position (Lines 166 and 167).

From there we extract the position coordinates (Lines 170-173) followed by populating the information in our

rects
list.

Now let’s draw a horizontal visualization line (that people must cross in order to be tracked) and use the centroid tracker to update our object centroids:

# draw a horizontal line in the center of the frame -- once an
# object crosses this line we will determine whether they were
# moving 'up' or 'down'
cv2.line(frame, (0, H // 2), (W, H // 2), (0, 255, 255), 2)

# use the centroid tracker to associate the (1) old object
# centroids with (2) the newly computed object centroids
objects = ct.update(rects)

On Line 181 we draw the horizontal line which we’ll be using to visualize people “crossing” — once people cross this line we’ll increment our respective counters

Then on Line 185, we utilize our

CentroidTracker
instantiation to accept the list of
rects
, regardless of whether they were generated via object detection or object tracking. Our centroid tracker will associate object IDs with object locations.

In this next block, we’ll review the logic which counts if a person has moved up or down through the frame:

# loop over the tracked objects
for (objectID, centroid) in objects.items():
# check to see if a trackable object exists for the current
# object ID
to = trackableObjects.get(objectID, None)

# if there is no existing trackable object, create one
if to is None:
to = TrackableObject(objectID, centroid)

# otherwise, there is a trackable object so we can utilize it
# to determine direction
else:
# the difference between the y-coordinate of the *current*
# centroid and the mean of *previous* centroids will tell
# us in which direction the object is moving (negative for
# 'up' and positive for 'down')
y = [c[1] for c in to.centroids]
direction = centroid[1] - np.mean(y)
to.centroids.append(centroid)

# check to see if the object has been counted or not
if not to.counted:
# if the direction is negative (indicating the object
# is moving up) AND the centroid is above the center
# line, count the object
if direction < 0 and centroid[1] < H // 2:
totalUp += 1
to.counted = True

# if the direction is positive (indicating the object
# is moving down) AND the centroid is below the
# center line, count the object
elif direction > 0 and centroid[1] > H // 2:
totalDown += 1
to.counted = True

# store the trackable object in our dictionary
trackableObjects[objectID] = to

We begin by looping over the updated bounding box coordinates of the object IDs (Line 188).

On Line 191 we attempt to fetch a

TrackableObject
for the current
objectID
.

If the

TrackableObject
doesn’t exist for the
objectID
, we create one (Lines 194 and 195).

Otherwise, there is already an existing

TrackableObject
, so we need to figure out if the object (person) is moving up or down.

To do so, we grab the y-coordinate value for all previous centroid locations for the given object (Line 204). Then we compute the

direction
by taking the difference between the current centroid location and the mean of all previous centroid locations (Line 205).

The reason we take the mean is to ensure our direction tracking is more stable. If we stored just the previous centroid location for the person we leave ourselves open to the possibility of false direction counting. Keep in mind that object detection and object tracking algorithms are not “magic” — sometimes they will predict bounding boxes that may be slightly off what you may expect; therefore, by taking the mean, we can make our people counter more accurate.

If the

TrackableObject
has not been
counted
(Line 209), we need to determine if it’s ready to be counted yet (Lines 213-222), by:

1. Checking if the
direction
is negative (indicating the object is moving Up) AND the centroid is Above the centerline. In this case we increment
totalUp
.
2. Or checking if the
direction
is positive (indicating the object is moving Down) AND the centroid is Below the centerline. If this is true, we increment
totalDown
.

Finally, we store the

TrackableObject
in our
trackableObjects
dictionary (Line 225) so we can grab and update it when the next frame is captured.

We’re on the home-stretch!

The next three code blocks handle:

1. Display (drawing and writing text to the frame)
2. Writing frames to a video file on disk (if the
--output
command line argument is present)
3. Capturing keypresses
4. Cleanup

First we’ll draw some information on the frame for visualization:

# draw both the ID of the object and the centroid of the
# object on the output frame
text = "ID {}".format(objectID)
cv2.putText(frame, text, (centroid[0] - 10, centroid[1] - 10),
cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)
cv2.circle(frame, (centroid[0], centroid[1]), 4, (0, 255, 0), -1)

# construct a tuple of information we will be displaying on the
# frame
info = [
("Up", totalUp),
("Down", totalDown),
("Status", status),
]

# loop over the info tuples and draw them on our frame
for (i, (k, v)) in enumerate(info):
text = "{}: {}".format(k, v)
cv2.putText(frame, text, (10, H - ((i * 20) + 20)),
cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0, 0, 255), 2)

Here we overlay the following data on the frame:

• ObjectID
: Each object’s numerical identifier.
• centroid
: The center of the object will be represented by a “dot” which is created by filling in a circle.
• info
: Includes
totalUp
,
totalDown
, and
status

For a review of drawing operations, be sure to refer to this blog post.

Then we’ll write the

frame
to a video file (if necessary) and handle keypresses:

# check to see if we should write the frame to disk
if writer is not None:
writer.write(frame)

# show the output frame
cv2.imshow("Frame", frame)
key = cv2.waitKey(1) & 0xFF

# if the q key was pressed, break from the loop
if key == ord("q"):
break

# increment the total number of frames processed thus far and
# then update the FPS counter
totalFrames += 1
fps.update()

In this block we:

• Write the
frame
, if necessary, to the output video file (Lines 249 and 250)
• Display the
frame
and handle keypresses (Lines 253-258). If “q” is pressed, we
break
out of the frame processing loop.
• Update our
fps
counter (Line 263)

We didn’t make too much of a mess, but now it’s time to clean up:

# stop the timer and display FPS information
fps.stop()
print("[INFO] elapsed time: {:.2f}".format(fps.elapsed()))
print("[INFO] approx. FPS: {:.2f}".format(fps.fps()))

# check to see if we need to release the video writer pointer
if writer is not None:
writer.release()

# if we are not using a video file, stop the camera video stream
if not args.get("input", False):
vs.stop()

# otherwise, release the video file pointer
else:
vs.release()

# close any open windows
cv2.destroyAllWindows()

To finish out the script, we display the FPS info to the terminal, release all pointers, and close any open windows.

Just 283 lines of code later, we are now done .

### People counting results

To see our OpenCV people counter in action, make sure you use the “Downloads” section of this blog post to download the source code and example videos.

From there, open up a terminal and execute the following command:

$python people_counter.py --prototxt mobilenet_ssd/MobileNetSSD_deploy.prototxt \ --model mobilenet_ssd/MobileNetSSD_deploy.caffemodel \ --input videos/example_01.mp4 --output output/output_01.avi [INFO] loading model... [INFO] opening video file... [INFO] elapsed time: 37.27 [INFO] approx. FPS: 34.42 Here you can see that our person counter is counting the number of people who: 1. Are entering the department store (down) 2. And the number of people who are leaving (up) At the end of the first video you’ll see there have been 7 people who entered and 3 people who have left. Furthermore, examining the terminal output you’ll see that our person counter is capable of running in real-time, obtaining 34 FPS throughout. This is despite the fact that we are using a deep learning object detector for more accurate person detections. Our 34 FPS throughout rate is made possible through our two-phase process of: 1. Detecting people once every 30 frames 2. And then applying a faster, more efficient object tracking algorithm in all frames in between. Another example of people counting with OpenCV can be seen below: $ python people_counter.py --prototxt mobilenet_ssd/MobileNetSSD_deploy.prototxt \
--model mobilenet_ssd/MobileNetSSD_deploy.caffemodel \
--input videos/example_01.mp4 --output output/output_02.avi
[INFO] opening video file...
[INFO] elapsed time: 36.88
[INFO] approx. FPS: 34.79

I’ve included a short GIF below to give you an idea of how the algorithm works:

Figure 7: An example of an OpenCV people counter in action.

A full video of the demo can be seen below:

This time there have been 2 people who have entered the department store and 14 people who have left.

You can see how useful this system would be to a store owner interested in foot traffic analytics.

The same type of system for counting foot traffic with OpenCV can be used to count automobile traffic with OpenCV and I hope to cover that topic in a future blog post.

Additionally, a big thank you to David McDuffee for recording the example videos used here today! David works here with me at PyImageSearch and if you’ve ever emailed PyImageSearch before, you have very likely interacted with him. Thank you for making this post possible, David! Also a thank you to BenSound for providing the music for the video demos included in this post.

### What are the next steps?

Congratulations on building your person counter with OpenCV!

If you’re interested in learning more about OpenCV, including building other real-world applications, including face detection, object recognition, and more, I would suggest reading through my book, Practical Python and OpenCV + Case Studies.

Practical Python and OpenCV is meant to be a gentle introduction to the world of computer vision and image processing. This book is perfect if you:

• Are new to the world of computer vision and image processing
• Have some past image processing experience but are new to Python
• Are looking for some great example projects to get your feet wet

Learn OpenCV fundamentals in a single weekend!

If you’re looking for a more detailed dive into computer vision, I would recommend working through the PyImageSearch Gurus course. The PyImageSearch Gurus course is similar to a college survey course and many students report that they learn more than a typical university class.

Inside you’ll find over 168 lessons, starting with the fundamentals of computer vision, all the way up to more advanced topics, including:

• Face recognition
• Training your own custom object detectors
• …and much more!

You’ll also find a thriving community of like-minded individuals who are itching to learn about computer vision. Each day in the community forums we discuss:

• New project ideas and resources
• Kaggle and other competitions
• Development environment and code issues
• …among many other topics!

Master computer vision inside PyImageSearch Gurus!

## Summary

In today’s blog post we learned how to build a people counter using OpenCV and Python.

Our implementation is:

• Capable of running in real-time on a standard CPU
• Utilizes deep learning object detectors for improved person detection accuracy
• Leverages two separate object tracking algorithms, including both centroid tracking and correlation filters for improved tracking accuracy
• Applies both a “detection” and “tracking” phase, making it capable of (1) detecting new people and (2) picking up people that may have been “lost” during the tracking phase

I hope you enjoyed today’s post on people counting with OpenCV!

The post OpenCV People Counter appeared first on PyImageSearch.

### Unsupervised Learning Demystified

Unsupervised learning is a pattern-finding technique for mining inspiration from your data. Let's demystify!

### Four short links: 13 August 2018

Algorithms, Feedback, Transliteration, and Diagnosis

1. Dijkstras in Disguise -- It turns out that many algorithms I've encountered in my computer graphics, finance, and reinforcement learning studies are all variations of this relaxation principle in disguise. [...] This blog post is a gentle tutorial on how all these varied CS topics are connected.
2. The Blacker the Box -- The thesis of this post is: The faster the feedback on prediction accuracy, the blacker the box can be. The slower the feedback, the more your models should be explicit and formal.
3. Design Challenges in Named Entity Transliteration -- In order to improve availability of bilingual named entity transliteration data sets, we release personal name bilingual dictionaries mined from Wikidata for English to Russian, Hebrew, Arabic, and Japanese Katakana. Our code and dictionaries are publicly available. GitHub.
4. AI Spots Fibromyalgia -- A machine-learning algorithm that was programmed to recognize this neurological signature was able to use it to predict which brain scans were indicative of fibromyalgia and which were not. [...] López-Solà’s research is compelling evidence to convince those who are reluctant to accept the existence of fibromyalgia. Interesting because medical science argues whether the condition is real, but the software can reliably identify something.

### How feminism has made me a better scientist

Feminism is not a branch of science. It is not a set of testable propositions about the observable world, nor is it any single research method. From my own perspective, feminism is a political movement associated with successes such as votes for women, setbacks such as the failed Equal Rights Amendment, and continuing struggles in areas ranging from reproductive freedom to equality in the workplace. Feminism is also a way of looking at the world through awareness of the social and political struggles of women, historically and in the present. And others will define feminism in other ways.

How is this relevant to the practice of science? As a researcher in statistical methods and applications, and I have found feminism to help me do better science. I make this claim based on various experiences in my work.

To start with, statistics is all about learning from data. Statisticians from Florence Nightingale and Francis Galton through Hans Rosling and Nate Silver have discovered unexpected patterns using mathematical modeling and data visualization. What does that have to do with feminism? Feminism is about a willingness to question the dominant, tacitly accepted ideology. This is essential for science. What is labeled “non-ideological” basically means the dominant ideology is accepted without thought. As Angela Davis said in her lecture, Feminism and Abolition, “feminist methodologies impel us to explore connections that are not always apparent. And they drive us to inhabit contradictions and discover what is productive in these contradictions. Feminism insists on methods of thought and action that urge us to think about things together that appear to be separate, and to disaggregate things that appear to naturally belong together.”

Along with questioning the dominant, tacitly accepted ideology is the need to recognize this ideology. This is related to the idea that a valuable characteristic of a scientist is a willingness to be disturbed. We learn from anomalies (see discussion here), which requires recognizing how a given observation or story is anomalous (that is, anomalous with respect to what expectations, exactly?), which in turn is more effective if one can identify the substrates of theories that determine our expectations. An example from our statistical work is our research using the Xbox survey to reveal that apparent swings in public opinion can largely be explained by variations in nonresponse.

On a more specific level, feminism can make us aware of work where the male perspective is unthinkingly taken as a baseline. For example, a paper was released a few years ago presenting survey data from the United States showing that parents of girls were more likely to support the Republican party, compared to parents of boys. I’m not completely sure what to make of this finding—for one thing, an analysis a few years ago of data from Britain found the opposite pattern—but here I want to focus on the reception of this research claim. There’s something oddly asymmetrical about how these results were presented, both by the authors and in the media. Consider the following headlines: “The Effect of Daughters on Partisanship and Social Attitudes Toward Women,” “Does Having Daughters Make You More Republican?”, “Parents With Daughters Are More Likely To Be Republicans, Says New Study,” “Parents Of Daughters Lean Republican, Study Shows,” “The Daughter Theory: Does raising girls make parents conservative?” Here’s our question: Why is it all about “the effect of daughters”? Why not, Does having sons make you support the Democrats? This is not just semantics: the male-as-baseline perspective affects the explanations that are given for this pattern: Lots of discussion about how you, as a parent, might change your views of the world if you have a girl. But not so much about how you might change your views if you have a boy.

The fallacy here was that people were reasoning unidirectionally. In this case, the benefit of a feminist perspective is not political but rather just a recognition of multiple perspectives and social biases, recognizing that in this case the boy baby is considered a default. As the saying goes, the greatest trick the default ever pulled was convincing the world it didn’t exist.

To put it another way: it is not that feminism is some sort of superpower that allows one to consider alternatives to the existing defaults, it’s more that these alternatives are obvious and can only not be seen if you don’t allow yourself to look. Feminism is, for this purpose, not a new way of looking at the world; rather, it is a simple removal of blinders. But uncovering blind spots isn’t that simple, and can be quite powerful.

A broader point is that it’s hard to do good social science if you don’t understand the community you’re studying. The lesson from feminism is not just to avoid taking the male perspective for granted but more generally to recognize the perspective of outgroups. An example came up recently with the so-called gaydar study, a much-publicized paper demonstrating the ability of a machine learning algorithm to distinguish gay and straight faces based on photographs from dating sites. This study was hyped beyond any reasonable level. From a statistical perspective, it’s no surprise at all that two groups of people selected from different populations will differ from each other, and if you have samples from two different populations and a large number of cases, then you should be able to train an algorithm to distinguish them at some level of accuracy. The authors of the paper in question went way beyond this, though, claiming that their results “provide strong support for the PHT [prenatal hormone theory], which argues that same-gender sexual orientation stems from the underexposure of male fetuses and overexposure of female fetuses to prenatal androgens responsible for the sexual differentiation of faces, preferences, and behavior.” Also some goofy stuff about the fact that gay men in this particular sample are less likely to have beards. Beyond the purely statistical problems here is a conceptual error, which is to think of “gay faces” as some sort of innate property of gay people, rather than as cues that gays are deliberately sending to each other. The distinctive and noticeable characteristics of the subpopulation are the result of active choices by members of that group, not (as assumed in the paper under discussion) essential attributes derived from “nature” or “nurture.”

Looking from a different direction, feminism can make us suspicious of simplistic gender-essentialist ideas such as expressed in various papers that make use of schoolyard evolutionary biology—the idea that, because of evolution, all people are equivalent to all other people, except that all boys are different from all girls. It’s the attitude I remember from the grade school playground, in which any attribute of a person, whether it be how you walked or how you laughed or even how you held your arms when you were asked to look at your fingernails (really) were gender-typed. It’s gender and race essentialism. And when you combine it with what Tversky and Kahneman called “the law of small numbers” (the naive but common attitude that any underlying pattern should reproduce in any small sample) has led to endless chasing of noise in data analyses. In short, if you believe this sort of essentialism, you can find it just about anywhere you look, hence the value of a feminist perspective which reminds us of the history and fallacies of gender essentialism. Of course there are lots of systematic differences between boys and girls, and between men and women, that are not directly sex-linked. To be a feminist is not to deny these differences; rather, placing these differences within a larger context, and relating them to past and existing power imbalances, is part of what feminism is about.

Many examples of schoolyard evolutionary biology in published science will be familiar to regular readers of this blog: there’s the beauty-and-sex-ratio study, the ovulation-and-clothing study, the fat-arms-and-political attitudes study (a rare example that objectified men instead of women), the ovulation-and-voting study, and various others. Just to be clear: I’m not saying that the claims in those studies have to be wrong. All these claims, to my eyes, look crudely gender-essentialist, but that doesn’t mean they’re false, or that it was a bad idea to study them. No, all those studies had problems in their statistics (in the sense that poor understanding of statistical methods led researchers and observers to wrongly think that those data presented strong evidence in favor of those claims) and in their measurement (in that the collected data were too sparse and noisy to really have a chance of supplying the desired evidence).

A feminist perspective was helpful to me in unraveling these studies for several reasons. To start with, feminism gave me the starting point of skepticism regarding naive gender essentialism—or, I could say, it help keep me from being intimidated by weak theorizing coming from that direction. Second, feminism made me aware of multiple perspectives: if someone can hypothesize that prettier parents are more likely to produce girls, I can imagine an opposite just-so story that makes just as much sense. And, indeed, both stories could be true at different times and different places, which brings me to the third thing I bring from feminism: an awareness of variation. There’s no reason to think that a hypothesized effect will be consistent in magnitude or even sign for different people and under different conditions. Understanding this point is a start toward moving away from naive, one might say “scientistic,” views of what can be learned or proved from simple surveys or social experiments.

I consider myself a feminist but I understand that others have different political views, and I’m not trying to say that being a feminist is necessary for doing science. Of course not. Rather, my point is that I think the political and historical insights of feminism have made me a better statistician and scientist.

As I wrote a few years ago:

At some level, in this post I’m making the unremarkable point that each of us has a political perspective which informs our research in positive and negative ways. The reason that this particular example of the feminist statistician is interesting is that it’s my impression that feminism, like religion, is generally viewed as a generally anti-scientific stance. I think some of this attitude comes from some feminists themselves who are skeptical of science in that is a generally male-dominated institution that is in part used to continue male dominance of society, and it also comes from people who might say that reality has an anti-feminist bias.

We can try to step back and account for our own political and ethical positions when doing science, but at the same time we should be honest about the ways that our positions and experiences shape the questions we ask and produce insights.

Feminism, like religion or other social identifications, can be competitive with science or it can be collaborative. See, for example, the blog of Echidne for a collaborative approach. To the extent that feminism represents a set of tenets are opposed to reality, it could get in the way of scientific thinking, in the same way that religion would get in the way of scientific thinking if, for example, you tried to apply faith healing principles to do medical research. If you’re serious about science, though, I think of feminism (or, I imagine, Christianity, for example) as a framework rather than a theory—that is, as a way of interpreting the world, not as a set of positive statements. This is in the same way that I earlier wrote that racism is a framework, not a theory. Not all frameworks are equal; my point here is just that, if we’re used to thinking of feminism, or religion, as anti-scientific, it can be useful to consider ways in which these perspectives can help one’s scientific work.

All of this is just one perspective. I did get several useful comments and references from Shira Mitchell and Dan Simpson when preparing this post, but the particular stories and emphases are mine. One could imagine a whole series of such posts—“How feminism made me a worse scientist,” “How science has made me a better feminist,” “How Christianity has made me a better scientist,” and so forth—all written by different people. And each one could be valid.

I was impelled to write the above post after reflecting upon all those many pseudo-scientific stories of cavemen (as Rebecca Solnit put it, “the five-million-year-old suburb”) and reflecting on the difficulties we so often have in communicating with one another; see for example here (where psychologist Steven Pinker, who describes himself as a feminist, gives a list of topics that he feels are “taboo” and cannot be openly discussed among educated Americans; an example is the statement, “Do men have an innate tendency to rape?”) and here (where social critic Charles Murray, who I assume would not call himself a feminist, argues that educated Americans are too “nonjudgmental” and not willing enough to condemn others, for example by saying “that it is morally wrong for a woman to bring a baby into the world knowing that it will not have a father”).

When doing social science, we have to accept that different people have sharply different views. Awareness of multiple perspectives is to me a key step, both in understanding social behavior and also in making sense of the social science we read. I do not think that calling oneself a feminist makes someone a better person, nor do I claim that feminism represents some higher state of virtue. All I’m saying here is that feminism, beyond its political context, happens to be a perspective that can help some of us be better scientists.

The post How feminism has made me a better scientist appeared first on Statistical Modeling, Causal Inference, and Social Science.

### Top Stories, Aug 6-12: Eight iconic examples of data visualisation; GitHub Python Data Science Spotlight

Also: Only Numpy: Implementing GANs and Adam Optimizer using Numpy; Understanding Language Syntax and Structure; Eight iconic examples of data visualisation; 5 Data Science Projects That Will Get You Hired in 2018; Seven Practical Ideas For Beginner Data Scientists

### IML and H2O: Machine Learning Model Interpretability And Feature Explanation

Model interpretability is critical to businesses. If you want to use high performance models (GLM, RF, GBM, Deep Learning, H2O, Keras, xgboost, etc), you need to learn how to explain them. With machine learning interpretability growing in importance, several R packages designed to provide this capability are gaining in popularity. We analyze the IML package in this article.

In recent blog posts we assessed LIME for model agnostic local interpretability functionality and DALEX for both local and global machine learning explanation plots. This post examines the iml package (short for Interpretable Machine Learning) to assess its functionality in providing machine learning interpretability to help you determine if it should become part of your preferred machine learning toolbox.

We again utilize the high performance machine learning library, h2o, implementing three popular black-box modeling algorithms: GLM (generalized linear models), RF (random forest), and GBM (gradient boosted machines). For those that want a deep dive into model interpretability, the creator of the iml package, Christoph Molnar, has put together a free book: Interpretable Machine Learning. Check it out.

## Articles In The Model Interpretability Series

Articles related to machine learning and black-box model interpretability:

Awesome Data Science Tutorials with LIME for black-box model explanation in business:

## Transform Your Data Science Abilities In 10 Weeks

If you’re interested in learning how to apply critical thinking and data science while solving a real-world business problem following an end-to-end data science project, check out Data Science For Business (DS4B 201-R). Over the course of 10 weeks you will solve an end-to-end Employee Churn data science project following our systematic Business Science Problem Framework.

We have two new course offerings coming soon! Shiny Web Apps and Python And Spark For Customer Churn! Get started with Business Science University.

## FREE BOOK: Interpretable Machine Learning

The creator of the IML (Interpretable Machine Learning) package has a great FREE resource for those interested in applying model interpretability techniques to complex, black-box models (high performance models). Check out the book: Interpretable Machine Learning by Christoph Molnar.

Interpretable Machine Learning Book by Christoph Molnar

## Learning Trajectory

We’ll cover the following topics on iml, combining with the h2o high performance machine learning package:

## IML and H2O: Machine Learning Model Interpretability And Feature Explanation

By Brad Boehmke, Director of Data Science at 84.51°

The iml package is probably the most robust ML interpretability package available. It provides both global and local model-agnostic interpretation methods. Although the interaction functions are notably slow, the other functions are faster or comparable to existing packages we use or have tested. I definitely recommend adding iml to your preferred ML toolkit. The following provides a quick list of some of its pros and cons:

• ML model and package agnostic: can be used for any supervised ML model (many features are only relevant to regression and binary classification problems).
• Variable importance: uses a permutation-based approach for variable importance, which is model agnostic, and accepts any loss function to assess importance.
• Partial dependence plots: Fast PDP implementation and allows for ICE curves.
• H-statistic: one of only a few implementations to allow for assessing interactions.
• Local interpretation: provides both LIME and Shapley implementations.
• Plots: built with ggplot2 which allows for easy customization

• Does not allow for easy comparisons across models like DALEX.
• The H-statistic interaction functions do not scale well to wide data (may predictor variables).
• Only provides permutation-based variable importance scores (which become slow as number of features increase).
• LIME implementation has less flexibilty and features than lime.

### Replication Requirements

#### Libraries

I leverage the following packages:

#### Data

To demonstrate iml’s capabilities we’ll use the employee attrition data that has been included in the rsample package. This demonstrates a binary classification problem (“Yes” vs. “No”) but the same process that you’ll observe can be used for a regression problem.
I perform a few house cleaning tasks on the data prior to converting to an h2o object and splitting.

NOTE: The surrogate tree function uses partykit::cpart, which requires all predictors to be numeric or factors. Consequently, we need to coerce any character predictors to factors (or ordinal encode).

#### H2O Models

We will explore how to visualize a few of the more common machine learning algorithms implemented with h2o. For brevity I train default models and do not emphasize hyperparameter tuning. The following produces a regularized logistic regression, random forest, and gradient boosting machine models.

All of the models provide AUCs ranging between 0.75 to 0.79. Although these models have distinct AUC scores, our objective is to understand how these models come to this conclusion in similar or different ways based on underlying logic and data structure.

Although these models have distinct AUC scores, our objective is to understand how these models come to this conclusion in similar or different ways based on underlying logic and data structure.

### IML procedures

In order to work with iml, we need to adapt our data a bit so that we have the following three components:

1. Create a data frame with just the features (must be of class data.frame, cannot be an H2OFrame or other class).

2. Create a vector with the actual responses (must be numeric – 0/1 for binary classification problems).

3. iml has internal support for some machine learning packages (i.e. mlr, caret, randomForest). However, to use iml with several of the more popular packages being used today (i.e. h2o, ranger, xgboost) we need to create a custom function that will take a data set (again must be of class data.frame) and provide the predicted values as a vector.

Once we have these three components we can create a predictor object. Similar to DALEX and lime, the predictor object holds the model, the data, and the class labels to be applied to downstream functions. A unique characteristic of the iml package is that it uses R6 classes, which is rather rare. To main differences between R6 classes and the normal S3 and S4 classes we typically work with are:

• Methods belong to objects, not generics (we’ll see this in the next code chunk).
• Objects are mutable: the usual copy-on-modify semantics do not apply (we’ll see this later in this tutorial).

These properties make R6 objects behave more like objects in programming languages such as Python. So to construct a new Predictor object, you call the new() method which belongs to the R6 Predictor object and you use $ to access new(): ### Global interpretation iml provides a variety of ways to understand our models from a global perspective: 1. Feature Importance 2. Partial Dependence 3. Measuring Interactions 4. Surrogate Model We’ll go through each. #### 1. Feature importance We can measure how important each feature is for the predictions with FeatureImp. The feature importance measure works by calculating the increase of the model’s prediction error after permuting the feature. A feature is “important” if permuting its values increases the model error, because the model relied on the feature for the prediction. A feature is “unimportant” if permuting its values keeps the model error unchanged, because the model ignored the feature for the prediction. This model agnostic approach is based on (Breiman, 2001; Fisher et al, 2018) and follows the given steps: For any given loss function do 1: compute loss function for original model 2: for variable i in {1,...,p} do | randomize values | apply given ML model | estimate loss function | compute feature importance (permuted loss / original loss) end 3. Sort variables by descending feature importance  We see that all three models find OverTime as the most influential variable; however, after that each model finds unique structure and signals within the data. Note: you can extract the results with imp.rf$results.

Permutation-based approaches can become slow as your number of predictors grows. To assess variable importance for all 3 models in this example takes only 8 seconds. However, performing the same procedure on a data set with 80 predictors (AmesHousing::make_ames()) takes roughly 3 minutes. Although this is slower, it is comparable to other permutation-based implementations (i.e. DALEX, ranger).

The following lists some advantages and disadvantages to iml’s feature importance procedure.

• Model agnostic
• Simple interpretation that’s comparable across models
• Can apply any loss function (accepts custom loss functions as well)
• Plot output uses ggplot2; we can add to or use our internal branding packages with it

• Permutation-based methods are slow
• Default plot contains all predictors; however, we can subset $results data frame if desired #### 2. Partial dependence The Partial class implements partial dependence plots (PDPs) and individual conditional expectation (ICE) curves. The procedure follows the traditional methodology documented in Friedman (2001) and Goldstein et al. (2014) where the ICE curve for a certain feature illustrates the predicted value for each observation when we force each observation to take on the unique values of that feature. The PDP curve represents the average prediction across all observations. For a selected predictor (x) 1. Determine grid space of j evenly spaced values across distribution of x 2: for value i in {1,...,j} of grid space do | set x to i for all observations | apply given ML model | estimate predicted value | if PDP: average predicted values across all observations end  The following produces “ICE boxplots” and a PDP line (connects the averages of all observations for each response class) for the most important variable in all three models (OverTime). All three model show a sizable increase in the probability of employees attriting when working overtime. However, you will notice the random forest model experiences less of an increase in probability compared to the other two models. For continuous predictors you can reduce the grid space to make computation time more efficient and center the ICE curves. Note: to produce the centered ICE curves (right plot) you use ice$center and provide it the value to center on. This will modify the existing object in place (recall this is a unique characteristic of R6 –> objects are mutable). The following compares the marginal impact of age on the probability of attriting. The regularized regression model shows a monotonic decrease in the probability (the log-odds probability is linear) while the two tree-based approaches capture the non-linear, non-monotonic relationship.

Similar to pdp you can also compute and plot 2-way interactions. Here we assess how the interaction of MonthlyIncome and OverTime influences the predicted probability of attrition for all three models.

The following lists some advantages and disadvantages to iml’s PDP and ICE procedures.

• Provides PDPs & ICE curves (unlike DALEX)
• Allows you to center ICE curves
• Computationally efficient
• grid.size allows you to increase/reduce grid space of xi values
• Rug option illustrates distribution of all xi values
• Provides convenient plot outputs for categorical predictors

• Only provides heatmap plot of 2-way interaction plots
• Does not allow for easy comparison across models like DALEX

#### 3. Measuring Interactions

A wonderful feature provided by iml is to measure how strongly features interact with each other. To measure interaction, iml uses the H-statistic proposed by Friedman and Popescu (2008). The H-statistic measures how much of the variation of the predicted outcome depends on the interaction of the features. There are two approaches to measure this. The first measures if a feature (xi) interacts with any other feature. The algorithm performs the following steps:

1: for variable i in {1,...,p} do
| f(x) = estimate predicted values with original model
| pd(x) = partial dependence of variable i
| pd(!x) = partial dependence of all features excluding i
| upper = sum(f(x) - pd(x) - pd(!x))
| lower = variance(f(x))
| rho = upper / lower
end
2. Sort variables by descending rho (interaction strength)


The intereaction strength (rho) will be between 0 when there is no interaction at all and 1 if all of variation of the predicted outcome depends on a given interaction. All three models capture different interaction structures although some commonalities exist for different models (i.e. OverTime, Age, JobRole). The interaction effects are stronger in the tree based models versus the GLM model, with the GBM model having the strongest interaction effect of 0.4.

Considering OverTime exhibits the strongest interaction signal, the next question is which other variable is this interaction the strongest. The second interaction approach measures the 2-way interaction strength of feature xi and xj and performs the following steps:

1: i = a selected variable of interest
2: for remaining variables j in {1,...,p} do
| pd(ij) = interaction partial dependence of variables i and j
| pd(i) = partial dependence of variable i
| pd(j) = partial dependence of variable j
| upper = sum(pd(ij) - pd(i) - pd(j))
| lower = variance(pd(ij))
| rho = upper / lower
end
3. Sort interaction relationship by descending rho (interaction strength)


The following measures the two-way interactions of all variables with the OverTime variable. The two tree-based models show MonthlyIncome having the strongest interaction (although it is a week interaction since rho < 0.13). Identifying these interactions can be useful to understand which variables create co-denpendencies in our models behavior. It also helps us identify interactions to visualize with PDPs (which is why I showed the example of the OverTime and MonthlyIncome interaction PDP earlier).

The H-statistic is not widely implemented so having this feature in iml is beneficial. However, its important to note that as your feature set grows, the H-statistic becomes computationally slow. For this data set, measuring the interactions across all three models only took 45 seconds and 68 seconds for the two-way interactions. However, for a wider data set such as AmesHousing::make_ames() where there are 80 predictors, this will up towards an hour to compute.

#### 4. Surrogate model

Another way to make the models more interpretable is to replace the “black box” model with a simpler model (aka “white box” model) such as a decision tree. This is known as a surrogate model in which we

1. apply original model and get predictions
2. choose an interpretable "white box" model (linear model, decision tree)
3. Train the interpretable model on the original dataset and its predictions
4. Measure how well the surrogate model replicates the prediction of the black box model
5. Interpret / visualize the surrogate model


iml provides a simple decision tree surrogate approach, which leverages partykit::cpart. In this example we train a CART decision tree with max depth of 3 on our GBM model. You can see that the white box model does not do a good job of explaining the black box predictions (R^2 = 0.438).

The plot illustrates the distribution of the probability of attrition for each terminal node. We see an employee with JobLevel > 1 and DistanceFromHome <= 12 has a very low probability of attriting.

When trying to explain a complicated machine learning model to decision makers, surrogate models can help simplify the process. However, its important to only use surrogate models for simplified explanations when they are actually good representatives of the black box model (in this example it is not).

### Local interpretation

In addition to providing global explanations of ML models, iml also provides two newer, but well accepted methods for local interpretation.

Local interpretation techniques provide methods to explain why an individual prediction was made for a given observation.

To illustrate, lets get two observations. The first represents the observation that our random forest model produces the highest probability of a attrition (observation 154 has a 0.666 probability of attrition) and the second represents the observation with the lowest probability (observation 28 has a 0 probability of attrition).

#### 1. Lime

iml implements its own version of local interpretable model-agnostic explanations (Ribeiro et al., 2016). Although it does not use the lime package directly, it does implement the same procedures (see lime tutorial).

A few notable items about iml implementation (see referenced tutorial above for these details within lime):

• like lime, can change distance metric (default is gower but accepts all distance functions provided by ?dist),
• like lime, can change kernel (neighborhood size),
• like lime, computationally efficient –> takes about 5 seconds to compute,
• like lime, can be applied to multinomial responses,
• like lime, uses the glmnet package to fit the local model; however…
• unlike lime, only implements a ridge model (lime allows ridge, lasso, and more),
• unlike lime, can only do one observation at a time (lime can do multiple),
• unlike lime, does not provide fit metric such as (R^2) for the local model.

The following fits a local model for the observation with the highest probability of attrition. In this example I look for the 10 variables in each model that are most influential in this observations predicted value (k = 10). The results show that the Age of the employee reduces the probability of attrition within all three models. Morever, all three models show that since this employee works OverTime, this is having a sizable increase in the probability of attrition. However, the tree-based models also identify the MaritalStatus and JobRole of this employee contributing to his/her increased probability of attrition.

Here, I reapply the same model to low_prob_ob. Here, we see Age, JobLevel, and OverTime all having sizable influence on this employees very low predicted probability of attrition (zero).

Although, LocalModel does not provide the fit metrics (i.e. R^2) for our local model, we can compare the local models predicted probability to the global (full) model’s predicted probability.

For the high probability employee, the local model only predicts a 0.34 probability of attrition whereas the local model predicts a more accurate 0.12 probability of attrition for the low probability employee. This can help you guage the trustworthiness of the local model.

High Probability:

Low Probability:

#### 2. Shapley values

An alternative for explaining individual predictions is a method from coalitional game theory that produces whats called Shapley values (Lundberg & Lee, 2016). The idea behind Shapley values is to assess every combination of predictors to determine each predictors impact. Focusing on feature xj, the approach will test the accuracy of every combination of features not including xj and then test how adding xj to each combination improves the accuracy. Unfortunately, computing Shapley values is very computationally expensive. Consequently, iml implements an approximate Shapley estimation algorithm that follows the following steps:

ob = single observation of interest
1: for variables j in {1,...,p} do
| m = random sample from data set
| t = rbind(m, ob)
| f(all) = compute predictions for t
| f(!j) = compute predictions for t with feature j values randomized
| diff = sum(f(all) - f(!j))
| phi = mean(diff)
end
2. sort phi in decreasing order


The Shapley value ($\phi$) represents the contribution of each respective variable towards the predicted valued compared to the average prediction for the data set.

We use Shapley\$new to create a new Shapley object. For this data set it takes about 9 seconds to compute. The time to compute is largely driven by the number of predictors but you can also control the sample size drawn (see sample.size argument) to help reduce compute time. If you look at the results, you will see that the predicted value of 0.667 is 0.496 larger than the average prediction of 0.17. The plot displays the contribution each predictor played in this difference of 0.496.

We can compare the Shapley values across each model to see if common themes appear. Again, OverTime is a common theme across all three models. We also see MonthlyIncome influential for the tree-based methods and there are other commonalities for the mildly influential variables across all three models (i.e. StockOptionLevel, JobLevel, Age, MaritalStatus).

Similarly, we can apply for the low probability employee. Some common themes pop out for this employee as well. It appears that the age, total number of working years, and the senior position (JobLevel, JobRole) play a large part in the low predicted probability of attrition for this employee.

Shapley values are considered more robust than the results you will get from LIME. However, similar to the different ways you can compute variable importance, although you will see differences between the two methods often you will see common variables being identified as highly influential in both approaches. Consequently, we should use these approaches to help indicate influential variables but not to definitively label a variables as the most influential.

## Next Steps (Transform Your Abilities)

It’s critical to understand that modeling is just one part of the overall data science project. Don’t mistake – it’s an important part, but other parts are equally if not more important, which is why our Data Science For Business With R (DS4B 201-R) course is successfully teaching data science students how to apply data science in the real world.

We teach end-to-end data science. This means you learn how to:

• Chapter 1: Sizing The Problem, Custom Functions With Tidy Eval: Learn how to understand if a business problem exists by sizing the problem. In addition, create custom functions using Tidy Eval, a programming interface that is needed if you want to build custom functions using dplyr, ggplot2.

• Chapter 2: Exploration with GGally, skimr – We show you how to explore data and identify relationships efficiently and effectively

• Chapter 3: recipes, Premodeling Correlation Analysis: We teach you how to use recipes to develop data transformation workflow and we show you how to perform a pre-modeling correlation analysis so you do not move into modeling prematurely, which again saves you time

• Chapters 4 and 5: H2O AutoML: You will first learn how to use all of the major h2o functions to perform automated machine learning for binary classification including working with the H2O leaderboard, visualizing results, and performing grid search and cross validation. In the next chapter, you learn how to evaluate multiple models against each other with ROC, Precision vs Recall, Gain and Lift, which is necessary to explain your choices for best model to the business stakeholders (your manager and key decision makers).

• Chapter 6: LIME: You learn how to create local explanations to interpret black-box machine learning models. Explanations for the model predictions are critical because the business cannot do anything to improve unless they understand why. We show you how with lime.

• Chapters 7 and 8: Expected Value Framework, Threshold Optimization, Sensitivity Analysis: These are my two favorite chapters because they show you how to link the churn classification model to financial results, and they teach purrr for iterative grid-search style optimization! These are POWERFUL CONCEPTS.

• Chapter 9: Recommendation Algorithm: We again use a correlation analysis but in a different way. We discretize, creating a special visualization that enables us to develop a recommendation strategy.

## Data Science For Business With R (DS4B 201-R)

Learn everything you need to know to complete a real-world, end-to-end data science project with the R programming language. Transform your abilities in 10 weeks.

We have one course out and two courses coming soon!

### Data Science For Business With R (DS4B 201-R)

Over the course of 10 weeks, we teach you how to solve an end-to-end data science project. Available now!

Transform you abilities by solving employee churn over a 10-week end-to-end data science project in DS4B 201-R

### Building A Shiny Application (DS4B 301-R)

Our next course teaches you how to take the H2O Model, LIME explanation results, and the recommendation algorithm you develop in DS4B 201-R and turn it into a Shiny Web Application that predicts employee attrition! Coming in Q3 2018.

Shiny App That Predicts Attrition and Recommends Management Strategies, Taught in DS4B 301-R (Building A Shiny Web App)

Kelly O’Briant is lead developer for the Shiny App Course coming soon. She’s a brilliant software engineer / data scientist that knows how to make a great looking and performing Shiny app.

### Data Science For Business With Python (DS4B 201-P)

Did we mention with have a DS4B Python Course coming? Well we do! Coming in Q4 2018.

The problem changes: Customer Churn! The tools will be H2O, LIME, and a host of other tools implemented in Python + Spark.

Python Track: Data Science For Business With Python And Spark

Favio Vazquez, Principle Data Scientist at OXXO, is building the Python + Spark equivalent of DS4B 201-R. He’s so talented knowing Python, Spark, and R, along with a host of other data science tools.

## Don’t Miss A Beat

If you like our software (anomalize, tidyquant, tibbletime, timetk, and sweep`), our courses, and our company, you can connect with us:

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Wikimedia Foundation: Manager of Product Analytics (San Francisco, CA or Remote)

Seeking an experienced and hands-on Manager of Product Analytics to lead a team that uses data to help our staff and our communities achieve our vision: a world in which every single human being can freely share in the sum of all knowledge.

### Aerial view of sheepdogs herding sheep

Sometimes the visualization takes care of itself. Photographer Tim Whittaker filmed sheepdogs herding thousands of sheep, and the flows one place to another are like organized randomness.

Tags: , , ,

### Book Memo: “Artificial Adaptive Systems using Auto Contractive Maps”

 Theory, Applications and Extensions This book offers an introduction to artificial adaptive systems and a general model of the relationships between the data and algorithms used to analyze them. It subsequently describes artificial neural networks as a subclass of artificial adaptive systems, and reports on the backpropagation algorithm, while also identifying an important connection between supervised and unsupervised artificial neural networks. The book’s primary focus is on the auto contractive map, an unsupervised artificial neural network employing a fixed point method versus traditional energy minimization. This is a powerful tool for understanding, associating and transforming data, as demonstrated in the numerous examples presented here. A supervised version of the auto contracting map is also introduced as an outstanding method for recognizing digits and defects. In closing, the book walks the readers through the theory and examples of how the auto contracting map can be used in conjunction with another artificial neural network, the ‘spin-net,’ as a dynamic form of auto-associative memory.

(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

## A Deluge of Content

Over the last two decades, the accessibility of media has increased dramatically. One of the main reasons for this surge in available content is the evolution of online streaming platforms. Services like Netflix, Hulu, Amazon Prime, and others enable access for millions of consumers to a seemingly countless number of movies and television shows. With so many choices now available, it is more important than ever to have effective ways to filter all the possible titles and find options that best match your interests. Many streaming services have implemented functionality to suggest possible titles of interests for their users. Without these guiding tools, deciding what to watch next can seem like a monstrous undertaking.

## The Goal

In my experience, the “you may also like”-style suggesting algorithms work well. However, I find that when I’m searching for something to watch I want to know more than just whether or not I will like it. What kind of movie is it? What parts of the movie were done better than others? In short, I want to know what aspects of the movie suggest that I will enjoy it. This information is not provided by the suggestion functionality of the streaming platforms I have encountered. So I thought of a different way to approach the problem.

Movie critics work hard to produce reviews that help the general audience decide whether a movie is worth watching. A critic review can provide a much more detailed look into a movie’s worth than an algorithm simply telling you “watch this.” Still, the issue of individual critics’ personal taste and preference exists. Although general trends usually do arise, movie reviews can be highly subjective. How do I know I can trust a particular critic? Do they value the same qualities in a movie that I do? These uncertainties can compromise the validity of a review on a person to person basis. A critic’s reviews would carry more weight if I knew that our interests were similar. With this in mind, I decided to create an application that would match a user with the most similarly-minded critics. The user could then follow their matched critics to explore a range of movies they could potentially enjoy.

## The Data

I chose to extract movie review data from the movie and TV show review site RottenTomatoes (RT). RT is a great resource for anyone looking for a new TV show to watch or to choose between movies that are currently playing. The site provides ratings by both professional critics and average viewers to give a general idea of the quality of a film or show. RT also gives a certification of “Fresh” or “Rotten” so users can quickly get an idea if the title is worth checking out. I used the scrapy package in python to code a script that would extract certain values from the RT site. From RT, I scraped certified critics’ reviews of the top 100 movies per year from 2000 to 2018. From the top 100 movies list, I only inspected those movies that were reviewed by 100 or more unique critics. The 100 critic count mark seemed to be a threshold for widely recognizable movies. On the critic reviews page for each movie, if a critic gave a rating for the movie (some reviews only included a description) I scraped the critic’s name, the organization they belonged to, and the rating they gave. I also scraped the average rating for each of the movies.

## Data Cleaning

I used python to clean the data that I scraped from RT. The main issue that I encountered was that there were many different rating formats used among the different critics. For example, here are some of the formats used:

• 3/5
• 3/4
• 6/10
• 60/100
• 3 out of 5
• A +
• A plus
• *** 1/2
• 3 stars
• Recommended vs. Not Recommended

In order to normalize the critics’ ratings so I could compare them, I decided to convert them all to fractional values between 0 and 1. The different rating formats made this conversion difficult. I wrote if-statements containing regular expressions for each format to match and extract the rating information. Then I was able to reassign each value as a fractional representation of the original rating (e.g. *** 1/2 = 3.5/5 = 0.7). Because there were many different format types, I only retained the ratings that had the most common formats. I also dropped the ratings that I did not think would add value to the matching algorithm, such as recommended vs. not recommended (too coarse of a rating). After the conversion, many ratings had more than five decimal places. For simplicity, I truncated them all to three decimal places.

## The App

The main focus of this project was to use web scraping to collect unorganized data and then analyze that data. Instead of exploring the RT data to find trends among movies, such as comparing reviews among different genres or movie release dates, I decided to use the data to help people find movies they might like. This was my thought process for the application’s functionality:

1. Have a user pick five movies they have seen
2. Have the user rate each movie according to how much they enjoyed it
3. Compare the user’s ratings to critic ratings and find the critics with the most similar ratings
4. Show the top critics ordered by closest match and show other movies those critics rated

In order to transform this theoretical process into a reality, I used the Shiny package in R to code an application that could ask the user for input and return a table of top matched critics for them to follow.

The user selects five movies they have seen.

The user rates the movies they selected based on how much they enjoyed them.

The matching algorithm is pretty straightforward, although I utilized two different methods. The first method I used was to take the absolute value of the difference between the user scores and the critic scores and then sum the differences up. I called this match score the “Absolute Distance Score” (the lower the score, the better the match). This approach was simple, but I found that it created some issues. For example, a critic that rated all five movies exactly one point off from the user ratings would be given the same match score as a critic that rated four movies exactly the same as the user but one movie five points off. In this example, the first critic rated each movie almost exactly the same as the user, and the second critic rated four movies exactly the same and rated one very differently. I decided that in this type of situation the critics should not be given the same match score so I implemented an additional method. I added a “Squared Distance Score” which squared the difference between the user and critic’s ratings and summed them up. This weights distance exponentially so larger differences will increase the match score more. I included both score types in the table that shows the top matches, however I sorted the table by lowest squared distance score. I hoped that allowing the user to see both scores would provide more insight into the meaning behind the match score.

An important complication arose in the case where a critic did not review all five of the user’s chosen movies. This could lead to misleading match scores in the match table. If one critic reviewed all five of the user’s movies, there is a greater chance that their match score will be higher than a that of a critic that reviewed only three movies (five differences will be summed for the first critic and only three differences will be summed for the second). In the match table, two critics could have the same score but perhaps only because data was missing for one or more of the movies. Because of this, I added a review count to the match table. I ordered the table by highest review count and selected the top five match scores for critics who rated five, four, or three of the user’s movies. I did not include critics that rated two or less of the user’s movies. I made it clear in the text above the tables that differences in match scores between review count groups should be taken with a grain of salt. Overall, the five review count group will give the most robust matches.

This table shows the user’s top matched critics based on the movies and ratings chosen.

These tables show other movies that the currently selected critic has reviewed, specifically those movies that the critic rated high or low.

Below the match table, I included two tables that show high and low rated movies for the currently selected critic. My goal here was to recommend movies to the user based on the opinion of their matched critics. Here the user can browse the list of highly rated movies to see which titles they might enjoy. Conversely, the low rated movie table will allow the user to know which movies to avoid.

The last tables in the application is something I was very excited to implement. When searching for a movie to watch, I generally only stumble upon movies that are universally loved or universally disliked because those are the movies that receive the most publicity. Obviously there are many other movies out there that I might really enjoy, but it’s hard to find the diamonds in the rough. The final tables in my application show the user which movies their critics rated much differently from the general public. The left table shows movies that the critic rated highly but had a low average rating; these are the movies no one would ever think to recommend to you, but are ones you might really enjoy based on your matched critic. On the other hand, the right table shows movies that the critic rated low but had a high average rating; these are the movies that everyone keeps telling you to see, but you might be better off avoiding them. I think this is a very useful tool in finding movies that cater to your interests and can open up a whole new world of movies for you to see.

These tables show the movies that the currently selected critic rated much differently from the general public.

## Going Forward

I am really happy with how this application turned out, but there are still a few kinks I would like to work out.

1. In the match table, I include a column with a list of all the organisations that the critics have been associated with. I populate this column by aggregating all the unique organisations for a critic from the original data set and concatenating them with a “|” as a separator. While the matching calculation does not take very long, this organisation column creation more than doubles the required time for the table to be created. I would like to add a few lines in the python cleaning code to replace each organisation value in the original data set with all the critic’s organisations so that this calculation does not need to be done each time a user submits a new request.
2. Due to my filtering criteria during scraping, I only included movies that were reviewed by 100 or more critics. Additionally, the list of movies only includes the top 100 movies from each year between 2000 and 2018.  I have not explored how loosening these criteria would affect the data set size or how it would affect the match calculation time, but I would like to add more options in the future for users to choose from (and also to potentially make the algorithm more accurate by adding more uncommon movies).
3. Also due to the criteria above, the movies in the data set are necessarily popular. This gave rise to a complication in the “Critic Rating High / Average Rating Low” table. Because all of the movies are popular, there are few to no movies with average ratings below 6/10 or so. This somewhat defeats the purpose of the table, which is supposed to introduce the user to unknown or generally disliked movies. On RT, each critic has their own page with every review they have submitted along with RT’s T-Meter rating (a percentage of critics who gave the movie a “Fresh” rating). In the future, I would like to scrape each critic’s page and use this second data set to populate the two critic vs. average tables. This should give the user a more exhaustive list of movies they should consider seeing or avoiding.

## Hey, Thanks!

Thank you for taking the time to read about my project! I was really excited to create this application because I think it serves a real purpose and can be helpful for a large group of people. Feel free to check out my application, and please let me know if it helped you find a movie you enjoyed! Here is a link to my GitHub repository if you would like to explore my code.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### R Packages worth a look

Binary Indexed Tree (rbit)
A simple implementation of Binary Indexed Tree by R. The BinaryIndexedTree class supports construction of Binary Indexed Tree from a vector, update of …

Standard Young Tableaux (syt)
Deals with standard Young tableaux (field of combinatorics). Performs enumeration, counting, random generation, the Robinson-Schensted correspondence, …

Estimation and Prediction for Remote Effects Spatial Process Models (telefit)
Implementation of the remote effects spatial process (RESP) model for teleconnection. The RESP model is a geostatistical model that allows a spatially- …

Interactive Visualization of Bayesian Networks (bnviewer)
Interactive visualization of Bayesian Networks. The ‘bnviewer’ package reads various structure learning algorithms provided by the ‘bnlearn’ package an …