# My Data Science Blogs

## September 26, 2018

### Distilled News

Microsoft, Adobe and SAP understand the customer experience is no longer a sales management conversation. CEOs are breaking down the silos of the status quo so they can get all people inside their companies focused on serving people outside their companies. With the Open Data Initiative, we will help businesses run with a true single view of the customer.
Learn about the basics of feature selection and how to implement and investigate various feature selection techniques in Python
In this tutorial, you will learn what data engineering entails along with learning about our future data engineering course offerings.
Prophet is a forecasting tool developed by Facebook to quickly forecast time series data, available in R and Python. In this post I’ll walk you through a quick example of how to forecast U.S. candy sales using Prophet and Python.
If software ate the world, models will run it. But are we ready to be controlled by blackbox intelligent softwares? Probably not. And this is fair. We, as human, need to understand how AI works – especially when it drives our behaviours or businesses. That´s why in a previous post, we spotted machine learning transparency as one of the hottest AI trends. Let us walk through a brief history of machine learning models explainability – illustrated by real examples from our AI Claim Management solution for insurers.
As a research scientist at IBM, Malioutov spends part of his time building machine learning systems that solve difficult problems faced by IBM´s corporate clients. One such program was meant for a large insurance corporation. It was a challenging assignment, requiring a sophisticated algorithm. When it came time to describe the results to his client, though, there was a wrinkle. ‘We couldn´t explain the model to them because they didn´t have the training in machine learning.’ In fact, it may not have helped even if they were machine learning experts. That´s because the model was an artificial neural network, a program that takes in a given type of data – in this case, the insurance company´s customer records – and finds patterns in them. These networks have been in practical use for over half a century, but lately they´ve seen a resurgence, powering breakthroughs in everything from speech recognition and language translation to Go-playing robots and self-driving cars.
Automated machine learning is a rapidly developing segment of artificial intelligence – it´s time to define what an AutoML product is so end-users can compare product capabilities intelligently.
In the second part of this blog series, I showed how to compute spatial kernel density estimates based on area-level data. The Kernelheaping package also supports boundary-corrected kernel density estimation, which allows us to exclude certain areas, where we know that the density must be zero. One example is estimating the population density where we like to exclude uninhabited areas such as lakes, forests, parks etc. The Kernelheaping package employs a boundary correction method, where each single kernel is restricted to the area of interest.
This post assumes basic knowledge of Artificial Neural Networks (ANN) architecture-also called fully connected networks (FCN). These notes are originally made for myself. It will benefit others who have already taken the Course 4, and quickly want to brush up during interviews or need help with theory when getting stuck with development. It is not supposed to cover everything from scratch. Hence for someone who has not taken the course, the content might look daunting and it might scare them away from Deep Learning. My suggestion is not to read beyond 2 if you haven’t taken the course.
Designing deep neural nets can be a painful task considering so many parameters involved and no general formula seems to fit all the use cases. We can use CNNs for image classification, LSTMs for NLP related tasks but still number of features, size of features, number of neurons, number of hidden layers, choice of activation functions, initialization of weights etc. will vary for each in different use cases.
In this post, I am going to discuss Apache Airflow, a workflow management system developed by Airbnb.
Recently I started working on a speech classification problem, as I know very little about speech/audio processing, I had to recap the very basics. In this post, I want to go over some of the things I learned. For this purpose, I want to work on the ‘speech MNIST’ dataset, i.e, a set of recorded spoken digits.
Difference between NLP, NLU, NLG and the possible things which can be achieved when implementing an NLP engine for chatbots.
Previously released under the name of SQL Operations Studio, Azure Data Studio offers a modern editor experience for managing data across multiple sources with fast intellisense, code snippets, source control integration, and an integrated terminal. Azure Data Studio is engineered with the data platform user in mind, with built-in charting of query result-sets and customizable dashboards. Azure Data Studio is complementary to SQL Server Management Studio with experiences around query editing and data development, while SQL Server Management Studio still offers the broadest range of administrative functions, and remains the flagship tool for platform management tasks. Azure Data Studio will continue to be updated on a monthly basis and currently offers built-in support for SQL Server on-premises and Azure SQL Database, along with preview support for Azure SQL Managed Instance, Azure SQL Data Warehouse, and SQL Server 2019 Big Data.
When working on a supervised machine learning problem with a given data set, we try different algorithms and techniques to search for models to produce general hypotheses, which then make the most accurate predictions possible about future instances. The same principles apply to text (or document) classification where there are many models can be used to train a text classifier. The answer to the question ‘What machine learning model should I use?’ is always ‘It depends.’ Even the most experienced data scientists can’t tell which algorithm will perform best before experimenting them. This is what we are going to do today: use everything that we have presented about text classification in the previous articles (and more) and comparing between the text classification models we trained in order to choose the most accurate one for our problem.
Hi and welcome to an Illustrated Guide to LSTM’s and GRU’s. I’m Michael, and I’m a Machine Learning Engineer in the AI voice assistant space. In this post, we’ll start with the intuition behind LSTM ‘s and GRU’s. Then I’ll explain the internal mechanisms that allow LSTM’s and GRU’s to perform so well. If you want to understand what’s happening under the hood for these two networks, then this post is for you. You can also watch the video version of this post on youtube if you prefer.

### Python could become the world’s most popular coding language

But its rivals are unlikely to disappear

### R Packages worth a look

R Bindings for ‘Selenium WebDriver’ (RSelenium)
Provides a set of R bindings for the ‘Selenium 2.0 WebDriver’ (see <https://… …

An Interface to the ‘DescribeDisplay’ ‘GGobi’ Plugin (
DescribeDisplay)
Produce publication quality graphics from output of ‘GGobi’ describe display plugin.

Surrogate Survival ROC (surrosurvROC)
Nonparametric and semiparametric estimations of the time-dependent ROC curve for an incomplete failure time data with surrogate failure time endpoints.

### Understanding Regression Error Metrics

Human brains are built to recognize patterns in the world around us. For example, we observe that if we practice our programming everyday, our related skills grow. But how do we precisely describe this relationship to other people? How can we describe how strong this relationship is? Luckily, we can describe relationships between phenomena, such as practice and skill, in terms of formal mathematical estimations called regressions.

Regressions are one of the most commonly used tools in a data scientist's kit. When you learn Python or R, you gain the ability to create regressions in single lines of code without having to deal with the underlying mathematical theory. But this ease can cause us to forget to evaluate our regressions to ensure that they are a sufficient enough representation of our data.

We can plug our data back into our regression equation to see if the predicted output matches corresponding observed value seen in the data. The quality of a regression model is how well its predictions match up against actual values, but how do we actually evaluate quality?

Luckily, smart statisticians have developed error metrics to judge the quality of a model and enable us to compare regresssions against other regressions with different parameters. These metrics are short and useful summaries of the quality of our data. This article will dive into four common regression metrics and discuss their use cases.

There are many types of regression, but this article will focus exclusively on metrics related to the linear regression. The linear regression is the most commonly used model in research and business and is the simplest to understand, so it makes sense to start developing your intuition on how they are assessed. The intuition behind many of the metrics we'll cover here extend to other types of models and their respective metrics.

If you'd like a quick refresher on the linear regression, you can consult this fantastic blog post or the Linear Regression Wiki page.

# A primer on linear regression

In the context of regression, models refer to mathematical equations used to describe the relationship between two variables. In general, these models deal with prediction and estimation of values of interest in our data called outputs. Models will look at other aspects of the data called inputs that we believe to affect the outputs, and use them to generate estimated outputs. These inputs and outputs have many names that you may have heard before. Inputs are can also be called independent variables or predictors, while outputs are also known as responses or dependent variables. Simply speaking, models are just functions where the outputs are some function of the inputs.

The linear part of linear regression refers to the fact that a linear regression model is described mathematically in the form:

If that looks too mathematical, take solace in that linear thinking is particularly intuitive. If you've ever heard of "practice makes perfect," then you know that more practice means better skills; there is some linear relationship between practice and perfection.

The regression part of linear regression does not refer to some return to a lesser state. Regression here simply refers to the act of estimating the relationship between our inputs and outputs. In particular, regression deals with the modelling of continuous values (think: numbers) as opposed to discrete states (think: categories).

Taken together, a linear regression creates a model that assumes a linear relationship between the inputs and outputs. The higher the inputs are, the higher (or lower, if the relationship was negative) the outputs are.

What adjusts how strong the relationship is and what the direction of this relationship is between the inputs and outputs are our coefficients. The first coefficient without an input is called the intercept, and it adjusts what the model predicts when all your inputs are 0.

We will not delve into how these coefficients are calculated, but know that there exists a method to calculate the optimal coefficients, given which inputs we want to use to predict the output. Given the coefficients, if we plug in values for the inputs, the linear regression will give us an estimate for what the output should be.

As we'll see, these outputs won't always be perfect. Unless our data is a perfectly straight line, our model will not precisely hit all of our data points. One of the reasons for this is the ϵ (named "epsilon") term. This term represents error that comes from sources out of our control, causing the data to deviate slightly from their true position.

Our error metrics will be able to judge the differences between prediction and actual values, but we cannot know how much the error has contributed to the discrepancy. While we cannot ever completely eliminate epsilon, it is useful to retain a term for it in a linear model.

# Comparing model predictions against reality

Since our model will produce an output given any input or set of inputs, we can then check these estimated outputs against the actual values that we tried to predict. We call the difference between the actual value and the model's estimate a residual. We can calculate the residual for every point in our data set, and each of these residuals will be of use in assessment. These residuals will play a significant role in judging the usefulness of a model.

If our collection of residuals are small, it implies that the model that produced them does a good job at predicting our output of interest. Conversely, if these residuals are generally large, it implies that model is a poor estimator.

We technically can inspect all of the residuals to judge the model's accuracy, but unsurprisingly, this does not scale if we have thousands or millions of data points. Thus, statisticians have developed summary measurements that take our collection of residuals and condense them into a single value that represents the predictive ability of our model.

There are many of these summary statistics, each with their own advantages and pitfalls. For each, we'll discuss what each statistic represents, their intution and typical use case. We'll cover:

• Mean Absolute Error
• Mean Square Error
• Mean Absolute Percentage Error
• Mean Percentage Error

Note: Even though you see the word error here, it does not refer to the epsilon term from above! The error described in these metrics refer to the residuals!

# Staying rooted in real data

In discussing these error metrics, it is easy to get bogged down by the various acronyms and equations used to describe them. To keep ourselves grounded, we'll use a model that I've created using the Video Game Sales Data Set from Kaggle.

The specifics of the model I've created are shown below.

My regression model takes in two inputs (critic score and user score), so it is a multiple variable linear regression. The model took in my data and found that 0.039 and -0.099 were the best coefficients for the inputs. For my model, I chose my intercept to be zero since I'd like to imagine there'd be zero sales for scores of zero. Thus, the intercept term is crossed out. Finally, the error term is crossed out because we do not know its true value in practice. I have shown it because it depicts a more detailed description of what information is encoded in the linear regression equation.

## Rationale behind the model

Let's say that I'm a game developer who just created a new game, and I want to know how much money I will make. I don't want to wait, so I developed a model that predicts total global sales (my output) based on an expert critic's judgment of the game and general player judgment (my inputs). If both critics and players love the game, then I should make more money... right? When I actually get the critic and user reviews for my game, I can predict how much glorious money I'll make.

Currently, I don't know if my model is accurate or not, so I need to calculate my error metrics to check if I should perhaps include more inputs or if my model is even any good!

# Mean absolute error

The mean absolute error (MAE) is the simplest regression error metric to understand. We'll calculate the residual for every data point, taking only the absolute value of each so that negative and positive residuals do not cancel out. We then take the average of all these residuals. Effectively, MAE describes the typical magnitude of the residuals. If you're unfamiliar with the mean, you can refer back to this article on descriptive statistics. The formal equation is shown below:

The picture below is a graphical description of the MAE. The green line represents our model's predictions, and the blue points represent our data.

The MAE is also the most intuitive of the metrics since we're just looking at the absolute difference between the data and the model's predictions. Because we use the absolute value of the residual, the MAE does not indicate underperformance or overperformance of the model (whether or not the model under or overshoots actual data). Each residual contributes proportionally to the total amount of error, meaning that larger errors will contribute linearly to the overall error.

Like we've said above, a small MAE suggests the model is great at prediction, while a large MAE suggests that your model may have trouble in certain areas. A MAE of 0 means that your model is a perfect predictor of the ouputs (but this will almost never happen).

While the MAE is easily interpretable, using the absolute value of the residual often is not as desirable as squaring this difference. Depending on how you want your model to treat outliers, or extreme values, in your data, you may want to bring more attention to these outliers or downplay them. The issue of outliers can play a major role in which error metric you use.

## Calculating MAE against our model

Calculating MAE is relatively straightforward in Python. In the code below, sales contains a list of all the sales numbers, and X contains a list of tuples of size 2. Each tuple contains the critic score and user score corresponding to the sale in the same index. The lm contains a LinearRegression object from scikit-learn, which I used to create the model itself. This object also contains the coefficients. The predict method takes in inputs and gives the actual prediction based off those inputs.

# Perform the intial fitting to get the LinearRegression object
from sklearn import linear_model
lm = linear_model.LinearRegression()
lm.fit(X, sales)

mae_sum = 0
for sale, x in zip(sales, X):
prediction = lm.predict(x)
mae_sum += abs(sale - prediction)
mae = mae_sum / len(sales)

print(mae)
>>> [ 0.7602603 ]


Our model's MAE is 0.760, which is fairly small given that our data's sales range from 0.01 to about 83 (in millions).

# Mean square error

The mean square error (MSE) is just like the MAE, but squares the difference before summing them all instead of using the absolute value. We can see this difference in the equation below.

## Consequences of the Square Term

Because we are squaring the difference, the MSE will almost always be bigger than the MAE. For this reason, we cannot directly compare the MAE to the MSE. We can only compare our model's error metrics to those of a competing model.

The effect of the square term in the MSE equation is most apparent with the presence of outliers in our data. While each residual in MAE contributes proportionally to the total error, the error grows quadratically in MSE. This ultimately means that outliers in our data will contribute to much higher total error in the MSE than they would the MAE. Similarly, our model will be penalized more for making predictions that differ greatly from the corresponding actual value. This is to say that large differences between actual and predicted are punished more in MSE than in MAE. The following picture graphically demonstrates what an indivdual residual in the MSE might look like.

Outliers will produce these exponentially larger differences, and it is our job to judge how we should approach them.

## The problem of outliers

Outliers in our data are a constant source of discussion for the data scientists that try to create models. Do we include the outliers in our model creation or do we ignore them? The answer to this question is dependent on the field of study, the data set on hand and the consequences of having errors in the first place.

For example, I know that some video games achieve a superstar status and thus have disproportionately higher earnings. Therefore, it would be foolish of me to ignore these outlier games because they represent a real phenomenon within the data set. I would want to use the MSE to ensure that my model takes these outliers into account more. If I wanted to downplay their significance, I would use the MAE since the outlier residuals won't contribute as much to the total error as MSE.

Ultimately, the choice between is MSE and MAE is application-specific and depends on how you want to treat large errors. Both are still viable error metrics, but will describe different nuances about the prediction errors of your model.

## A note on MSE and a close relative

Another error metric you may encounter is the root mean squared error (RMSE). As the name suggests, it is the square root of the MSE. Because the MSE is squared, its units do not match that of the original output. Researchers will often use RMSE to convert the error metric back into similar units, making interpretation easier.

Since the MSE and RMSE both square the residual, they are similarly affected by outliers. The RMSE is analogous to the standard deviation (MSE to variance) and is a measure of how large your residuals are spread out.

Both MAE and MSE can range from 0 to positive infinity, so as both of these measures get higher, it becomes harder to interpret how well your model is performing. Another way we can summarize our collection of residuals is by using percentages so that each prediction is scaled against the value it's supposed to estimate.

## Calculating MSE against our model

Like MAE, we'll calculate the MSE for our model. Thankfully, the calculation is just as simple as MAE.

mse_sum = 0
for sale, x in zip(sales, X):
prediction = lm.predict(x)
mse_sum += (sale - prediction)**2
mse = mse_sum / len(sales)

print(mse)
>>> [ 3.53926581 ]


With the MSE, we would expect it to be much larger than MAE due to the influence of outliers. We find that this is the case: the MSE is an order of magnitude higher than the MAE. The corresponding RMSE would be about 1.88, indicating that our model misses actual sale values by about $1.8M. # Mean absolute percentage error The mean absolute percentage error (MAPE) is the percentage equivalent of MAE. The equation looks just like that of MAE, but with adjustments to convert everything into percentages. Just as MAE is the average magnitude of error produced by your model, the MAPE is how far the model's predictions are off from their corresponding outputs on average. Like MAE, MAPE also has a clear interpretation since percentages are easier for people to conceptualize. Both MAPE and MAE are robust to the effects of outliers thanks to the use of absolute value. However for all of its advantages, we are more limited in using MAPE than we are MAE. Many of MAPE's weaknesses actually stem from use division operation. Now that we have to scale everything by the actual value, MAPE is undefined for data points where the value is 0. Similarly, the MAPE can grow unexpectedly large if the actual values are exceptionally small themselves. Finally, the MAPE is biased towards predictions that are systematically less than the actual values themselves. That is to say, MAPE will be lower when the prediction is lower than the actual compared to a prediction that is higher by the same amount. The quick calculation below demonstrates this point. We have a measure similar to MAPE in the form of the mean percentage error. While the absolute value in MAPE eliminates any negative values, the mean percentage error incorporates both positive and negative errors into its calculation. ## Calculating MAPE against our model mape_sum = 0 for sale, x in zip(sales, X): prediction = lm.predict(x) mape_sum += (abs((sale - prediction))/sale) mape = mape_sum/len(sales) print(mape) >>> [ 5.68377867 ]  We know for sure that there are no data points for which there are zero sales, so we are safe to use MAPE. Remember that we must interpret it in terms of percentage points. MAPE states that our model's predictions are, on average, 5.6% off from actual value. # Mean percentage error The mean percentage error (MPE) equation is exactly like that of MAPE. The only difference is that it lacks the absolute value operation. Even though the MPE lacks the absolute value operation, it is actually its absence that makes MPE useful. Since positive and negative errors will cancel out, we cannot make any statements about how well the model predictions perform overall. However, if there are more negative or positive errors, this bias will show up in the MPE. Unlike MAE and MAPE, MPE is useful to us because it allows us to see if our model systematically underestimates (more negative error) or overestimates (positive error). If you're going to use a relative measure of error like MAPE or MPE rather than an absolute measure of error like MAE or MSE, you'll most likely use MAPE. MAPE has the advantage of being easily interpretable, but you must be wary of data that will work against the calculation (i.e. zeroes). You can't use MPE in the same way as MAPE, but it can tell you about systematic errors that your model makes. ## Calculating MPE against our model mpe_sum = 0 for sale, x in zip(sales, X): prediction = lm.predict(x) mpe_sum += ((sale - prediction)/sale) mpe = mpe_sum/len(sales) print(mpe) >>> [-4.77081497]  All the other error metrics have suggested to us that, in general, the model did a fair job at predicting sales based off of critic and user score. However, the MPE indicates to us that it actually systematically underestimates the sales. Knowing this aspect about our model is helpful to us since it allows us to look back at the data and reiterate on which inputs to include that may improve our metrics. Overall, I would say that my assumptions in predicting sales was a good start. The error metrics revealed trends that would have been unclear or unseen otherwise. # Conclusion We've covered a lot of ground with the four summary statistics, but remembering them all correctly can be confusing. The table below will give a quick summary of the acronyms and their basic characteristics. Acroynm Full Name Residual Operation? Robust To Outliers? MAE Mean Absolute Error Absolute Value Yes MSE Mean Squared Error Square No RMSE Root Mean Squared Error Square No MAPE Mean Absolute Percentage Error Absolute Value Yes MPE Mean Percentage Error N/A Yes All of the above measures deal directly with the residuals produced by our model. For each of them, we use the magnitude of the metric to decide if the model is performing well. Small error metric values point to good predictive abillity, while large values suggest otherwise. That being said, it's important to consider the nature of your data set in choosing which metric to present. Outliers may change your choice in metric, depending on if you'd like to give them more significance to the total error. Some fields may just be more prone to outliers, while others are may not see them so much. In any field though, having a good idea of what metrics are available to you is always important. We've covered a few of the most common error metrics used, but there are others that also see use. The metrics we covered use the mean of the residuals, but the median residual also sees use. As you learn other types of models for your data, remember that intution we developed behind our metrics and apply them as needed. # Further Resources If you'd like to explore the linear regression more, Dataquest offers an excellent course on its use and application! We used scikit-learn to apply the error metrics in this article, so you can read the docs to get a better look at how to use them! Continue Reading… ### Job opening at CDC: “The Statistician will play a central role in guiding the statistical methods of all major projects of the Epidemiology and Prevention Branch of the CDC Influenza Division, and aid in designing, analyzing, and interpreting research intended to understand the burden of influenza in the US and internationally and identify the best influenza vaccines and vaccine strategies.” This sounds super interesting: Vacancy Information: Mathematical Statistician, GS-1529-14 Please apply at one of the following: · DE (External candidates to the US GOV) Announcement: HHS-CDC-D3-18-10312897 · MP (Internal candidates to the US GOV) Announcement: HHS-CDC-M3-18-10312898 Location: Atlanta, GA – Centers for Disease Control and Prevention – National Center for Immunization and Respiratory Disease – Influenza Division – Epidemiology and Prevention Branch Salary:$108,281 to $140,765 per year Position summary: The Statistician will play a central role in guiding the statistical methods of all major projects of the Epidemiology and Prevention Branch of the CDC Influenza Division, and aid in designing, analyzing, and interpreting research intended to understand the burden of influenza in the US and internationally and identify the best influenza vaccines and vaccine strategies. This new position is expected to bring novel solutions and innovative methods to research challenges in the influenza burden, vaccine immunogenicity and effectiveness, and antiviral effectiveness fields. Working closely with Branch leadership, the position will advise across five research and surveillance teams and a >$25 million annual research portfolio. Specifically, the Statistician will:

– Design, develop, and adapt mathematical methods and techniques to statistical processes to advance public health program research methods.

– Provide assistance in logistic regression analysis, categorical data analysis, multiple regression analysis, and mixed model techniques.

– Perform analysis of research studies utilizing statistical packages and programming languages.

– Write and present comprehensive statistical reports to provide technical advice and consultation to public health professionals, senior scientists, and management officials.

– Develop, implement, and coordinate national health interview survey segments covering various health related issues.

– Conduct analyses and evaluations to determine the suitability and adequacy of data collected, and adapt procedures independently as needed.

– Occasional travel required.

Basic Qualification Requirements:

A degree that included 24 semester hours of mathematics and statistics, of which at least 12 semester hours were in mathematics and 6 semester hours were in statistics -OR- a combination of education and experience — at least 24 semester hours of mathematics and statistics, including at least 12 hours in mathematics and 6 hours in statistics, as shown above, plus appropriate experience or additional education. In addition to meeting the basic requirements above, applicants must also have at least one year of specialized experience at or equivalent to the GS-13 grade level in the Federal service as defined as “…experience which is directly related to the position which has equipped the applicant with the particular knowledge, skills and abilities (KSAs) to successfully perform the duties of the position, to include experience administering and providing professional consultation in the application of statistical approaches for the study of infectious diseases and vaccine or antiviral effectiveness.”

### Document worth reading: “Human-Machine Inference Networks For Smart Decision Making: Opportunities and Challenges”

The emerging paradigm of Human-Machine Inference Networks (HuMaINs) combines complementary cognitive strengths of humans and machines in an intelligent manner to tackle various inference tasks and achieves higher performance than either humans or machines by themselves. While inference performance optimization techniques for human-only or sensor-only networks are quite mature, HuMaINs require novel signal processing and machine learning solutions. In this paper, we present an overview of the HuMaINs architecture with a focus on three main issues that include architecture design, inference algorithms including security/privacy challenges, and application areas/use cases. Human-Machine Inference Networks For Smart Decision Making: Opportunities and Challenges

### If you did not already know

Distributed Data Shuffling
Data shuffling of training data among different computing nodes (workers) has been identified as a core element to improve the statistical performance of modern large scale machine learning algorithms. Data shuffling is often considered one of the most significant bottlenecks in such systems due to the heavy communication load. Under a master-worker architecture (where a master has access to the entire dataset and only communications between the master and workers is allowed) coding has been recently proved to considerably reduce the communication load. In this work, we consider a different communication paradigm referred to as distributed data shuffling, where workers, connected by a shared link, are allowed to communicate with one another while no communication between the master and workers is allowed. Under the constraint of uncoded cache placement, we first propose a general coded distributed data shuffling scheme, which achieves the optimal communication load within a factor two. Then, we propose an improved scheme achieving the exact optimality for either large memory size or at most four workers in the system. …

Deep Graph Translation
Inspired by the tremendous success of deep generative models on generating continuous data like image and audio, in the most recent year, few deep graph generative models have been proposed to generate discrete data such as graphs. They are typically unconditioned generative models which has no control on modes of the graphs being generated. Differently, in this paper, we are interested in a new problem named \emph{Deep Graph Translation}: given an input graph, we want to infer a target graph based on their underlying (both global and local) translation mapping. Graph translation could be highly desirable in many applications such as disaster management and rare event forecasting, where the rare and abnormal graph patterns (e.g., traffic congestions and terrorism events) will be inferred prior to their occurrence even without historical data on the abnormal patterns for this graph (e.g., a road network or human contact network). To achieve this, we propose a novel Graph-Translation-Generative Adversarial Networks (GT-GAN) which will generate a graph translator from input to target graphs. GT-GAN consists of a graph translator where we propose new graph convolution and deconvolution layers to learn the global and local translation mapping. A new conditional graph discriminator has also been proposed to classify target graphs by conditioning on input graphs. Extensive experiments on multiple synthetic and real-world datasets demonstrate the effectiveness and scalability of the proposed GT-GAN. …

Kriging
In statistics, originally in geostatistics, Kriging or Gaussian process regression is a method of interpolation for which the interpolated values are modeled by a Gaussian process governed by prior covariances, as opposed to a piecewise-polynomial spline chosen to optimize smoothness of the fitted values. Under suitable assumptions on the priors, Kriging gives the best linear unbiased prediction of the intermediate values. Interpolating methods based on other criteria such as smoothness need not yield the most likely intermediate values. The method is widely used in the domain of spatial analysis and computer experiments. The technique is also known as Wiener-Kolmogorov prediction (after Norbert Wiener and Andrey Kolmogorov). The theoretical basis for the method was developed by the French mathematician Georges Matheron based on the Master’s thesis of Danie G. Krige, the pioneering plotter of distance-weighted average gold grades at the Witwatersrand reef complex in South Africa. Krige sought to estimate the most likely distribution of gold based on samples from a few boreholes. The English verb is to krige and the most common noun is Kriging; both are often pronounced with a hard ‘g’, following the pronunciation of the name ‘Krige’.
Spatio-Temporal Kriging in R

## September 25, 2018

### Book Memo: “Simulating Business Processes for Descriptive, Predictive, and Prescriptive Analytics”

 This book outlines the benefits and limitations of simulation, what is involved in setting up a simulation capability in an organization, the steps involved in developing a simulation model and how to ensure model results are implemented. In addition, detailed example applications are provided to show where the tool is useful and what it can offer the decision maker. In Simulating Business Processes for Descriptive, Predictive, and Prescriptive Analytics, Andrew Greasley provides an in-depth discussion on • Business process simulation and how it can enable business analytics • How business process simulation can provide speed, cost, dependability, quality, and flexibility metrics • Industrial case studies including improving service delivery while ensuring an efficient use of staff in public sector organizations such as the police service, testing the capacity of planned production facilities in manufacturing, and ensuring on time delivery in logistics systems • State-of-the-art developments in business process simulation regarding the use of big data, simulating advanced services and modeling peoples behavior Managers and decision makers will learn how simulation provides a faster, cheaper and less risky way of observing the future performance of a real-world system. The book will also benefit personnel already involved in simulation development by providing a business perspective on managing the process of simulation, ensuring simulation results are implemented, and performance is improved.

### Le Monde puzzle [#1068]

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

And here is the third Le Monde mathematical puzzle  open for competition:

Consider for this puzzle only integers with no zero digits. Among these an integer x=a¹a²a³… is refined if it is a multiple of its scion, the integer defined as x without the first digit, y=a²a³…. Find the largest refined integer x such the sequence of the successive scions of x with more than one digit is entirely refined. An integer x=a¹a²a… is distilled if it is a multiple of its grand-scion, the integer defined as x without the first two digits, z=a³… Find the largest distilled integer x such the sequence of the successive scions of x with more than two digits is entirely distilled.

Another puzzle amenable to a R resolution by low-tech exploration of possible integers, first by finding refined integers, with  no solution between 10⁶ and 10⁸ [meaning there is no refined integer larger than 10⁶] and then checking which refined integers have refined descendants, e.g.,

raf=NULL
for (x in (1e1+1):(1e6-1)){
y=x%%(10^trunc(log(x,10)))
if (y>0){
if (x%%y==0)
raf=c(raf,x)}}


that returns 95 refined integers. And then

for (i in length(raf):1){
gason=raf[i]
keep=(gason%in%raf)
while (keep&(gason>100)){
gason=gason%%(10^trunc(log(gason,10)))
keep=keep&(gason%in%raf)}
if (keep) break()}}


that returns 95,625 as the largest refined integer with the right descendance. Rather than finding all refined integers at once, going one digit at a time and checking that some solutions have proper descendants is actually faster.

Similarly, running an exploration code up to 10⁹ produces 1748 distilled integers, with maximum 9,977,34,375, so it is unlikely this is the right upper bound but among these the maximum with the right distilled descendants is 81,421,875. As derived by

rad=(1:99)[(1:99)%%10>0]
for (dig in 2:12){
for (x in ((10^dig+1):(10^{dig+1}-1))){
y=x%%(10^{dig-1})
if (y>0){
if (x%%y==0){
if (min(as.integer(substring(x,seq(nchar(x)),seq(nchar(x)))))>0){
y=x%%(10^dig)
while (keep&(y>1e3)){
y=y%%(10^trunc(log(y,10)))
if (keep) solz=x}}}}
if (solz<10^dig) break()
topsol=max(solz)}


R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### U. of Zurich: Assistant Professorship in Interacting with Data (Non-tenure Track) [Zurich, Switzerland]

Candidates should have Ph.D. in CS with specialization in Interactive Data Analysis, Visual Analytics, Information Visualization or related areas and have an excellent academic record. Apply by 30 Nov 2018.

### Beyond Refuge: Natural Language Understanding Engineer

Beyond Refuge is seeking a Natural Language Understanding Engineer passionate about social change and getting involved on a leadership level with a startup-like idea within an innovative, agile nonprofit.

### Magister Dixit

“AI won’t replace managers, but managers who use AI will replace those who don’t.” Erik Brynjolfsson, Andy McAffee ( January 16, 2018 )

### Machine Learning for Health #NIPS2018 workshop call for proposals

(This article was first published on Rstats – bayesianbiologist, and kindly contributed to R-bloggers)

The theme for this year’s workshop will be “Moving beyond supervised learning in healthcare”. This will be a great forum for those who work on computational solutions to the challenges facing clinical medicine. The submission deadline is Friday Oct 26, 2018. Hope to see you there!

https://ml4health.github.io/2018/pages/call-for-papers.html

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### R developer's guide to Azure

If you want to run R in the cloud, you can of course run it in a virtual machine in the cloud provider of your choice. And you can do that in Azure too. But Azure provides seven dedicated services that provide the ability to run R code, and you can learn all about them in the new R Developer's Guide to Azure at Microsoft Docs. The services include:

Click on the links above for detailed documentation on how to run R in each of these services. Like all Microsoft Docs this guide is hosted in Github, so if you have suggestions for modifications or additions to this document, you can use the "Content Feedback" link to provide suggestions directly in the repository.

Microsoft Docs: R developer's guide to Azure

### R developer’s guide to Azure

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

If you want to run R in the cloud, you can of course run it in a virtual machine in the cloud provider of your choice. And you can do that in Azure too. But Azure provides seven dedicated services that provide the ability to run R code, and you can learn all about them in the new R Developer's Guide to Azure at Microsoft Docs. The services include:

Click on the links above for detailed documentation on how to run R in each of these services. Like all Microsoft Docs this guide is hosted in Github, so if you have suggestions for modifications or additions to this document, you can use the "Content Feedback" link to provide suggestions directly in the repository.

Microsoft Docs: R developer's guide to Azure

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Label line ends in time series with ggplot2

(This article was first published on blogR, and kindly contributed to R-bloggers)

@drsimonj here with a quick share on making great use of the secondary y axis with ggplot2 – super helpful if you’re plotting groups of time series!

Here’s an example of what I want to show you how to create (pay attention to the numbers of the right):

## Setup

To setup we’ll need the tidyverse package and the Orange data set that comes with R. This tracks the circumference growth of five orange trees over time.

library(tidyverse)

d <- Orange

#> Grouped Data: circumference ~ age | Tree
#>   Tree  age circumference
#> 1    1  118            30
#> 2    1  484            58
#> 3    1  664            87
#> 4    1 1004           115
#> 5    1 1231           120
#> 6    1 1372           142


## Template code

To create the basic case where the numbers appear at the end of your time series lines, your code might look something like this:

# You have a data set with:
# - GROUP colum
# - X colum (say time)
# - Y column (the values of interest)
DATA_SET

# Create a vector of the last (furthest right) y-axis values for each group
DATA_SET_ENDS <- DATA_SET %>%
group_by(GROUP) %>%
top_n(1, X) %>%
pull(Y)

# Create plot with sec.axis
ggplot(DATA_SET, aes(X, Y, color = GROUP)) +
geom_line() +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(sec.axis = sec_axis(~ ., breaks = DATA_SET_ENDS))


## Let’s see it!

Let’s break it down a bit. We already have our data set where the group colum is Tree, the X value is age, and the Y value is circumference.

So first get a vector of the last (furthest right) values for each group:

d_ends <- d %>%
group_by(Tree) %>%
top_n(1, age) %>%
pull(circumference)

d_ends
#> [1] 145 203 140 214 177


Next, let’s set up the basic plot without the numbers to see how each layer adds up.

ggplot(d, aes(age, circumference, color = Tree)) +
geom_line()


Now we can use scale_y_*, with the argument sec.axis to create a second axis on the right, with numbers to be displayed at breaks, defined by our vector of line ends:

ggplot(d, aes(age, circumference, color = Tree)) +
geom_line() +
scale_y_continuous(sec.axis = sec_axis(~ ., breaks = d_ends))


This is a great start, The only major addition I suggest is expanding the margins of the x-axis so the gap disappears. You do this with scale_x_* and the expand argument:

ggplot(d, aes(age, circumference, color = Tree)) +
geom_line() +
scale_y_continuous(sec.axis = sec_axis(~ ., breaks = d_ends)) +
scale_x_continuous(expand = c(0, 0))


## Polishing it up

Like it? Here’s the code to recreate the first polished plot:

library(tidyverse)

d <- Orange %>%
as_tibble()

d_ends <- d %>%
group_by(Tree) %>%
top_n(1, age) %>%
pull(circumference)

d %>%
ggplot(aes(age, circumference, color = Tree)) +
geom_line(size = 2, alpha = .8) +
theme_minimal() +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(sec.axis = sec_axis(~ ., breaks = d_ends)) +
ggtitle("Orange trees getting bigger with age",
subtitle = "Based on the Orange data set in R") +
labs(x = "Days old", y = "Circumference (mm)", caption = "Plot by @drsimonj")


## Sign off

Thanks for reading and I hope this was useful for you.

For updates of recent blog posts, follow @drsimonj on Twitter, or email me at drsimonjackson@gmail.com to get in touch.

If you’d like the code that produced this blog, check out the blogR GitHub repository.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### A Better Example of the Confused By The Environment Issue

Our interference from then environment issue was a bit subtle. But there are variations that can be a bit more insidious.

library("dplyr")

# unrelated value that happens
# to be in our environment
z <- "y"

data.frame(x = 1, y = 2, z = 3) %>%
select(-z)
#   x y
# 1 1 2

data.frame(x = 1, y = 2) %>% # oops, no "z"
select(-z)
#   x
# 1 1

# notice column "y" was removed and
# no error or warning was signalled.


When the data.frame has a lot of columns, and is coming from somewhere else (even as an argument to a function): we may not notice the column loss until very much later (making for hard debugging or even unreliable results).

### Whats new on arXiv

In this paper, we develop a privacy implementation for symbolic control systems. Such systems generate sequences of non-numerical data, and these sequences can be represented by words or strings over a finite alphabet. This work uses the framework of differential privacy, which is a statistical notion of privacy that makes it unlikely that privatized data will reveal anything meaningful about underlying sensitive data. To bring differential privacy to symbolic control systems, we develop an exponential mechanism that approximates a sensitive word using a randomly chosen word that is likely to be near it. The notion of ‘near’ is given by the Levenshtein distance, which counts the number of operations required to change one string into another. We then develop a Levenshtein automaton implementation of our exponential mechanism that efficiently generates privatized output words. This automaton has letters as its states, and this work develops transition probabilities among these states that give overall output words obeying the distribution required by the exponential mechanism. Numerical results are provided to demonstrate this technique for both strings of English words and runs of a deterministic transition system, demonstrating in both cases that privacy can be provided in this setting while maintaining a reasonable degree of accuracy.
Toxic online content has become a major issue in today’s world due to an exponential increase in the use of internet by people of different cultures and educational background. Differentiating hate speech and offensive language is a key challenge in automatic detection of toxic text content. In this paper, we propose an approach to automatically classify tweets on Twitter into three classes: hateful, offensive and clean. Using Twitter dataset, we perform experiments considering n-grams as features and passing their term frequency-inverse document frequency (TFIDF) values to multiple machine learning models. We perform comparative analysis of the models considering several values of n in n-grams and TFIDF normalization methods. After tuning the model giving the best results, we achieve 95.6% accuracy upon evaluating it on test data. We also create a module which serves as an intermediate between user and Twitter.
The aim of this research is to introduce a novel structural design process that allows architects and engineers to extend their typical design space horizon and thereby promoting the idea of creativity in structural design. The theoretical base of this work builds on the combination of structural form-finding and state-of-the-art machine learning algorithms. In the first step of the process, Combinatorial Equilibrium Modelling (CEM) is used to generate a large variety of spatial networks in equilibrium for given input parameters. In the second step, these networks are clustered and represented in a form-map through the implementation of a Self Organizing Map (SOM) algorithm. In the third step, the solution space is interpreted with the help of a Uniform Manifold Approximation and Projection algorithm (UMAP). This allows gaining important insights in the structure of the solution space. A specific case study is used to illustrate how the infinite equilibrium states of a given topology can be defined and represented by clusters. Furthermore, three classes, related to the non-linear interaction between the input parameters and the form space, are verified and a statement about the entire manifold of the solution space of the case study is made. To conclude, this work presents an innovative approach on how the manifold of a solution space can be grasped with a minimum amount of data and how to operate within the manifold in order to increase the diversity of solutions.
This work presents a method for visual text recognition without using any paired supervisory data. We formulate the text recognition task as one of aligning the conditional distribution of strings predicted from given text images, with lexically valid strings sampled from target corpora. This enables fully automated, and unsupervised learning from just line-level text-images, and unpaired text-string samples, obviating the need for large aligned datasets. We present detailed analysis for various aspects of the proposed method, namely – (1) the impact of the length of training sequences on convergence, (2) relation between character frequencies and the order in which they are learnt, and (3) demonstrate the generalisation ability of our recognition network to inputs of arbitrary lengths. Finally, we demonstrate excellent text recognition accuracy on both synthetically generated text images, and scanned images of real printed books, using no labelled training examples.
In classic fair division problems such as cake cutting and rent division, envy-freeness requires that each individual (weakly) prefer his allocation to anyone else’s. On a conceptual level, we argue that envy-freeness also provides a compelling notion of fairness for classification tasks. Our technical focus is the generalizability of envy-free classification, i.e., understanding whether a classifier that is envy free on a sample would be almost envy free with respect to the underlying distribution with high probability. Our main result establishes that a small sample is sufficient to achieve such guarantees, when the classifier in question is a mixture of deterministic classifiers that belong to a family of low Natarajan dimension.
Neural architecture for named entity recognition has achieved great success in the field of natural language processing. Currently, the dominating architecture consists of a bi-directional recurrent neural network (RNN) as the encoder and a conditional random field (CRF) as the decoder. In this paper, we propose a deformable stacked structure for named entity recognition, in which the connections between two adjacent layers are dynamically established. We evaluate the deformable stacked structure by adapting it to different layers. Our model achieves the state-of-the-art performances on the OntoNotes dataset.
Control of large-scale networked systems often necessitates the availability of complex models for the interactions amongst the agents. However in many applications, building accurate models of agents or interactions amongst them might be infeasible or computationally prohibitive due to the curse of dimensionality or the complexity of these interactions. In the meantime, data-guided control methods can circumvent model complexity by directly synthesizing the controller from the observed data. In this paper, we propose a distributed $Q$-learning algorithm to design a feedback mechanism based on a given underlying graph structure parameterizing the agents’ interaction network. We assume that the distributed nature of the system arises from the cost function of the corresponding control problem and show that for the specific case of identical dynamically decoupled systems, the learned controller converges to the optimal Linear Quadratic Regulator (LQR) controller for each subsystem. We provide a convergence analysis and verify the result with an example.
We propose a novel linear discriminant analysis approach for the classification of high-dimensional matrix-valued data that commonly arises from imaging studies. Motivated by the equivalence of the conventional linear discriminant analysis and the ordinary least squares, we consider an efficient nuclear norm penalized regression that encourages a low-rank structure. Theoretical properties including a non-asymptotic risk bound and a rank consistency result are established. Simulation studies and an application to electroencephalography data show the superior performance of the proposed method over the existing approaches.
5th generation networks are envisioned to provide seamless and ubiquitous connection to 1000-fold more devices and is believed to provide ultra-low latency and higher data rates up to tens of Gbps. Different technologies enabling these requirements are being developed including mmWave communications, Massive MIMO and beamforming, Device to Device (D2D) communications and Heterogeneous Networks. D2D communication is a promising technology to enable applications requiring high bandwidth such as online streaming and online gaming etc. It can also provide ultra- low latencies required for applications like vehicle to vehicle communication for autonomous driving. D2D communication can provide higher data rates with high energy efficiency and spectral efficiency compared to conventional communication. The performance benefits of D2D communication can be best achieved when D2D users reuses the spectrum being utilized by the conventional cellular users. This spectrum sharing in a multi-tier heterogeneous network will introduce complex interference among D2D users and cellular users which needs to be resolved. Motivated by limited number of surveys for interference mitigation and resource allocation in D2D enabled heterogeneous networks, we have surveyed different conventional and artificial intelligence based interference mitigation and resource allocation schemes developed in recent years. Our contribution lies in the analysis of conventional interference mitigation techniques and their shortcomings. Finally, the strengths of AI based techniques are determined and open research challenges deduced from the recent research are presented.
We introduce a novel type of text representation that preserves the 2D layout of a document. This is achieved by encoding each document page as a two-dimensional grid of characters. Based on this representation, we present a generic document understanding pipeline for structured documents. This pipeline makes use of a fully convolutional encoder-decoder network that predicts a segmentation mask and bounding boxes. We demonstrate its capabilities on an information extraction task from invoices and show that it significantly outperforms approaches based on sequential text or document images.
With increasing amounts of visual data being created in the form of videos and images, visual data selection and summarization are becoming ever increasing problems. We present Vis-DSS, an open-source toolkit for Visual Data Selection and Summarization. Vis-DSS implements a framework of models for summarization and data subset selection using submodular functions, which are becoming increasingly popular today for these problems. We present several classes of models, capturing notions of diversity, coverage, representation and importance, along with optimization/inference and learning algorithms. Vis-DSS is the first open source toolkit for several Data selection and summarization tasks including Image Collection Summarization, Video Summarization, Training Data selection for Classification and Diversified Active Learning. We demonstrate state-of-the art performance on all these tasks, and also show how we can scale to large problems. Vis-DSS allows easy integration for applications to be built on it, also can serve as a general skeleton that can be extended to several use cases, including video and image sharing platforms for creating GIFs, image montage creation, or as a component to surveillance systems and we demonstrate this by providing a graphical user-interface (GUI) desktop app built over Qt framework. Vis-DSS is available at https://…/vis-dss
We study the time complexity of induced subgraph isomorphism problems where the pattern graph is fixed. The earliest known example of an improvement over trivial algorithms is by Itai and Rodeh (1978) who sped up triangle detection in graphs using fast matrix multiplication. This algorithm was generalized by Ne\v{s}et\v{r}il and Poljak (1985) to speed up detection of k-cliques. Improved algorithms are known for certain small-sized patterns. For example, a linear-time algorithm is known for detecting length-4 paths. In this paper, we give the first pattern detection algorithm that improves upon Ne\v{s}et\v{r}il and Poljak’s algorithm for arbitrarily large pattern graphs (not cliques). The algorithm is obtained by reducing the induced subgraph isomorphism problem to the problem of detecting multilinear terms in constant-degree polynomials. We show that the same technique can be used to reduce the induced subgraph isomorphism problem of many pattern graphs to constructing arithmetic circuits computing homomorphism polynomials of these pattern graphs. Using this, we obtain faster combinatorial algorithms (algorithms that do not use fast matrix multiplication) for k-paths and k-cycles. We also obtain faster algorithms for 5-paths and 5-cycles that match the runtime for triangle detection. We show that these algorithms are expressible using polynomial families that we call graph pattern polynomial families. We then define a notion of reduction among these polynomials that allows us to compare the complexity of various pattern detection problems within this framework. For example, we show that the induced subgraph isomorphism polynomial for any pattern that contains a k-clique is harder than the induced subgraph isomorphism polynomial for k-clique. An analogue of this theorem is not known with respect to general algorithmic hardness.
This investigation aims to study different adaptive fuzzy inference algorithms capable of real-time sequential learning and prediction of time-series data. A brief qualitative description of these algorithms namely meta-cognitive fuzzy inference system (McFIS), sequential adaptive fuzzy inference system (SAFIS) and evolving Takagi-Sugeno (ETS) model provide a comprehensive comparison of their working principle, especially their unique characteristics are discussed. These algorithms are then simulated with dataset collected at one of the academic buildings at Nanyang Technological University, Singapore. The performance are compared by means of the root mean squared error (RMSE) and non-destructive error index (NDEI) of the predicted output. Analysis shows that McFIS shows promising results either with lower RMSE and NDEI or with lower architectural complexity over ETS and SAFIS. Statistical Analysis also reveals the significance of the outcome of these algorithms.
High-throughput data acquisition in synthetic biology leads to an abundance of data that need to be processed and aggregated into useful biological models. Building dynamical models based on this wealth of data is of paramount importance to understand and optimize designs of synthetic biology constructs. However, building models manually for each data set is inconvenient and might become infeasible for highly complex synthetic systems. In this paper, we present state-of-the-art system identification techniques and combine them with chemical reaction network theory (CRNT) to generate dynamic models automatically. On the system identification side, Sparse Bayesian Learning offers methods to learn from data the sparsest set of dictionary functions necessary to capture the dynamics of the system into ODE models; on the CRNT side, building on such sparse ODE models, all possible network structures within a given parameter uncertainty region can be computed. Additionally, the system identification process can be complemented with constraints on the parameters to, for example, enforce stability or non-negativity—thus offering relevant physical constraints over the possible network structures. In this way, the wealth of data can be translated into biologically relevant network structures, which then steers the data acquisition, thereby providing a vital step for closed-loop system identification.
Deep learning architectures have proved versatile in a number of drug discovery applications, including the modelling of in vitro compound activity. While controlling for prediction confidence is essential to increase the trust, interpretability and usefulness of virtual screening models in drug discovery, techniques to estimate the reliability of the predictions generated with deep learning networks remain largely underexplored. Here, we present Deep Confidence, a framework to compute valid and efficient confidence intervals for individual predictions using the deep learning technique Snapshot Ensembling and conformal prediction. Specifically, Deep Confidence generates an ensemble of deep neural networks by recording the network parameters throughout the local minima visited during the optimization phase of a single neural network. This approach serves to derive a set of base learners (i.e., snapshots) with comparable predictive power on average, that will however generate slightly different predictions for a given instance. The variability across base learners and the validation residuals are in turn harnessed to compute confidence intervals using the conformal prediction framework. Using a set of 24 diverse IC50 data sets from ChEMBL 23, we show that Snapshot Ensembles perform on par with Random Forest (RF) and ensembles of independently trained deep neural networks. In addition, we find that the confidence regions predicted using the Deep Confidence framework span a narrower set of values. Overall, Deep Confidence represents a highly versatile error prediction framework that can be applied to any deep learning-based application at no extra computational cost.
Event extraction is of practical utility in natural language processing. In the real world, it is a common phenomenon that multiple events existing in the same sentence, where extracting them are more difficult than extracting a single event. Previous works on modeling the associations between events by sequential modeling methods suffer a lot from the low efficiency in capturing very long-range dependencies. In this paper, we propose a novel Jointly Multiple Events Extraction (JMEE) framework to jointly extract multiple event triggers and arguments by introducing syntactic shortcut arcs to enhance information flow and attention-based graph convolution networks to model graph information. The experiment results demonstrate that our proposed framework achieves competitive results compared with state-of-the-art methods.
The generative learning phase of Autoencoder (AE) and its successor Denosing Autoencoder (DAE) enhances the flexibility of data stream method in exploiting unlabelled samples. Nonetheless, the feasibility of DAE for data stream analytic deserves in-depth study because it characterizes a fixed network capacity which cannot adapt to rapidly changing environments. An automated construction of a denoising autoeconder, namely deep evolving denoising autoencoder (DEVDAN), is proposed in this paper. DEVDAN features an open structure both in the generative phase and in the discriminative phase where input features can be automatically added and discarded on the fly. A network significance (NS) method is formulated in this paper and is derived from the bias-variance concept. This method is capable of estimating the statistical contribution of the network structure and its hidden units which precursors an ideal state to add or prune input features. Furthermore, DEVDAN is free of the problem- specific threshold and works fully in the single-pass learning fashion. The efficacy of DEVDAN is numerically validated using nine non-stationary data stream problems simulated under the prequential test-then-train protocol where DEVDAN is capable of delivering an improvement of classification accuracy to recently published online learning works while having flexibility in the automatic extraction of robust input features and in adapting to rapidly changing environments.
Implicit probabilistic models are models defined naturally in terms of a sampling procedure and often induces a likelihood function that cannot be expressed explicitly. We develop a simple method for estimating parameters in implicit models that does not require knowledge of the form of the likelihood function or any derived quantities, but can be shown to be equivalent to maximizing likelihood under some conditions. Our result holds in the non-asymptotic parametric setting, where both the capacity of the model and the number of data examples are finite. We also demonstrate encouraging experimental results.

### afex_plot(): Publication-Ready Plots for Factorial Designs

(This article was first published on Computational Psychology - Henrik Singmann, and kindly contributed to R-bloggers)

I am happy to announce that a new version of afex (version 0.22-1) has appeared on CRAN. This version comes with two major changes, for more see the NEWS file. To get the new version including all packages used in the examples run:

install.packages("afex", dependencies = TRUE)


First, afex does not load or attach package emmeans automatically anymore. This reduces the package footprint and makes it more lightweight. If you want to use afex without using emmeans, you can do this now. The consequence of this is that you have to attach emmeans explicitly if you want to continue using emmeans() et al. in the same manner. Simply add library("emmeans") to the top of your script just below library("afex") and things remain unchanged. Alternatively, you can use emmeans::emmeans() without attaching the package.

Second and more importantly, I have added a new plotting function to afex. afex_plot() visualizes results from factorial experiments combining estimated marginal means and associated uncertainties (i.e., error bars) in the foreground with a depiction of the raw data in the background. Currently, afex_plots() supports ANOVAs and mixed models fitted with afex as well as mixed models fitted with lme4 (support for more models will come in the next version). As shown in the example below, afex_plots() makes it easy to produce nice looking plots that are ready to be incorporated into publications. Importantly, afex_plots() allows different types of error bars, including within-subjects confidence intervals, which makes it particularly useful for fields where such designs are very common (e.g., psychology). Furthermore, afex_plots() is built on ggplot2 and designed in a modular manner, making it easy to customize the plot to ones personal preferences.

afex_plot() requires the fitted model object as first argument and then has three arguments determining which factor or factors are displayed how:
x is necessary and specifies the factor(s) plotted on the x-axis
trace is optional and specifies the factor(s) plotted as separate lines (i.e., with each factor-level present at each x-axis tick)
panel is optional and specifies the factor(s) which separate the plot into different panels.

The further arguments make it easy to customize the plot in various ways. A comprehensive overview is provided in the new vignette, further details, specifically regarding the question of which type of error bars are supported, is given on its help page (which also has many more examples).

Let us look at an example. We take data from a 3 by 2 within-subject experiment that also features prominently in the vignette. Note that we plot within-subjects confidence intervals (by setting error = "within") and then customize the plot quite a bit by changing the theme, using nicer labels, removing some y-axis ticks, adding colour, and using a customized geom (geom_boxjitter from the ggpol package) for displaying the data in the background.

library("afex")
library("ggplot2")
data(md_12.1)
aw <- aov_ez("id", "rt", md_12.1, within = c("angle", "noise"))

afex_plot(aw, x = "angle", trace = "noise", error = "within",
mapping = c("shape", "fill"), dodge = 0.7,
data_geom = ggpol::geom_boxjitter,
data_arg = list(
width = 0.5,
jitter.width = 0,
jitter.height = 10,
outlier.intersect = TRUE),
point_arg = list(size = 2.5),
error_arg = list(size = 1.5, width = 0),
factor_levels = list(angle = c("0°", "4°", "8°"),
noise = c("Absent", "Present")),
legend_title = "Noise") +
labs(y = "RTs (in ms)", x = "Angle (in degrees)") +
scale_y_continuous(breaks=seq(400, 900, length.out = 3)) +
theme_bw(base_size = 15) +
theme(legend.position="bottom", panel.grid.major.x = element_blank())

ggsave("afex_plot.png", device = "png", dpi = 600,
width = 8.5, height = 8, units = "cm")


In the plot, the black dots are the means and the thick black lines the 95% within-subject confidence intervals. The raw data is displayed in the background with a half box plot showing the median and upper and lower quartile as well as the raw data. The raw data is jittered on the y-axis to avoid perfect overlap.

One final thing to note. In the vignette on CRAN as well as the help page there is an error in the code. The name of the argument for changing the labels of the factor-levels is factor_levels and not new_levels. The vignette linked above and here uses the correct argument name. This is already corrected on github and will be corrected on CRAN with the next release.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### The Whys and Hows of Web Scraping – A Lethal Weapon in Your Data Arsenal

We breakdown the various aspects of web scraping, from why businesses need to do it, to instructions on how to go about acquiring this data with PromptCloud - a pioneer in Data as Service solutions with specialization in large-scale and custom web data extraction.

### Building Online Interactive Simulators for Predictive Models in R

(This article was first published on R – Displayr, and kindly contributed to R-bloggers)

Correctly interpreting predictive models can be tricky. One solution to this problem is to create interactive simulators, where users can manipulate the predictor variables and see how the predictions change. This post describes a simple approach for creating online interactive simulators. It works for any model where there is a predict method. Better yet, if the model’s not top secret, you can build and share the model for no cost, using the free version of Displayr!

In this post I show how to describe the very simple simulator shown below. Click the image to interact with it, or click the button below to explore and edit the code.

## Step 1: Create the model

The first step is to create a model. There are lots of ways to do this, including:

• Creating the model using R code from within Displayr. I illustrate this below.
• Pasting in estimates that you have already computed (Insert > Paste Table).
• Using Displayr’s graphical user interface.
• Creating an R model somewhere else, saving it somewhere on the web (e.g., Dropbox), and then reading it into Displayr using readRDS. (See How to Link Documents in Displayr for a discussion of some utilities we have created for reading from Dropbox.)

In this post I will illustrate by using one of my all-time favorite models – a generalized additive model – via the gam function in the mgcv package. The process for creating this in Displayr is:

• Log in to Displayr (if you don’t already have an account, click the  button at the top-right of the screen).
• Press Insert > R Output (Analysis).
• Enter your code into the R Output and press the  CALCULATE  button at the top of the Object Inspector. In the example below I have fitted a GAM using some of IBM’s telco churn example data.

## Step 2: Add controls for each of the predictors

• Press Insert > Control (More) (this option is on the far right of the ribbon).
• In the Object Inspector > Properties > GENERAL, set the Name to cSeniorCitizen. You can give it any name you wish, but it is usually helpful to have a clear naming standard. In this example, I am using c so that whenever I refer to the control in code it is obvious to me that it is a control.
• Click on the Control tab of the Object Inspector and set the Item list to No; Yes, which means that the user will have a choice between No and Yes when using the control.
• Press Insert > Text box and click and drag to draw a text box to the left of the control. Type Senior Citizen into the text box, set it to be right-aligned (in the Appearance tab of the ribbon), with a font size of 10. You can micro-control layout by selecting the textbox, holding down your control key, and clicking the arrow keys on your keyboard.
• Click on the control and select No. It should look as shown below.

• Now, using shift and your mouse, select the text box and the control and press Home > Duplicate, and drag the copies to be neatly arranged underneath. Repeat this until you have four sets of labels and controls, one under each other.
• Update the textboxes, and each control’s Name, and Item list, as follows:
• Tenure (months), cTenure: 0; 1; 2; 3; 4; 5; 6; 7; 8; 9; 10; 11; 12; 13; 14; 15; 16; 17; 18; 19; 20; 21; 22; 23; 24; 25; 26; 27; 28; 29; 30; 31; 32; 33; 34; 35; 36; 37; 38; 39; 40; 41; 42; 43; 44; 45; 46; 47; 48; 49; 50; 51; 52; 53; 54; 55; 56; 57; 58; 59; 60; 61; 62; 63; 64; 65; 66; 67; 68; 69; 70; 71; 72
• Internet service, cInternetService: No; DSL; Fiber optic
• Monthly charges, cMonthlyCharges: $0;$10; $20;$30; $40;$50; $60;$70; $80;$90; $100;$110; $120 • Select any option from each of the controls (it does not matter which you choose). ## Step 3: Computing the prediction Press Insert > R Output (Analysis) and then enter the code below, modifying it as per your needs. For example, with the code SeniorCitizen = cSeniorCitizen, the variable name used in the model is SeniorCitizen and cSeniorCitizen is the name of the control. The item names in the control must exactly match the values of the variables in the data set. It is for this reason that the MonthlyCharges code is a bit more complicated, as it needs to strip out the$ from the control and convert it into a number (as the variable in the data set just contains numbers).


predict(my.gam,
type = "response",
newdata = data.frame(SeniorCitizen = cSeniorCitizen,
Tenure = as.numeric(cTenure),
InternetService = cInternetService,
MonthlyCharges = as.numeric(gsub("\\$", "", cMonthlyCharges))))[1] * 100  ### Confidence bands Provided that the predict method supports them, the same approach easily extends to computing confidence intervals and other quantities from models. This code snippet computes the confidence intervals for the GAM used above.  pred <- predict(my.gam, se.fit = TRUE, newdata = data.frame(SeniorCitizen = cSeniorCitizen, Tenure = as.numeric(cTenure), InternetService = cInternetService, MonthlyCharges = as.numeric(gsub("\\$", "", cMonthlyCharges))))
bounds = plogis(pred$fit + c(-1.96, 0, 1.96) * pred$se.fit) * 100
names(bounds) = c("Lower 95% CI", "Predicted", "Upper 95% CI")
bounds


### Computing predictions from coefficients

And, of course, you can also make predictions directly from coefficients, rather than from model objects. For example, the following code makes a prediction for a logistic regression:


coefs = my.logistic.regression$coef XB = coefs["(Intercept)"] + switch(cSeniorCitizen, No = 0, Yes = coefs["SeniorCitizenYes"]) + as.numeric(cTenure) * coefs["Tenure"] + switch(cInternetService, No = coefs["InternetServiceNo"], "Fiber optic" = coefs["InternetServiceFiber optic"], DSL = 0) + as.numeric(gsub("\\$", "", cMonthlyCharges)) * coefs["MonthlyCharges"]
100 / (1 + exp(-XB))


### Making safe predictions

Sometimes models perform “unsafe” transformations of the data in their internals. For example, some machine learning models standardize inputs (subtract the mean and divide by standard deviation). This can create a problem at prediction time, as the predict method may, in the background, attempt to repeat the standardization using the data for the prediction. This will cause an error (as the standard deviation of a single input observation is 0). Similarly, it is possible to create unsafe predictions from even the most well-written model (e.g., if using poly or scale in your model formula). There are a variety of ways of dealing with unsafe predictions, but a safe course of action is to perform any transformations outside of the model (i.e., not in the model formula).

## Step 4: Export the simulator

If everything has gone to plan you can now use the simulator. To export it so that others can use it, click Export > Public Web Page, and you can then share the link with whoever you wish. The version that I have created here is very simple, but you can do a lot more if you want to make something pretty or more detailed (see the Displayr Dashboard Showcase for more examples).

Click here to interact with the published dashboard, or click here to open a copy of the Displayr document that I created when writing this post. It is completely live, so you can interact with it. Click on any of the objects on the page to view the underlying R code, which will appear in the Object Inspector > Properties > R CODE.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Whats new on arXiv

Recommender systems are used in variety of domains affecting people’s lives. This has raised concerns about possible biases and discrimination that such systems might exacerbate. There are two primary kinds of biases inherent in recommender systems: observation bias and bias stemming from imbalanced data. Observation bias exists due to a feedback loop which causes the model to learn to only predict recommendations similar to previous ones. Imbalance in data occurs when systematic societal, historical, or other ambient bias is present in the data. In this paper, we address both biases by proposing a hybrid fairness-aware recommender system. Our model provides efficient and accurate recommendations by incorporating multiple user-user and item-item similarity measures, content, and demographic information, while addressing recommendation biases. We implement our model using a powerful and expressive probabilistic programming language called probabilistic soft logic. We experimentally evaluate our approach on a popular movie recommendation dataset, showing that our proposed model can provide more accurate and fairer recommendations, compared to a state-of-the art fair recommender system.
Ever increasing number of Android malware, has always been a concern for cybersecurity professionals. Even though plenty of anti-malware solutions exist, a rational and pragmatic approach for the same is rare and has to be inspected further. In this paper, we propose a novel two-set feature selection approach based on Rough Set and Statistical Test named as RSST to extract relevant system calls. To address the problem of higher dimensional attribute set, we derived suboptimal system call space by applying the proposed feature selection method to maximize the separability between malware and benign samples. Comprehensive experiments conducted on a dataset consisting of 3500 samples with 30 RSST derived essential system calls resulted in an accuracy of 99.9%, Area Under Curve (AUC) of 1.0, with 1% False Positive Rate (FPR). However, other feature selectors (Information Gain, CFsSubsetEval, ChiSquare, FreqSel and Symmetric Uncertainty) used in the domain of malware analysis resulted in the accuracy of 95.5% with 8.5% FPR. Besides, empirical analysis of RSST derived system calls outperform other attributes such as permissions, opcodes, API, methods, call graphs, Droidbox attributes and network traces.
Non-Intrusive Load Monitoring (NILM) is an important application to monitor household appliance activities and provide related information to house owner or/and utility company via a single sensor installed at the electrical entry of the house. It can be used for different purposes in residential and industrial sectors. Thus, an increasing number of new algorithms have been developed in recent years. In these algorithms, researchers either use existing public datasets or collect their own data which causes such problems as insufficiency of electrical parameters, missing of ground-truth data, absence of many appliances, and lack of appliance information. To solve these problems, this paper presents a model-based platform for NILM system development, namely Functional Intrusive Load Monitor (FILM). By using this platform, the state transitions and activities of all the involved appliances can be preset by researchers, and multiple electrical parameters such as harmonics and power factor can be monitored or calculated. This platform will help researchers save the time of collecting experimental data, utilize precise control of individual appliance activities, and develop load signatures of devices. This paper describes the steps, structure, and requirements of building this platform. Case study is presented to help understand this platform.
Permutation invariant Gaussian matrix models were recently developed for applications in computational linguistics. A 5-parameter family of models was solved. In this paper, we use a representation theoretic approach to solve the general 13-parameter Gaussian model, which can be viewed as a zero-dimensional quantum field theory. We express the two linear and eleven quadratic terms in the action in terms of representation theoretic parameters. These parameters are coefficients of simple quadratic expressions in terms of appropriate linear combinations of the matrix variables transforming in specific irreducible representations of the symmetric group $S_D$ where $D$ is the size of the matrices. They allow the identification of constraints which ensure a convergent Gaussian measure and well-defined expectation values for polynomial functions of the random matrix at all orders. A graph-theoretic interpretation is known to allow the enumeration of permutation invariants of matrices at linear, quadratic and higher orders. We express the expectation values of all the quadratic graph-basis invariants and a selection of cubic and quartic invariants in terms of the representation theoretic parameters of the model.
This chapter focuses on Internet of Things from the nanoscale point of view. The chapter starts with section 1 which provides an introduction of nanothings and nanotechnologies. The nanoscale communication paradigms and the different approaches are discussed for nanodevices development. Nanodevice characteristics are discussed and the architecture of wireless nanodevices are outlined. Section 2 describes Internet of NanoThing(IoNT), its network architecture, and the challenges of nanoscale communication which is essential for enabling IoNT. Section 3 gives some practical applications of IoNT. The internet of Bio-NanoThing (IoBNT) and relevant biomedical applications are discussed. Other Applications such as military, industrial, and environmental applications are also outlined.
Q-learning is one of the most popular methods in Reinforcement Learning (RL). Transfer Learning aims to utilize the learned knowledge from source tasks to help new tasks to improve the sample complexity of the new tasks. Considering that data collection in RL is both more time and cost consuming and Q-learning converges slowly comparing to supervised learning, different kinds of transfer RL algorithms are designed. However, most of them are heuristic with no theoretical guarantee of the convergence rate. Therefore, it is important for us to clearly understand when and how will transfer learning help RL method and provide the theoretical guarantee for the improvement of the sample complexity. In this paper, we propose to transfer the Q-function learned in the source task to the target of the Q-learning in the new task when certain safe conditions are satisfied. We call this new transfer Q-learning method target transfer Q-Learning. The safe conditions are necessary to avoid the harm to the new tasks and thus ensure the convergence of the algorithm. We study the convergence rate of the target transfer Q-learning. We prove that if the two tasks are similar with respect to the MDPs, the optimal Q-functions in the source and new RL tasks are similar which means the error of the transferred target Q-function in new MDP is small. Also, the convergence rate analysis shows that the target transfer Q-Learning will converge faster than Q-learning if the error of the transferred target Q-function is smaller than the current Q-function in the new task. Based on our theoretical results, we design the safe condition as the Bellman error of the transferred target Q-function is less than the current Q-function. Our experiments are consistent with our theoretical founding and verified the effectiveness of our proposed target transfer Q-learning method.
The present paper surveys neural approaches to conversational AI that have been developed in the last few years. We group conversational systems into three categories: (1) question answering agents, (2) task-oriented dialogue agents, and (3) chatbots. For each category, we present a review of state-of-the-art neural approaches, draw the connection between them and traditional approaches, and discuss the progress that has been made and challenges still being faced, using specific systems and models as case studies.
We study the problem of learning a generalizable action policy for an intelligent agent to actively approach an object of interest in indoor environment solely from its visual inputs. While scene-driven or recognition-driven visual navigation has been widely studied, prior efforts suffer severely from the limited generalization capability. In this paper, we first argue the object searching task is environment dependent while the approaching ability is general. To learn a generalizable approaching policy, we present a novel solution dubbed as GAPLE which adopts two channels of visual features: depth and semantic segmentation, as the inputs to the policy learning module. The empirical studies conducted on the House3D dataset as well as on a physical platform in a real world scenario validate our hypothesis, and we further provide in-depth qualitative analysis.
Understanding the world around us and making decisions about the future is a critical component to human intelligence. As autonomous systems continue to develop, their ability to reason about the future will be the key to their success. Semantic anticipation is a relatively under-explored area for which autonomous vehicles could take advantage of (e.g., forecasting pedestrian trajectories). Motivated by the need for real-time prediction in autonomous systems, we propose to decompose the challenging semantic forecasting task into two subtasks: current frame segmentation and future optical flow prediction. Through this decomposition, we built an efficient, effective, low overhead model with three main components: flow prediction network, feature-flow aggregation LSTM, and end-to-end learnable warp layer. Our proposed method achieves state-of-the-art accuracy on short-term and moving objects semantic forecasting while simultaneously reducing model parameters by up to 95% and increasing efficiency by greater than 40x.
Autonomous cyber-physical agents and systems play an increasingly large role in our lives. To ensure that agents behave in ways aligned with the values of the societies in which they operate, we must develop techniques that allow these agents to not only maximize their reward in an environment, but also to learn and follow the implicit constraints of society. These constraints and norms can come from any number of sources including regulations, business process guidelines, laws, ethical principles, social norms, and moral values. We detail a novel approach that uses inverse reinforcement learning to learn a set of unspecified constraints from demonstrations of the task, and reinforcement learning to learn to maximize the environment rewards. More precisely, we assume that an agent can observe traces of behavior of members of the society but has no access to the explicit set of constraints that give rise to the observed behavior. Inverse reinforcement learning is used to learn such constraints, that are then combined with a possibly orthogonal value function through the use of a contextual bandit-based orchestrator that picks a contextually-appropriate choice between the two policies (constraint-based and environment reward-based) when taking actions. The contextual bandit orchestrator allows the agent to mix policies in novel ways, taking the best actions from either a reward maximizing or constrained policy. In addition, the orchestrator is transparent on which policy is being employed at each time step. We test our algorithms using a Pac-Man domain and show that the agent is able to learn to act optimally, act within the demonstrated constraints, and mix these two functions in complex ways.
Transfer-learning and meta-learning are two effective methods to apply knowledge learned from large data sources to new tasks. In few-class, few-shot target task settings (i.e. when there are only a few classes and training examples available in the target task), meta-learning approaches that optimize for future task learning have outperformed the typical transfer approach of initializing model weights from a pre-trained starting point. But as we experimentally show, meta-learning algorithms that work well in the few-class setting do not generalize well in many-shot and many-class cases. In this paper, we propose a joint training approach that combines both transfer-learning and meta-learning. Benefiting from the advantages of each, our method obtains improved generalization performance on unseen target tasks in both few- and many-class and few- and many-shot scenarios.
Preference are central to decision making by both machines and humans. Representing, learning, and reasoning with preferences is an important area of study both within computer science and across the sciences. When working with preferences it is necessary to understand and compute the distance between sets of objects, e.g., the preferences of a user and a the descriptions of objects to be recommended. We present CPDist, a novel neural network to address the problem of learning to measure the distance between structured preference representations. We use the popular CP-net formalism to represent preferences and then leverage deep neural networks to learn a recently proposed metric function that is computationally hard to compute directly. CPDist is a novel metric learning approach based on the use of deep siamese networks which learn the Kendal Tau distance between partial orders that are induced by compact preference representations. We find that CPDist is able to learn the distance function with high accuracy and outperform existing approximation algorithms on both the regression and classification task using less computation time. Performance remains good even when CPDist is trained with only a small number of samples compared to the dimension of the solution space, indicating the network generalizes well.
Joint analysis of data from multiple information repositories facilitates uncovering the underlying structure in heterogeneous datasets. Single and coupled matrix-tensor factorization (CMTF) has been widely used in this context for imputation-based recommendation from ratings, social network, and other user-item data. When this side information is in the form of item-item correlation matrices or graphs, existing CMTF algorithms may fall short. Alleviating current limitations, we introduce a novel model coined coupled graph-tensor factorization (CGTF) that judiciously accounts for graph-related side information. The CGTF model has the potential to overcome practical challenges, such as missing slabs from the tensor and/or missing rows/columns from the correlation matrices. A novel alternating direction method of multipliers (ADMM) is also developed that recovers the nonnegative factors of CGTF. Our algorithm enjoys closed-form updates that result in reduced computational complexity and allow for convergence claims. A novel direction is further explored by employing the interpretable factors to detect graph communities having the tensor as side information. The resulting community detection approach is successful even when some links in the graphs are missing. Results with real data sets corroborate the merits of the proposed methods relative to state-of-the-art competing factorization techniques in providing recommendations and detecting communities.
Link prediction is one of the fundamental tools in social network analysis, used to identify relationships that are not otherwise observed. Commonly, link prediction is performed by means of a similarity metric, with the idea that a pair of similar nodes are likely to be connected. However, traditional link prediction based on similarity metrics assumes that available network data is accurate. We study the problem of adversarial link prediction, where an adversary aims to hide a target link by removing a limited subset of edges from the observed subgraph. We show that optimal attacks on local similarity metrics—that is, metrics which use only the information about the node pair and their network neighbors—can be found in linear time. In contrast, attacking Katz and ACT metrics which use global information about network topology is NP-Hard. We present an approximation algorithm for optimal attacks on Katz similarity, and a principled heuristic for ACT attacks. Extensive experiments demonstrate the efficacy of our methods.
Unsupervised representation learning algorithms such as word2vec and ELMo improve the accuracy of many supervised NLP models, mainly because they can take advantage of large amounts of unlabeled text. However, the supervised models only learn from task-specific labeled data during the main training phase. We therefore propose Cross-View Training (CVT), a semi-supervised learning algorithm that improves the representations of a Bi-LSTM sentence encoder using a mix of labeled and unlabeled data. On labeled examples, standard supervised learning is used. On unlabeled examples, CVT teaches auxiliary prediction modules that see restricted views of the input (e.g., only part of a sentence) to match the predictions of the full model seeing the whole input. Since the auxiliary modules and the full model share intermediate representations, this in turn improves the full model. Moreover, we show that CVT is particularly effective when combined with multi-task learning. We evaluate CVT on five sequence tagging tasks, machine translation, and dependency parsing, achieving state-of-the-art results.
Every new privacy regulation brings along the question of whether it results in improving the privacy for the users or whether it creates more barriers to understanding and exercising their rights. The EU General Data Protection Regulation (GDPR) is one of the most demanding and comprehensive privacy regulations of all time. Hence, a few months after it went into effect, it is natural to study its impact over the landscape of privacy policies online. In this work, we conduct the first longitudinal, in-depth, and at-scale assessment of privacy policies before and after the GDPR. We gauge the complete consumption cycle of these policies, from the first user impressions until the compliance assessment. We create a diverse corpus of 3,086 English-language privacy policies for which we fetch the pre-GPDR and the post-GDPR versions. Via a user study with 530 participants on Amazon Mturk, we discover that the visual presentation of privacy policies has slightly improved in limited data-sensitive categories in addition to the top European websites. We also find that the readability of privacy policies suffers under the GDPR, due to almost a 30% more sentences and words, despite the efforts to reduce the reliance on passive sentences. We further develop a new workflow for the automated assessment of requirements in privacy policies, building on automated natural language processing techniques. We find evidence for positive changes triggered by the GDPR, with the ambiguity level, averaged over 8 metrics, improving in over 20.5% of the policies. Finally, we show that privacy policies cover more data practices, particularly around data retention, user access rights, and specific audiences, and that an average of 15.2% of the policies improved across 8 compliance metrics. Our analysis, however, reveals a large gap that exists between the current status-quo and the ultimate goals of the GDPR.
Collaborative filtering (CF) has been successfully employed by many modern recommender systems. Conventional CF-based methods use the user-item interaction data as the sole information source to recommend items to users. However, CF-based methods are known for suffering from cold start problems and data sparsity problems. Hybrid models that utilize auxiliary information on top of interaction data have increasingly gained attention. A few ‘collaborative learning’-based models, which tightly bridges two heterogeneous learners through mutual regularization, are recently proposed for the hybrid recommendation. However, the ‘collaboration’ in the existing methods are actually asynchronous due to the alternative optimization of the two learners. Leveraging the recent advances in variational autoencoder~(VAE), we here propose a model consisting of two streams of mutual linked VAEs, named variational collaborative model (VCM). Unlike the mutual regularization used in previous works where two learners are optimized asynchronously, VCM enables a synchronous collaborative learning mechanism. Besides, the two stream VAEs setup allows VCM to fully leverages the Bayesian probabilistic representations in collaborative learning. Extensive experiments on three real-life datasets have shown that VCM outperforms several state-of-art methods.
Deep learning is a multi-layer neural network. It can be regarded as a chain of complete bipartite graphs. The nodes of the first partite is the input layer and the last is the output layer. The edges of a bipartite graph function as weights which are represented as a matrix. The values of i-th partite are computed by multiplication of the weight matrix and values of (i-1)-th partite. Using mass training and teacher data, the weight parameters are estimated little by little. Overfitting (or Overlearning) refers to a model that models the ‘training data’ too well. It then becomes difficult for the model to generalize to new data which were not in the training set. The most popular method to avoid overfitting is called dropout. Dropout deletes a random sample of activations (nodes) to zero during the training process. A random sample of nodes cause more irregular frequency of dropout edges. We propose a combinatorial design on dropout nodes from each partite which balances frequency of edges. We analyze and construct such designs in this paper.
We present a stable mergesort, called~\ASS, that exploits the existence of monotonic runs for sorting efficiently partially sorted data. We also prove that, although this algorithm is simple to implement, its computational cost, in number of comparisons performed, is optimal up to an additive linear term.
Online Learning to Rank (OLTR) methods optimize rankers based on user interactions. State-of-the-art OLTR methods are built specifically for linear models. Their approaches do not extend well to non-linear models such as neural networks. We introduce an entirely novel approach to OLTR that constructs a weighted differentiable pairwise loss after each interaction: Pairwise Differentiable Gradient Descent (PDGD). PDGD breaks away from the traditional approach that relies on interleaving or multileaving and extensive sampling of models to estimate gradients. Instead, its gradient is based on inferring preferences between document pairs from user clicks and can optimize any differentiable model. We prove that the gradient of PDGD is unbiased w.r.t. user document pair preferences. Our experiments on the largest publicly available Learning to Rank (LTR) datasets show considerable and significant improvements under all levels of interaction noise. PDGD outperforms existing OLTR methods both in terms of learning speed as well as final convergence. Furthermore, unlike previous OLTR methods, PDGD also allows for non-linear models to be optimized effectively. Our results show that using a neural network leads to even better performance at convergence than a linear model. In summary, PDGD is an efficient and unbiased OLTR approach that provides a better user experience than previously possible.
In this paper, several two-dimensional clustering scenarios are given. In those scenarios, soft partitioning clustering algorithms (Fuzzy C-means (FCM) and Possibilistic c-means (PCM)) are applied. Afterward, VAT is used to investigate the clustering tendency visually, and then in order of checking cluster validation, three types of indices (e.g., PC, DI, and DBI) were used. After observing the clustering algorithms, it was evident that each of them has its limitations; however, PCM is more robust to noise than FCM as in case of FCM a noise point has to be considered as a member of any of the cluster.
The combination of large open data sources with machine learning approaches presents a potentially powerful way to predict events such as protest or social unrest. However, accounting for uncertainty in such models, particularly when using diverse, unstructured datasets such as social media, is essential to guarantee the appropriate use of such methods. Here we develop a Bayesian method for predicting social unrest events in Australia using social media data. This method uses machine learning methods to classify individual postings to social media as being relevant, and an empirical Bayesian approach to calculate posterior event probabilities. We use the method to predict events in Australian cities over a period in 2017/18.
We propose a collection of three shift-based primitives for building efficient compact CNN-based networks. These three primitives (channel shift, address shift, shortcut shift) can reduce the inference time on GPU while maintains the prediction accuracy. These shift-based primitives only moves the pointer but avoids memory copy, thus very fast. For example, the channel shift operation is 12.7x faster compared to channel shuffle in ShuffleNet but achieves the same accuracy. The address shift and channel shift can be merged into the point-wise group convolution and invokes only a single kernel call, taking little time to perform spatial convolution and channel shift. Shortcut shift requires no time to realize residual connection through allocating space in advance. We blend these shift-based primitives with point-wise group convolution and built two inference-efficient CNN architectures named AddressNet and Enhanced AddressNet. Experiments on CIFAR100 and ImageNet datasets show that our models are faster and achieve comparable or better accuracy.
When a bilingual student learns to solve word problems in math, we expect the student to be able to solve these problem in both languages the student is fluent in,even if the math lessons were only taught in one language. However, current representations in machine learning are language dependent. In this work, we present a method to decouple the language from the problem by learning language agnostic representations and therefore allowing training a model in one language and applying to a different one in a zero shot fashion. We learn these representations by taking inspiration from linguistics and formalizing Universal Grammar as an optimization process (Chomsky, 2014; Montague, 1970). We demonstrate the capabilities of these representations by showing that the models trained on a single language using language agnostic representations achieve very similar accuracies in other languages.
Methods based on Bayesian decision tree ensembles have proven valuable in constructing high-quality predictions, and are particularly attractive in certain settings because they encourage low-order interaction effects. Despite adapting to the presence of low-order interactions for prediction purpose, we show that Bayesian decision tree ensembles are generally anti-conservative for the purpose of conducting interaction detection. We address this problem by introducing Dirichlet process forests (DP-Forests), which leverage the presence of low-order interactions by clustering the trees so that trees within the same cluster focus on detecting a specific interaction. We show on both simulated and benchmark data that DP-Forests perform well relative to existing interaction detection techniques for detecting low-order interactions, attaining very low false-positive and false-negative rates while maintaining the same performance for prediction using a comparable computational budget.
We are concerned with reliably harvesting data collected from service-based systems hosted on a mobile ad hoc network (MANET). More specifically, we are concerned with time-bounded and time-sensitive time-series monitoring data describing the state of the network and system. The data are harvested in order to perform an analysis, usually one that requires a global view of the data taken from distributed sites. For example, network- and application-state data are typically analysed in order to make operational and maintenance decisions. MANETs are a challenging environment in which to harvest monitoring data, due to the inherently unstable and unpredictable connectivity between nodes, and the overhead of transferring data in a wireless medium. These limitations must be overcome to support time-series analysis of perishable and time-critical data. We present an epidemic, delay tolerant, and intelligent method to efficiently and effectively transfer time-series data between the mobile nodes of MANETs. The method establishes a network-wide synchronization overlay to transfer increments of the data over intermediate nodes in periodic cycles. The data are then accessible from local stores at the nodes. We implemented the method in Java~EE and present evaluation on a run-time dependence discovery method for Web Service applications hosted on MANETs, and comparison to other four methods demonstrating that our method performs significantly better in both data availability and network overhead.
Previous transfer learning methods based on deep network assume the knowledge should be transferred between the same hidden layers of the source domain and the target domains. This assumption doesn’t always hold true, especially when the data from the two domains are heterogeneous with different resolutions. In such case, the most suitable numbers of layers for the source domain data and the target domain data would differ. As a result, the high level knowledge from the source domain would be transferred to the wrong layer of target domain. Based on this observation, ‘where to transfer’ proposed in this paper should be a novel research frontier. We propose a new mathematic model named DT-LET to solve this heterogeneous transfer learning problem. In order to select the best matching of layers to transfer knowledge, we define specific loss function to estimate the corresponding relationship between high-level features of data in the source domain and the target domain. To verify this proposed cross-layer model, experiments for two cross-domain recognition/classification tasks are conducted, and the achieved superior results demonstrate the necessity of layer correspondence searching.
Non-maximum suppression (NMS) is essential for state-of-the-art object detectors to localize object from a set of candidate locations. However, accurate candidate location sometimes is not associated with a high classification score, which leads to object localization failure during NMS. In this paper, we introduce a novel bounding box regression loss for learning bounding box transformation and localization variance together. The resulting localization variance exhibits a strong connection to localization accuracy, which is then utilized in our new non-maximum suppression method to improve localization accuracy for object detection. On MS-COCO, we boost the AP of VGG-16 faster R-CNN from 23.6% to 29.1% with a single model and nearly no additional computational overhead. More importantly, our method is able to improve the AP of ResNet-50 FPN fast R-CNN from 36.8% to 37.8%, which achieves state-of-the-art bounding box refinement result.
Although nonstationary data are more common in the real world, most existing causal discovery methods do not take nonstationarity into consideration. In this letter, we propose a kernel embedding-based approach, ENCI, for nonstationary causal model inference where data are collected from multiple domains with varying distributions. In ENCI, we transform the complicated relation of a cause-effect pair into a linear model of variables of which observations correspond to the kernel embeddings of the cause-and-effect distributions in different domains. In this way, we are able to estimate the causal direction by exploiting the causal asymmetry of the transformed linear model. Furthermore, we extend ENCI to causal graph discovery for multiple variables by transforming the relations among them into a linear nongaussian acyclic model. We show that by exploiting the nonstationarity of distributions, both cause-effect pairs and two kinds of causal graphs are identifiable under mild conditions. Experiments on synthetic and real-world data are conducted to justify the efficacy of ENCI over major existing methods.
Understanding searchers’ queries is an essential component of semantic search systems. In many cases, search queries involve specific attributes of an entity in a knowledge base (KB), which can be further used to find query answers. In this study, we aim to move forward the understanding of queries by identifying their related entity attributes from a knowledge base. To this end, we introduce the task of entity attribute identification and propose two methods to address it: (i) a model based on Markov Random Field, and (ii) a learning to rank model. We develop a human annotated test collection and show that our proposed methods can bring significant improvements over the baseline methods.
Interpretability is a key factor in the design of automatic classifiers for medical diagnosis. Deep learning models have been proven to be a very effective classification algorithm when trained in a supervised way with enough data. The main concern is the difficulty of inferring rationale interpretations from them. Different attempts have been done in last years in order to convert deep learning classifiers from high confidence statistical black box machines into self-explanatory models. In this paper we go forward into the generation of explanations by identifying the independent causes that use a deep learning model for classifying an image into a certain class. We use a combination of Independent Component Analysis with a Score Visualization technique. In this paper we study the medical problem of classifying an eye fundus image into 5 levels of Diabetic Retinopathy. We conclude that only 3 independent components are enough for the differentiation and correct classification between the 5 disease standard classes. We propose a method for visualizing them and detecting lesions from the generated visual maps.
The inference of the causal relationship between a pair of observed variables is a fundamental problem in science, and most existing approaches are based on one single causal model. In practice, however, observations are often collected from multiple sources with heterogeneous causal models due to certain uncontrollable factors, which renders causal analysis results obtained by a single model skeptical. In this paper, we generalize the Additive Noise Model (ANM) to a mixture model, which consists of a finite number of ANMs, and provide the condition of its causal identifiability. To conduct model estimation, we propose Gaussian Process Partially Observable Model (GPPOM), and incorporate independence enforcement into it to learn latent parameter associated with each observation. Causal inference and clustering according to the underlying generating mechanisms of the mixture model are addressed in this work. Experiments on synthetic and real data demonstrate the effectiveness of our proposed approach.

### ebook: DATAx Guide to Data Visualization

Get free ebook, DATAx Guide to Data Visualization in 2019, the definitive foundation to help you prepare for the future of data visualization, AI and machine learning.

### Unfolding Naive Bayes From Scratch

Whether you are a beginner in Machine Learning or you have been trying hard to understand the Super Natural Machine Learning Algorithms and you still feel that the dots do not connect somehow, this post is definitely for you!

### PCA plot with fill, color, and shape all together

(This article was first published on One Tip Per Day, and kindly contributed to R-bloggers)

When I plotted the PCA results (e.g. scatter plot for PC1 and PC2) and was about to annotate the dataset with different covariates (e.g. gender, diagnosis, and ethic group), I noticed that it’s not straightforward to annotate >2 covariates at the same time using ggplot.

Here is what works for me in ggplot:

pcaData <- plotPCA(vsd, intgroup = c( “Diagnosis”, “Ethnicity”, “Sex”), returnData = TRUE) # vsd and plotPCA are part of DESeq2 package, nothing with my example below.

percentVar <- round(100 * attr(pcaData, “percentVar”))
ggplot(pcaData, aes(x = PC1, y = PC2, color = factor(Diagnosis), shape = factor(Ethnicity))) +
geom_point(size =3, aes(fill=factor(Diagnosis), alpha=as.character(Sex))) +
geom_point(size =3) +
scale_shape_manual(values=c(21,22)) +
scale_alpha_manual(values=c(“F”=0, “M”=1)) +
xlab(paste0(“PC1: “, percentVar[1], “% variance”)) +
ylab(paste0(“PC2: “, percentVar[2], “% variance”)) +
ggtitle(“PCA of all genes, no covariate adjusted”)

I also found that you can use the male and female symbol ( ) as shapes in your plot. Here is how:

df <- data.frame(x = runif(10), y = runif(10), sex = sample(c(“m”,”f”), 10, rep = T))

qplot(x, y, data = df, shape = sex, size = I(5)) +
scale_shape_manual(values = c(“m” = “\u2642”, f = “\u2640”))

(Reference: https://github.com/kmiddleton/rexamples/blob/master/ggplot2%20male-female%20symbols.R)

I’ve not figured out a way to combine the two ideas above.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### When Bayes, Ockham, and Shannon come together to define machine learning

A beautiful idea, which binds together concepts from statistics, information theory, and philosophy.

### “Auto-What?” –  A Taxonomy of Automated Machine Learning

Automated machine learning is a rapidly developing segment of artificial intelligence - it’s time to define what an AutoML product is so end-users can compare product capabilities intelligently.

### You’ve got data on 35 countries, but it’s really just N=3 groups.

Jon Baron points to a recent article, “Societal inequalities amplify gender gaps in math,” by Thomas Breda, Elyès Jouini, and Clotilde Napp (supplementary materials here), and writes:

A particular issue bothers me whenever I read studies like this, which use nations as the unit of analysis and then make some inference from correlations across nations. And I suspect that the answer to my concern is well known (but maybe not to the authors of these studies, and not to me) and you could just say what it is.

My concern is this. The results here are based on correlations across 35 nations. But, if you look at the supplement, you will see that these nations fall into distinct groups. A whole bunch of them are “northern European.” These countries are related to each other in many ways, including geography, history, and culture. It seems to me that these kinds of relationships reduce the true number of observations to some number much smaller than 35. The whole result in the paper could come from the fact that this particular culture happens to have two features: low inequality and high emphasis on women’s education. In reality, these two features need not have any relationship. (I suspect that Japan and China would help break that correlation, for example.)

To take an extreme example, suppose you had a sample consisting of all the countries of Europe and all the countries of Africa. That would be quite a few countries. And, within that total sample, you could find lots of highly significant correlations. But this would clearly be an N of 2, for all practical purposes.

Or suppose we count all the U.S. states as if they were separate “states” (in the sense of nations). Why not? Surely Massachusetts and Alabama are no more similar than Germany and Norway.

I can’t think of a purely statistical way to solve this problem. But that doesn’t mean there isn’t one.

I replied with a link to this discussion from a few years ago on that controversial claim that high genetic diversity, or low genetic diversity, is bad for the economy, in which I wrote:

Two economics professors, Quamrul Ashraf and Oded Galor, wrote a paper, “The Out of Africa Hypothesis, Human Genetic Diversity, and Comparative Economic Development,” that is scheduled to appear in the American Economic Review. . . . Ashraf and Galor have, however, been somewhat lucky in their enemies, in that they’ve been attacked by a bunch of anthropologists who have criticized them on political as well as scientific grounds. This gives the pair of economists the scientific and even moral high ground, in that they can feel that, unlike their antagonists, they are the true scholars, the ones pursuing truth wherever it leads them, letting the chips fall where they may.

The real issue for me is that the chips aren’t quite falling the way Ashraf and Galor think they are. . . .

The way to go is to start with the big pattern they noticed: the most genetically diverse countries (according to their measure) are in east Africa, and they’re poor. The least genetically diverse countries are remote undeveloped places like Bolivia and are pretty poor. Industrialized countries are not so remote (thus they have some diversity) but they’re not filled with east Africans (thus they’re not extremely genetically diverse). From there, you can look at various subsets of the data and perform various side analysis, as the authors indeed do for much of their paper.

And this post from a couple years later, where I wrote:

I continue to think that Ashraf and Galor’s paper is essentially an analysis of three data points (sub-Saharan Africa, remote Andean countries and Eurasia). It offered little more than the already-known stylized fact that sub-Saharan African countries are very poor, Amerindian countries are somewhat poor, and countries with Eurasians and their descendants tend to have middle or high incomes.

Baron replied:

Yes and no. The second reference seems to state the problem, but in a way that is specific to this case. Yet I see this sort of thing all the time. I’m not concerned about the direction of causality. That is a separate problem. I think that the correlations across countries are highly deceptive when the countries fall into groups of very similar countries. When correlations are deceptive in this way, they are not useful for inferring any sort of causality, even if we can infer its direction from other considerations.

In assessing the reliability of a correlation (by any method) it helps when N is higher. With an N of 35, a correlation can be clearly significant (for example) when the same correlation would be nowhere close to significant if the N is 3. Yet, if 35 countries fall into 3 groups of highly similar countries, the effective N is more like 3 than 35. Even worse, if the countries fall into two groups, you cannot even compute a true correlation at all. It only appears that you can when you use country as the unit of analysis. In the present study, this problem is exacerbated because the sampling of countries used, out of the population of all countries, is not at all random.

The same problem occurs, however, when analyzing U.S. states, in studies that look at the entire population of states. Many of these correlations arise because states fall into groups: Confederacy, New England plus West Coast, flyover country. For example, I’m sure that “average humidity” correlates with “percent of evangelical Christians” across states, but that is really the result of the Confederacy alone, hence a historical accident.

I guess an analogous problem occurs with time series. If “year” is the unit of analysis, you can get a nice correlation that is really the result of two linear trends over some period of time. (You can even use “day” and have a huge N.) I think this problem has been solved statistically. But I don’t know of any way of solving the problem of what might be called spatial or multi-dimensional similarity.

I don’t know that I’d say that the high percentage of evangelical Christians in the southern U.S. is a result of the Confederacy—maybe we’d still see if those states had never seceded and that war had never happened—but that’s not really the point here.

My response to Baron’s question is that you can deal with this sort of clustering by fitting a multilevel model including indicators for group as well as country. That said, this won’t solve the whole problem. As always, inferences depend on the specification of the model. In particular, including group indicators in your regression won’t necessarily resolve the problem. Ultimately I think you have to go to more careful models. For example, if you are comparing what’s going on in 35 countries, but they’re all in 3 groups, you might want to separately do analyses between and within groups.

### Deep Learning: The Impact of NVIDIA DGX Station

Read this IDC report & see how a deep learning workstation may solve IT problems of many researchers, developers, and creative professionals.

### One Drink Per Day, Your Chances of Developing an Alcohol-Related Condition

While a drink a day might increase your risk of experiencing an alcohol-related condition, the change is low in absolute numbers. Read More

### Four short links: 25 September 2018

Software Engineering, ML Hardware Trends, Time Series, and Eng Team Playbooks

1. Notes to Myself on Software Engineering -- Code isn’t just meant to be executed. Code is also a means of communication across a team, a way to describe to others the solution to a problem. Readable code is not a nice-to-have; it is a fundamental part of what writing code is about. A solid list of advice/lessons learned.
2. Machine Learning Shifts More Work To FPGAs, SoCs -- compute power used for AI/ML is doubling every 3.5 months. FPGAs and ASICs are already predicted to be 25% of the market for machine learning accelerators in 2018. Why? FPGAs and ASICs use far less power than GPUs, CPUs, or even the 75 watts per hour Google’s TPU burns under heavy load. [...] They can also deliver a performance boost in specific functions chosen by customers that can be changed along with a change in programming.
3. Time Series Forecasting -- one of those "three surprising things" articles. The three surprising things: You need to retrain your model every time you want to generate a new prediction; sometimes you have to do away with train/test splits; and the uncertainty of the forecast is just as important as, or even more so, than the forecast itself.
4. Health Monitor -- Atlassian's measures of whether your team is doing well. Their whole set of playbooks is great reading for engineering managers.

### EARLy bird catches the worm!

(This article was first published on RBlog – Mango Solutions, and kindly contributed to R-bloggers)

Two weeks ago was our most successful EARL London conference in its 5-year history, which I had the pleasure of attending for both days of talks. Now I must admit, as a Python user, I did feel a little bit like I was being dragged along to an event where everyone would be talking about the latest R packages for customising RMarkdown and Shiny applications (… and there was a little bit of that – I’m pretty sure I heard someone joke that it should be called the Shiny conference).

However, I was pleasantly surprised to find a diverse forum of passionate and inspiring data scientists from a wide range of specialisations (and countries!), each with unique personal insights to share. Although the conference was R focused, the concepts that were discussed are universally applicable across the Data Science profession, and I learned a great deal from attending these talks. If you weren’t so fortunate to attend or would like a refresher, here are my top 5 takeaways from the conference (you can find the slides for all the talks here, click on the speaker image to find the slides):

Steven Wilkins, Edwina Dunn, Rich Pugh

For data to have a positive impact within an organisation, data science projects need to be defined according to the challenges impacting the business and those important decisions that the business needs to make. There’s no use building a model to describe past behaviour or predict future sales if this can’t be translated into action. I’ve heard this from Rich a thousand times since I’ve been at Mango Solutions, but hearing Steven Wilkins describe how this allowed Hiscox to successfully deliver business value from analytics really drove the point home for me. Similarly, Edwina Dunn demonstrated that those organisations which take the world by storm (e.g. Netflix, Amazon, Uber and AirBnB) are those which first and foremost are able to identify customer needs and then use data to meet those needs.

### 2. Communication drives change within organisations

Rich Pugh, Edwina Dunn, Leanne Fitzpatrick, Steven Wilkins

However, even the best run analytics projects won’t have any impact if the organisation does not value the insights they deliver. People are at the heart of the business, and organisations need to undergo a cultural shift if they want data to drive their decision making. An organisation can only become truly data-driven if all of its members can see the value of making decisions based on data and not intuition. Obviously, an important part of data science is the ability to communicate insights to external stakeholders, by means of storytelling and visualisations. However, even within an organisation, communication is just as important to instil this much needed cultural change.

### 3. Setting up frameworks streamlines productivity

Leanne Fitzpatrick, Steven Wilkins, Garrett Grolemund, Scott Finnie & Nick Forrester, George Cushen

Taking the time to set up frameworks ensures that company vision can be translated into day to day productivity. In reference to point 1, setting up a framework for prototyping of data science projects allows rapid evaluation of their potential impact to the business. Similarly, a consistent framework should be applied to communication within organisations, such as establishing how to educate the business to promote cultural change, or in the form of documentation and code reviews for developers.

On the technical side, pre-defined frameworks should also be used to bridge the gap between modelling and deployment. Leanne Fitzpatrick’s presentation demonstrated how the use of Docker images, YAML, project templates and engineer-defined test frameworks minimises unnecessary back and forth between data scientists and data engineers and therefore can streamline productivity. To enable this, however, it is important to teach modellers the importance of keeping production in mind during development, and to teach model requirements to data engineers, which hugely improved collaboration at Hymans according to Scott Finnie & Nick Forrester.

In the same vein, I was really intrigued by the flexibility of RMarkdown for creating re-usable templates. Garrett Grolemund from RStudio mentioned that we are currently experiencing a reproducibility crisis, in which the validity of scientific studies is put to question by the fact that most of their results are not reproducible. Using a tool such as RMarkdown to publish code used in statistical studies makes sharing and reviewing code much simpler, and minimises the risk of oversight. Similarly, RMarkdown seems to be a valuable tool for documentation and can even become a simple way of creating project websites, when combined with R packages such as George Cushen’s Kickstart-R.

### 4. Interpretability beats complexity (sometimes)

Kasia Kulma, Wojtek Kostelecki, Jeremy Horne, Jo-fai Chow

Stakeholders might not always be willing to trust models, and might prefer to fall back on their own experience. Therefore, being able to clearly interpret modelling results is essential to engage people and drive decision-making. One way of addressing this concern is to use simple models such as linear regression or logistic regression for time-series econometrics and market attribution, as demonstrated by Wojtek Kostelecki. The advantage of these is that we can assess the individual contribution of variables to the model, and therefore clearly quantify their impact on the business.

However, there are some cases where a more sophisticated model should be favoured over a simple one. Jeremy Horne’s example of customer segmentation proved that we aren’t always able to implement geo-demographic rules to help identify which customers are likely to engage with the business. “This is the reason why we use sophisticated machine learning models”, since they are better able to distinguish between different people from the same socio-demographic group, for example. This links back to Edwina Dunn’s mention of how customers should no longer be categorised by their profession or geo-demographics, but by their passions and interests.

Nevertheless, ‘trusting the model’ is a double-edged sword, and there are some serious ethical issues to consider, especially when dealing with sensitive personal information. I’m also pretty sure I heard the word ‘GDPR’ mentioned at every talk I attended. But fear not, here comes LIME to the rescue! Kasia Kulna explained how Local Interpretable Model-Agnostic Explanations (say that 5 times fast) allow modellers to sanity check their models by giving interpretable explanations as to why a model predicted a certain result. By extension, this can help prevent bias, discrimination and help avoid exploitative marketing.

### 5. R and Python can learn from each other

David Smith (during the panellist debate)

Now comes the fiery debate. Python or R? Call me controversial but, how about both? This was one of the more intriguing concepts that I heard, which came as the result of a question during the engaging panellist debate about the R and data science community. What this conference has demonstrated to me is that R is undergoing a massive transformation from being the simple statistical tool it once was, to a fully-fledged programming language which even has tools for production! Not only this, but it has the advantage of being a domain-specific language, which results in a very tight-knit community – which seemed to be the general consensus amongst the panel.

However, there are still a few things R can learn from Python, namely its vast array of tools for transitioning from modelling to deployment. It does seem like R is making steady progress in this regard, with tools such as Plumber to create REST APIs, Shiny Server for serving Shiny web apps online and RStudio Connect to tie these all together with RMarkdown and dashboards. Similarly, machine learning frameworks and cloud services which were more Python focused are now available in R. Keras, for example, provides a nice way to use TensorFlow from R, and there are many R packages available for deploying those models to production servers, as mentioned by Andrie de Vries.

Conversely, Python could learn from R in its approach to data analysis. David Smith remarked that there is a tendency within the Python world to have a model-centric approach to data science. This is also something that I have personally noticed. Whereas R is historically embedded in statistics, and therefore brings many tools for exploratory data analysis, this seems to take a backstage in the Python world. This tendency is exacerbated by popular Python machine learning frameworks such as scikit-learn and TensorFlow, which seem to recommend throwing whole datasets into the model and expecting the algorithm to select significant features for us. Python needs to learn from R tools such as ggplot2, Shiny and the tidyverse, which make it easier to interactively explore datasets.

Another part of the conference I really enjoyed were the lightning talks, which proved how challenging it can be to effectively pitch an idea within a single 10 minute presentation! As a result here are my…

Lightning takeaways!

• “Companies should focus on what data they need, not the data they have.” (Edwina Dunn – Starcount)
• “Don’t give in to the hype” (Andrie de Vries – RStudio)
• “Trust the model” (Jeremy Horne – MC&C Media)
• h2o + Spark = hot” (Paul Swiontkowski – Microsoft)
• “Shiny dashboards are cool” (Literally everyone at EARL)

I’m sorry to all the speakers who I haven’t mentioned. I heard great things about all the talks, but this is all I could attend!

Finally, my personal highlight of the conference was the unlimited free drinks – er I mean, getting the opportunity to talk to so many knowledgeable and approachable people from such a wide range of fields! It really was a pleasure meeting and learning from all of you.

If you enjoyed this post, be sure to join us at LondonR at Ball’s Brothers on Tuesday 25th September, where other Mangoes will share their experience of the conference, in addition to the usual workshops, talks and networking drinks.

If you live in the US, or happen to be visiting this November, then come join us in at one of our EARL 2018 US Roadshow events: EARL Seattle (WA) on 7th November, EARL Houston (TX) on 9th November, and EARL Boston (MA) on 13th November. Our highlights to the EARL Conference London will be online soon.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Introducing the Kernel Heaping Package III

(This article was first published on INWT-Blog-RBloggers, and kindly contributed to R-bloggers)

In the second part of this blog series, I showed how to compute spatial kernel density estimates based on area-level data. The Kernelheaping package also supports boundary-corrected kernel density estimation, which allows us to exclude certain areas, where we know that the density must be zero. One example is estimating the population density where we like to exclude uninhabited areas such as lakes, forests, parks etc. The Kernelheaping package employs a boundary correction method, where each single kernel is restricted to the area of interest. We continue with our example of elderly people in Berlin from part two:

library(maptools) library(dplyr) library(fields) library(ggplot2) library(RColorBrewer) library(Kernelheaping) library(rgeos) library(rgdal)

Again, we load a shapefile with the administrative districts, available from: https://www.statistik-berlin-brandenburg.de/opendata/RBS_OD_LOR_2015_12.zip

data <- read.csv2("EWR201512E_Matrix.csv") berlin <- readOGR("RBS_OD_LOR_2015_12/RBS_OD_LOR_2015_12.shp") berlin <- spTransform(berlin, CRS("+proj=longlat +datum=WGS84"))

We load an OpenStreetMap file including shapes or polygons with information on uninhabited areas such as lakes, rivers, forests and parks: https://daten.berlin.de/datensaetze/openstreetmap-daten-für-berlin

berlinN <- readOGR("berlin-latest-free.shp/gis_osm_landuse_a_free_1.shp") # land  berlinWater <- readOGR("berlin-latest-free.shp/gis_osm_water_a_free_1.shp") # water

We specifically exlude residential areas and split the shapefile into the two remaining categories (“Nature” and “Other”):

table(berlinN@data$fclass) berlinN <- berlinN[!(berlinN@data$fclass == "residential"), ] berlinGreen <- berlinN[(berlinN@data$fclass %in% c("forest", "grass", "nature_reserve", "park", "cemetery", "allotments", "farm", "meadow", "orchard", "vineyard", "heath")), ] berlinOther <- berlinN[!(berlinN@data$fclass %in%  	c("forest", "grass", "nature_reserve", "park",  	  "cemetery", "allotments", "farm", "meadow",  	  "orchard", "vineyard", "heath")), ]

These shapes are very complicated with many polygons. Thus we simplify them with the gSimplify() function from the rgeos package:

berlinGreen <- spTransform(gSimplify(berlinGreen, tol = 0.0005, topologyPreserve = FALSE), 	                   CRS("+proj=longlat +datum=WGS84")) 	 berlinOther <- spTransform(gSimplify(berlinOther, tol = 0.0005, topologyPreserve = FALSE),                            CRS("+proj=longlat +datum=WGS84")) 	 berlinWater <- spTransform(gSimplify(berlinWater, tol = 0.0005, topologyPreserve = FALSE),                            CRS("+proj=longlat +datum=WGS84"))

For the dshapebivr() and dshapebivrProp() functions we need a single shapefile; therefore we have to unite the water, nature and other shapefiles:

berlinUnInhabitated <- gUnion(gSimplify(gUnion(berlinGreen, berlinOther), tol = 0.0005), 	                      berlinWater)

Now we perform the same data preparation steps as in the previous part and estimate the boundary-corrected density of people between 65 and 80 in Berlin. The shapefile of uninhabited areas now goes into the deleteshapes argument:

 dataIn <- cbind(do.call(rbind, lapply(berlin@polygons, function(x) x@labpt)),                    data$E_E65U80) est <- dshapebivr(data = dataIn, burnin = 5, samples = 15, adaptive = FALSE, shapefile = berlin, deleteShapes = berlinUnInhabitated, gridsize = 325, boundary = TRUE) To plot the map in ggplot2, we need to perform some additional data preparation steps: berlin@data$id <- as.character(berlin@data$PLR) berlin@data$E_E65U80 <- data$E_E65U80 berlinPoints <- fortify(berlin, region = "id") berlin@data$E_E65U80density <- berlin@data$E_E65U80 / (gArea(berlin, byid = TRUE) / 1000000) berlinDf <- left_join(berlinPoints, berlin@data, by = "id") kData <- data.frame(expand.grid(long = est$Mestimates$eval.points[[1]], lat = est$Mestimates$eval.points[[2]]), Density = est$Mestimates$estimate %>% as.vector) %>% filter(Density > 0) Now, we are able to plot the density together with the administrative districts and uninhabited areas of different types: ggplot(kData) + geom_raster(aes(long, lat, fill = Density)) + ggtitle("Bivariate density of Inhabitants between 65 and 80 years") + scale_fill_gradientn(colours = c("#FFFFFF", "coral1"))+ geom_polygon(fill = "grey20", data = fortify(gIntersection(berlin, berlinOther)), aes(long, lat, group = group), alpha = 0.25) + geom_polygon(fill = "darkolivegreen3", data = fortify(gIntersection(berlin, berlinGreen)), aes(long, lat, group = group), alpha = 0.25) + geom_polygon(fill = "deepskyblue3", data = fortify(gIntersection(berlin, berlinWater)), aes(long, lat, group = group), alpha = 0.25) + geom_path(color = "#000000", data = berlinDf, size = 0.1, aes(long, lat, group = group)) + coord_quickmap() ## Smooth Estimates of Proportion One may not only estimate the density, but also the proportion of a certain group relative to the overall population. The Kernelheaping package provides the dshapebivrProp() function which smoothly estimates the spatial proportion using a Nadaraya-Watson-type estimator. Naturally, it includes boundary correction as well. We use another open data example for Berlin on inhabitants with migration background from https://daten.berlin.de/datensaetze/einwohnerinnen-und-einwohner-mit-migrationshintergrund-berlin-lor-planungsräumen First, we load the dataset and merge the area ids such that they fit with the shapefile of Berlin: berlinMigration <- read.csv2("EWRMIGRA201512H_Matrix.csv") berlinMigration$RAUMID <- as.character(berlinMigration$RAUMID) berlinMigration$RAUMID[nchar(berlinMigration$RAUMID) == 7] <- paste0("0", berlinMigration$RAUMID[nchar(berlinMigration$RAUMID) == 7]) berlinMigration <- berlinMigration[order(berlinMigration$RAUMID), ] 

We model the spatial proportion of inhabitants with Turkish migration background. For the proportion, a fourth column with the total number of people in that area is necessary:

 dataTurk <- cbind(do.call(rbind, lapply(berlin@polygons, function(x) x@labpt)),                    berlinMigration$HK_Turk, berlinMigration$MH_E)

We estimate the proportion with the dshapebivrProp() function now:

estTurk <- dshapebivrProp(data = dataTurk,                            burnin = 5,                            samples = 10,                            adaptive = FALSE,                            deleteShapes = berlinUnInhabitated,                                            shapefile = berlin,                            gridsize = 325,                            boundary = TRUE,                            numChains = 4,                            numThreads = 4)

Now we can plot these proportions:

gridBerlin <- expand.grid(long = estTurk$Mestimates$eval.points[[1]],                           lat = estTurk$Mestimates$eval.points[[2]]) kDataTurk <- data.frame(gridBerlin,                          Proportion = estTurk$proportions %>% as.vector) %>% filter(Proportion > 0) ggplot(kDataTurk) + geom_raster(aes(long, lat, fill = Proportion)) + ggtitle("Proportion of inhabitants with turkish migration background ") + scale_fill_gradientn(colours = c("#FFFFFF", "coral1")) + geom_polygon(fill = "grey20", data = fortify(gIntersection(berlin, berlinOther)), aes(long, lat, group = group), alpha = 0.25) + geom_polygon(fill = "darkolivegreen3", data = fortify(gIntersection(berlin, berlinGreen)), aes(long, lat, group = group), alpha = 0.25) + geom_polygon(fill = "deepskyblue3", data = fortify(gIntersection(berlin, berlinWater)), aes(long, lat, group = group), alpha = 0.25) + geom_path(color = "#000000", data = berlinDf, size = 0.1, aes(long, lat, group = group)) + coord_quickmap() ## Hotspot Estimation Spatial kernel density estimates are a great tool to identify subpopulation hotspots. Three different countries / regions of origin are compared: Arabian countries, countries of the former Soviet Union and Poland. We perform the usual data preparation and estimation steps:  dataArab <- cbind(do.call(rbind, lapply(berlin@polygons, function(x) x@labpt)), berlinMigration$HK_Arab) dataSU <- cbind(do.call(rbind, lapply(berlin@polygons, function(x) x@labpt)),                  berlinMigration$HK_EheSU) dataPol <- cbind(do.call(rbind, lapply(berlin@polygons, function(x) x@labpt)), berlinMigration$HK_Polen) estArab <- dshapebivr(data = dataArab, burnin = 5, samples = 10, adaptive = FALSE,                       shapefile = berlin, gridsize = 325, boundary = TRUE) estSU <- dshapebivr(data = dataSU, burnin = 5, samples = 10, adaptive = FALSE,                     shapefile = berlin, gridsize = 325, boundary = TRUE) estPol <- dshapebivr(data = dataPol, burnin = 5, samples = 10, adaptive = FALSE,                      shapefile = berlin, gridsize = 325, boundary = TRUE) gridBerlin <- expand.grid(long = estArab$Mestimates$eval.points[[1]],                           lat = estArab$Mestimates$eval.points[[2]])

Now we use the 97.5% quantile of the inhabited area to define hotspots:

kDataArab <- data.frame(gridBerlin,                          Density = estArab$Mestimates$estimate %>% as.vector) %>%     filter(Density > 0) %>%   filter(Density > quantile(Density, 0.975)) %>%    mutate(Density = "Arabian countries") kDataSU <- data.frame(gridBerlin,                        Density = estSU$Mestimates$estimate %>% as.vector) %>%    filter(Density > 0) %>%   filter(Density > quantile(Density, 0.975)) %>%    mutate(Density = "Former Soviet Union") kDataPol <- data.frame(gridBerlin,                         Density = estPol$Mestimates$estimate %>% as.vector) %>%     filter(Density > 0) %>%   filter(Density > quantile(Density, 0.975)) %>%    mutate(Density = "Poland") 

Now, we display the hotspots of all three population subgroups in a single plot:

ggplot() +   geom_raster(aes(long, lat), fill = "#FFFFFF", data = kData, alpha = 0.6) +    geom_raster(aes(long, lat, fill = Density), data = kDataArab, alpha = 0.6) +    geom_raster(aes(long, lat, fill = Density), data = kDataSU, alpha = 0.6) +    geom_raster(aes(long, lat, fill = Density), data = kDataPol, alpha = 0.6) +   scale_fill_manual(guide_legend(title = ""), values = c("#f8eb4a", "#DD9123", "#8A3B89")) +   ggtitle("Hotspots of Inhabitants With Different Migration Background") +   geom_polygon(fill = "grey20", data = fortify(gIntersection(berlin, berlinOther)),                aes(long, lat, group = group), alpha = 0.25) +   geom_polygon(fill = "darkolivegreen3", data = fortify(gIntersection(berlin, berlinGreen)),                aes(long, lat, group = group), alpha = 0.25) +   geom_polygon(fill = "deepskyblue3", data = fortify(gIntersection(berlin, berlinWater)),                aes(long, lat, group = group), alpha = 0.25) +   geom_path(color = "#000000", data = berlinDf, size = 0.1,                aes(long, lat, group = group)) +   coord_quickmap() +   theme(legend.position = "top")

Further parts of the article series Introducing the Kernelheaping Package:

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

## Bringing unprecedented reliability and performance to cloud data lakes

Designed by Databricks in collaboration with Microsoft, Azure Databricks combines the best of Databricks’ Apache SparkTM-based cloud service and Microsoft Azure. The integrated service provides the Databricks Unified Analytics Platform integrated with the Azure cloud platform, encompassing the Azure Portal; Azure Active Directory; and other data services on Azure, including Azure SQL Data Warehouse, Azure Cosmos DB, Azure Data Lake Storage; and Microsoft Power BI.

Databricks Delta, a component of Azure Databricks, addresses the data reliability and performance challenges of data lakes by bringing unprecedented data reliability and query performance to cloud data lakes. It is a unified data management system that delivers ML readiness for both batch and stream data at scale while simplifying the underlying data analytics architecture.

Further, it is easy to port code to use Delta. With today’s public preview, Azure Databricks Premium customers can start using Delta straight away.  They can start benefiting from the acceleration that large reliable datasets can provide to their ML efforts. Others can try it out using the Azure Databricks  14 day trial.

## Common Data Lake Challenges

Many organizations have responded to their ever-growing data volumes by adopting data lakes as places to collect their data ahead of making it available for analysis. While this has tended to improve the situation somewhat data lakes also present some key challenges:

Query performance – The required ETL processes can add significant latency such that it may take hours before incoming data manifests in a query response so the users do not benefit from the latest data. Further, increasing scale and the resulting longer query run times can prove unacceptably long for users.

Data reliability – The complex data pipelines are error-prone and consume inordinate resources. Further, schema evolution as business needs change can be effort-intensive. Finally, errors or gaps in incoming data, a not uncommon occurrence, can cause failures in downstream applications.

System complexity – It is difficult to build flexible data engineering pipelines that combine streaming and batch analytics. Building such systems requires complex and low-level code. Interventions during stream processing with batch correction or programming multiple streams from the same sources or to the same destinations is restricted.

## Databricks Delta To The Rescue

Already in use by several customers (handling more than 300 billion rows and more than 100 TB of data per day) as part of a private preview, today we are excited to announce Databricks Delta is now entering Public Preview status for Microsoft Azure Databricks Premium customers, expanding its reach to many more.

Using an innovative new table design, Delta supports both batch and streaming use cases with high query performance and strong data reliability while requiring a simpler data pipeline architecture:

Increased query performance – Able to deliver 10 to 100 times faster performance than Apache Spark(™) on Parquet through the use of key enablers such as compaction, flexible indexing, multi-dimensional clustering and data caching.

Improved data reliability – By employing ACID (“all or nothing”) transactions, schema validation / enforcement, exactly once semantics, snapshot isolation and support for UPSERTS and DELETES.

Reduced system complexity – Through the unification of batch and streaming in a common pipeline architecture – being able to operate on the same table also means a shorter time from data ingest to query result. Schema evolution provides the ability to infer schema from input data making it easier to deal with changing business needs.

## The Versatility of Delta

Delta can be deployed to help address a myriad of use cases including IoT, clickstream analytics and cyber security. Indeed, some of our customers are already finding value with Delta for these – I hope to share more on that in future posts. My colleagues have written a blog (Simplify Streaming Stock Data Analysis Using Databricks Delta) to showcase Delta that you might interesting.

## Easy to Adopt: Check Out Delta Today

Porting existing Spark code for using Delta is as simple as changing

CREATE TABLE … USING parquet” to

CREATE TABLE … USING delta”

or changing

If you are already using Azure Databricks Premium you can explore Delta today using:

If you are not already using Databricks, you can try Databricks Delta for free by signing up for the free Azure Databricks 14 day trial.

--

The post Databricks Delta: Now Available in Preview as Part of Microsoft Azure Databricks appeared first on Databricks.

### Distilled News

Do you have a favorite coffee place in town? When you think of having a coffee, you might just go to this place as you´re almost sure that you will get the best coffee. But this means you´re missing out on the coffee served by this place´s cross-town competitor. And if you try out all the coffee places one by one, the probability of tasting the worse coffee of your life would be pretty high! But then again, there´s a chance you´ll find an even better coffee brewer. But what does all of this have to do with reinforcement learning?
• The Incredible Inventions of Intuitive AI – Maurice Conti
• How Algorithms Shape our World – Kevin Slavin
• What Happens When our Computers Get Smarter than We Are? – Nick Bostrom
• Can we Build AI without Losing Control Over it? – Sam Harris
• How a Driverless Car Sees the Road – Chris Urmson
• How We´re Teaching Computers to Understand Pictures – Fei-Fei Li
• How Computers Learn to Recognize Objects Instantly – Joseph Redmon
• The Jobs We´ll Lose to Machines – Anthony Goldbloom
• How AI can Enhance our Memory, Work and Social Lives – Tom Gruber
• How AI can Compose a Personalized Soundtrack to your Life – Pierre Barreau
Conversational user interface (UI) is changing the way that we interact. Intelligent assistants, chatbots and voice-enabled devices, like Amazon Alexa and Google Home, offer a new, natural, and intuitive human-machine interaction and open up a whole new world for us as humans. Chatbots and voicebots ease, speed up, and improve daily tasks. They increase our efficiency and compared to humans, they are also very cost effective for the businesses employing them. This article will address the concept of conversational UIs by initially exploring what they are, how they evolved, what they offer. The article provides an introduction to the conversational world. We will take a look at how UI has developed over the years and the difference between voice control, chatbots, virtual assistants, and conversational solutions.
As companies become more data-driven there is often a proliferation of data from both internal sources as well as third parties being consumed. Rarely have I seen firms try and centralise where datasets are stored. Instead, data is often copied onto infrastructure for individual teams and departments. This allows teams to not disrupt others with their work as well as avoid disruption from other teams. Data sources are often refreshed in batches ranging from every few minutes to monthly updates. The file formats, compression schemes and encryption systems used to proliferate these datasets can vary greatly. There is no one single tool I use for collection and analysis of new datasets. I do my best to pick tools that help me avoid writing a lot of bespoke code while taking advantage of the hardware available on any one system I may be using. In this guide I’ll walk through a exercise in consuming, transforming and analysing a data dump of the English language version of Wikipedia.
The here package is pretty simple ( only 3 functions), but I cannot remember how to use it to navigate folders, so this is my aide-memoire. It might be useful for others too. Here finds the root of your current folder / working directory. If you use Projects in RStudio, that will usually be the root of your project folder. If not, you can use set_here() to create a small file which will set the root location.
Generalized linear models – and generalized linear mixed models – are called generalized linear because they connect a model´s outcome to its predictors in a linear way. The function used to make this connection is called a link function. Link functions sounds like an exotic term, but they´re actually much simpler than they sound.
An experiment about modeling price elasticity as an example and, after analyzing the model with residual plots, it turned out there’s a problem after the 1st of September in the test data set …
What’s New in Deep Learning Research: Reducing Bias and Discrimination in Machine Learning Models with AI Fairness 360
At this point, we are ready to deal with another type of neural networks, the so-called convolutional neuronal networks, widely used in computer vision tasks. These networks are composed of an input layer, an output layer and several hidden layers, some of which are convolutional, hence its name.
Bots that play Dota2, AI that beat the best Go players in the world, computers that excel at Doom. What’s going on? Is there a reason why the AI community has been so busy playing games? Let me put it that way. If you want a robot to learn how to walk what do you do? You build one, program it and release it on the streets of New York? Of course not. You build a simulation, a game, and you use that virtual space to teach it how to move around it. Zero cost, zero risks. That’s why games are so useful in research areas. But how do you teach it to walk? The answer is the topic of today’s article and is probably the most exciting field of Machine learning at the time:
Text Processing is one of the most common task in many ML applications. Below are some examples of such applications.
The curse of dimensionality is the bane of all classification problems. What is the curse of dimensionality? As the the number of features (dimensions) increase linearly, the amount of training data required for classification increases exponentially. If the classification is determined by a single feature we need a-priori classification data over a range of values for this feature, so we can predict the class of a new data point. For a feature x with 100 possible values, the required training data is of order O(100). But if there is a second feature y as well that is needed to determine the class, and y has 50 possible values, then we will need training data of order O(5000) – i.e. over the grid of possible values for the pair ‘x, y’. Thus the measure of the required data is the volume of the feature space and it increases exponentially as more features are added.
This article describes how to export data from iTunes with R and itunesr (by Abdul Majed Raja) followed by visualized ratings and reviews. It also covers how to use googleLanguageR for translating reviews by performing language processing with Google Cloud Machine Learning API before conducting basic sentiment analysis.

### Magister Dixit

“There are just not enough brain cells on the planet to even look or even glance at that data, let alone analyze it and extract knowledge from it.” Yann LeCun

### Create stylish tables in R using formattable

(This article was first published on Little Miss Data, and kindly contributed to R-bloggers)

I love a good visualization to assist in telling the story of your data. Because of this I am completely hooked on a variety of data visualization packages and tooling. But what happens with you need to visualize the raw numbers? Do you open up the data set in the viewer and screenshot? Do you save the summarized data set locally and add a bit of formatting in excel? That’s what I used to do with my R summary tables. But it got me thinking; why can’t tables be treated as a first class data visualization too? Tables need a little pizazz as much as the next data object!

Enter the r package formattable! The formattable package is used to transform vectors and data frames into more readable and impactful tabular formats.

I’m going to walk you through a step-by-step example of using the formattable R package to make your data frame more presentable for data storytelling.

## Set Up R

In terms of setting up the R working environment, we have a couple of options open to us.  We can use something like R Studio for a local analytics on our personal computer.  Or we can use a free, hosted, multi-language collaboration environment like Watson Studio.  If you’d like to get started with R in IBM Watson Studio, please have a look at the tutorial I wrote.

## Install and load packages, Set variables

R packages contain a grouping of R data functions and code that can be used to perform your analysis. We need to install and load them in your environment so that we can call upon them later.  We are also going to assign a few custom color variables that we will use when setting the colors on our table. If you are in Watson Studio, enter the following code into a cell (or multiple cells), highlight the cell and hit the “run cell”  button.

# Install the relevant libraries - do this one time
install.packages("data.table")
install.packages("dplyr")
install.packages("formattable")
install.packages("tidyr")
#Load the libraries
library(data.table)
library(dplyr)
library(formattable)
library(tidyr)

#Set a few color variables to make our table more visually appealing
customGreen0 = "#DeF7E9"
customGreen = "#71CA97"
customRed = "#ff7f7f"

For our tutorial we are going to be using a data set from the Austin Open Data Portal. It’s a website designed to make facilitate easy access to open government data. I’ve been playing around with it somewhat frequently and I’m really impressed with the consistency of design and features per data set. A lot of other open data portals do not make it this easy to find and download data from.

We are going to be using formattable on the Imagine Austin Indicators dataset. As per the Imagine Austin website, the data set tracks key performance indicators (KPIs) of Austins progress in creating a connected, vibrant and livable city.

#Download the Austin indicator data set
austinData= fread('https://raw.githubusercontent.com/lgellis/MiscTutorial/master/Austin/Imagine_Austin_Indicators.csv', data.table=FALSE, header = TRUE, stringsAsFactors = FALSE)
head(austinData)
attach(austinData)

## Modify the Data Set

We are going to narrow down the data set to focus on 4 key health metrics. Specifically the prevalence of obesity, tobacco use, cardiovascular disease and obesity. We are then going to select only the indicator name and yearly KPI value columns. Finally we are going to make extra columns to display the 2011 to 2016 yearly average and the 2011 to 2016 metric improvements.

i1 <- austinData %>%
  filter(Indicator Name %in% 
           c('Prevalence of Obesity', 'Prevalence of Tobacco Use', 
             'Prevalence of Cardiovascular Disease', 'Prevalence of Diabetes')) %>%
  select(c(Indicator Name, 2011, 2012, 2013, 2014, 2015, 2016)) %>%
  mutate (Average = round(rowMeans(
    cbind(2011, 2012, 2013, 2014, 2015, 2016), na.rm=T),2), 
    Improvement = round((2011-2016)/2011*100,2))
i1

## View the table data in it’s raw format

We now have the data in the table we want, so let’s display it to our audience. We can start by viewing the table in it’s raw format.

i2

## View the Data with the Formattable Package

Viewing the data by simply printing it did not produce a nice looking table. Let’s see what formattable gives us out of the box.

#0) Throw it in the formattable function
formattable(i1)

Not bad! But let’s spruce it up a little. We will left align the first column, right align the last column and center align the rest. Additionally we will bold and make grey the the row title: Indicator Name.

 #1)  First Data Table
formattable(i1, 
            align =c("l","c","c","c","c", "c", "c", "c", "r"), 
            list(Indicator Name = formatter(
              "span", style = ~ style(color = "grey",font.weight = "bold")) 
))

## Add a Color Tile for All Year Columns

We will add the color_tile function to all year columns. This creates the effect of a column by column row wise heat map, and it looks great! Note that we are using our own custom colors declared in the very beginning of the code to ensure our table has the look and feel we want.

#2) Add the color mapping for all 2011 to 2016.
formattable(i1, align =c("l","c","c","c","c", "c", "c", "c", "r"), list(
  Indicator Name = formatter("span", style = ~ style(color = "grey",font.weight = "bold")), 
  2011= color_tile(customGreen, customGreen0),
  2012= color_tile(customGreen, customGreen0),
  2013= color_tile(customGreen, customGreen0),
  2014= color_tile(customGreen, customGreen0),
  2015= color_tile(customGreen, customGreen0),
  2016= color_tile(customGreen, customGreen0)
))

## Add a Color Bar for the Average Column

We will now add the color_bar function to the average column. Rather than using a heat map, it will display the same background color each time. However, it will have a bar line to indicate relative row wise size of the values.

#3) Add the color bar to the average column
formattable(i1, align =c("l","c","c","c","c", "c", "c", "c", "r"), list(
  Indicator Name = formatter("span", style = ~ style(color = "grey",font.weight = "bold")), 
  2011= color_tile(customGreen, customGreen0),
  2012= color_tile(customGreen, customGreen0),
  2013= color_tile(customGreen, customGreen0),
  2014= color_tile(customGreen, customGreen0),
  2015= color_tile(customGreen, customGreen0),
  2016= color_tile(customGreen, customGreen0),
  Average = color_bar(customRed)
))

## Add Our Own Format Function

One great tip that I learned from the vignette is that you can make your own formatting functions really easily. Using their examples in the vignette and on bioinfo.irc.ca, I made a slight modification to create our own improvement_formatter function that bolds the text and colors it our custom red or green depending on it’s value.

#4) Add sign formatter to improvement over time

improvement_formatter <-
formatter("span",
style = x ~ style(
font.weight = "bold",
color = ifelse(x > 0, customGreen, ifelse(x < 0, customRed, "black"))))

formattable(i1, align =c("l","c","c","c","c", "c", "c", "c", "r"), list(
Indicator Name =
formatter("span", style = ~ style(color = "grey",font.weight = "bold")),
2011= color_tile(customGreen, customGreen0),
2012= color_tile(customGreen, customGreen0),
2013= color_tile(customGreen, customGreen0),
2014= color_tile(customGreen, customGreen0),
2015= color_tile(customGreen, customGreen0),
2016= color_tile(customGreen, customGreen0),
Average = color_bar(customRed),
Improvement = improvement_formatter
))

## Modify the Format Function To Display Images

We are going to slightly modify the format to display the up and down arrow depending on the value of improvement. Note, that in the video above, I also change the formatter to display a thumbs up symbol on the winning improvement value. The code for this and other examples are available on my github repo.

#5) For improvement formatter add icons
# Up and down arrow with greater than comparison from the vignette
improvement_formatter <- formatter("span", 
                                   style = x ~ style(font.weight = "bold", 
                                                     color = ifelse(x > 0, customGreen, ifelse(x < 0, customRed, "black"))), 
                                   x ~ icontext(ifelse(x>0, "arrow-up", "arrow-down"), x)
                                   )
formattable(i1, align =c("l","c","c","c","c", "c", "c", "c", "r"), list(
  Indicator Name = formatter("span", style = ~ style(color = "grey",font.weight = "bold")), 
  2011= color_tile(customGreen, customGreen0),
  2012= color_tile(customGreen, customGreen0),
  2013= color_tile(customGreen, customGreen0),
  2014= color_tile(customGreen, customGreen0),
  2015= color_tile(customGreen, customGreen0),
  2016= color_tile(customGreen, customGreen0),
  Average = color_bar(customRed),
  Improvement = improvement_formatter
))

## Add an Icon in The Row Title Based on the Value in Another Column

We are going to make one last modification to append an image to the indicator name column based on a value located in another column. This is an important departure from our previous behavior, because previously we were only assigning the format of a single column based on it’s own values. In order to enable cross column compare, we just need to remove the x in front of the ~ style and the ~ icontext conditions. This will allow us to explicitly specify the columns we want to reference.

#6) Add a star to the max value.  Use  if/else value = max(value)
improvement_formatter <- formatter("span", 
                                   style = x ~ style(font.weight = "bold", 
                                                     color = ifelse(x > 0, customGreen, ifelse(x < 0, customRed, "black"))), 
                                   x ~ icontext(ifelse(x == max(x), "thumbs-up", ""), x)
)
## Based on Name
formattable(i1, align =c("l","c","c","c","c", "c", "c", "c", "r"), list(
  Indicator Name = formatter("span",
                     style = x ~ style(color = "gray"),
                     x ~ icontext(ifelse(x == "Prevalence of Tobacco Use", "star", ""), x)), 
  2011= color_tile(customGreen, customGreen0),
  2012= color_tile(customGreen, customGreen0),
  2013= color_tile(customGreen, customGreen0),
  2014= color_tile(customGreen, customGreen0),
  2015= color_tile(customGreen, customGreen0),
  2016= color_tile(customGreen, customGreen0),
  Average = color_bar(customRed),
  Improvement = improvement_formatter
))

## Compare Column to Column

Finally, we are going to just do a simple cross column row wise comparison. We’ll take our same data set but strip it back to just 2015 and 2016 data. We will then compare the values and mark up the 2016 column as up/down and green/red based on comparing the 2016 value to the 2015 value.

##7)  Compare column to column
#Drop the rest and show just 2015 and 2016
i2 <- austinData %>%
  filter(Indicator Name %in% c('Prevalence of Obesity', 'Prevalence of Tobacco Use', 'Prevalence of Cardiovascular Disease', 'Prevalence of Diabetes')) %>%
  select(c(Indicator Name, 2015, 2016)) 
head(i2)
formattable(i2, align =c("l","c","c"), list(
  Indicator Name = formatter("span",
                               style = ~ style(color = "gray")), 
  2016= formatter("span", style = ~ style(color = ifelse(2016 >2015, "red", "green")),
                    ~ icontext(ifelse(2016 >2015,"arrow-up", "arrow-down"), 2016))
))

## Extras

In the full github code, you will see a number of other examples. As a bonus, I’ve also included the code to create the animation using the magick package!

## Thank You

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### If you did not already know

Dyadic data refers to a domain with two nite sets of objects in which observations are made for dyads, i.e., pairs with one element from either set. This type of data arises naturally in many application ranging from computational linguistics and information retrieval to preference analysis and computer vision. In this paper, we present a systematic, domain-independent framework of learning from dyadic data by statistical mixture models. Our approach covers different models with flat and hierarchical latent class structures. We propose an annealed version of the standard EM algorithm for model fitting which is empirically evaluated on a variety of data sets from different domains.

The adjacency diagram is a space-filling variant of the node-link diagram; rather than drawing a link between parent and child in the hierarchy, nodes are drawn as solid areas (either arcs or bars), and their placement relative to adjacent nodes reveals their position in the hierarchy. The icicle layout in figure 4D is similar to the first node-link diagram in that the root node appears at the top, with child nodes underneath. Because the nodes are now space-filling, however, we can use a length encoding for the size of software classes and packages. This reveals an additional dimension that would be difficult to show in a node-link diagram. …

Least Absolute Shrinkage and Screening Operator (LASSO)
Slide 31: ‘Tibshirani (1996):
LASSO = Least Absolute Shrinkage and Selection Operator
new translation:
LASSO = Least Absolute Shrinkage and Screening Operator’ …

### Document worth reading: “Data Innovation for International Development: An overview of natural language processing for qualitative data analysis”

Availability, collection and access to quantitative data, as well as its limitations, often make qualitative data the resource upon which development programs heavily rely. Both traditional interview data and social media analysis can provide rich contextual information and are essential for research, appraisal, monitoring and evaluation. These data may be difficult to process and analyze both systematically and at scale. This, in turn, limits the ability of timely data driven decision-making which is essential in fast evolving complex social systems. In this paper, we discuss the potential of using natural language processing to systematize analysis of qualitative data, and to inform quick decision-making in the development context. We illustrate this with interview data generated in a format of micro-narratives for the UNDP Fragments of Impact project. Data Innovation for International Development: An overview of natural language processing for qualitative data analysis

### Mapping the 2018 East Africa floods from space with smapr

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

Hundreds of thousands of people in east Africa have been displaced and hundreds have died as a result of torrential rains which ended a drought but saturated soils and engorged rivers, resulting in extreme flooding in 2018.
This post will explore these events using the R package smapr, which provides access to global satellite-derived soil moisture data collected by the NASA Soil Moisture Active-Passive (SMAP) mission and abstracts away some of the complexity associated with finding, acquiring, and working with the HDF5 files that contain the observations (shout out to Laura DeCicco and Marco Sciaini for reviewing smapr, and Noam Ross for editing in the rOpenSci onboarding process).
We will focus on Somalia and Kenya, two of the hardest hit countries.
We’ll also lean on another rOpenSci package, rnoaa, to link precipitation to soil moisture.

First, let’s get spatial boundaries for the study area:

library(raster)
library(tidyverse)
library(smapr)
library(rworldmap)
library(rnoaa)
library(plotly)
library(rasterVis)
library(animation)
library(patchwork)
library(sf)

worldmap <- getMap()
study_area <- subset(worldmap, NAME_SORT %in% c('Somalia', 'Kenya'))
study_area_sf <- as(study_area, 'sf')

plot(worldmap)
plot(study_area, add = TRUE, col = 'dodgerblue')


## Finding soil moisture data

Next, we can use the find_smap function to find global soil moisture data for any day.
The SMAP satellite was launched in 2015 with two sensors: an active microwave sensor and a passive radiometer.
The active sensor has since failed, so we will use the radiometer data, specifically version five of the ‘SPL3SMP’ data product (for a full list of data products see https://smap.jpl.nasa.gov/data/).

find_smap('SPL3SMP', dates = '2018-03-01', version = 5)

##                               name       date                     dir
## 1 SMAP_L3_SM_P_20180301_R16010_001 2018-03-01 SPL3SMP.005/2018.03.01/


This returns a data frame with one row per file – we can see here that there is one file available for that date.
If we wanted to search over a range of dates, we could provide a date sequence:

date_seq <- seq(as.Date('2018-03-01'), as.Date('2018-03-06'), by = 1)
files <- find_smap('SPL3SMP', dates = date_seq, version = 5)
files

##                               name       date                     dir
## 1 SMAP_L3_SM_P_20180301_R16010_001 2018-03-01 SPL3SMP.005/2018.03.01/
## 2 SMAP_L3_SM_P_20180302_R16010_001 2018-03-02 SPL3SMP.005/2018.03.02/
## 3 SMAP_L3_SM_P_20180303_R16010_001 2018-03-03 SPL3SMP.005/2018.03.03/
## 4 SMAP_L3_SM_P_20180304_R16010_001 2018-03-04 SPL3SMP.005/2018.03.04/
## 5 SMAP_L3_SM_P_20180305_R16010_001 2018-03-05 SPL3SMP.005/2018.03.05/
## 6 SMAP_L3_SM_P_20180306_R16010_001 2018-03-06 SPL3SMP.005/2018.03.06/


The download_smap function takes the results from find_smap and downloads files locally:

downloads <- download_smap(files, overwrite = FALSE)

##                               name       date                     dir
## 1 SMAP_L3_SM_P_20180301_R16010_001 2018-03-01 SPL3SMP.005/2018.03.01/
## 2 SMAP_L3_SM_P_20180302_R16010_001 2018-03-02 SPL3SMP.005/2018.03.02/
## 3 SMAP_L3_SM_P_20180303_R16010_001 2018-03-03 SPL3SMP.005/2018.03.03/
## 4 SMAP_L3_SM_P_20180304_R16010_001 2018-03-04 SPL3SMP.005/2018.03.04/
## 5 SMAP_L3_SM_P_20180305_R16010_001 2018-03-05 SPL3SMP.005/2018.03.05/
## 6 SMAP_L3_SM_P_20180306_R16010_001 2018-03-06 SPL3SMP.005/2018.03.06/
##                   local_dir
## 1 /home/rstudio/.cache/smap
## 2 /home/rstudio/.cache/smap
## 3 /home/rstudio/.cache/smap
## 4 /home/rstudio/.cache/smap
## 5 /home/rstudio/.cache/smap
## 6 /home/rstudio/.cache/smap


These are HDF5 files, so each file contains multiple datasets.
For this data product, a soil moisture data set is named Soil_Moisture_Retrieval_Data_AM/soil_moisture, and we can use the extract_smap function to generate rasters from this dataset (see list_smap for a list of all datasets contained in any file):

sm_raster <- extract_smap(downloads, name = "Soil_Moisture_Retrieval_Data_AM/soil_moisture")
sm_raster

## class       : RasterBrick
## dimensions  : 406, 964, 391384, 6  (nrow, ncol, ncell, nlayers)
## resolution  : 36032.22, 36032.22  (x, y)
## extent      : -17367530, 17367530, -7314540, 7314540  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=cea +lon_0=0 +lat_ts=30 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0
## data source : /home/rstudio/.cache/smap/tmp.tif
## names       : SMAP_L3_S//R16010_001, SMAP_L3_S//R16010_001, SMAP_L3_S//R16010_001, SMAP_L3_S//R16010_001, SMAP_L3_S//R16010_001, SMAP_L3_S//R16010_001
## min values  :                  0.02,                  0.02,                  0.02,                  0.02,                  0.02,                  0.02
## max values  :             0.9122642,             0.9390723,             0.9531010,             0.9122642,             0.9531010,             0.9390723


Let’s plot the RasterBrick to see what the soil moisture data look like:

levelplot(sm_raster)


The striping occurs because of the orbit of the SMAP satellite.
If we were interested in some weekly measure of soil moisture, we could simply average across days to get a more continuous picture of soil moisture:

weekly_sm <- mean(sm_raster, na.rm = TRUE)
plot(weekly_sm)


We’ll want to do that set of operations a bunch of times for this case study, so we’ll write a little function that takes a date range and returns a raster averaging over dates:

average_smap <- function(date_range) {
mean_sm <- find_smap('SPL3SMP', dates = date_range, version = 5) %>%
extract_smap(name = "Soil_Moisture_Retrieval_Data_AM/soil_moisture") %>%
mean(na.rm = TRUE)
mean_sm
}


Notice that spatial coverage is still not 100%.
Notably, places where the satellite missed and frozen regions in the arctic have NA values.

## Getting global precipitation data with rnoaa

We can better interpret soil moisture if we know about precipitation, and the NOAA Climate Prediction Center (CPC) has global precipitation data that are a cinch to get with the cpc_prcp function.

cpc_prcp('2018-03-01')

## # A tibble: 259,200 x 3
##      lon   lat precip
##
##  1  0.25 -89.8      0
##  2  0.75 -89.8      0
##  3  1.25 -89.8      0
##  4  1.75 -89.8      0
##  5  2.25 -89.8      0
##  6  2.75 -89.8      0
##  7  3.25 -89.8      0
##  8  3.75 -89.8      0
##  9  4.25 -89.8      0
## 10  4.75 -89.8      0
## # ... with 259,190 more rows


Of course, we don’t want to get data just for one date.
Instead, just like we used a date range to get soil moisture data, we can get precipitation data for each date within a range using the map function from the purrr package.
We want to map the get_prcp function to each date in our vector date_seq:

date_seq %>%
map(cpc_prcp) %>%
str

## List of 6
##  $:Classes 'tbl_df', 'tbl' and 'data.frame': 259200 obs. of 3 variables: ## ..$ lon   : num [1:259200] 0.25 0.75 1.25 1.75 2.25 2.75 3.25 3.75 4.25 4.75 ...
##   ..$lat : num [1:259200] -89.8 -89.8 -89.8 -89.8 -89.8 ... ## ..$ precip: num [1:259200] 0 0 0 0 0 0 0 0 0 0 ...
##  $:Classes 'tbl_df', 'tbl' and 'data.frame': 259200 obs. of 3 variables: ## ..$ lon   : num [1:259200] 0.25 0.75 1.25 1.75 2.25 2.75 3.25 3.75 4.25 4.75 ...
##   ..$lat : num [1:259200] -89.8 -89.8 -89.8 -89.8 -89.8 ... ## ..$ precip: num [1:259200] 0 0 0 0 0 0 0 0 0 0 ...
##  $:Classes 'tbl_df', 'tbl' and 'data.frame': 259200 obs. of 3 variables: ## ..$ lon   : num [1:259200] 0.25 0.75 1.25 1.75 2.25 2.75 3.25 3.75 4.25 4.75 ...
##   ..$lat : num [1:259200] -89.8 -89.8 -89.8 -89.8 -89.8 ... ## ..$ precip: num [1:259200] 0 0 0 0 0 0 0 0 0 0 ...
##  $:Classes 'tbl_df', 'tbl' and 'data.frame': 259200 obs. of 3 variables: ## ..$ lon   : num [1:259200] 0.25 0.75 1.25 1.75 2.25 2.75 3.25 3.75 4.25 4.75 ...
##   ..$lat : num [1:259200] -89.8 -89.8 -89.8 -89.8 -89.8 ... ## ..$ precip: num [1:259200] 0 0 0 0 0 0 0 0 0 0 ...
##  $:Classes 'tbl_df', 'tbl' and 'data.frame': 259200 obs. of 3 variables: ## ..$ lon   : num [1:259200] 0.25 0.75 1.25 1.75 2.25 2.75 3.25 3.75 4.25 4.75 ...
##   ..$lat : num [1:259200] -89.8 -89.8 -89.8 -89.8 -89.8 ... ## ..$ precip: num [1:259200] 0 0 0 0 0 0 0 0 0 0 ...
##  $:Classes 'tbl_df', 'tbl' and 'data.frame': 259200 obs. of 3 variables: ## ..$ lon   : num [1:259200] 0.25 0.75 1.25 1.75 2.25 2.75 3.25 3.75 4.25 4.75 ...
##   ..$lat : num [1:259200] -89.8 -89.8 -89.8 -89.8 -89.8 ... ## ..$ precip: num [1:259200] 0 0 0 0 0 0 0 0 0 0 ...


Amazing!
We now have a list where each element contains a data frame with the amount of precipitation over a consistent spatial grid that covers the whole globe, over a one week interval.
To merge these data frames together into one, we can use bind_rows, and we’ll also filter out NA values, which are represented as negative numbers:

date_seq %>%
map(cpc_prcp) %>%
bind_rows %>%
filter(precip >= 0)

## # A tibble: 558,357 x 3
##      lon   lat precip
##
##  1  0.25 -89.8      0
##  2  0.75 -89.8      0
##  3  1.25 -89.8      0
##  4  1.75 -89.8      0
##  5  2.25 -89.8      0
##  6  2.75 -89.8      0
##  7  3.25 -89.8      0
##  8  3.75 -89.8      0
##  9  4.25 -89.8      0
## 10  4.75 -89.8      0
## # ... with 558,347 more rows


Now we have one data frame, and we can compute a mean over all dates for each grid cell using a group_by, summarize operation.

date_seq %>%
map(cpc_prcp) %>%
bind_rows %>%
filter(precip >= 0) %>%
group_by(lon, lat) %>%
summarize(precip = mean(precip, na.rm = TRUE))

## # A tibble: 93,060 x 3
## # Groups:   lon [?]
##      lon   lat precip
##
##  1  0.25 -89.8      0
##  2  0.25 -89.2      0
##  3  0.25 -88.8      0
##  4  0.25 -88.2      0
##  5  0.25 -87.8      0
##  6  0.25 -87.2      0
##  7  0.25 -86.8      0
##  8  0.25 -86.2      0
##  9  0.25 -85.8      0
## 10  0.25 -85.2      0
## # ... with 93,050 more rows


One last little detail: the longitude values range from 0 to 360, but it’s going to be easier later if they range from -180 to 180, so we’ll use mutate to get longitude defined over (-180, 180).

weekly_pr <- date_seq %>%
map(cpc_prcp) %>%
bind_rows %>%
filter(precip >= 0) %>%
group_by(lon, lat) %>%
summarize(precip = mean(precip, na.rm = TRUE)) %>%
ungroup %>%
mutate(lon = ifelse(lon > 180, lon - 360, lon))


We can generate a raster object by way of a SpatialGridDataFrame, which will be useful later to ensure that the soil moisture and precipitation data are on the same spatial grid:

# little helper function
make_precip_raster <- function(prcp_df) {
coordinates(prcp_df) <- ~lon+lat
gridded(prcp_df) <- TRUE
prcp_df <- as(prcp_df, "SpatialGridDataFrame") # to full grid
proj4string(prcp_df) <- '+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs'
raster(prcp_df)
}

# create a precip raster
pr_raster <- make_precip_raster(weekly_pr)

# plot it
plot(pr_raster)


Just like we did for the soil moisture data, we’ll bundle up all of these steps into a helper function:

average_precip <- function(date_range) {
date_seq %>%
map(cpc_prcp) %>%
bind_rows %>%
filter(precip >= 0) %>%
group_by(lon, lat) %>%
summarize(precip = mean(precip, na.rm = TRUE)) %>%
ungroup %>%
mutate(lon = ifelse(lon > 180, lon - 360, lon)) %>%
make_precip_raster
}


## Fetching global soil moisture and precipitation from 2015 to present

Now we have two functions average_smap and average_precip that we can use to get global soil moisture and precipitation data for any date range.
Next, we’ll use these functions to get data at weekly intervals from the beginning of the SMAP data archive in 2015 through the end of August 2018.

start_dates <- seq(as.Date("2015-04-01"), as.Date("2018-09-01"), by = 7)
end_dates <- start_dates + 6

weekly_smap <- vector(mode = 'list', length = length(start_dates))
weekly_precip <- vector(mode = 'list', length = length(start_dates))

for (i in seq_along(start_dates)) {
date_seq <- seq(start_dates[i], end_dates[i], by = 1)

geotiff_name <- paste0('sm-', i, '.tif')
if (!file.exists(geotiff_name)) {
average_smap(date_seq) %>%
writeRaster(geotiff_name)
}
weekly_smap[[i]] <- raster(geotiff_name)

weekly_precip[[i]] <- average_precip(date_seq)
}


Now we have two lists, weekly_smap and weekly_precip, where each element is a raster.
We’d like to convert these to RasterStack objects, and get them on the same spatial grid, in the same projection as our study area polygon.

weekly_smap <- stack(weekly_smap)
names(weekly_smap) <- start_dates
weekly_smap <- projectRaster(weekly_smap, crs = projection(study_area))

weekly_precip <- stack(weekly_precip)
names(weekly_precip) <- start_dates
weekly_precip <- resample(weekly_precip, weekly_smap)


Now that we have RasterStack objects, we will create tidy data frames that we can use in ggplot2.
Because the code to do this is the same for both objects, I’ll write a little helper function.

make_study_area_df <- function(raster) {
raster %>%
trim %>%
as('SpatialPixelsDataFrame') %>%
as.data.frame() %>%
gather(date, value, -x, -y) %>%
as_tibble %>%
mutate(date = start_dates[as.numeric(as.factor(date))])
}

soil_moisture <- make_study_area_df(weekly_smap) %>%
rename(sm = value)
precip <- make_study_area_df(weekly_precip) %>%
rename(pr = value)


Let’s check these out!

soil_moisture

## # A tibble: 171,124 x 4
##        x     y date          sm
##
##  1  50.7  11.9 2015-04-01 0.185
##  2  51.1  11.9 2015-04-01 0.127
##  3  50.3  11.6 2015-04-01 0.337
##  4  50.7  11.6 2015-04-01 0.150
##  5  51.1  11.6 2015-04-01 0.139
##  6  43.2  11.3 2015-04-01 0.340
##  7  48.5  11.3 2015-04-01 0.557
##  8  48.8  11.3 2015-04-01 0.361
##  9  49.2  11.3 2015-04-01 0.398
## 10  49.6  11.3 2015-04-01 0.150
## # ... with 171,114 more rows


So we have a data frame where each row is a pixel with a date corresponding to the first day of the week, and then the mean soil moisture for that week in the sm column.
What do these data look like?

my_theme <- theme_minimal() +
theme(panel.grid.minor = element_blank())

soil_moisture %>%
ggplot(aes(date, sm)) +
geom_point(alpha = .02) +
my_theme +
xlab('') +
ylab('Mean soil moisture (m^3 water per m^3 soil)')


precip %>%
ggplot(aes(date, pr)) +
geom_point(alpha = .02) +
my_theme +
xlab('') +
ylab('Mean precipitation (mm)')


What does the relationship between precipitation and soil moisture look like?

soil_moisture %>%
left_join(precip) %>%
ggplot(aes(pr, sm)) +
geom_point(alpha = .02) +
my_theme +
ylab('Mean soil moisture (m^3 water per m^3 soil)') +
xlab('Mean precipitation (mm)') +
scale_y_log10() +
scale_x_log10()

## Joining, by = c("x", "y", "date")


Notice the band of zero-precipitation points that sit on the y-axis.
Among nonzero precipitation values, it seems like there is a nonlinear relationship between precipitation and soil moisture.

So, now let’s take a look at the spring and summer of 2018, when the flooding was worst in Somalia and Kenya.

soil_moisture %>%
filter(date > as.Date('2018-01-01')) %>%
ggplot(aes(x=x, y=y, fill=sm)) +
geom_raster() +
scale_fill_viridis_c(direction = -1,
'Soil moisture') +
facet_wrap(~date, nrow = 7) +
theme_minimal() +
theme(axis.text = element_blank()) +
geom_sf(data = study_area_sf, inherit.aes = FALSE, fill = NA, size = .2) +
xlab('') +
ylab('')


Soil moisture

precip %>%
filter(date > as.Date('2018-01-01')) %>%
ggplot(aes(x=x, y=y, fill=pr)) +
geom_raster() +
high = 'dodgerblue',
'Precipitation') +
facet_wrap(~date, nrow = 7) +
theme_minimal() +
theme(axis.text = element_blank()) +
geom_sf(data = study_area_sf, inherit.aes = FALSE, fill = NA, size = .2) +
xlab('') +
ylab('')


Precipitation

To get a sense for how these values compare to the overall distribution of values in space, we can plot the same data, but rather than using the soil moisture and precipitation values to color the map, we can color the map using the empirical cumulative distribution function (CDF).
This will give us values between 0 and 1, which tell us the fraction of values below a particular value.
So for instance if the empirical CDF at a particular value gives us 0.9, then 90% of the observations were less than that value.
One quick thing to notice is that the distribution of values is different for every pixel:

sm_ecdf <- soil_moisture %>%
group_by(x, y) %>%
mutate(ecdf = ecdf(sm)(sm)) %>%
ungroup

sm_ecdf %>%
ggplot(aes(sm, ecdf, group = interaction(x, y))) +
geom_line(alpha = .1) +
my_theme +
xlab('Soil moisture (m^3 water per m^3 soil)') +
ylab('Empirical cumulative distribution function')


pr_ecdf <- precip %>%
group_by(x, y) %>%
mutate(ecdf = ecdf(pr)(pr)) %>%
ungroup

pr_ecdf %>%
ggplot(aes(pr, ecdf, group = interaction(x, y))) +
geom_line(alpha = .1) +
my_theme +
xlab('Precipitation (mm)') +
ylab('Empirical cumulative distribution function')


A few things to notice about these empirical cumulative distribution functions:

• The distributions of soil moisture and precipitation are quite different (e.g., the distribution of precipitation has a lot of zero values, and a long tail).
• The empirical CDFs provide a mapping from the range of data to the interval from zero to one, so that we can visualize how any particular amount of soil moisture or precipitation compares to the full distribution of values for each pixel. Essentially, when the empirical CDF gives a value close to 0, that’s a low value (a small fraction of values are less than or equal to it). If the empirical CDF gives a value close to one, that’s a high value (a large fraction of values are less than or equal to it).

Let’s see what the empirical CDF values look like on a map:

sm_ecdf %>%
filter(date > as.Date('2018-01-01')) %>%
ggplot(aes(x=x, y=y, fill=ecdf)) +
geom_raster() +
scale_fill_viridis_c('Soil moisture\nempirical CDF',
option = 'B', direction = -1) +
facet_wrap(~date, nrow = 7) +
theme_minimal() +
theme(axis.text = element_blank()) +
geom_sf(data = study_area_sf, inherit.aes = FALSE, fill = NA, size = .2) +
xlab('') +
ylab('')


precip %>%
group_by(x, y) %>%
mutate(ecdf = ecdf(pr)(pr)) %>%
filter(date > as.Date('2018-01-01')) %>%
ggplot(aes(x=x, y=y, fill=ecdf)) +
geom_raster() +
scale_fill_viridis_c('Precipitation\nempirical CDF',
option = 'B',
direction = -1) +
facet_wrap(~date, nrow = 7) +
theme_minimal() +
theme(axis.text = element_blank()) +
geom_sf(data = study_area_sf, inherit.aes = FALSE, fill = NA, size = .2) +
xlab('') +
ylab('')


Across much of Somalia and Kenya, soils were exceptionally dry prior to March 2018 (light colors), and with the prolonged levels of high rain beginning in March, soils were much more saturated than usual through mid to late May (dark colors).
The rain throughout April fell on wet soils, which are less able to absorb water, leading to overland flow and flooding.

We can put this all together in a visualization that shows the evolution of soil moisture and precipitation:

Here the soil moisture and precipitation values are mapped over time, with line plots below to show the median value (black line), interquartile range (dark grey ribbon), 10% and 90% quantiles (light grey ribbon).
The exceptional nature of the flooding in 2018 shows up as a high and broad peak in soil moisture and precipitation, and the drought that preceded the flooding in 2017 is visible as a period of low rains and dry soil.

### Want to contribute?

If you’re psyched on global soil moisture and want to contribute to smapr, there is room to develop support for more level 2 products.
If you’re not familiar with the different levels of NASA data products: level 0 is the raw instrument data; level 1 is a bit more processed, e.g., the raw data with ancillary information; level 2 has derived geophysical variables; level 3 has these variables mapped on a uniform grid; and level 4 data consist of model output and derived variables.
Currently smapr supports the more processed level 3 science grade and level 4 enhanced value products primarily.
For a starting point, check out: https://github.com/ropensci/smapr/issues/35

### Acknowledgements

The smapr package was developed in Earth Lab with help from Matt Oakley who worked with the Earth Lab Analytics Hub.
The idea for the package emerged from a NOAA Data Partnership event in 2016, where we began working with the NASA SMAP data and realized that we had all the makings for an R package.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### simstudy update: improved correlated binary outcomes

(This article was first published on ouR data generation, and kindly contributed to R-bloggers)

An updated version of the simstudy package (0.1.10) is now available on CRAN. The impetus for this release was a series of requests about generating correlated binary outcomes. In the last post, I described a beta-binomial data generating process that uses the recently added beta distribution. In addition to that update, I’ve added functionality to genCorGen and addCorGen, functions which generate correlated data from non-Gaussian or normally distributed data such as Poisson, Gamma, and binary data. Most significantly, there is a newly implemented algorithm based on the work of Emrich & Piedmonte, which I mentioned the last time around.

### Limitation of copula algorithm

The existing copula algorithm is limited when generating correlated binary data. (I did acknowledge this when I first introduced the new functions.) The generated marginal means are what we would expect. though the observed correlation on the binary scale is biased downwards towards zero. Using the copula algorithm, the specified correlation really pertains to the underlying normal data that is used in the data generation process. Information is lost when moving between the continuous and dichotomous distributions:

library(simstudy)

set.seed(736258)
d1 <- genCorGen(n = 1000, nvars = 4, params1 = c(0.2, 0.5, 0.6, 0.7),
dist = "binary", rho = 0.3, corstr = "cs", wide = TRUE,
method = "copula")

d1
##         id V1 V2 V3 V4
##    1:    1  0  0  0  0
##    2:    2  0  1  1  1
##    3:    3  0  1  0  1
##    4:    4  0  0  1  0
##    5:    5  0  1  0  1
##   ---
##  996:  996  0  0  0  0
##  997:  997  0  1  0  0
##  998:  998  0  1  1  1
##  999:  999  0  0  0  0
## 1000: 1000  0  0  0  0
d1[, .(V1 = mean(V1), V2 = mean(V2),
V3 = mean(V3), V4 = mean(V4))]
##       V1    V2    V3    V4
## 1: 0.184 0.486 0.595 0.704
d1[, round(cor(cbind(V1, V2, V3, V4)), 2)]
##      V1   V2   V3   V4
## V1 1.00 0.18 0.17 0.17
## V2 0.18 1.00 0.19 0.23
## V3 0.17 0.19 1.00 0.15
## V4 0.17 0.23 0.15 1.00

### The ep option offers an improvement

Data generated using the Emrich & Piedmonte algorithm, done by specifying the “ep” method, does much better; the observed correlation is much closer to what we specified. (Note that the E&P algorithm may restrict the range of possible correlations; if you specify a correlation outside of the range, an error message is issued.)

set.seed(736258)
d2 <- genCorGen(n = 1000, nvars = 4, params1 = c(0.2, 0.5, 0.6, 0.7),
dist = "binary", rho = 0.3, corstr = "cs", wide = TRUE,
method = "ep")

d2[, .(V1 = mean(V1), V2 = mean(V2),
V3 = mean(V3), V4 = mean(V4))]
##       V1    V2    V3    V4
## 1: 0.199 0.504 0.611 0.706
d2[, round(cor(cbind(V1, V2, V3, V4)), 2)]
##      V1   V2   V3   V4
## V1 1.00 0.33 0.33 0.29
## V2 0.33 1.00 0.32 0.31
## V3 0.33 0.32 1.00 0.28
## V4 0.29 0.31 0.28 1.00

If we generate the data using the “long” form, we can fit a GEE marginal model to recover the parameters used in the data generation process:

library(geepack)

set.seed(736258)
d3 <- genCorGen(n = 1000, nvars = 4, params1 = c(0.2, 0.5, 0.6, 0.7),
dist = "binary", rho = 0.3, corstr = "cs", wide = FALSE,
method = "ep")

geefit3 <- geeglm(X ~ factor(period), id = id, data = d3,
family = binomial, corstr = "exchangeable")

summary(geefit3)
##
## Call:
## geeglm(formula = X ~ factor(period), family = binomial, data = d3,
##     id = id, corstr = "exchangeable")
##
##  Coefficients:
##                 Estimate  Std.err  Wald Pr(>|W|)
## (Intercept)     -1.39256  0.07921 309.1   <2e-16 ***
## factor(period)1  1.40856  0.08352 284.4   <2e-16 ***
## factor(period)2  1.84407  0.08415 480.3   <2e-16 ***
## factor(period)3  2.26859  0.08864 655.0   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Estimated Scale Parameters:
##             Estimate Std.err
## (Intercept)        1 0.01708
##
## Correlation: Structure = exchangeable  Link = identity
##
## Estimated Correlation Parameters:
##       Estimate Std.err
## alpha   0.3114 0.01855
## Number of clusters:   1000   Maximum cluster size: 4

And the point estimates for each variable on the probability scale:

round(1/(1+exp(1.3926 - c(0, 1.4086, 1.8441, 2.2686))), 2)
## [1] 0.20 0.50 0.61 0.71

### Longitudinal (repeated) measures

One researcher wanted to generate individual-level longitudinal data that might be analyzed using a GEE model. This is not so different from what I just did, but incorporates a specific time trend to define the probabilities. In this case, the steps are to (1) generate longitudinal data using the addPeriods function, (2) define the longitudinal probabilities, and (3) generate correlated binary outcomes with an AR-1 correlation structure.

set.seed(393821)
probform <- "-2 + 0.3 * period"

def1 <- defDataAdd(varname = "p", formula = probform,
dist = "nonrandom", link = "logit")

dx <- genData(1000)
dx <- addPeriods(dx, nPeriods = 4)

dg <- addCorGen(dx, nvars = 4,
corMatrix = NULL, rho = .4, corstr = "ar1",
dist = "binary", param1 = "p",
method = "ep", formSpec = probform,
periodvar = "period")

The correlation matrix from the observed data is reasonably close to having an AR-1 structure, where $$\rho = 0.4$$, $$\rho^2 = 0.16$$, $$\rho^3 = 0.064$$.

cor(dcast(dg, id ~ period, value.var = "X")[,-1])
##         0      1      2       3
## 0 1.00000 0.4309 0.1762 0.04057
## 1 0.43091 1.0000 0.3953 0.14089
## 2 0.17618 0.3953 1.0000 0.36900
## 3 0.04057 0.1409 0.3690 1.00000

And again, the model recovers the time trend parameter defined in variable probform as well as the correlation parameter:

geefit <- geeglm(X ~ period, id = id, data = dg, corstr = "ar1",
family = binomial)
summary(geefit)
##
## Call:
## geeglm(formula = X ~ period, family = binomial, data = dg, id = id,
##     corstr = "ar1")
##
##  Coefficients:
##             Estimate Std.err  Wald Pr(>|W|)
## (Intercept)  -1.9598  0.0891 484.0   <2e-16 ***
## period        0.3218  0.0383  70.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Estimated Scale Parameters:
##             Estimate Std.err
## (Intercept)        1  0.0621
##
## Correlation: Structure = ar1  Link = identity
##
## Estimated Correlation Parameters:
##       Estimate Std.err
## alpha    0.397  0.0354
## Number of clusters:   1000   Maximum cluster size: 4

### Model mis-specification

And just for fun, here is an example of how simulation might be used to investigate the performance of a model. Let’s say we are interested in the implications of mis-specifying the correlation structure. In this case, we can fit two GEE models (one correctly specified and one mis-specified) and assess the sampling properties of the estimates from each:

library(broom)

dx <- genData(100)
dx <- addPeriods(dx, nPeriods = 4)

iter <- 1000
rescorrect <- vector("list", iter)
resmisspec <- vector("list", iter)

for (i in 1:iter) {

dw <- addCorGen(dx, nvars = 4,
corMatrix = NULL, rho = .5, corstr = "ar1",
dist = "binary", param1 = "p",
method = "ep", formSpec = probform,
periodvar = "period")

correctfit <- geeglm(X ~ period, id = id, data = dw,
corstr = "ar1", family = binomial)

misfit     <- geeglm(X ~ period, id = id, data = dw,
corstr = "independence", family = binomial)

rescorrect[[i]] <- data.table(i, tidy(correctfit))
resmisspec[[i]] <- data.table(i, tidy(misfit))
}

rescorrect <-
rbindlist(rescorrect)[term == "period"][, model := "correct"]

resmisspec <-
rbindlist(resmisspec)[term == "period"][, model := "misspec"]

Here are the averages, standard deviation, and average standard error of the point estimates under the correct specification:

rescorrect[, c(mean(estimate), sd(estimate), mean(std.error))]
## [1] 0.304 0.125 0.119

And for the incorrect specification:

resmisspec[, c(mean(estimate), sd(estimate), mean(std.error))]
## [1] 0.303 0.126 0.121

The estimates of the time trend from both models are unbiased, and the observed standard error of the estimates are the same for each model, which in turn are not too far off from the estimated standard errors. This becomes quite clear when we look at the virtually identical densities of the estimates:

As an added bonus, here is a conditional generalized mixed effects model of the larger data set generated earlier. The conditional estimates are quite different from the marginal GEE estimates, but this is not surprising given the binary outcomes. (For comparison, the period coefficient was estimated using the marginal model to be 0.32)

library(lme4)

glmerfit <- glmer(X ~ period + (1 | id), data = dg, family = binomial)
summary(glmerfit)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: X ~ period + (1 | id)
##    Data: dg
##
##      AIC      BIC   logLik deviance df.resid
##     3595     3614    -1795     3589     3997
##
## Scaled residuals:
##    Min     1Q Median     3Q    Max
## -1.437 -0.351 -0.284 -0.185  2.945
##
## Random effects:
##  Groups Name        Variance Std.Dev.
##  id     (Intercept) 2.38     1.54
## Number of obs: 4000, groups:  id, 1000
##
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -2.7338     0.1259   -21.7   <2e-16 ***
## period        0.4257     0.0439     9.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
##        (Intr)
## period -0.700

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

## September 24, 2018

### Whats new on arXiv

This paper argues the need for research to realize uncertainty-aware artificial intelligence and machine learning (AI\&ML) systems for decision support by describing a number of motivating scenarios. Furthermore, the paper defines uncertainty-awareness and lays out the challenges along with surveying some promising research directions. A theoretical demonstration illustrates how two emerging uncertainty-aware ML and AI technologies could be integrated and be of value for a route planning operation.
Mediation analysis aims at disentangling the effects of a treatment on an outcome through alternative causal mechanisms and has become a popular practice in biomedical and social science applications. The causal framework based on counterfactuals is currently the standard approach to mediation, with important methodological advances introduced in the literature in the last decade, especially for simple mediation, that is with one mediator at the time. Among a variety of alternative approaches, K. Imai et al. showed theoretical results and developed an R package to deal with simple mediation as well as with multiple mediation involving multiple mediators conditionally independent given the treatment and baseline covariates. This approach does not allow to consider the often encountered situation in which an unobserved common cause induces a spurious correlation between the mediators. In this context, which we refer to as mediation with uncausally related mediators, we show that, under appropriate hypothesis, the natural direct and indirect effects are non-parametrically identifiable. These results are promptly translated into unbiased estimators using the same quasi-Bayesian algorithm developed by Imai et al. We validate our method by an original simulation study. As an illustration, we apply our method on a real data set from a large cohort to assess the effect of hormone replacement treatment on breast cancer risk through three mediators, namely dense mammographic area, nondense area and body mass index.
Artificial Intelligence (AI) approaches to problem-solving and decision-making are becoming more and more complex, leading to a decrease in the understandability of solutions. The European Union’s new General Data Protection Regulation tries to tackle this problem by stipulating a ‘right to explanation’ for decisions made by AI systems. One of the AI paradigms that may be affected by this new regulation is Answer Set Programming (ASP). Thanks to the emergence of efficient solvers, ASP has recently been used for problem-solving in a variety of domains, including medicine, cryptography, and biology. To ensure the successful application of ASP as a problem-solving paradigm in the future, explanations of ASP solutions are crucial. In this survey, we give an overview of approaches that provide an answer to the question of why an answer set is a solution to a given problem, notably off-line justifications, causal graphs, argumentative explanations and why-not provenance, and highlight their similarities and differences. Moreover, we review methods explaining why a set of literals is not an answer set or why no solution exists at all.
Simulation plays an essential role in comprehending a target system in many fields of social and industrial sciences. A major task in simulation is the estimation of parameters, and optimal parameters to express the observed data need to directly elucidate the properties of the target system as the design of the simulator is based on the expert’s domain knowledge. However, skilled human experts struggle to find the desired parameters.Data assimilation therefore becomes an unavoidable task in simulator design to reduce the cost of simulator optimization. Another necessary task is extrapolation; in many practical cases, the prediction based on simulation results will be often outside of the dominant range of the given data area, and this is referred to as the covariate shift. This paper focuses on the regression problem with the covariate shift. While the parameter estimation for the covariate shift has been studied thoroughly in parametric and nonparametric settings, conventional statistical methods of parameter searching are not applicable in the data assimilation of the simulation owing to the properties of the likelihood function: intractable or nondifferentiable. To address these problems, we propose a novel framework of Bayesian inference based on kernel mean embedding that comprises an extended kernel approximate Bayesian computation (ABC) of the importance weighted regression, kernel herding, and the kernel sum rule. This framework makes the prediction available in covariate shift situations, and its effectiveness is evaluated in both synthetic numerical experiments and a widely used production simulator.
Determining the programming language of a source code file has been considered in the research community; it has been shown that Machine Learning (ML) and Natural Language Processing (NLP) algorithms can be effective in identifying the programming language of source code files. However, determining the programming language of a code snippet or a few lines of source code is still a challenging task. Online forums such as Stack Overflow and code repositories such as GitHub contain a large number of code snippets. In this paper, we describe Source Code Classification (SCC), a classifier that can identify the programming language of code snippets written in 21 different programming languages. A Multinomial Naive Bayes (MNB) classifier is employed which is trained using Stack Overflow posts. It is shown to achieve an accuracy of 75% which is higher than that with Programming Languages Identification (PLI a proprietary online classifier of snippets) whose accuracy is only 55.5%. The average score for precision, recall and the F1 score with the proposed tool are 0.76, 0.75 and 0.75, respectively. In addition, it can distinguish between code snippets from a family of programming languages such as C, C++ and C#, and can also identify the programming language version such as C# 3.0, C# 4.0 and C# 5.0.
Stack Overflow is the most popular Q&A website among software developers. As a platform for knowledge sharing and acquisition, the questions posted in Stack Overflow usually contain a code snippet. Stack Overflow relies on users to properly tag the programming language of a question and it simply assumes that the programming language of the snippets inside a question is the same as the tag of the question itself. In this paper, we propose a classifier to predict the programming language of questions posted in Stack Overflow using Natural Language Processing (NLP) and Machine Learning (ML). The classifier achieves an accuracy of 91.1% in predicting the 24 most popular programming languages by combining features from the title, body and the code snippets of the question. We also propose a classifier that only uses the title and body of the question and has an accuracy of 81.1%. Finally, we propose a classifier of code snippets only that achieves an accuracy of 77.7%. These results show that deploying Machine Learning techniques on the combination of text and the code snippets of a question provides the best performance. These results demonstrate also that it is possible to identify the programming language of a snippet of few lines of source code. We visualize the feature space of two programming languages Java and SQL in order to identify some special properties of information inside the questions in Stack Overflow corresponding to these languages.
Most existing recommender systems leverage the data of one type of user behaviors only, such as the purchase behavior in E-commerce that is directly related to the business KPI (Key Performance Indicator) of conversion rate. Besides the key behavioral data, we argue that other forms of user behaviors also provide valuable signal on a user’s preference, such as views, clicks, adding a product to shop carts and so on. They should be taken into account properly to provide quality recommendation for users. In this work, we contribute a novel solution named NMTR (short for Neural Multi-Task Recommendation) for learning recommender systems from multiple types of user behaviors. We develop a neural network model to capture the complicated and multi-type interactions between users and items. In particular, our model accounts for the cascading relationship among behaviors (e.g., a user must click on a product before purchasing it). To fully exploit the signal in the data of multiple types of behaviors, we perform a joint optimization based on the multi-task learning framework, where the optimization on a behavior is treated as a task. Extensive experiments on two real-world datasets demonstrate that NMTR significantly outperforms state-of-the-art recommender systems that are designed to learn from both single-behavior data and multi-behavior data. Further analysis shows that modeling multiple behaviors is particularly useful for providing recommendation for sparse users that have very few interactions.
Recently, along with the rapid development of mobile communication technology, edge computing theory and techniques have been attracting more and more attentions from global researchers and engineers, which can significantly bridge the capacity of cloud and requirement of devices by the network edges, and thus can accelerate the content deliveries and improve the quality of mobile services. In order to bring more intelligence to the edge systems, compared to traditional optimization methodology, and driven by the current deep learning techniques, we propose to integrate the Deep Reinforcement Learning techniques and Federated Learning framework with the mobile edge systems, for optimizing the mobile edge computing, caching and communication. And thus, we design the ‘In-Edge AI’ framework in order to intelligently utilize the collaboration among devices and edge nodes to exchange the learning parameters for a better training and inference of the models, and thus to carry out dynamic system-level optimization and application-level enhancement while reducing the unnecessary system communication load. ‘In-Edge AI’ is evaluated and proved to have near-optimal performance but relatively low overhead of learning, while the system is cognitive and adaptive to the mobile communication systems. Finally, we discuss several related challenges and opportunities for unveiling a promising upcoming future of ‘In-Edge AI’.
Neural networks are increasingly deployed in real-world safety-critical domains such as autonomous driving, aircraft collision avoidance, and malware detection. However, these networks have been shown to often mispredict on inputs with minor adversarial or even accidental perturbations. Consequences of such errors can be disastrous and even potentially fatal as shown by the recent Tesla autopilot crash. Thus, there is an urgent need for formal analysis systems that can rigorously check neural networks for violations of different safety properties such as robustness against adversarial perturbations within a certain $L$-norm of a given image. An effective safety analysis system for a neural network must be able to either ensure that a safety property is satisfied by the network or find a counterexample, i.e., an input for which the network will violate the property. Unfortunately, most existing techniques for performing such analysis struggle to scale beyond very small networks and the ones that can scale to larger networks suffer from high false positives and cannot produce concrete counterexamples in case of a property violation. In this paper, we present a new efficient approach for rigorously checking different safety properties of neural networks that significantly outperforms existing approaches by multiple orders of magnitude. Our approach can check different safety properties and find concrete counterexamples for networks that are 10$\times$ larger than the ones supported by existing analysis techniques. We believe that our approach to estimating tight output bounds of a network for a given input range can also help improve the explainability of neural networks and guide the training process of more robust neural networks.
We propose SoaAlloc, a dynamic object allocator for Single-Method Multiple-Objects applications in CUDA. SoaAlloc is the first allocator for GPUs that (a) arranges allocations in a SIMD-friendly Structure of Arrays (SOA) data layout, (b) provides a do-all operation for maximizing the benefit of SOA, and (c) is on par with state-of-the-art memory allocators for raw (de)allocation time. Our benchmarks show that the SOA layout leads to significantly better memory bandwidth utilization, resulting in a 2x speedup of application code.
We present an analysis into the inner workings of Convolutional Neural Networks (CNNs) for processing text. CNNs used for computer vision can be interpreted by projecting filters into image space, but for discrete sequence inputs CNNs remain a mystery. We aim to understand the method by which the networks process and classify text. We examine common hypotheses to this problem: that filters, accompanied by global max-pooling, serve as ngram detectors. We show that filters may capture several different semantic classes of ngrams by using different activation patterns, and that global max-pooling induces behavior which separates important ngrams from the rest. Finally, we show practical use cases derived from our findings in the form of model interpretability (explaining a trained model by deriving a concrete identity for each filter, bridging the gap between visualization tools in vision tasks and NLP) and prediction interpretability (explaining predictions).
This paper describes how to carry out a feasibility study for a potential knowledge based system application. It discusses factors to be considered under three headings: the business case, the technical feasibility, and stakeholder issues. It concludes with a case study of a feasibility study for a KBS to guide surgeons in diagnosis and treatment of thyroid conditions.
While the machine learning literature dedicated to fully automated reasoning algorithms is abundant, the number of methods enabling the inference process on the basis of previously defined knowledge structures is scanter. Fuzzy Cognitive Maps (FCMs) are neural networks that can be exploited towards this goal because of their flexibility to handle external knowledge. However, FCMs suffer from a number of issues that range from the limited prediction horizon to the absence of theoretically sound learning algorithms able to produce accurate predictions. In this paper, we propose a neural network system named Short-term Cognitive Networks that tackle some of these limitations. In our model weights are not constricted and may have a causal nature or not. As a second contribution, we present a nonsynaptic learning algorithm to improve the network performance without modifying the previously defined weights. Moreover, we derive a stop condition to prevent the learning algorithm from iterating without decreasing the simulation error.
In open set learning, a model must be able to generalize to novel classes when it encounters a sample that does not belong to any of the classes it has seen before. Open set learning poses a realistic learning scenario that is receiving growing attention. Existing studies on open set learning mainly focused on detecting novel classes, but few studies tried to model them for differentiating novel classes. We recognize that novel classes should be different from each other, and propose distribution networks for open set learning that can learn and model different novel classes. We hypothesize that, through a certain mapping, samples from different classes with the same classification criterion should follow different probability distributions from the same distribution family. We estimate the probability distribution for each known class and a novel class is detected when a sample is not likely to belong to any of the known distributions. Due to the large feature dimension in the original feature space, the probability distributions in the original feature space are difficult to estimate. Distribution networks map the samples in the original feature space to a latent space where the distributions of known classes can be jointly learned with the network. In the latent space, we also propose a distribution parameter transfer strategy for novel class detection and modeling. By novel class modeling, the detected novel classes can serve as known classes to the subsequent classification. Our experimental results on image datasets MNIST and CIFAR10 and text dataset Ohsumed show that the distribution networks can detect novel classes accurately and model them well for the subsequent classification tasks.
The prosperity of smart mobile devices has made mobile crowdsensing (MCS) a promising paradigm for completing complex sensing and computation tasks. In the past, great efforts have been made on the design of incentive mechanisms and task allocation strategies from MCS platform’s perspective to motivate mobile users’ participation. However, in practice, MCS participants face many uncertainties coming from their sensing environment as well as other participants’ strategies, and how do they interact with each other and make sensing decisions is not well understood. In this paper, we take MCS participants’ perspective to derive an online sensing policy to maximize their payoffs via MCS participation. Specifically, we model the interactions of mobile users and sensing environments as a multi-agent Markov decision process. Each participant cannot observe others’ decisions, but needs to decide her effort level in sensing tasks only based on local information, e.g., its own record of sensed signals’ quality. To cope with the stochastic sensing environment, we develop an intelligent crowdsensing algorithm IntelligentCrowd by leveraging the power of multi-agent reinforcement learning (MARL). Our algorithm leads to the optimal sensing policy for each user to maximize the expected payoff against stochastic sensing environments, and can be implemented at individual participant’s level in a distributed fashion. Numerical simulations demonstrate that IntelligentCrowd significantly improves users’ payoffs in sequential MCS tasks under various sensing dynamics.
As Artificial Intelligence (AI) technologies proliferate, concern has centered around the long-term dangers of job loss or threats of machines causing harm to humans. All of this concern, however, detracts from the more pertinent and already existing threats posed by AI today: its ability to amplify bias found in training datasets, and swiftly impact marginalized populations at scale. Government and public sector institutions have a responsibility to citizens to establish a dialogue with technology developers and release thoughtful policy around data standards to ensure diverse representation in datasets to prevent bias amplification and ensure that AI systems are built with inclusion in mind.
In this paper, we present LaSOT, a high-quality benchmark for Large-scale Single Object Tracking. LaSOT consists of 1,400 sequences with more than 3.5M frames in total. Each frame in these sequences is carefully and manually annotated with a bounding box, making LaSOT the largest, to the best of our knowledge, densely annotated tracking benchmark. The average sequence length of LaSOT is more than 2,500 frames, and each sequence comprises various challenges deriving from the wild where target objects may disappear and re-appear again in the view. By releasing LaSOT, we expect to provide the community a large-scale dedicated benchmark with high-quality for both the training of deep trackers and the veritable evaluation of tracking algorithms. Moreover, considering the close connections of visual appearance and natural language, we enrich LaSOT by providing additional language specification, aiming at encouraging the exploration of natural linguistic feature for tracking. A thorough experimental evaluation of 35 tracking algorithms on LaSOT is presented with detailed analysis, and the results demonstrate that there is still a big room to improvements. The benchmark and evaluation results are made publicly available at https://…/.
Bayesian model-based clustering is a widely applied procedure for discovering groups of related observations in a dataset. These approaches use Bayesian mixture models, estimated with MCMC, which provide posterior samples of the model parameters and clustering partition. While inference on model parameters is well established, inference on the clustering partition is less developed. A new method is developed for estimating the optimal partition from the pairwise posterior similarity matrix generated by a Bayesian cluster model. This approach uses non-negative matrix factorization (NMF) to provide a low-rank approximation to the similarity matrix. The factorization permits hard or soft partitions and is shown to perform better than several popular alternatives under a variety of penalty functions.
This document introduces a strategy to solve linear optimization problems. The strategy is based on the bounding condition each constraint produces on each one of the problem’s dimension. The solution of a linear optimization problem is located at the intersection of the constraints defining the extreme vertex. By identifying the constraints that limit the growth of the objective function value, we formulate linear equations system leading to the optimization problem’s solution.The most complex operation of the algorithm is the inversion of a matrix sized by the number of dimensions of the problem. Therefore, the algorithm’s complexity is comparable to the corresponding to the classical Simplex method and the more recently developed Linear Programming algorithms. However, the algorithm offers the advantage of being non-iterative.
This paper presents a novel approach for automatic rule learning applicable to an autonomous driving system using real driving data.
Background: Finding biomedical named entities is one of the most essential tasks in biomedical text mining. Recently, deep learning-based approaches have been applied to biomedical named entity recognition (BioNER) and showed promising results. However, as deep learning approaches need an abundant amount of training data, a lack of data can hinder performance. BioNER datasets are scarce resources and each dataset covers only a small subset of entity types. Furthermore, many bio entities are polysemous, which is one of the major obstacles in named entity recognition. Results: To address the lack of data and the entity type misclassification problem, we propose CollaboNet which utilizes a combination of multiple NER models. In CollaboNet, models trained on a different dataset are connected to each other so that a target model obtains information from other collaborator models to reduce false positives. Every model is an expert on their target entity type and takes turns serving as a target and a collaborator model during training time. The experimental results show that CollaboNet can be used to greatly reduce the number of false positives and misclassified entities including polysemous words. CollaboNet achieved state-of-the-art performance in terms of precision, recall and F1 score. Conclusions: We demonstrated the benefits of combining multiple models for BioNER. Our model has successfully reduced the number of misclassified entities and improved the performance by leveraging multiple datasets annotated for different entity types. Given the state-of-the-art performance of our model, we believe that CollaboNet can improve the accuracy of downstream biomedical text mining applications such as bio-entity relation extraction.
Machine learning methods such as convolutional neural networks (CNNs) are becoming an integral part of scientific research in many disciplines, spatial vector data often fail to be analyzed using these powerful learning methods because of its irregularities. With the aid of graph Fourier transform and convolution theorem, it is possible to convert the convolution as a point-wise product in Fourier domain and construct a learning architecture of CNN on graph for the analysis task of irregular spatial data. In this study, we used the classification task of building patterns as a case study to test this method, and experiments showed that this method has achieved outstanding results in identifying regular and irregular patterns, and has significantly improved in comparing with other methods.
Existing dialog datasets contain a sequence of utterances and responses without any explicit background knowledge associated with them. This has resulted in the development of models which treat conversation as a sequence-to-sequence generation task i.e, given a sequence of utterances generate the response sequence). This is not only an overly simplistic view of conversation but it is also emphatically different from the way humans converse by heavily relying on their background knowledge about the topic (as opposed to simply relying on the previous sequence of utterances). For example, it is common for humans to (involuntarily) produce utterances which are copied or suitably modified from background articles they have read about the topic. To facilitate the development of such natural conversation models which mimic the human process of conversing, we create a new dataset containing movie chats wherein each response is explicitly generated by copying and/or modifying sentences from unstructured background knowledge such as plots, comments and reviews about the movie. We establish baseline results on this dataset (90K utterances from 9K conversations) using three different models: (i) pure generation based models which ignore the background knowledge (ii) generation based models which learn to copy information from the background knowledge when required and (iii) span prediction based models which predict the appropriate response span in the background knowledge.

### September #SWDchallenge recap: MAKEOVER edition

This month, we challenged you to remake a dual pie chart and a record number of you delivered—96 submissions to be exact!

With everyone starting from the same visual, it’s fascinating to see the diversity in your makeovers—from chart types to design choices to points of emphasis. This reinforces one underscoring theme of storytelling with data’s guiding principles: there is no single “right” answer and two people can approach the same dataviz challenge in completely different ways. We believe there’s room for diversity of thought and creativity in this space, as long as you use that creative license appropriately. Ultimately, we should always strive to design with the audience in mind—their needs trump all else!

Nearly 50% of makeovers included slopegraphs. This can be an appropriate choice when you want the audience to focus on the change or difference between two data points. We’ve expressed our affinity for the intuitive nature of slopegraphs many times, including examples of employee feedback, market share and your own topics from the June challenge. Within the many slopegraphs, it’s neat to see the wide variety of design choices and points of emphasis. Many focused on the decreases (Dennis utilized a clean design with red to focus attention on Europe, Paul emphasized all decreases with red, Colin chose blue to focus attention on Europe), while others called out the relative increase in Asia Pacific (Clayton made good use of font size and color, Steve included a takeaway and call to action in his title, Joost’s decluttered visual allows for easy interpretation). Others emphasized both increase and decrease (Aleksandra’s, Joey’s and Kristina’s thoughtful choice of two different colors tied together with annotations).

While slopegraphs have a visually intuitive benefit, one downside is that some audiences may be less familiar with how to interpret one, potentially a reason against using one. From that standpoint, check out how the other 50% of submissions ranged from dot plots, rocket charts, a waffle chart, a circular flow chart and many variations of bars (standard bars, diverging bars, bullet bars, and 100% stacked bars). Allan made a nice choice of diverging bars, calling attention to the latest data points with darker blue. Also don’t miss Kat’s submission, including a link to a video demonstrating how she’d progress through a story.

A number of readers pointed out that there’s some broader context missing in this data: the absolute numbers behind the percentages (which represent regional % of total tourism). In absence of this, directly comparing percentages in 2000 vs. 2016 could lead to false conclusions about travel increasing or decreasing on an absolute scale. Check out Dan Z’s submission where he articulates this caveat well. This is a good reminder to always consider what additional context your audience may need to draw a meaningful conclusion.

To everyone who submitted examples: THANK YOU for taking the time to create and share your work! The makeovers are posted below in reverse alphabetical order (to give some visibility to those at the end of the alphabet). If you tweeted or thought you submitted one but but don't see it here, email your submission (including your graph attached as .png) to SWDchallenge@storytellingwithdata.com and we'll work to include any late entries this week (just a reminder that tweeting on its own isn't enough—we unfortunately don't have time to scrape Twitter for entries, so emailing is the sure way to get your creations included).

The next monthly challenge will be announced on October 1 and will run through midnight PST on October 8th. Until then, check out the #SWDchallenge page for the archives of previous months' challenges and submissions.

As a reminder, here’s where we started with the dual pie charts. Scroll down to see the many makeovers. We hope you enjoy perusing this month’s creations!

### Zunaira

I chose a simple layout with minimum colour. I tried to use colour to highlight the message that I wanted the end user to see. Also added directionality in the chart to highlight increase and decrease in percentage value.

### Zach

Here is a link to the Tableau Public post. My goal was to clearly show the growth and rankings of the regions over the given time frame. I did this by drawing arrows on a graph starting where the region ranked in 2000, ending where the region ranked in 2016, and ordered by 2016 rank. I also wanted to make the viz visually appealing, so I used regional colors to pop and emphasize the text, along with a simple plane background that I created in Powerpoint.

### Vidya

The takeaway message in the original is that Asia Pacific is the largest travel market in 2016. The title of the chart makes this clear. In the chart though, Asia Pacific is lost in the sea of colours. I was torn between a slopegraph and a back-to-back bar chart for this makeover. The slopegraph shows the trend very clearly but it doesn't make it clear that Asia Pacific ate into Europe and North America's share. The part to whole relationship is lost in a slopegraph. In a back-to-back bar chart, the bar lengths make it evident who the largest contributors for each year are and the part to whole relationship is intact. But Asia Pacific's increase in travel market between the years does not stand out as clearly. It came down to the key message I wanted to communicate with this makeover. I decided the key message I wanted to highlight is Asia Pacific has the largest share of travel market in 2016. The growth in Asia Pacific travel market from 2000 to 2016 is my secondary message.
Blog

### Valentina

In the first graph my eyes went to the most colorful slices of the pies, so I decided to highlight with colors the region that grew most between 2000 and 2006. I chose a different kind of graph to show the relationship between the two periods. It would be interesting to have more data about this process: for example, I'd like to know which parts of each continent are more visited. I've used Excel.

### Tinju

Simple bar charts grouped per region, designed in Tableau.

### Tim

I ended up just creating this by hand with one of our design tools. My first thought was to go with a slopegraph, but my concern was that the percentages didn't stand out enough, and, while the story seems to be about the change, I didn't think the relative sizes should be buried. I thus married a dot-based waffle chart to a panel bar arrangement to a slopegraph. The color strategy borrows from an piece I saw sometime back, and now can't find. NYT, I think. What am I still uncertain about? I couldn't figure out how to elegantly add actual numeric value labels to the dots. Some may miss them. I decided the relative number of dots was sufficient. I can also imagine some finding the dots more clutter than value (as opposed to just a bar, for example).

### Thys

Created in PowerBI.

### Susan

I got inspiration from @jimvansistine's entry. I thought his rocket lines worked well, so I riffed on that. I appreciate his inspiration for this. The viz was created in Tableau.

### Steve

I wanted to keep the original message of the graph the same, while making it clearer to follow. So I changed to a slope chart to get make the change over time and ranking particularly clear. I used colour sparingly and made it consistent across the commentary and the visualization, and made sure that wording was consistent throughout (I found the original use of both APAC and Asia Pacific slightly confusing). I've written this up in more detail on my blog and have shared the visualization on twitter.

### Stela

Since we are comparing multiple categories I think using a bar chart is a better option. Also we are comparing how these categories have changed between two time periods I decided to create a bi directional horizontal bar chart. I think this chart type shows the two important takeaways from this data: The increase in travel to APAC. The overall change in travel behavior: decreases in travel to Europe and North America.

### Solange

I decided a slopegraph would show the change between the years better, and it would also make it easier to understand which regions changed the most and their ranking.  I used colour to focus in Asia Pacific, but made sure to label the other regions as well. Then I considered a possible context for this data: an airline looking for new routes? Or an investor who needs to decide where to invest next? I realized that either way, there was not enough information to recommend an action, but there was an option to spark curiosity which is why I opted for a question as title. To recommend an action I would need to understand a bit more, not only the contribution of each region but also the absolute values. The contribution of some regions has decreased, but has the absolute value between those years increased? Has tourism in general increased as well?

### Shahnaz

When I looked at the chart first, my eyes straight drawn on to the label that are outside the pie chart and the year.  Then the header I found is misleading as APAC has become the largest Travel Market in 2016, not in 2000.  When I look at the chart for 2000, my eyes drawn straight to Europe (i.e. probably due to the dark color). For 2016, my eyes drawn to Europe as well.  Due to the light colour, APAC didn't standout for me as the largest Travel market. I would do the graph as bar or slope chart.  In my makeover, I chose to do slope chart.  Also, the data is in very high level. We can only say that APAC has become the largest travel market in 2016 but cannot say why? I wish I had the data in more detail level (e.g. which month people travel more, season, popular travel destination…). I have used Tableau.

### Sashmir

When I was looking at the original image, my eyes were drawn to the two circles. After inspecting the shapes my eyes went back up to the title and took a while for me to register what APAC meant. Once I understood that it was about the Asian continent I inspected the shapes again and looked from left to right repeatedly to compare the values of the different continents. Given this experience, I found 2 issues with the image that was used. First is the title could be improved by using Asia instead of APAC since it is a more common term. Second is by comparing the 2 pie charts my eyes had to keep moving back and forth in order to compare the different slices. To address the first issue I used the word Asia and placed a small description after the title to convey a clear message. For the second issue, I used a horizontally stacked bar chart for easier comparison and in order to emphasize Asia I chose a strong red color and used a grayscale palette for the other continents.

### Sarah

I wanted to build something that was easy to read and draw comparisons between the years. This is very difficult in the original pie chart version of the viz. With this in mind I decided to build a bar-in-bar chart which clearly shows the latest year versus the earlier year used for comparison. I ordered the bars by the latest value so APAC appears at the top since this is the focus of the story.
Interactive viz

### Sanne & Michiel

Our first thought was a single slopegraph, but because we already saw so many online. So we decided to use a slopegraph-small-multiple combination. This allowed us to show the different layers in the story. The three continents with strong movement we placed on top, the continents with small change below. We used color and dark grey to create hierarchy between the top and bottom charts. Next we ranked the charts by size in 2016 and placed the “Other” as last. The visual is made in Excel. We wrote and aligned the title and subtitle in such a way that the colored text part is near the charts they belong to. Small numbers use a single decimal to show change. The year is written in two numbers instead of four. The main line in each chart crosses over the others.

### Sandeep

Clearly, looking at the title, the author wants to convey that APAC has become largest travel market (in terms of share to global GDP). This means he is comparing regional share to all other markets as was in year 2000 and also highlighting the number 1 ranked regional share for the year 2006. Hence, my story would be to show how each regional share has changed compared to 2000. Then rank them according to the current year (2006) regional share to global GDP. This is built in Tableau Public. Link to dashboard here.

### Rodrigo

Yes, it is a simple barchart as I feel it does a much better job to draw the 2000/2016 comparison than the original set of pies. To enhance the bars I've opted o add the region name, its measure and the growth percentage to allow users to also understand the change more quickly. For the use of color I wanted to make it clear that the focus is APAC region (as the takeaway title declares) and left the remaining regions muted in a light gray for context.

### Roman

Had a few thoughts about rebuilding this chart, and before long, I realized I had quite a bit more than a paragraph! Here’s a link to a blog post I made on the topic.

### Raymond

A little zany! Why not!

### Pietro

I've made a really simple makeover based on the basic learnt on SWD! So 2 different pies for a comparison isn't a good choice and column and lines are better when your art showing data about time. I've used Power BI desktop and paint.

### Paul

When I saw the original pie charts, my eyes immediately went to the largest slices.  However it's difficult to compare changes in segment size between the two pie charts. Based on the original title, only the second pie chart is needed to see that APAC now has the largest travel market.  In that case though, the first pie wouldn't be needed.  Instead of just reducing the data to only 2016, I realized that showing the changes in segment size would be a more valuable use of this dataset. Since there are only two dates in the dataset (2000 and 2016) and my goal with this viz was to show change in travel share over time, my mind immediately went to a slopegraph.  Especially after having created a slopegraph for June's #SWDChallenge, I knew it would be a great way to not only rank the segments for each of the two years, but to also show changes between the time periods. In my updated version, it's now easy to see that while APAC is the largest market, the market share for both North America and Europe decreased the most over the 16 year period.
Interactive viz﻿

### Patricia

For this makeover I used the lessons I learned from last month's challenge using dot plots. Along with slope graphs, dot plots are a great way to illustrate changes especially between two points of reference. I used Python's Plotly to create this graph, and I wrote a tutorial on how to make connected dot plots here.

### Olga

I decided to make a simple slope graph, partially for the sake of exercise, since I have not done this type of chart in Tableau before. I also thought it would deliver the main message really well. I went with a simple colour-blind-friendly palette. I was interested in how big is the contribution of tourism to the global GDP and how many jobs it generates. I decided to add a couple of extra sentences of insight/description and a link to the World Travel & Tourism Council 2017 report.

Blog

### Nick

A slope graph seems to be the best way to see change between the two years for each region so it's a pretty basic viz, nothing fancy.

### Natalia

For this makeover, I decided that the most important idea was to highlight changes in top-3 players, here was the most drama happening. That's why I moved Latin America and Middle East data in the comments.

### Matt

Tools: Microsoft Excel 365 (Android App, Desktop Program). This is my first challenge, so the process took a little longer than I was expecting.  The title in the original graph gave me an idea on what data needed to be highlighted. From that point I knew I needed to make a point that the Asian Pacific Region was the “Bell of the Ball.” In my day to day, metrics are reported with Gauges, and although a gauge graph would be appropriate it would not show the full story.

### Marian

For me the original visualization made it difficult to compare the percentages per region per year so felt a bar chart grouped by region would give the best visual. I got feedback to put a larger gap between the regions and I’m pretty pleased with how it turned out. Depending on when the original Viz was published you could also argue that the original title is a bit misleading. Since we are now in 2018 and are comparing numbers from 2000 and 2016 only, I thought it was more appropriate to state that in 2016 APAC was the biggest travel market rather than ‘Asia is the biggest travel market’.

### Maria

After considering a bar chart (stacked horizontal or simple vertical) I have chosen a slopegraph. WIth a slopegraph I could easily show the visual change from 2000 to 2016 for each region via the slope of the respective line. I used color to emphasize visually the "Asia Pacific" results that made the main message of the initial graph. I also added the text boxes on the right side, using the respective colors to make a link with slopes. I decided to show 1 more digit on the data labels of Latin America & Middle East. As there is a minimal change from 2000 to 2016, these slopes are not horizontal, so I needed to explain this with more accurate numbers. I find that they do not attract attention due to the color that pushes them to the background. I used Microsoft Excel.

### Liyang

I just finished reading Cole’s storytelling with data and the idea to use a slope graph to makeover the original chart came to me right away. Additionally, the highlight and annotation to draw attention are also important takeaways from the book which I applied here.

### Lora

Made in Excel. Wanted my story to focus on the drastic changes that occurred over the four year period.

### Lise

Made in Excel, with the help of a SWD blog post on bullet graphs. I wanted to show the change more clearly in general, and specifically the change in the Asia Pacific market. I decided on this graph type because I find it easy to read and because it shows the development reasonably clearly. The blue 'dots' (lines) makes it a little unclear because I've gone for quite chunky lines to match the bars, and it's impossible to know whether the 2000-figure is the top, bottom or middle of the line. However, given that my aim is to show the change from 2000 to 2006 rather than the exact difference, I find this acceptable. The bars are sorted largest -> smallest for the 2006 numbers. For precision, a slope graph would have been better.

### Lily

My immediate thought when looking at the 'before' pie charts was "Europe has gotten smaller" because the dark navy blue caught my eye and that slice of the pie seemed most prominent. However, the title indicated that the Asia Pacific slice was really what the chart was intended to illustrate. So I redesigned it as a slope chart, with Asia Pacific in a bright red and the rest of the countries in shades of blue and green. This brings attention to that specific part of the story. I also re-titled it with "Asia Pacific" instead of "APAC" because not everyone knows that acronym and it isn't labeled that way in the chart itself. Time constraints prevented me from thinking about the broader context of the data, but I'm happy with this as a starting point.
Website

### Laura S

Please accept the attached graph on behalf of the graduate students at Middlebury Institute for International Studies, who are using “Storytelling with Data” as a textbook this semester in the course Analytics and Thick Data. We love the blog and were excited for this opportunity to put our data visualization techniques to the test!

### Laura M

I decided not to use any tool. For me is important starting drawing first my ideas and then use any tool if is needed. In this case I wanted to present my proposal only painted in order to demonstrate that tools are usually overrated, I mean we usually start the process opening a tool and creating a "new file" when the most important is starting drawing in order to tidy our ideas (most important KPI’s, data that we don't need,colors, sorting etc).

### Larissa

This is my first time participating in the SWDchallenge. I used Tableau public and Power Point to make my graph.

### Kristen

I love the simplicity of a slopegraph and it seemed the perfect option for this data. Created using Tableau.

### Kristine

As I have understood, the "Before" graph would like to underscore that APAC has already become the largest travel market came 2016. In this sense, only the APAC marker (2016) was left colored, and a call out stating the key takeaway was placed and pointed to the marker.

### Kristina

I believe the point of the original chart is to show how the share of each of the six regions has changed from 2000 and 2016. However, the pie chart is a format that makes that comparison quite difficult. One chart type that I came to know and love thanks to SWD is the slopegraph and I thought it was the perfect format for this data. We have two states - data for 2000 (before) and data for 2016 (after) and also the numbers are suitable to show the dynamics in the share of Europe and Asia Pacific - the two regions with the most significant change - positive or negative. This is the reason those two are in color - blue to mark the increase for Asia Pacific and orange to mark the decrease for Europe. However, one more reason to decide to focus on these two regions is because I'm from Europe and my story here is that I want to present a proposal to increase the Europe share by making North and Southeast Europe more attractive. The mentioned European Tourism Association is fictitious.

### Kirill

Please see my very first attempt to participate in the data challenge. Created in Excel.

### Kaushik

Made in Excel 2016: Given I have data for two time periods, which are quite apart, I choose to use a "Stacked Bar" chart to show trends:

1. Normalized bars - each bar has same length. No misrepresentation by having varying length bars for each region.
2. Stacked bar - shades of black and lighter black used, to show gradual change.The more the color in the bar, the larger the contribution. I avoided jarring colours as they only draw attention to themselves and not what need to be understood.
3. Data labels - Big data labels to improve readability. Red color used to draw user's attention to steepest decline in share.
4. Added Trends  - Added trends to aid interpretation of the graph.

### Kat

One word in particular stood out to me in this month’s challenge - “storytelling". I chose to focus my efforts on uncovering a story in the makeover graphs and then animate this using Adobe After Effects. You can see the final version and a look into the process here.

### Julian

Slope graph was an obvious choice from my point of view. Done with python and matplotlib and published on my Twitter account.

### Joost

Upon seeing the pie charts, my eyes were drawn to the title and then to the large share of Europe in the first pie. It took me a while to figure out what the main message was, which is the increase in the Asia Pacific region in 2016 compared to 2000. Obviously I thought that a pie chart was not the best chart type to visualize here. I kind of automatically started to link the shares of the first pie chart to the corresponding shares of the second pie charts, creating kind of like a slope graph in my mind. So I decided to go for the Slope Chart to make over this chart. Though I normally work with QlikView handling larger datasets, I created this with Excel.

I am wondering what the statistics were in the years between 2000 and 2016. Whether this was a trend, or just random fluctuation for every year. I am also wondering what the statistics are for 2018, as I am leaving tomorrow towards the APAC. I'll assume I will be visiting the most popular travel region.

### John

A slope graph made a ton of sense to show changes in the global tourism market between 2000 and 2016. This visual allows us to easily see the rate and direction of change in global tourism market share across world regions. Our goal here is to highlight the dramatic increase in market share for Asia Pacific, so I chose to highlight the Asia Pacific slope in blue to make it stand out from the other regions (for which I used a muted grey). I used the same shade of blue in the text above the visual to help the reader connect references to Asia Pacific in the paragraph to the visualization.

### Jim

For this month's challenge I thought of a slope graph at first.  After seeing so many of those, I tried to test out a variation by measuring the percentage point change from 2000 to 2016.  I like how it highlights the growing and shrinking share, which was one of the main takeaways in the data.  But I think it obscures the ranking of each region, so I called that out in the text labels.  Created in Tableau.

### Jeremy & Sarah

We thought we would try the rarely-popular doughnut chart – it will be interesting to see what people think. It’s not as accurate as a stacked bar chart, but the circle does lend itself to a more engaging layout: a bar chart would not have nearly as much room for adding commentary boxes. We stuck closely to the provided data – it would have been interesting to see absolute figures, and the changes in these year-on-year. With more time I would also have liked to explore the WT&TC’s figures and see what countries were included to make up each of the regions. As it is, while I think the map is a helpful guide, it’s also a best guess (and so any errors are mine!)

### Jennifer

Created in Data Studio – tool tips available on hover.

### Jennie

I used Excel's basic line graph function and decided a slope graph would show the changes clearly.  Our corporate colours are pink and blue so decided to use those to highlight the 2 main changes, with the most interesting (to me) in bright pink to catch the eye.

### Jenna

This is my first #SWDChallenge and my first time applying what I've learned about R / ggplot2 to plot on my own. I thought a simple slopegraph like this would be most effective to communicate the main point of the original chart - that Asia Pacific has become the largest travel market - surpassing Europe and N. America in the process.

### Jared

I ended up going with a slope chart - like everyone else it seems. Despite this, I can say I learned how to create a slope chart in Tableau. Something that I've never done before!

### Jamie

My main goal was to highlight both the change in share over time, and the current relative shares. The growth in the Asia Pacific market was pretty significant, as was the drop for Europe, so I wanted to be able to highlight that. Slope graphs are still fresh in my mind, so I used that for the change factor. With more time I might consider a dot plot with directional connectors to show the change in position. The slope graph does allow for a comparison of the shares in each year, so the bar chart may be redundant, but I felt that it makes the comparison much more clear. The need for two charts in this arrangement may be, in itself, a good argument for going with the dot plot instead.

### James

The key message for me was the fact that Asia Pacific has become the largest share of global tourism, secondary message being that Europe and North America have seen their share drop. I’ve used Excel and the quintessential storytelling with data colour palette of grey and blue to highlight the key message. For the year 2000 data points I’ve used a faded outline to try and convey a sense of movement between them and the  solid data points used for the year 2016.

As we know, humans have a hard time judging the area segments of a pie chart, let alone comparing the areas across two pie charts. So my first thought was to change the visual to a type of graph that would better display trends and rankings over time; so I decided to go with a slope graph. With the more appropriate graph type and green color call-out, readers can now easily see that Asia Pacific moved from #3 to the #1 slot in 2016. I also thought it was noteworthy to point out that Europe’s travel market had experienced decline. It would be helpful here to have actual dollars of Global GDP to give a little more context as to why the changes in percentages were happening.
Interactive viz

### Ilya

The classic slope graph seemed a natural choice to show both the rankings and the dynamics of regions’ market share for two moments in time: with only one line going upwards from the third place all the way to the top, highlighted in blue for more contrast, it reads in unison with the title. Made with Highcharts with styling and interactivity stripped to bare minimum.

### Inta

I used PowerPoint to show the percentage increase or decrease, emphasized the increase & decrease in the data in a column chart and green color for increase, and grey for decrease. I showed only one number, % increase or decrease for each region.

### The FrankensTeam

The most important message of the original visualization is how the shares changed, and - as we can read from the title - Asia Pacific region become the leader of the market. Pie charts are good choice to represent parts of a whole but cannot drive your focus on the change in shares and in rank. So our choice is stacked column chart with sorted data where we show the change in position with "flow". Our visualization was created using Excel 2016. It is not a built-in chart, we did some magic with scatter lines. You can find more of our work in our chart gallery.

### Franck

This chart is my contribution to the september 2018 #SWDChallenge. I wanted the chart to focus on rank shifts so I designed it this way:

* colours focus attention on rank shifts; any other information (such as percentages) gives the reader a deeper understanding, but uses less eye-catching colours (i.e. grays) in order to not disturb the main message

* two distinct colours differentiate rise and fall, respectively in green and red

* the chart is vertically oriented, so that shapes of the chords themselves indicate a rise or a fall

Also, even if my first idea was to use a basic flow chart, it was more appealing and fun to use a circular flow chart :-) Many thanks to Nadieh Bremer for sharing How to create a Flow diagram with a circular Twist.

### Emily

The original slide focused on only half the story (the rise for Asia/Pacific) and not the corresponding drop in Europe. That, plus change over time, called for a slopegraph that de-emphasized the less interesting markets.

### Dragos

The pie charts were a good representation of what the title lead to, it was just that it took a little time to notice the areas that grew and the ones that decreased. I used two clustered bar charts to show the two years in parallel and the differences between the two dates. Some data about the changes in presidency in these regions would have been nice, maybe this made the difference. I used Power BI.

### Dorian

I am using Tableau to compare the travel market share using bar charts and reference lines.
Viz by Dorian Banutoiu (Canonicalized) • Tableau Public profile.

### Dirk

I used MS Excel. I've also posted to Twitter.

### Diego

I tried to keep it simple, just adding some colour to the important text and numbers.
Viz Tableau Public

### Dianna

I’d use the data as a starting off point to explain why Asia Pacific is doing so much better with tourism compared to all the other areas we had data for. Listing Asia Pacific at the top in green with all other regions in either yellow (for no change) or red (for decreases) draws the user’s attention to a clear conclusion that Asia Pacific is performing best. Depending on my audience, I’d try to answer a few questions to create this data narrative. Things like what Asia Pacific might doing better than everyone else and other causes for the increase would be a good starting point. Likewise for the no change and decrease groups- are there things that they could be doing differently to increase tourism?  What are the reasons for the decrease in tourism? I’m sure we could cater the data story to them to figure out a plan of action based on these questions. I’d also use the percent changes of actual tourism earnings rather than showing changes in the percent of GDP, that way we have something a little more concrete to show true changes in tourism, rather than just how tourism earnings and GDP relate.

### Dennis

On Twitter I already saw some designs coming along all drawing the same conclusion. That's why I forced myself to make assumptions (as part of the exercise) and followed a different path. I am aware that the conclusion I am drawing from this data is not fully supported by the data. I would need more data, for example tourist data by year, to see where the decrease started and see if there is a correlation with the terrorist attacks. The absolute numbers on tourism contribution would also help to see if there is a real decrease or that other parts of the economy did increase over time. I decided to use a slope chart to show the decrease between the two available points. This tells us the top 1 position Europe had in 2000 on the one hand and the decrease ever since on the other. As this decrease (and the subject itself) are both negative the use of a red color seemed the most appropriate. To highlight Europe I also increased the font size, the line thickness and the shape size. This was all done using Microsoft Power BI, my main tool for data analytics.

### David

The power of testing: I did a little audience testing and found my snappy titles were a bit too snappy. I originally said "Asia Pacific: Your Target Market” This misled my audience into thinking that 31% percent was somehow a target. My more wordy title separated the facts -  "Asia Pacific: The largest travel market" - from the call to action "Your target market?"

### Darryl

I thought it would be useful to see the numbers on a scale of a graph to easily compare the relative values of the 2000 and 2016 data for the different countries. I then used a modified dumbell type graph to clearly show how the different regions had moved of the period, adding colour so this stands out a little more.

### Daniel

In this illustration, I want to highlight the 2016 figures, and also highlight the change since 2010 (inside the bar).

### Dan

It is nothing wrong in investigating the evolution of a market share. We do this very often and this implies percentages comparison. IMO, comparing the numerical attributes of the same entity (in our case region) that are estimated in very different contexts (periods of time) is like comparing different variables. The market share variation of the same item (calculated either by difference or by ratio) becomes less relevant and, in this particular case, it looks very different from the absolute values variation. One of the most often used solution to encode variation, the slopegraph, becomes in this case more like a parallel coordinates plot because the slopes of the lines are losing the “difference” encoding sense. Let’s have a look at Europe, for instance. Ratio calculation of the market shares between 2000 and 2016 shows a serious -23% decrease. But the absolute figures (not included, but you can get them here: choose Download Source Data in Excel format) show a +23% raise. Without the absolute figures in hand I know that I need to be more careful in designing graphs that can influence my audience conclusions. Neither -8% (difference) nor -23% (ratio) are calculations that properly describe Europe evolution in tourism. In the absence of absolute values, my design is trying to discourage precise visual decoding of differences between values that belong to the same region, but related to so different contexts (years) and emphasize other relevant observations. Those are: market share updated rankings (with Asia Pacific becoming first travelers's destination), a similar cumulative share now and then (83%) among top three players and a more balanced market share distribution between them in recent years (6% = 1st-3rd in 2016 vs 14% = 1st-3rd in 2000).

### Crystal

This month's #SWDChallenge makeover gave me the opportunity to try out a new butterfly/diverging bar chart. I think this makeover allows for easier comparisons over the original pie charts.

### Colin

Data visualizations can tell an interesting narrative.  However, sometimes the focus can be lost in the chart choice and design.  The challenge came with some useful guidance on how to makeover a chart, which I decided to follow: choosing an appropriate chart type to show change over time such as a slope chart, use of pre-attentive attributes like colour and size to draw our audience's eyes to the increase in the Asian Pacific travel market size, identifying clutter and removing elements which may be confusing or complicated, evaluating the wider context and adding a call to action for further research into this topic.
Interactive viz | Blog

### Clayton

Since reading Cole's storytelling with data and the blog I have enjoyed discovering the slopegraph. A useful visual for comparison situations whilst also displaying rank. In my mind, perfect for this makeover. The visual also quickly highlights that no other region has experienced a gain in market share.

### Christophe

I used two main principles from the literature: 1)  “Visualizing is categorizing (ordering)” from J. Bertin , and  2) “There must a common (horizontal or vertical) basis for visual comparison”  from Cleveland and Mcgill (1984). So to help the reader compare market shares, whether they were increasing or decreasing, visual variables (here bars) should be somehow aligned. I’ve chosen horizontal staked bars with common references for starting or ending points.  So bars are aligned  either on the right or on the left . My trick was to slightly cheat with the way one usually do horizontal stacked bars, I changed the order of continents and inserted some blank space and align Europe (on the left) and North America (on the right) to highlight the increase of Asia-Pacific in the middle. The rest is classic (colors, values, text, highlight, etc..).

### Burkhardt

In order to focus the graphic on the essential, I have sorted the data series and clearly marked them in color. With the headline I then pointed out the important points: a) Europe has clearly lost and b) Asia Pacific has clearly won. What I lacked for a better presentation are the absolute numbers. For example, it could be that the number of trips in 2016 has decreased significantly (probably not so....). Which tools did I use? I used simply MS PowerPoint with the addon "Thinkcell" to put the graphic cleanly into a suitable format. I completely invented the Call-To-Action. Here it depends surely on who looks at this diagram: the European tourism minister will certainly have other To Do´s, than the tourism industry in the Asia-Pacific region.

The first change I thought would help make the message more clear (for me anyway) was to change the two pie charts to a single slope graph. I think the pie charts were a little hard to follow, as I had to compare them for a bit to find out the difference. I chose a slope graph as I only had two dates to work with with to show the change, and I find a slope graph is both great at that type of comparison. Like a pie chart, a slope graph is also easily understood by most audiences. I don't know much about the audience this was intended for, but as a person outside of the travel industry it took me a few seconds to understand what APAC stood for. I'm assuming this makeover is for a general audience, so I changed APAC to Asian Pacific. I agree with the original design that the Asian Pacific market taking over first place from the European market was the most interesting message, so I highlighted those markets in the title and in the graph. If this was about exploring all of the regions, I might not have done that. The last key difference I added was giving context to how much money travel accounts for, so I did some additional research and I found that the Asian Pacific market had 326 billion in travel bookings in 2012. That is a big number, so I added it below the graph. I used the full number without the "billion" abbreviation to show how monstrously big that number is. ### Brent I chose a slope graph to show the change over time mainly because there are only two years of data to compare. Instead of using the arbitrary colors on the pie chart, I used a standard blue-orange diverging color palette, which worked well to grey out the regions that didn't experience change while highlighting the shifts in both Asia and Europe. I built the graph in Tableau and it is posted on Tableau Public. ### Brad I added some completely fictional data from a fictional company to generate some business-relevant takeaways. ### Bosley I used Tableau to create this slope chart which I believe is much easier to read than the original. I went back and forth on the color to highlight Asia pacific but landed on the dark red as I felt it was punchier than a blue version of the same chart. ### Benjamin First time of many for me participating! I thought this dataset is a prime target for a slope graph comparing two different years. I think the raw GDP values in a separate plot, probably not another slope graph, would be helpful in supplementing the percentages shown. Otherwise, this single plot makes it look as if Asia took the tourism from the European region. Perhaps it's my age group and what I'm exposed to on social media, it seems like traveling is becoming a much more common use of disposable income. So additional data in total tourism GDP difference between 2010 and 2016 could further illustrate how much bigger tourism to Asia is. ### Annamarie I used Tableau to illustrate the story. Graphs are colored by increase (green) or decrease (light gray) in overall tourism contribution to GDP from 2000 to 2016. Wider bars represent 2000 figures while thin bars represent 2016 figures. "Travel" was on my mind when I chose the green with typewriter fonts. ### Anna I have created just simple one bar chart in Tableau. I have added color to Asia Pacific to catch the eye on this region. ### Anabela I decided to follow along the image idea to promote the growth of Asian Pacific. In order to do that, I decided to use a slope chart (I've never seen one), build with Excel, where the dark blue lets you focus on Asian Pacific. ### Ana For the September challenge, I wanted to find a way not only to represent the growth of APAC and the decline of Europe, but also that the market is still stable and dominated by same three regions. That was the reason for a bar chart selection instead of line chart. I also wanted to play with unusual colours to explore other possibilities. I prefer to keep the use of Excel and Powerpoint, as this is what most of the companies will use for their presentations. ### Amy I often present to audiences that are very analytical and prefer to have access to all the data to make sure they come to the same conclusions. For this type of audience, I developed a second visual that is a stacked bar chart. Again, I used size and color to draw the eye to the Asia Pacific region, but here I included all the data labels. I also think my analytical audiences would appreciate a sense of the magnitude of the data (although I didn't include additional information in my visuals). For example, in the slope chart, it looks like Europe is on the decline. However, it's not clear if the overall number is going down or just growing at a slower rate compared to the other regions. ### Amit I used Excel and relied on a line graph to demonstrate each region’s change over a 16 year period. The original graph was an OK start. I found the sub-text confusing and could not interpret what the percentage represented. I.e.: It reads like “Europe got 35% of GDP from Tourism, in 2000.” I assumed the percentage was a region’s market share against global tourism spend AND that total tourism spend remained flat. The real story for me was that APAC grew very strongly, while other regions remained flat or dropped. Europe had the largest drop. This is what I show. I used RED to contrast against blue shades to draw the reader’s attention. A sarcastic title at top adds humor. Whole dollar amounts would make the data and graph more useful. ### Amber For this makeover I used a line graph in Excel. ### Allen For this challenge I’ve recreated a set of charts, starting with pulling the source data into Tableau and playing with different visualizations to see what is interesting. The final display has a few different views to display my findings. I believe combining different views to build the story has worked here, while doing my best to stick with the theory of keeping it as simple as possible. More data would be interesting, I’d like to see a more granular view to see how countries in these markets are changing as well as seeing how the market itself is changing. ### Allan I decided to make a simple diverging bar chart for this challenge using Tableau. ### Aleksandra I wanted to focus on the change in the ranking (highlighting also drop of Europe) and some reasons behind it. ### Adam A simple bar embellished with a bit of #stickmanstats treatment. BLOG | TWITTER | LINKEDIN Click "like" if you've made it to the bottom—this helps us know that the time it takes to pull this together is worthwhile! Check out the #SWDchallenge page for more, including details on the next challenge. Thanks for reading! Continue Reading… ### R Packages worth a look Multi-Data-Driven Sparse PLS Robust to Missing Samples (ddsPLS) Allows to build Multi-Data-Driven Sparse PLS models. Multi-blocks with high-dimensional settings are particularly sensible to this. Tidying Methods for Mixed Models (broom.mixed) Convert fitted objects from various R mixed-model packages into tidy data frames along the lines of the ‘broom’ package. The package provides three S3 … A Collection of Utility Functions (ojUtils) This is a collection of utility functions. Currently, it provides alternatives to base ifelse() and base combn() functions utilizing ‘Rcpp’, and provid … Continue Reading… ### Highlight Sessions from Alibaba, Uber, The Washington Post – at Predictive Analytics World London The Predictive Analytics World London 2018 (Sep 17-18) agenda is now live. Have a look at what all the excitement is about! Continue Reading… ### Book Memo: “Decision Tree and Ensemble Learning Based on Ant Colony Optimization”  This book not only discusses the important topics in the area of machine learning and combinatorial optimization, it also combines them into one. This was decisive for choosing the material to be included in the book and determining its order of presentation. Decision trees are a popular method of classification as well as of knowledge representation. At the same time, they are easy to implement as the building blocks of an ensemble of classifiers. Admittedly, however, the task of constructing a near-optimal decision tree is a very complex process. The good results typically achieved by the ant colony optimization algorithms when dealing with combinatorial optimization problems suggest the possibility of also using that approach for effectively constructing decision trees. The underlying rationale is that both problem classes can be presented as graphs. This fact leads to option of considering a larger spectrum of solutions than those based on the heuristic. Moreover, ant colony optimization algorithms can be used to advantage when building ensembles of classifiers. This book is a combination of a research monograph and a textbook. It can be used in graduate courses, but is also of interest to researchers, both specialists in machine learning and those applying machine learning methods to cope with problems from any field of R&D. Continue Reading… ### Improving Debt Collection with Predictive Models FICO scores will be soon improved by predictive analytics. This new approach is more accurate and can extend to the entire debt management process. Badly assessed financial risks were at the core of the financial crisis in the late 2000s. Banks and credit companies used faulty models which did not The post Improving Debt Collection with Predictive Models appeared first on Dataconomy. Continue Reading… ### Building a Machine Learning Model through Trial and Error A step-by-step guide that includes suggestions on how to preprocess data and deriving features from this. This article also contains links to help you explore additional resources about machine learning methods and other examples. Continue Reading… ### Uncertainty in Data Science (Transcript) (This article was first published on DataCamp Community - r programming, and kindly contributed to R-bloggers) Here is a link to the podcast. ## Introducing Allen Downey Hugo: Hi, there, Allen, and welcome to DataFramed. Allen: Hey, Hugo. Thank you very much. Hugo: Such a pleasure to have you on the show, and I’m really excited to have you here to talk about uncertainty in data science, how we think about prediction, and how we can think probabilistically, and how we do it right, and how we can get it wrong as well, but before we get into that, I’d love to find out a bit about you, and so I’m wondering what you’re known for in the data community. Allen: Right. Well, I’m working on a book series that’s called Think X, for all X, so hopefully some people know about that. Think Python is kind of the starting point, and then for data science, Think Stats and Think Bayes, for data science and for Bayesian statistics. Hugo: Great, and so why Think? Allen: Came about, roundabout, the original book was called How to Think Like a Computer Scientist, and it was originally a Java book, and then it became a Python book, and then it wasn’t really about programming. It was about bigger ideas, and so then when I started the other books, the premise of the books is that you’re using computation as a tool to learn something else, so it’s a way of thinking, it’s an approach to the topic, and so that’s how we got to the schema that’s always think something for various values of something. ### Computation Hugo: Right. I like that a lot, and speaking to this idea of computation, I know you’re a huge proponent of the role of computation in helping us to think, so maybe you can speak to that for a minute. Allen: Sure. I mean, it partly comes … I’ve been teaching in an engineering program, and engineering education has been very math-focused for a long time, so the curriculum, you have to take a lot of calculus and linear algebra before you get to do any engineering, and it doesn’t have to be that way at all. I think there are a lot of ideas in engineering that you can get to very quickly computationally that are much harder mathematically. Allen: One of the examples that comes up all the time is integration, which is a little bit of a difficult idea. Students, when they see an integral sign, immediately there’s gonna be some challenge there, but if you do everything discretely, you can take all of those integrals, you just turn them into summations, and then if you do it computationally, you take all of the summations and turn them into for loops, and then you can have very clear code where you’re looping through space, you’re adding up all of the elements. That’s what an integral is. Hugo: Absolutely, and I think another place that you’ve thought about a lot, and a lot of us have worked in where this rears its head is the idea of using computation and sampling and re-sampling datasets to get an idea about statistics. Right? Allen: Right. Yeah. I think classical statistical inference, looking at things like confidence intervals and hypothesis tests, re-sampling is a very powerful tool. You’re running simulations of the system, and you can compute things like sampling distribution or a p-value in a very straightforward way, meaning that it’s easy to do, but it also just makes the concept transparent. It’s really obvious what’s going on. Hugo: That’s right, and you actually … We’ve had a segment on the podcast previously, which is … It’s blog post of the week, and we had one on your blog post, There Is Only One Test, which really spells out the idea of that in the world of statistical hypothesis testing, there is really only one test, and the idea of you can actually see that, and this one of your great points, you can see that when you take the sampling, re-sampling, bootstrapping approach. Right? Allen: Right. Yeah. I think it makes the framework visible, that hypothesis tests, there’s a model of the null hypothesis, and that’s gonna be different for different scenarios, and there’s the test statistic, and that’s gonna be different for different scenarios, but once you’ve specified those two pieces, everything else is the same. You’re running the same framework. So, I think it makes the concept much clearer. Hugo: Great, and we’ll link to that in the show notes. We’ll also link to your fantastic followup post called "There Is Still Only One Test". Allen: Well, that’s just because I didn’t explain it very well the first time, so I had to try again. ### How did you get into data science? Hugo: It also proves the point, though, that there is still only one test, and I’ll repeat that, that there is still only one test. So, how did you get into data science originally? Allen: Well, my background is computer science, so there are a lot of ways, a lot of doors into data science, but I think computer science is certainly one of the big ones. I did … My master’s thesis was on computer vision, so that was kind of a step in that direction. My PhD was all about measuring and modeling computational systems, so there are a lot of things that come in there like long tail distributions, and then in 2009 I did a sabbatical, and I was working at Google in a group that was working on internet performance, so we were doing a lot of measurement, modeling, statistical descriptions, and predictive modeling, so that’s kind of where it started to get serious, and that’s where I started when I was working on Think Stats for the first time. Hugo: So, this origin story of you getting involved in data science I think makes an interesting point, that you’ve actually touched a lot of different types of data, and I know that you’re a huge fan of the idea that data science isn’t necessarily only for data scientists, that it actually could be of interest to everyone because it touches … There are so many touch points with the way we live and data science. Right? Allen: Right. Yeah. This is one of my things that I get a little upset about, is when people talk about data science, and then they talk about big data, and then they talk about quantitative finance and business analytics, like that’s all there is, and I use a broader notion of what data science is. I’d like to push the idea that it’s any time that you’re using data to answer questions and to guide decision making, because that includes a lot of science, which is often about answering questions, a lot about engineering where you’re designing a system to achieve a particular goal, and of course, decision making, both on an individual or a business or a national public policy level. So, I’d like to see data science involved in all of those pieces. Hugo: Absolutely. So, we’re here to talk about uncertainty today. One part of data science is making predictions, which we’ll get to, but the fact that we live in an uncertain world is incredibly interesting because what we do as a culture and a society, we use probability to think about uncertainty, so I’m wondering your thoughts on whether we us humans are actually good at thinking probabilistically. Allen: Right. It’s funny because we are and we are not at the same time. Hugo: I’m glad you didn’t say we probably are. Allen: Right. Yeah. That would’ve been good. So, we do seem to have some instinct for probabilistic thinking, even for young children. We do something that’s like a Bayesian update. When we get new data, if we’re uncertain about something, we get new evidence, we update our beliefs, and in some cases we actually do a pretty good approximation of an accurate Bayesian update, typically for things that are kind of in the middling range of probability, maybe from about 25% to 75%. At the same time, we’re terrible at very rare things. Small probabilities we’re pretty bad at, and then there are a bunch of ways that we can be consistently fooled because we’re not actually doing the math. We’re doing approximations to it, and those approximations fail consistently in ways that behavioral psychologists have pointed out, things like confirmation bias and other cognitive failures like that. ## "Why Are We So Surprised?"" Hugo: Absolutely. So, I want to speak to an article you wrote on your blog called Why Are We So Surprised?, in which you stated, “In theory, we should not be surprised by the outcome of the 2016 presidential election, but in practice, we are.” So, I’m wondering why you think we shouldn’t have been surprised. Allen: Right. Well, a lot of the forecasts, a lot of the models coming from FiveThirtyEight and from The New York Times, they were predicting that Trump had about a 25% chance, maybe more, of winning the election. So, if something’s got a 25% chance, that’s the same as flipping a coin twice and getting heads twice. You wouldn’t be particularly surprised by that. So, in theory a 25% risk shouldn’t be surprising, but in practice, I think people still don’t really understand probabilistic predictions. Allen: One reason we can see that is the lack of symmetry, which is, if I tell you that Trump has a 25% chance of winning, you think, “Well, okay. That might happen,” but when FiveThirtyEight said that Hillary Clinton had a 70% chance of winning, I think a lot of people interpreted that as a deterministic prediction, that FiveThirtyEight was saying, “Hillary Clinton is going to win,” and then when that didn’t happen, they said, “Well, then FiveThirtyEight was wrong,” and I don’t think that’s the right interpretation of a probabilistic prediction. If someone tells you there’s a 70% chance and it doesn’t happen, that should be mildly surprising, but it doesn’t necessarily mean that the prediction was wrong. Hugo: Yeah, and in your article, you actually make a related point that everybody predicted at some level, well, predicted that Hillary had over a 50% chance of winning, and you made the point that people interpreted this as there was consensus that Hillary would win with different degrees of confidence, but that’s … So, as you stated, that’s interpreting it as deterministic predictions, not probabilistic predictions. Right? Allen: Yeah, I think that’s right, and it also … It fails the symmetry test again because different predictions, they ranged all the way from 70% to 99%, and people reacted as if that was a consensus, but that’s not a consensus. If you flip it around, that’s the range from saying that Trump has anywhere between 1% and 30% chance of winning, and if the predictions had been expressed that way, I think people would’ve looked at that and said, “Oh, clearly there’s not a consensus there, because there’s a big difference between 1% and 30%.” Hugo: I really like this analogy to flipping coins, because it puts a lot of things in perspective, and another example, as you mention in your article, The New York Times gave Trump a 9% chance of winning, and if you flip a coin four times in a row and get four heads, that’s relatively surprising, but you wouldn’t be like, “Oh, I can’t believe that happened,” and that has a 6.25% chance of happening. Right? Allen: Right. Yeah, I think that’s a good way to get a sense for what these probabilities mean. Hugo: Absolutely. So, you mentioned also that these models were actually relatively credible models, so maybe you can speak to that. Allen: Yeah. I think going in, two reasons to think that these predictions were credible, one of them was just past performance, that FiveThirtyEight and The New York Times had done well in previous elections, but maybe more important, their methodology was transparent. They were showing you all of the poll data that they were using as inputs, and I think they weren’t actually publishing the algorithms, but they gave a lot of detail about how these things were working. Some polls are more believable than others. They were applying correction factors, and they also had … They were taking time into account. So, a more recent poll would be weighted more heavily than a poll that was farther into the past. So, all of those, I think ahead of the fact, we had good reasons to believe the predictions, and after the fact, even though the outcome wasn’t what we expected, that really just doesn’t mean that the models are wrong. Hugo: So, with all of this knowledge around how uncertain we are about uncertainty and how we can be good and bad about thinking probabilistically, what approaches can we as a data reporting community take to communicate around uncertainty better in the future? Allen: Right. I think we don’t know yet, but one of the things that I think is good is that people are trying a lot of different things. So, again, taking the election as an example, The New York Times had the twitchy needle that was sort of famously maybe not the best way to represent that information. There were other examples. Nate Silver’s predictions are based on running many simulations. So, he would show a histogram that would show the outcome of doing many, many simulations, and that I think probably works for some audiences. I think it’s tough for other audience. Allen: One of the suggestions I made that I would love to see someone try is instead of running many simulations and trying to summarize the results, I’d love to see one simulation per day with the results of one simulation presented in detail. So, thinking back to 2016, suppose that every day you looked in the paper, and it showed you one possible outcome of the election, and let’s say that Nate Silver’s predictions were right, and there was a 70% chance that Clinton would win. So, in a given week, you would see Clinton win maybe four or five times. You would see Trump win two or three times, and I think at the end of that week, your intuition would actually have a good sense for that probability. Hugo: I think that’s an incredible idea, because what it speaks to for me personally is you’re not really looking at these simulations or these results in the abstract. You’re actually experiencing them firsthand in some way. Allen: Exactly. So, you get the emotional effect of opening the paper and seeing that Trump won, and if that’s already happened a few times in simulation, then the reality would be a lot less surprising. Hugo: Absolutely. Are there any other types of approaches or ways of thinking that you’d like to see more in the future? Allen: Well, as I said, I think there are a lot of experiments going, so I think we will get better at communicating these ideas, and I think the audience is also learning, so different visualizations that wouldn’t have worked very well a few years ago, now people are I think just better at interpreting data, interpreting visualizations, because it’s become part of the media in a way that it wasn’t. If you’d look back not that long ago, I don’t know if you remember when USA Today started doing infographics, and that was a thing. People were really excited about those infographics, and you look back at those things now, and they’re terrible. It’ll be like- Hugo: Mm-hmm (affirmative). We’ve come a long way. Allen: It’s something that’s really just a bar chart, except that the bar is made up of stacked up apples and stacked up oranges, and that was data visualization, say, 20 years ago, and now you look at the things that The New York Times is doing with interactive visualizations. I saw one the other day, which is their three-dimensional visualization of the yield curve, which is a tough idea in finance and economics, and a 3-D visualization is tough, and interactive visualization is challenging, so maybe it doesn’t work for every audience, but I really appreciated just the ambition of it. Hugo: So, you mentioned the role of data science in decision making in general, and I think in a lot of ways, we make decisions based on all the data we have, and then a decision is made, but a lot of the time, the quality of the decision will be rated on the quality of the outcome, which isn’t necessarily the correct way to think about these things. Right? Allen: Right. I gave an example about Blackjack, that you can make the right play in Blackjack. You take a hit when you’re supposed to take a hit, and if you go bust, it’s tempting to say, “Oh. Well, I guess I shouldn’t have done that,” but that’s not correct. You made the right play, and in the long run that’s the right decision. Any specific outcome is not necessarily gonna go your way. Hugo: Yeah, but we know that in that case because we can evaluate the predictions based on the theory we have and the simulations we have in our mind or computationally. Right? On long-term rates, essentially. Allen: Right. Yeah. Blackjack is easy because every game of Blackjack is kind of the same, so you’ve got these identical trials. You’ve got long-term rates. We have a harder time with single-case predictions, single-case probabilities. Hugo: Like election forecasting? Allen: Like elections, right, but in that case, right, you can’t evaluate a single prediction. You can’t say specifically whether it’s right or wrong, but you can evaluate the prediction process. You can check to make sure that probabilistic predictions are calibrated. So, maybe getting back to Nate Silver again, in The Signal and the Noise, he uses a nice example, which is the National Weather Service, which is, they make probabilistic predictions. They say, “20% chance of rain, 80% chance of rain,” and on any given day, you don’t know if they were wrong. Allen: So, if they 20% then it rains, or if they say 80% and it doesn’t rain, that’s a little bit surprising, but it doesn’t make them wrong. But in the long run, if you keep track of every single time that they say 20% and then you count up how many times does it actually rain on 20% days, and how many times does it rain on 80% days, if the answer is 20% and 80%, then that’s a well-calibrated probabilistic prediction. ## Where is uncertainty prevalent in society? Hugo: Absolutely. So, this is another example. The weather is one. We’ve talked about election forecasting, and these are both examples where it’s we really need to think about uncertainty. I’m wondering what other examples in society are where we need to think about uncertainty and why they’re important. Allen: Yep. Well, a big one … Anything that’s related to health and safety, those are all cases where we’re talking about risks, we’re talking about interventions that have certain probabilities of good outcomes, certain probabilities of side effects, and those are other cases, I think, where sometimes our heuristics are good, and other times we make really consistent cognitive errors. Hugo: There are a lot of cognitive biases, and one that I fall prey to constantly is, I’m not even sure what it’s called, but it’s when you have a small sample size, and I see something occur several times, I’m like, “Oh, that’s probably the way things work.” Allen: Right. Yeah. I guess that’s a form of over-fitting. In statistics, there’s sort of a joke that people talk about the law of small numbers, but that’s right. I think that’s a version of jumping to conclusions. That’s an example where I think doctors have had a version of that in the past, which is they make decisions often about treatment that are based on their own patients, so, “Such-and-such a drug has worked well for my patients, and I’ve seen bad outcomes with my patients,” as contrasted with using large randomized trials, which we’ve got a lot of evidence now that randomized trials are a more reliable form of evidence than the example that you gave of generalizing from small numbers. Hugo: So, health and safety, as you said, are two relevant examples. What can we do to combat this, do you think? Allen: That one’s tough. I’m thinking about some of the ways that we get health wrong, some of the ways that we get safety. Certainly, one of the problems is that we’re very bad at small risks, small probabilities. There’s some evidence that we can do a little bit better if we express things in terms of natural frequencies, so if I tell you that something has a .01% probability, you might have a really hard time making sense of that, but if I tell you that it’s something like one person out of 10,000, then you might have a way to picture that. You could say, “Well, okay. At a baseball game, there might be 30,000 people, so there could be three people here right now how have such-and-such a condition.” So, I think expressing things in terms of natural frequencies might be one thing that helps. Hugo: Interesting. So, essentially, these are, I suppose, linguistic technologies and adopting things that we know work in language. Allen: Yeah, I think so. I think graphical visualizations are important, too. Certainly, we have this incredibly powerful tool, which is our vision system, that’s able to take a huge amount of data and process it quickly, so that’s, I think, one of the best ways to get information off a page and into someone’s brain. Hugo: Yeah. Look, this actually just reminded me of something I haven’t thought about in years, but it must’ve been 10 or 15 years ago, I was at an art show in Melbourne, Australia, and there was an artwork which it was visualizing how many people had been in certain situations or done certain things using grains of rice. So, they had a bowl, like the total population of Australia, the total population of the US, and then the number of people who were killed during the Holocaust and the number of people who’ve stepped on the moon, and that type of stuff, and it was actually incredibly vivid and memorable, and you got a strong sense of magnitude there. Allen: Yes. I think that works. There’s a video I saw, we’ll have to find this and maybe put in a link, about war casualties and showing a little individual person for each casualty, but then adding it up and showing colored rectangles of different casualties in different wars, the number of people from each country, and that was very effective, and then I’m reminded of XKCD has done several really nice examples to show the relative sizes of things, just by mapping them onto area on the page. One of the ones that I think is really good is different doses of radioactivity, where he was able to show many different orders of magnitude by starting with a small unit that was represented by a single square, and then scaling it up, and then scaling it up, so that you could see that there are orders of magnitude between things like dental x-rays that we really should not be worrying about, and other kinds of exposure that are actual health risks. ## Uncertainty Misconceptions Hugo: Incredible. So, what are the most important misconceptions regarding uncertainty that you think we need to correct, those data-oriented educators? Allen: Right. Well, we talked about probabilistic predictions. I think that’s a big one. I think the other big one that I think about is the shapes of distributions, that when you try to summarize a distribution, if I just tell you the mean, then people generally assume that it’s something like a bell-shaped curve, and we have some intuition for what that’s like, that if I tell you that the average human being is about 165 centimeters tall, or I think it’s more than that, but anyway, you get a sense of, “Okay. So, probably there are some people who are over 200, and probably there are some people who are less than 60, but there probably isn’t anybody who is a kilometer tall.” We have a sense of that distribution. Allen: But then you get things like the Pareto distribution, and this is one of the examples I use in my book, is what I call Pareto World, which is same as our world, because the average height is about the same, but the distribution is shaped like a Pareto distribution, which is one of these crazy long-tailed distributions, and in Pareto World, the average height is between one and two meters, but the vast majority of people are only a centimeter tall, and if you have seven billion people in Pareto World, the tallest one is probably a hundred kilometers tall. ### Pareto Distributions Hugo: That’s incredible, and just quickly, what type of phenomena do Pareto distributions, what are they known to model? Allen: Right. Well, I think wealth and income are two of the big ones. In fact, I think that’s the original domain where Pareto was looking at these long-tailed distributions, and that’s the case where a few people have almost all of the wealth, and the vast majority of people have almost none. So, that’s a case where if I tell you the mean and you are imagining a bell-shaped distribution, you have totally the wrong picture of what’s going on. The mean is really not telling you what a typical person has. In fact, there may be no typical person. Hugo: Absolutely, and in fact, that’s a great example. Another example is if you have a bimodal distribution with nothing in the middle, the mean. There could actually be no one with that particular quantity of whatever we’re talking about. Allen: Yeah, that’s a good example. Hugo: So Allen, when you were discussing the Pareto distribution and the normal distribution, then something really struck me that as stakeholders and decision makers and research scientists and data scientists, we seem to be more comfortable in thinking about summary statistics and concrete numbers instead of distribution. So what I mean by that is, we like to report the mean, the mode, the median and measures of spread such as the variance. And there seems to be some sort of discomfort we feel, and we’re not great at thinking about distributions which seem kind of necessary to quantify and think about uncertainty. Allen: No, I think that’s right. It doesn’t come naturally. You know, I work with students. It takes awhile to just understand the idea of what a distribution is. But I think it’s important because it captures all of the information that you have about a prediction. You want to know all possible outcomes, and the probability for each possible outcome. That’s what a distribution is. It captures exactly the information that you need as a decision maker. Hugo: Exactly. So, I mean, instead of communicating, for example, P-values in hypothesis testing, we can actually show the distribution of the possible effect sizes, right? Allen: Right, and this is the strength of Bayesian methods, because what you’ve got is a posterior distribution that captures this information. And if you now feed that into a decision making process, it answers all the questions that you might want to ask. If you only care about the central tendency you can get that, but very often there’s a cost function that says, you know, if this value turns out to be very high, there’s a cost associated with that. If it’s low, there’s a cost associated with that. So if you’ve got the whole distribution, you can feed that into a cost benefit analysis and make better decisions. Hugo: Absolutely. And I love the point that you made, which I think about a lot of the time, and when I teach Bayesian thinking and Bayesian inference, I make this incredibly explicit all the time, that from the posterior, from the distribution, you can get out so many of the other things that you need and you would want to report. Allen: Right, so maybe you care, you know, what’s the probability of a given catastrophic output. So, in that case you would be looking at, you know, the tails of that distribution. Or something like, you know, what’s the probability that I’ll be off by a certain amount or again, you know, things like the mean and the spread. Whatever the number is, you can get it from the distribution. ## What technologies are best suited for thinking and communicating around uncertainty? Hugo: Absolutely. And this is actually … this leads to another question which I wanted to talk about. Bayesian inference I think of in a number of ways, as a technology that we’ve developed to deal with these types of questions and concepts. I think also we have reached a point in the past decades where Bayesian inference now, because of computational power we have, is actually far more feasible to do in a robust and efficient manner. And I think we may get to that in a bit. But I’m wondering in general, so what technologies, to your mind, are best suited for thinking and communicating around uncertainty, Allen? Allen: Well, you know, a couple of the visualizations that people use all the time, and of course, you know, the classic one is a histogram. And that one, I think, is most appropriate for a general audience. Most people understand histograms. Violin plots are kinda similar, that’s just two histograms back-to-back. And I think those are good because people understand them, but problematic. I mean, I’ve seen a number of articles of people pointing out that you kinda have to get histograms right. If the bin size is too big, then you’re smoothing away a lot of information that you might care about. If the bin size is too small, you’re getting a lot of noise and it can be hard to see the shape of the distribution through the noise. Allen: So, one of the things I advocate for is using CDFs instead of histograms, or PDFs, as the default visualization. And when I’m exploring a data set, I’m almost always looking at CDFs because you get the best view of the shape of the distribution, you can see modes, you can see central tendencies, you can see spread. But also if you’ve got weird outliers, they jump out, and if you’ve got repeated values, you can see those clearly in a CDF, with less visual noise that distracts you from the important stuff. So I love CDFs. The only problem is that people don’t understand them. But I think this is another case where the audience is getting educated, that the more people are consuming data journalism, the more they’re seeing visualizations like this. And there’s some implicit learning that’s going on. Allen: I saw one example very recently, someone showing the altitude that human populations live at. ‘Cause they were talking about sea levels rising and talking about the fraction of people who live less than four meters above sea level. But the visualization was kind of a sneaky CDF, they showed, it actually a CDF sideways. But it was done in a way where a person who doesn’t necessarily have technical training would be able to figure out what that graph was showing. So I think that’s a step in a good direction. Hugo: I like that a lot. And just to clarify, a CDF is a cumulative distribution function? Allen: Yes. Sorry, I should’ve said that. Hugo: Yeah. Allen: And in particular I’m talking about empirical CDFs, where you’re just taking it straight from data and generating the cumulative distribution function. Hugo: Fantastic. And one of the nice things there, so each point on the x-axis, the y value will correspond to the number of data points equal to a less than, that particular point. And one of the great things is, you can also read off all your percentiles, right? Allen: Exactly, right. You can read it in both directions. So, if you start on the y-axis, you can pick the percentile you want, like the median, 50 percentile. And then read off the corresponding x value. Or, the flip side is exactly what you said. If you want to know what fraction of the values are below a certain threshold, then you just read off that threshold and get the corresponding y-value. Hugo: Yeah. And one of the other things that I love, you mentioned a bunch of, well several very attractive characteristics of empirical CDF, ECDFs. I also love that you can plot, you know, your control and a lot of different experiments just on the same figure and actually see how they differ, as opposed to you try to plot a bunch of histograms together, you gotta do wacky transparencies and all this stuff, right? Allen: Yes, that’s exactly right. And you can stack lots of CDFs on the same axes, and the differences that you see are really the differences that matter. When you compare histograms, you’re seeing a lot of noise and you can see differences between histograms that are just random. When you’re looking at CDFs, you get a pretty robust view of what the differences are and where in the distribution those differences happen. Hugo: Yeah. Fantastic. Look, I’m very excited for a day in which the general populace appreciates CDFs and they appear in the mainstream media. I think that’s a bright future. Allen: Yeah, and I think we’re close. I’ve seen one example, there have got to be more. Hugo: Are there any other technologies or ways of thinking about uncertainty that you think are useful? Allen: Well we talked a little bit about visualizing simulations, I think that matters. There’s one example maybe getting back to … if we have to get back to the 2016 election, I think one of the issues that came up is that a lot of the predictions, when they showed you a map of the different states, they were showing a color scale where there would be a red state and a blue state, but also pink and light blue and purple. And they were trying to show uncertainty using that color map, but then that’s, you know, and that’s not how the electoral college works. The electoral college, every state is either all red or all blue, with just a couple of exceptions. So that was a case where the predictions ended up looking very different from what the final results looked like, and I think that’s part of why we were uncomfortable with predictions and the results. Hugo: Interesting. So what is a fix for that, do you think? Allen: Well, again coming back to my suggestion about, you know, don’t try to show me all possible simulation outcomes, but show me one simulation per day. And in that case, the result that you show me, the daily result, would be all red or all blue. So, the predictions in that sense would look exactly like the outcome. And then when you see the outcome, the chances are that it’s gonna resemble at least one of the predictions that you made. Hugo: Great. Now I just had kind of a future flash, a brainwave into a future where we can use virtual reality technologies to drop people into potential simulations. But that’s definitely future music. Allen: Yes. I think that’s interesting. ## What does the future of data science look like to you? Hugo: Yeah. So speaking of the future, we’ve talked a lot about modern data science and uncertainty. I’m wondering what the future of data science looks like to you? Allen: I think a big part of it looks like more people being involved. So not just highly trained technical statisticians, but we’ve been talking like data journalists, for example, who are people who have a technical skill to look at data, but also the storytelling skill to ask interesting questions, get answers, and then communicate those answers. I’d love to see all of that become more a part of general education, starting in primary school. Starting in secondary school, working with data, working with some of these visualizations we’ve been talking about. Using data to answer questions. Using data to explore and find out about the world, you know, at the stage that’s appropriate at different levels of education. Allen: There’s a lot of talk about trying to get maybe less calculus in the world and more data science, and I think that’s gotta be the direction we go. If you look at what people really need to know and what they’re likely to use, practically everybody is going to be a consumer of data science and I think more and more people are gonna be producers of data science. So I think that’s gotta be part of a core education. And calculus, I love calculus. But, it’s just not as important for as many people. Hugo: Yeah. And arguably, for you in your engineering background, I mean, calculus is incredibly important for engineers and physicists, but other people who need to be quantitative, it is, I think your point is very strong that learning how to actually work with data and statistics around that, is arguably a lot more essential. Allen: Yeah. I think, as I said, more and more people are gonna be doing at least some kind of data science where they’re taking advantage of all of the data now that’s freely available, and that’s, you know, government agencies are producing huge volumes of data and often they don’t have the resources to really do anything with it. They’ve got a mandate to produce the data, but they don’t have the people to do that. But the flip side of that is there’s a huge opportunity for anyone with basic data skills to get in there and find interesting things. Often, you’re one of the first people to explore a data set, you know, if you jump in there on the day it’s published, you can find all kinds of things, not necessarily using, you know, powerful or complex statistical methods, just basic exploratory data analysis. Hugo: Yeah, and the ability now to get, you know, learners, students, people in education institutions, involved in data science by making it or letting them realize that it’s relevant to them, that there’s data about their lives or about their physiological systems that they can analyze and explore, I think, is a huge win. Allen: It is. It’s really empowering, and this is one of the reasons that I … I call myself a data optimist. And what I mean by that is I think there are huge opportunities here to use data science for social good. Getting into these data sets, as you said, they are relevant to people’s lives. You can find things. I saw a great example at a conference recently, I was talking to a young guy from Brazil, who had worked on an application that was going through government data that was available online and flagging evidence of corruption, evidence of budgets that were being misspent. And they would tweet about it. There was just a robot that would find suspicious things in these accounts, and tweet them out there, which is, you know, kind of transparency that I think makes governments better. So I think there’s a lot of potential there. Hugo: That’s incredible. Actually, that reminded me. I met a lawyer who was non-technical awhile ago, and non-computational, but he was learning a bit of machine learning, a bit of Python. He was trying to figure out whether you could predict judgements handed down by the Supreme Court based on previous judgements, and who would vote in a particular way. And that’s just because that’s something that really interests him professionally and in terms of social justice, as well. Allen: Right. And I think, you know, the fact that people can do that who are not necessarily experts in that field, but amateurs for lack of a better word, can get in there and really do useful work. I think, you know, there are a lot of concerns, too. And this is getting a lot of attention right now, I’m actually in the middle of reading Weapons of Math Destruction, Cathy O’Neill’s book. And there are a lot of concerns and I think there are things that are scary that we should be thinking about, but one of the things I’m actually thinking about now and trying to figure out is, how do we balance this discussion? ‘Cause I think we’re having, or at least starting, a good public discussion about this. It’s good to get the problems on the table and address them, but how do we get the right balance between the optimism that I think is appropriate, but also the concerns that we should be dealing with. Hugo: Yeah, absolutely. And as you say, the more and more books being published, more and more conversations happening in public. I mean it’s the past several weeks that Mike Loukides, Hilary Mason, and DJ Patil who have posted their series of articles on data ethics and what they would like to see adoption in culture and in tech, among other places. I do think Weapons of Math Destruction is very interesting as part of this conversation, because of course one of the key parts of the definition for Cathy O’Neil over Weapon of Math Destruction is that it’s not transparent, as well, right? So all the cases we’re talking about kind of involve necessary transparency, so if we see more of that going forward, we’ll at least be able to have a conversation around it. Allen: Right, and I agree with both O’Neill and with you. I think that’s a crucial part of these algorithms and, you know, open science and reproducible science is based on transparency and open data, and you know, also open code and open methodology. Hugo: Absolutely. And this actually brings me to another question, which is a through line here is, the ability of everybody, every citizen to interact with data science in some sense. And I’m wondering for you in your practice, and as a data scientist and an educator, what is the role of the open source in the ability of everybody to interact with data science? Allen: Right, I think it’s huge. You know, reproducible science doesn’t work if your code is proprietary. If you, you know, if you only share your data but not your methods, that only goes so far. It also doesn’t help very much if I publish my code but it’s in a language that’s not accessible to everybody, you know, languages that are very expensive to get your hands on. Even among relatively affluent countries, you’re not necessarily gonna have access to that code. And then when you go worldwide, there are, you know, a great majority of people in the world that are not gonna have access to that as contrasted with languages like R and Python that are freely available, now you still have to access to technology and that’s not universal, but it’s better and I think free software is an important part of that. Hugo: Yeah. Allen: This is, you know, part of the reason that I put my books up under free licenses is I know that there are a lot of people in the world who are not gonna buy hard copies of these books, but I want to make them available, and I do, you know, I get a lot of correspondence from people who are using my labs in electronic forms, who would not have access to them in hard copy. ## Favorite Data Science Technique Hugo: So, Allen, we’ve talked about a bunch of techniques that are dear to your heart. I’m wondering what one of your favorite data science-y techniques or methodologies is. Allen: Right. I have a lot. Hugo: Let’s do it. Allen: This might not be a short list. Hugo: Sure. Allen: So I am at heart a Bayesian. I do a certain amount of computational inference, you know, you do in classical statistical inference, but I’m really interested in helping Bayesian methods spread. And I think one of the challenges there is just understanding the ideas. It’s one of these ideas that seems hard when you first encounter it, and then at some point there’s a breakthrough, and then it seems obvious. Once you’ve got it, it is such a beautiful simple idea that it changes how you see everything. So that’s what I want to help readers get to, and my students, is get that transition from the initial confusion into that moment of clarity. Allen: One of the methods I use for that, and this is what I use in Think Bayes a lot, is just grid algorithms where you take everything that’s continuous and break it up into discrete chunks, and then all the integrals become for loops, and I think it makes the ideas very clear. And then I think the other part of it that’s important is the algorithms, particularly MCMC algorithms, which, you know, that’s what makes Bayesian methods practical for substantial problems. You mentioned earlier that, you know, the computational power has become available. And that’s a big part of what makes Bayes practical. But I think the algorithms are just as important, and particularly when you start to get up into higher dimensions. It’s just not feasible without modern algorithms that are really quite new, developed in the last decade or so. Hugo: Yeah. And I just want to speak to the idea of grid methods and, you said, turning, you say integrals become for loops. And I think is something which has actually been behind a lot of what we’ve been discussing as well and something that actually attracted me to your pedagogy initially and all of your work, was this idea of turning math into computation. And we see the same with techniques such as the bootstrap and resampling, but taking concepts that seem, you know, relatively abstract and seeing how they actually play out in a computational structure and making that translational step there. Allen: Right. Yeah, I’ve found that very powerful for me as a learner. I’ve had that experience over and over, of reading something expressed using mathematical concepts, and then I turn it into code and I feel like that’s how I get to understand it. Partly because you get to see it happening, often it’s very visual in a way that the math is not, at least for me. But the other is it’s debuggable. That if you have a misunderstanding, then when you try to represent it in code, you’re gonna see evidence of the misunderstanding. It’s gonna pop up as a bug. So, when you’re debugging your code, you’re also debugging your understanding. Which, for me, builds the confidence that when I’ve got working code, it also makes me believe that I understand the thing. Hugo: Absolutely, and a related concept is the idea that breaking it down into chunks of code allows you to understand smaller concepts and build up the entire concept in smaller steps. Allen: Right, yeah. I think that’s a good point, too. Hugo: Great. So, are there any other favorite techniques? You can have one or two more if you’d like. Allen: I’ll mention one which is survival analysis. And partly because it doesn’t come up in an introductory class most of the time, but it’s something I keep coming back to. I’ve used it for several projects, not necessarily looking at survival or medicine, but things like a study I did of how long a marriage lasts. Or, how long it is until someone has a first child, or gets married for the first time, or how long the marriage itself lasts until a divorce. So, as I say, it’s not an idea that everybody sees, but once you learn it, you start seeing a lot of applications for it. Hugo: Absolutely. And this did make it into your Think Stats book, do I recall correctly, or? Allen: Yes. Yeah, I’ve got a section on survival analysis. ## Call to Action Hugo: Yeah, fantastic. So I’ll definitely link to that in the show notes, as well. So, my last question is, do you have a call to action for our listeners out there? Allen: Maybe two. I think if you have not yet had a chance to study data science, you should. And I think there are a lot of great resources that are available now that just weren’t around not too long ago. And especially if you took a statistics class in high school or college, and it did not connect with you, the problem is not necessarily you. The standard curriculum in statistics for a long time I think has just not been right for most people. I think it’s just spent way too much time on esoteric hypothesis tests. It gets bogged down in some statistical philosophy that’s actually not very good philosophy, it’s not very good philosophy, it’s science. Allen: If you come back to it now from a data science point of view, it’s much more likely that you’re gonna find classes and educational resources that are much more relevant. They’re gonna be based on data. They’re gonna be much more compelling. So give it another shot. I think that’s my first call to action. Hugo: I would second that. Allen: And then the other is, for people who have got data science skills, there are a lot of ways to use that to do social good in the world. I think a lot of data scientists end up doing, you know, quantitative finance and business analytics, those are kinda the two big application domains. And there’s nothing wrong with that, but I also think there are a lot of ways to use the skills that you’ve got to do something good, to, you know, find stories about what’s happening and get those stories out. To, you know, use those stories as a way to effect change. Or if nothing else, just to answer questions about the world. If there’s something that interests you, very often you can find data and answer questions. Hugo: And there are a lot of very interesting data for social good programs out there, which we’ve actually had Peter Bull on the podcast to talk about data for good in general, and I’ll put some links in the show notes as well. Allen: Yes, and then I’ve got actually a talk that I want to link to that I’ve done a couple of times, and it’s called Data Science, Data Optimism. And the last part of the talk is my call for data science for social good. I’ve got a bunch of links there that I’ve collected, that are just really the people that I know and groups that I know who are working in this area, but it’s not complete by any means. So I would love to hear more from people, and maybe help me to expand my list. Hugo: Fantastic. And people can reach out to you on Twitter, as well? Is that right? Allen: Yes. I’m Allen Downey. Hugo: Fantastic. Allen, it’s been an absolute pleasure having you on the show. Allen: Thank you very much. It’s been great talking with you. To leave a comment for the author, please follow the link and comment on their blog: DataCamp Community - r programming. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### Diversity in Data Science: Overview and Strategy We take a hard look at diversity within the tech industry, root causes, and potential solutions and highlight resources/initiatives that can connect readers with programs aiding their professional development. Continue Reading… ### Applications of R presented at EARL London 2018 During the EARL (Enterprise Applications of the R Language) conference in London last week, the organizers asked me how I thought the conference had changed over the years. (This is the conference's fifth year, and I'd been to each one.) My response was that it reflected the increasing maturity of R in the enterprise. The early years featured many presentations that were about using R in research and the challenges (both technical and procedural) for integrating that research into the day-to-day processes of the business. This year though, just about every presentation was about R in production, as a mainstream part of the operational infrastructure for analytics. That theme began in earnest with Garret Grolemund's keynote presentation on the consequences of scientific research that can't be replicated independently. (This slide, based on this 2016 JAMA paper was an eye-opener for me.) The R language has been at the forefront of providing the necessary tools and infrastructure to remove barriers to reproducible research in science, and the RMarkdown package and its streamlined integration in RStudio, is particularly helpful. Garret's keynote was recorded but the video isn't yet available; when it's published I highly recommend taking the time to watch this excellent speech. The rest of the program was jam-packed with a variety of applications of R in industry as well. I couldn't see them all (it was a three-track conference), but every one I attended demonstrated mature, production-scale applications of R to solve difficult business problems with data. Here are just a few examples: • Rainmakers, a market research firm, uses R to estimate the potential market for new products. They make use of the officer package to automate the generation of Powerpoint reports. • Geolytix uses R to help companies choose locations for new stores that maximize profitability while reducing the risk of cannibalizing sales from other nearby locations. They use SQL Server ML Services to deploy these models to their clients. • Dyson uses R (and the prophet package) to forecast the expected sales of new models of vacuum cleaners, hand dryers, and other products, so that the manufacturing plants can ramp up (or down) production as needed. • Google uses R to design and analyze the results of customer surveys, to make the best decisions of which product features to invest in next. • N Brown Group, a fashion retailer, uses R to analyze online product reviews from customers. • Marks and Spencer uses R to increase revenues by optimizing the products shown and featured in the online store. • PartnerRe, the reinsurance firm, has built an analytics team around R, and uses Shiny to deploy R-based applications throughout the company. • Amazon Web Services uses containerized applications with R to identify customers who need additional onboarding assistance or who may be dissatisfied, and to detect fraud. • Microsoft uses R and Spark (via sparklyr in HDInsight) to support marketing efforts for Xbox, Windows and Surface with propensity modeling, to identify who is most likely to respond to an offer. You can see may more examples at the list of speakers linked below. (I blogged about my own talk at EARL earlier this week.) Click through to see detailed summaries and (in most cases) a link to download slides. EARL London 2018: Speakers Continue Reading… ### OpenCV Face Recognition In this tutorial, you will learn how to use OpenCV to perform face recognition. To build our face recognition system, we’ll first perform face detection, extract face embeddings from each face using deep learning, train a face recognition model on the embeddings, and then finally recognize faces in both images and video streams with OpenCV. Today’s tutorial is also a special gift for my fiancée, Trisha (who is now officially my wife). Our wedding was over the weekend, and by the time you’re reading this blog post, we’ll be at the airport preparing to board our flight for the honeymoon. To celebrate the occasion, and show her how much her support of myself, the PyImageSearch blog, and the PyImageSearch community means to me, I decided to use OpenCV to perform face recognition on a dataset of our faces. You can swap in your own dataset of faces of course! All you need to do is follow my directory structure in insert your own face images. As a bonus, I’ve also included how to label “unknown” faces that cannot be classified with sufficient confidence. To learn how to perform OpenCV face recognition, just keep reading! Looking for the source code to this post? Jump right to the downloads section. ## OpenCV Face Recognition In today’s tutorial, you will learn how to perform face recognition using the OpenCV library. You might be wondering how this tutorial is different from the one I wrote a few months back on face recognition with dlib? Well, keep in mind that the dlib face recognition post relied on two important external libraries: 1. dlib (obviously) 2. face_recognition (which is an easy to use set of face recognition utilities that wraps around dlib) While we used OpenCV to facilitate face recognition, OpenCV itself was not responsible for identifying faces. In today’s tutorial, we’ll learn how we can apply deep learning and OpenCV together (with no other libraries other than scikit-learn) to: 1. Detect faces 2. Compute 128-d face embeddings to quantify a face 3. Train a Support Vector Machine (SVM) on top of the embeddings 4. Recognize faces in images and video streams All of these tasks will be accomplished with OpenCV, enabling us to obtain a “pure” OpenCV face recognition pipeline. ### How OpenCV’s face recognition works Figure 1: An overview of the OpenCV face recognition pipeline. The key step is a CNN feature extractor that generates 128-d facial embeddings. (source) In order to build our OpenCV face recognition pipeline, we’ll be applying deep learning in two key steps: 1. To apply face detection, which detects the presence and location of a face in an image, but does not identify it 2. To extract the 128-d feature vectors (called “embeddings”) that quantify each face in an image I’ve discussed how OpenCV’s face detection works previously, so please refer to it if you have not detected faces before. The model responsible for actually quantifying each face in an image is from the OpenFace project, a Python and Torch implementation of face recognition with deep learning. This implementation comes from Schroff et al.’s 2015 CVPR publication, FaceNet: A Unified Embedding for Face Recognition and Clustering. Reviewing the entire FaceNet implementation is outside the scope of this tutorial, but the gist of the pipeline can be seen in Figure 1 above. First, we input an image or video frame to our face recognition pipeline. Given the input image, we apply face detection to detect the location of a face in the image. Optionally we can compute facial landmarks, enabling us to preprocess and align the face. Face alignment, as the name suggests, is the process of (1) identifying the geometric structure of the faces and (2) attempting to obtain a canonical alignment of the face based on translation, rotation, and scale. While optional, face alignment has been demonstrated to increase face recognition accuracy in some pipelines. After we’ve (optionally) applied face alignment and cropping, we pass the input face through our deep neural network: Figure 2: How the deep learning face recognition model computes the face embedding. The FaceNet deep learning model computes a 128-d embedding that quantifies the face itself. But how does the network actually compute the face embedding? The answer lies in the training process itself, including: 1. The input data to the network 2. The triplet loss function To train a face recognition model with deep learning, each input batch of data includes three images: 1. The anchor 2. The positive image 3. The negative image The anchor is our current face and has identity A. The second image is our positive image — this image also contains a face of person A. The negative image, on the other hand, does not have the same identity, and could belong to person B, C, or even Y! The point is that the anchor and positive image both belong to the same person/face while the negative image does not contain the same face. The neural network computes the 128-d embeddings for each face and then tweaks the weights of the network (via the triplet loss function) such that: 1. The 128-d embeddings of the anchor and positive image lie closer together 2. While at the same time, pushing the embeddings for the negative image father away In this manner, the network is able to learn to quantify faces and return highly robust and discriminating embeddings suitable for face recognition. And furthermore, we can actually reuse the OpenFace model for our own applications without having to explicitly train it! Even though the deep learning model we’re using today has (very likely) never seen the faces we’re about to pass through it, the model will still be able to compute embeddings for each face — ideally, these face embeddings will be sufficiently different such that we can train a “standard” machine learning classifier (SVM, SGD classifier, Random Forest, etc.) on top of the face embeddings, and therefore obtain our OpenCV face recognition pipeline. If you are interested in learning more about the details surrounding triplet loss and how it can be used to train a face embedding model, be sure to refer to my previous blog post as well as the Schroff et al. publication. ### Our face recognition dataset Figure 3: A small example face dataset for face recognition with OpenCV. The dataset we are using today contains three people: • Myself • Trisha (my wife) • “Unknown”, which is used to represent faces of people we do not know and wish to label as such (here I just sampled faces from the movie Jurassic Park which I used in a previous post — you may want to insert your own “unknown” dataset). As I mentioned in the introduction to today’s face recognition post, I was just married over the weekend, so this post is a “gift” to my new wife . Each class contains a total of six images. If you are building your own face recognition dataset, ideally, I would suggest having 10-20 images per person you wish to recognize — be sure to refer to the “Drawbacks, limitations, and how to obtain higher face recognition accuracy” section of this blog post for more details. ### Project structure Once you’ve grabbed the zip from the “Downloads” section of this post, go ahead and unzip the archive and navigate into the directory. From there, you may use the tree command to have the directory structure printed in your terminal:  tree --dirsfirst
.
├── dataset
│   ├── trisha [6 images]
│   └── unknown [6 images]
├── images
│   ├── patrick_bateman.jpg
├── face_detection_model
│   ├── deploy.prototxt
│   └── res10_300x300_ssd_iter_140000.caffemodel
├── output
│   ├── embeddings.pickle
│   ├── le.pickle
│   └── recognizer.pickle
├── extract_embeddings.py
├── openface_nn4.small2.v1.t7
├── train_model.py
├── recognize.py
└── recognize_video.py

7 directories, 31 files

There are quite a few moving parts for this project — take the time now to carefully read this section so you become familiar with all the files in today’s project.

Our project has four directories in the root folder:

• dataset/
: Contains our face images organized into subfolders by name.
• images/
: Contains three test images that we’ll use to verify the operation of our model.
• face_detection_model/
: Contains a pre-trained Caffe deep learning model provided by OpenCV to detect faces. This model detects and localizes faces in an image.
• output/
: Contains my output pickle files. If you’re working with your own dataset, you can store your output files here as well. The output files include:
• embeddings.pickle
: A serialized facial embeddings file. Embeddings have been computed for every face in the dataset and are stored in this file.
• le.pickle
: Our label encoder. Contains the name labels for the people that our model can recognize.
• recognizer.pickle
: Our Linear Support Vector Machine (SVM) model. This is a machine learning model rather than a deep learning model and it is responsible for actually recognizing faces.

Let’s summarize the five files in the root directory:

• extract_embeddings.py
: We’ll review this file in Step #1 which is responsible for using a deep learning feature extractor to generate a 128-D vector describing a face. All faces in our dataset will be passed through the neural network to generate embeddings.
• openface_nn4.small2.v1.t7
: A Torch deep learning model which produces the 128-D facial embeddings. We’ll be using this deep learning model in Steps #1, #2, and #3 as well as the Bonus section.
• train_model.py
: Our Linear SVM model will be trained by this script in Step #2. We’ll detect faces, extract embeddings, and fit our SVM model to the embeddings data.
• recognize.py
: In Step #3 and we’ll recognize faces in images. We’ll detect faces, extract embeddings, and query our SVM model to determine who is in an image. We’ll draw boxes around faces and annotate each box with a name.
• recognize_video.py
: Our Bonus section describes how to recognize who is in frames of a video stream just as we did in Step #3 on static images.

Let’s move on to the first step!

### Step #1: Extract embeddings from face dataset

Now that we understand how face recognition works and reviewed our project structure, let’s get started building our OpenCV face recognition pipeline.

Open up the

extract_embeddings.py
file and insert the following code:

# import the necessary packages
from imutils import paths
import numpy as np
import argparse
import imutils
import pickle
import cv2
import os

# construct the argument parser and parse the arguments
ap = argparse.ArgumentParser()
help="path to input directory of faces + images")
help="path to output serialized db of facial embeddings")
help="path to OpenCV's deep learning face detector")
help="path to OpenCV's deep learning face embedding model")
help="minimum probability to filter weak detections")
args = vars(ap.parse_args())

We import our required packages on Lines 2-8. You’ll need to have OpenCV and

imutils
installed. To install OpenCV, simply follow one of my guides (I recommend OpenCV 3.4.2, so be sure to download the right version while you follow along). My imutils package can be installed with pip:

$pip install --upgrade imutils Next, we process our command line arguments: • --dataset : The path to our input dataset of face images. • --embeddings : The path to our output embeddings file. Our script will compute face embeddings which we’ll serialize to disk. • --detector : Path to OpenCV’s Caffe-based deep learning face detector used to actually localize the faces in the images. • --embedding-model : Path to the OpenCV deep learning Torch embedding model. This model will allow us to extract a 128-D facial embedding vector. • --confidence : Optional threshold for filtering week face detections. Now that we’ve imported our packages and parsed command line arguments, lets load the face detector and embedder from disk: # load our serialized face detector from disk print("[INFO] loading face detector...") protoPath = os.path.sep.join([args["detector"], "deploy.prototxt"]) modelPath = os.path.sep.join([args["detector"], "res10_300x300_ssd_iter_140000.caffemodel"]) detector = cv2.dnn.readNetFromCaffe(protoPath, modelPath) # load our serialized face embedding model from disk print("[INFO] loading face recognizer...") embedder = cv2.dnn.readNetFromTorch(args["embedding_model"]) Here we load the face detector and embedder: • detector : Loaded via Lines 26-29. We’re using a Caffe based DL face detector to localize faces in an image. • embedder : Loaded on Line 33. This model is Torch-based and is responsible for extracting facial embeddings via deep learning feature extraction. Notice that we’re using the respective cv2.dnn functions to load the two separate models. The dnn module wasn’t made available like this until OpenCV 3.3, but I recommend that you are using OpenCV 3.4.2 or higher for this blog post. Moving forward, let’s grab our image paths and perform initializations: # grab the paths to the input images in our dataset print("[INFO] quantifying faces...") imagePaths = list(paths.list_images(args["dataset"])) # initialize our lists of extracted facial embeddings and # corresponding people names knownEmbeddings = [] knownNames = [] # initialize the total number of faces processed total = 0 The imagePaths list, built on Line 37, contains the path to each image in the dataset. I’ve made this easy via my imutils function, paths.list_images . Our embeddings and corresponding names will be held in two lists: knownEmbeddings and knownNames (Lines 41 and 42). We’ll also be keeping track of how many faces we’ve processed via a variable called total (Line 45). Let’s begin looping over the image paths — this loop will be responsible for extracting embeddings from faces found in each image: # loop over the image paths for (i, imagePath) in enumerate(imagePaths): # extract the person name from the image path print("[INFO] processing image {}/{}".format(i + 1, len(imagePaths))) name = imagePath.split(os.path.sep)[-2] # load the image, resize it to have a width of 600 pixels (while # maintaining the aspect ratio), and then grab the image # dimensions image = cv2.imread(imagePath) image = imutils.resize(image, width=600) (h, w) = image.shape[:2] We begin looping over imagePaths on Line 48. First, we extract the name of the person from the path (Line 52). To explain how this works, consider the following example in my Python shell: $ python
>>> from imutils import paths
>>> import os
>>> imagePaths = list(paths.list_images("dataset"))
>>> imagePath = imagePaths[0]
>>> imagePath
>>> imagePath.split(os.path.sep)
>>> imagePath.split(os.path.sep)[-2]
>>>

Notice how by using

imagePath.split
and providing the split character (the OS path separator — “/” on unix and “\” on Windows), the function produces a list of folder/file names (strings) which walk down the directory tree. We grab the second-to-last index, the persons
name
, which in this case is
'adrian'
.

image
and
resize
it to a known
width
(Lines 57 and 58).

Let’s detect and localize faces:

# construct a blob from the image
imageBlob = cv2.dnn.blobFromImage(
cv2.resize(image, (300, 300)), 1.0, (300, 300),
(104.0, 177.0, 123.0), swapRB=False, crop=False)

# apply OpenCV's deep learning-based face detector to localize
# faces in the input image
detector.setInput(imageBlob)
detections = detector.forward()

From there we detect faces in the image by passing the

imageBlob
through the
detector
network (Lines 68 and 69).

Let’s process the

detections
:

# ensure at least one face was found
if len(detections) > 0:
# we're making the assumption that each image has only ONE
# face, so find the bounding box with the largest probability
i = np.argmax(detections[0, 0, :, 2])
confidence = detections[0, 0, i, 2]

# ensure that the detection with the largest probability also
# means our minimum probability test (thus helping filter out
# weak detections)
if confidence > args["confidence"]:
# compute the (x, y)-coordinates of the bounding box for
# the face
box = detections[0, 0, i, 3:7] * np.array([w, h, w, h])
(startX, startY, endX, endY) = box.astype("int")

# extract the face ROI and grab the ROI dimensions
face = image[startY:endY, startX:endX]
(fH, fW) = face.shape[:2]

# ensure the face width and height are sufficiently large
if fW < 20 or fH < 20:
continue

The

detections
list contains probabilities and coordinates to localize faces in an image.

Assuming we have at least one detection, we’ll proceed into the body of the if-statement (Line 72).

We make the assumption that there is only one face in the image, so we extract the detection with the highest

confidence
and check to make sure that the confidence meets the minimum probability threshold used to filter out weak detections (Lines 75-81).

Assuming we’ve met that threshold, we extract the

face
ROI and grab/check dimensions to make sure the
face
ROI is sufficiently large (Lines 84-93).

From there, we’ll take advantage of our

embedder
CNN and extract the face embeddings:

# construct a blob for the face ROI, then pass the blob
# through our face embedding model to obtain the 128-d
# quantification of the face
faceBlob = cv2.dnn.blobFromImage(face, 1.0 / 255,
(96, 96), (0, 0, 0), swapRB=True, crop=False)
embedder.setInput(faceBlob)
vec = embedder.forward()

# add the name of the person + corresponding face
# embedding to their respective lists
knownNames.append(name)
knownEmbeddings.append(vec.flatten())
total += 1

We construct another blob, this time from the face ROI (not the whole image as we did before) on Lines 98 and 99.

Subsequently, we pass the

faceBlob
through the embedder CNN (Lines 100 and 101). This generates a 128-D vector (
vec
) which describes the face. We’ll leverage this data to recognize new faces via machine learning.

And then we simply add the

name
and embedding
vec
to
knownNames
and
knownEmbeddings
, respectively (Lines 105 and 106).

We also can’t forget about the variable we set to track the

total
number of faces either — we go ahead and increment the value on Line 107.

We continue this process of looping over images, detecting faces, and extracting face embeddings for each and every image in our dataset.

All that’s left when the loop finishes is to dump the data to disk:

# dump the facial embeddings + names to disk
print("[INFO] serializing {} encodings...".format(total))
data = {"embeddings": knownEmbeddings, "names": knownNames}
f = open(args["embeddings"], "wb")
f.write(pickle.dumps(data))
f.close()

We add the name and embedding data to a dictionary and then serialize the

data
in a pickle file on Lines 110-114.

At this point we’re ready to extract embeddings by running our script.

From there, open up a terminal and execute the following command to compute the face embeddings with OpenCV:

$python extract_embeddings.py --dataset dataset \ --embeddings output/embeddings.pickle \ --detector face_detection_model \ --embedding-model openface_nn4.small2.v1.t7 [INFO] loading face detector... [INFO] loading face recognizer... [INFO] quantifying faces... [INFO] processing image 1/18 [INFO] processing image 2/18 [INFO] processing image 3/18 [INFO] processing image 4/18 [INFO] processing image 5/18 [INFO] processing image 6/18 [INFO] processing image 7/18 [INFO] processing image 8/18 [INFO] processing image 9/18 [INFO] processing image 10/18 [INFO] processing image 11/18 [INFO] processing image 12/18 [INFO] processing image 13/18 [INFO] processing image 14/18 [INFO] processing image 15/18 [INFO] processing image 16/18 [INFO] processing image 17/18 [INFO] processing image 18/18 [INFO] serializing 18 encodings... Here you can see that we have extracted 18 face embeddings, one for each of the images (6 per class) in our input face dataset. ### Step #2: Train face recognition model At this point we have extracted 128-d embeddings for each face — but how do we actually recognize a person based on these embeddings? The answer is that we need to train a “standard” machine learning model (such as an SVM, k-NN classifier, Random Forest, etc.) on top of the embeddings. In my previous face recognition tutorial we discovered how a modified version of k-NN can be used for face recognition on 128-d embeddings created via the dlib and face_recognition libraries. Today, I want to share how we can build a more powerful classifier on top of the embeddings — you’ll be able to use this same method in your dlib-based face recognition pipelines as well if you are so inclined. Open up the train_model.py file and insert the following code: # import the necessary packages from sklearn.preprocessing import LabelEncoder from sklearn.svm import SVC import argparse import pickle # construct the argument parser and parse the arguments ap = argparse.ArgumentParser() ap.add_argument("-e", "--embeddings", required=True, help="path to serialized db of facial embeddings") ap.add_argument("-r", "--recognizer", required=True, help="path to output model trained to recognize faces") ap.add_argument("-l", "--le", required=True, help="path to output label encoder") args = vars(ap.parse_args()) We’ll need scikit-learn, a machine learning library, installed in our environment prior to running this script. You can install it via pip: $ pip install scikit-learn

We import our packages and modules on Lines 2-5. We’ll be using scikit-learn’s implementation of Support Vector Machines (SVM), a common machine learning model.

From there we parse our command line arguments:

• --embeddings
: The path to the serialized embeddings (we exported it by running the previous
extract_embeddings.py
script).
• --recognizer
: This will be our output model that recognizes faces. It is based on SVM. We’ll be saving it so we can use it in the next two recognition scripts.
• --le
: Our label encoder output file path. We’ll serialize our label encoder to disk so that we can use it and the recognizer model in our image/video face recognition scripts.

Each of these arguments is required.

Let’s load our facial embeddings and encode our labels:

# load the face embeddings

# encode the labels
print("[INFO] encoding labels...")
le = LabelEncoder()
labels = le.fit_transform(data["names"])

Here we load our embeddings from Step #1 on Line 19. We won’t be generating any embeddings in this model training script — we’ll use the embeddings previously generated and serialized.

Then we initialize our scikit-learn

LabelEncoder
and encode our name
labels
(Lines 23 and 24).

Now it’s time to train our SVM model for recognizing faces:

# train the model used to accept the 128-d embeddings of the face and
# then produce the actual face recognition
print("[INFO] training model...")
recognizer = SVC(C=1.0, kernel="linear", probability=True)
recognizer.fit(data["embeddings"], labels)

On Line 29 we initialize our SVM model, and on Line 30 we

fit
the model (also known as “training the model”).

Here we are using a Linear Support Vector Machine (SVM) but you can try experimenting with other machine learning models if you so wish.

After training the model we output the model and label encoder to disk as pickle files.

# write the actual face recognition model to disk
f = open(args["recognizer"], "wb")
f.write(pickle.dumps(recognizer))
f.close()

# write the label encoder to disk
f = open(args["le"], "wb")
f.write(pickle.dumps(le))
f.close()

We write two pickle files to disk in this block — the face recognizer model and the label encoder.

At this point, be sure you executed the code from Step #1 first. You can grab the zip containing the code and data from the “Downloads” section.

Now that we have finished coding

train_model.py
as well, let’s apply it to our extracted face embeddings:

$python train_model.py --embeddings output/embeddings.pickle \ --recognizer output/recognizer.pickle \ --le output/le.pickle [INFO] loading face embeddings... [INFO] encoding labels... [INFO] training model...$ ls output/
embeddings.pickle	le.pickle		recognizer.pickle

Here you can see that our SVM has been trained on the embeddings and both the (1) SVM itself and (2) the label encoding have been written to disk, enabling us to apply them to input images and video.

### Step #3: Recognize faces with OpenCV

We are now ready to perform face recognition with OpenCV!

We’ll start with recognizing faces in images in this section and then move on to recognizing faces in video streams in the following section.

Open up the

recognize.py
file in your project and insert the following code:

# import the necessary packages
import numpy as np
import argparse
import imutils
import pickle
import cv2
import os

# construct the argument parser and parse the arguments
ap = argparse.ArgumentParser()
help="path to input image")
help="path to OpenCV's deep learning face detector")
help="path to OpenCV's deep learning face embedding model")
help="path to model trained to recognize faces")
help="path to label encoder")
help="minimum probability to filter weak detections")
args = vars(ap.parse_args())

We

import
our required packages on Lines 2-7. At this point, you should have each of these packages installed.

Our six command line arguments are parsed on Lines 10-23:

• --image
: The path to the input image. We will attempt to recognize the faces in this image.
• --detector
: The path to OpenCV’s deep learning face detector. We’ll use this model to detect where in the image the face ROIs are.
• --embedding-model
: The path to OpenCV’s deep learning face embedding model. We’ll use this model to extract the 128-D face embedding from the face ROI — we’ll feed the data into the recognizer.
• --recognizer
: The path to our recognizer model. We trained our SVM recognizer in Step #2. This is what will actually determine who a face is.
• --le
: The path to our label encoder. This contains our face labels such as
'adrian'
or
'trisha'
.
• --confidence
: The optional threshold to filter weak face detections.

Be sure to study these command line arguments — it is important to know the difference between the two deep learning models and the SVM model. If you find yourself confused later in this script, you should refer back to here.

Now that we’ve handled our imports and command line arguments, let’s load the three models from disk into memory:

# load our serialized face detector from disk
protoPath = os.path.sep.join([args["detector"], "deploy.prototxt"])
modelPath = os.path.sep.join([args["detector"],
"res10_300x300_ssd_iter_140000.caffemodel"])

# load our serialized face embedding model from disk

# load the actual face recognition model along with the label encoder
le = pickle.loads(open(args["le"], "rb").read())

We load three models in this block. At the risk of being redundant, I want to explicitly remind you of the differences among the models:

1. detector
: A pre-trained Caffe DL model to detect where in the image the faces are (Lines 27-30).
2. embedder
: A pre-trained Torch DL model to calculate our 128-D face embeddings (Line 34).
3. recognizer
: Our Linear SVM face recognition model (Line 37). We trained this model in Step 2.

Both 1 & 2 are pre-trained meaning that they are provided to you as-is by OpenCV. They are buried in the OpenCV project on GitHub, but I’ve included them for your convenience in the “Downloads” section of today’s post. I’ve also numbered the models in the order that we’ll apply them to recognize faces with OpenCV.

We also load our label encoder which holds the names of the people our model can recognize (Line 38).

Now let’s load our image and detect faces:

# load the image, resize it to have a width of 600 pixels (while
# maintaining the aspect ratio), and then grab the image dimensions
image = imutils.resize(image, width=600)
(h, w) = image.shape[:2]

# construct a blob from the image
imageBlob = cv2.dnn.blobFromImage(
cv2.resize(image, (300, 300)), 1.0, (300, 300),
(104.0, 177.0, 123.0), swapRB=False, crop=False)

# apply OpenCV's deep learning-based face detector to localize
# faces in the input image
detector.setInput(imageBlob)
detections = detector.forward()

Here we:

• Load the image into memory and construct a blob (Lines 42-49). Learn about
cv2.dnn.blobFromImage
here.
• Localize faces in the image via our
detector
(Lines 53 and 54).

Given our new

detections
, let’s recognize faces in the image. But first we need to filter weak
detections
and extract the
face
ROI:

# loop over the detections
for i in range(0, detections.shape[2]):
# extract the confidence (i.e., probability) associated with the
# prediction
confidence = detections[0, 0, i, 2]

# filter out weak detections
if confidence > args["confidence"]:
# compute the (x, y)-coordinates of the bounding box for the
# face
box = detections[0, 0, i, 3:7] * np.array([w, h, w, h])
(startX, startY, endX, endY) = box.astype("int")

# extract the face ROI
face = image[startY:endY, startX:endX]
(fH, fW) = face.shape[:2]

# ensure the face width and height are sufficiently large
if fW < 20 or fH < 20:
continue

You’ll recognize this block from Step #1. I’ll explain it here once more:

• We loop over the
detections
on Line 57 and extract the
confidence
of each on Line 60.
• Then we compare the
confidence
to the minimum probability detection threshold contained in our command line
args
dictionary, ensuring that the computed probability is larger than the minimum probability (Line 63).
• From there, we extract the
face
ROI (Lines 66-70) as well as ensure it’s spatial dimensions are sufficiently large (Lines 74 and 75).

Recognizing the name of the

face
ROI requires just a few steps:

# construct a blob for the face ROI, then pass the blob
# through our face embedding model to obtain the 128-d
# quantification of the face
faceBlob = cv2.dnn.blobFromImage(face, 1.0 / 255, (96, 96),
(0, 0, 0), swapRB=True, crop=False)
embedder.setInput(faceBlob)
vec = embedder.forward()

# perform classification to recognize the face
preds = recognizer.predict_proba(vec)[0]
j = np.argmax(preds)
proba = preds[j]
name = le.classes_[j]

First, we construct a

faceBlob
(from the
face
ROI) and pass it through the
embedder
to generate a 128-D vector which describes the face (Lines 80-83)

Then, we pass the

vec
through our SVM recognizer model (Line 86), the result of which is our predictions for who is in the face ROI.

We take the highest probability index (Line 87) and query our label encoder to find the

name
(Line 89). In between, I extract the probability on Line 88.

Note: You cam further filter out weak face recognitions by applying an additional threshold test on the probability. For example, inserting

if proba < T
(where
T
is a variable you define) can provide an additional layer of filtering to ensure there are less false-positive face recognitions.

Now, let’s display OpenCV face recognition results:

# draw the bounding box of the face along with the associated
# probability
text = "{}: {:.2f}%".format(name, proba * 100)
y = startY - 10 if startY - 10 > 10 else startY + 10
cv2.rectangle(image, (startX, startY), (endX, endY),
(0, 0, 255), 2)
cv2.putText(image, text, (startX, y),
cv2.FONT_HERSHEY_SIMPLEX, 0.45, (0, 0, 255), 2)

# show the output image
cv2.imshow("Image", image)
cv2.waitKey(0)

For every face we recognize in the loop (including the “unknown”) people:

• We construct a
text
string containing the
name
and probability on Line 93.
• And then we draw a rectangle around the face and place the text above the box (Lines 94-98).

And then finally we visualize the results on the screen until a key is pressed (Lines 101 and 102).

It is time to recognize faces in images with OpenCV!

To apply our OpenCV face recognition pipeline to my provided images (or your own dataset + test images), make sure you use the “Downloads” section of the blog post to download the code, trained models, and example images.

From there, open up a terminal and execute the following command:

$python recognize.py --detector face_detection_model \ --embedding-model openface_nn4.small2.v1.t7 \ --recognizer output/recognizer.pickle \ --le output/le.pickle \ --image images/adrian.jpg [INFO] loading face detector... [INFO] loading face recognizer... Figure 4: OpenCV face recognition has recognized me at the Jurassic World: Fallen Kingdom movie showing. Here you can see me sipping on a beer and sporting one of my favorite Jurassic Park shirts, along with a special Jurassic World pint glass and commemorative book. My face prediction only has 47.15% confidence; however, that confidence is higher than the “Unknown” class. Let’s try another OpenCV face recognition example: $ python recognize.py --detector face_detection_model \
--embedding-model openface_nn4.small2.v1.t7 \
--recognizer output/recognizer.pickle \
--le output/le.pickle \
[INFO] loading face recognizer...

Figure 5: My wife, Trisha, and I are recognized in a selfie picture on an airplane with OpenCV + deep learning facial recognition.

Here are Trisha and I, ready to start our vacation!

In a final example, let’s look at what happens when our model is unable to recognize the actual face:

$python recognize.py --detector face_detection_model \ --embedding-model openface_nn4.small2.v1.t7 \ --recognizer output/recognizer.pickle \ --le output/le.pickle \ --image images/patrick_bateman.jpg [INFO] loading face detector... [INFO] loading face recognizer... Figure 6: Facial recognition with OpenCV has determined that this person is “unknown”. The third image is an example of an “unknown” person who is actually Patrick Bateman from American Psycho — believe me, this is not a person you would want to see show up in your images or video streams! ### BONUS: Recognize faces in video streams As a bonus, I decided to include a section dedicated to OpenCV face recognition in video streams! The actual pipeline itself is near identical to recognizing faces in images, with only a few updates which we’ll review along the way. Open up the recognize_video.py file and let’s get started: # import the necessary packages from imutils.video import VideoStream from imutils.video import FPS import numpy as np import argparse import imutils import pickle import time import cv2 import os # construct the argument parser and parse the arguments ap = argparse.ArgumentParser() ap.add_argument("-d", "--detector", required=True, help="path to OpenCV's deep learning face detector") ap.add_argument("-m", "--embedding-model", required=True, help="path to OpenCV's deep learning face embedding model") ap.add_argument("-r", "--recognizer", required=True, help="path to model trained to recognize faces") ap.add_argument("-l", "--le", required=True, help="path to label encoder") ap.add_argument("-c", "--confidence", type=float, default=0.5, help="minimum probability to filter weak detections") args = vars(ap.parse_args()) Our imports are the same as the Step #3 section above, except for Lines 2 and 3 where we use the imutils.video module. We’ll use VideoStream to capture frames from our camera and FPS to calculate frames per second statistics. The command line arguments are also the same except we aren’t passing a path to a static image via the command line. Rather, we’ll grab a reference to our webcam and then process the video. Refer to Step #3 if you need to review the arguments. Our three models and label encoder are loaded here: # load our serialized face detector from disk print("[INFO] loading face detector...") protoPath = os.path.sep.join([args["detector"], "deploy.prototxt"]) modelPath = os.path.sep.join([args["detector"], "res10_300x300_ssd_iter_140000.caffemodel"]) detector = cv2.dnn.readNetFromCaffe(protoPath, modelPath) # load our serialized face embedding model from disk print("[INFO] loading face recognizer...") embedder = cv2.dnn.readNetFromTorch(args["embedding_model"]) # load the actual face recognition model along with the label encoder recognizer = pickle.loads(open(args["recognizer"], "rb").read()) le = pickle.loads(open(args["le"], "rb").read()) Here we load face detector , face embedder model, face recognizer model (Linear SVM), and label encoder. Again, be sure to refer to Step #3 if you are confused about the three models or label encoder. Let’s initialize our video stream and begin processing frames: # initialize the video stream, then allow the camera sensor to warm up print("[INFO] starting video stream...") vs = VideoStream(src=0).start() time.sleep(2.0) # start the FPS throughput estimator fps = FPS().start() # loop over frames from the video file stream while True: # grab the frame from the threaded video stream frame = vs.read() # resize the frame to have a width of 600 pixels (while # maintaining the aspect ratio), and then grab the image # dimensions frame = imutils.resize(frame, width=600) (h, w) = frame.shape[:2] # construct a blob from the image imageBlob = cv2.dnn.blobFromImage( cv2.resize(frame, (300, 300)), 1.0, (300, 300), (104.0, 177.0, 123.0), swapRB=False, crop=False) # apply OpenCV's deep learning-based face detector to localize # faces in the input image detector.setInput(imageBlob) detections = detector.forward() Our VideoStream object is initialized and started on Line 43. We wait for the camera sensor to warm up on Line 44. We also initialize our frames per second counter (Line 47) and begin looping over frames on Line 50. We grab a frame from the webcam on Line 52. From here everything is the same as Step 3. We resize the frame (Line 57) and then we construct a blob from the frame + detect where the faces are (Lines 61-68). Now let’s process the detections: # loop over the detections for i in range(0, detections.shape[2]): # extract the confidence (i.e., probability) associated with # the prediction confidence = detections[0, 0, i, 2] # filter out weak detections if confidence > args["confidence"]: # compute the (x, y)-coordinates of the bounding box for # the face box = detections[0, 0, i, 3:7] * np.array([w, h, w, h]) (startX, startY, endX, endY) = box.astype("int") # extract the face ROI face = frame[startY:endY, startX:endX] (fH, fW) = face.shape[:2] # ensure the face width and height are sufficiently large if fW < 20 or fH < 20: continue Just as in the previous section, we begin looping over detections and filter out weak ones (Lines 71-77). Then we extract the face ROI as well as ensure the spatial dimensions are sufficiently large enough for the next steps (Lines 84-89). Now it’s time to perform OpenCV face recognition: # construct a blob for the face ROI, then pass the blob # through our face embedding model to obtain the 128-d # quantification of the face faceBlob = cv2.dnn.blobFromImage(face, 1.0 / 255, (96, 96), (0, 0, 0), swapRB=True, crop=False) embedder.setInput(faceBlob) vec = embedder.forward() # perform classification to recognize the face preds = recognizer.predict_proba(vec)[0] j = np.argmax(preds) proba = preds[j] name = le.classes_[j] # draw the bounding box of the face along with the # associated probability text = "{}: {:.2f}%".format(name, proba * 100) y = startY - 10 if startY - 10 > 10 else startY + 10 cv2.rectangle(frame, (startX, startY), (endX, endY), (0, 0, 255), 2) cv2.putText(frame, text, (startX, y), cv2.FONT_HERSHEY_SIMPLEX, 0.45, (0, 0, 255), 2) # update the FPS counter fps.update() Here we: • Construct the faceBlob (Lines 94 and 95) and calculate the facial embeddings via deep learning (Lines 96 and 97). • Recognize the most-likely name of the face while calculating the probability (Line 100-103). • Draw a bounding box around the face and the person’s name + probability (Lines 107 -112). Our fps counter is updated on Line 115. Let’s display the results and clean up: # show the output frame cv2.imshow("Frame", frame) key = cv2.waitKey(1) & 0xFF # if the q key was pressed, break from the loop if key == ord("q"): break # stop the timer and display FPS information fps.stop() print("[INFO] elasped time: {:.2f}".format(fps.elapsed())) print("[INFO] approx. FPS: {:.2f}".format(fps.fps())) # do a bit of cleanup cv2.destroyAllWindows() vs.stop() To close out the script, we: • Display the annotated frame (Line 118) and wait for the “q” key to be pressed at which point we break out of the loop (Lines 119-123). • Stop our fps counter and print statistics in the terminal (Lines 126-128). • Cleanup by closing windows and releasing pointers (Lines 131 and 132). To execute our OpenCV face recognition pipeline on a video stream, open up a terminal and execute the following command: $ python recognize_video.py --detector face_detection_model \
--embedding-model openface_nn4.small2.v1.t7 \
--recognizer output/recognizer.pickle \
--le output/le.pickle
[INFO] starting video stream...
[INFO] elasped time: 12.52
[INFO] approx. FPS: 16.13

Figure 7: Face recognition in video with OpenCV.

As you can see, both Trisha and my face are correctly identified! Our OpenCV face recognition pipeline is also obtaining ~16 FPS on my iMac. On my MacBook Pro I was getting ~14 FPS throughput rate.

### Drawbacks, limitations, and how to obtain higher face recognition accuracy

Figure 8: All face recognition systems are error-prone. There will never be a 100% accurate face recognition system.

Inevitably, you’ll run into a situation where OpenCV does not recognize a face correctly.

What do you do in those situations?

And how do you improve your OpenCV face recognition accuracy? In this section, I’ll detail a few of the suggested methods to increase the accuracy of your face recognition pipeline

#### You may need more data

Figure 9: Most people aren’t training their OpenCV face recognition models with enough data. (image source)

My first suggestion is likely the most obvious one, but it’s worth sharing.

In my previous tutorial on face recognition, a handful of PyImageSearch readers asked why their face recognition accuracy was low and faces were being misclassified — the conversation went something like this (paraphrased):

Them: Hey Adrian, I am trying to perform face recognition on a dataset of my classmate’s faces, but the accuracy is really low. What can I do to increase face recognition accuracy?

Me: How many face images do you have per person?

Them: Only one or two.

Me: Gather more data.

I get the impression that most readers already know they need more face images when they only have one or two example faces per person, but I suspect they are hoping for me to pull a computer vision technique out of my bag of tips and tricks to solve the problem.

It doesn’t work like that.

If you find yourself with low face recognition accuracy and only have a few example faces per person, gather more data — there are no “computer vision tricks” that will save you from the data gathering process.

Invest in your data and you’ll have a better OpenCV face recognition pipeline. In general, I would recommend a minimum of 10-20 faces per person.

Note: You may be thinking, “But Adrian, you only gathered 6 images per person in today’s post!” Yes, you are right — and I did that to prove a point. The OpenCV face recognition system we discussed here today worked but can always be improved. There are times when smaller datasets will give you your desired results, and there’s nothing wrong with trying a small dataset — but when you don’t achieve your desired accuracy you’ll want to gather more data.

#### Perform face alignment

Figure 9: Performing face alignment for OpenCV facial recognition can dramatically improve face recognition performance.

The face recognition model OpenCV uses to compute the 128-d face embeddings comes from the OpenFace project.

The OpenFace model will perform better on faces that have been aligned.

Face alignment is the process of:

1. Identifying the geometric structure of faces in images.
2. Attempting to obtain a canonical alignment of the face based on translation, rotation, and scale.

As you can see from Figure 9 at the top of this section, I have:

1. Detected a faces in the image and extracted the ROIs (based on the bounding box coordinates).
2. Applied facial landmark detection to extract the coordinates of the eyes.
3. Computed the centroid for each respective eye along with the midpoint between the eyes.
4. And based on these points, applied an affine transform to resize the face to a fixed size and dimension.

If we apply face alignment to every face in our dataset, then in the output coordinate space, all faces should:

1. Be centered in the image.
2. Be rotated such the eyes lie on a horizontal line (i.e., the face is rotated such that the eyes lie along the same y-coordinates).
3. Be scaled such that the size of the faces is approximately identical.

Applying face alignment to our OpenCV face recognition pipeline was outside the scope of today’s tutorial, but if you would like to further increase your face recognition accuracy using OpenCV and OpenFace, I would recommend you apply face alignment.

Check out my blog post, Face Alignment with OpenCV and Python.

My second suggestion is for you to attempt to tune your hyperparameters on whatever machine learning model you are using (i.e., the model trained on top of the extracted face embeddings).

For this tutorial, we used a Linear SVM; however, we did not tune the

C
value, which is typically the most important value of an SVM to tune.

The

C
value is a “strictness” parameter and controls how much you want to avoid misclassifying each data point in the training set.

Larger values of

C
will be more strict and try harder to classify every input data point correctly, even at the risk of overfitting.

Smaller values of

C
will be more “soft”, allowing some misclassifications in the training data, but ideally generalizing better to testing data.

It’s interesting to note that according to one of the classification examples in the OpenFace GitHub, they actually recommend to not tune the hyperparameters, as, from their experience, they found that setting

C=1
obtains satisfactory face recognition results in most settings.

Still, if your face recognition accuracy is not sufficient, it may be worth the extra effort and computational cost of tuning your hyperparameters via either a grid search or random search.

#### Use dlib’s embedding model (but not it’s k-NN for face recognition)

In my experience using both OpenCV’s face recognition model along with dlib’s face recognition model, I’ve found that dlib’s face embeddings are more discriminative, especially for smaller datasets.

Furthermore, I’ve found that dlib’s model is less dependent on:

1. Preprocessing such as face alignment
2. Using a more powerful machine learning model on top of extracted face embeddings

If you take a look at my original face recognition tutorial, you’ll notice that we utilized a simple k-NN algorithm for face recognition (with a small modification to throw out nearest neighbor votes whose distance was above a threshold).

The k-NN model worked extremely well, but as we know, more powerful machine learning models exist.

To improve accuracy further, you may want to use dlib’s embedding model, and then instead of applying k-NN, follow Step #2 from today’s post and train a more powerful classifier on the face embeddings.

### Did you encounter a “USAGE” error running today’s Python face recognition scripts?

Each week I receive emails that (paraphrased) go something like this:

Hi Adrian, I can’t run the code from the blog post.

My error looks like this:

usage: extract_embeddings.py [-h] -i DATASET -e EMBEDDINGS
-d DETECTOR -m EMBEDDING_MODEL [-c CONFIDENCE]
extract_embeddings.py: error: the following arguments are required:
-i/--dataset, -e/--embeddings, -d/--detector, -m/--embedding-model

Or this:

I’m using Spyder IDE to run the code. It isn’t running as I encounter a “usage” message in the command box.

There are three separate Python scripts in this tutorial, and furthermore, each of them requires that you (correctly) supply the respective command line arguments.

If you’re new to command line arguments, that’s fine, but you need to read up on how Python, argparse, and command line arguments work before you try to run these scripts!

I’ll be honest with you — face recognition is an advanced technique. Command line arguments are a very beginner/novice concept. Make sure you walk before you run, otherwise you will trip up. Take the time now to educate yourself on how command line arguments.

Secondly, I always include the exact command you can copy and paste into your terminal or command line and run the script. You might want to modify the command line arguments to accommodate your own image or video data, but essentially I’ve done the work for you. With a knowledge of command line arguments you can update the arguments to point to your own datawithout having to modify a single line of code.

For the readers that want to use an IDE like Spyder or PyCharm my recommendation is that you learn how to use command line arguments in the command line/terminal first. Program in the IDE, but use the command line to execute your scripts.

I also recommend that you don’t bother trying to configure your IDE for command line arguments until you understand how they work by typing them in first. In fact, you’ll probably learn to love the command line as it is faster than clicking through a GUI menu to input the arguments each time you want to change them. Once you have a good handle on how command line arguments work, you can then configure them separately in your IDE.

From a quick search through my inbox, I see that I’ve answered over 500-1,000 of command line argument-related questions. I’d estimate that I’d answered another 1,000+ such questions replying to comments on the blog.

Don’t let me discourage you from commenting on a post or emailing me for assistance — please do. But if you are new to programming, I urge you to read and try the concepts discussed in my command line arguments blog post as that will be the tutorial I’ll link you to if you need help.

## Summary

In today’s blog post we used OpenCV to perform face recognition.

Our OpenCV face recognition pipeline was created using a four-stage process:

1. Create your dataset of face images
2. Extract face embeddings for each face in the image (again, using OpenCV)
3. Train a model on top of the face embeddings
4. Utilize OpenCV to recognize faces in images and video streams

Since I was married over this past weekend, I used photos of myself and Trisha (my now wife) to keep the tutorial fun and festive.

You can, of course, swap in your own face dataset provided you follow the directory structure of the project detailed above.

If you need help gathering your own face dataset, be sure to refer to this post on building a face recognition dataset.

I hope you enjoyed today’s tutorial on OpenCV face recognition!

The post OpenCV Face Recognition appeared first on PyImageSearch.

### "Maximum Mean Discrepancy for Training Generative Adversarial Networks" (TODAY at the statistics seminar)

Attention conservation notice: Last-minute notice of a technical talk in a city you don't live in. Only of interest if you (1) care actor/critic or co-training methods for fitting generative models, and (2) have free time in Pittsburgh this afternoon.

I have been remiss in blogging the statistics department's seminars for the new academic year. So let me try to rectify that:

Arthur Gretton, "The Maximum Mean Discrepancy for Training Generative Adversarial Networks"
Abstract: Generative adversarial networks (GANs) use neural networks as generative models, creating realistic samples that mimic real-life reference samples (for instance, images of faces, bedrooms, and more). These networks require an adaptive critic function while training, to teach the networks how to move improve their samples to better match the reference data. I will describe a kernel divergence measure, the maximum mean discrepancy, which represents one such critic function. With gradient regularisation, the MMD is used to obtain current state-of-the art performance on challenging image generation tasks, including 160 × 160 CelebA and 64 × 64 ImageNet. In addition to adversarial network training, I'll discuss issues of gradient bias for GANs based on integral probability metrics, and mechanisms for benchmarking GAN performance.
Time and place: 4:00--5:00 pm on Monday, 24 September 2018, in the Mellon Auditorium (room A35), Posner Hall, Carnegie Mellon University

As always, talks are free and open to the public.

### Deep Learning Framework Power Scores 2018

Who’s on top in usage, interest, and popularity?

### Save On an Annual DataCamp Subscription (Less Than 2 Days Left)

(This article was first published on R-posts.com, and kindly contributed to R-bloggers)

DataCamp is now offering a discount on unlimited access to their course curriculum. Access over 170+ course in R, Python, SQL and more taught by experts and thought-leaders in data science such as Mine Cetinkaya-Rundel (R-Studio), Hadley Wickham (R-Studio), Max Kuhn (caret) and more. Check out this link to get the discount!

Below are some of the tracks available. You can choose a career track which is a deep dive into a subject that covers all the skills needed. Or a skill track which focuses on a specific subject.

Tidyverse Fundamentals (Skill Track)
Experience the whole data science pipeline from importing and tidying data to wrangling and visualizing data to modeling and communicating with data. Gain exposure to each component of this pipeline from a variety of different perspectives in this tidyverse R track.

Finance Basics with R (Skill Track)
If you are just starting to learn about finance and are new to R, this is the right track to kick things off! In this track, you will learn the basics of R and apply your new knowledge directly to finance examples, start manipulating your first (financial) time series, and learn how to pull financial data from local files as well as from internet sources.

Data Scientist with R (Career Track)
A Data Scientist combines statistical and machine learning techniques with R programming to analyze and interpret complex data. This career track gives you exposure to the full data science toolbox.

Quantitative Analyst with R (Career Track)
In finance, quantitative analysts ensure portfolios are risk balanced, help find new trading opportunities, and evaluate asset prices using mathematical models. Interested? This track is for you.

And much more – the offer ends September 25th so don’t wait!

DataCamp is an online learning platform that uses high-quality video and interactive in-browser coding challenges to teach you data science using R, Python, SQL and more. All courses can be taken at your own pace. To date, over 2.5+ million data science enthusiasts have already taken one or more courses at DataCamp.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Don’t calculate post-hoc power using observed estimate of effect size

Aleksi Reito writes:

The statement below was included in a recent issue of Annals of Surgery:

But, as 80% power is difficult to achieve in surgical studies, we argue that the CONSORT and STROBE guidelines should be modified to include the disclosure of power—even if less than 80%—with the given sample size and effect size observed in that study.

It is the highest ranking journal in the field of surgery. I find it worrying that they suggest calculating post-hoc power.

I agree. This is a well known error; see references here, where we write:

The idea that published effect-size estimates tend to be too large, essentially because of publication bias, is not new (Hedges, 1984; Lane & Dunlap, 1978; for a more recent example, also see Button et al., 2013). . . .

After data have been collected, and a result is in hand, statistical authorities commonly recommend against performing power calculations (see, e.g., Goodman & Berlin, 1994; Lenth, 2007; Senn, 2002).

It’s fine to estimate power (or, more generally, statistical properties of estimates) after the data have come in—but only only only only only if you do this based on a scientifically grounded assumed effect size. One should not not not not not estimate the power (or other statistical properties) of a study based on the “effect size observed in that study.” That’s just terrible, and it’s too bad that the Annals of Surgery is ignoring a literature that goes back at least to 1994 (and I’m sure earlier) that warns against this.

Reito continues:

I still can´t understand how it is possible that authors suggest a revision to CONSORT and STROBE guidelines by including an assessment of post-hoc power and this gets published in the highest ranking surgical journal. They try to tackle the issues with reproducibility but show a complete lack of understanding in the basic statistical concepts. I look forward the discussion on this matter.

I too look forward to this discussion. Hey, Annals of Surgery, whassup?

P.S. I guess I could write a letter to the editor of the journal but I doubt they’d publish it, as I don’t speak the language of medical journals.

But, hey, let’s give it a try! I’ll go over to the webpage of Annals of Surgery, set up an account, write a letter . . .

Here it is:

Don’t calculate post-hoc power using observed estimate of effect size

Andrew Gelman

28 Mar 2018

In an article recently published in the Annals of Surgery, Bababekov et al. (2018) write: “as 80% power is difficult to achieve in surgical studies, we argue that the CONSORT and STROBE guidelines should be modified to include the disclosure of power—even if <80%—with the given sample size and effect size observed in that study.” This would be a bad idea. The problem is that the (estimated) effect size observed in a study is noisy, especially so in the sorts of studies discussed by the authors. Using estimated effect size can give a terrible estimate of power, and in many cases can lead to drastic overestimates of power (thus, extreme overconfidence of the sort that is rightly deplored by Bababekov et al. in their article), with the problem becoming even worse for studies that happen to achieve statistical significance. The problem is well known in the statistical and medical literatures; see, e.g., Lane and Dunlap (1978), Hedges (1984), Goodman and Berlin (1994), Senn (2002), and Lenth (2007). For some discussion of the systemic consequences of biased power calculations based on noisy estimates of effect size, see Button et al. (2013), and for an alternative approach to design and power analysis, see Gelman and Carlin (2014). That said, I agree with much of what Bababekov et al. (2018) say. I agree that the routine assumption of 80% power is a mistake, and that requirements of 80% power encourage researchers to exaggerate effect sizes in their experimental designs, to cheat in their analyses in order to attain the statistical significance that they was supposedly so nearly being assured (Gelman, 2017b). More generally, demands for near-certainty, along with the availability of statistical analysis tools that can yield statistical significance even in the absence of real effects (Simmons et al., 2011), have led to replication crisis and general corruption in many areas of science (Ioannidis, 2016), a problem which I believe is structural and persists even in the presence of honest intentions of many or most participants in the process (Gelman, 2017a). I appreciate the concerns of Bababekov et al. (2018) and I agree with their goals and general recommendations, including their conclusion that “we need to begin to convey the uncertainty associated with our studies so that patients and providers can be empowered to make appropriate decisions.” There is a just a problem with their recommendation to calculate power using observed effect sizes. References Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B., Flint, J., Robinson, E. S. J., and Munafo, M. R. (2013). Power failure: Why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience 14, 1-12. Gelman, A. (2017a). Honesty and transparency are not enough. Chance 30 (1), 37-39. Gelman, A. (2017b). The “80% power” lie. Statistical Modeling, Causal Inference, and Social Science blog, 4 Dec. https://andrewgelman.com/2017/12/04/80-power-lie/ Gelman, A., and Carlin, J. B. (2014). Beyond power calculations: Assessing Type S (sign) and Type M (magnitude) errors. Perspectives on Psychological Science 9, 641-651. Goodman, S. N., and Berlin, J. A. (1994). The use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results. Annals of Internal Medicine 121, 200-206. Hedges, L. V. (1984). Estimation of effect size under non- random sampling: The effects of censoring studies yielding statistically insignificant mean differences. Journal of Educational Statistics 9, 61-85. Ioannidis, J. (2016). Evidence-based medicine has been hijacked: a report to David Sackett. Journal of Clinical Epidemiology 73, 82-86. Lane, D. M., and Dunlap, W. P. (1978). Estimating effect size: Bias resulting from the significance criterion in editorial decisions. British Journal of Mathematical and Statistical Psychology 31, 107-112. Lenth, R. V. (2007). Statistical power calculations. Journal of Animal Science 85, E24-E29. Senn, S. J. (2002). Power is indeed irrelevant in interpreting completed studies. British Medical Journal 325, Article 1304. Simmons, J., Nelson, L., and Simonsohn, U. (2011). False- positive psychology: Undisclosed flexibility in data collection and analysis allow presenting anything as significant. Psychological Science 22, 1359–-366.

. . . upload it to the journal’s submission website. Done!

That took an hour. An hour worth spending? Who knows. I doubt the journal will accept the letter, but we’ll see. I assume their editorial review system is faster than this blog. Submission is on 28 Mar 2018, blog is scheduled for posting 24 Sept 2018.

### Dataquest helped me get my dream job at Noodle.ai

Dataquest’s mission is to prepare real-world data scientists.

Sunishchal Dev wanted to get a career in data science. He had a degree in Technology & Innovation Management and had business skills, but he needed to improve his technical skills and learn python to get the job he really wanted.

Sunishchal had experience with online learning, but it hadn’t clicked before until he starting using Dataquest as part of a General Assembly Bootcamp. “I tried to use Code Academy and Data Camp but felt that their learning modules were not interactive enough to keep me engaged. I also don't like watching video lectures, as they are not skimmable.”

Dataquest was different, Sunishchal explains, “The lessons were easy to absorb and had the right balance of theory and practical knowledge.” Access to office hours and the community through his premium subscription helped him build the foundation he needed to move forward.

He believes that, when it comes to learning, you get what you pay for. He’d spent a long time trying to learn to program using free resources, he found that his commitment level matched what he paid. “Making the leap of faith in getting a Dataquest subscription really lit the fire under me. I started completing the lessons and got a ton of value out of my small investment.”

Sunishchal now works for Noodle.ai, an Enterprise AI startup that builds custom data pipelines and machine learning models for large enterprises. He spends his days on risk modeling for a jet engine manufacturer. “I can truly say I've found my dream job!”

Are you next? Start learning >>

### Versa Shore: Data Scientist [Seattle, WA]

Versa Shore is seeking a Data Scientist in Seattle, WA, a generalist in data science with deeper experience in either Optimization, Stochastic Processes, Deep Learning or reinforcement learning.